*Proceeding Paper* **Combination of Post-Processing Methods to Improve High-Resolution NWP Solar Irradiance Forecasts in French Guiana †**

**Rafael Alvarenga 1,\* , Hubert Herbaux <sup>2</sup> and Laurent Linguet <sup>1</sup>**


**Abstract:** Efforts have been made to improve Numerical Weather Prediction (NWP) forecasts using post-processing techniques, relying on statistical models to refine the weather forecasts. Most approaches used in the literature suffer from two main deficiencies when applied to high-resolution data: (1) they high capacity models to retain nonlinear data fluctuations; (2) some are known to reduce the mean random error; however, they may still generate subsequent biased forecasts. In this study, methods from three different approaches are compared to improve 10-min resolution NWP solar irradiance forecasts, namely a neural network and a linear statistical model as Model Output Statistics, Kalman Filter and Kernel Conditional Density Estimation. The results show that none of the methods, if used individually, improve the mean absolute error (MAE) and mean bias (MBE) jointly. However, a combination of a neural network followed by Kalman filter post-processing results in significant improvements both in the mean random error and the systematic mean bias of original forecasts, reducing the MAE by 45% and the MBE by 91%, respectively.

**Keywords:** solar irradiance forecast; post-processing; neural network; Kalman filter; conditional kernel density estimation

#### **1. Introduction**

The accurate forecasting of solar irradiance plays an essential role in the management and integration of photovoltaic (PV) systems in transmission grids. The intermittence of solar irradiance, which is highly correlated with the output power, needs to be accurately predicted in advance. It assures the transport system operator that sufficient power will be available to fill the demand throughout the day.

Numerical weather prediction (NWP) models have been widely utilized to generate medium and long-term forecasts of solar irradiance, relying on mathematical equations describing the atmospheric fluid mechanics and thermodynamics [1]. Admittedly, NWP modeling has improved continuously; however, the resulting weather forecasts often present considerable errors. These errors can be divided into two groups: random errors, caused by the insufficient capacity of NWP models to predict the variations of solar irradiance, and systematic bias, caused by defective modeling which will tend to systematically overestimate or underestimate solar irradiance.

Over the past decades, different post-processing techniques were proposed to correct these deviations. Most of the proposed methods follow the approach called model output statistics (MOS) [2–5], where NWP forecast errors are corrected by learning a function that relates the response variable of interest to its predictors. Despite improving the performance of the original forecasts, the approach suffers from two primary deficiencies if applied to high-resolution data: (1) it requires a model with a high capacity to retain nonlinear data

**Citation:** Alvarenga, R.; Herbaux, H.; Linguet, L. Combination of Post-Processing Methods to Improve High-Resolution NWP Solar Irradiance Forecasts in French Guiana. *Eng. Proc.* **2022**, *18*, 27. https://doi.org/10.3390/ engproc2022018027

Academic Editors: Ignacio Rojas, Hector Pomares, Olga Valenzuela, Fernando Rojas and Luis Javier Herrera

Published: 22 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

fluctuations; (2) it is known to reduce the random error; however, it may still generate subsequent biased forecasts.

Besides MOS, other approaches have also been used in the literature for the same purpose, without sharing the same drawbacks: the Kalman filter (KF) [6–11] recursively improves weather forecasts based on historical errors measured before the forecast generation, and Kernel Conditional Density Estimation (KCDE) [1] allows estimations of the marginal distribution of original forecasts–and consequently, their expected value–conditioned on other explanatory variables.

The objective of this study is to fill the gap in the post-processing of high-resolution forecasts observed in the literature by proposing a global approach to enhance solar irradiance forecasts based on the consecutive combination of post-processing methods, which do not effectively enhance the global performance of the original forecasts when applied individually. Furthermore, the same analysis is proposed for coarser forecasts, revealing differences in the modeling and efficiency of post-processing methods based on temporal resolution.

#### **2. Methodology**

The methods to be compared and combined are introduced in this section, namely the MOS, KCDE, and Kalman filters.

#### *2.1. Model Output Statistics*

MOS are statistical models that relate observed variables of interest (prediction error, in this case) with the appropriate predictors derived from the outputs of the original model (original NWP predictions). Once calibrated, MOS models are used to estimate the prediction error of WRF forecasts, which are then subtracted from original forecasts, thereby generating improved forecasts. An artificial neural network (ANN) is selected from MOS methods, which are proven to provide better improvements for high-resolution data. A statistical MOS was selected for an additional comparison.

#### 2.1.1. Artificial Neural Network

In ANNs, the function parameters are processing units, called neurons. They are interconnected via a set of weights-analogous to synaptic connections in the nervous system in a manner that allows signals to travel through the network when making predictions or during training. The neurons are grouped in consecutive layers, where each layer is responsible to model patterns of different complexity levels. In this study, a multi-layer perceptron is used as a neural network, where each neuron of one layer is connected to every neuron of the following layer. The output of any neuron is equal to the sum of the outputs of all neurons of the previous layer multiplied by their respective weights, plus a bias term. Subsequently, the net output is passed to an activation function that bounds the activation value to a predefined range. The activation function employed is known as a rectified linear unit (ReLU), which helps the convergence of the model and presents good results. Finally, the model output is equal to the activation value of the only neuron in the output layer.

During training, the weights and biases are updated to reduce the mean square error for each batch of predictions and observations, using the Adam stochastic gradient descent. This process is repeated multiple times, called epochs, that covers the training dataset extensively until the maximum number of epochs is reached.

#### 2.1.2. Lorenz's MOS

As compared to the neural networks, the success of Lorenz's MOS [12] can be partly attributed to its simplicity. In particular, the error was first modeled as a 4th-order polynomial function:

$$e = a\_1 \cos^4 Z + a\_2 \hat{k}^4 + a\_3 \cos^3 Z + \dots + a\_8 \hat{k} \tag{1}$$

where *e* defines the error between the forecasted and measured solar irradiance. The model presented by Lorenz relies on two predictors: Z defining the zenith angle and k for ˆ the forecasted clear-sky index–the ratio of solar irradiance forecasts to the expected solar irradiance in clear-sky conditions.

The function parameters can be easily estimated using the least squared method over historical forecasts and measurements. Similar to all tested methods, the best group of predictors was selected based on their capability of predicting the original forecast error.

#### *2.2. Kernel Conditional Density Estimation*

Kernel density estimation is a probabilistic nonparametric approach for estimating the unknown distribution of a random variable x (e.g., the original forecast error), based on the local estimation of the distribution of its samples X1,. . .,Xn [1], as shown in Figure 1.

**Figure 1.** An example of kernel density estimation (black) generated with the help of Gaussian kernels (red) used to define the local distribution of each sample (green).

The kernel, referring to any smooth function *K*, is used to define the distributions assigned to each sample *x*, which will then compose the estimated probability function. The most used kernel is the Gaussian kernel, which is defined by:

$$\mathcal{K}(\mathbf{x}) = \frac{1}{\sqrt{2\pi}} e^{-\frac{\mathbf{x}^2}{2}} \tag{2}$$

It is equivalent to the probability density function of a standard normal distribution. A scaled version allows to handle variables varying in different ranges:

$$
\mathcal{K}\_h(\mathbf{x}) = \frac{1}{h} \mathcal{K}\left(\frac{\mathbf{x}}{h}\right) \tag{3}
$$

where *h* denotes the bandwith parameter controlling the smoothness of the density estimates. For the density estimation of a dataset presenting *n* values, the kernel density estimator is defined as follows:

$$f(\mathbf{x}) = \sum\_{n=1}^{n} \frac{1}{n} \mathcal{K}(\mathbf{x} - \mathbf{X}\_i) \tag{4}$$

Ultimately, the conditional version of this density estimation adds a weighting factor proportional to the magnitude of each value along one dimension:

$$\hat{f}(y|\mathbf{x}) = \sum\_{n=1}^{n} w\_{\bar{l}}(\mathbf{x}) \mathcal{K}\_{\bar{h}\_{\bar{y}}}(y - Y\_{\bar{i}}) \tag{5}$$

where

$$w\_i(\mathbf{x}) = \frac{\mathcal{K}\_{h\_\mathbf{x}}(||\mathbf{x} - \mathbf{X}\_i||)}{\sum\_{n=1}^{n} \mathcal{K}\_{h\_\mathbf{x}}(||\mathbf{x} - \mathbf{X}\_i||)}\tag{6}$$

Interested readers are invited to read the comprehensive definition of the KCDE method provided by Yang [1].

Different methods have been proposed to find the optimal value of the bandwidths hx and hy. In this study, a method described in [1] was employed, where the selected bandwidth is the one that minimizes the asymptotic mean integrated squared error (AMISE). For circular data presenting a periodic nature, such as the solar zenith angle, a different kernel is proposed, called the von Mises kernel, which requires adaptations to define an expression for the AMISE during the application of the bandwidth selection method. Finally, for one group of predictors, the point forecasts for the prediction error can be obtained by calculating the expectation of the resulting probability density function generated with the kernel conditional density estimation.

#### *2.3. Kalman Filter*

In a general presentation, Kalman filters are used to estimate the error state *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* of a discrete time-evolving process. At each time step, the first-guess of the error state is as follows:

$$\mathbf{x}\_{l} = A\mathbf{x}\_{l-1} + w\_{l} \tag{7}$$

where matrix A defines the values of pre-defined predictors at one time step, and the state is a vector representing the effect of each predictor when estimating the prediction error. A nonlinear Kalman filter can be implemented by adding nonlinearities to matrix A, such as predictors in the exponential form.

The observed errors *<sup>z</sup>* <sup>∈</sup> <sup>R</sup>*<sup>m</sup>* expressed by a measurement equation that relates the error to measurements for each time step is as follows:

$$z\_l = H\mathfrak{x}\_l + v\_l \tag{8}$$

The variable *xt*−<sup>1</sup> represents the previous error state, and random variables *wt* and *vt* are white noises applied to model perturbations in the process and measurements, respectively. The noises are assumed to be independent of each other and follow normal distributions with covariances *W* and *V*:

$$p(w) \sim N(0, W)$$

$$p(v) \sim N(0, V) \tag{9}$$

In practice, the process noise covariance *W* and measurement noise covariance *V* matrices may change with each time step or measurement and are calculated iteratively with recent values. The same process is applied to the matrices *A* and *H*.

The correction procedure involves two groups of equations: time update equations and measurements update equations, time update equations are responsible for making a first guess of the next solar irradiance prediction error, based on the last state of the measured error and error covariance estimates, obtaining an a priori prediction for the next time step; the measurement update equations will then incorporate new measurements into the first guess, obtaining improved a posteriori predictions [13]. The interested reader is referred to [7] for a detailed description of the associated equations.

As mentioned in [1], if the Kalman filter is applied continuously for each lead-time of a day-ahead or intraday prediction, such forecast horizon becomes «resolution-ahead»; e.g, hour-ahead if working in hourly resolution. The procedure needs to be adapted when applied operationally, keeping forecasts in the original horizon. As the original forecasts are issued every day at 00:00 in this study, the first guess at a lead-time L within the horizon

of 24 h cannot depend on the previous lead-time L−1 of the same forecast, because the values for the entire forecast are still first guesses, as respective measurements are not yet accessible and update equations cannot be applied. The first guess for the lead-time L should be made based on the same lead-time with available measurements, in this case, the lead-time L of the forecast already improved the day before. A scheme of this procedure is presented in Figure 2.

**Figure 2.** Kalman filter applied operationally for the post-processing of daily-issued predictions.

Unlike MOS methods, Kalman filters do not require parameters to be trained with extensive historical information. However, the method requires recurrent access to observations at each time step, therefore being more efficient in real-time applications rather to improve long-range forecasts.

#### **3. Data**

#### *3.1. Weather Forecasts*

The weather forecasts used in this study were issued daily at 00:00 by a Weather Research and Forecasting (WRF) numerical model during a year and aggregated in a 10-min resolution.

The WRF forecasts include the global horizontal irradiance to be post-processed and correlated weather variables, which are used as predictors.

Considerable errors were observed in solar irradiance WRF forecasts for low levels of sun elevation (below 15°). These errors can be explained by the approximations made to model the atmosphere within WRF, resulting in more forecast errors when the sunlight crosses larger distances in the atmosphere, like in the early morning and late afternoon. Therefore, the time steps before 8 a.m. and after 5 p.m. were removed from the analysis to avoid such outliers.

#### *3.2. Observations*

Ground truth of solar irradiance was obtained with a pyranometer installed at the same location in French Guiana to evaluate and calibrate certain post-processing methods.

The observations were preprocessed with the imputation of outliers or missing values through linear interpolation if they were less than 30% of the day, otherwise the entire day was discarded.

Besides the high-resolution forecasts and observations, a second group of data was obtained by upscaling the same data to an hourly resolution. The objective is to verify the differences in modeling and performance of each method when applying to coarse-grained and fine-grained time series. Figure 3 presents observations in both temporal resolutions for a specific day.

#### **4. Results and Discussion**

The methods described in the previous section and convenient combinations were applied to the post-processing of WRF solar irradiance forecasts in both 10-min resolution and hourly resolution. Accordingly, particularities in the modeling of each time-resolution forecast and obtained results are described in this section.

#### *4.1. Modeling*

Post-processing was applied for each method using different combinations of different numbers of predictors. In particular, for ANN, a second search was performed over the parameters of the model, such as the batch size, number of layers, and neurons per layer. The Kalman filter was tested in both linear and nonlinear implementations, with different combinations of exponential predictors in the latter case. The best group of predictors found for each method and for each time-resolution tested are listed in Table 1.

**Table 1.** Optimal group of predictors per method per resolution.


MOS and Kalman filter methods were implemented in python, using the Keras library for neural networks, while KCDE was implemented in R using the stats package.

The best groups of predictors used to improve higher resolutions are slightly bigger, which can be explained by the required additional data to explain more sudden variations of solar irradiance. Furthermore, the methods focusing on correcting the random error (e.g., ANN, Lorenz's MOS) tend to require more predictors than methods that focus on the mean bias (e.g., Kalman filter).

In particular, for the Kalman filter, the historical data used in the recursive calculation of predicted bias at each time step is also a parameter to be defined. Basically, the random errors can be better modeled if the historical window is longer, and conversely, a shorter historical window allows for better mean bias modeling. After multiple tests, the optimal window length capable of reducing both the random error and mean bias is 70 days for both time resolutions.

#### *4.2. Improvements on Original Forecasts*

All post-processing methods described in this study were tested on the same data and their resulting forecasts are compared by calculating the mean absolute error (*MAE*) and root mean squared error (*RMSE*) to assess the random error, and the mean bias error (*MBE*) to assess the systematic bias:

$$MAE = \left(\frac{1}{n}\right) \sum\_{t=1}^{n} |y\_t - \mathbf{x}\_t| \tag{10}$$

$$RMSE = \sqrt{\left(\frac{1}{n}\right) \sum\_{t=1}^{n} (y\_t - x\_t)^2} \tag{11}$$

$$MBE = \left(\frac{1}{n}\right) \sum\_{t=1}^{n} y\_t - \chi\_t \tag{12}$$

where, *yt* denotes the improved prediction and *xt* denotes the observation at a time step *t*. The MAE of original and resulting forecasts after the application of each method is shown in Figures 4 and 5 for the 10-min resolution and hourly resolution data, respectively.

**Figure 4.** Resulting MAE per hour of the day following the application of each method and convenient combinations of methods on 10-min resolution WRF forecasts.

**Figure 5.** Resulting MAE per hour of the day following the application of each method and convenient combinations of methods on hourly resolution WRF forecasts.

Naturally, as solar irradiance presents increasingly sudden variations, the MAE is higher for fine-grained WRF forecasts.

All tested methods reduce the MAE of original WRF forecasts, the ANN presents the best reduction of the mean error for both time resolutions. However, Figures 6 and 7 illustrate that the ANN is the only method that degrades the mean bias of original WRF forecasts in both time resolutions. The Kalman filter is the best method to reduce the mean bias for both time resolutions, with almost all the bias removed from original WRF forecasts.

**Figure 6.** Resulting MBE following the application of each method and convenient combinations of methods over 10-min resolution WRF forecasts.

**Figure 7.** Resulting MBE following the application of each method and convenient combinations of methods over hourly resolution WRF forecasts.

The consecutive combination of these two methods was tested in an attempt to gather the performance of the Kalman filter in reducing the mean bias and the capability of the ANN to reduce the mean error.

The percentage improvement given by each method compared to the original WRF is described in Tables 2 and 3 for a 10-min resolution and hourly resolution forecasts, respectively.

**Table 2.** Percentage improvement of each method compared to original forecasts in a 10-min resolution (the higher the better, negative values denote degraded results).


**Table 3.** Percentage improvement of each method compared to original forecasts in hourly resolution (the higher the better, negative values denote degraded results).


Focusing only on the combinations of ANN and KF methods, the results for both time resolutions reveal that when the ANN is applied lastly, it introduces a bias that was previously removed by the Kalman filter. However, when the Kalman filter is applied after the ANN in high-resolution data, it maintains the reduced mean error and additionally reduces the mean bias firstly introduced by the ANN in a 10-min resolution. Therefore, a consecutive ANN MOS post-processing followed by Kalman filtering yields the best overall results, improving the original MAE by 45% and the original MBE by 91%.

A different type of behavior could be verified in hourly resolution, where for the same combination of methods, the Kalman filter degrades the mean error previously improved by the ANN. In this time resolution, the best approach may depend on the application: if a reduced random error is preferred, a combination of KF+ANN can be used in this order, improving the original MAE by 54% even if the MBE is degraded by −7%, generating positive biased solar irradiance forecasts. On the other hand, if one wants to avoid consecutive biased forecasts (e.g., in PV power plants coupled with batteries, an overestimation of PV production may rapidly fill the batteries in real-time, leaving no capacity to absorb further errors during the day), in which case, the ANN+KF combination is more convenient, improving the MBE of original forecasts by 95% while improving the MAE by 13%.

#### **5. Conclusions and Further Work**

This study presents the best combination of post-processing methods to improve highresolution solar irradiance forecasts generated by a WRF model. The annual meteorological simulation for 2020 at a specific location in French Guiana was analyzed. The evaluation focuses on the capability of each method to improve both the random error and mean bias of original solar irradiance forecasts. Furthermore, an identical analysis is proposed for a lower resolution version of the same forecasts, exhibiting differences in modeling and performance for each method according to the time resolution.

The methods compared in this study are well known to improve systematic and random errors individually, namely Model Output Statistics (MOS), Kalman filter (KF) and KCDE. The results reveal that any of the methods improve both errors simultaneously; however, a combination of a neural network MOS followed by a KF post-processing improves the MAE by 45% and MBE by 91% for high-resolution forecasts. In a coarser time resolution, the same combination of methods improves the MAE by 13% and MBE by 95%, and the combination in the reverse order improves the MAE by 54% independently, degrading the MBE by −7%. In this case, the most valuable combination may depend on the preferred error to be reduced, according to the downstream application.

For future development of the work, a much more extensive analysis can be conducted by comparing the methods on multiple high-resolution forecasts, generated for different locations with different climates.

**Author Contributions:** Conceptualization, R.A., H.H. and L.L.; methodology, R.A.; software, R.A.; validation, R.A.; formal analysis, R.A.; investigation, R.A.; resources, R.A.; data curation, R.A.; writing—original draft preparation, R.A.; writing—review and editing, H.H. and L.L.; visualization, R.A.; supervision, H.H. and L.L.; project administration, L.L. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Proceeding Paper* **Online Classification of High Gamma Dose Rate Incidents †**

**Mohammed Al Saleh 1,2,3,\* , Beatrice Finance 1, Yehia Taher 1, Ali Jaber <sup>2</sup> and Roger Luff <sup>4</sup>**


**Abstract:** In this paper, we propose a new method for choosing the most suitable time-series classification method that can be applied to online gamma dose rate incidents. We referred to the historical incidents measured in the German Radiation Early Warning Network and clustered them into several classes before testing existing classification methods. This raises the research problem of the online classification of time-series data with varying scales and lengths. Referring to the state-of-the-art methods, we found that no specific classification method can fit our data all the time. This motivated us to introduce our own approach.

**Keywords:** machine learning algorithms; predictive model; time-series classification; gamma dose rate; Radiation Early Warning Network

**Citation:** Al Saleh, M.; Finance, B.; Taher, Y.; Jaber, A.; Luff, R. Online Classification of High Gamma Dose Rate Incidents. *Eng. Proc.* **2022**, *18*, 28. https://doi.org/10.3390/engproc 2022018028

Academic Editors: Ignacio Rojas, Hector Pomares, Olga Valenzuela, Fernando Rojas and Luis Javier Herrera

Published: 22 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

#### **1. Introduction**

Time-series analysis is gaining more and more interest in so many domains. That is because, with the proliferation of the use of sensors and IoT devices that continuously produce massive amounts of real-time data, special care has been given for analyzing the data to understand past events and patterns and predict future ones. Medical heart monitor data, stock market prices, weather conditions, etc., are all examples of such time-series data.

In this paper, we are interested in analyzing the gamma dose rate (background radiation level) in the environment. A serious event that occurs and causes an abrupt increase in the gamma dose rate is the leakage or contamination failure of a nuclear reactor, such as what happened in the Chernobyl accident which was the biggest short-term leak of radioactive materials ever recorded in history [1]. Such an event has to be intercepted at the earliest point possible to take the proper measures and precautions and notify the concerned authorities to minimize the effects of such a hazardous situation. It is a very critical task as long-term or acute exposure to a high gamma dose rate can have many hazardous consequences on humans as well as on the ecosystem.

Around the globe, there are thousands of probes (sensors) that collect gamma dose rates in real time. A Radiation Early Warning System (REWS) [2] collects the data and raises the alarm in case of an increase in the local gamma dose rate. Whenever an event occurs (i.e., the gamma dose rate goes above the accepted threshold, provided by experts), an alarm is triggered, and a team of experts and personnel have to unite to investigate the reasons behind this rise. Currently, the analysis of incoming incidents is performed manually with the efforts of experts. Such a method is time-consuming and risky, knowing that the factors affecting the gamma dose rate are not always known immediately. Fortunately, most of the incoming incidents are mainly innocent ones as they remain in an acceptable range value for human health and this value returns to normal after a period of time.

The objective of our research is to propose an *Intelligent Radiation Early Warning System* that finds the cause automatically behind an incident and its classification into *real or innocent* ones at real time. Gaining intelligence is the key aspect of our approach. Therefore, we aimed to transform the static and semi-automatic REWSs into dynamic, fully automatic, and intelligence-driven systems. The proposed system would optimize the ability to analyze any alert generated by an event such as rain.

The two main phases of our *Intelligent REWS* are: (1) building the predictive model and (2) near real-time detection and prediction. In the first phase, the historical data generated by Germany's REWS [3] were analyzed to extract knowledge about the previous incidents that occurred in the past. The historical databases contain raw unlabeled data (i.e., time series) corresponding to the gamma dose rate monitoring at each probe. The data used in this work comprise the past ten years' minute-by-minute gamma dose rate real data for over a thousand probes.

In [4], we already proposed an unsupervised machine learning model that helps us automatically determine the reasons behind the incidents. This task was difficult and required many experiments to find the best time-series clustering algorithm. After tackling all the shortcomings behind the first phase of our *Intelligent REWS*, we now investigate in this paper our contributions for the Online Detection and Prediction phase. As we aim to match unlabeled incidents without any human intervention as soon as possible, our research is in the field of supervised machine learning time-series classification.

The remaining sections of this paper will be organized as follows: Section 2 will state the context and the problems behind our research. The state-of-the-art approaches, similar to our approach in one aspect or another, are described in detail in Section 3. Section 4 will present our approach and contribution. We will evaluate our approach in Section 5. Finally, we will conclude in Section 6.

#### **2. Context and Problem Statement**

In this research, we deal with univariate time-series which are unlabeled as shown in Figure 1. Incidents caused by the same event may not have a recognizable temporal trace or characteristics but more common behavior. For example, a particular event may cause peaks of increasing amplitudes that decrease during a longer period of time; another may cause an abrupt increase and maintain its amplitude during a period of time, and so on. Note that incidents caused by the same event can last for a varying length of time and reach different amplitudes.

**Figure 1.** Typical gamma dose rate time series.

In the first phase, we were able to collect nearly 300 innocent incidents from 45 different locations in Germany. Choosing different locations allowed us to have diverse shapes of observations representing the innocent incidents, thus gaining a higher quality dataset to build our investigation upon. Investigating our time-series data revealed important characteristics that need to be handled carefully. Incidents are of highly varying lengths, different scales, and of different levels. The same incidents could have different characteristics. Going further, different incidents could have the same characteristics.

The evolving parameters problem was solved by preparing a catalog of parameters before the extraction process started. A specific algorithm tackled the evolving issue through several calculation steps to ensure that the parameters obtained by the end of each month were accurate. Then, the extracted incidents undergo a unique preprocessing phase to ensure that they are ready to enter the proposed clustering model. This was done using a *z-normalization* method [5] responsible for dealing with the scale issue. The *zero-padding* method was applied to deal with the different length incidents issue.

Once the preprocessing was applied, the best clustering model for our context was formed by combining the similarity measure (DTW) [6] as well as the clustering algorithm (K-means) [7] with its averaging method (DBA). With the help of experts in gamma radiation monitoring, we were able to identify three events after applying our clustering approach that split into three categories: rain, stormy rain, or incidents caused by probe calibration as depicted in Figure 2.

**Figure 2.** Clusters obtained from our predictive model.

For the online detection and prediction phase, a matching process tries to classify the incoming incidents as depicted in Figure 1 within the labeled clusters depicted in Figure 2. This helps identify the real cause of the current incident. The incoming readings are analyzed to explore the thresholds in real-time. Once an incident is detected, the data preprocessing model used in phase one is also applied to deal with the scale and length issues. Notice that we start analyzing an incident even if the incident is not yet finished. As we aim to match unlabeled incidents without any human intervention as quickly as possible, our research is in the field of supervised machine learning time-series classification.

After investigating the classification algorithms presented in the literature, we noticed several shortcomings that prevented us from relying on a specific algorithm for our online classification model. After going through the classification algorithms introduced in the literature, we noticed that no specific algorithm could perfectly fit our data since our data have unique characteristics and behavior. We noticed that although some algorithms work perfectly for specific types of incidents, they could not classify other types. This problem was enough for us to not trust a specific classification algorithm when dealing with our incoming incidents. Moreover, we noticed that these algorithms were unable to detect incoming incidents that have a unique behavior and should be classified in a new class that will be labeled by the experts later.

In this context, the problem consists of finding a machine-learning-based framework to automate the event identification process to decrease the time and effort spent and increase the efficiency and accuracy of the process. This will result in automatically identifying the incoming incidents as soon as possible and giving the correct impression to the experts to distinguish the innocent incidents from those that are critical. Hence, the main research question behind this paper could be formulated as follows: *"What is the machine learning* *model that should be used for the online time-series classification of special behavior and how do the different models perform in practice* ?*"*.

#### **3. State of the Art**

In this section, we briefly recall the main time-series classification algorithms mentioned in the literature. We compare these techniques based on applicability and effectiveness. In addition to conducting a literature study, we also apply these different techniques to our dataset to test their performance.

For time-series data, there exist several algorithms that consider the time factor, which is essential in our study. A common problematic solution that could happen when dealing with time-series data is to treat each value in the sequence as a separate feature. This is the core difference between time-series data and tabular data. In time-series data, the order of the data is essential and critical. In contrast, in tabular data, the order is ignored and scrambling the order of the features will not affect the prediction process. Therefore, each algorithm dedicated for time-series data is based on a technique and perspective that extracts knowledge from the time-series data concerning the order of the data.

Those algorithms are categorized as follows:


In their paper [14], the authors introduced time-series data and time-series classification methods, focusing in their research on the importance of distance-based classifiers. Xing et al. [15], in their paper, divided the time-series classification method into three main categories. Feature-based methods, model-based methods, and distance-based methods. Diving deeply into the literature, we noticed that most of the research works focused on or introduced a specific classification algorithm. As we will see in the experimentation section, when these classification algorithms are applied to our data, they are not able to perform the task in all situations. That is why we propose a novel approach that gives the best results through our testing.

#### **4. Online Detection and Prediction Phase**

In Figure 3, we depict the three main components of our online detection and prediction phase. It is composed of: (1) the online incidents extraction, (2) the online preprocessing phase, and finally (3) the online classification phase. First, it is important to mention that the intended result is to reduce the errors and not eliminate all errors. As our data are significant and challenging, removing all errors is impossible.

**Figure 3.** The proposed online detection and prediction phase.

The data that are sent by the probes from different locations are continuously monitored. A high reading that is above the peak threshold will trigger the system to check whether the reading of this probe will remain above the peak for 30 min. If the incoming readings remain high (above the peak threshold), this series will be extracted as an incident starting from the value above the maximum background mean until 30 min have passed.

Extracted incidents cannot directly enter into the classification phase. Preprocessing treatment for the raw incidents should be done. Incidents coming from different probes have different characteristics in terms of length, scale and level. During the preprocessing phase, the data of the incidents are normalized using the z-normalization technique and padded using the zero-padding technique. This preprocessing phase will not affect the shape of the incident; it will only standardize the incidents to become similar to the training dataset that the classifiers have trained over. This will help classifiers identify or predict the class of these unknown incidents.

The classification phase is divided into two phases: the voting phase and the counselor's decision. In the voting phase, four classifiers from the state of the art are implemented separately. All of the previous classifiers were implemented and tested because each one was successful in identifying a particular class. Each classifier will accept as input the incoming incident. The four classifiers will run in parallel:


the algorithm had slight errors in classifying calibration incidents as rain and vice versa.


In our approach, each classifier will perform its prediction and the output will not just be the predicted class but also the probability of each class upon which it concluded to select the class of higher probability. Then, the second phase of the online classification phase is introduced to take all the classifiers' predictions. Its role is to analyze what the majority has classified this new incoming incident as. The counselor has to choose one of three possible choices:


In summary, combining all classifiers' abilities helped overcome the problems and challenges found in our data. The counselor has three possible choices depending on the highest aggregated probability. Suppose that the probability remains high (above 80%) after three classification attempts. In that case, the incident will be assigned to the class of the highest probability. If the probability varies between 60% and 80%, then the incident will undergo further classification after collecting more incoming data readings. Moreover, the incident is left for the experts to check and verify whether it did not succeed in gaining a probability higher than 60% so that it could be a new incident of a new class to be created or a new shape for the existing class.

#### **5. Experimentation**

In order to compare the different approaches of the state of the art, as well as to see the benefit of our proposed model, we decided to evaluate systematically different experiments and evaluations on the labeled time-series data. After investigating all of the mentioned algorithms in the state of the art, we attempted to implement each classifier based on its best practice for selecting the optimal parameters and then apply it over our labeled data. First, the data we have are split into two sets, a training dataset (90% of the dataset) and a testing dataset (forms the rest 10%). When splitting the data between training and testing, we guarantee that the training dataset is balanced and presented well in all three classes so that the classifier will be trained well.

All the implementations are performed using Python libraries. For the best environment performance and for easy implementation, we installed Anaconda, in which we used the Jupyter notebook for writing and testing the code. Anaconda provides us with an isolated environment containing all the needed libraries to perform our tests. Going deeply into the libraries dedicated to time-series data, several methods are defined to handle this type of data. The traditional machine learning algorithms implemented for tabular classifiers cannot be applied in our case of time-series data because these neglect the time factor essential in our data. Thus, in addition to Sklearn, pandas, numpy, and other libraries, we installed and used the Sktime library which contains the time-series classifiers.

In Table 1, we found the evaluation results for the four times series classifiers of the state of the art. Class 0 corresponds to the calibration cluster, Class 1 to the rain class and class 2 to the stormy rain class.

The first classifier is the KNN with DTW, which is a distance-based classifier. By default, this classifier uses the Euclidean distance measure [8] to determine the membership of a class. For our case, the time-series data require a different metric algorithm because incidents are of varying length and are not perfectly aligned in time. Although the accuracy was not bad (89.28%), after training and testing it, some errors still occurred. By investigating what the model failed, we deduced that it could detect the calibration class and the rain class but failed to identify the stormy rain class. The model got confused between the rain class and stormy rain class incidents and classified the stormy rain as rain incidents.

The second classifier is the time-series forest classifier which relies on the intervalbased algorithm. This classifier depends on the information retrieved from the various intervals of a series. At first, the classifier splits the series into random intervals; each has a random starting point and length. Then, the algorithm extracts summary features (slope, mean, and standard deviation) from those intervals. The extracted features form the feature vector representing the interval. Since this algorithm is based on the random forest algorithm, it will construct and train a decision tree from the extracted features. Several trees are constructed to support decision making and select the majority of the trees in the forest. After training and testing the TSF classification model, the performance was good (89.3%), but not sufficient. The model was able to identify the calibration class but it faced some errors when identifying the stormy rain class.


**Table 1.** Applying different classification algorithms to the dataset.

The third classifier is the random interval spectral ensemble (RISE) classifier. This classifier is based on the frequency features extracted from the series after splitting it into intervals. It sounds similar to the previous classifier, the TSF, especially because it also uses the random forest algorithm. It differs from TSF in two ways. First is how it splits the series into intervals, where the intervals for each decision tree are of the same length. The second difference is in the type of features that the algorithm extracts from the intervals, where RISE extracts spectral features (series-to-series features) and not summary statistics. The algorithm was significant in classifying the rain class from the data. In the rest of the classes, however, it faced some errors. The accuracy of this model reached (85.71%).

Finally, the last classifier is the shapelet-based classifier. This classifier is very popular and used when dealing with time-series data. A shapelet is a sub-shape of a series. A bag

of shapelets is used to represent a particular class. When extracting those shapelets, the algorithm searches for shapes with discriminatory power to identify a class. Shapelets form the identity of each class. When a new unknown incident arrives, the algorithm will extract its shapelets and compare them to the classes' shapelets to confirm which class the incident belongs to. The shapelet-based algorithm was implemented and tested over our data, but the results were unsatisfying. After several tests and attempts to enhance the model's overall accuracy, it nearly reached 50%. However, after we investigated the results, we uncovered the reason for such an outcome. The data that we have are very challenging because they are very similar to each other, which makes their shapes very similar; this is why the model was confused. Even though the classification model was able to identify the rain class, it failed in the other two classes (calibration and stormy rain).

To start evaluating our online classification module, we first tested incoming incidents. These incidents are preprocessed online after being extracted and then prepared to be classified with the classification algorithms. Then, the classification algorithms will work individually in parallel on the incoming incident, trying to classify it as soon as possible. The classification algorithms presented in Table 2 will return their predictions for the incoming incidents as soon as possible. This prediction will be in the form of a probability suggested by each algorithm to each incident while trying to map it to the respective class.


**Table 2.** Testing our counselorapproach.

Finally, the counselor will start performing the task assigned to it. Thus, the role of the counselor will be to decide which algorithm acts the best and gives the perfect prediction for the incoming incident, as shown in Table 2, where incidents 2 and 3 were assigned to classes 0 and 2, respectively. However, incident 1's probability was not enough for the counselor to make a decision, which it is it suggested waiting for more data.

Our proposed model overcomes the issue that we were concerned about. When tested on its own, the problem with each classification algorithm was its ability to identify a single class and failing in differentiating between the rest of the classes. By combining the outputs of those four algorithms, the counselor was able to either commit the identity of the unknown incoming incident or consider it a new incident related to a new class to be examined by experts. Therefore, the proposed classification model output was satisfying and it supported decision making for predicting the class of incoming incidents.

#### **6. Conclusions**

In this work, we presented our machine learning-based framework for autonomously identifying the causes behind the online incoming incidents caused by high gamma dose rate readings. After extracting, preprocessing, and clustering the historical incidents, our approach is to apply a machine learning model that will match online incoming incidents to their similar clustered ones to identify the causes behind them as soon as possible using supervised classification.

In the classification phase, we, specifically, tackled the problem of classifying time series using several classification algorithms at the same time which was properly addressed nowhere in the literature. We researched and experimented with the different classic and state-of-the-art approaches to evaluate their compatibility. When those approaches failed to classify our data when properly testing each approach alone, we proposed our counselor classification model for using all the classification algorithms simultaneously and voting for the one with the best outcome.

Displaying the obtained matching percentages through the experimentation of various algorithms, we were able to highlight how our model comparatively gave the best results. Furthermore, the experts expressed positive and hopeful thoughts upon inspecting the results, which motivated us to publish this contribution in an article. As future work, the next step would be to improve the quality of the overall framework by exploring the evaluation with more datasets to automate the evaluation as well.

**Author Contributions:** M.A.S. analyzed and interpreted the need for an intelligent radiation early warning system, introduced the second phase of the RIMI framework, and was a major contributor in writing the manuscript. He also queried the AI methodologies and techniques to introduce an approach that can address the shortcomings behind the RIMI second phase. B.F., Y.T., A.J. and R.L. verified the tested techniques and functions for analyzing the data and were major contributors in writing the manuscript. All authors read and agreed to the published version of the manuscript.

**Funding:** The first and corresponding author "Mohammed Al Saleh" has a scholarship from the National Council for Scientific Research in Lebanon (CNRS) to continue his Ph.D. degree, including this research.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The datasets used and/or analyzed during the current study are available from the corresponding author upon reasonable request.

**Acknowledgments:** We would like to thank the National Council for Scientific Research (CNRS) in Lebanon for supporting this work. We would like to express our gratitude to the Federal Office for Radiation Protection (BfS) in Germany for allowing us to use the data collected by their REWS since more than 15 years ago. We would also like to thank Roy Issa for his support in implementing the code behind this research.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **References**


### *Proceeding Paper* **Comparative Analysis of Residential Load Forecasting with Different Levels of Aggregation †**

**Ana Apolo Peñaloza \* , Roberto Chouhy Leborgne and Alexandre Balbinot**

Grupo de Modelagem e Análises de Sistemas de Potência and Laboratório de Instrumentação Eletrônica e Inteligência Artificial, Universidade Federal do Rio Grande do Sul UFRGS, Porto Alegre 90020-000, RS, Brazil; roberto.leborgne@ufrgs.br (R.C.L.); alexandre.balbinot@ufrgs.br (A.B.)

**\*** Correspondence: ana.apolo@ufrgs.br; Tel.: +55-(51)-3308-3129

† Presented at the 8th International Conference on Time Series and Forecasting, Gran Canaria, Spain, 27–30 June 2022.

**Abstract:** Microgrids need a robust residential load forecasting. As a consequence, this highlights the problem of predicting electricity consumption in small amounts of households. The individual demand curve is volatile, and more difficult to forecast than the aggregated demand curve. For this reason, Mean Absolute Percentage Error (MAPE) varies in a large range (of 1% to 45%), depending on the number of consumers analyzed. Different levels of aggregation of household consumers that can be used in microgrids are analyzed; the load forecasting of the single consumer and aggregated consumers are compared. The forecasting methodology used is the most consolidated of Recurrent Neural Networks, i.e., LSTM. The dataset used contains 920 residential consumers belonging to the Commission for Energy Regulation (CER), a control group that is in the Irish Social Science Data Archive (ISSDA) repository. The result shows that the forecasting of groups of more than 20 aggregated consumers has a lower MAPE that individual forecasting. On the other hand, individual forecasting is better for groups with fewer than 10 consumers.

**Keywords:** load forecasting; LSTM; residential load forecasting; aggregation

#### **1. Introduction**

Load forecasting is essential to ensure a balance between demand and generation. Thus, utilities need highly accurate forecasts to maintain the security and stability of power supply [1]. At the same time, the complexity of the distribution network has continued to grow, which has created uncertainty in the grid, especially with the increase in microgeneration from renewable energy, and charging of electric vehicles [2].

Smart meters allow residents to monitor their consumption in real time. In addition to that, these meters provide large amounts of data from utilities [3]. These measurements allow for the enhanced measurement of consumption and energy control, allowing greater flexibility to the distribution network [4].

With this amount of data coming from smart meters, it becomes possible to perform a validation of demand forecasting at household or building levels. At these levels, consumption profiles are volatile [5,6]. Most load forecasting work is focused instead on large substations with tens of MW or transmission grids with tens of GW. Forecasting is assessed by the Mean Absolute Percentage Error (MAPE) metric which is generally below 2% for a substation at transmission level, while it can reach up to 30% for residential consumers [4]. Figure 1 shows aggregated demand curves for different amounts of consumers. This figure expresses that with 100 consumers, the demand curve is quite smoothed.

This work aims to analyze the demand forecasting in the context of microgrids. A dataset of 30 min demand of an individual residential consumer is used. This allows a comparison between demand forecasting of individual and aggregated consumers. This comparison is performed with different numbers of aggregated consumers (5, 10, 20,

**Citation:** Peñaloza, A.A.; Leborgne, R.C.; Balbinot, A. Comparative Analysis of Residential Load Forecasting with Different Levels of Aggregation. *Eng. Proc.* **2022**, *18*, 29. https://doi.org/10.3390/ engproc2022018029

Academic Editors: Ignacio Rojas, Hector Pomares, Olga Valenzuela, Fernando Rojas and Luis Javier Herrera

Published: 21 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

30, 50 and 100). The selection of these consumers is random from a set of 256 residential consumers. The error metric used is the MAPE and the forecasting method is the Long Short-Term Memory (LSTM) recurrent neural network. LSTM is a consolidated methodology in load forecasting.

**Figure 1.** Different number of consumers.

#### **2. Literature Review**

The literature for demand forecasting is quite extensive, so it is possible to find statistical or machine learning methods for this purpose. In recent years, deep-learning methods are becoming popular for demand forecasting. Thus, recurrent neural networks, with their variant LSTM and GRU, are the most widely used methodologies for forecasting, as can be found in [2,7–13] for LSTM and [14,15] for GRU.

The forecasting literature is divided into forecasting of energy consumption and forecasting of load demand. Energy consumption forecasting has been used for few consumers, usually households or buildings, as found in [15–19]. On the other hand, load forecasting is used to forecast substations' demand, as found in [5,13,20–22]. These two approaches have completely different forecast errors using the same performance metrics. This is due to the difficulty of predicting power consumption.

Some authors that forecast energy consumption had aggregated individual energy consumption to obtain the building energy consumption [23].

There are few papers in the literature that compare load forecasting results and performance when using individual or aggregated consumers. Methodology performance for different levels of aggregation was evaluated in [7]. The authors found that with an aggregation of 5 consumers, the forecast errors are between 30 and 50%, while with 1000 aggregated consumers, the errors drop to less than 5%. However, the authors have not analyzed the forecasting performance for single users.

One of the first to analyze individual load forecasting is [13]. The individual's demand volatility makes the prediction difficult. Forecasting methodology is using an LSTM neural network. Demand forecasting is done for 69 consumers who belong to a set of 10,000 consumers from Australia for a period of three months. Other quantities of aggregated consumers are not analyzed. Results show that the non-aggregate forecast is better than the aggregate set forecast.

Comparisons of demand forecasting with different levels of aggregation are presented in [4], but the results are ambiguous and do not consider a non-aggregated methodology. Different aggregation levels using LSTM are shown in [24]. Their results suggest that demand forecasting of more than 200 aggregated residential consumers hardly decreases error, with the error curve becoming almost constant. However, the work of [24] only considers the increase in consumers and how this reflects on the aggregate predictor error. Non-aggregated consumers are not considered.

A methodology for demand forecasting of 200 consumers that are divided into groups of 50, 100, and 150 is presented in [25]. Forecasting is performed by LSTM and k-means clustering. The methodology without clustering proved to be better. They do not perform analyses with groups smaller than 50 consumers, which would be a typical microgrid context.

#### **3. Demand Forecasting with RNN**

Recurrent Neural Networks (RNNs) have become the most widely used methodology to perform residential demand forecasting [6,8,11,26–28]. The preference for RNNs is due to the fact that their models are sequence-based [6]. RNNs can process large data or text size series [29]. Therefore, RNNs are widely used in text translation and forecasting of time series data [30]. Energy demand is a time series, as observed in Figure 1. Figure 2 shows the typical architecture of RNNs and their unfolding in the earlier and later times.

**Figure 2.** Recurrent neural network architecture. Adapted of [6].

There are two types of RNNs: Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU). LSTM and GRU are implemented to solve the problems of gradient bursting, when gradients approach infinity, or gradient fading, with gradients close to zero [30]. These problems are associated with successive multiplications of the weight matrices W [30].

LSTM is a solution to the long-time gradient fading and bursting problems through finite gradient control [30]. GRU, in turn, is a simplified variant of LSTM introduced by Cho [31], which performs the same gradient control using fewer gates.

#### *Long Short-Term Memory*

The LSTM is trained by the Backpropagation Through Time (BPTT) algorithm [32]. Figure 3 illustrates a time step with two LSTM cells, showing the internal connection of an LSTM cell. Figure 3 allows one to observe the updates of hidden state (*ht*) and cell state (*ct*) after a time step [30]. The key to the LSTM cell is cell state (*ct*). The cell *ct* moves from an earlier time step to the later time step and can be called a long-term memory term.

**Figure 3.** Individual and aggregated load forecasting.

The equation of *ct* is shown in (1) and is divided into two parts. The first part is controlled by the forget gate (*ft*). The *ft* is in charge of defining the elements of the input

The input gate is shown in (3) and the input state (*gt*) is shown in (4). So, *gt* creates the values that can be added to the cell state while it decides which input values (input *x*) will be updated. The output gate (*ot*) is also divided in two parts. In the first part, the values of the input (*x*) are placed, and the hidden state (*ht*) is used in the output, and in the second part the same happens for the values of ct.

Finally, the sum of the two parts is transformed by the hyperbolic tangent function (tanh) to obtain values in the range from [−1, 1] [6,29,30,33].

$$\mathcal{L}\_{l} = \mathcal{g}\_{l} \odot \dot{\imath}\_{l} + (\mathcal{c}\_{t-1} \odot f\_{l}) \tag{1}$$

$$f\_t = \sigma \left( \mathcal{W}\_{f \cdot x} \mathbf{x}\_t + \mathcal{W}\_{f \cdot h} \mathbf{h}\_{t-1} + \mathbf{b}\_f \right) \tag{2}$$

$$\dot{\mathbf{u}}\_{l} = \sigma(\mathcal{W}\_{l:\mathbf{x}}\mathbf{x}\_{l} + \mathcal{W}\_{l:h}h\_{t-1} + b\_{i}) \tag{3}$$

$$\mathbf{g}\_t = \tanh\left(\mathcal{W}\_{\mathcal{S}^\cdot \times} \mathbf{x}\_t + \mathcal{W}\_{\mathcal{S}^\cdot \cdot \hbar} \mathbf{h}\_{t-1} + \mathbf{b}\_{\mathcal{S}}\right) \tag{4}$$

$$\circ o\_l = \sigma(\mathcal{W}\_{o \cdot X} \mathfrak{x}\_l + \mathcal{W}\_{o \cdot h} h\_{t-1} + b\_o) \tag{5}$$

$$h\_l = \tanh(\mathbf{c}\_l) \odot o\_l \tag{6}$$

where: *ct* is the cell state; *h* is the hidden state; *ft* is the forget gate; *it* is the input gate; *gt* is the input activation gate; *ot* is the output gate. *x* is the input, *W* is the weight matrix, *b* is the bias in gate, *σ* is the sigmoid function, and *tanh* is the hyperbolic tangent.

#### **4. Methodology for Comparison and Case Study**

Figure 4 presents the methodology used to generate the forecasting for the comparison of the performance of the aggregated and individual forecast. The load forecasting is performed by the LSTM Recurrent Neural Network.

Figure 4 shows the methodology divided into two parts. The first is the pre-processing step, which contains normalization and missing data checking. The dataset used in this work belongs to the Smart Metering Electricity Consumer Behavior Trials project, carried out by the Commission for Energy Regulation (CER), Ireland's energy generation and distribution regulatory institution. The project that generated this database aimed to analyze the energy consumption per hour with different residential and industrial consumption tariffs by time of use (ToU). Its concern was to present the behavior of consumers in each of the different price ranges and their adaptation in relation to time of use. The project relied on the use of smart meters with a real-time digital panel, access to consumption via the internet, and detailed bimonthly consumption [34,35].

The dataset CER included 4225 residential consumers. However, this paper focused on the control group that included 657 consumers with measurements over a period of one and a half years, from 14 July 2009 to 31 December 2010. Data acquisition was carried out every 30 min. As the focus of this paper is influence of load aggregation on demand, forecasting groups with 5, 10, 20, 30, 50 and 100 consumers were randomized. The dataset CER does not contain missing data. The sampling period is one hour, and only one season is considered to perform the forecasting. Thus, 1600 samples per user are selected to perform the different forecasts.

In order to verify the performance of the forecast for aggregated and individual consumers, the following set of consumers {5, 10, 20, 30, 50 and 100} is performed. The selection of consumers for each set is randomized among the consumers. In order to verify the influence of consumers on each set, the random selection is repeated 10 times. Once the consumers in the set are selected, their demand is summed to create a single demand curve with 1600 time samples.

**Figure 4.** Individual and aggregated demand. (**a**) 5 individual consumers; (**b**) 5 aggregated consumers; (**c**) 10 individual consumers; (**d**) 10 aggregated consumers; (**e**) 50 individual consumers; (**f**) 50 aggregated consumers.

On the other hand, the curve called individual is composed of 1600 samples from each individual in the set (example: from a set with 20 consumers, the number of samples used is 32,000). Next, each individual consumer is predicted and then summed up; finally, the result is compared with the aggregated curve.

Figure 4a,c,e presents on the left the individual demand of 5, 10 and 50 consumers and on the right the aggregated demand of the consumers (see Figure 4b,d,f). Figure 4d,f shows that the demand waveform softens with the increase in the number of consumers, such that with 50 consumers the aggregated peaks (see Figure 4e) are smoother compared to the waveform of 5 consumers (see Figure 4b).

Table 1 shows the LSTM configuration for the aggregated and individual forecasting. The selection of these parameters is based on [8]. The forecast error metric is Mean Absolute Percentage Error (MAPE), which is the most widely used error measure according to the literature.


**Table 1.** LSTM hyper-parameters.

#### **5. Results**

To perform the forecasting, the CER database is divided into 70% for training and 30% for testing. The consumers were aggregated in 5, 10, 20, 30, 50 and 100 consumers to analyze the forecasting performance. The load demand curve of the previous 48 h of each group is used by the algorithm to predict the demand for next hour. In the case of aggregate forecasting, 48 time steps of aggregated demand are used, and the forecast is the aggregated demand of the following hour. In the case of individual forecasting, 48 time steps of individual demand are used for each consumer and the forecast is the individual demand of the next hour for each consumer.

Figure 5a–f show the forecast curve and the actual curve for aggregated and individual consumers, respectively. Figure 5a shows the forecast of individual consumers. Figure 5a–c highlight the difficulty of forecasting microgrids with few consumers (fewer than 10), where uncertainty is relatively more significant due to the volatility of individual consumers. Figure 5d–f show that after 20 aggregated consumers, the volatility decreases, smoothing the curves and facilitating the forecast. Finally, Figure 5f shows the best result because the curve is smoother and therefore less volatile than the curve in Figure 5a.

**Figure 5.** Actual vs. forecast demand curve. (**a**) Individual consumers; (**b**) 5 aggregated consumers; (**c**) 10 aggregated consumers; (**d**) 20 aggregated consumers; (**e**) 50 aggregated consumers; (**f**) 100 aggregated consumers.

Figures 6 and 7 show the average MAPE for the individual and aggregated consumers of the 10 forecast simulations for each group {5, 10, 20, 30, 50 and, 100} consumers. Figure 6 shows the MAPE for training whereas Figure 7 shows the MAPE for testing. Figures 6 and 7 show that MAPE decreases in training and testing when the number of aggregated consumers increases, because the aggregated demand curve is smoother with a larger aggregation of consumers.

**Figure 6.** MAPE in the training of individual and aggregated consumers.

**Figure 7.** MAPE in the test of individual and aggregated consumers.

The MAPE of individual forecasting is comparable with aggregated forecasting when small number of consumers are considered, less than 10. Therefore, in small micro-grids the individual curve could be more useful than the aggregated one. The MAPE for individual forecasting is due to atypical consumers that tend to affect the individual forecast more than the aggregate. A way to improve the individual forecasting would be to work with consumers with similar consumption profiles. Thus, grouping would be by cluster and not random as was done in this work.

Another paper [36] also works with the CER database, uses a control group with 782 consumers, and his MAPE error is 6%. This work presents a training error of 6.88% when training uses 100 consumers. According to Figures 6 and 7, the error decreases with the increasing number of consumers. With 256 consumers, a 5% training error is found. The error decrease with the increase in consumers is due to the fact that with 256 consumers the demand curve is smoother than with 100 consumers. The error decrease is stable, after 100 consumers, with errors smaller than 3% for 782 consumers.

#### **6. Conclusions**

This paper presented a comparison of load forecasting considering an aggregated and an individual demand curve. The load forecasting was performed by the LSTM RNNs that was designed to work with a series of data such as energy demand curves. The analysis was performed with 5, 10, 20, 30, 50 and 100 consumers. The selection of consumers was randomized, and the experiment was repeated 10 times. The aggregated 100-consumer forecasting presented the lowest MAPE. The decrease in the MAPE was due to the smoothed demand curve of the aggregated consumers. The aggregated forecasting reduced the volatility of the electricity consumption. On the other hand, the individual forecast was more susceptible to atypical consumers which produce larger differences in the forecast for small groups (below 20 consumers). However, for small micro-grids, individual consumption is better than aggregate.

**Author Contributions:** Conceptualization, R.C.L. and A.A.P.; methodology, A.A.P.; validation, A.B.; writing—original draft preparation, A.A.P.; writing—review and editing, R.C.L. and A.B.; All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior-Brasil (CAPES)–Finance Code 001.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Accessed through the Irish Social Science Data Archive https://www. ucd.ie/issda/data/commissionforenergyregulationcer (accessed on 17 June 2022).

**Acknowledgments:** This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior-Brasil (CAPES)–Finance Code 001. The authors would like to thank the Irish Social Science Data Archive for making the database used in this paper available through "CER Smart Metering Project-Electricity Customer Behaviour Trial, 2009–2010" and also to ISSDA, as follows: "Accessed through the Irish Social Science Data Archive- https://www.ucd.ie/issda/data/ commissionforenergyregulationcer (accessed on 17 June 2022).

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **References**


### *Proceeding Paper* **An Open Source and Reproducible Implementation of LSTM and GRU Networks for Time Series Forecasting †**

**Gissel Velarde \* , Pedro Brañez, Alejandro Bueno, Rodrigo Heredia and Mateo Lopez-Ledezma**

Independent Researchers, Cochabamba 06651, Bolivia; pedrobran8@gmail.com (P.B.); alebuenoaz@gmail.com (A.B.); rodrigoh1205@gmail.com (R.H.); lopezmateo97@yahoo.com (M.L.-L.)

**\*** Correspondence: gv@urubo.org

† Presented at the 8th International Conference on Time Series and Forecasting, Gran Canaria, Spain, 27–30 June 2022.

**Abstract:** This paper introduces an open source and reproducible implementation of Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks for time series forecasting. We evaluated LSTM and GRU networks because of their performance reported in related work. We describe our method and its results on two datasets. The first dataset is the S&P BSE BANKEX, composed of stock time series (closing prices) of ten financial institutions. The second dataset, called Activities, comprises ten synthetic time series resembling weekly activities with five days of high activity and two days of low activity. We report Root Mean Squared Error (*RMSE*) between actual and predicted values, as well as Directional Accuracy (*DA*). We show that a single time series from a dataset can be used to adequately train the networks if the sequences in the dataset contain patterns that repeat, even with certain variation, and are properly processed. For 1-step ahead and 20-step ahead forecasts, LSTM and GRU networks significantly outperform a baseline on the Activities dataset. The baseline simply repeats the last available value. On the stock market dataset, the networks perform just as the baseline, possibly due to the nature of these series. We release the datasets used as well as the implementation with all experiments performed to enable future comparisons and to make our research reproducible.

**Keywords:** forecasting; time series; open source; reproducibility

#### **1. Introduction**

Artificial Neural Networks (ANNs) and particularly Recurrent Neural Networks (RNNs) gained attention in time series forecasting due to their capacity to model dependencies over time [1]. With our proposed method, we show that RNNs can be successfully trained with a single time series to deliver forecasts for unseen time series in a dataset containing patterns that repeat, even with certain variation; therefore, once a network is properly trained, it can be used to forecast other series in the dataset if adequately prepared.

LSTM [2] and GRU [3] are two related deep learning architectures from the RNN family. LSTM consists of a memory cell that regulates its flow of information thanks to its non-linear gating units, known as the input, forget and output gates, and activation functions [4]. GRU architecture consists of reset and update gates and activation functions. Both architectures are known to perform equally well on sequence modeling problems, yet GRU was found to train faster than LSTM on music and speech applications [5].

Empirical studies on financial time series data reported that LSTM outperformed Autoregressive Integrated Moving Average (ARIMA) [6]. ARIMA [7] is a traditional forecasting method that integrates autoregression with moving average processes. In [6], LSTM and ARIMA were evaluated on *RMSE* between actual and predicted values on financial data. The authors suggested that the superiority of LSTM over ARIMA was thanks to gradient descent optimization [6]. A systematic study compared different ANNs

**Citation:** Velarde, G.; Brañez , P.; Bueno, A.; Heredia, R; Lopez-Ledezma, M. An Open Source and Reproducible Implementation of LSTM and GRU Networks for Time Series Forecasting. *Eng. Proc.* **2022**, *18*, 30. https://doi.org/10.3390/ engproc2022018030

Academic Editors: Ignacio Rojas, Hector Pomares, Olga Valenzuela, Fernando Rojas and Luis Javier Herrera

Published: 22 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

architectures for stock market forecasting [8]. More specifically, the authors evaluated architectures of the types LSTM, GRU, Convolutional Neural Networks (CNN) and Extreme Learning Machines (ELM). In their experiments, two-layered LSTM and two-layered GRU networks delivered low *RMSE*.

In this study, we evaluate LSTM and GRU architectures because of their performance reported in related work for time series forecasting [6,8]. Our method is described in Section 2. In Sections 2.1–2.3, we review principles of Recurrent Neural Networks (RNN) of the type LSTM and GRU. In Section 2.4, we explain our data preparation, followed by the networks' architecture, training (Section 2.5) and evaluation (Section 2.6). The evaluation was performed on two datasets. In Section 3.1, we describe the S&P BSE-BANKEX or simply BANKEX dataset, which was originally described in [8] and consists of stock time series (closing prices). In Section 3.2, we describe the Activities dataset, a dataset composed of synthetic time series resembling weekly activities with five days of high activity and two days of low activity. The experiments are presented in Section 3. Finally, we state our conclusions in Section 5 and present possible directions for future work. We release the datasets used as well as the implementation with all experiments performed to enable future comparisons and make our research reproducible.

#### **2. Method**

The general overview of the method is described as follows. The method inputs time series of values over time and outputs predictions. Every time series in the dataset is normalized. Then, the number of test samples is defined to create the training and testing sets. One time series from the train set is selected and prepared to train an LSTM and a GRU, independently. Once the networks are trained, the test set is used to evaluate *RMSE* and *DA* between actual and predicted values for each network. The series are transformed back to unnormalized values for visual inspection. We describe every step in detail. Next, in Sections 2.1–2.3, we review principles of RNNs of the type LSTM and GRU, following the presentation as in [5].

#### *2.1. Recurrent Neural Networks*

ANNs are trained to approximate a function and learn the networks' parameters that best approximate that function. RNNs are a special type of ANNs developed to handle sequences. An RNN updates its recurrent hidden state *ht* for a sequence *x* = (*x*1, *x*2,..., *xT*) by:

$$h\_l = \begin{cases} 0, & t = 0\\ \phi(h\_{t-1}, \chi\_l), & \text{otherwise} \end{cases} \tag{1}$$

where *φ* is a nonlinear function. The output of an RNN maybe of variable length *y* = (*y*1, *y*2,..., *yT*).

The update of *ht* is computed by:

$$h\_l = \lg(\mathcal{W}\mathbf{x}\_l + \mathcal{U}h\_{l-1})\_\prime \tag{2}$$

where *W* and *U* are weights' matrices and *g* is a smooth and bounded activation function such as a logistic sigmoid, or simply called sigmoid function *f*(*x*) = *σ* = <sup>1</sup> <sup>1</sup>+*e*−*<sup>x</sup>* , or a hyperbolic tangent function *f*(*x*) = *tanh*(*x*) = *<sup>e</sup>x*−*e*−*<sup>x</sup> <sup>e</sup>x*+*e*−*<sup>x</sup>* .

Given a state *ht*, an RNN outputs a probability distribution for the next element in a sequence. The sequence probability is represented as:

$$p(\mathbf{x}\_1, \mathbf{x}\_2, \dots, \mathbf{x}\_T) = p(\mathbf{x}\_1) p(\mathbf{x}\_2 \mid \mathbf{x}\_1) \dots p(\mathbf{x}\_T \mid \mathbf{x}\_1, \mathbf{x}\_2, \dots, \mathbf{x}\_{T-1}) . \tag{3}$$

The last element is a so-called end-of-sequence value. The conditional probability distribution is given by:

$$p(\mathbf{x}\_{l} \mid \mathbf{x}\_{1}, \mathbf{x}\_{2}, \dots, \mathbf{x}\_{t-1}) = \mathbf{g}(h\_{l}),\tag{4}$$

where *ht* is the recurrent hidden state of the RNN as in expression (1). Updating the network's weights involves several matrix computations, such that back-propagating errors lead to vanishing or exploding weights, making training unfeasible. LSTM was proposed in 1997 to solve this problem by enforcing constant error flow thanks to gating units [2]. GRU is a closely related network proposed in 2014 [3]. Next, we review LSTM and GRU networks. See Figure 1 for illustration.

**Figure 1.** LSTM (**left**) and GRU (**right**). *c* represents the memory cell and *c*˜ the new memory cell of the LSTM. *h* represents the activation and ˜ *h* the new activation of the GRU. Based on [5].

#### *2.2. Long Short-Term Memory*

The LSTM unit decides whether to keep content memory thanks to its gates. If a sequence feature is detected to be relevant, the LSTM unit keeps track of it over time, modeling dependencies over long-distance [5].

In Expressions (6)–(8) and (10), *W* and *U* represent weights matrices and *V* represents a diagonal matrix. *W*, *U* and *V* need to be learned by the algorithm during training. The subscripts *i*, *o* and *f* correspond to input, output and forget gates, respectively. For every *j*-th LSTM unit, there is a memory cell *c j <sup>t</sup>* at time *t*, which activation *h j <sup>t</sup>* is computed as:

$$h\_t^j = o\_t^j 
tanh(c\_t^j) \tag{5}$$

where *o j <sup>t</sup>* is the output gate responsible for modulating the amount of memory in the cell. The forget gate *f j <sup>t</sup>* modulates the amount of memory content to be forgotten and the input gate *i j <sup>t</sup>* modulates the amount of new memory to be added to the memory cell, such that:

$$
\sigma\_t^j = \sigma(W\_o \mathfrak{x}\_t + L\_o h\_{t-1} + V\_o \mathfrak{c}\_t)^j,\tag{6}
$$

$$f\_t^j = \sigma(\mathbf{W}\_f \mathbf{x}\_t + \mathbf{U}\_f \mathbf{h}\_{t-1} + \mathbf{V}\_f \mathbf{c}\_{t-1})^j,\tag{7}$$

$$\mathbf{i}\_t^j = \sigma(\mathcal{W}\_i \mathbf{x}\_t + \mathcal{U}\_i \mathbf{h}\_{t-1} + V\_i \mathbf{c}\_{t-1})^j. \tag{8}$$

where *σ* is a sigmoid function. The memory cell *c j <sup>t</sup>* partially forgets and adds new memory content *c*˜ *j <sup>t</sup>* by:

$$\mathbf{c}\_{t}^{j} = f\_{t}^{j}\mathbf{c}\_{t-1}^{j} + \dot{\mathbf{r}}\_{t}^{j}\mathbf{c}\_{t'}^{j} \tag{9}$$

where:

$$\mathfrak{x}\_t^j = \tanh(\mathcal{W}\_c \mathfrak{x}\_t + \mathcal{U}\_c h\_{t-1})^j. \tag{10}$$

#### *2.3. Gated Recurrent Unit*

The main difference between LSTM and GRU is that GRU does not have a separate memory cell, such that the activation *h j <sup>t</sup>* is obtained by the following expression:

$$h\_t^j = (1 - z\_t^j)h\_{t-1}^j + z\_t^j \tilde{h}\_t^j. \tag{11}$$

279

The update gate *z j <sup>t</sup>* decides the amount of update content given by the previous *h j t*−1 and candidate activation ˜ *h j <sup>t</sup>*. In Expressions (12)–(14), *W* and *U* represent weights matrices that need to be learned during training. Moreover, the subscripts *z* and *r* correspond to update and reset gates, respectively. The update gate *z j <sup>t</sup>* and reset gate *r j <sup>t</sup>* are obtained by the following expressions:

$$z\_t^j = \sigma(\mathcal{W}\_z \mathbf{x}\_t + \mathcal{U}\_z h\_{t-1})^j\_{\ \ \ \ \prime} \tag{12}$$

$$\sigma\_t^j = \sigma(\mathsf{W}\_{\mathsf{r}}\mathsf{x}\_t + \mathsf{U}\_{\mathsf{r}}\mathsf{h}\_{t-1})^j\_{\mathsf{r}} \tag{13}$$

where *σ* is a sigmoid function. The candidate activation ˜ *h j <sup>t</sup>* is obtained by:

$$\tilde{h}\_t^j = \tanh(\mathcal{W}\mathbf{x}\_t + r\_t \odot (\mathcal{U}h\_{t-1}))^j. \tag{14}$$

where denotes element-wise multiplication.

#### *2.4. Data Preparation*

Every time series or sequence in the dataset is normalized as follows. Let *v* be a sequence *v* = (*v*1, *v*2,..., *vQ*) of *Q* samples that can be normalized between 0 and 1:

$$\infty = v' = \frac{v - v\_{\min}}{v\_{\max} - v\_{\min}}.\tag{15}$$

We define the number of samples in the test set as *tests*. The number of samples *N* for training is obtained by *N* = *Q* − *tests* − *w*. Then, a sequence *x* is selected arbitrarily and prepared to train each network as follows. We define a window of size *w* and a number of steps ahead *f* , where *f* < *w* < *N* < *Q*, such that:

$$X = \begin{bmatrix} \chi\_1 & \chi\_2 & \dots & \chi\_w \\ \chi\_2 & \chi\_3 & \dots & \chi\_{w+1} \\ \chi\_3 & \chi\_4 & \dots & \chi\_{w+2} \\ \dots & \dots & \dots & \dots & \dots \\ \chi\_{Q-(w-1+f)} & \chi\_{Q-(w-2+f)} & \dots & \chi\_{Q-f} \end{bmatrix}$$

*X* becomes a *Q* − (*w* − 1 + *f*) by *w* matrix, and:

$$Y = \begin{bmatrix} \mathcal{X}\_{\overline{w}+1} & \mathcal{X}\_{\overline{w}+2} & \dots & \mathcal{X}\_{\overline{w}+f} \\ \mathcal{X}\_{\overline{w}+2} & \mathcal{X}\_{\overline{w}+3} & \dots & \mathcal{X}\_{\overline{w}+2+f} \\ \mathcal{X}\_{\overline{w}+3} & \mathcal{X}\_{\overline{w}+4} & \dots & \mathcal{X}\_{\overline{w}+3+f} \\ \dots & \dots & \dots & \dots & \dots \\ \mathcal{X}\_{Q-(f-1)} & \mathcal{X}\_{Q-(f-2)} & \dots & \mathcal{X}\_{Q} \end{bmatrix} \tag{16}$$

becomes a *Q* − (*w* − 1 + *f*) by *f* matrix containing the targets. The first *N* rows of *X* and *Y* are used for training. The remaining *Q* − *N* elements are used for testing. The settings for our experiments are described in Section 3, after we introduce the characteristics of the dataset used.

#### *2.5. Networks' Architecture and Training*

We tested two RNNs. One with LSTM memory cells and one with GRU memory cells. In both cases, we use the following architecture and training:


with recurrent sigmoid activations and *tanh* activation functions as explained in Sections 2.2 and 2.3. The networks are trained for 200 epochs with Adam optimizer [9]. The number of epochs and architecture were set empirically. We minimize Mean Squared Error (*MSE*) loss between the targets and the predicted values, see Expression (17). The

networks are trained using a single time series prepared as described in Section 2.4. The data partition is explained in Section 3.3.

#### *2.6. Evaluation*

We use Mean Squared Error (*MSE*) to train the networks:

$$MSE = n^{-1} \sum\_{t=1}^{n} (x\_t - y\_t)^2,\tag{17}$$

where *n* is the number of samples, *xt* and *yt* are actual and predicted values at time *t*. Moreover, we use Root Mean Squared Error (*RMSE*) for evaluation between algorithms:

$$RMSE = \sqrt{MSE}.\tag{18}$$

Both metrics, *MSE* and *RMSE*, are used to measure the difference between actual and predicted values, and therefore, smaller results are preferred [10]. We also use Directional Accuracy (*DA*):

$$DA = \frac{100}{n} \sum\_{t=1}^{n} d\_{t\prime} \tag{19}$$

where:

$$d\_{\mathfrak{f}} = \begin{cases} 1 & (\mathfrak{x}\_{\mathfrak{f}} - \mathfrak{x}\_{\mathfrak{f}-1})(y\_{\mathfrak{f}} - y\_{\mathfrak{f}-1}) \ge 0 \\ 0 & otherwise. \end{cases}$$

such that *xt* and *yt* are the actual and predicted values at time *t*, respectively, and *n* is the sample size. *DA* is used to measure the capacity of a model to predict direction as well as prediction accuracy. Thus, higher values of *DA* are preferred [10].

#### **3. Experiments**

In this section, we report experiments performed with both datasets.

#### *3.1. The S&P BSE BANKEX Dataset*

This dataset was originally described in [8]; however, our query retrieved a different number of samples as in [8]. We assume it must have changed since it was originally retrieved. We collected the time series on 20 January 2022, using Yahoo! Finance's API [11] for the time frame between 12 July 2005, and 3 November 2017, see Table 1. Most time series had 3035 samples, and some time series had 3032 samples; therefore, we stored each time series's last 3032 samples. Figure 2 presents the time series of BANKEX without and with normalization.



**Figure 2.** (**a**) Time series in the BANKEX dataset without normalization. Closing Price in Indian Rupee (INR). Daily samples retrieved between 12 July 2005 and 3 November 2017 using Yahoo! Finance's API [11]. All time series with 3032 samples. (**b**) Same time series as in (**a**), but with normalization; closing price normalized between 0 and 1. The numbers from 1 to 10 correspond to the numbers (first column) for each series in Table 1.

#### *3.2. The Activities Dataset*

The Activities dataset is a synthetic dataset created resembling weekly activities with five days of high activity and two days of low activity. The dataset has ten time series with 3584 samples per series. Initially, a pattern of five ones followed by two zeros was repeated to obtain a length of 3584 samples. The series was added a slope of 0.0001. The original series was circularly rotated for the remaining series in the dataset, to which noise was added, and each sequence was arbitrarily scaled, so that the peak-to-peak amplitude of each series was different, see Figure 3.

**Figure 3.** Time series in the Activities dataset without normalization, first 100 samples.

#### *3.3. Datasets Preparation and Partition*

Following Section 2.4, every time series was normalized between 0 and 1. We used a window of size *w* = 60 days. We tested for *f* = 1 and *f* = 20 steps ahead. We used the last 251 samples of each time series for testing. We selected arbitrarily the first time series of each dataset for training our LSTM and GRU networks.

#### *3.4. Results*

The results are presented in Tables 2–5. Close-to-zero *RMSE* and close-to-one *DA* are preferred. On the Activities dataset, two-tailed Mann–Whitney tests show that for 1-step ahead forecasts, *RMSE* achieved by any RNN is significantly lower than that delivered by the baseline (LSTM & Baseline: *U* = 19, *n* = 10, 10, *p* < 0.05. GRU & Baseline: *U* = 0, *n* = 10, 10, *p* < 0.05). In addition, GRU delivers significantly lower *RMSE* than LSTM (*U* = 91, *n* = 10, 10, *p* < 0.05). In terms of *DA*, both RNN perform equally well and significantly outperform the baseline. For 20-step ahead forecasts, again both RNNs achieve significantly lower *RMSE* than the baseline (LSTM & Baseline: *U* = 0, *n* = 10, 10, *p* < 0.05. GRU & Baseline: *U* = 0, *n* = 10, 10, *p* < 0.05). This time, LSTM achieves lower *RMSE* than GRU (*U* = 10, *n* = 10, 10, *p* < 0.05) and higher *DA* (*U* = 81, *n* = 10, 10, *p* < 0.05).

**Table 2.** One-step ahead forecast on Activities dataset. *RMSE*: columns 2 to 4. *DA*: columns 5 to 7.


**Table 3.** Twenty-step ahead forecast on Activities dataset. *RMSE*: columns 2 to 4. *DA*: columns 5 to 7.


**Table 4.** One-step ahead forecast on BANKEX dataset. *RMSE*: columns 2 to 4. *DA*: columns 5 to 7.



**Table 5.** Twenty-step ahead forecast on BANKEX dataset. *RMSE*: columns 2 to 4. *DA*: columns 5 to 7.

On the BANKEX dataset, two-tailed Mann–Whitney tests show that for 1-step ahead forecasts there is no difference among approaches considering *RMSE* (LSTM & Baseline: *U* = 51, *n* = 10, 10, *p* > 0.05. GRU & Baseline: *U* = 55, *n* = 10, 10, *p* > 0.05. LSTM & GRU: *U* = 49, *n* = 10, 10, *p* > 0.05). Similar results are found for 20-step ahead forecasts (LSTM & Baseline: *U* = 76, *n* = 10, 10, *p* > 0.05. GRU & Baseline: *U* = 67, *n* = 10, 10, *p* > 0.05. LSTM & GRU: *U* = 66, *n* = 10, 10, *p* > 0.05). *DA* results are consistent with those obtained for *RMSE*. Figure 4a,b show examples of 20-step ahead forecasts and Figure 5 presents an example of 1-step ahead forecasts. Visual inspection helps understand the results.

**Figure 4.** Examples of 20-step ahead forecast. (**a**) Activities dataset. (**b**) BANKEX dataset.

**Figure 5.** Example of 1-step ahead forecast. Actual and predicted closing price over the first 100 days of the test set Yes Bank. Closing Price in Indian Rupee (INR).

#### **4. Discussion**

The motivation for developing a reproducible and open-source framework for time series forecasting relates to our experience in trying to reproduce previous work [6,8]. We found it challenging to find implementations that are simple to understand and replicate. In addition, datasets are not available. We discontinued comparisons with [8], since the dataset we collected was slightly different, and we were unsure if the reported results referred to normalized values or not. If the algorithms are described but the implementations are not available, a dataset is necessary to compare forecasting performance between two algorithms, such that a statistical test can help determine if one algorithm is significantly more accurate than the other [12] (pp. 580–581).

#### **5. Conclusions**

We proposed a method for time series forecasting based on LSTM and GRU and showed that these networks can be successfully trained with a single time series to deliver forecasts for unseen time series in a dataset containing patterns that repeat, even with certain variation. Once a network is properly trained, it can be used to forecast other series in the dataset if adequately prepared. We tried and varied several hyperparameters. On sequences, such as those resembling weekly activities that repeat with certain variation, we found an appropriate setting; however, we failed to find an architecture that would outperform a baseline on stock market data; therefore, we assume that we either failed at optimizing the hyperparameters of the networks, the approach is unsuitable for this application, or we would need extra information that is not reflected in stock market series alone. For future work, we plan to benchmark different forecasting methods against the method presented here. In particular, we want to evaluate statistical methods as well as other machine learning methods that have demonstrated strong performance on forecasting tasks [13]. We release our code as well as the dataset used in this study to allow this research to be reproducible.

**Author Contributions:** P.B., A.B., R.H. and M.L.-L. designed, implemented, and trained the networks, G.V. advised on the project, optimized the networks and code, and wrote the paper. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Data Availability Statement:** The code and datasets to reproduce this research are available at: https: //github.com/Alebuenoaz/LSTM-and-GRU-Time-Series-Forecasting (accessed on 22 May 2022).

**Acknowledgments:** We would like to thank the anonymous reviewers for their valuable observations.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Proceeding Paper* **Outliers Impact on Parameter Estimation of Gaussian and Non-Gaussian State Space Models: A Simulation Study †**

**Fernanda Catarina Pereira 1,***<sup>∗</sup>* **, Arminda Manuela Gonçalves 2,‡ and Marco Costa 3,‡**


**Abstract:** State space models are powerful and quite flexible tools that allow systems that vary significantly over time due to their formulation to be dealt with, because the models' parameters vary over time. Assuming a known distribution of errors, in particular the Gaussian distribution, parameter estimation is usually performed by maximum likelihood. However, in time series data, it is common to have discrepant values that can impact statistical data analysis. This paper presents a simulation study with several scenarios to find out in which situations outliers can affect the maximum likelihood estimators. The results obtained were evaluated in terms of the difference between the maximum likelihood estimate and the true value of the parameter and the rate of valid estimates. It was found that both for Gaussian and exponential errors, outliers had more impact in two situations: when the sample size is small and the autoregressive parameter is close to 1, and when the sample size is large and the autoregressive parameter is close to 0.25.

**Keywords:** state space models; parameter estimation; outliers; simulation study

#### **1. Introduction**

There are several books in the literature that describe state space models in detail [1–5]. A major advantage of these models is the possibility of explicitly integrating the unobservable components of a time series by relating to each other stochastically.

State space models have in their structure a latent process, the state, which is not observed. The Kalman filter is typically used to estimate it, as it is a recursive algorithm that, at each time, computes the optimal estimator in the sense that it has the minimum mean squared error of the state when the model is fully specified, and one-step-ahead predictions by updating and improving the predictions of the state vector in real time when new observations become available. The Kalman filter was originally developed by control engineering in the 1960s in one of Kalman's papers [6] describing a recursive solution to the linear filter problem for discrete time. Today, this algorithm is applied in various areas of study.

Usually, to estimate the unknown parameters of the model, the maximum likelihood method is used by assuming normality of the errors; however, this assumption cannot always be guaranteed. Non-parametric estimation methods can be a strong contribution when it comes to the initial values of iterative methods used to optimize the likelihood function, which often do not verify the convergence of the algorithms due to the initial choice of these parameters. For example, ref. [7] propose estimators based on the generalized method of moments, the distribution-free estimators, where these estimators do not depend on the distribution of errors.

**Citation:** Pereira, F.C.; Gonçalves, A.M.; Costa, M. Outliers Impact on Parameter Estimation of Gaussian and Non-Gaussian State Space Models: A Simulation Study. *Eng. Proc.* **2022**, *18*, 31. https://doi.org/ 10.3390/engproc2022018031

Academic Editors: Ignacio Rojas, Hector Pomares, Olga Valenzuela, Fernando Rojas and Luis Javier Herrera

Published: 22 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Nevertheless, even if the assumption of normality of errors is not verified, the Kalman filter still returns optimal predictions within the class of all linear estimators. However, the optimal properties of Kalman filter predictors can only be ensured when all state space models' parameters are known. When the unknown parameter vector is replaced by its estimate, the mean squared error of the estimators is underestimated.

The analysis and modeling of dynamic systems through state space models has been quite useful given its flexibility. In its formulation, the state process is assumed to be a Markov process, allowing optimal predictions of the states and, consequently, observations based only on the optimal estimator of the current state to be obtained.

Despite these advantages, any prediction model is dependent on the quality of the data. Particularly, in many cases, meteorological time series are subject to higher uncertainties, and Kalman filter solutions can be biased [8].

In particular, outliers are an important issue in time series modeling. Time series data are typically dependent on each other and the presence of outliers can impact parameter estimates, forecasting and also inference results [9]. In the presence of incomplete data and outliers in the observed data, ref. [10] developed a modified robust Kalman filter. Ref. [11] showed that linear Gaussian state space models are suitable for estimating the unknown parameters and can consequently affect the state predictions, especially when the measurement error was much larger than the stochasticity of the process. Ref. [12] proposed a non-parametric estimation method based on statistical data depth functions to obtain robust estimates of the mean and the covariance matrix of the asset returns, which is more robust in the presence of outliers, and also does not require parametric assumptions.

This work arose from the project "TO CHAIR—The Optimal Challenges in Irrigation", in which short-term forecast models, with the state space representation, were developed to model the time series of maximum air temperature. For this project, we analyzed data provided by the University of Trás-os-Montes and Alto Douro, corresponding to the maximum air temperature observed in a farm, located in the district of Bragança, between 20 February and 11 October 2019, and data from the website weatherstack.com, corresponding to the forecasts with a time horizon of 1 to 6 days of the same meteorological variable for the same location. The main goal focused on improving the accuracy of the forecasts for the farm. However, there were some modeling problems, particularly regarding the convergence of the numerical method, which arose in the presence of outliers.

Therefore, to evaluate and compare the quality of the estimates of the unknown parameters of the linear invariant state space model in the presence of outliers, this paper presents four simulation studies: the first is based on the linear Gaussian state space model; the second is based on the linear Gaussian state space model with contaminated observations; the third is based on the linear non-Gaussian state space model with exponential errors; and the last one is based on the linear non-Gaussian state space model with exponential errors and contaminated observations. For each of the four studies, several scenarios were tested, in which 2000 samples with valid estimates of size *n* (*n* = 50, 200, 500) were simulated. The results obtained were evaluated in terms of the difference between the maximum likelihood estimate and the true value of the parameter and the rate of valid estimates.

#### **2. Simulation Design**

In general, the linear univariate state space model is given as follows:

$$Y\_l = \beta\_l W\_l + e\_{l\prime} \quad \text{observation equation} \tag{1}$$

$$
\beta\_l = \mu + \phi(\beta\_{l-1} - \mu) + \varepsilon\_{l\prime} \quad \text{state equation} \tag{2}
$$

where *t* = 1, . . . , *n* is the discrete time and


This paper aims to investigate under what conditions the presence of outliers affects the estimation of parameters and states in the state space model. Thus, we simulate time series of size *n* (*n* = 50, 200, 500) using the model defined by Equations (1) and (2). For simplicity's sake, we consider for all simulation studies *Wt* = 1, ∀*t*, and *μ* = 0, that is

$$Y\_l = \beta\_l + e\_{l\prime} \tag{3}$$

$$
\beta\_l = \phi \beta\_{t-1} + \varepsilon\_t, \ t = 1, \dots, n. \tag{4}
$$

To create the contamination scenario, we study real time series concerning maximum air temperature. We used data from two different sources: the first corresponds to daily records of maximum air temperature between 20 February and 11 October 2019 (234 observations) through a portable weather station installed on a farm located in the Bragança district in northeastern Portugal; the second database corresponds to forecasts from the weatherstack.com website. These forecasts have a time horizon of up to 6 days; this means that, for a certain time *t*, we have forecasts given at times *t* − 6, *t* − 5, . . . , *t* − 1.

So, first we took the difference between the recorded/observed maximum temperature and the website's forecasts, say, Λ*t*,(*h*), where *t* is the time, in days, and *h* is the time horizon of the forecasts, *h* = 1, ... , 6 days. Next, we calculated the percentage of outliers of Λ*t*,(*h*), whose percentage was on average 5%. Regarding the variable Λ*t*,(*h*), outliers were removed and replaced by linear interpolation, say, Λ∗ *t*,(*h*) , in order to remove the contamination present in the data, and its mean was subtracted, Λ∗ *<sup>t</sup>*,(*h*) − *mean*(Λ<sup>∗</sup> *t*,(*h*) ), so that it had zero mean. Then, for each time horizon *h* (*h* = 1, ... , 6), the model with a state space representation presented by Equations (3) and (4) was fitted to the data Λ∗ *<sup>t</sup>*,(*h*) − *mean*(Λ<sup>∗</sup> *t*,(*h*) ).

In order to establish a relationship between the estimates of parameters *φ*, *σ*<sup>2</sup> *<sup>ε</sup>* and *σ*<sup>2</sup> *e* , that were obtained from the "non-contaminated" data, and the magnitude of the outliers of Λ*t*,(*h*), the linear regression model was fitted, whose relationship is given by

$$k = 1.8874 + 3.5161 \sqrt{\frac{\sigma\_\varepsilon^2}{1 - \phi^2} + \sigma\_\varepsilon^2} \tag{5}$$

where *k* = |outliers of Λ*t*,(*h*) − mean of Λ*t*,(*h*)without outliers|, is the magnitude of the outliers, and *<sup>σ</sup>*<sup>2</sup> *ε* <sup>1</sup> <sup>−</sup> *<sup>φ</sup>*<sup>2</sup> <sup>+</sup> *<sup>σ</sup>*<sup>2</sup> *<sup>e</sup>* is the total variance of *Yt*. In total, Λ*t*,(*h*) (*h* = 1, ... , 6) shows 59 outliers.

In this work, four simulation scenarios were tested:

1. The first is based on the linear Gaussian state space model given by

$$\begin{aligned} \Upsilon\_{\mathfrak{l}} &= \beta\_{\mathfrak{l}} + \mathfrak{e}\_{\mathfrak{l}\prime} \ \mathfrak{e}\_{\mathfrak{l}} \sim \mathcal{N}(\mathbf{0}, \sigma\_{\mathfrak{e}}^{2})\\ \beta\_{\mathfrak{l}} &= \phi \beta\_{\mathfrak{l}-1} + \mathfrak{e}\_{\mathfrak{l}\prime} \ \mathfrak{e}\_{\mathfrak{l}} \sim \mathcal{N}(\mathbf{0}, \sigma\_{\mathfrak{e}}^{2}) \ \mathfrak{e}\_{\mathfrak{l}} \ \mathfrak{l} = 1, \dots, m \end{aligned}$$

2. The second is based on the linear Gaussian state space model with contaminated observations.

To contaminate the model, the deterministic factor *k*, given in (5), is added in this way

$$\begin{aligned} \Upsilon\_{\mathfrak{l}} &= \beta\_{\mathfrak{l}} + e\_{\mathfrak{l}} + I\_{\mathfrak{l}}k\_{\mathfrak{l}} \ e\_{\mathfrak{l}} \sim \mathcal{N}(0, \sigma\_{\mathfrak{c}}^2) \\ \beta\_{\mathfrak{l}} &= \phi \beta\_{\mathfrak{l}-1} + \varepsilon\_{\mathfrak{l}\prime} \ \varepsilon\_{\mathfrak{l}} \sim \mathcal{N}(0, \sigma\_{\mathfrak{c}}^2) \end{aligned}$$

where *It* ∼ B(1, 0.05).

3. The third is based on the linear non-Gaussian state space model with exponential errors defined by

$$\begin{aligned} Y\_t &= \beta\_t + \varepsilon\_{t\prime} \quad \varepsilon\_t \sim \text{Exp}(\lambda\_\varepsilon) - \frac{1}{\lambda\_\varepsilon} \\ \beta\_t &= \phi \beta\_{t-1} + \varepsilon\_{t\prime} \quad \varepsilon\_t \sim \text{Exp}(\lambda\_\varepsilon) - \frac{1}{\lambda\_\varepsilon} \quad t = 1, \dots, n \end{aligned}$$

4. The last one is based on the linear non-Gaussian state space model with exponential errors and contaminated observations. Similar to scenario 2, we have

$$\begin{aligned} Y\_t &= \beta\_t + \varepsilon\_t + I\_t k\_\prime \ e\_t \sim \text{Exp}(\lambda\_\varepsilon) - \frac{1}{\lambda\_\varepsilon} \\ \beta\_t &= \phi \beta\_{t-1} + \varepsilon\_t, \ \varepsilon\_t \sim \text{Exp}(\lambda\_\varepsilon) - \frac{1}{\lambda\_\varepsilon}, \ t = 1, \dots, n \end{aligned}$$

where *It* ∼ B(1, 0.05), and *k* given in (5).

For each of the four scenarios, sample sizes of *n* = 50, 200, 500 were simulated. In this study, a range of values were simulated for *φ* (0.25, 0.75), and *σ*<sup>2</sup> *<sup>ε</sup>* and *σ*<sup>2</sup> *<sup>e</sup>* (0.10, 1.00, 5.00, 0.10, 2.00, 0.05). For each parameter combination, 2000 replicates with valid estimates were considered, i.e., estimates within the parameter space: −1 < *φ* < 1, *σε* > 0, and *σ<sup>e</sup>* > 0. In all simulations, we take the initial state *β*<sup>0</sup> = 0 in the Kalman filter.

To evaluate the quality of the parameter estimates, we considered the Root Mean Square Error (RMSE),

$$\text{RMSE}(\Theta) = \sqrt{\frac{1}{2000} \sum\_{i=1}^{2000} \left(\Theta\_i - \widehat{\Theta}\_i\right)^2}$$

the Mean Absolute Error (MAE),

$$\text{MAE}(\Theta) = \frac{1}{2000} \sum\_{i=1}^{2000} \left| \Theta\_i - \widehat{\Theta}\_i \right| $$

the Mean Absolute Percentage Error (MAPE),

$$\text{MAPE}(\Theta) = \frac{1}{2000} \sum\_{i=1}^{2000} \left| \frac{\Theta\_i - \hat{\Theta}\_i}{\Theta\_i} \right| \times 100$$

Θ = (*φ*, *σ*<sup>2</sup> *<sup>ε</sup>* , *σ*<sup>2</sup> *<sup>e</sup>* ) and the convergence rate. The convergence rate provides information about the percentage of valid estimates among all simulations (simulations with valid and non-valid estimates). The convergence rate is given by the number of valid simulated estimates (in this case, 2000) divided by the number of total simulations.

To estimate the unknown parameters of the state space model (3) and (4) Θ = (*φ*, *σ*<sup>2</sup> *<sup>ε</sup>* , *σ*<sup>2</sup> *e* ) of each simulation, the maximum likelihood method was used by assuming the normality of the disturbances for all four scenarios. Log-likelihood maximization was performed by the Newton–Raphson numerical method. In this study, the R package "astsa" was used [3,13,14].

#### **3. Results**

In this section, the simulation results are presented. Tables 1–3 present the results of the simulations in terms of the RMSE, MAE, MAPE (%) and the convergence rate (%) for sample sizes *n* = 50, *n* = 200 and *n* = 500, respectively, considering both non-contaminated (NC) and contaminated Gaussian errors. Tables 4–6 show the simulation results considering contaminated and non-contaminated exponential errors.

**Table 1.** RMSE, MAE, MAPE and convergence rate of Θ with 2000 simulations of sample sizes *n* = 50, considering Gaussian errors (NC = Non-Contaminated; C = Contaminated).


As expected, contamination had an impact on the performance of the maximum likelihood estimators.

First, it is seen that for small sample sizes and non-contaminated errors, the convergence rate tends to decrease. For example, for *n* = 500 in the case of non-contaminated Gaussian errors, the convergence rate was over 72%, while for *n* = 50, it was over 57%. For contaminated Gaussian and exponential errors, the convergence rate decreased compared to non-contaminated errors.

Overall, an improvement in the rate of valid estimates (convergence rate) is noticeable when *φ* = 0.75 compared to *φ* = 0.25 in the case of non-contaminated Gaussian and exponential errors. In the case of contaminated Gaussian and exponential errors, this behavior only occurred when *n* = 500.


**Table 2.** RMSE, MAE, MAPE and convergence rate of Θ with 2000 simulations of sample sizes *n* = 200, considering Gaussian errors (NC = Non-Contaminated; C = Contaminated).

When the errors are not contaminated, the RMSE, MAE and MAPE tend to decrease with increasing sample size. However, this premise is not true when the errors are contaminated. In fact, it was found that for both Gaussian and exponential errors, outliers had more impact in two situations: when *φ* = 0.75 and *n* = 50 (Tables 1 and 4); and when *φ* = 0.25 and *n* = 500 (Tables 3 and 6). This impact is reflected in the RMSE, MAE and MAPE, which produced very high values.

Furthermore, there are many cases where, for example, the RMSE of the estimators of the contaminated errors are 3 times higher than the RMSE of the non-contaminated errors. For example, in the case of the Gaussian errors with *n* = 500, *φ* = 0.25, *σ*<sup>2</sup> *<sup>ε</sup>* = 0.10 and *σ*<sup>2</sup> *<sup>e</sup>* = 0.05, the RMSE of *φ*, *σ*<sup>2</sup> *<sup>ε</sup>* and *σ*<sup>2</sup> *<sup>e</sup>* of the contaminated Gaussian errors were about 3, 6 and 11 times higher, respectively, compared to the non-contaminated Gaussian errors (Table 3).

On the other hand, comparing both the Gaussian and exponential error cases, we find that there are no significant differences in the convergence rate, as well as in the efficiency of the autoregressive *φ* estimator. However, the RMSE, MAE and MAPE of the variance estimators, *σ*<sup>2</sup> *<sup>ε</sup>* and *σ*<sup>2</sup> *<sup>e</sup>* , are in general higher in the case of exponential errors.


**Table 3.** RMSE, MAE, MAPE and convergence rate of Θ with 2000 simulations of sample sizes *n* = 500, considering Gaussian errors (NC = non-contaminated; C = contaminated).

**Table 4.** RMSE, MAE, MAPE and convergence rate of Θ with 2000 simulations of sample sizes *n* = 50, considering exponential errors (NC = non-contaminated; C = contaminated).



**Table 4.** *Cont.*

**Table 5.** RMSE, MAE, MAPE and convergence rate of Θ with 2000 simulations of sample sizes *n* = 200, considering exponential errors (NC = non-contaminated; C = contaminated).



**Table 6.** RMSE, MAE, MAPE and convergence rate of Θ with 2000 simulations of sample sizes *n* = 500, considering exponential errors (NC = non-contaminated; C = contaminated).

#### **4. Discussion**

In this work, outliers were found to impact the performance of the Maximum Likelihood estimators. In particular, it was found through the simulation study that outliers have a very significant impact in both cases: when the sample size is small and the autoregressive parameter is close to 1, and when the sample size is large and the autoregressive parameter is close to 0.25. This impact was reflected in the RMSE, MAE and MAPE values which, in many cases, were higher compared to the case of non-contaminated errors.

Moreover, we notice that the rate of valid estimates (convergence rate) is higher for large sample sizes, and is more evident for non-contaminated Gaussian and exponential errors. On the other hand, it is also important to have large sample sizes to avoid problems related to parameter estimation [11]. In general, the convergence rate is lower when Gaussian and exponential errors are contaminated.

Therefore, our next step is to develop methods to detect outliers in time series and/or to establish other estimation methods that are more robust, in the sense that they do not assume a distribution of the data and are less sensitive to outliers.

In this work, the outliers were generated from a regression model that established a linear relationship between the magnitude of the outliers and the total variance of the model with the state space representation of maximum air temperature real data. The rate of outliers from the real data was 5%; thus, this was the percentage used in this work.

In the literature, we did not find a unanimous approach for doing this. For example, ref. [15] contaminated the error of the zero-mean Gaussian equation of state by replacing the standard deviation of the observation error with a 10-times-higher standard deviation

with a probability of 10% (symmetric outliers). They also considered the case of asymmetric outliers, where the zero mean of the observation error was replaced with a value 10 times higher than the standard deviation with a probability of 10%. Ref. [16] followed the same line as [15], but in this case they call symmetric outliers "zero-mean" and asymmetric outliers "non-zero", considering the probability of contamination to be 5%. Ref. [9] contaminated both the observation and state equation errors, considering the magnitude of the outliers equal to 2.5 the standard deviation from the diagonal elements of the observation and state covariance matrices, respectively.

**Author Contributions:** F.C.P., A.M.G. and M.C. contributed to this work. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by FEDER/COMPETE/NORTE 2020/POCI/FCT funds through grants UID/EEA/-00147/20 13/UID/IEEA/00147/006933-SYSTEC project and To CHAIR - POCI-01- 0145-FEDER-028247. A. Manuela Gonçalves was partially financed by Portuguese Funds through FCT (Fundação para a CiênciaeaTecnologia) within the Projects UIDB/00013/2020 and UIDP/00013/2020 of CMAT-UM. Marco Costa was partially supported by The Center for Research and Development in Mathematics and Applications (CIDMA) through the Portuguese Foundation for Science and Technology (FCT - Fundação para a Ciência e a Tecnologia), references UIDB/04106/2020 and UIDP/04106/2020. F. Catarina Pereira was financed by national funds through FCT (Fundação para a Ciência e a Tecnologia) through the individual PhD research grant UI/BD/150967/2021 of CMAT-UM.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Proceeding Paper* **Time Series Sampling †**

**Florian Combes 1,2,\*,‡ , Ricardo Fraiman 3,‡ and Badih Ghattas 2,‡**

	- 27–30 June 2022.

**Abstract:** Some complex models are frequently employed to describe physical and mechanical phenomena. In this setting, we have an input *X*, which is a time series, and an output *Y* = *f*(*X*) where *f* is a very complicated function, whose computational cost for every new input is very high. We are given two sets of observations of *X*, *S*<sup>1</sup> and *S*<sup>2</sup> of different sizes such that only *f*(*S*1) is available. We tackle the problem of selecting a subsample *S*<sup>3</sup> ∈ *S*<sup>2</sup> of a smaller size on which to run the complex model *f* and such that distribution of *f*(*S*3) is close to that of *f*(*S*1). We adapt to this new framework five algorithms introduced in a previous work "Subsampling under Distributional Constraints" to solve this problem and show their efficiency using time series data.

**Keywords:** optimal sampling; Kolmogorov–Smirnov; time series; encoding; dynamic time warping

**Citation:** Combes, F.; Fraiman, R.; Ghattas, B. Time Series Sampling. *Eng. Proc.* **2022**, *18*, 32. https:// doi.org/10.3390/engproc2022018032

Academic Editors: Ignacio Rojas, Hector Pomares, Olga Valenzuela, Fernando Rojas and Luis Javier Herrera

Published: 23 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

### **1. Introduction**

The study of the damage caused over time and stress on a mechanical part allows for prediction of the failure of this part [1,2]. For this, it is necessary, on the one hand, to have reliable data and, on the other hand, to have a model that is faithful to reality. Such models are used to generate some scenarios using the solution of partial derivative equations (PDEs). Their input is often composed of different variables in the form of a time series denoted by *X*, and their output *f*(*X*) may be multidimensional and depend on space and time. The use of such models consists of solving complicated PDEs, and each generated scenario corresponds in the machine learning paradigm to an inference for a new input *X*, and, therefore, to the computation of *f*(*X*). The more complex that these models are, and the closer they are to reality, the more expensive they are in terms of computing time and power. In practice, these calculations can take days or weeks.

Consider a set *S*<sup>1</sup> = {*X*1, ... , *Xn*<sup>1</sup> } with a distribution similar to that of the time series *X* lying in a separable metric space (E, *ρ*). To each observation, we apply a deterministic smooth function. This function, *<sup>f</sup>* : E → R, is expensive and complex and may be seen as a black box. Moreover, we have a large sample *S*<sup>2</sup> of size *n*<sup>2</sup> with the same distribution as *X*. We do not know the values *f*(*S*2). The goal is to find a subsample *S*<sup>3</sup> ⊂ *S*<sup>2</sup> of a size *n*<sup>3</sup> smaller than *n*<sup>2</sup> in such a way that the distribution of *f*(*S*3) is close to that of *f*(*S*1).

At first sight, this is a classic sampling problem. We can use sampling techniques identical to those used in surveys as well as unsupervised [3] or supervised techniques [4–6]. Some recent algorithms were proposed in [7] to solve this problem for the general case of variable *X* that lies in any metric space. We are interested, in this paper, in adapting such algorithms for the case in which *X* is a time series. We explore, in this paper, two possible adaptations; the first consists of encoding the time series *X* by independent features, and the second using appropriate distances between time series and adjusting the sampling algorithms to use such distances.

This manuscript is organized as follows. In Section 2, we fix some notations that will be used throughout the manuscript, and we specify the framework of the problem to be solved. In Section 3, we describe the algorithms used in [7] to solve the problem. In Section 4, we suggest alternative adaptations for these algorithms for their application to time series. Section 5 gives an industrial application of these approaches. In Section 6, some concluding remarks are provided.

#### **2. The Problem Setting**

Let *S*<sup>1</sup> = {*X*1, ... , *Xn*<sup>1</sup> } as a set of *n*<sup>1</sup> time series following the same distribution *μ* as *X*, and S<sup>2</sup> = {*X* <sup>1</sup>, ... , *X <sup>n</sup>*<sup>2</sup> } as a second set of time series of size *n*<sup>2</sup> coming from the same distribution *<sup>μ</sup>*. Let *<sup>f</sup>* : E → R as a deterministic function that is very complicated and hard to compute. The unknown distribution of *f*(*X*) will be denoted by *F*. Moreover, we dispose of

$$\mathcal{Y}\_1 = \colon \{ \mathcal{Y}\_i = f(X\_i) \text{ for } i = 1, \ldots, n\_1 \} $$

of the images of the first sample S<sup>1</sup> denoted *f*(*S*1). The images of *S*<sup>2</sup> by *f* are not available.

From this information, we want to determine a subsample S<sup>3</sup> ⊂ S<sup>2</sup> of size *n*<sup>3</sup> << *n*<sup>2</sup> in such a way that the empirical distribution of *f*(*S*3) := { *f*(*Xj*) : *Xj* ∈ *S*3} will be close to the distribution of *f*(*X*1).

Several approaches are possible for this problem in a general setting in which *X* is a random variable in a separable metric space.

If *μ*<sup>1</sup> stands for the empirical distribution of S<sup>1</sup> and *μ*<sup>3</sup> for that of a subset S<sup>3</sup> ⊂ S2, and the function *f* is regular, we look for a subset S<sup>3</sup> such that a :

$$d(\mu\_1, \mu\_3)\_\prime \tag{1}$$

is minimal among all possible subsets. Here, *d* is a distance metrizing weak convergence, similar to the Prokhorov distance. In the next section, we will describe some algorithms proposed in [7] to solve this problem and that we aim to adapt to the time series.

#### **3. Sampling Algorithms**

We describe briefly some existing algorithms designed to solve our problem in a general context where *X* is a random variable that lies in any metric space. A more detailed description of these approaches may be found in [7].

Let *S*<sup>1</sup> = {*X*1, ... *Xn*<sup>1</sup> } and *S*<sup>2</sup> = {*X* <sup>1</sup>, ... *X <sup>n</sup>*<sup>2</sup> } be two independent samples with *n*<sup>2</sup> > *n*<sup>1</sup> that come from the same distribution of *X*.

Among the following algorithms, some make use of Y1, and others do not.

#### *3.1. Extended Nearest Neighbors Approach*

This algorithm does not make use of Y1. The selected subset *S*<sup>3</sup> is simply the set of the nearest extended neighbors of *S*<sup>1</sup> in *S*2. Consider the nearest neighbors of *S*<sup>1</sup> in *S*2, *d*1,..., *dn*<sup>1</sup> , their ordered distances and *j*(1),... *j*(*n*1) their indices.

Suppose that two elements of *S*1, called *Xi* and *Xj*, have identical nearest neighbors, *X l* , with respective distances *di* and *dj*, and suppose that if *di* < *dj*, then *X <sup>l</sup>* will be the nearest neighbor of *Xi*, while for *Xj* we will have to take its second nearest neighbor, and so on. This approach is based on the idea that if the elements of *S*<sup>2</sup> are close to those of *S*1, then the images of *S*<sup>3</sup> by *f* should be close to those of *S*1.

#### *3.2. A Partition-Based Algorithm*

For this algorithm, a partition of the set Y<sup>1</sup> into *L* clusters of size *m* is built s.t. *n* = *mL* Denote the clusters by *Ck*, their complements by *Cn*−*<sup>k</sup>* = *<sup>S</sup>*<sup>1</sup> \ *Ck* and *Fk* the empirical cdf of *Y* in *Ck*. Denote by *C*ˆ *<sup>k</sup>* the cluster which minimizes -*Fk* − *Fn*−*k*-, and the subset *C*˜ *<sup>k</sup>* <sup>=</sup> {*Xi*1, ... *Xik*} ⊂ *<sup>S</sup>*<sup>1</sup> fulfilling *<sup>f</sup>*(*C*˜ *<sup>k</sup>*) = *C*ˆ *<sup>k</sup>* which is a subset of *S*1. *S*<sup>3</sup> is defined as the set of the nearest neighbors of *C*˜ *<sup>k</sup>* from *S*2. The partition used in this algorithm may be any random partition, or any partition obtained from a clustering algorithm similar to *k*-means.

#### *3.3. A Histogram-Based Approach*

The idea in this algorithm is to use the empirical distribution of the sample Y1. First, Y<sup>3</sup> is obtained by a stratified sampling from Y<sup>1</sup> where the stratas are obtained using the bins of the histogram built on Y1. *S*<sup>3</sup> is then taken to be the set of the nearest neighbors of *<sup>f</sup>* <sup>−</sup>1(Y3) from *<sup>S</sup>*2. The empirical distribution of *<sup>f</sup>*(*S*3) is close to that of *<sup>f</sup>*(*S*1), *<sup>S</sup>*1, and the *<sup>S</sup>*<sup>3</sup> distributions are expected to be close if *f* is smooth. Two alternatives to this algorithm have been proposed in [7], replacing the stratified sampling to get Y<sup>3</sup> with either the support points approach [3] or the D-optimality [4].

#### **4. Adaptation of Algorithms to Time Series**

We suppose that the time series available in both samples *S*<sup>1</sup> and *S*<sup>2</sup> have different lengths. In this section, we will show how the algorithms presented in Section 3 may be adjusted to be used when *X* is a time series. For this purpose, two approaches are considered, encoding and choosing appropriate distances for time series. For the first approach, each time series in the data is embedded in a vectorial metric space through feature extraction; features may be used arbitrarily or obtained from an autoregressive linear model applied to each time series.

#### *4.1. Encoding*

We experiment with two types of encodings.


A linear autoregressive model is adjusted to each time series with a maximum order *pmax*. Once the optimal orders *pj* are obtained for all time series in any sample, we fix the final desired order as *p* = *maxj*{*pj*}. Finally, each time series is represented by its *p* estimated coefficients.

#### *4.2. Using Appropriate Distance*

Most of the algorithms described in Section 3 are based on neighborhood and, thus, distances. Here, we suggest replacing the Euclidean distance used in the general framework with dynamic time warping (DTW).

DTW allows for comparison of two time series by measuring their similarity, even if they have different lengths [8,9].

Consider the two time series *A* = {*A*1, ... , *An*} and *B* = {*B*1, ... , *Bm*} with *n* = *m*, which lie in the same dimensional space. The goal of DTW is to find the optimal temporal alignment between *A* and *B*; each point from *A* is associated with at least one point from *B* (and reciprocally), as shown in Figure 1. The optimal alignment, corresponding to the DTW distance between *A* and *B*, is the one giving the minimum total length between the couples of aligned points. This may be computed by resolving a linear programming problem.

**Figure 1.** Time points alignment between two time series [10].

For a long time series, the DTW may be very complex to compute. The computation may be simplified using fast versions of the underlying optimization algorithm or by sampling the time series using arbitrary frequencies.

#### **5. Real Dataset Application**

We apply now the proposed approach to a time series dataset concerning real customers provided by RENAULT ( the French car industry). All experiments were run using the R software [11] together with packages [12,13].

#### *5.1. Driving Behavior Dataset*

The variable *X* in our dataset is the time series of driver's speed. A large number of time series are available sampled at 5 Hz with a recording duration varying from a few days to a few weeks. Sampling at 5 Hz represents 86,400 points per day. This means that the time series studied are very large and different. An example of such time series is provided in Figure 2. We have used a selected sample of 691 customers.

**Figure 2.** Complete driver's speed (**left**) and zoomed part (**right**) .

As these time series are speed traces of real customers, they vary from 0 to 150 km/h in our sample. Long and short stays at zero may be observed in these time series, corresponding to either night periods or shorter periods that correspond to car pauses, such as those for red lights. The lengths of the time series in our sample vary from 25,000 points to several million. Figure 3 shows the distribution of the time series lengths and the distribution of the zero values among the time series.

**Figure 3.** Histograms of time series length (**right**) and number of zeroes in each time series (**left**) .

In addition, for the 691 customers, a complex numerical model *f* was run in order to estimate the maximum soot released from each customer's vehicle. This model requires as input *x* the speed data time series together with many other characteristics of the vehicle. The maximum soot is the output of interest. As this model requires very long time computations (several hours for each customer depending on its parametrization), the objective is to select from among a dataset not used with the model, such as a sample of customers, for which the model would issue a similar distribution for the maximum soot to that already modeled for the customers.

#### *5.2. Results*

In this section, we provide the results of the algorithms proposed in Section 3: the extended nearest neighbors, the partition-based algorithm, the histogram-based approach and both variants using the support point and D-optimality.

To test these algorithms, the sample was randomly split into two parts with *S*<sup>1</sup> having 100 customers, and *S*<sup>2</sup> having the other 591. Each algorithm was run 100 times, and, from each run, the obtained optimal subsample *S*<sup>3</sup> was compared to *S*<sup>1</sup> using the Kolmogorov– Smirnov (KS) two samples test between *f*(*S*1) and *f*(*S*3), as the output of the model is known for all the samples in our dataset. Table 1 gives the average values of the KS statistics and the p-values over the 100 runs. We can observe that, depending on the encoding used, the algorithms do not react in the same way. We can see that the histogram-based algorithm provides the same results for the AR coefficients and feature extraction, though it is less efficient with the DTW. For the partition-based algorithms, the DTW performs much better than the two encoding approaches. While for the nearest neighbors and the support point, we get very close results for both encodings. Finally, for the D-optimality, we can see that the DTW and AR features provide the same results, which are better than those for feature extraction.


**Table 1.** Average results for each encoding and for each algorithm over 100 runs. NN = nearest neighbors.

Figure 4 shows the distribution of the *p*-values and the KS statistics over the 100 runs for all algorithms, the two encodings and DTW.

We can see that the deviation of the computed statistics are different for the three approaches: the AR, feature extraction and DTW. We can see that, for all algorithms, the AR encoding results are more unstable than those of the two other approaches.

**Figure 4.** Results of all algorithms by encoding from left to right: AR features, feature extraction and DTW. Algo 1 = Extended nearest neighbors; Algo 2 = Histogram-based; Algo 3 = Partition-based; Algo 4 = D-optimality; Algo 5 = Support point.

#### **6. Discussion and Conclusions**

In this work, we have tackled the problem of selecting a subsample of time series to satisfy some distributional constraints. We have adapted several algorithms proposed recently in a more general framework to our context. All of the proposed approaches were tested over a real dataset of a car's speed time series. The results obtained show that our algorithms work as expected, except for one approach: the histogram-based one. To adapt the existing algorithms we have suggested various numerical representations for the time series and shown that different choices for encoding may affect the results significantly. More complex models than the autoregressive approach, or more refined statistical representations for the time series, might be more efficient. One idea under investigation is to use supervised embedding through recurrent neural networks.

**Author Contributions:** Conceptualization, R.F. and B.G.; software, F.C.; writing—original draft preparation, R.F. and B.G.; writing—review and editing, F.C. and B.G.; supervision, B.G. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The data is not accessible because it is the property of Renault Group.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Proceeding Paper* **Modelling a Continuous Time Series with FOU***(p)* **Processes †**

**Juan Kalemkerian**

Facultad de Ciencias, Universidad de la República, Montevideo 11400, Uruguay; jkalem@cmat.edu.uy † Presented at the 8th International Conference on Time Series and Forecasting, Gran Canaria, Spain, 27–30 June 2022.

**Abstract:** In this work we summarize the knowledge about FOU*(p)* processes (fractional iterated Ornstein–Uhlenbeck processes of order emphp). Fractional Ornstein–Uhlenbeck processes are a particular case of FOU*(p)* processes (when *p* = 1). FOU*(p)* processes are able to model time series with both long- and short-range dependence. We give the definition, the main theoretical properties, and a procedure for estimating the parameters consistently. We also show how to model a continuous time series with FOU*(p)* processes, and we give an example of an application.

**Keywords:** fractional Brownian motion; long-range dependence; fractional Ornstein–Uhlenbeck process

#### **1. Introduction**

Usually, in time series, the researcher has a series of measurements evenly spaced in time (for example, measurements per minute, every thirty seconds, or weekly measurements). In these cases the underlying process is continuous time. The fractional iterated Ornstein–Uhlenbeck processes of order *p* (that we call FOU(*p*) and which are defined in [1]) are stationary and centred Gaussian continuous-time processes. By construction, the FOU(*p*) process depends on the two parameters defining the underlying fractional Brownian motion, namely, the Hurst exponent (*H*) and the scale parameter *σ*. From the relationship between the variogram and the Hölder index of the process trajectories, using a result given in [2], it is proved in [1] that *H* is the Hölder index of a FOU(*p*), giving information about the irregularity of the trajectories. If the process is observed in a discretized and equispaced interval [0, *T*], by applying a procedure suggested in [3] it is possible to estimate *H* and *σ* consistently. Apart from *H* and *σ*, a FOU(*p*) process is determined by a set of additional parameters, the so-called *λ* parameters, giving information about the local dependence. The theoretical properties of any FOU(*p*) process and a methodology for estimating its parameters consistently (including the asymptotic behaviour) are given in [1]. The estimation method and the asymptotic results for the *λ* parameters were obtained under the assumption that the process is observed over the entire interval [0, *T*], where *T* → ∞. In [4], a consistent method can be found for estimating the *λ* parameters in the discretized case. An interesting property of the FOU(*p*) process is that it exhibits short-range dependence when *p* ≥ 2, even though *H* > 1/2 (in this case, the generating fractional Brownian motion has long-range dependence). In addition, when *p* = 1 we have the result that FOU(1) is the classical fractional Ornstein–Uhlenbeck process (fOU) defined in [5], which has long-range dependence when *H* > 1/2. Another interesting property is that as *p* grows, the autocorrelation function of the process goes more quickly to zero. In addition, FOU(1) (fOU) processes can be approximated by FOU(2) (simply taking *λ*<sup>1</sup> → 0, where *λ* = (*λ*1, *λ*2)). Thus, FOU(*p*) processes can be viewed as a generalization of fOU processes and are able to model a time series with short-range dependence or long-range dependence. The main objective of this work is to summarize the results obtained in [1,4,6] for modeling a time series through FOU(*p*) processes. In Section 2, we give the definition of a FOU(*p*) process. The method for estimating its parameters is given in Section 3. In Section 4, we

**Citation:** Kalemkerian, J. Modelling a Continuous Time Series with FOU*(p)* Processes. *Eng. Proc.* **2022**, *18*, 33. https://doi.org/10.3390/ engproc2022018033

Academic Editors: Ignacio Rojas, Hector Pomares, Olga Valenzuela, Fernando Rojas and Luis Javier Herrera

Published: 23 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

give a method of modeling a time series through the FOU(*p*) process, including an example. Some conclusions are given in Section 5.

#### **2. Definition of FOU(***p***) Processes**

FOU(*p*) processes are built from the fractional Brownian motion.

**Definition 1.** *A fractional Brownian motion with Hurst parameter H* ∈ (0, 1] *is an almost surely continuous centred Gaussian process* {*BH*(*t*)}*t*∈<sup>R</sup> *such that its auto-covariance function is*

$$\mathbb{E}(\mathcal{B}\_H(t)\mathcal{B}\_H(s)) = \frac{1}{2} \left( |t|^{2H} + |s|^{2H} - |t-s|^{2H} \right) \text{ for every } t, s \in \mathbb{R}.$$

To use a fractional Brownian motion with a scale parameter *σ*, we use the notation {*σBH*(*t*)}*t*∈R.

Now, we can define a fractional iterated Ornstein–Uhlenbeck process of order *p* (FOU(*p*)), as found in [1].

**Definition 2.** *Let* {*σBH*(*s*)}*s*∈<sup>R</sup> *be a fractional Brownian motion with Hurst parameter <sup>H</sup> and scale parameter σ. Suppose further that λ*1, *λ*2, ..., *λ<sup>q</sup> are pairwise different and positive numbers and <sup>p</sup>*1, *<sup>p</sup>*2, ..., *pq* ∈ N *such that <sup>p</sup>*<sup>1</sup> + *<sup>p</sup>*<sup>2</sup> + ... + *pq* = *p. Then, the fractional iterated Ornstein– Uhlenbeck process of order p is defined as*

$$X\_t := T^{p\_1}\_{\\\lambda\_1} \circ T^{p\_2}\_{\\\lambda\_2} \circ \dots \circ T^{p\_q}\_{\\\lambda\_q} (\sigma B\_H)(t) = \sum\_{i=1}^q K\_i(\lambda) \sum\_{j=0}^{p\_i - 1} \binom{p\_i - 1}{j} T^{(j)}\_{\\\lambda\_i} (\sigma B\_H)(t),$$

*where the numbers Ki*(*λ*) *are defined by*

$$K\_i(\lambda) = K\_i\left(\lambda\_1, \lambda\_2, \dots, \lambda\_q\right) := \frac{1}{\prod\_{j \neq i} (1 - \lambda\_j/\lambda\_i)}\tag{1}$$

*and the operators T*(*j*) *<sup>λ</sup><sup>i</sup> are defined by*

$$T\_{\lambda}^{(h)}(y)(t) := \int\_{-\infty}^{t} e^{-\lambda(t-s)} \frac{\left(-\lambda(t-s)\right)^{h}}{h!} dy(s) \text{ for } h = 0, 1, 2, \dots \tag{2}$$

*We define T<sup>λ</sup> simply for the h* = 0 *case, that is*

$$T\_{\lambda}(y)(t) := \int\_{-\infty}^{t} e^{-\lambda(t-s)} dy(s). \tag{3}$$

**Remark 1.** *The equality Tp*<sup>1</sup> *<sup>λ</sup>*<sup>1</sup> ◦ *<sup>T</sup>p*<sup>2</sup> *<sup>λ</sup>*<sup>2</sup> ◦ .... ◦ *<sup>T</sup>pq <sup>λ</sup><sup>q</sup>* <sup>=</sup> <sup>∑</sup>*<sup>q</sup> <sup>i</sup>*=<sup>1</sup> *Ki*(*λ*) <sup>∑</sup>*pi*−<sup>1</sup> *<sup>j</sup>*=<sup>0</sup> ( *pi*−1 *<sup>j</sup>* )*T*(*j*) *<sup>λ</sup><sup>i</sup> that appears in Definition 2 is proved in [7].*

**Remark 2.** *The equality Tp*<sup>1</sup> *<sup>λ</sup>*<sup>1</sup> ◦ *<sup>T</sup>p*<sup>2</sup> *<sup>λ</sup>*<sup>2</sup> ◦ .... ◦ *<sup>T</sup>pq <sup>λ</sup><sup>q</sup>* <sup>=</sup> <sup>∑</sup>*<sup>q</sup> <sup>i</sup>*=<sup>1</sup> *Ki*(*λ*) <sup>∑</sup>*pi*−<sup>1</sup> *<sup>j</sup>*=<sup>0</sup> ( *pi*−1 *<sup>j</sup>* )*T*(*j*) *<sup>λ</sup><sup>i</sup> implies that the composition Tp*<sup>1</sup> *<sup>λ</sup>*<sup>1</sup> ◦ *<sup>T</sup>p*<sup>2</sup> *<sup>λ</sup>*<sup>2</sup> ◦ .... ◦ *<sup>T</sup>pq <sup>λ</sup><sup>q</sup> is commutative. Then, we assume that λ*<sup>1</sup> < *λ*<sup>2</sup> < ... < *λq. This will be helpful for estimating λ* = (*λ*1, *λ*2, . . ., *λq*)*, to avoid ambiguity.*

**Notation 1.** {*Xt*}*t*∈<sup>R</sup> <sup>∼</sup> *FOU λ*(*p*1) <sup>1</sup> , *<sup>λ</sup>*(*p*2) <sup>2</sup> , . . ., *<sup>λ</sup>*(*pq* ) *<sup>q</sup>* , *σ*, *H* , *where* 0 < *λ*<sup>1</sup> < *λ*<sup>2</sup> < ... < *λq, or more simply,* {*Xt*}*t*∈<sup>R</sup> <sup>∼</sup>*FOU*(*p*)*.*

**Remark 3.** *The notation FOU λ*(*p*1) <sup>1</sup> , *<sup>λ</sup>*(*p*2) <sup>2</sup> , . . ., *<sup>λ</sup>*(*pq* ) *<sup>q</sup>* , *σ*, *H implies that* 0 < *λ*<sup>1</sup> < *λ*<sup>2</sup> < ... < *λq. On the other hand, the notation FOU*(*p*) *means that we have taken p times the composition of operators T<sup>λ</sup> for different or equal values of λ.*

**Remark 4.** *Every FOU λ*(*p*1) <sup>1</sup> , *<sup>λ</sup>*(*p*2) <sup>2</sup> , . . ., *<sup>λ</sup>*(*pq* ) *<sup>q</sup>* , *σ*, *H is a Gaussian, centred, almost surely continuous process and is almost surely non-differentiable at any point (the proof of these results can be found in [1]).*

**Remark 5.** *When p* = 1*, FOU*(*λ*, *σ*, *H*) *is the classical fractional Ornstein–Uhlenbeck process.*

**Remark 6.** *In the case in which p*<sup>1</sup> = *p*<sup>2</sup> = ... = *pq* = 1*, we have*

$$X\_l = T\_{\lambda\_1} \circ T\_{\lambda\_2} \circ \dots \circ T\_{\lambda\_q}(\sigma \mathcal{B}\_H)(t) = \sum\_{i=1}^q K\_i(\lambda) T\_{\lambda\_i}(\sigma \mathcal{B}\_H)(t) \tag{4}$$

and we simply write {*Xt*}*t*∈<sup>R</sup> <sup>∼</sup>FOU *λ*1, *λ*2, . . ., *λq*, *σ*, *H* .

**Remark 7.** *From Equation (4) we have that any FOU*(*λ*1, *λ*2, *σ*, *H*) *where* 0 < *λ*<sup>1</sup> < *λ*<sup>2</sup> *can be writing as*

$$X\_t = \frac{\lambda\_1}{\lambda\_1 - \lambda\_2} X\_t^{(1)} + \frac{\lambda\_2}{\lambda\_2 - \lambda\_1} X\_t^{(2)} \tag{5}$$

*being <sup>X</sup>*(*i*) *<sup>t</sup>* = *σ* . *t* <sup>−</sup><sup>∞</sup> *<sup>e</sup>*−*λi*(*t*−*s*)*dBH*(*s*) *for <sup>i</sup>* <sup>=</sup> 1, 2*. That is, <sup>X</sup>*(*i*) *<sup>t</sup> is a classical fractional Ornstein– Uhlenbeck process with λ* = *λi*. *Then a FOU*(2) *process is a linear combination of two fOU processes with different values of λ.*

**Remark 8.** *From Equation (5), if <sup>λ</sup>*<sup>1</sup> <sup>→</sup> <sup>0</sup>*, we have that Xt* <sup>→</sup> *<sup>X</sup>*(2) *<sup>t</sup> , that is, every fractional Ornstein–Uhlenbeck processes can be approximated by a FOU*(2) *process (by taking λ*<sup>1</sup> *small).*

**Remark 9.** *In [1] it is shown that every FOU*(*p*) *process has a short-range dependence for p* ≥ 2 *and every H* ∈ (0, 1)*. On the other hand, it is well-known that if H* > 1/2*, every classical Ornstein–Uhlenbeck process has long-range dependence. Therefore, if H* > 1/2*, according with previous remark, we have that the short-range dependence FOU*(2) *processes can able to approximate a long-range dependence fOU process.*

**Remark 10.** *From remarks 5 and 9, we can say that the FOU*(*p*) *processes are a generalization of fOU processes and are able to model a time series with short-range dependence or long-range dependence.*

#### **3. Parameter Estimation**

In this section, we summarize a procedure that allows the estimation of the parameters of any FOU(*p*) in a consistent way. Similarly to the estimators for (*λ*, *σ*, *H*) proposed in [8] for the fractional Ornstein–Uhlenbeck process, the procedure for estimating the parameters in any FOU(*p*) process has two steps. As a first step, we estimate *σ* and *H* independently of the values of the *λ* parameters. As a second step, using the explicit formula for the spectral density (see Equation (8)) and substituting (*H*-, -*σ*) instead of (*H*, *<sup>σ</sup>*), we can estimate *λ* = *λ*1, *λ*2, . . ., *λ<sup>q</sup>* using Whittle estimators.

Throughout this section, we assume that we have an equispaced sample in [0, *T*] of a FOU(*p*) process, that is, *XT*/*n*, *X*2*T*/*n*, ...., *XT*, which we simply call *X*1, *X*2, . . ., *Xn*.

#### *3.1. Estimation of H and σ*

We start by recalling the definition of a filter of length *k* + 1 and order *L*.

**Definition 3.** *a* = (*a*0, *a*1, . . ., *ak*) *is a filter of length k* + 1 *and order L* ≥ 1 *when the following conditions are fulfilled:*


**Remark 11.** *Given a*, *a filter of order L and length k* + 1*, the new filter a*<sup>2</sup> *defined by a*<sup>2</sup> = (*a*0, 0, *a*1, 0, *a*2, 0, . . ., 0, *ak*) *has order L and length* 2*k* + 1*.*

Given a filter *a*, we can define the quadratic variation of a sample associated with *a*.

**Definition 4.** *Given a filter a of length k* + 1 *and a sample X*1, *X*2, . . ., *Xn, we define*

$$V\_{n,a} := \frac{1}{n} \sum\_{i=0}^{n-k} \left( \sum\_{j=0}^{k} a\_j X\_{i+j} \right)^2.$$

If we use a filter *<sup>a</sup>* of order *<sup>L</sup>* <sup>≥</sup> 2 and length *<sup>k</sup>* <sup>+</sup> 1, and we take <sup>Δ</sup>*<sup>n</sup>* <sup>=</sup> *<sup>n</sup>*−*<sup>α</sup>* for some *α* > 0 such that *T* = *n*Δ*<sup>n</sup>* → +∞, then if *H* > 1/2, the estimators of *H* and *σ* are given by

$$
\hat{H} = \frac{1}{2} \log\_2 \left( \frac{V\_{n,a^2}}{V\_{n,a}} \right) \tag{6}
$$

$$\widehat{\sigma} = \left(\frac{-2V\_{n,a}}{\Delta\_n^{2\hat{H}} \sum\_{i=0}^k \sum\_{j=0}^k a\_i a\_j |i-j|^{2\hat{H}}}\right)^{1/2}.\tag{7}$$

In [1,4], the theoretical details for the asymptotic normality and consistency of *<sup>H</sup>*- and -*<sup>σ</sup>* can be found.

#### *3.2. Estimation of the λ Parameters*

If *<sup>X</sup>* <sup>=</sup> {*Xt*}*t*∈<sup>R</sup> <sup>∼</sup> *FOU*(*λ*(*p*1) <sup>1</sup> , ... , *<sup>λ</sup>*(*pq*) *<sup>q</sup>* , *<sup>σ</sup>*, *<sup>H</sup>*), where <sup>∑</sup>*<sup>q</sup> <sup>i</sup>*=<sup>1</sup> *pi* = *p*, the spectral density of *X* is given by (see [1])

$$f^{(X)}(\mathbf{x}, \boldsymbol{\lambda}, \sigma, H) = \frac{\sigma^2 \Gamma(2H + 1) \sin(H\pi) |\mathbf{x}|^{2p - 1 - 2H}}{2\pi \prod\_{i=1}^{q} (\lambda\_i^2 + \mathbf{x}^2)^{p\_i}}.\tag{8}$$

From (8), we can estimate *λ* = *λ*1, *λ*2, . . ., *λ<sup>q</sup>* using a modified Whittle contrast. Consider (for any fixed *T* > 0) the function

$$\left(\mathcal{U}\_{\Gamma}^{(n)}(\lambda,\sigma,H)\right) \quad = \ \frac{T}{n}\sum\_{i=1}^{n}h\_{\Gamma}^{(n)}(iT/n,\lambda,\sigma,H).$$

where *h* (*n*) *<sup>T</sup>* is defined by

$$h\_T^{(n)}(\mathbf{x}, \boldsymbol{\lambda}, \sigma, H) = \frac{1}{2\pi} \Big( \log f^{(X)}(\mathbf{x}, \boldsymbol{\lambda}, \sigma, H) + \frac{I\_T^{(n)}(\mathbf{x})}{f^{(X)}(\mathbf{x}, \boldsymbol{\lambda}, \sigma, H)} \Big) w(\mathbf{x})^2$$

where

$$I\_T^{(n)}(\mathbf{x}) = \frac{T}{2\pi} \left| \frac{1}{n} \sum\_{j=1}^n e^{\frac{ijTx}{n}} \mathbf{X}\_{\frac{jT}{n}} \right|^2$$

is the discretization of the periodogram of the process and *w* is some weight function. Then, the vector *λ* can be estimated by

$$
\hat{\lambda}\_T^{(n)} = \arg\min\_{\lambda \in \Lambda} \mathcal{U}\_T^{(n)} \left(\lambda, \hat{\sigma}, \hat{H}\right) \tag{9}
$$

where -*<sup>σ</sup>* and *<sup>H</sup>*- are defined by (7) and (6), respectively, and <sup>Λ</sup> is some compact set. Details about the consistency of this estimator, including how to choose the *w* function, can be found in [4].

**Remark 12.** *When <sup>q</sup>* <sup>=</sup> <sup>1</sup>*, that is,* {*Xt*}*t*∈<sup>R</sup> <sup>∼</sup> *FOU λ*(*p*), *σ*, *H , the λ parameter can be estimated more easily by the following formula:*

$$
\hat{\lambda} = \left( \frac{n\hat{\sigma}^2 \hat{H} \Gamma\left(2\hat{H}\right) \prod\_{i=1}^{p-1} \left(i - \hat{H}\right)}{(p-1)! \sum\_{i=1}^{n} X\_i^2} \right)^{\frac{1}{2H}}.\tag{10}
$$

Theorems on the consistency and asymptotic normality of *λ* can be found in [6].

#### **4. Modelling an Observed Time Series Using FOU(***p***) Processes**

Of course, before starting to model with FOU(*p*) it is necessary to subtract the mean value and remove the seasonal component if it has one. Given *X*1, *X*2, ..., *Xn* observations of a stationary centred time series that we wish to model using a FOU(*p*) process, we firstly assume that the observations form an equispaced sample on [0, *T*], that is, *XT*/*n*, *X*2*T*/*n*, ..., *XT* for some value of *T*. According to what was seen in the previous section, we need to estimate (*λ*, *σ*, *H*), whose estimators depend on *T*. Thus, firstly we need to know the value of *T*.

#### *4.1. Choosing the Value of T*

To give the value of *T* is equivalent to give the unit of measurement of time in which the observations are taken. In general, it is natural to take some value of *T* (for example, if the observations are weekly and we have 104 observations, it is natural to take *T* = 104 weeks or *T* = 2 years), but we can take any value of *T* and interpret it (in terms of the original time measure of the data). Therefore, we can choose a value of *T* that optimizes a certain criterion. According to theoretical results (see [1,4,6]), in order to model a time series dataset using FOU(*p*) processes, it is necessary to have values of *n* and *T* such that *n*, *T* → +∞ and *T*/*n* → 0 at a certain rate. Now, *n* is the sample size, and the observations lie in the range of [0, *T*] for some value of *T*. In [4], it is suggested that a certain value of *T* should be chosen to optimize some criterion, for example, MAE, RMSE, AIC, BIC, or the Willmott index.

#### *4.2. An Application to Real Data*

In this section, we work with the well-known Series A (a record of 197 chemical process concentration readings, taken every two hours). To model this with a FOU(*p*) process, we use values of *p* = 2, 3, 4. As a first step, for each one of these models, we select a suitable value of *T*. We minimize the error forecasts for the last *m* observations

$$\sqrt[h]{\frac{1}{m}\sum\_{i=1}^{m}\left|\mathbf{X}\_{n-m+i}-\widehat{\mathbf{X}}\_{n-m+i}\right|^{h}}$$

for *h* = 1 (mean absolute error, MAE) and *h* = 2 (the root mean square error prediction, RMSE) and maximize the Willmott index defined by

$$\mathcal{W}\_{\hbar} = 1 - \frac{\sum\_{i=1}^{m} \left| \mathbf{X}\_{n-m+i} - \mathbf{\hat{X}}\_{n-m+i} \right|^{\hbar}}{\sum\_{i=1}^{m} \left( \left| \mathbf{\hat{X}}\_{n-m+i} - \overline{\mathbf{X}}(m) \right| + \left| \mathbf{X}\_{n-m+i} - \overline{\mathbf{X}}(m) \right| \right)^{\hbar}}$$

where *X*(*m*) := <sup>1</sup> *<sup>m</sup>* <sup>∑</sup>*<sup>m</sup> <sup>i</sup>*=<sup>1</sup> *Xn*−*m*+*<sup>i</sup>* and *X*1, *X*2, ..., *Xn* (or *XT*/*n*, *X*2*T*/*n*, . . ., *XT*) are the real observations, for *h* = 1 and *h* = 2. Observe that *Wh* takes values between 0 and 1, and the predictions improve as *Wh* grows (*W*<sup>2</sup> is called Willmott index and *W*<sup>1</sup> is called the Willmott *L*<sup>1</sup> index). In Table 1, we show the values of the four forecast quality measures for the AR(7), ARMA(1, 1) models and every adjusted FOU(*p*) models for *p* = 2, 3, 4. Every considered FOU(*p*) model, performs similarly and near to AR(7) in the four measures. In Figure 1, we show the values of the four forecast quality measures in the function of *T* for the FOU(*λ*(2), *σ*, *H*) model (the other FOU(*p*) models behave similarly) for *m* = 50 predictions. That is, for every value of *T* and every model, we estimate the parameters of the FOU(*p*) model and then we obtain the *m* predictions for the last *m* observations (at one step) and compute the RMSE, MAE, *W*1, and *W*2. In every case, the optimal value is reached in the neighbourhood of *T* = 11. To estimate (*σ*, *H*), we used a Daubechies filter of order 2:

$$a = \frac{1}{\sqrt{2}} (0.482962, -0.836516, 0.224143, 0.129409).$$

ARMA (1, 1) and AR(7) are suggested for modeling the Series A dataset (see [9] where this dataset was introduced) [10,11]. In Figure 2, we show that the adjusted FOU(*λ*(3), *σ*, *H*) and FOU(*λ*(4), *σ*, *H*) have a better fit than the two ARMA models considered.

**Figure 1.** RMSE, MAE, *W*1, and *W*<sup>2</sup> (for *m* = 50 predictions) when the model used is FOU(*λ*(2), *σ*, *H*) as a function of *T*.

**Figure 2.** In black, the empirical auto-covariance function and in blue, the fitted auto-covariance function, according to the adjusted model for the Series A dataset.


**Table 1.** Values of *W*2, *W*1, *RMSE*, and *MAE* for different models adjusted to Series A. In bold the optimum value.

#### **5. Some Concluding Remarks**

We summarize below the main conclusions obtained (details can be found in [1,4,6]):

1. FOU(*p*) processes are a Gaussian family of continuous-time stochastic processes that generalize (by taking *p* = 1) the classical fractional Ornstein–Uhlenbeck processes.


**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Samples of the compounds are available from the basic R packages.

**Conflicts of Interest:** The author declares no conflict of interest.

#### **References**

