*Proceeding Paper* **PV Fault Diagnosis Method Based on Time Series Electrical Signal Analysis †**

**Carole Lebreton , Fabrice Kbidi \* , Frédéric Alicalapa , Michel Benne and Cédric Damour**

Energy Lab, Université de La Réunion, 15, Avenue René Cassin CS 92003, CEDEX 9, 97744 Saint-Denis, France; carole.lebreton@univ-reunion.fr (C.L.); frederic.alicalapa@univ-reunion.fr (F.A.); michel.benne@univ-reunion.fr (M.B.); cedric.damour@univ-reunion.fr (C.D.)

**\*** Correspondence: fabrice.kbidi@univ-reunion.fr

† Presented at the 8th International Conference on Time Series and Forecasting, Gran Canaria, Spain, 27–30 June 2022.

**Abstract:** With the objectives of energy self-sufficiency and zero emissions in La Reunion, photovoltaics are becoming an increasingly important part of the local energy mix. Installation reliability and safety are crucial to ensure network stability and security, therefore PV system fault diagnosis is an essential tool in the expansion of this electricity production method. The DETECT Project (Diagnosis onlinE of sTate of health of EleCTric systems) is a research project aiming at diagnosis method development. In this way, signal processing provides us with promising tools in the form of decomposition algorithms. Thanks to their low computation cost, empirical mode decomposition (EMD) and variational mode decomposition (VMD) allow undertaking a real-time diagnosis, with on-line PV electrical signal time-series data analysis.

**Keywords:** PV systems; faults; diagnosis; signal processing; time series data

**Citation:** Lebreton, C.; Kbidi, F.; Alicalapa, F.; Benne, M.; Damour, C. PV Fault Diagnosis Method Based on Time Series Electrical Signal Analysis. *Eng. Proc.* **2022**, *18*, 18. https:// doi.org/10.3390/engproc2022018018

Academic Editors: Ignacio Rojas, Hector Pomares, Olga Valenzuela, Fernando Rojas and Luis Javier Herrera

Published: 21 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

#### **1. Introduction**

The current ecological and environmental situation has led to an international awareness of greenhouse gas emissions. Renewable and natural energy sources have been widely used and developed over the last 20 years. The production of electrical energy from photovoltaic energy in territories with a high solar potential has increased considerably. As Reunion Island is located in a tropical zone with a high annual productivity (1314 h of full-power equivalent in 2019 [1]), the energy policy of France and of the island encourages the development of photovoltaic solar energy. In this context, the production of electrical energy by photovoltaic means has increased from 0 to 206.3 MW in 15 years [1] in order to counter the massive importation of fossil fuels. As a non-interconnected grid territory, the Reunion Island grid is prone to instability due to strong variations in intermittent energy production. Thereby, maintenance and forecasting of photovoltaic production are necessary. The tropical conditions induce an increase in faults and premature aging of photovoltaic systems [2]. Moreover, the photovoltaic park in Reunion Island includes a wide range of sizes of power plants. There is a large proportion of small installations (62.66% of the number of photovoltaic plants are below 9 kVA), which are unequally equipped for diagnosis and fault detection.

The DETECT project aims to implement innovative technical solutions to detect, isolate, and identify faults that may lead to failures. These solutions allow improvement of the energy efficiency of PV systems by anticipating maintenance operations. The deployment of these solutions on a large scale relies on the use of low-cost technologies, allowing the implementation of diagnostic modules while limiting the installation of additional sensors in situ. The objective of the project is to develop distributed technical solutions to analyze the health status of hybrid systems, in particular PV systems, installed in humid tropical

environments such as Reunion Island. The innovative character of this project is based on several elements:


#### **2. Method Presentation**

When a fault occurs, symptoms will appear in the PV system, the behavior of the voltage or current curves can be affected in a more or less visible way [3]. Advanced diagnosis methods collect the information available on the production site and process it in order to highlight the characteristic behavior, i.e., the fault symptoms.

The development of a diagnosis method consists of three steps:


In order to meet the constraints of the DETECT project, the following options are retained:


#### **3. Time Series Decomposition and Preliminary Results**

Signal processing methods are used to extract additional features from time series. Among the most known, the fast Fourier transform (FFT) transforms discrete time domain data into the frequency domain. The information collected comprises the gain and the phase of the signal for each frequency value. These characteristics carry information not available for data in the time domain. Unlike FFT, whose scope is limited to stationary or periodic signals, the wavelet transform (WT) is applicable to non-stationary and transient signals. The signal is decomposed into distinct details with determined bandwidths. The analysis of the information then extracted is used for fault diagnosis.

In order to extract information from PV output time series, these well-known decomposition tools have already been applied to PV fault diagnosis, as well as more recent tools. FFT [4], EMD (empirical mode decomposition) [5], discrete wavelet transform decomposition [6], and PCA (principal component analysis) [7] can be cited. Diagnosis of PV system faults warrants special attention for these tools in order to, e.g., detect arc faults, define PV system health status, quantify power quality disturbances in a PV-based microgrid, or discern shading faults.

Variational mode decomposition (VMD) is a decomposition algorithm, developed in 2014 by Dragomiretskiy [8], that demodulates non-recursively a signal *x*(*t*) in a finite number of distinct amplitude-modulated-frequency-modulated (AM-FM) signals *xk*(*t*), called intrinsic mode functions (IMFs).

The superiority of EMD or VMD has been studied in cases of structural health monitoring [9] and medical time series [10]. The latter highlights that EMD is more sensitive to low frequencies than VMD and conversely for VMD and high frequencies. Regarding signal noise, VMD filters high-frequency noise in a more effective way without signal characteristic changes, unlike EMD, which reduces signal amplitude.

It is important to note that EMD's or VMD's superiority and advantages highly depend on the time series characteristics.

VMD has already been used in other application domains for diagnosis [11], and it is a promising tool for electrical system diagnosis [12]. In the next subsections, EMD and VMD are applied to PV output current for healthy, cloudy, and faulty data.

#### *3.1. Experimental Data Selection*

Three types of experimental conditions were applied on a 4kWc real rooftop PV plant consisting of 2 lines of twelve panels, corresponding to healthy conditions during clear and cloudy sky, and faulty conditions during clear sky weather.

#### *3.2. Preliminary Results*

The following figures illustrate EMD and VMD analysis of the selected data. The different time series are shown in Figure 1. The selected windows defined in red are decomposed by EMD and VMD. Figures 2 and 3 exhibit the different IMFs for the three conditions.

**Figure 1.** Experimental dataset. (**a**) Healthy conditions: clear sky. (**b**) Healthy conditions: cloudy sky. (**c**) Faulty conditions: partial shading.

**Figure 2.** *Cont.*

**Figure 2.** Empirical mode decomposition. (**a**) Healthy conditions: clear sky. (**b**) Healthy conditions: cloudy sky. (**c**) Faulty conditions: partial shading.

In order to extract more information regarding the IMF characteristics, an analysis of each IMF was conducted. For each IMF, its energetic contribution was calculated, as well as its central frequency being collected and the coefficient of correlation being calculated between the current IMF and the original signal. These preliminary results suggest that some IMFs react differently according to experimental conditions. The analysis and the study of these IMFs could allow determination of the occurring conditions with accuracy, and discrimination of a faulty state from a healthy state.

#### **4. Conclusions and Perspectives**

The DETECT Project aims to develop real-time, low cost, and accurate PV system fault diagnosis. In this context, EMD and VMD are promising signal processing tools with a low computational cost. This advantage allows undertaking an on-line PV time-series data analysis.

The proposed targeted method should combine high-efficiency and low-cost technologies, yet minimize human and material effort deployment on photovoltaic parks. The first results have been confirmed on a set of experimental data. The analysis of IMF characteristics and the selection of the most significant ones are in progress. A comparative study of EMD and VMD performance associated with another step of the fault diagnosis tool is being undertaken. In parallel, different research domains are being investigated, in the context of signal processing and information theory. The main perspective is to build an indicator coming from IMF characteristics, aiming to precisely detect, isolate, and diagnose possible faults.

**Author Contributions:** Conceptualization, C.L., C.D. and F.K.; methodology, C.L., C.D. and F.K.; software, C.L. and C.D.; validation, C.L., C.D., F.K., F.A. and M.B.; formal analysis, C.L., C.D., F.K., F.A. and M.B.; investigation, C.L., C.D., F.K., F.A. and M.B.; resources, C.D., F.A. and M.B.; writing—original draft preparation, C.L. and F.K.; writing—review and editing, C.L., C.D. and F.K.; supervision, C.D., F.A. and M.B.; project administration, C.D., F.A. and M.B.; funding acquisition, C.D., F.A. and M.B. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research work was funded by the ERDF (European Regional Development Fund) and the Region Council of La Réunion.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** Special acknowledgments go to the European Regional Development Fund (ERDF) and the Regional council of Reunion for DETECT project funding. We also thank the project's partner, the Femto-ST laboratory.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Proceeding Paper* **Early Detection of Flash Floods Using Case-Based Reasoning †**

**Enrique Fernádez 1,‡, José R. Villar 1,\* , Alberto Navarro <sup>2</sup> and Javier Sedano <sup>2</sup>**


**Abstract:** A flash flood is the sudden increase in the water level of a basin due to an abrupt change in weather conditions. The importance of early flash flood detection is given by reducing its consequences, either in infrastructure damage or human losses. Interestingly, the studies in the literature focus on the dynamics of the basins, determining how the water levels in a basin would be in a considered scenario, and leaving the early and online flash flood detection unaddressed. This research addresses this latter problem and proposes a case-based reasoning that estimates the flooding map for a given prediction horizon. Provided enough data are available, this CBR tool would perfectly deal with different basins and locations. This research is being designed and developed on two concrete basins, one from Spain and one from France. We expect that the performance of the CBR tool will satisfactorily assess the decision making of the public safety experts.

**Keywords:** flash flood; case-based reasoning; risk prediction

**Citation:** Fernádez, E.; Villar, J.R.; Navarro, A.; Sedano, J. Early Detection of Flash Floods Using Case-Based Reasoning. *Eng. Proc.* **2022**, *18*, 19. https://doi.org/ 10.3390/engproc2022018019

Academic Editors: Ignacio Rojas, Hector Pomares, Olga Valenzuela, Fernando Rojas and Luis Javier Herrera

Published: 21 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

#### **1. Introduction**

The abrupt changes in the weather conditions and climate change are inducing more and more flash floods, which are sudden increments in the water level in a basin due to a sudden change in the weather conditions, among other variables, such as soil conditions. A great effort in the risk assessment has been performed, aiming to evaluate the risk of flooding for certain scenarios in different basins. The risk assessment identifies those areas of a basin that are susceptible to flooding to define efficient management policies [1,2], to design a new infrastructure that avoids the flooding [3] or to design complementary infrastructures to mitigate the effects of these flash floods [4].

Nevertheless, the early and online detection of flash floods has not yet gathered the focus of the research community [5]. Such a tool would allow public safety experts to make decisions in advance in the event of a flash flood, reducing as much as possible its consequences. Solving this latter application, that is, aiding the public safety experts in the flash flood detection, needs the use of data not only from the basin's geomorphological and hydrological information, but also information concerning the current scenario—for example, the water level of the rivers at certain points, the amount of rain in the last relevant period—together with weather forecast information.

In this research, a prototype for the online detection of flash floods is proposed. To do so, information from the sensory systems deployed on the basin is queried, as well as the weather forecast system, to obtain a clear picture of the basin's current state. These time series are used to estimate the height of water at any point in the basin. The water's height can be superimposed on a map, providing information about what geographical points in the basin are in danger of a sudden flood.

Features are extracted from the above-mentioned time series, and case-based reasoning (CBR), such as the intelligent paradigm, finds similarities among the retrieved and selected cases from the case base. The outcome of the CBR is the above-mentioned map containing the height of the water for every single point in the basin. Together with a web application, a novel online flash flood tool can guide public safety teams in their decision-making.

The structure of this research is as follows. Section 2 discusses the related work section concerning flash flood risk assessment. Section 3 describes the prototype that has been developed. Section 4 shows the results obtained for a real basin in Spain. Finally, the conclusions from this research are drawn.

#### **2. Related Work**

Risk assessment aims to identify basin's susceptible to flooding areas; it is a widely used term in the flood detection literature. A risk assessment measurement uses, among other possibilities, Machine Learning (ML) and artificial intelligence methods and techniques; however, the majority of the research proposed off-line numerical approaches [6]; the off-line characteristics come from the computation requirements for the simulations.

When talking about ML, either unsupervised and supervised learning have been applied. With respect to unsupervised learning, the main developed idea is to design or select a set of risk indexes or features by means of clustering the instances from the available GIS (Global Information System) points—which introduces a risk label per group—then, the risk label of the most suitable group labels every new GIS point instance. For example, the study in [7] proposes merging the improved analytic hierarchy process, which hybridizes the iterative self-organizing data analysis and the Maximum Likelihood (ISO-Maximum) clustering algorithm. With these methods, a risk index that reflects its geomorphical and geographical characteristic labels every possible position.

Moreover, supervised learning methods, such as decision trees or support vector machines, have also been used for risk assessment. In [8], flood susceptibility mapping of GIS points in the Kelantan basin (Malaysia) was addressed using rule-based decision trees and the combination of frequency ration and logistic regression statistical methods. Up to 10 features called conditional factors were calculated in the first stage devoted to feature selection from GIS and remote sensing data. Afterwards, ML and the statistical methods computed a risk value for each position in a basin map. This scheme has been also used in different studies, such as in [9], where random forest and boosted regression trees were compared.

Neural networks have also been reported for risk assessment [10,11]. As before, a set of features were extracted from GIS positions to train and test the neural model. Each GIS position is evaluated when deploying the model, computing the corresponding set of features and calculating the risk assessment using the trained neural model for the basin. Support vector machines were proposed instead of neural networks using a rather similar procedure reported in [12].

Furthermore, the study in [13] used multicriteria decision making in the evaluation of the risk assessment. This research proposed analytical hierarchy process to generate the decision-making process and assign a risk label to each GIS position. A similar study was reported in [14] for the assessment in the Mashhad Plain basin, Iran.

Additionally, the Dempster–Shafer-based evidential belief function has been proposed for the assessment and spatial prediction of flood-susceptible areas [15]. A set of features is extracted from each GIS location and these features are used as inputs to the Dempster– Shafer method. Interestingly, this probabilistic-based reasoning method overtook other methods used in the literature, such as logistic regression and frequency ratio. A similar conclusion is presented in [9], where a comparison between the Dempster–Shafer of two different decision trees was performed- In this latter case, the features and the method varies from the former studio, but the conclusions are similar.

These studies show that perhaps ML is not the most interesting technique for this type of problem because of the difficulties to obtain enough valid and representative data from all the basin, so training the model could lead to generalized models. Furthermore, the evidence that the Dempster–Shafer theory could compete with the ML methods suggests that what is needed is an artificial intelligence technique that can represent the knowledge from the experts, extrapolating this knowledge to the different positions in the basin.

It is worth recalling that all of these approaches are focused on off-line risk assessment. Nonetheless, if we also consider the case of on-line information, not only GIS information is needed but also the sensory data and the weather forecasts could be needed in order to assess a certain scenario. With all these premises, this research proposes to use an alternative that has never been used in this context.

#### **3. A Flash Flood Detection System: Using CBR as Reasoning Paradigm**

The main idea is to develop a web application to assist public safety experts in their decision-making process. This web application requests basic information about the corresponding basin; then, it queries a CBR web service for a sequence of maps representing the mm of water in the area, with one map per prediction horizon. The CBR, on the other hand, performs the first three stages in this type of system (retrieve, recall, and reuse) to merge the maps from the most relevant cases considering a similarity measurement between the stored cases and the current scenario or status of the basin. For the purpose of this research, the retain stage will not be implemented as is explained when discussing the case representation.

This study analyses the behavior of hydrological basins, each basin including one or more locations where the prediction would be requested. A requested prediction would predict the state of the basin (in millimeters of water level) for up to six prediction horizons. A prediction horizon is a time step that is a function of the dynamics of the concrete basin; for instance, for some quick response basins, this time could be in minutes, while for more steady basins, this period could be 1 hour.

The following subsections provide details on the different parts of the system. Section 3.1 deals with the web interface; Section 3.2 explain how a case, representing a basin state, is stored. Finally, the different CBR stages are detailed in Section 3.3.

#### *3.1. Exploiting the CBR Tool*

Figure 1 shows the interface of the web application that has been developed for the INUNDATIO project; this interface is where the expert interacts with the system. Firstly, the user must introduce the basin and location, then the weather forecast. Then, by clicking on the button, the CBR service is requested and a sequence of maps will be delivered and shown. The tool visualizes the map of the mm of water as a layer on top of an open-layers map from the corresponding basin.

There are two main options in this interface: automatic weather forecast—requesting the values to the designated forecast service—and manual forecasting. In this latter case, the user chooses between a linear, a quadratic, or Gaussian forecast by setting the function's parameters. This realistic forecast allows for evaluating the basin behavior in case of extreme events.



**Figure 1.** Example of the web interface for the CBR deployment. (**a**) In the upper part, the manual rainfall forecast is shown. (**b**) In the bottom part, the automatic weather forecast interface is depicted.

#### *3.2. Case Representation of the Basin State*

As mentioned before, a case in the case base represents the state of a certain basin. The state of the basin is defined by a set of variables; each of them is a time series. According to the hydrodynamic experts, these time series can be represented using a window of values; the length of each window varies with the variable and the specific basin as a function of its dynamics: faster basins require smaller windows and vice versa. Moreover, each window is split into intervals, and the average of the values of the variable is computed for each interval. Then, each time series is represented as a sequence of aggregated values.

Furthermore, each basin includes information such as its list of deployed sensors and sensor types, and whether the values from a sensor type should be aggregated or not. For instance, the Rain Gauges (RG) are usually aggregated among all the rain gauges in the basin, whereas the Water Discharge Levels (WDL) are usually considered individually. Figure 2 shows the variables and the number of intervals that are considered in this study. Additionally, each case also stores maps of the basin with the millimeters of water at any point for a set of prediction horizons, as also shown in the Figure.

It is worth mentioning that these maps and information are gathered from the results of the simulations of the basin using the hydrological models. For sure, there are welldefined hydrological models of the basins; however, running a simulation takes too much time and the outcomes are not expected to be available in such a small period of time to be useful for an online request from public safety experts. Hence, that is the reason why the whole CBR is designed: mimicking the performance of the hydrological models for the prediction of a basin's behavior.

**Figure 2.** The case representation used in this research. The colored boxes represent the intervals in which each window is split. A map with the mm of water is stored for each prediction horizon.

#### *3.3. CBR Stages*

A case includes maps with the mm of water for each prediction horizon. This means that, in the case of requiring the CBR to retain cases, these flood maps must be gathered for each prediction horizon so the CBR system could store them within the case. Nevertheless, this information is not available for the system: there is no method to recollect the mm of water for each point in a basin; therefore, the retain phase of the CBR can not be completed and the CBR would only rely on the information extracted from simulations. How to use ideas such as concept drift to include extra information in the case base is left for future research.

Hence, only three stages are to be defined: retrieve, reuse, and revise. For the retrieval phase, simple queries were performed to retrieve the cases for the current location and basin. This might not be efficient in terms of computation; however, the scheme allows us to define different distance measurements without the need of designing complex queries.

The reuse stage is the one that performs the calculation of the distances between the retrieved cases from the case database and the current scenario. For this research, the square root of the weighted sum of the differences between the values for the variables describing the current scenario and a case is used as the distance measurement. Single criteria sorting using the distance measurement, together with a maximum number of neighbors, define the selection of the most relevant cases. Moreover, it is expected that more distance measurements can be defined, such as using the Sobol' [16] indexes or introducing not only single criteria but multi-criteria sorting—for instance, using the Pareto non-dominance concept.

Finally, the revise stage generates the outcome of each case. To do so, the distance between each relevant case and the current scenario is used as a weight to calculate the weighted map among the maps from each relevant case (see Equation (1)). The idea is that the smaller the distance, the higher the weight should be; therefore, each reused case is assigned with the product of the distances of the remaining cases; this value is then scaled in [0.0, 1.0]. Using these weights, and for each prediction horizon, the CBR calculates a weighted map as the agreement among the cases.

*wi* <sup>=</sup> <sup>∏</sup>*j*=*<sup>i</sup> <sup>d</sup>*(*cj*, *current scenario*) <sup>∑</sup>*<sup>j</sup>* <sup>∏</sup>*j*=*<sup>i</sup> <sup>d</sup>*(*cj*, *current scenario*) (1)

#### **4. Experimentation and Discussion**

The INUNDATIO project focuses on two basins: Venero Claro and Nive; however, this study has only been analyzed with the Venero Claro basin, in Spain. The Venero Claro represents a small torrential hydrographical basin in Sierra de Gredos, Ávila. The Cabrera

creek is a torrential-rain current and a tributary to the Alberche river belonging to the Tajo basin. It refers to a series of small creeks on the northern side of Sierra del Valle; its vegetation is 45% Pinus Pinaster, but also includes Pinus Sylvestris and Alnus Glutinosa. The behavior of this basin is a very active fluvial(torrential hydrodynamics) with relatively short dynamic time periods, between 15 min to 45 min.

During the implementation of the INUNDATIO project, a set of sensors was deployed in the basin to develop the hydrodynamic models; these models, unfortunately, will not be available during the project. As a consequence, for testing this CBR, the research team produced a realistic case base.

To generate the realistic case base, two main criteria have been defined in agreement with the experts: soil saturation and weather forecast; these criteria were a sort of lookup table to set the initial conditions and the evolutionary rules of the generated cases. Up to three different possible soil saturation states were considered: non-saturated soil, mid-saturated soil, and highly saturated soil. Furthermore, three possible forecasts were proposed: scarce rain forecast, soft rain forecast, and heavy rain forecast. The evolution of the sensors' measurements and the rain forecast sequences were standardized for each pair of soil saturation and weather forecast. With these assumptions, a realistic case was generated by randomly biasing the standardized sensors functions and the rain forecast. Finally, the maps of the basin for each of the prediction horizons containing the millimeters of water were generated; these maps were carefully chosen to represent plausible states of the basin given the basin variables and rain forecast.

To evaluate this prototype of the CBR, different queries were performed by manually setting the rain forecast. The outcome of the tool must be such that the maps resemble the variances in mm of water in the basin. Because the tool automatically queries the online databases for the basin sensory information, tests with different basin initial conditions were not available.

The performance of the CBR is briefly outlined in Figures 3 and 4. The former figure depicts the web interface, showing the CBR output map overlapped with the basin map. The user can easily zoom in and out as well as pan over the map. This outcome comes from a scenario where the basin was suffering strong rain and then it stopped raining; the area in the upper-left corner of the studied basin of Venero Claro is completely flooded; this area is a low height plain.

In Figure 4, the maps for each of the 6 prediction horizons are included, limiting the view to the surroundings of the basin's flooded area. As mentioned, the outcome is generated for a transition from heavy rain to a dry period. Consequently, the expectation is that the basin stabilizes and starts draining water. This is actually what the outcome shows, although due to the zoom that is necessary to notice the draining, it is almost unnoticeable.

**Figure 3.** CBR results' visualization in the web interface for one prediction horizon, with the common facilities for zoom and pan. Notice that the limits of the study abruptly end at the flooding.

**Figure 4.** CBR outcome for the six prediction horizons, varying from 1 to 6 prediction horizons from left to right and top to bottom (left-top and right-bottom stand for 1 and 6 prediction horizons, respectively.

#### **5. Conclusions**

In this research, a web application that uses a case-based reasoning intelligent service has been proposed for the prediction of the behavior of the basins; the goal is to assist the safety experts when flash floods are more possible according to the basin state. This proposal covers different types of basins, from those with fast responses to those with a slower dynamic.

Results only include the evolution of the Venero Claro basin in Spain and were obtained using realistic data and cases. This is because the data from the different basins—that is, the relevant cases, the sources of information, and forecasts—are still in development; therefore, work is still pending and, hopefully, a complete test could be available by the end of this year. Nevertheless, the CBR has shown its capacity to predict what the experts expect from a given set of initial conditions, merging the information from the most relevant stored cases.

In future work, the implementation of different extensions, such as multi-criteria sorting using Pareto dominance or the availability of different distance measurements, will be addressed.

**Author Contributions:** All the authors have equally contributed in the development of this research. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research has been funded by the SUDOE Interreg Program under the grant INUNDA-TIO SOE3/P4/E0929. Furthermore, this research has been funded by European Union's Horizon 2020 research and innovation programme (project DIH4CPS) under the Grant Agreement no 872548, by the Spanish Ministry of Economics and Industry under the grant PID2020-112726RB-I00, by the Spanish Research Agency (AEI, Spain) under the grant agreement RED2018-102312-T (IA-Biomed), by CDTI (Centro para el Desarrollo Tecnológico Industrial) under projects CER-20211003 and CER-20211022, by the Missions Science and Innovation project MIG-20211008 (INMERBOT). Further, by Principado de Asturias, grant SV-PA-21-AYUD/2021/50994 and by ICE (Junta de Castilla y León) under project CCTT3/20/BU/0002.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **References**


### *Proceeding Paper* **Inland Areas, Protected Natural Areas and Sustainable Development †**

**Antonio Bertini, Immacolata Caruso and Tiziana Vitolo \***

Consiglio Nazionale delle Ricerche (CNR), Istituto di Studi sul Mediterraneo (ISMed), 80134 Naples, Italy; antonio.bertini@ismed.cnr.it (A.B.); immacolata.caruso@ismed.cnr.it (I.C.)

**\*** Correspondence: tiziana.vitolo@ismed.cnr.it

† Presented at the 8th International Conference on Time Series and Forecasting, Gran Canaria, Spain, 27–30 June 2022.

**Abstract:** In recent years, policies implemented by many countries have resulted in the deterioration of the Earth's environment rather than the protection of environmental resources. The impact on the environment and territories, particularly those in the Mediterranean basin, is now evident in all its gravity. Even in Italy, the development policies pursued until 2010 favored urban settlements, neglecting the rest of the territory and, in particular, areas far from the traffic flows of people and goods. In the last decade, Italy has also begun to invest in 'inland areas' and protected natural areas that, if well analyzed, organized and managed, can become the promoter of sustainable development for the entire country. Starting from the potential expressed by local communities, landscape resources, cultural, intangible and tangible heritage, the aim is to provide direction and potential scenarios for enhancing economic opportunities, unexpressed and possible drivers of sustainable development.

**Keywords:** Italian inland areas; protected natural areas; sustainable development

**Citation:** Bertini, A.; Caruso, I.; Vitolo, T. Inland Areas, Protected Natural Areas and Sustainable Development. *Eng. Proc.* **2022**, *18*, 20. https://doi.org/10.3390/ engproc2022018020

Academic Editors: Ignacio Rojas, Hector Pomares, Olga Valenzuela, Fernando Rojas and Luis Javier Herrera

Published: 21 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

#### **1. Introduction**

Over the last 60 years, the Italian development model has focused mainly on urban settlements, neglecting the rest of the territory and, in particular, areas far from the traffic flows of goods and people. It is only recently that non-densely urbanized areas have been taken into consideration, analyzed and observed, and that organic decisions have been made that respond to true and proper concerted programming and planning that affects the entire country. In Italy, areas that are distant, both from the most important urban centers and from communication infrastructures—in particular the digital one—have been defined as 'inland areas' since 2013, and have been illustrated in a comprehensive planning document drawn up by the Ministry for Territorial Cohesion and the South, called '*National Strategy for Inland Areas*' [1,2].

Inland areas are 'fragile' areas from a socio-demographic point of view, due to the ageing process of the population. They are unstable areas from an environmental, physical and eco-systemic point of view, due to widespread hydrogeological instability, loss of biodiversity and a slow but cumulative process of degradation of landscape values. The municipalities concerned number 3101; the population involved amounts to 4,171,667 inhabitants; and the available resources are 210 million euros [3].

To ensure that the analyses of the '*National Strategy for Inland Areas*' had effective consequences, the 'Support Fund for Economic, Artisan and Commercial Activities' was established, for each of the years from 2020 to 2022 [4].

In this context, some municipalities have initiated cooperation processes, involving local communities, for the production of essential services, and for the protection of environmental and cultural resources to enhance them. At the same time, other inland areas—many of which are located in the central and northern regions of Italy, and a few in southern Italy—have generated good policies and good practices that have made it possible to retain

the population, in some cases even recording an increase in the number of settled inhabitants. In central and northern Italy, inland areas are not synonymous with depressed or poor areas: as a result of national public policies such as the expansion of the welfare state and the promotion of local development projects, together with private initiative, some areas have reached an adequate level of per capita material well-being. In the south of Italy, on the other hand, despite having significant territorial capital (buildings and settlement systems, cultivable surface area, practical knowledge, landscapes and ecosystems), it has not been possible to create that virtuous circle capable of promoting and enhancing the territory [5].

This study, after outlining a brief examination of the current situation, examines the potential and problems of inland areas, in order to define a framework of guidance for sustainable planning in areas rich in raw materials—i.e., biodiversity and cultural, material and intangible heritage—but scarce in initiatives. The aim is to help create real sustainable development, considering that future generations must be handed down not only heritage but a civilization—our civilization [6].

#### **2. National and International Policies and Strategies for Inland Areas**

A recent report by the Intergovernmental Science-Policy Platform on Biodiversity and Ecosystem Services (IPBES) notes that the current rate of species extinction is a thousand times higher than the average rate for the entire history of life on Earth. Nevertheless, eight years after the approval of the 2030 Agenda, and after the 10-year World Biodiversity Strategy 2011–2020, the state of planetary biodiversity continues to deteriorate.

The theme of environmental quality has been the center of debate at key world summits (Earth Summit, Climate Convention, Kyoto Protocol, Agenda 21) and in most policies conducted in recent years. In Europe, 2% of agricultural land has been lost since 1970—in Italy, 20%—and an irreversible process has been generated that impacts not only the territories suffering transformation but also the entire national territory, due to the fragmentation that ecosystems are facing [7]. The global context provides reference for policies on the conservation and management of the natural and environmental heritage, derived from a more consolidated and articulated programmatic regulatory apparatus at international and EU level, which has been developing for almost thirty years, and which has undergone powerful acceleration in recent years. Progressively, the subject is better-understood, and studied as a complex reality, formed by natural resources, potentials, risks, sedimentation of culture, economic activities and history [8]. The subject is considered a particularly diverse one, and in this sense it must be studied, because in it the works and transformations of the environment—planned and executed for the needs of society—interact with the rules of nature [9]. On a European scale, protected areas show strong connections with land management because they are key elements of environmental policies, although there is a known lack of systematic information on planning and management experiences.

However, in recent years, policy and planning coordination in Europe has made significant progress, which can be outlined as follows: a growth of protected areas is taking place on the continent; conflicts between parks and local contexts have been alleviated due to a shift from the concept of constrained resources to resources for development; the relationship between planning and management has been consolidated, with the integration of the different dimensions present in protected areas. The quantitative increase in protected areas has implied, and will increasingly imply, the overlapping of parks with areas affected by strong processes of urbanization and productive development. This is leading to an increase in the number of issues relating to protected areas in the overall management of these territories, which can no longer be separated from the reference contexts in which they are located. At the same time, a widespread process of environmental degradation and alteration is ongoing, with global repercussions. These facts generate conflicts between economic, social and cultural interests and the protection of natural resources that are profoundly different than those faced at the birth of nature conservation policies [10]. This has also led to the development of environmental law, and the spread of the need for sustainable growth and territorial identities [11]. However, it is possible to highlight some negative trends that profoundly affect these policies and more general land policies:


A balance must exist between the environmental conditions of protected areas and the needs of tourism development. European indications also point towards an integrated approach that is able to meet the needs of protecting the natural environment, increasing tourist flows and managing the flow of new inhabitants who choose to live in protected areas [12].

#### **3. Protected Natural Areas in Italy**

In Italy, many of the protected natural areas fall within the 'inland areas', and cover 3,100,000 hectares or approximately 10% of the national land surface. There are also 27 protected marine areas and the 'International Marine Mammal Sanctuary', which is shared between Italy, France and the Principality of Monaco, while the two underwater archaeological parks of Baia and Gaiola, both in the Gulf of Naples, constitute two rarities of considerable interest [13].

The number of residents in all the Italian parks is a total of 4,407,741, while the population accounts for 23.7% of the total (around 14 million inhabitants) [14]. At the same time, 68% of Italian municipalities have a territory that contributes in part or in full to the formation of the protected area, whether at national or regional level, and of these, 35% are municipalities with less than 5,000 inhabitants.

The system of Italy's protected natural areas is characterized by remarkable biodiversity, which coexists with an extraordinary cultural heritage made up of historicalarchitectural-artistic emergencies and intangible assets. The 'Law on Protected Natural Areas' No. 394 of 1991, one of the best nature conservation laws in Europe, defined an integrated system of national protected areas and introduced the concepts of valorization and experimentation of productive activities compatible with the conservation of natural heritage and biodiversity. The law aimed to combine the conservation of nature, landscape, history, anthropology and local culture with economic opportunities and sustainable development possibilities.

However, especially in southern Italy, the policy, strategies and opportunities indicated by the law were only partially implemented. In southern Italy in particular, most parks are having difficulty getting developed, due to a lack of participation by local communities in the pursuit of sustainable development objectives, and a lack of capacity on the part of the park authorities and the park community to be proactive, driving forces in addressing the realities involved [15].

Inland areas and protected natural areas, especially Italian parks when analyzed as a whole, are not just synonymous with depressed or poor areas. In fact, the residents in these areas have a reasonable standard of living commensurate with their needs. Most of the people who live in these somewhat secluded areas do not have much incentive to improve their economic condition, as they find their standard of living balanced against their expectations. For the most part, they are people who prefer to live away from the hectic life of densely populated areas, in an environment where the environmental quality is better, and where human relationships are daily and constant over time, and therefore more satisfying. This condition has not led to a social and economic crisis, as many think, but has relegated these communities to a sort of limbo in which there are few incentives to invest in activities that create growth, change the status quo, and develop and therefore also enhance the rich heritage of these places [16].

#### **4. For Sustainable Tourism in Inland and Protected Areas**

In 1995, at the World Conference on Sustainable Tourism held in Lanzarote (Canary Islands, Spain), the World Charter for Sustainable Tourism was signed, which represents a reference for defining the priorities, objectives and instruments needed to promote future tourism.

The World Tourism Organisation (WTO) has defined tourism activities as sustainable when they "develop in such a way that they remain viable in a tourist area for an unlimited time, do not alter the natural, social and artistic environment and do not hinder or inhibit the development of other social and economic activities".

On 19 October 2007, 'The Agenda for a Sustainable and Competitive European Tourism' marked the official start of planning for such tourism, which aims at economic prosperity, equity and social cohesion, and environmental and cultural protection.

In line with the Lisbon Treaty and the commitments ratified by EU ministers at the Madrid Conference on 14 April 2010, a new framework for EU action on European tourism was defined, which included, among other things, the development of sustainable management of tourist destinations and local enterprises.

In the wake of these premises and shared regulatory and cultural references, many and diverse attempts have been made to revive the fortunes of protected areas, but with scarce and, above all, limited results. In many areas of Italy, investments have been made in tourism activities, a sector that accounts for about 10% of European GDP, and is the third sector, with the greatest potential for economic growth in the Union, rightly considered a powerful driver of local development and employment in European countries. The factors of tourist interest in Italy have often been treated as separate compartments, which instead could have had a greater attractive force precisely from their coexistence: lakes and rivers, wine and food routes, religious tourism, ancient and historical centers, archaeological nature sites, archaeological nature trails, craft activities, including identity activities, more than 50 UNESCO sites, etc. Environments were rich in diversified tourist proposals, in which to study and propose ideas while avoiding the environmental depauperate [17,18].

In the context of protected natural areas falling within the inland areas, another widespread, consistent and culturally relevant heritage is constituted by the 'small towns' where, in some cases, a few hundred families live, and which for the most part maintain a close environmental, morphological and landscape relationship, if not still functional, with the surrounding territory. The number of small towns is substantial, and reaches 72% of Italy's municipalities; their surface area covers 55% of the national territory, and 17% of the Italian population lives there (on average there are 64 inhabitants per km). Small towns also have a significant economic, settlement and infrastructural value. The relatively recent affirmation of new policies tending to promote local dimensions, networking the widespread and articulated diversities of experience, culture and identity, and constituting a tendentially multifaceted offer, has determined a favorable climate for a relaunch of the issue of small towns in new development contexts, making previously unusual resources, instruments and availability accessible [19].

#### **5. Conclusions for the Sustainable Development of The Protected Inland Areas of Southern Italy**

One of the most agreeable and convincing approaches to dealing more effectively with the problem of a concerted strategy of actions in the inland and protected areas is that of the 'Smart land', i.e., a territorial sphere of diffuse and shared experimental policies which have the objective of increasing the attractiveness of the territory, taking due account of social cohesion, the diffusion of knowledge, accessibility, freedom of movement, the usability of the environment, the quality of the landscape and the daily life of the local communities that live in those places. The aim is to attempt to rebuild a 'middle society' made up of those residents interested in and capable of transforming the premises into real economic, social and cultural practices capable of turning local potential into economic activities [20].

Most local regeneration and development projects remain hetero-directed and topdown because the involvement of the communities occurs later, thus lacking the direct involvement of local populations, especially for the implementation of specific projects. More successful, and growing more successful still, is the implementation of bottom-up projects, which arise from the mechanisms of protection and enhancement of local identities, which constitute a driver of lasting and sustainable development according to the logic of sharing [21].

In this context, in recent years, some administrations of those municipalities suffering from significant depopulation (this phenomenon does not spare the south or even central and northern Italy, concentrating in small and mountainous centers) are offering economic and social incentives, giving away abandoned houses to encourage the return of emigrants, or welcoming new communities to revive agricultural and craft activities. Nevertheless, the conditions of peripheral location persist, not only because of geographical issues but also because of the lack of socio-economic and political connections between metropolitan and urban areas in general and inland areas. After having created the material conditions for a return to the inland areas, the next step, where possible, is to transform the community into a 'project community', a partnership, that is, between those who plan, those who reside, and those who benefit from the services offered [22].

In our opinion, it is in the encounter between the top-down and bottom-up processes that a balance that generates real development opportunities can be achieved. Furthermore, targeted policies and direct funding for the development of essential services are crucial if good practices are to become an effective incentive for communities to remain in inland and protected areas. This process is also contemplated in the *'National Recovery and Resilience Plan'*, which could help to bridge the existing gaps between northern, central and southern Italy, and demonstrate institutions' renewed awareness of marginal areas.

**Author Contributions:** Conceptualization, A.B., I.C. and T.V.; methodology, A.B., I.C. and T.V.; software, T.V.; validation, A.B., I.C. and A.B.; formal analysis, T.V.; investigation, I.C.; resources, A.B.; data curation, A.B.; writing—original draft preparation, T.V; writing—review and editing, A.B.; visualization, T.V.; supervision, T.V. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Proceeding Paper* **Expectation-Maximization Algorithm for Autoregressive Models with Cauchy Innovations †**

**Monika S. Dhull \* and Arun Kumar**

Department of Mathematics, Indian Institute of Technology Ropar, Rupnagar 140001, India; arun.kumar@iitrpr.ac.in


**Abstract:** In this paper, we study the autoregressive (AR) models with Cauchy distributed innovations. In the AR models, the response variable *yt* depends on previous terms and a stochastic term (the innovation). In the classical version, the AR models are based on normal distribution which could not capture the extreme values or asymmetric behavior of data. In this work, we consider the AR model with Cauchy innovations, which is a heavy-tailed distribution. We derive closed forms for the estimates of parameters of the considered model using the expectation-maximization (EM) algorithm. The efficacy of the estimation procedure is shown on the simulated data. The comparison of the proposed EM algorithm is shown with the maximum likelihood (ML) estimation method. Moreover, we also discuss the joint characteristic function of the AR(1) model with Cauchy innovations, which can also be used to estimate the parameters of the model using empirical characteristic function.

**Keywords:** autoregressive model; Cauchy innovations; EM algorithm

**Citation:** Dhull, M.S.; Kumar, A. Expectation-Maximization Algorithm for Autoregressive Models with Cauchy Innovations. *Eng. Proc.* **2022**, *18*, 21. https://doi.org/10.3390/ engproc2022018021

Academic Editors: Ignacio Rojas, Hector Pomares, Olga Valenzuela, Fernando Rojas and Luis Javier Herrera

Published: 21 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

#### **1. Introduction**

Autoregressive (AR) models with stable and heavy-tailed innovations are of great interest in time series modeling. These distributions can easily assimilate the asymmetry, skewness, and outliers present in time series data. The Cauchy distribution is a special case of stable distribution with undefined expected value, variance, and higher order moments. The Cauchy distribution and its mixture has many applications in the field of economics [1], seismology [2], biology [3], and various other fields, but only a few studies have been conducted concerning time series models with Cauchy errors. In [4], the maximum likelihood (ML) estimation of AR(1) model with Cauchy errors is studied.

The standard estimation techniques for the AR(*p*) model with Cauchy innovations, particularly the Yule–Walker method and conditional least squares method, cannot be used due to the infinite second order moments of the Cauchy distribution. Therefore, it is worthwhile to study and assess the alternate estimation techniques for the AR(*p*) model with Cauchy innovations. In the literature, several estimation techniques have been proposed to estimate the parameters of AR models with infinite variance errors (see e.g., [5–7]).

In this paper, we propose to use the EM algorithm to estimate the parameters of the distribution and model simultaneously. It is a general iterative algorithm for model parameter estimation which iterates between two steps, namely the expectation step (E-step) and the maximization step (M-step) [8]. It is an alternative to the numerical optimization of the likelihood function which is proven to be numerically stable [9]. We also provide the formula based on the characteristic function (CF) and empirical characteristic function (ECF) of Cauchy distribution for AR(*p*) model estimation. The idea to use ECF in a time series stable ARMA model has been discussed in [10].

The remainder of the paper is organized as follows. In Section 2, we present a brief overview of the Cauchy AR(*p*) model, followed by a discussion of estimation techniques, namely the EM algorithm and estimation by CF and ECF. Section 3 checks the efficacy of the estimation procedure on simulated data. We also present the comparative study, where the proposed technique is compared with ML estimation for Cauchy innovations. Section 4 concludes the paper.

#### **2. Cauchy Autoregressive Model**

We consider the AR(*p*) univariate stationary time-series {*Yt*}, *<sup>t</sup>* ∈ N with Cauchy innovations defined as

$$Y\_{l} = \sum\_{i=1}^{p} \rho\_{i} Y\_{l-i} + \varepsilon\_{l} = \rho^{T} Y\_{l-1} + \varepsilon\_{l\prime} \tag{1}$$

where *<sup>ρ</sup>* = (*ρ*1, *<sup>ρ</sup>*2, ··· , *<sup>ρ</sup>p*)*<sup>T</sup>* is a *<sup>p</sup>*-dimensional column vector, *Yt*−<sup>1</sup> = (*Yt*−1,*Yt*−2, ··· ,*Yt*−*p*)*<sup>T</sup>* is a vector of *<sup>p</sup>* lag terms, and {*t*}, *<sup>t</sup>* ∈ N are i.i.d. innovations distributed as Cauchy(*α*, *<sup>γ</sup>*). The pdf of Cauchy(*α*, *γ*) [11] is

$$f(\mathbf{x}; \boldsymbol{a}, \boldsymbol{\gamma}) = \frac{\gamma}{\pi} \left[ \frac{1}{\gamma^2 + (\mathbf{x} - \boldsymbol{a})^2} \right], \boldsymbol{\gamma} > \mathbf{0}, \; \mathbf{a} \in \mathbb{R}, \; \mathbf{x} \in \mathbb{R}. \tag{2}$$

The conditional distribution of *Yt* given the preceding data <sup>F</sup>*t*−<sup>1</sup> = (*Yt*−1,*Yt*−2, ··· ,*Y*1)*<sup>T</sup>* is given by [4]

$$p(\boldsymbol{Y}\_{t}|\mathcal{F}\_{t-1}) = f(\boldsymbol{y}\_{t} - \boldsymbol{\rho}^{T}\boldsymbol{y}\_{t-1}; \boldsymbol{a}, \boldsymbol{\gamma}) = \frac{\gamma}{\pi} \left[ \frac{1}{\gamma^{2} + (\boldsymbol{y}\_{t} - \boldsymbol{\rho}^{T}\boldsymbol{y}\_{t-1} - \boldsymbol{a})^{2}} \right] \boldsymbol{\gamma}$$

where *yt*−<sup>1</sup> is the realization of *Yt*−1. In the next subsection, we propose the methods to estimate the model parameters *ρ* and innovation parameters *α* and *γ* simultaneously.

#### *2.1. Parameter Estimation Using EM Algorithm*

We estimate the parameters of the AR(*p*) model using an EM algorithm which maximizes the likelihood function iteratively. Further, we discuss the time series {*Yt*} using the characteristic function (CF) and estimation method using CF and ECF. Recently [7], the exponential-squared estimator for AR models with heavy-tailed errors was introduced and proven to be <sup>√</sup>*n*-consistent under some regularity conditions; similarly, the self-weighted least absolute deviation estimation method was also studied for the infinite variance AR model [12]. The ML estimation of AR models with Cauchy errors with intercept and with linear trend is studied, and the AR coefficient is shown to be *n*3/2-consistent under some conditions [4]. For the AR(*p*) model with Cauchy innovations with *n* samples, the log likelihood is defined as

$$l(\Theta) = (n-p)\log(\gamma) - (n-p)\log(\pi) - \sum\_{t=p+1}^{n} \log(\gamma^2 + (y\_t - \rho^T Y\_{t-1} - a)^2),$$

where Θ = - *α*, *γ*, *ρ<sup>T</sup>* . **Proposition 1.** *Consider the AR(p) time-series model given in Equation* (1) *where error terms follow Cauchy*(*α*, *γ*)*. The maximum likelihood estimates of the model parameters using EM algorithm are as follows*

$$\begin{split} \boldsymbol{\hat{\rho}}^{T} &= \left( \sum\_{t=p+1}^{n} \frac{(y\_t - a)\boldsymbol{Y}\_{t-1}^{T}}{\boldsymbol{s}\_{t}^{(k)}} \right) \left( \sum\_{t=p+1}^{n} \frac{\boldsymbol{Y}\_{t-1} \boldsymbol{Y}\_{t-1}^{T}}{\boldsymbol{s}\_{t}^{(k)}} \right)^{-1}; \\ \boldsymbol{\hat{\kappa}} &= \frac{\sum\_{t=p+1}^{n} \frac{(y\_t - \rho^{T} \boldsymbol{Y}\_{t-1})}{\boldsymbol{s}\_{t}^{(k)}}}{\sum\_{t=p+1}^{n} \frac{1}{\boldsymbol{s}\_{t}^{(k)}}}; \\ & \text{and} \\ \boldsymbol{\hat{\gamma}} &= \sqrt{\frac{n-p}{2\sum\_{t=p+1}^{n} \frac{1}{\boldsymbol{s}\_{t}^{(k)}}}}. \end{split} \tag{3}$$

*where s*(*k*) *<sup>t</sup>* = (*yt* <sup>−</sup> *<sup>ρ</sup>T*(*k*) *Yt*−<sup>1</sup> <sup>−</sup> *<sup>α</sup>*(*k*))<sup>2</sup> <sup>+</sup> *<sup>γ</sup>*(*k*)<sup>2</sup> *.*

**Proof.** Consider the AR(*p*) model

$$\mathcal{Y}\_t = \rho^T \mathcal{Y}\_{t-1} + \varepsilon\_t, \ t = p + 1, 2, \dots, n\_{\prime\prime}$$

where *ε<sup>t</sup>* follows Cauchy distribution Cauchy(*α*, *γ*). Let (*εt*, *Vt*) for *t* = 1, 2, ··· , *n* denote the complete data for innovations *ε*. The observed data *ε<sup>t</sup>* are assumed to be from Cauchy(*α*, *γ*) and the unobserved data *Vt* follow inverse gamma IG(1/2, *γ*2/2). A random variable *V*∼IG(*a*, *b*) if the pdf is given by

$$f\_V(v;a,b) = \frac{b^a}{\Gamma(a)} \frac{e^{-b/v}}{v^{a+1}}, \ a>0, \ b>0, \ v>0.$$

We can rewrite *<sup>ε</sup><sup>t</sup>* as *<sup>ε</sup><sup>t</sup>* <sup>=</sup> *Yt* <sup>−</sup> *<sup>ρ</sup>TYt*−1, for *<sup>t</sup>* <sup>=</sup> 1, 2, ··· , *<sup>n</sup>*. The stochastic relation *<sup>ε</sup>* <sup>=</sup> *<sup>α</sup>* <sup>+</sup> <sup>√</sup>*VZ* with *<sup>Z</sup>* <sup>∼</sup> *<sup>N</sup>*(0, 1) i.e., standard normal and *<sup>V</sup>* <sup>∼</sup> IG(1/2, *<sup>γ</sup>*2/2) is used to generate Cauchy(*α*, *γ*) distribution. Then, the conditional distribution is

$$f(\varepsilon = \varepsilon\_l | V = v\_l) = \frac{1}{\sqrt{2\pi v\_l}} \exp\left(-\frac{1}{2v\_l}(\varepsilon\_l - \mathfrak{a})\right).$$

Now, we need to estimate the unknown parameters Θ = (*α*, *γ*, *ρT*). To apply the EM algorithm for estimation, we first find the conditional expectation of log-likelihood of complete data (*ε*, *V*) with respect to the conditional distribution of *V* given *ε*. As the unobserved data are assumed to be from IG(1/2, *γ*2/2), the posterior distribution is again an inverse gamma i.e.,

$$V|\varepsilon = e, \Theta \sim \text{IG}\left(1, \frac{(e-\alpha)^2 + \gamma^2}{2}\right).$$

The following conditional inverse first moment and E(log(*V*)|*<sup>ε</sup>* = *<sup>e</sup>*) will be used in calculating the conditional expectation of the log-likelihood function:

$$\begin{aligned} \mathbb{E}(V^{-1}|\varepsilon=\varepsilon) &= \frac{2}{(e-\mathfrak{a})^2+\gamma^2} \\ \mathbb{E}(\log(V)|\varepsilon=\varepsilon) &= \log(\left(e-\mathfrak{a}\right)^2+\gamma^2) - \log 2 + 0.5776... \end{aligned}$$

The complete data likelihood is given by

$$\begin{aligned} L(\Theta) &= \prod\_{t=p+1}^n f(\epsilon\_t, v\_t) = \prod\_{t=p+1}^n f\_{\epsilon|V}(\epsilon\_t|v\_t) f\_V(v\_t) \\ &= \prod\_{t=p+1}^n \frac{\gamma}{2\pi v\_t^2} \exp\left(-\frac{(\epsilon\_t - \alpha)^2 + \gamma^2}{2v\_t}\right). \end{aligned}$$

The log likelihood function will be

$$l(\Theta) = (n-p)\log(\gamma) - (n-p)\log(2\pi) - 2\sum\_{t=p+1}^{n} \log(v\_t) - \sum\_{t=p+1}^{n} \frac{(\varepsilon\_t - a)^2 + \gamma^2}{2v\_t}.$$

Now, we will use the relation *<sup>t</sup>* <sup>=</sup> *Yt* <sup>−</sup> *<sup>ρ</sup>TYt*−<sup>1</sup> in further calculations. In the first step at *k*th iteration, E-step of EM algorithm, we need to compute the expected value of the complete data log likelihood known as *<sup>Q</sup>*(Θ|Θ(*k*)), which is expressed as

*<sup>Q</sup>*(Θ|Θ(*k*) ) = <sup>E</sup>*V*|*ε*,Θ(*k*)[log *<sup>f</sup>*(*ε*, *<sup>V</sup>*|Θ)|*t*, <sup>Θ</sup>(*k*) ] = <sup>E</sup>*V*|*ε*,Θ(*k*)[*l*(Θ|Θ(*k*) )] = (*n* − *p*)log *γ* − (*n* − *p*)log 2*π* − *n* ∑ *t*=*p*+1 <sup>E</sup>(log *vt*|*t*, <sup>Θ</sup>(*k*) ) − *γ*2 2 *n* ∑ *t*=*p*+1 E(*v*−<sup>1</sup> *<sup>t</sup>* <sup>|</sup>*t*, <sup>Θ</sup>(*k*) ) <sup>−</sup> <sup>1</sup> 2 *n* ∑ *t*=*p*+1 (*<sup>t</sup>* <sup>−</sup> *<sup>α</sup>*)2E(*v*−<sup>1</sup> *<sup>t</sup>* <sup>|</sup>*t*, <sup>Θ</sup>(*k*) ) = (*n* − *p*)log *γ* − (*n* − *p*)log 2*π* − (*n* − *p*)log 2 + 0.5776(*n* − *p*) − *n* ∑ *t*=*p*+1 log((*<sup>t</sup>* <sup>−</sup> *<sup>α</sup>*(*k*) )<sup>2</sup> <sup>+</sup> *<sup>γ</sup>*(*k*)<sup>2</sup> ) − *γ*2 2 *n* ∑ *t*=*p*+1 1 (*<sup>t</sup>* <sup>−</sup> *<sup>α</sup>*(*k*))<sup>2</sup> <sup>+</sup> *<sup>γ</sup>*(*k*)<sup>2</sup> <sup>−</sup> *n* ∑ *t*=*p*+1 (*<sup>t</sup>* <sup>−</sup> *<sup>α</sup>*)<sup>2</sup> (*<sup>t</sup>* <sup>−</sup> *<sup>α</sup>*(*k*))<sup>2</sup> <sup>+</sup> *<sup>γ</sup>*(*k*)<sup>2</sup> = (*n* − *p*)log *γ* − (*n* − *p*)log 2*π* − (*n* − *p*)log 2 + 0.5776(*n* − *p*) − *n* ∑ *t*=*p*+1 log(*s* (*k*) *<sup>t</sup>* ) − *γ*2 2 *n* ∑ *t*=*p*+1 1 *s* (*k*) *t* − *n* ∑ *t*=*p*+1 (*<sup>t</sup>* <sup>−</sup> *<sup>α</sup>*)<sup>2</sup> *s* (*k*) *t* .

where *st* = (*yt* <sup>−</sup> *<sup>ρ</sup>TYt*−<sup>1</sup> <sup>−</sup> *<sup>α</sup>*)<sup>2</sup> <sup>+</sup> *<sup>γ</sup>*2. In the next M-step, we estimate the parameters *α*, *γ*, and *ρ<sup>T</sup>* by maximizing the *Q* function using the equations below:

$$\begin{split} \frac{\partial Q}{\partial \rho} &= 4 \sum\_{t=p+1}^{n} \frac{(y\_t - \rho^T Y\_{t-1} - \alpha Y\_{t-1}^T)}{s\_t^{(k)}},\\ \frac{\partial Q}{\partial \alpha} &= \sum\_{t=p+1}^{n} \frac{(y\_t - \rho^T Y\_{t-1} - \alpha)}{s\_t^{(k)}},\\ \frac{\partial Q}{\partial \gamma} &= \frac{n-p}{\gamma} - 2\gamma \sum\_{t=p+1}^{n} \frac{1}{s\_t^{(k)}}. \end{split}$$

Solving the above equations at each iteration, we find the following closed form estimates of the parameters at *k* + 1th iteration:

$$\begin{split} \boldsymbol{\rho}^{T} &= \left( \sum\_{t=p+1}^{n} \frac{(y\_{t} - \boldsymbol{a}) \boldsymbol{Y}\_{t-1}^{T}}{\boldsymbol{s}\_{t}^{(k)}} \right) \left( \sum\_{t=p+1}^{n} \frac{\boldsymbol{Y}\_{t-1} \boldsymbol{Y}\_{t-1}^{T}}{\boldsymbol{s}\_{t}^{(k)}} \right)^{-1}; \\ \boldsymbol{\hat{\kappa}} &= \frac{\sum\_{t=p+1}^{n} \frac{(y\_{t} - \boldsymbol{\rho}^{T} \boldsymbol{Y}\_{t-1})}{\boldsymbol{s}\_{t}^{(k)}}}{\sum\_{t=p+1}^{n} \frac{1}{\boldsymbol{s}\_{t}^{(k)}}}; \end{split} \tag{4}$$

$$\hat{\gamma} = \sqrt{\frac{n-p}{2\sum\_{t=p+1}^{n} \frac{1}{s\_t^{(k)}}}},$$

where *s* (*k*) *<sup>t</sup>* = (*yt* <sup>−</sup> *<sup>ρ</sup>T*(*k*) *Yt*−<sup>1</sup> <sup>−</sup> *<sup>α</sup>*(*k*))<sup>2</sup> <sup>+</sup> *<sup>γ</sup>*(*k*)<sup>2</sup> .

#### *2.2. Characteristic Function for Estimation*

Thus far, we have considered the conditional distribution of *Yt* given the preceding data F*t*−1. Now, we include the dependency of time series {*Yt*} by defining the variable *dj* = (*yj*, ··· , *yj*+*p*) for *j* = 1, ··· , *n* − *p*. In each variable {*dj*}, there are *p* terms the same as adjacent variable. The distribution of {*dj*} will be multivariate Cauchy with dimension *r* = *p* + 1.

The CF of each *dj* is *c*(Θ,*s*) = E(exp(*i sTdj*)) and the ECF is *cn*(*s*) = <sup>1</sup> *<sup>n</sup>* <sup>∑</sup>*n*−*<sup>p</sup> <sup>j</sup>*=<sup>1</sup> exp(*i sTdj*) where *<sup>s</sup>* = (*s*1, ··· ,*sp*+1)*T*.

To estimate the parameters using CF and ECF, we make sure that the joint CF of the AR(*p*) model has a closed form. In the next result, the closed form expression for the joint CF of the AR(1) model with Cauchy innovations is given.

**Proposition 2.** *The joint CF of stationary AR(*1*) model with Cauchy innovations is*

$$c(s\_1, s\_2; \Theta) = \exp\left[ia(s\_1 + s\_2)\left(\frac{1}{1 - \rho}\right)\right] \times \exp\left[-\gamma \left(|s\_2| + |s\_1 + \rho s\_2| \left(\frac{1}{1 - |\rho|}\right)\right)\right].$$

**Proof.** For stationary AR(1) model *yt* = *ρyt*−<sup>1</sup> + *εt*, we can rewrite it as

$$y\_t = \varepsilon\_t + \rho \varepsilon\_{t-1} + \rho^2 \varepsilon\_{t-2} + \rho^3 \varepsilon\_{t-3} + \dotsb \dotsb$$

Note that {*εt*} are i.i.d from Cauchy(*α*, *γ*) distribution, and the CF of Cauchy(*α*, *γ*) is <sup>E</sup>(exp(*i sεt*)) = exp(*<sup>i</sup> <sup>α</sup><sup>s</sup>* <sup>−</sup> *<sup>γ</sup>*|*s*|) [11]. Then, the joint CF of (*yt*, *yt*−1) is calculated as follows:

$$\begin{split} \langle s(s\_1; s\_2; \Theta) &= \mathbb{E}[\exp(is\_1y\_{t-1} + is\_2y\_t)] \\ &= \mathbb{E}[\exp(is\_1(\epsilon\_{t-1} + \rho\epsilon\_{t-2} + \rho^2\epsilon\_{t-3} + \cdots)) + is\_2(\epsilon\_t + \rho\epsilon\_{t-1} + \rho^2\epsilon\_{t-2} + \cdots)] \\ &= \mathbb{E}[\exp(is\_2\epsilon\_t + i(s\_1 + s\_2\rho)\epsilon\_{t-1} + i\rho(s\_1 + \rho s\_2)\epsilon\_{t-2} + \cdots)] \\ &= \exp(ia\epsilon\_2 - \gamma|s\_2|) \times \exp(ia(s\_1 + \rho s\_2) - \gamma|s\_1 + \rho s\_2)) \\ &\quad \times \exp(ia(\rho s\_1 + \rho^2 s\_2) - \gamma|\rho s\_1 + \rho^2 s\_2)) \times \cdots \\ &= \exp(ia(s\_1 + s\_2)(1 + \rho + \rho^2 + \rho^3 + \cdots)) \\ &\quad \times \exp(-\gamma(|s\_2| + |s\_1 + \rho s\_2| + |\rho|s\_1 + \rho s\_2| + |\rho^2||s\_1 + \rho s\_2| + \cdots)) \\ &= \exp(ia(s\_1 + s\_2)\left(\frac{1}{1 - \rho}\right) \times \exp(-\gamma(|s\_2| + |s\_1 + \rho s\_2|(1 + |\rho| + |\rho^2| + \cdots))) \\ &= \exp\left[ia(s\_1 + s\_2)\left(\frac{1}{1 - \rho}\right)\right] \times \exp\left[-\gamma\left(|s\_2| + |s\_1 + \rho s\_2|\left(\frac{1}{1 - |\rho|}\right)\right)\right]. \end{split}$$

The joint CF for a higher dimension can be obtained in similar manner. Now, the model parameters can be estimated by solving the following integral with CF and ECF as defined in [10]:

$$\int \cdots \int w\_{\Theta}(s)(c\_n(s) - c(s; \Theta))ds = 0. \tag{5}$$

where optimal weight function

$$w\_{\Theta}^{\*}(s) = \frac{1}{2\pi} \int \cdots \int \exp(-i s^{T} d\_{\rangle}) \frac{\partial \log f(y\_{j+p}|y\_{j'}, \cdots, y\_{j+p-1})}{\partial \Theta} dy\_{j} \ldots dy\_{j+p}. \tag{6}$$

**Remark 1.** *For a stationary AR(l) process* {*Yt*} *with p* = *l, the ECF estimator defined by Equation* (5) *with optimal weight function defined in Equation* (6) *is a conditional ML (CML) estimator and hence asymptotically efficient. The conditional log pdf for Cauchy distribution is:*

$$\log f(y\_{j+p}|y\_{j},\cdots,y\_{j+p-1}) = \log \gamma - \log \pi - \log \left(\gamma^2 + (y\_t - \rho^T Y\_{t-1} - a^2)^2\right).$$

*The proof is similar to the proof of Proposition* 2.1 *in [10].*

#### **3. Simulation Study**

In this section, we assess the proposed model and the introduced estimation technique using a simulated data set. We discuss the estimation procedure for the AR(2) model with Cauchy innovations. The AR(2) model defined in (1) is simulated with *ρ*<sup>1</sup> and *ρ*<sup>2</sup> as model parameters. We generate 1000 trajectories, each of size *N* = 500 of Cauchy innovations using the normal variance-mean mixture form *<sup>ε</sup>* <sup>=</sup> *<sup>α</sup>* <sup>+</sup> <sup>√</sup>*VZ* with *<sup>Z</sup>* <sup>∼</sup> *<sup>N</sup>*(0, 1), i.e., standard normal and *<sup>V</sup>* <sup>∼</sup> IG(1/2, *<sup>γ</sup>*2/2). We then use the following simulation steps to generate the Cauchy innovations:

*step 1*: Generate standard normal variate *Z*;

*step 2*: Generate inverse gamma random variate IG(1/2, *γ*2/2) with *γ* = 2;

*step 3*: Using the relation *<sup>ε</sup>* <sup>=</sup> *<sup>α</sup>* <sup>+</sup> <sup>√</sup>*VZ*, we simulate the Cauchy innovations with *<sup>α</sup>* <sup>=</sup> 1; *step 4*: The time series data *yt* is generated with model parameters *ρ*<sup>1</sup> = 0.5 and *ρ*<sup>2</sup> = 0.3.

The exemplary time series data plot and scatter plot of innovation terms are shown in Figure 1. We apply the discussed EM algorithm to estimate the model parameters and distribution parameters. The relative change in the parameters is considered to terminate the algorithm. The following is the stopping criteria which is commonly used in literature:

$$\max \left\{ \left| \frac{a^{(k+1)} - a^{(k)}}{a^{(k)}} \right|, \left| \frac{\gamma^{(k+1)} - \gamma^{(k)}}{\gamma^{(k)}} \right|, \left| \frac{\rho\_1^{(k+1)} - \rho\_1^{(k)}}{\rho\_1^{(k)}} \right|, \left| \frac{\rho\_2^{(k+1)} - \rho\_2^{(k)}}{\rho\_2^{(k)}} \right| \right\} < 10^{-4}. \tag{7}$$

We compare the estimation results of Cauchy(*α*, *γ*) with the EM algorithm and maximum likelihood (ML) estimation. The ML estimates are computed using the inbuilt function "mlcauchy" in R, which uses the exponential transform of the location parameter and performs non-linear minimization by a Newton-type algorithm. The comparison of the estimates of Cauchy(*α*, *γ*) are shown in boxplots in Figure 2. From the boxplots, we find that the EM algorithm converges near to the true value of the Cauchy(*α*, *γ*) as compared to the ML estimation. There is a possibility of achieving a better result from ML method if a different algorithm or inbuilt function for optimization are used for estimation.

**Figure 1.** (**a**) The data plot of exemplary time series of length *N* = 500 and (**b**) the scatter plot of the corresponding innovation terms of the AR(2) model with Cauchy innovations. The chosen parameters of the model are *ρ*<sup>1</sup> = 0.5, *ρ*<sup>2</sup> = 0.3, *α* = 1, and *γ* = 2.

**Figure 2.** Boxplots of the estimates of the AR(2) model's parameters with theoretical values (**a**) *α* = 1 and (**b**) *γ* = 2 represented with blue dotted lines. The boxplots are created using 1000 trajectories, each of length 500.

#### **4. Conclusions and Future Scope**

In this work, we derive the closed form of estimates of AR model with Cauchy innovations using an EM algorithm. The performance of the proposed algorithm is compared with the ML method using simulated data. The ML estimation is found using an inbuilt function in R. Another benefit of using the EM algorithm is that it calculates the model as well as the innovation parameters simultaneously. It is evident from the boxplot that the EM algorithm outperforms the ML method. Further, we discuss another approach based on CF to estimate the AR model parameters with stable distribution. In the future, we plan to study and compare the proposed algorithm and ECF based estimation method with the existing techniques in [5–7] for an AR model with infinite variance. Further, the real life phenomena can be studied using the proposed model and methods.

**Author Contributions:** Conceptualization, A.K. and M.S.D.; Methodology, A.K. and M.S.D.; Writing-Original Draft Preparation, M.S.D. and A.K.; WritingReview & Editing, A.K. and M.S.D. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** M.S.D. would like to thank her institute Indian Institute of Technology Ropar, India and the Ministry of Education (MoE), Government of India, for supporting her research work.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

*67*, 381–393. [CrossRef]


### *Proceeding Paper* **Deep Representation Learning for Cluster-Level Time Series Forecasting †**

**Tsegamlak T. Debella 1,2,***<sup>∗</sup>* **, Bethelhem S. Shawel 1,3 , Maxime Devanne <sup>2</sup> , Jonathan Weber <sup>2</sup> , Dereje H. Woldegebreal 1, Sofie Pollin <sup>3</sup> and Germain Forestier <sup>2</sup>**


**Abstract:** In today's data-driven world, time series forecasting is an intensively investigated temporal data mining technique. In practice, there is a range of forecasting techniques that have been proven to be efficient at capturing different aspects of an input. For instance, classic linear forecasting models such as seasonal autoregressive integrated moving average (S-ARIMA) models are known to capture the trends and seasonality evident in temporal datasets. In contrast, neural-network-based forecasting approaches are known to be best at capturing nonlinearity. Despite such differences, most forecasting techniques inherently assume that models are fitted using a single input. In practice, there are often cases where we cannot deploy forecasting models in this manner. For instance, in most wireless communication traffic forecasting problems, temporal datasets are defined by taking samples from hundreds of base stations. Moreover, the base stations are expected to have spatial correlation due to user mobility, land use, settlement patterns, etc. Thus, in such cases, it is often advised that forecasting should be approached using clusters that group the base stations based on their traffic patterns. However, when this approach is used, the quality of the cluster centroids and the overall cluster formation process is expected to have a significant impact on the performance of forecasting models. In this paper, we show the effectiveness of representation learning for cluster formation and cluster centroid definition, which in turn improves the quality of cluster-level forecasting. We demonstrate this concept using data traffics collected from 729 wireless base stations. In general, based on the experimental results, the representation learning approach outperforms cluster-level forecasting models based on classical clustering techniques such as K-means and dynamic time warping barycenter averaging K-means (DBA K-means).

**Keywords:** clustering; forecasting; representation learning; time series; multitasking

#### **1. Introduction**

Time series forecasting (prediction) is a well-developed temporal data mining technique [1–4]. The theory of time series forecasting often relies on the data having recognizable patterns that can either be captured or learned. In practice, patterns or components within a time series dataset include trends, seasonality, cycles, and irregular (error) components [2,5,6]. In general, given a set of time series denoted by S = {*X*1, *X*2, *X*3, ... *Xt*−1} : *Xi* <sup>∈</sup> <sup>R</sup>*N*, the forecasting task can be formalized as:

$$
\hat{X}\_{i,t:t+p} = f(X\_{i,t-L:t-1}, Y\_{i,t-L:t-1}) \tag{1}
$$

**Citation:** Debella, T.T.; Shawel, B.S.; Devanne, M.; Weber, J.; Woldegebrel, D.H.; Pollin, S.; Forestier, G. Deep Representation Learning for Cluster-Level Time Series Forecasting. *Eng. Proc.* **2022**, *18*, 22. https:// doi.org/10.3390/engproc2022018022

Academic Editors: Ignacio Rojas, Hector Pomares, Olga Valenzuela, Fernando Rojas and Luis Javier Herrera

Published: 21 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

where, *<sup>X</sup>*ˆ*i*,*t*:*t*+*<sup>p</sup>* <sup>=</sup> {*X*ˆ*i*,*t*, *<sup>X</sup>*ˆ*i*,*t*<sup>+</sup>1, ... , *<sup>X</sup>*ˆ*i*,*t*+*p*} is the forecast for the *<sup>i</sup>*th series for *<sup>p</sup>* forward steps or horizons. Moreover, *Xi*,*t*−*L*:*t*−<sup>1</sup> = {*Xi*,*t*−*L*, ... , *Xi*,*t*−1} are past observations over a look-back window *L*. In most practical cases, different forecasting techniques often aim to provide predictions for a range of future time stamps [1,2,5]. However, when this is the case, it is important to carefully address the propagation of errors and the size of the look-back window. In this regard, some forecasting techniques often incorporate exogenous variables that are presumed to add value to (improve) the prediction accuracy [6]. In Equation (1), the possibility of incorporating exogenous variables is indicated using the parameter *Yi*,*t*−*L*:*t*−1. In practice, we can broadly categorize forecasting models based on different factors. For instance, we can categorize them into univariate or multivariate models based on their ability to incorporate either a single or multiple time series. We can also categorize them, based on their mathematical formulation, as linear or nonlinear [1]. In general, on deployment, the underlying data often govern which of the different forecasting models is to be utilized. For instance, if forecasting is performed on a group of input time series that have some degree of correlation, then multivariate techniques are more fitting [6–8]. In contrast, if forecasting is performed on a group of series that show a sense of independence, standard univariate techniques are often considered [9].

In practice, despite the differences among forecasting approaches, we must overcome certain challenges and limitations that are associated with either the underlying data or a forecasting model [1]. For instance, a limited number of training samples often leads to the overfitting problem. In reality, overfitting is a problem in both univariate and multivariate forecasting approaches [2,5,10]. However, in reality, there are also challenges (limitations) that are specific to a given forecasting approach. For example, in some practical cases, univariate forecasting models often fail to capture the dynamic nature of the underlying data [3,11]. This is because, in these approaches, the model's locality with respect to an individual time series will restrict its scalability in the context of other correlated time series [5,7]. For instance, in most wireless communication forecasting problems, temporal datasets are collected from a range of base stations that share a certain geographical area [6,11]. In these cases, deploying univariate forecasting models on an individual base station is often not advised for two reasons [3,11]. Firstly, due to the presence of user mobility, settlement patterns, land use, etc., base stations cannot be treated as isolated entities. Consequently, a given base station is expected to have a piece of inherent hidden information about its neighbors [12,13]. In addition to this, in wireless communication networks, the base stations number hundreds if not thousands. Therefore, the sheer number alone makes it challenging to deploy univariate forecasting models. In order to overcome these and other additional challenges, most univariate time series forecasting tasks make use of pre-processing techniques. For instance, prior to fitting most univariate linear forecasting models, an input series is often processed in the context of identifying seasonal patterns, extracting external explanatory variables, putting a limit on the forecasting horizon, and using a "global" approach, which clusters the time series based on similarity [1,2,5,6,11,14]. Among such pre-processing steps, clustering is often considered to be useful in capturing the spatial correlations and reducing the number of required forecasting models [11,14]. This is because, while clustering, we often group base stations that have similar traffic patterns around a centroid (average). Thus, given an optimal cluster centroid, the patterns observed within cluster members can be generalized. This, in turn, provides a way to embed the information a base station has about its neighbors. Moreover, it has been shown to be possible to further incorporate spatial correlation by defining a cluster correlation matrix [11].

Although pre-processing techniques have proved to be useful, they often induce challenges of their own. In reality, this will be evident if the techniques are not properly configured or if the right technique is not utilized. In this respect, the quality of clusterlevel forecasting is often dependent on the separability of clusters, the correlation among cluster members, and the quality of cluster centroids [11,14]. If, for instance, we focus on the cluster centroids (averages), we at times find them being utilized as representatives

while fitting univariate or multivariate forecasting models [11]. However, in practice, time series cluster centroids are often significantly affected by temporal distortion, which misaligns patterns (shapes) [15,16]. For instance, in Figure 1, we show a cluster of 50-Hertz sinusoids (sin signals) that have phase differences of 0, *π*/3, *π*/6, <sup>2</sup>*<sup>π</sup>* <sup>3</sup> , *and π*. We also show the cluster centroids (averages) that are estimated using an arithmetic mean (shown in Figure 1a) and the soft dynamic time warping barycenter averaging (SDBA) (shown in Figure 1b) [17]. The average estimated using the arithmetic mean is smaller, due to the presence of a phase shift (temporal distortion). In general, in practice, temporal distortion often forces the arithmetic mean to aggregate shapes in a destructive manner [17,18]. Therefore, we often expect a cluster-level forecasting model that is based on an arithmetic mean to be affected by underprediction. In addition to this, the impact of a temporal shift is not limited to the distortion of the cluster centroids. A sub-optimal cluster centroid could also lead to the grouping of members that have no significant correlation (similarity). This in turn could affect the quality of cluster-level predictions that aim at capturing spatial correlations via clustering [11,14].

With these observations in mind, in this paper, we propose to utilize a neural network arrangement for cluster formation and centroid estimation. In this regard, we show the advantage of utilizing representation learning, which aims to identify patterns that are presumed to be useful for cluster formation in the latent space of neural networks [19]. Moreover, we also propose to utilize a neural network architecture that is able to estimate time-domain cluster centroids from latent representations [18]. In reality, the utilization of neural networks for the cluster formation and centroid estimation is expected to be advantageous as a result of at least one factor, i.e., transfer learning [20]. In this regard, neural networks are known to be good at generalizing for a range of unseen datasets. Therefore, well-organized and trained clustering and centroid estimation networks are capable of generating clusters and centroids without the need for costly re-runs, i.e., if additional datasets (base stations) become available.

We have organized the rest of the paper into three additional sections. In Section 2, we give a brief review of different clustering and forecasting techniques. Following this, i.e., in Section 3, we present the methodology used in this study. In Section 4, we present the experimental evaluations. Finally, in Section 5, we present our concluding remarks.

#### **2. Background**

*2.1. A Brief Review of Common Time Series Forecasting Techniques*

On a holistic level, the different time series forecasting methods can be grouped into [1,5,10]:


In practice, we find both categories of forecasting approaches deployed in wireless network traffic prediction problems. Forecasting models that are deployed in this application domain could be either one of the two or a hybrid of both approaches [2,9–12]. In general, in this domain, one of the dominant knowledge-driven forecasting approaches is the S-ARIMA forecasting model. The main driving force behind this dominance is the seasonal nature of most mobile traffic data [2,6,9]. For instance, we expect base stations located in residential areas to have a cyclic peak traffic demand in the early mornings, late at night, and at weekends. The S-ARIMA model is found to be capable of modeling such seasonalities by linearly combining the forecasts of autoregressive (AR) and moving average (MA) models (*θ*) [1,5]. However, in practice, increasing (decreasing) trends within datasets have been found to introduce offsets into predicted values. Consequently, S-ARIMA incorporates a difference operation, which is indicated with the keyword I(*d*). Moreover, to account for seasonality, it treats the seasonal and non-seasonal parts of the data independently. In general, assuming a (1 − *L*) lag operator, a S-ARIMA model is mathematically represented as [1]:

$$(\phi\_p(L)\Phi\_P(L^{s\_k}))(1-L)^d(1-L)^{D\_{s\_k}}(X\_l-\mu) = (\theta\_q(L))(\Theta\_Q(L^{s\_k}))\varepsilon\_l\tag{2}$$

where the (*p*; *P*(.)) and (*q*; *Q*(.)) orders of polynomials are used for AR(*φ*; Φ) and MA(*θ*; Θ) coefficients, i.e., for the non-seasonal and seasonal parts. The parameters d and D are also used to represent the differencing operation performed on the non-seasonal and seasonal parts of the data. Moreover, *Xt*,*εt*, and *μ* are the value of a series at *t*, its residual term, and a constant.

In practice, there are also cases where the underlying data could have patterns that cannot be captured with linear models [8,11]. When this is the case, researchers often propose deploying neural networks that are capable of performing nonlinear transformations. In this regard, neural networks that are based on the long short-term memory (LSTM) method are often found to be efficient [2,10]. This is because these networks are capable by design of capturing the dependencies between time stamps, which is a useful characteristic in time series forecasting. However, in some cases, combining S-ARIMA and neural networks has been proposed, to better capture the seasonal, linear, and nonlinear aspects of an underlying series [8,11]. We now conclude this section and proceed to the discussion of time series clustering techniques.

#### *2.2. A Brief Review of Time Series Clustering Techniques*

Like time series forecasting, time series clustering is also an intensively investigated temporal data mining technique [15,21,22]. In general, we can broadly categorize time series forecasting as either distance- or features-based. In distance-based approaches, clustering techniques utilize distance metrics either to group series around their centroids or to group them in a hierarchical manner [22,23]. However, in practice, distance-based approaches are often expected to incorporate some type of temporal alignment technique, to overcome the impact of temporal distortion. When this is the case, the most frequently proposed alignment technique is dynamic time warping (DTW) [24]. However, the incorporation of DTW into the clustering process often increases the computational complexity of the clustering process [16]. With this understanding, in recent years, researchers have proposed performing clustering by utilizing latent space embedding of neural networks [19,21]. For instance, in [19], the authors proposed deep embedding clustering (DEC), which uses a denoising autoencoder to extract latent features from the input series. The latent features are then grouped into *K* clusters by first computing the soft cluster assigning given in (3).

$$q\_{l,j} = \frac{(\frac{1 + ||Z\_l - \mu\_j||\_{l2}}{\alpha})^{\frac{-\alpha + 1}{2}}}{\sum\_{j=1}^{K} (\frac{1 + ||Z\_l - \mu\_j||\_{l2}}{\alpha})^{\frac{-\alpha + 1}{2}}} \tag{3}$$

In (3), *Zi* <sup>∈</sup> <sup>R</sup>*<sup>τ</sup>* is a latent embedding corresponding to a time series *Xi* <sup>∈</sup> <sup>R</sup>*N*, where *<sup>τ</sup>* <sup>&</sup>lt; *<sup>N</sup>*. Moreover, *<sup>μ</sup><sup>i</sup>* <sup>∈</sup> <sup>R</sup>*<sup>τ</sup>* is the arithmetic mean of the latent embedding corresponding to a cluster. In reality, the soft cluster assignment computes cluster labels based on the likelihood of latent features under the Student t-distribution. In order to make the soft assignments meaningful, the authors proposed to define the auxiliary distribution (*pi*,*j*) given in (4). Finally, the network was forced to minimize the Kullback–Leibler (KL) divergence between its soft assignments and the auxiliary distribution (4).

$$L = KL(P||Q) = \sum\_{i} \sum\_{j} p\_{i,j} \log \frac{p\_{i,j}}{q\_{i,j}}, \text{ where } \begin{array}{c} \frac{q\_{i,j}^2}{\sum\_{j=1}^{K} q\_{i,j}}\\ \sum\_{j=1}^{K} \frac{q\_{i\_j}^2}{\sum\_{j=1}^{K} q\_{i,j}} \end{array} \tag{4}$$

In this paper, we propose to cluster base stations using the DEC setup. However, one minor inconvenience is that the DEC setup cannot generate time-domain cluster centroids. We propose to overcome this limitation by utilizing a multitasking autoencoder, which is set to perform multi-class classification and reconstruction. This configuration was proposed in [18], in order to estimate the averages of multi-class temporal datasets from their latent embedding. With this in mind, we will next present the methodology used.

#### **3. Methodology**

#### *3.1. Proposed Network Architecture*

In this study, we utilized the multitasking architecture shown in Figure 2, in order to estimate time-domain cluster centroids. Moreover, we also used the architectures shown at the encoder for deep embedding clustering under the DEC arrangement. In general, the multitasking setup optimizes for reconstruction and the multi-class classification losses given in (5), where *Xi*, *X*ˆ *<sup>i</sup>* are input and reconstructed time series in R*N*. Moreover, *pi*,*<sup>j</sup>* are the softmax activation values (the likelihood) of a series *Xi* belonging to category (Cat) [18]. In terms of layer arrangements, the multitasking network is constructed from transposed and normal convolutional, max-pooling, flattening, and dense layers [25].

$$L\_{\rm Multi}(\mathbf{X}\_{i\prime}, \hat{\mathbf{X}}\_{i\prime}, \mathbf{Cat}\_{\prime}p\_{\rm cat}) = \frac{1}{N} \sum\_{i=1}^{N} ||X\_{i\prime} - \hat{\mathbf{X}}\_{i}||\_{\rm l2} - \frac{1}{N} \sum\_{i=1}^{N} \sum\_{j=1}^{C} \text{Cat}\_{i,j} \ln^{p\_{i,j}} \tag{5}$$

**Figure 2.** Neural network architecture utilized for cluster centroid estimation.

In reality, the layer arrangements are motivated by the layer configurations observed in the Visual Geometry Group-16 (VGG16) architecture [26]. Overall, we configured the convolutional and max-pooling layers with a kernel size of 3. Moreover, we configured the transposed convolutions to have a stride of 2. In contrast, the non-transposed convolutional layer and the encoder's last max-pooling layer had a stride of 1. Furthermore, the number of neurons in the encoder's dense layer was set to <sup>2688</sup> <sup>4</sup> . Moreover, we configured the dense layers of the three classifiers to have 0.9 <sup>×</sup> <sup>2688</sup> <sup>8</sup> , 0.8 <sup>×</sup> <sup>2688</sup> <sup>16</sup> , and *K* neurons, where

K is the number of clusters. Finally, we set the number of neurons for the decoder's dense layer to 2688. In terms of layer activation, we utilized a rectified linear unit (ReLU) activation function on most of the proposed neural network's layers. However, for the first encoder's convolutional layer and the last decoder's dense layer, we utilized a linear activation function. Finally, we set the classifier's last dense layer to use a softmax activation function [20,25].

#### *3.2. Datasets*

The datasets used were collected from 729 base stations providing wireless data services to regions located within Addis Ababa, Ethiopia. In order to define the temporal datasets, 24 h of data traffic measurements were taken for four consecutive months, i.e., from September 2019 to the end of December 2019. Hence, we obtained 729 temporal datasets that were 2688 time stamps long.

#### *3.3. Experimental Setup*

In order to evaluate the proposed approach, we first converted the datasets from terabytes to gigabytes by dividing by 1024. We then took the encoder and decoder portion of the multitasking setup and trained it for a reconstruction loss, i.e., using the first part of (5). Following this, we took the encoder portion of the trained autoencoder and trained it for 1500 epochs using the objective function given in (4). We then used the network to predict cluster labels for the latent embedding of input datasets. Next, we used the labeled series to train the full multitasking setup using (5). This training was performed in order to generate time-domain centroids (averages) for the clustered latent space representations. After training, we estimated time-domain cluster centroids using the decoder portion of the multitasking autoencoder and by taking the arithmetic mean of the latent space represenations. We then took the estimated centroids and fitted a D-SARIMA model, which was implemented in R [27]. The model fitting was performed using a segment of the estimated cluster centroids, i.e., a segment that corresponded to 3<sup>1</sup> <sup>4</sup> months of traffic measurements. However, in addition to segmenting the centroids, we also identified the most strongly correlated cluster centroids. We used these centroids as exogenous variables of one another while fitting the D-SARIMA model. This, in turn, becomes useful for capturing the spatial correlation among base stations that is evident due to land use.

Finally, we used the D-SARIMA models that were fitted to the cluster centroids to generate 1<sup>1</sup> <sup>2</sup> weeks of predictions for individual cluster members. However, in order to generate the predictions, we substituted the centroid segments used for model fitting with the corresponding segments of the individual cluster members. In this way, it is possible to test the representativeness of the coefficients learned using the segments of the cluster centroids. Finally, we assessed the quality of the predictions using the root mean square (RMS) error and mean absolute error (MAE), as given in (6). As a comparison, we performed the same evaluations using the TSLearner implantation of K-means and its variant, DBA K-means [16,23,28].

$$RMSE = \sqrt{\frac{1}{N} \sum\_{i=0}^{N-1} (y\_{t\_i} - \hat{y}\_{t\_i})^2} \quad MAE = \frac{1}{N} \sum\_{i=0}^{N-1} |y\_{t+i} - \hat{y}\_{t+i}| \tag{6}$$

#### **4. Experimental Results and Discussion**

Prior to any model training or fitting, we first assessed the inter-cluster inertia of the traffic datasets, i.e., we determined the optimal number of clusters (K). In this regard, we first conducted a basic K-means clustering for different values of K. We then observed the average within-group squared sum (WGSS) for different values of *K*. From this observation, we found that five clusters sufficiently minimized the inter-cluster inertia. Next, we decomposed the datasets into their trend, seasonal, and residue components. Moreover, we conducted autocorrelation (AC) and partial autocorrelation (PAC) analyses of the datasets in order to determine the period of the seasonalities. Figure 3a,b show that the datasets have

seasonal and trend components. Moreover, Figure 3c,d show that there are two seasons, i.e., a daily (24 h) season and a weekly (168 h) season. Therefore, we decided to deploy a double seasonal ARIMA (D-SARIMA) model rather than an S-ARIMA model.

**Figure 3.** Demonstration of the seasonality of the data traffic. (**a**) Daily seasonality. (**b**) Weekly seasonality. (**c**) Autocorrelation. (**d**) Partial autocorrelation.

We then used R's auto.sarima package to determine the optimal model parameters for D-SARIMA {*S*1(*P*, *Q*, *D*), *S*2(*P*, *Q*, *D*), and (*p*, *q*, *d*)}. In this regard, we found that D-SARIMA {(2, 1, 2), (2, 0, 0)24, (2, 0, 0)168} gave a better validation error. Therefore, we first clustered the base stations into five groups using the DEC arrangement. We then plotted the clusters on the map of Addis Ababa in order to visually assess whether the clusters had a geographical meaning, as shown in Figure 4.

**Figure 4.** Geographical location of clusters corresponding to base stations within Addis Ababa.

In general, the clustering algorithm identified geographical locations that correlated with the offered traffic demand. For instance, cluster 1 corresponded to the areas locally known as *"Megenagna"* and *"Bole"*. These locations are known to accommodate large

entertainment facilities and the country's largest airport. Additionally, the locations corresponding to cluster 2 accommodate governmental and non-governmental institutions, universities, residential areas, and embassies. Furthermore, cluster 0 and cluster 3 corresponded to mixed use areas that are densely populated, such as *"Cherkos"* and *"Autobis Tera"*. Finally, cluster 4 was a purely residential area that is sparsely populated. Overall, the DEC arrangement identified geographical locations that correlated with the amplitudes of the centroids. The alternative clustering also identified similar patterns. However, we

observed minor distortions in the centroids of the DBA K-means clustering. We associated these distortions with the sensitivity of DBA to the offset (trend) that is evident in the datasets. Figure 5 demonstrates DBA's response to the offsets, which are manifested as sharp spikes.

**Figure 5.** Clusters and cluster centroids estimated by DBA K-means.

We next compared the performances of the D-SARIMA models fitted on the cluster centroids estimated using the different clustering techniques. In this regard, we made two types of predictions. First, we predicted cluster members using D-SARIMA models that were fitted on the cluster centroids, as discussed in the Experimental Setup section. In Table 1, these predictions are differentiated using the keyword CS. In addition to this, as a benchmark, we made similar predictions using D-SARIMA models that were fitted on the individual cluster members. In Table 1, these prediction are differentiated using the keyword BS.

We can interpret the results shown in Table 1 from two different angles. First, we can note that whatever the utilized clustering technique, cluster-level forecasting approaches better captured the dynamics of the underlying data. This is evident, because we have incorporated the spatial information by using highly correlated cluster centroids as exogenous variables. In addition to this, we can also note that the representation learning approach had the lowest aggregate prediction error. This, in turn, implies that the approach is able to overcome the impact of temporal distortion either on cluster formation or on centroid estimation. We conclude this section by presenting the forecasts generated by the three approaches for one of the base stations. For the forecasts shown in Figure 6, the D-SARIMA based on representation learning achieved RMSE and MAE values of 0.758 and 0.596. In contrast, the DBA K-means and basic K-means approaches achieved RMSE and MAE values of 0.802 and 0.617, and 0.669 and 0.847.


**Table 1.** Comparison of cluster-level and data-level forecasting.

**Figure 6.** Example forecasts generated using K-means, DBA K-means, and DEC clustering multitasking autoencoder centroid estimation. (**a**) DBA-K-means-based forecasts. (**b**) K-means-based forecasts. (**c**) DEC and multitasking autoencoder based forecasts.

#### **5. Conclusions**

In this paper, we argued that the effect of temporal distortion on cluster formation and cluster centroid estimation cannot be ignored. With this in mind, we proposed a representation-learning-based clustering and cluster centroid estimation approach for cluster-level forecasting. We showed that this approach has the ability to better capture the spatial correlation evident among base stations within wireless communication networks. In general, to validate our argument we only utilized a basic linear forecasting model. However, in reality, the potential of the proposal is not limited to this. In our future work, we aim to assess the improvement in the quality of forecasts using either neural-networkbased forecasting techniques or a hybrid of linear and neural-network-based forecasting approaches.

**Author Contributions:** Conceptualization, Methodology, Software, Validation, Formal Analysis, Investigation, and Data Curation: T.T.D. and B.S.S.; Original Draft Preparation: T.T.D.; Supervision, Review, and Editing: M.D., J.W., D.H.W., S.P. and G.F. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was conducted under the Ethio-France PhD. program.

**Data Availability Statement:** The datasets utilized for experimental evaluation are obtained from Ethio Telecom. The authors could make the datasets available up on request and with the consent of the network operator.

**Acknowledgments:** The authors would like to thank the French Embassy for Ethiopia and the African Union and the former Ethiopian Ministry of Science and Higher Education (MOSHE) for supporting the PhD program under which this research was conducted.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Proceeding Paper* **Autoencoders for Anomaly Detection in an Industrial Multivariate Time Series Dataset †**

**Theodoros Tziolas 1, Konstantinos Papageorgiou 1,\* , Theodosios Theodosiou <sup>1</sup> , Elpiniki Papageorgiou <sup>1</sup> , Theofilos Mastos <sup>2</sup> and Angelos Papadopoulos <sup>2</sup>**


**Abstract:** In smart manufacturing, the automation of anomaly detection is essential for increasing productivity. Timeseries data from production processes are often complex sequences and their assessment involves many variables. Thus, anomaly detection with deep learning approaches is considered as an efficient and effective methodology. In this work, anomaly detection with deep autoencoders is examined. Three autoencoders are employed to analyze an industrial dataset and their performance is assessed. Autoencoders based on long short-term memory and convolutional neural networks appear to be the most promising.

**Keywords:** autoencoders; deep learning; LSTM; 1DCNN; anomaly detection; elevator industry

#### **1. Introduction**

The collection and the processing of timeseries data in industrial procedures is an essential task in smart manufacturing. Exploitation of these data enables data holders to engage complex strategies and processes such as process optimization and predictive maintenance within the context of Industry 4.0. A key asset towards zero-defect manufacturing (ZDM) is timeseries anomaly detection (AD), which can reveal misconfigurations in manufacturing lines and eminent faults. Manual AD requires expert technicians to monitor sensor signals from manufacturing lines, identify faults in real-time, and trigger proper actions. As the volume of production and data increases, manual operations become ineffective. This task becomes more complicated if multiple measurements need to be assessed simultaneously and in combination. To alleviate such issues, artificial intelligence (AI) methods are considered.

The first step in AD in a manufacturing line is to determine what an anomaly is. Clearly, this is a case-dependent issue and makes AD in industrial timeseries an extremely wide area; thus, this work focuses on a specific case study using data from manufacturing and testing elevators. The investigated production line uses sensors to capture real-time data regarding hydraulic pressure, elevator velocity and the noise produced. Actual data for a two-year period (2019–2021) were provided by KLEEMANN (KLEE) one of the most important lift manufacturers in the European and global markets. According to KLEE expert technicians, the signal attributes that reveal anomalies in the production line are the duration, the magnitude and the direction of the acquired curves.

Conventional AD methods include clustering-based and statistical-based [1,2] methods; the k-NN [3] and the K-means [4] methods are probably the most popular clustering methods, while the autoregressive-moving-average models are typical statistical choices [5]. However, conventional methods demonstrate poor performance and face challenges such

**Citation:** Tziolas, T.; Papageorgiou, K.; Theodosiou, T.; Papageorgiou, E.; Mastos, T.; Papadopoulos, A. Autoencoders for Anomaly Detection in an Industrial Multivariate Time Series Dataset. *Eng. Proc.* **2022**, *18*, 23. https://doi.org/10.3390/engproc 2022018023

Academic Editors: Ignacio Rojas, Hector Pomares, Olga Valenzuela, Fernando Rojas and Luis Javier Herrera

Published: 22 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

as low anomaly recall rate, noise-resilient anomaly detection, difficulty to deal with complex anomalies and in high-dimensional date—especial in case of interdependencies [2], etc. Many of these challenges are addressed by deep learning (DL).

Common DL approaches include prediction-based models [6]. These assume strong correlation between neighbor values and employ past values to predict the current and future ones. DL methods are usually based on artificial neural networks (ANN) and their variants to perform AD. In timeseries processing, the recurrent neural networks (RNN) and the convolutional neural networks (CNN) are the most efficient architectures as the they were designed targeting data sequences. Stacked layers of long short-term memory (LSTM) RNNs were employed in [7] for AD in engine datasets. Similarly, an LSTM predictor was used in [8] for a system of two industrial robotic manipulators using simulated data. A Bayesian neural network based on dropout and CNN was proposed in [9] for an industrial dataset to process pressure profiles. Despite their success, predictive models still have limitations in actual industrial environments due to the complexity involved in the production process [10]. Furthermore, defective products and anomalies in the signals are rather rare, making it difficult to train the AI models.

Autoencoders (AE) were proposed to overcome these shortcomings. AEs are composed of two different networks in series: an encoder that compresses the input in a lower dimensional space and a decoder that attempts to reconstruct the input from the compressed representation [11]. A key-feature of AEs is that they can be trained using only normal data, thus, overcoming the lack of anomalies in the data in an actual manufacturing line. An LSTM-based AE was proposed in [10] for application on engine datasets. Both the encoder and the decoder consisted of a single layer. This architecture could detect anomalies even when the values in a sequence change very frequently. A deeper LSTM-based AE with eight hidden layers and five LTSM units in each layer was proposed in [12]. The performance of this model was efficiently confirmed for the detection of rare sound events. CNN-based architectures are also found in literature, as an alternative to the LSTM-based. A more complex architecture was proposed in [13] to learn features from complex vibration signals and perform gearbox fault diagnosis; this architecture employed a residual connection and a deeper 1DCNN-based AE. It is evident that AD with AEs is a promising methodology that can capture underlying correlations that often exist in industrial multivariate datasets.

For a similar case study of AD in timeseries produced by elevators, an approach of supervised learning was proposed in [14]. In this work, a multi-head CNN-RNN was trained to classify the timeseries in normal and including anomalies. Even though the operation of the elevator was monitored from 20 sensors, due to the complexity in labeling all possible scenarios from real data, expertise knowledge was used to generate a simulated dataset of 20 variables. In this approach, the model is processing each variable independently. The model showed good performance with a trade off in long training times.

This work takes from successful AE-based AD practices and investigates the performance of different AE-based architectures for AD in datasets acquired from actual elevator production lines. To the best of our knowledge, there are no similar implementations of AE-based AD. The contributions of this work include: (i) the assessment of three different AE-based architectures, namely ANN-, LSTM- and CNN-based AE, for the analysis of the provided industrial dataset, (ii) the capture of underlying correlations, possibly missed if each signal was considered independently from the others, (iii) the demonstration of a simple, yet effective, methodology for inducing realistic anomalies in timeseries data, which is also applicable to optimized production lines.

#### **2. Theoretical Background**

#### *2.1. Long Short-Term Memory Recurrent Neural Networks*

Traditional ANNs treat inputs and outputs as individual values. RNNs account for information from previous inputs as well, and, therefore, can efficiently treat sequences of

data. While standard RNNs apply a pointwise nonlinearity to the affine transformation of inputs and recurrent units, LSTMs use "gates" to control what information is used to update the cell state [7]. LSTM units (see Figure 1) are composed of a cell *ct*, an input gate *it*, an output gate *ot* and a forget gate *ft*.

**Figure 1.** LSTM unit.

The equations for the forward pass of an LSTM are:

$$f\_t = \left(\mathbf{W}\_f \cdot \mathbf{x}\_t + \mathbf{U}\_f \cdot \mathbf{h}\_{t-1} + \mathbf{b}\_f\right),\tag{1}$$

$$\mathbf{i}\_t = \sigma(\mathbf{W}\_i \cdot \mathbf{x}\_t + \mathbf{U}\_i \cdot \mathbf{h}\_{t-1} + \mathbf{b}\_i),\tag{2}$$

$$o\_t = \sigma\left(\mathbf{W}\_o \cdot \mathbf{x}\_t + \mathbf{U}\_f \cdot \mathbf{h}\_{t-1} + \mathbf{b}\_f\right),\tag{3}$$

$$\mathcal{L}\_{t} = f\_{l}\mathbf{c}\_{t-1} + i\_{l}\tanh(\mathbf{W}\_{\mathcal{C}} \cdot \mathbf{x}\_{t} + \mathbf{U}\_{\mathcal{C}} \cdot \mathbf{h}\_{t-1} + \mathbf{b}\_{\mathcal{C}}),\tag{4}$$

$$h\_l = \tanh(\mathcal{c}\_l) o\_{l\prime} \tag{5}$$

where *σ*,*W*, *U*, *b*, respectively denote the activation function (sigmoid in this work), the input weights, the recurrent weights, and the biases.

#### *2.2. One-Dimensional Convolutional Neural Networks*

CNNs are widely used to perform feature extraction and selection from the input in an unsupervised way. A CNN consists of convolutional layers to extract features, and pooling layers to perform down-sampling. 1DCNNs is a special CNN targeting timeseries data. For an input vector *<sup>x</sup>* <sup>=</sup> {*x*0, *<sup>x</sup>*1, ... , *xn*−1}, *<sup>n</sup>* <sup>∈</sup> <sup>N</sup>∗, the output of the convolutional operation (cross-correlation) with a kernel *<sup>k</sup>* <sup>=</sup> {*k*0, *<sup>k</sup>*1, ... , *km*−1}, *<sup>m</sup>* <sup>∈</sup> <sup>N</sup>∗, and *<sup>m</sup>* <sup>&</sup>lt; *<sup>n</sup>* at a timestep *t*, is defined as:

$$y(t) = (\mathbf{x} \* k)(t) = \sum\_{i=0}^{m-1} \mathbf{x}(t + i + (t \* s))k(i). \tag{6}$$

In Equation (6), *<sup>s</sup>* ∈ N is the stride, i.e., the number of timesteps omitted in the crosscorrelation operation; it resembles a sliding window over time. Zero-padding is typically employed to avoid the shrinking of the input at the boundaries. The output typically passes through a non-linear activation function like sigmoid or ReLU [15]. The output of a max-pooling operation with *pooling size* = *<sup>p</sup>* , *<sup>p</sup>* ∈ N<sup>∗</sup> at a timestep *<sup>t</sup>* can be defined as:

$$y(t) = \max\_{i=0,\ldots,p-1} \mathbf{x}(t+i+(t\*s)).\tag{7}$$

#### *2.3. Anomaly Detection with Autoencoders*

AEs attempt to reconstruct their input signal. If an AE is trained with optimal inputs only, its hyperparameters will be tuned to reconstruct optimal signals. Thus, when fed with timeseries which include anomalies, it is expected to fail reconstruction; this will be realized as high values in the loss function, quantified by typical error metrics, like the Mean Square Error (MSE) and the Mean Absolute Error (MAE).

Given a dataset of normal samples *<sup>X</sup>* <sup>∈</sup> *<sup>R</sup>t*×*<sup>m</sup>* with *<sup>t</sup>* <sup>∈</sup> <sup>N</sup><sup>∗</sup> the time steps and *<sup>m</sup>* <sup>∈</sup> <sup>N</sup><sup>∗</sup> the measurements, the AE uses an encoding function *E* : *X* → *Z* to compress its input to the latent space *<sup>Z</sup>*, and a decoding function *<sup>D</sup>* : *<sup>Z</sup>* <sup>→</sup> *<sup>X</sup>*<sup>ˆ</sup> to recover samples from the latent space. Ideally, the original samples *X* and the reconstructed samples (*X*ˆ ), i.e., the output of the AE, should be identical. However, in practical cases suitable *E*, *D* functions are pursued such that the *Loss X*, *X*ˆ becomes minimal; the MSE or MAE metrics are typical employed as Loss functions. In case a dataset *<sup>A</sup>* <sup>∈</sup> *<sup>R</sup>t*×*<sup>m</sup>* containing synthetic samples (with artificial anomalies in the data) is fed into a trained AE, it is expected that

$$\text{Loss}(\mathbf{X}, \mathbf{\hat{X}}) < \text{Loss}(\mathbf{A}, \mathbf{\hat{A}}), \text{ } \forall \mathbf{x} \in \mathbf{X}, \text{ } \mathbf{\hat{x}} \in \mathbf{X}, a \in A, \mathbf{\hat{a}} \in \mathbf{\hat{A}}.\tag{8}$$

Each sample is classified as 'Normal' or 'Anomalous' based on the values of the loss function. However, Equation (8) poses a strong condition that is difficult to satisfy for all samples, therefore, a soft margin threshold is employed to quantify the AD capabilities of the network.

$$\mathcal{L}(A) = \begin{cases} \text{Normal} & \text{Loss}(\mathbf{A}, \hat{\mathbf{A}}) < \mathbf{e}\_{th} \\ \text{Anomalous} & \text{Loss}(\mathbf{A}, \hat{\mathbf{A}}) \ge \mathbf{e}\_{th} \end{cases} \tag{9}$$

where *eth* is a predetermined threshold value. The optimal threshold is properly selected to maximize the usual metrics, namely the Accuracy (*A*), the Precision (*P*) and the Recall (*R*) [5].

#### **3. Implementation on Industrial Dataset**

#### *3.1. Description of the Dataset*

An industrial dataset was provided by KLEEMANN Greece containing historical measurements of elevator hydraulic power units (HPU). These measurements correspond to quality tests, that monitor the speed of the elevator, the developed pressure in the hydraulic unit, and noise produced during operation. The dataset contains 7200 different cases, corresponding to different client orders and employing various configurations and setups. An indicative example is shown in Figure 2.

**Figure 2.** Indicative diagram from a HPU quality test. Speed, pressure and noise are captured for both translational directions.

Investigation of the provided dataset revealed that the acquired curves are not affected by the translational direction. Thus, directional information was not regarded as a classification parameter and measured curves for all direction were merged into a single dataset with 2 × 7200 = 14, 400 samples. On the other hand, discrepancies were observed on testing parameters, e.g., some tests last longer than others, or the HPU operated in a different speed range, etc. Such deviations occur due to the different client requirements for each individual order, and cannot be avoided; therefore, some data preprocessing is needed, as described in the following section.

#### *3.2. Data Preprocessing*

The number of captured time steps in each HPU varied from 201 to 1035, albeit most samples (ca. 80%) consisted of only 201 time-steps. Therefore, all data sequences were brought to the same length (i.e., 201 time-steps), to enable mini-batch processing. The values of speed were all in the same range [0, 0.91], thus, no treatment was necessary. Noise and pressure were in very different ranges, namely [0, 91.2] and [0, 53.98] respectively, thus, they were normalized by dividing with their maximum value.

The derived dataset was split for training (90%) and testing (10%). The sizes of the datasets involved were (samples × time steps × variables) (see Table 1):

**Table 1.** Employed datasets for training and testing.


#### *3.3. Synthetic Data for Anomaly Detection*

As already mentioned, a well-configured manufacturing line rarely produces measurements with anomalies. This was also the case with the provided dataset; all timeseries correspond to optimal operation. Thus, a synthetic dataset for anomaly detection testing was created to test the model and assess its performance. To prevent biasing, communications with expert technicians were conducted to establish realistic deviations and criteria for identification of sub-optimal operation. According to the technical experts, anomalies in measured timeseries should be identified by (a) deviations more than ±1.9 bar in pressure, and (b) noise values higher than 68 dB. No hard indicator could be provided for speed. Furthermore, the noise value was treated as a weak indicator, since there were samples with noise value higher than 68 dB and still treated as normal.

To tackle these ambiguities, the provided dataset was further explored to establish more strict and realistic thresholds; these thresholds were then exploited to induce artificial anomalies. An example is shown in Figure 3 (Graphs 1–3), where it is clear that speed is not constant but exhibits fluctuations; thus, such curves can be used to extract the fluctuation threshold, beyond which, the operation is considered sub-optimal. The same applies for pressure (Graphs 4–6). Following this approach, the following factors were accounted for during generation of artificial anomalies:


**Figure 3.** Examples of time series during optimal operation.

To create the synthetic dataset for anomaly detection for testing, each measured curve of the normal testing dataset was processed. Therefore, each curve in the testing dataset for anomalous detection is the counterpart of a curve which includes anomalies in the normal testing dataset. Examples of normal and synthetic curve-pairs including artificial anomalies are shown in Figure 4.

**Figure 4.** Examples of normal-synthetic/anomalous pair of curves.

#### *3.4. Description of the AE Architectures*

To acquire the most appropriate AE for AD in this industrial dataset, three different AE architectures were considered, namely ANN-, LSTM- and CNN-based AEs. Thus, some popular architectures found in literature were explored and assessed using the training and testing datasets described.

**ANN-based AE.** The first implementation was an ANN-based AE. It consisted of 10 fully connected (dense) layers, a layer that flattens the matrix and a reshape layer in the decoding operation that transforms the vector back to a matrix. In particular, the architecture is as follows: Input (128-64-32)-flatten-1024-latent space (128)-(1024-6432) reshape (64-128)-output. For all hidden layers ReLU activation function was used.

**LSTM-based AE.** The LSTM-based AE is a shallower network compared to the previous one. It consists of 4 LSTM layers, a layer that repeats the vector in the corresponding timesteps and a skip connection layer. Hyperbolic tangent and sigmoid activation functions were used in LSTM units for the input and the recurrent state respectively. Skip-connections were employed in the stacked LSTM layers, following the practice of [16–18], to boost model's reconstruction performance. The architecture is shown in Figure 5.

**Figure 5.** Architecture of the LSTM-based AE.

**CNN-based AE.** The CNN-based AE consists of convolutions, transpose convolutions and pooling operation. Convolutions and pooling operation were part of the encoding process while transpose convolutions [19] were used to perform up-sampling during decoding. Convolutional and transpose convolutional layers employed the "same-padding" method, so that the size of the output matched the size of input. ReLU was used as the activation function. The complete architecture is presented in Figure 6.

**Figure 6.** Architecture of the CNN-Based AE.

#### *3.5. Results*

All the experiments were conducted on a custom-built workstation to accommodate the computational needs of OPTIMAI; the workstation was equipped with an Intel Core i9-11900KF @ 3.5 GHz CPU, 16 GB RAM and NVIDIA GeForce RTX 3080 Ti with 12 GB of GDDR6X memory, running on Windows 10 Pro. The proposed AEs were implemented in Python 3.8.3, with the Keras API framework of TensorFlow 2.0 [20].

The dataset was split by random selection for training (90%, 12,960 records) and testing (10%, 1440 records) so as to acquire a larger representative training dataset that will assist the network to learn the variety of operational values [21]. A smaller testing dataset is produced considering both synthetic anomalies and normal cases.

The L2 loss was chosen as the training cost function. In all experiments, both 5-fold and 10-fold cross validation were used, with the hold-out group of data being exploited as a validation dataset during training. The early stopping setup of Keras was used to avoid overfitting. As shown in Table 2, there was a considerable difference in the running times and number of epochs required for AEs to converge. The ANN-based AE was the fastest model to train despite it involved significantly more parameters than the others. CNNbased AE on the other hand, combines both computational efficiency and low training time.

**Table 2.** Computational characteristics for each proposed architecture.


To determine the most suitable AE type, one critical parameter needs to be considered: the classification threshold. To determine the classification threshold of signals as normal or anomalous, the mean reconstruction error values were investigated. Both the MSE and MAE metrics were considered, but the MSE curves was prone to outliers; thus, the classification threshold was selected based on MAE. The distribution of MAE for both normal and synthetic samples in the data is presented in Figure 7. An ideal model would

perfectly reconstruct its input; thus, no overlap (zero error) should be observed in the diagrams; or equivalently, the higher the overlap the worse the performance.

**Figure 7.** Distributions of MAE for each AE architecture.

Optimal performance was achieved by setting the threshold to 0.0063, 0.00048, and 0.00514 for the ANN-based, the LSTM-based and the CNN-based AEs, respectively.

The CNN-based AE achieved the highest scores according to the classification metrics (see Table 2). Its reconstruction capabilities are presented for two types of anomalies in Figures 8 and 9, with the corresponding MAE for both normal- anomalous pairs. Oscillations (Figure 8) produced higher MAE errors than dips/rises failures (Figure 9). However, further research is needed regarding localization using DL methods.

**Figure 8.** The reconstruction of the proposed model for a pair of normal and the corresponding with oscillation anomalies sample.

**Figure 9.** The reconstruction of the proposed model for a pair of normal and the corresponding with dips/rises anomalies sample.

All three examined AEs architectures demonstrated enhanced anomaly detection capabilities by producing a higher reconstruction error when fed with data including anomalies. According to the provided results, LSTM-based AE achieved the lowest reconstruction error for both cases of input data (normal or synthetic with artificial anomalies) which reveals that CNN-based AEs present higher sensitivity to anomalies. In terms of classification capability, the LSTM-based AE provided worse classification metrics than those achieved by the CNN-based AE, according to the MAE distribution graphs and the observed overlap, which also limited the range for the selection of the classification threshold. Overall, it emerges that the CNN-based AE shows better performance with regard to accuracy, also requiring less training time due to the reduced number of parameters.

#### **4. Conclusions**

To tackle the absence of anomalies in the data that is a common problem of real industrial datasets, a simple methodology of inducing anomalies based on expert's knowledge and data analysis was deployed. The AE based on 1DCNN layers outperformed in terms of classification accuracy, both LSTM-based and ANN-based AE. In addition, the ability of the CNN layers to share weights reduced the parameters-depth analogy of the proposed model and resulted in fast training times. The model achieved distinction of normal data and data including anomalies with 94% accuracy in an industrial dataset enhanced with artificial anomalies. Moreover, this work presented a methodology for inducing artificial anomalies, in order to generate samples that deviate from the normal operational thresholds of the examined industrial dataset.

Regarding the limitations with our approach, the possibility that the artificial dataset might not be representative for all possible outcomes including anomalies, is one of them. Thus, the classification threshold that selected with our approach can be considered optimal just for this test dataset. For future work, the consideration of real anomalies and the examination of other methodologies in classification threshold estimation, are essential.

**Author Contributions:** Conceptualization, T.T. (Theodoros Tziolas), K.P. and E.P.; methodology, T.T. (Theodoros Tziolas) and K.P.; software, T.T. (Theodoros Tziolas); validation, T.T. (Theodosios Theodosiou), T.T. (Theodoros Tziolas) and T.M.; formal analysis, T.T. (Theodoros Tziolas) and T.T. (Theodosios Theodosiou); investigation, T.T. (Theodoros Tziolas); resources, T.M. and A.P.; data cura-

tion, T.T. (Theodoros Tziolas) and T.M.; writing—original draft preparation, T.T. (Theodoros Tziolas), T.T. (Theodosios Theodosiou) and K.P.; writing—review and editing, E.P., T.M. and A.P.; visualization, T.T. (Theodoros Tziolas); supervision, E.P.; project administration, E.P.; funding acquisition, E.P. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by OPTIMAI, European project, grant number GA 958264.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The datasets analyzed during the current study are available from the KLEEMAN co-authors on reasonable request.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Proceeding Paper* **Time Series Clustering of High Gamma Dose Rate Incidents †**

**Mohammed Al Saleh 1,2,3,\* , Beatrice Finance 1, Yehia Taher 1, Ali Jaber <sup>2</sup> and Roger Luff <sup>4</sup>**


**Abstract:** In this paper, we proposed an unsupervised machine-learning-based framework to automate the process of extracting suspicious gamma dose rate incidents from the real unlabeled raw historical data measured in the German Radiation Early Warning Network and identify the underlying events behind each. This raised the research problem of clustering unlabeled time series data with varying lengths and scales. Based on the many evaluations, we demonstrated that the state-of-the-art's most popular time series clustering models were not suitable to perform this task. This motivated us to introduce our own approach. Through this approach we were able to perform online classification for gamma dose rate incidents of varying lengths and scales.

**Keywords:** machine learning algorithms; predictive model; time series clustering; gamma dose rate; Radiation Early Warning Network

#### **1. Introduction**

Time series analysis is gaining more and more interest in so many domains. That is because, with the proliferation of the use of sensors and IoT (Internet of Things) devices that continuously produce massive amounts of real-time data, special care has been given for analyzing that data to understand past events and patterns and predict future ones. Medical heart monitor's data, stock market prices, weather conditions, etc., are all examples of such time series data.

In this paper, we are interested in analyzing the gamma dose rate (background radiation level) in the environment. Some incidents can cause an abrupt increase in the gamma dose rate, such as what happened in the Chernobyl accident where the biggest short-term leak of radioactive materials was ever recorded in history [1]. Such an event has to be intercepted at the earliest point possible to take the proper measures and precautions and notify the concerned authorities to minimize the effects of such a hazardous situation. It is a very critical task, as long term or acute exposure to a high gamma dose rate can have many hazardous consequences on humans as well as on the ecosystem.

Around the globe, there are thousands of probes (sensors) that collect gamma dose rates in real time. A Radiation Early Warning System (REWS) collects and analyses data while raising alarm in case of an increase in the local gamma dose rate. Whenever an event occurs (i.e., the gamma dose rate goes above the accepted threshold, provided by experts), an alarm is triggered, and a team of experts and personnel have to unite to investigate the reasons behind this rise. Currently, analyzing the incoming incidents is performed manually while relying on the expert efforts. Such a method is time-consuming and risky, knowing that the factors affecting the gamma dose rate are not always known immediately. Fortunately, most of the incoming incidents are mainly innocent as they remain in an

**Citation:** Al Saleh, M.; Finance, B.; Taher, Y.; Jaber, A.; Luff, R. Time Series Clustering of High Gamma Dose Rate Incidents. *Eng. Proc.* **2022**, *18*, 24. https://doi.org/10.3390/ engproc2022018024

Academic Editors: Ignacio Rojas, Hector Pomares, Olga Valenzuela, Fernando Rojas and Luis Javier Herrera

Published: 22 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

acceptable range value for humans health and this value returns to normal after a period of time.

The objective of our research is to propose an *Intelligent Radiation Early Warning System* that automatically finds the cause behind an incident and its classification into *real or innocent* in real time. In order to build such an intelligent system, we first need to understand and learn from the past by using the historical databases produced by REWS. These databases contain raw unlabeled data (i.e., time series) corresponding to the gamma dose rate monitoring at each probe. Second, we need to identify, in real time, situations already encountered or not in the past to predict the cause behind an incident as quick as possible.

In this paper, we are only interested in describing the challenges of the first phase. We aim at proposing an unsupervised machine learning model that will help us to automatically identify the reasons behind incidents. Since we do not have full knowledge regarding the incident causes (neither their number), or the incident behavior patterns, we solicited the help of experts to validate the result quality of our model and to label the obtained clusters. Another challenge we run into was finding the right unsupervised machine learning model, which is a highly challenging mission. We have run experiments on real data in order to come up with the most suitable approach that is described in this paper. Surprisingly, the most popular approaches found in the state-of-the-art were not the ones selected for our proposal.

The remaining sections of the paper are organized as follows. In Section 2, we address the context and problem statement. Section 3 recalls the background related to the state-ofthe-art. Section 4 describes the methodology and the different evaluations we conducted to find the best model for our data. Finally, we conclude in Section 4.

#### **2. Context and Problem Statement**

For a long time now, time series data analysis has been a center of attention in research as it is used in different applications such as weather prediction, motion capture processing, analyzing insect behavior, pattern discovery on health-care data, and so on. Similarly, intelligence can be extracted from the gamma dose rate time series data in the radiation monitoring domain.

Although gamma dose rate is in theoretically affected by real incidents, it may also be affected by other factors such as weather conditions (rain, lightning, snow...), environmental factors (sun's cosmic radiations), and many other events as shown in Figure 1. Depending on the type of event, they cause the gamma dose rate to go above an acceptable value for a short or long period of time. These kinds of incidents, such as the ones caused by the rain, we call them *soft parabolas*, which are incidents that stay above the acceptable value of gamma dose rate for a considerable period of time. Incidents such as the ones caused by lightning are called *hard parabolas* and are hardly confused with real incidents or cause alarm. Usually, they are either caused by instantaneous events or a malfunctioning probe.

Note that even these *innocent* events trigger the system's alarm (depicted in red in Figure 1) and ensue a technically useless investigation; they have to be recognized and discarded by the experts after manually searching through and analyzing the multiple data sources (rain data, temperature data, wind data, even transportation data, etc.), which may not be available for inspection at any time and which elongates the useless process of evaluating the event as not alarming.

**Figure 1.** Increased gamma dose rate due to the deposits of Radon by-products from the atmosphere by rain.

To build our *Intelligent Radiation Early Warning System*, we faced the following problems. Firstly, the historical databases only maintain the gamma dose rate for each probe, unfortunately the experts' evaluations of the incidents are not maintained in the historical databases. We are not dealing with multivariate time series, as the different sources, such as precipitation and temperature, are not stored in some databases. Here, we deal with univariate time series, which are unlabeled as shown in the Figure 2. Secondly, incidents that are caused by the same event may not have a recognizable temporal trace or characteristics but more common behavior. For example, a particular event may cause peaks of increasing amplitudes that decrease over a longer period of time; another may cause an abrupt increase and maintain its amplitude for a period of time, and so on. Note that incidents caused by the same event can last for a varying length of time and reach different amplitudes. Thus, grouping the discovered incidents into similar patterns to explore their causes is deemed to be a big challenge.

**Figure 2.** Typical gamma dose rate time series.

Although the literature on time series analysis is prolific, the aforementioned challenge still remains an open question that can be formulated as follows: *"What is the best-fit unsupervised machine learning model that should be used for clustering our time series of varying lengths and different scales*?*"*

We propose to answer this question in this paper; this can be achieved thanks to the Germany REWS System [2], which offer their data to run all experiments; we thank all the experts we are in contact with, to validate or invalidate the results found by applying the different time series clustering algorithms. The data used in this work comprise the past ten years minute-by-minute gamma dose rate real data for over a thousand probes.

#### **3. Background and State-of-the-Art**

In this section, we briefly recall the four main phases for defining a time series clustering approach. First, there is the time series data preprocessing. Second, the similarity

measure should be chosen. Then, the clustering algorithm must be selected. Finally, the optimal number of clusters should be determined. Based on these, we enumerate the approaches proposed in the state-of-the-art for univariate time series clustering, especially those for the ones of varying length.

#### *3.1. Time Series Clustering Generic Model*

1. Time Series Data Preprocessing Data normalization and missing data imputation are the basic data preprocessing activities to be applied on time series data. Both kinds of techniques have a significant impact on the performance of a model, and they should be chosen based on the problem and model at hand.

Missing data imputation: Missing values may cause problems for machine learning algorithms as they will perform better with complete well-formed data. Some of the most popular approaches to deal with this problem are dropping rows with missing values, statistical imputation, and model imputation.

Normalization: The most common normalization methods used during data transformation include min-max, decimal scaling, and z-normalization [3]. The first two methods rely entirely on the minimum and/or maximum values that should be predefined from the data and upon which normalization will be performed. This is not the case with gamma dose rate time series data because the minimum and the maximum values are unknown.

2. Similarity Measure Similarity measures are algorithms used to determine the resemblance between different samples. In time series clustering, it is the determining factor used by the clustering algorithm to decide which cluster each sample belongs to. Shape-based distances evaluate the similarity of samples based on the actual or the normalized values, whereas feature-based distances evaluate similarity based on extracted features. In our context, we are only interested in shape-based measures. They fall into one of two categories:

Lock-step measures are metrics that evaluate the distance between two time series sequences as the overall difference between each point and its counterpart in the other sequence. These measures require data sequences to be of equal length. Minkowski (*Lp* norm) distances, specifically Euclidean [4], are the most favored lock-step metrics in machine learning. Their popularity is derived from their simplicity and success in machine learning literature as well as their being parameter-free.

Elastic measures on the other hand, provide better flexibility as they permit oneto-many and one-to-none point evaluation. Due to this flexibility, these measures provide a better comparison. This flexibility, however, comes at the price of increased time complexity.

Dynamic Time Warping (DTW) [5] is the most famous elastic measure in the literature, specifically introduced for time series analysis. As its name suggests, it warps the two considered sequences in time to deal with time shift and speed variations. DTW is a good similarity measure for comparing samples of varying lengths. It produces a scale-like effect, stretching and contracting, by accepting many-to-one matching; however, this also makes it sensitive to outliers.

3. Clustering Algorithm Despite the major role each part plays in the process, a recent study has shown that the choice of the proper similarity measure is considered to be more fundamental than that of the clustering algorithm itself in time series clustering. As a result, the majority of time series clustering fall back on classic clustering algorithms where either the choice of distance measure is modified as befits the time series data (raw-based methods), or the data are transformed to fit the clustering algorithm (feature and model-based) [6]. The raw-based approach is often preferable to the feature and model-based approaches. The latter are generally domain-dependent, where the features or models have to be altered depending on the application in the different domains. On the other hand, the main catch of

the raw-based algorithms is the *curse of dimensionality*, [7] where specs with high dimensionality are considered.

In our study, we focus on raw-based approaches as the results can be better generalized across the different applications and domains. We therefore only consider hierarchical and partitional clustering methods, as they are the most commonly used clustering methods in the literature on time series clustering [6,8].

Hierarchical clustering takes no parameters other than the linkage criteria [9]. Depending on the linkage criteria, a tree-like nested "hierarchy" of clusters is built, which can be visualized by a dendrogram. Hierarchical clustering's main advantage is that it does not require the number of clusters as input. Once the dendrogram is obtained, the clusters can be decided by making a cut at a certain point. On the other hand, it requires the distance matrix of all possible pairs of observations. This makes it very computationally expensive and not a favorable option for huge data sets.

Partitional clustering, as its name implies, partitional clustering partitions the data into *k* different clusters where *k* is specified a priori. Partitional clustering's aim is to minimize intra-cluster distance and maximize the inter-cluster distance. Partitional methods need the number of clusters *k* a priori. K-means [10] and K-medoids [11] heuristics are considered the front-men of the partitional methods. They are both based on the concept of finding the best cluster centers, minimizing the distance between each observation and the center of the cluster it is assigned to.

	- Elbow Method Is a method that estimates the number of clusters by comparing the within-cluster dispersion.
	- Silhouette Method The Silhouette index is proposed by Kaufman et al. [9] and is based on compactness and separation of clusters.

#### *3.2. State-of-the-Art*

In Table 1, we summarize the main approaches proposed in the literature to cluster time series of varying lengths. We found that the most favored similarity measure is DTW, and the most popular clustering algorithm is K-medoids. Combining DTW and K-means does not give valid clusters as stated in [12]. The only approach using DTW and K-Means is proposed by Petitjean et al. [13] who introduced a global averaging method called DTW barycenter averaging (DBA), which is a heuristic strategy; however, combining DTW with k-means seems to have a lot of complications, and even with the DBA averaging method, the verdict is left for the testing to see how the DBA fairs with a big length difference compared with the DTW with the k-medoids model.


**Table 1.** Combined Techniques in the Literature.

While this sounds good for the similarity measure (DTW), it is still not clear if this is still true when the similarity measure is used within a machine learning model. In a recent work, Tan et al. [20] explain that there was a little work published in the literature on the

classification of time series of varying lengths compared to the "time-warping" problem. They say the problem is comparatively "understudied and unappreciated". When looking at the UCR archive [21], we see also that there are a lot of datasets that are uniform and not much of varying length only very recently in 2018. That is why we believe that the context of our research will help to have a better understanding of the problem. Unfortunately, due to the nature of the data (radiation level), they cannot be rendered public to the UCR archive.

#### **4. Our Approach and Experiments**

In this section, we describe the different choices made in our model to cluster gamma dose rate time series. Knowing that we are not contributing to the clustering domain, we are proposing a kind of methodology where we are fine tuning the process of seeking for the best way to do clustering. This led us to the hardest part of our research where we tested all types of combinations between similarity measures and clustering algorithms. In the end, we present our contribution, which is the machine learning model we introduced that achieved the best results through testing. Our approach is depicted in Figure 3. As explained before, machine learning algorithms achieve better performance if the time series data have a consistent scale or distribution. Thus, an important attention has been on incident extraction and data preprocessing. Then, we detail the similarity measure and the clustering algorithm we retain.

**Figure 3.** Specific Model for Clustering Sequences of Varying Length.

#### *4.1. Incident Extraction and Preprocessing*

In the beginning, we considered all subsequences of the time series where the gamma dose rate went above the threshold as incidents; however, we found that short incidents added much noise to the data set after experimentation. The clustering could not achieve any satisfactory results, so we re-consulted the experts. In the remaining work, we discarded all incidents that did not last at least 30 min above the peak threshold.


dictable levels when affected by a radiation event; we cannot know the maximum and minimum values in order to perform min-max or decimal-scaling normalization. For this reason, we had to discard them. On the other hand, **z-normalization** is highly applied in the time series literature. Its strong point is that it normalizes the samples, so only the *shape* of them is left to compare to each other. A value *a* of *A* is standardized to *a* by computing:

$$a' = \frac{a - \mu(A)}{\sigma(A)}$$

The fact that it normalized the data to be of a mean equal to zero and standard deviation between 1 and −1 has great advantages, as explained in the next section.

3. **Length Standardization**. As mentioned before in the state-of-the-art, the elastic measure DTW is very sensitive to outliers, which means that if the variation in length between samples is too high, the clustering is not performed well, as we see later in the evaluation. To solve the varying length problem, a **padding** technique has been used as proposed by Tan et al. [20]. Samples are padded with in-consequential data points such as zero or the mean or the median depending on the data distribution. By padding with zero to the z-normalized data, neither the mean (0) nor the standard deviation was affected since zero is indeed in-consequential for this distribution of data. Notice that without the z-normalization, it would have been impossible to apply the z-padding. Thus, resulting in having all the incidents in the dataset of equal length and without interfering in the characteristics of the data.

The standardization applied in the preprocessing phase was critical for the approach. Without this preprocessing phase, the padding could not have been performed and the other experiments would not yield meaningful clusters.


**Figure 4.** Our model's cluster results for *k* = 4.

#### *4.3. Experimentation*

In order to compare the different approaches of the state-of-the-art, as well as to see the benefit of our proposed model, we decide to evaluate, in a systematic way, different combinations of preprocessing phases (with or without z-normalization or zero-padded) with different clustering models (K-means, K-medoids, K-shape) as synthesized in Table 2. The overall number of experiments performed was 24, including the 16 described experiments in Table 2. Due to space limitations, we only give the results obtained with K-medoids or K-means with the same similarity measure and the same preprocessing; however, all evaluations are available by contacting the authors.


**Table 2.** Model Experiments.

In Figure 5, using K-medoids with DTW with/without padding, we faced the same problem caused by the fact that the K-medoids algorithm tolerates outliers, so the obtained clusters have a lot of misplaced incidents and the centroid of the clusters does not clearly represent the observations in the cluster. On the other hand, observing the results of Kmeans clustering, we can see how adding up each preprocessing step brought us closer to the best cluster results, shown in Figure 6, which were approved by the experts who found that indeed each cluster (from left to right) can be explained by a different underlying event. Cluster 1's incidents are caused by a **calibration event** performed on the probes. Cluster 2's incidents are caused by a **stormy** rain where the wind causes the very sensitive probes to be affected by vibrations. Cluster 3's incidents are caused by a **normal rain** that causes the elements to go straight down and affect the probe with an immediate sharp increase. Notice that we also tried to increase the number of k, but we found that, when above 3, we started to see redundant clusters.

**Figure 5.** K-medoids with DTW(DBA) and z-normalized data with padding.

**Figure 6.** K-means with DTW(DBA) and z-normalized data with padding.

#### **5. Conclusions**

In this work, we presented our unsupervised machine-learning-based framework for autonomously identifying underlying events behind high gamma dose rate historical incidents. After extracting and preprocessing the extracted incidents, our machine learning model groups similarly behaving incidents caused by the same underlying event. The experts evaluated the groups, recognized the events, and labeled the incidents. The model that we have proposed is the result of an intense period of evaluations. The systematic methodology has convinced us of the foremost importance of the preprocessing phase. We believe that our proposal could be applied to other application domains, dealing with incidents of varying scale and length. To complete our *Intelligent Radiation Early Warning System*, online incidents should be classified to the proper cluster in real time. We will present our proposed solution in a future publication.

**Author Contributions:** M.A.S. analyzed and interpreted the need for an intelligent radiation early warning system, introduced the first phase of RIMI framework, and was a major contributor in writing the manuscript. He also queried the AI methodologies and techniques to introduce an approach that can address the shortcomings behind the RIMI first phase. B.F., Y.T., A.J. and R.L. verified the tested techniques and functions for analyzing the data and were major contributors in writing the manuscript. All authors have read and agreed to the published version of the manuscript.

**Funding:** The first and corresponding author "Mohammed Al Saleh" has a scholarship from the National Council for Scientific Research in Lebanon (CNRS) to continue his Ph.D. degree, including this research.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

**Acknowledgments:** We would like to thank the National Council for Scientific Research (CNRS) in Lebanon for supporting this work. We would like to express our gratitude to the Federal Office for Radiation Protection (BfS) in Germany for allowing us to use the data collected by their REWS for more than 15 years ago. We would also like to thank Roy Issa and Nourhan Bachir for their support in implementing the code behind this research.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


LCSS Longest Common Subsequence

#### **References**


### *Proceeding Paper* **A Dynamic Combination of Theta Method and ATA: Validating on a Real Business Case †**

**Yasin Tadayonrad \* and Alassane Ballé Ndiaye**

	- Spain, 27–30 June 2022.

**Abstract:** In order to make better decisions and take efficient actions in any supply chain system, we need to have better estimation of uncertain parameters, especially the future demands of our customers. To do so we must use a forecasting model which gives the most useful and accurate forecasts. Time series forecasting methods are still one of the most popular approaches used in the business because of their simplicity. One of the most recent methods that caught the attention of researchers and practitioners is Theta method, which was first ranked in M3 competition. This method works based on the decomposition of the deseasonalized original demand data into two components. The first component represents the long-term trend, and the second component indicates the short-term behavior of the data set. ATA method is another method which has been introduced recently. ATA method works like exponential smoothing methods, but in ATA method the smoothing parameter is a function of time point. In this paper we have proposed a new form of Theta method in which we have benefited from the features of ATA and presented a combination of ATA method and Theta method. We have introduced a dynamic model which uses Theta method as the main model and selected from among some alternative methods such as ATA method, simple exponential, and Double Exponential smoothing methods to be used as the theta lines. Also, we optimize the parameters of each method used in the model. Finally, we have tested the mentioned model on a real data set and concluded that the combination of Theta and ATA methods has a better performance compared to the other alternatives in terms of forecast accuracy.

**Keywords:** forecasting; time series; Theta method; ATA method; simple exponential smoothing (SES); double exponential smoothing (DES); forecasting key performance indicator (KPI); combination

#### **1. Introduction**

One of the major concerns in most supply chain systems is finding the best ways to help the managers and decision-makers to make right decisions and take efficient actions [1,2]. Estimating the customers' demand in the most accurate way is one of the most critical issues that affects many decisions in a supply chain system. Therefore, having an efficient and reliable forecasting model plays an important role in this industry. In the past decades, time series forecasting methods have been at the core of attention of researchers and practitioners in this field [3–7]. These methods are usually simple and understandable. That is why those models are still popular in the industry.

#### *1.1. Theta Method*

Theta method was introduced and ranked as the first method with the best performance in M3 competition [8]. After that, Theta method has been the benchmark method in the forecasting projects, competitions, and research [9]. The classic Theta method is applied to seasonally adjusted demand data set by decomposing the deseasonalized data into theta lines, extrapolating the theta lines, combining the extrapolated theta lines, and finally

**Citation:** Tadayonrad, Y.; Ndiaye, A.B. A Dynamic Combination of Theta Method and ATA: Validating on a Real Business Case. *Eng. Proc.* **2022**, *18*, 25. https://doi.org/ 10.3390/engproc2022018025

Academic Editors: Ignacio Rojas, Hector Pomares, Olga Valenzuela, Fernando Rojas and Luis Javier Herrera

Published: 22 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

seasonalizing the combined line to obtain the forecasts for periods ahead [10]. In the classic form, Theta decomposes the seasonally adjusted data into two theta lines. These two lines refer to long-term and short-term behavior of the original data set. The implementation of classic form of Theta method could be summarized in the following algorithm:

#### 1.1.1. Seasonality Detection

This step can be ignored if our original data set rep-resents demand data in a yearly basis. In case of having quarterly, monthly, weekly, and even daily data we need to detect seasonality patterns in the historical data.

There are several ways for seasonality detection but one of the most common ways is using autocorrelation method. Using the below equations, we can find the existence of seasonality and the seasonal lag in our data.

$$r\_k = \frac{\sum\_{i=1}^{n-k} \left(d\_i - \overline{d}\right) \left(d\_{i+k} - \overline{d}\right)}{\sum\_{i=1}^{n} \left(d\_i - \overline{d}\right)^2} \tag{1}$$

$$|r\_m| \ge |q\_{\left(1-\frac{a}{2}\right)} \sqrt{\frac{1}{n} \left(1 + 2 \sum\_{k=1}^{m-1} r\_k^2\right)}\tag{2}$$

*rk*: Autocorrelation coefficient for lag *k*.

*di*: Values of original time series at time *i*.

*m*: The periodicity of the data (four for quarterly data, twelve for monthly data).

The value of *q* refers to the (1−*α*/2)% confidence level in a Normal distribution

For any value of m that the absolute value if *rm* is greater than the right side of Equation (2), we can say there is a seasonal pattern with lag m. In case of having more than one value for *m*, we can take the largest value.

#### 1.1.2. Removal of Seasonality

One of the most well-known approaches for removing seasonality from the original data is decomposition methods. Classical multiplicative decomposition and classical additive decomposition are the most popular approaches to decompose seasonal patterns by which we can separate a time series into trend, seasonality, and remainder.

#### 1.1.3. Theta Decomposition

In this step the seasonally adjusted series need to be separated into two components, short-term and long-term components. In Theta method *θ* is the main parameter that modifies the local curvatures of seasonally adjusted data. Using different values for *θ* results in different representation of the original data. *θ* = 1 simply refers to the seasonally adjusted data. *θ* = 0 refers to the series with no local curvatures, in other words it indicates the linear trend. 0 < *θ* < 1 results in reduction of the local curvatures and, as such, amplification of the long-term behavior. *θ* > 1 results in amplification of the local curvatures and, as such, amplification of the short-term behavior.

In the classic version of Theta method, two theta lines are used, with *θ*1 = 0 and *θ*2 = 2, denoted as *t*1 and *t*2. As noted, *t*1 is the linear trend line of the data and corresponds to the long-term component, whereas *t*2 represents the series with double the local curvatures of the nonseasonal data.

The first theta line is *t*1. In the classic Theta method, simple regression is used to fit the nonseasonal data.

$$t1\_l(0) = \not\!\!d = b\_0 + b\_1 t \tag{3}$$

*b*0: Intercept of seasonality adjusted data

*b*1: Slope of seasonality adjusted data

*t*: Time period

$$t\mathfrak{L}\_l(\theta) = \theta d\_l + (1 - \theta)t\mathfrak{1}\_l\tag{4}$$

$$\text{with } \theta = 2:$$

$$t\mathcal{Q}\_l(\mathcal{Q}) = \mathcal{Q}d\_l - t\mathbf{1}\_l\tag{5}$$

#### 1.1.4. Extrapolation

In the standard form, *t*1 is extrapolated by a normal linear regression line, while *t*2 is extrapolated using simple exponential smoothing. Extrapolation of the long-term component using normal linear regression line is as follow:

$$f1\_{t+1} = b\_0 + b\_1(t+1)\tag{6}$$

Extrapolation of the short-term component using SES is:

$$f\mathcal{D}\_{t+1} = \mathfrak{a}t\mathfrak{L}\_t + (1-\mathfrak{a})f\mathfrak{L}\_t \tag{7}$$

1.1.5. Combination

The final forecast is a combination of the forecasts of the two theta lines using equal weights. The forecasts of long-term and short-term theta lines should be combined (in the classic version, with equal weights; *w*1, *w*<sup>2</sup> = 0.5).

$$f\_{t+h} = 0.5 \, f \mathbf{1}\_{t+h} + 0.5 \, f \mathbf{2}\_{t+h} \tag{8}$$

For *θ* > 1 we can use the following equation for optimizing the weights [11]:

$$w\_2 = \frac{\theta\_2 - 1}{\theta\_2 - \theta\_1}, \ w1 = 1 - w2\tag{9}$$

Using the above formula, forecast could be obtained by:

$$f\_{t+h} = w\_1 f \mathbf{1}\_{t+h} + w\_2 f \mathbf{2}\_{t+h} \tag{10}$$

#### 1.1.6. Reseasonalization

If the series was identified as seasonal in step one, then the final forecasts are multiplied by the respective seasonal indices. Therefore, we must add/multiply the seasonality indices (calculated in step two) to the forecasts obtained in step five.

#### *1.2. ATA Method*

ATA method was introduced by a Turkish team in 2017 [12]. An alternative for two major forecasting approaches: exponential smoothing (ES) and autoregressive integrated moving average (ARIMA). The ATA method works by modifying the smoothing constants of various exponential smoothing models in a way that the smoothing constant becomes a function of *t*. This modification helps deal with initialization problems and makes the optimization process easier. Also, with ATA it is possible to assign past information equal weights. The ATA method has a similar form to ES, but the smoothing parameters are modified so that when obtaining a smoothed value at a specific time point, the number of observations that can contribute to the value being smoothed is taken into account when the weights among the observations are distributed. Therefore, the smoothing parameter for this method is a function of *t*, unlike ES where no matter where the value you are smoothing resides on the timeline, the observations receive weights depending only on their distances from the value being smoothed. The standard form of ATA method is given in the following aquations.

$$a\_{l} = \left(\frac{p}{t}\right)d\_{l} + \left(\frac{t-p}{t}\right)(a\_{l-1} + b\_{l-1})\tag{11}$$

$$b\_{t} = \left(\frac{q}{t}\right)(a\_{l} - a\_{t-1}) + \left(\frac{t-q}{t}\right)b\_{t-1} \tag{12}$$

$$f\_l(h) = a\_l + h \ b\_l \tag{13}$$

$$\begin{array}{l} p \in \{1, 2, \dots, n\}, q \in \{1, 2, \dots, p\}, \ t > p \ge q\\ \text{For } t \le p, \ a\_t = d\_t\\ \text{For } t \le q, \ b\_t = d\_t - d\_{t-1} \\ b\_1 = 0 \end{array}$$

where:

*dt*: the value of the original series

*at*: the smoothed value at time *t*

*bt*: the trend component

*p*: the smoothing parameter for level

*q*: the smoothing parameter for trend

*ft*(*h*): the *h*-step-ahead forecast value for *h* = 1, 2, . . .

To be able to tackle with dampening trends in the historical patterns, we can us the dampening factor in ATA method as follows:

$$a\_{l} = \left(\frac{p}{t}\right)d\_{l} + \left(\frac{t-p}{t}\right)(a\_{l-1} + qb\_{l-1})\tag{14}$$

$$b\_{l} = \left(\frac{q}{t}\right)(a\_{l} - a\_{l-1}) + \left(\frac{t-q}{t}\right)qb\_{l-1} \tag{15}$$

$$f\_l(h) = a\_l + \left(\varphi + \varphi^2 + \dots + \varphi^h\right) b\_l \tag{16}$$

$$\begin{array}{l} p \in \{1, 2, \dots, n\}, q \in \{1, 2, \dots, p\}, q \in (0, 1], \ t > p \ge q\\ \text{For } t \le p, \ a\_t = d\_l\\ \text{For } t \le q, b\_t = d\_l - d\_{l-1} \\ b\_1 = 0 \end{array}$$

where:

*dt*: the value of the original series

*at*: the smoothed value at time *t*

*bt*: the trend component

*p*: the smoothing parameter for level

*q*: the smoothing parameter for trend

*ft*(*h*): the *h*-step-ahead forecast value for *h* = 1, 2, . . .

*ϕ*: dampening parameter

The algorithm of implementing ATA method could be summarized as three main steps. Step 1: Deseasonalize the data by the classical decomposition methods, if necessary. Step 2: Arbitrary models were used for different data types.

Find an optimal value for the parameter *p* by minimizing the in-sample one-step-ahead sMAPE using following models and obtain forecast values as desired.

Yearly: ATAadd(*p*, 1, *ϕ*) where *ϕ* ∈ {0.61, 0.62, . . . , 1}

All other types: (quarterly, monthly, weekly, daily, hourly)

ATAadd(*p*, 0, 1) and ATAadd(*p*, 1, 1) were obtained and averaged.

("add" refers to using additive decomposition method in the first step)

Step 3: The forecasts are reseasonalized using seasonal indices from classical decomposition.

#### **2. Materials and Methods**

In this work Theta method has been considered as the base model. In Theta method other models need to be used to extrapolate theta lines. As explained in the first section, there are two theta lines in the standard form of Theta method representing the long-term and short-term components of the demand. In the standard form the values of theta parameters are considered as 0 and 2 for the first and second theta lines, respectively. In the proposed model we tested the Theta method with both two and three theta lines. Regarding the mentioned unique features of ATA method and its ability to project the trend in the time series, ATA has been selected to extrapolate the first theta line which represents the long-term component. Therefore, in the third and fourth steps of Theta method, we use the formulas of ATA method (ATA and ATA with damped trend) instead of simple regression model. It means in the proposed method we use Equations (11)–(16) instead of Equations (3) and (6).

Furthermore, exponential smoothing methods (SES and DES methods as well as DES with damped trend) are used as the other candidates to extrapolate the other theta lines.

In order to find the best parameters that result in the most accurate forecasts, all parameters are optimized dynamically. As the demand range varies for different stock keeping units (SKUs), we need to attribute different importance to different products in order to pay more attention to the errors on those products with higher demand. To do so, weighted mean absolute error (wMAE) has been selected as a KPI to measure the performance of the forecasting methods in which we normalize the error based on the scale of demand of each product. Therefore, best values for the related parameters are chosen among the potential values for each model at each iteration (see Appendix A). The possible values that we have considered for each parameter have been listed in Table 1.


**Table 1.** Potential values for different parameters.

#### **3. Results**

The given method was implemented on python NumPy library. A large demand dataset from a major international company was used to test the model. The mentioned dataset includes the demand data of 54 months for 19,691 SKUs in their logistics network in the Europe. We implemented the proposed model using NumPy and Pandas libraries on Python 3.7.0 and ran it on the given data set.

In Tables 2 and 3, we have presented the results obtained from different models compared to the results from applying MA and SES (as the benchmarks) on the given data set. We have optimized the value of n and α in MA and SES for each SKU. The first column shows the number of periods that we have considered as the test set. Indeed, we do not train the models on this data set and it remained hidden so as to test the performance of models. The second and third columns show the comparison of the results from classic Theta method in which the first theta line is the simple regression and the second (and third) theta line have been dynamically chosen among SES, DES, and DES with damped trend. The fourth and fifth columns represent the results from ATA and ATA with damped trend methods compared to that of MA. And finally, the last two columns indicate the results from the proposed method in which the first theta line has been fitted and extrapolated using ATA with damped trend method with two and three theta lines, respectively. As

we can see in the last row of the tables, the proposed method has a better performance compared to MA and SES as the benchmarks by 5.50% and 3.73%, respectively.

**Table 2.** Comparing the results obtained from applying different methods on the given data set with the results of the first benchmark (optimized MA).


**Table 3.** Comparing the results obtained from applying different methods on the given data set with the results of the second benchmark (optimized SES).


#### **4. Conclusions**

One of the main concerns of managers and decision-makers in supply chain systems is having accurate estimation of uncertain parameters. Using time series forecasting methods has been always one of the reliable alternatives to obtain the forecasts. In this paper we have benefited from the advantages of some the well-known methods and, combining them, we concluded that we can improve the accuracy of the forecasts using this combined model in which we have used ATA and ATA with damped trend as the first theta line. By testing the performance of other alternative methods (SES, DES, and DES with damped trend), the best method(s) is selected to extrapolate the other theta line(s). The model is also capable of finding the best values for each parameter aiming at minimizing wMAE. We implemented the proposed model on Python and tested it on a real data set. For the future works, we can compare the results from the proposed model with that of machine learning algorithms.

**Author Contributions:** Both authors contributed to the research paper equally. All authors have read and agreed to the published version of the manuscript.

**Funding:** Université Libre de Bruxelles.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** We cannot provide the original data because of confidential issues.

**Acknowledgments:** We thank anonymous referees for their insightful comments that have helped improving this paper substantially.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A**



#### **References**


### *Proceeding Paper* **Limitation of Deep-Learning Algorithm for Prediction of Power Consumption †**

**Majdi Frikha 1,\*, Khaled Taouil 2, Ahmed Fakhfakh <sup>2</sup> and Faouzi Derbel <sup>1</sup>**


**Abstract:** In recent years, electricity consumption has become high due to the use of several domestic applications in the house. On the other hand, there is a trend of using renewable energy in many houses, such as solar energy, energy-storage systems and electric vehicles. For this reason, forecasting household electricity consumption is essential for managing and planning energy use. Forecasting power consumption is a difficult time-series-forecasting task. Additionally, the electrical load has irregular trend elements, which makes it very difficult to predict the demand for electrical energy using simple forecasting techniques. Therefore, several researchers have worked on intelligent algorithms such as machine-learning and deep-learning algorithms to find a solution for this problem. In this work, we demonstrate that deep-learning algorithms are not always reliable and accurate in predicting power consumption.

**Keywords:** deep learning; prediction; power consumption

**Citation:** Frikha, M.; Taouil, K.; Fakhfakh, A.; Derbel, F. Limitation of Deep-Learning Algorithm for Prediction of Power Consumption. *Eng. Proc.* **2022**, *18*, 26. https:// doi.org/10.3390/engproc2022018026

Academic Editors: Ignacio Rojas, Hector Pomares, Olga Valenzuela, Fernando Rojas and Luis Javier Herrera

Published: 22 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

#### **1. Introduction**

Due to economic progress and population growth, global electricity consumption has recently increased. According to the Energy Report realized in 2019, global power demand will increase by 2.1% per year, twice the rate of the primary energy production of the stated policy scenario [1]. Since electricity is used simultaneously in power plant production, it is vital to estimate energy usage ahead of time for a regular supply. Forecasting electricity consumption is a difficult problem to predict over a time series. The data collected by the smart sensor may contain redundancies, outliers, missing values and uncertainties. In addition, forecasting power consumption is a difficult time-series forecasting task. Additionally, the electrical load has irregular trend elements, which makes it very difficult to predict the demand for electrical energy using simple forecasting techniques. Therefore, various forecasting strategies to predict electricity usage have recently been developed. Energy consumption forecasting has been examined using various methodologies, which can be divided into two categories: traditional and artificial techniques [2].

Historically, statistical techniques were mainly used to forecast power demand. In [3], to predict electricity consumption, the SARIMA model (seasonal autoregressive integrated moving average) and the fuzzy neural model were compared to predict power consumption. The study household's hourly and daily electricity use was analyzed using linear regression and quadratic regression models [4]. In [5], a multiple regression approach was proposed using genetic algorithm technology to forecast the daily power consumption of the administration building. Significant drawbacks of both models include the lack of occupancy data and the fact that neither model has been studied to predict the energy demand of comparable homes. In another study, the authors used the bootstrap aggregate

autoregressive integrated moving average (ARIMA) and an exponential smoothing method in order to predict energy demand in different countries [6]. In general, statistical methods have shown weaknesses in predicting and capturing the non-linear behavior of energy consumption data long-term. Furthermore, the computational approach has limited predictive capacity due to its non-stationary trend and the sharp patterns in energy consumption. As a result, machine-learning approaches have been used to test a variety of prediction models in order to increase predictive quality [7–9]. For example, Liu et al. [10] created a support vector machine (SVM) model to predict and evaluate the energy consumption of public buildings.

Chen et al. [11] proposed a model that predicts power consumption as a function of ambient temperature, which was driven by the regression capability of the solid nonlinear support vector. Pinto et al. [12] proposed a paradigm for ensemble learning that combines three machine-learning methods. Nevertheless, due to the problem of dynamic correlation between the variables and data qualities that change throughout time, existing machine-learning algorithms suffer greatly from overfitting. It is difficult to establish a durable and reliable use in case of overfitting. Similarly, many deep-sequential-learning neural networks are set up to predict power consumption. With one-hour resolution forecasts, a recurrent neural network model was used to estimate energy demand profiles for business and residence databases [13]. A pooling methodology using a recurrent neural network algorithm was developed to solve the task of overfitting by boosting the number of data and the diversity [14]. RNN architecture with LSTM cells was developed to predict power consumption in [15]. Individual household power usage patterns are frequently unpredictable due to a range of factors such as weather conditions and holidays. Therefore, it is unreliable to predict power consumption using methods that are based solely on energy consumption data. Our work in this article shows that deep-learning algorithms are not always reliable and accurate in terms of power-consumption prediction.

#### **2. Deep-Learning Algorithms**

#### *2.1. Long Short-Term Memory Neural Network (LSTM)*

Long short-term memory neural networks (LSTMs) are an advanced form of recurrent neural networks (RNNs) that replace the original cell neurons. The LSTM inherits the unique features of the RNN, which treats the input as a connected time series. Additionally, the complex structure of the LSTM cell finds a solution to the problem of vanishing gradients and disappearance. The LSTM model flowchart has four key elements: input gates, cell status, forget gates and output gates. The information included in the cell status is maintained, updated and deleted via forget gates, input gates and output gates.

The forget gate, as the name implies, is responsible for deciding what information to discard or retain from the last step. This occurs through the first sigmoid layer. The following step is to determine what information should be stored in the new state of the cell. The update value is determined by a sigmoid function as the input gate layer. Following this, the tanh layer generates a new vector of values that can be injected to the state. The following step is to put them together to make a new status update. The cell state acts as memory for the LSTM. Here, it outperforms vanilla RNNs when processing longer input sequences. The previous state of the cell is coupled to the forget gate at each time step to decide which data to broadcast. It is then combined with the input gate to form a new memory for the cell. Finally, the LSTM cell must provide some output. The cell state obtained above passes through a hyperbolic function named tanh, so the cell state value is filtered between −1 and 1.

#### *2.2. Gated Recurrent Unit (GRU)*

The internal structure of an LSTM cell has three gates—input, forget and output whereas the structure of a GRU cell only has two gates: a reset gate and an update gate [16]. The update gate determines if the preceding cell's memory is still active, and the reset gate combines the next cell's input sequence with the previous cell's memory. However, LSTMs

differ slightly in some respects. To begin, the GRU cell has two gates, while the LSTM has three gates. The input and forget gates of the LSTM are then blended with the update gate and applied to the hidden reset gate directly.

#### *2.3. Convolutional Neural Network (CNN)*

A convolutional neural network (CNN) is a type of neural network that was created specifically to solve image-classification problems requiring 2D data. A CNN is also used to analyze 1-dimensional data in a time-series task. A CNN uses the principle of weight sharing to provide more performance for difficult problems such as time-series forecasting and power-demand forecasting. When you apply convolutions to the input data, they are converted to a feature map. The pooling layer is used after the convolution layer to model the collected feature maps in order to transform them into a more abstract format.

#### **3. Results and Discussion**

Accurate forecasting of electricity consumption improves energy utilization rates and helps building management to make better energy management decisions, and thus, saves significant amounts of energy and money. However, due to the dynamics and random noise of the data, the accurate prediction of power consumption is a difficult goal. In this paper, a framework was designed to obtain accurate results for power-demand prediction. The methodology used in this work involved three steps: data processing, training the data, and evaluation. We designed a three-step system to calculate short-term electricity consumption in this research. First, the incoming data are preprocessed to remove outliers, missing values, and redundant values. To normalize the input dataset to a given range, we used a typical scalar technique. Then, the data are analyzed and sent to the training phase. Then, we perform tests on the CNN, LSTM, GRU and 3-layer LSTM models. Finally, we evaluate our models using metric parameters such as RMSE and MSE. Basically, these measurements calculate the difference between the predicted value and the actual value. Therefore, the MSE calculates the mean squared between the actual and the predicted values. On the other side, RMSE calculates the percentage difference between actual and predicted values. The performance of the models was validated using the IHEPC dataset available on the UCI repository [17]. IHEPC is a household dataset freely accessible from the UCI machine-learning store, containing information on electricity consumption from 2006 to 2010. It contains more than 2 million values, with a total of around 26,000 remaining values. Missing values account for 1.25 percent of the total information and are processed during the preprocessing step. For nearly four years, this dataset has contained powerdemand measurements at a 1-min sampling rate. We divided the data into a training set and a test set for our tests. The model is tuned using a training set and the model function predicts the output values for data not seen in the test set. This method is appropriate for home applications and saves time during simulation. In this method, 75% of the data are used as a training set, with the remaining 25% used for testing. In addition, we ran several deep-learning tests. We also performed various tests on different deep-learning models for comparison, such as the LSTM, CNN, GRU, and 3-layer LSTM models. Using the aforementioned methods, the predictive models were trained up to 25 epochs.

The model was trained on an HP Omen PC with a Core i5 intel processor and 16 GB of RAM. The implementation was performed with Python3 software using Keras with TensorFlow libraries in the backend and the Adam optimizer.

In this article, we studied several deep-learning models to find an optimal model for short-term power consumption. We performed tests on deep-learning models including CNN, LSTM, GRU and 3-layer LSTM. The simulation results demonstrate that the 3-layer LSTM model is the best model compared to the other three models.

For example, CNN achieved 0.05 and 0.23 MSE and RMSE, LSTM achieved 0.04 and 0.21 MSE and RMSE, GRU scored 0.04 and 0.22 MSE and RMSE, and 3-layer LSTM attained 0.04 and 0.19 MSE and RMSE, as shown in Table 1.


**Table 1.** Performance of different deep-learning models over IHEPC dataset.

Figure 1 presents the prediction performance of the 3-layer LSTM model over the IHEPC dataset. We can see that we correctly predicted the difficult values linked to a strong variation in electricity consumption. Therefore, the results are precise and reliable for the prediction of power consumption.

**Figure 1.** Prediction performance of 3-layer LSTM model over IHEPC dataset.

Figure 2 presents the architecture of the 3-layer LSTM model used in our work. We used three LSTM layers. In the first layer, we used 128 neurons and tanh as the activation function. Next, in the second LSTM layer, we used 128 neurons, linearly, as the activation function, and a dropout equal to 0.25. The third LSTM layer had 64 neurons, Relu as the activation function, and a dropout equal to 0.25. These results present a time stamp prediction of 1 min. However, in the real application, it takes a higher time stamp, such as 15 min or 60 min, because it is used in smart meters. For this reason, we tested the best model, 3-layer LSTM, with different time stamps at 5 min, 15 min and 60 min, in order to show the reliability and accuracy of this model. After the simulation results, it is seen that the performance of the prediction is reduced significantly when the timestamp is increased. This greatly affects the reliability of the 3-layer LSTM model. Table 2 shows the results for each timestamp used.

**Figure 2.** Architecture of 3-layer LSTM model.

**Table 2.** Performance of different time stamps over 3-layer LSTM model.


Figures 3 and 4 present the results of the prediction at 15 min and 60 min timestamps. We see that when there is a sudden change or a peak in consumption, the model fails to predict the exact value. The obtained results show that using deep-learning algorithms is not always reliable to predict power consumption. Several factors affect the performance of the prediction, such as reducing the amount of data in the database used in the model. However, in a real application, we would not always be able to use 2 million datapoints for 60 min or 15 min timestamps; this is because we would need to have 60 years of data for a 15 min timestamp. Therefore, we need to obtain an algorithm that can efficiently predict power consumption in all databases at any timestamp.

**Figure 3.** Prediction performance of 3-layer LSTM model over 15 min time stamp.

**Figure 4.** Prediction performance of 3-layer LSTM model over 60 min time stamp.

#### **4. Conclusions**

In this work, we proposed a comparative study to predict the power consumption of a specific house. The proposed study has been tested on publicly available IHEPC datasets. We used a typical Min Max Scaler in order to normalize the input data, and then sent the adjusted data into the training step. Following that, we looked at numerous deep-learning models to see how well they performed and how accurate their predictions were using different time stamps. We conclude that deep-learning algorithms are not always reliable and accurate for power-consumption prediction, which confirms the limitation of the deep-learning models to predict power consumption.

**Author Contributions:** Conceptualization, F.D. and A.F.; methodology, F.D.; software, F.D.; validation, F.D., A.F. and K.T.; formal analysis, F.D.; investigation, M.F.; resources, M.F.; data curation, M.F.; writing—original draft preparation, M.F.; writing—review and editing, M.F.; visualization, M.F.; supervision, F.D.; project administration, F.D.; funding acquisition, A.F. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

