**Preface to "The 8th International Conference on Time Series and Forecasting"**

The ITISE 2022 (8th International conference on Time Series and Forecasting) seeks to provide a discussion forum for scientists, engineers, educators and students about the latest ideas and realizations in the foundations, theory, modeling and applications for interdisciplinary and multidisciplinary research encompassing disciplines of computer science, mathematics, statistics, forecasting, econometrics, etc., in the field of time series analysis and forecasting.

**Ignacio Rojas, Hector Pomares, Olga Valenzuela Cansino, Fernando Rojas and Luis Javier)FSSFSB** *Editors*

### *Editorial* **Statement of Peer Review †**

### **Olga Valenzuela 1, Fernando Rojas <sup>2</sup> , Luis Javier Herrera <sup>2</sup> , Hector Pomares <sup>2</sup> and Ignacio Rojas 2,\***


In submitting conference proceedings to *Engineering Proceedings*, the volume editors of the proceedings certify to the publisher that all papers published in this volume have been subjected to peer review performed by the volume editors. Reviews are conducted by expert referees adhering to the professional and scientific standards expected of a proceedings journal.


**Citation:** Valenzuela, O.; Rojas, F.; Herrera, L.J.; Pomares, H.; Rojas, I. Statement of Peer Review. *Eng. Proc.* **2022**, *18*, 43. https://doi.org/ 10.3390/engproc2022018043

Published: 6 September 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

### *Proceeding Paper* **Evaluating a Recurrent Neural Network Model for Predicting Readmission to Cardiovascular ICUs Based on Clinical Time Series Data †**

**Sobhan Moazemi \*,‡,§ , Sebastian Kalkhoff § , Steven Kessler, Zeynep Boztoprak, Vincent Hettlich, Artur Liebrecht, Roman Bibo, Bastian Dewitz, Artur Lichtenberg , Hug Aubin**  **and Falko Schmid** *-*

> Digital Health Lab Düsseldorf, Department of Cardiac Surgery, Medical Faculty and University Hospital Düsseldorf, Heinrich-Heine-University Düsseldorf, 40225 Düsseldorf, Germany; sebastian.kalkhoff@med.uni-duesseldorf.de (S.K.); steven.kessler@med.uni-duesseldorf.de (S.K.); zeynep.boztoprak@hhu.de (Z.B.); vincenthendrik.hettlich@med.uni-duesseldorf.de (V.H.); artur.liebrecht@med.uni-duesseldorf.de (A.L.); roman.bibo@med.uni-duesseldorf.de (R.B.); bastian.dewitz@med.uni-duesseldorf.de (B.D.); artur.lichtenberg@med.uni-duesseldorf.de (A.L.); hug.aubin@med.uni-duesseldorf.de (H.A.); falko.schmid@med.uni-duesseldorf.de (F.S.)

**\*** Correspondence: sobhan.moazemi@med.uni-duesseldorf.de

† Presented at the 8th International Conference on Time Series and Forecasting, Gran Canaria, Spain, 27–30 June 2022.


**Abstract:** Unexpected readmission to intensive care units (ICUs) endangers patients' lives due to premature patient transfers or prolonged stays at the care units. This can be mitigated by stratification of the readmission risk at discharge times using state-of-the-art machine learning (ML) methods. We fitted two alternative recurrent neural network (RNN) models based on long short-term memory (LSTM) on the Medical Information Mart for Intensive Care (MIMIC-III) dataset and evaluated them with an independent cohort from our hospital's ICU (UKD). The first model processed all the available time series data from each patient's ICU stay, whereas the second model focused on the data from the last 48 hours of the ICU stay prior to transfer. Our readmission prediction on MIMIC data reached an area under the curve of receiver operating characteristic (AUC-ROC) of 0.82. Furthermore, the model with the 48 h time frame outperformed the other model, as both models were applied to the independent test cohort. The results suggest that the RNN model for time series forecasting holds promise for future use as a clinical decision support tool, although follow-up studies with larger cohorts as well as user studies should be conducted to assess the generalizability and usability of the methods, respectively.

**Keywords:** readmission prediction; intensive care unit (ICU); recurrent neural network (RNN); long short-term memory (LSTM); machine learning (ML); time series analysis; health forecasting

#### **1. Introduction**

According to different retrospective and review studies [1–3], the rates of readmission to ICUs in hospitals in developed countries are quantified inconsistently in the range from 0.14 to 14.5 percent. Regardless of the inconsistent readmission rates, patients who are readmitted to ICUs due to inappropriate discharge time are subject to life-threatening risks, with respiratory and cardiac complications as the most common causes of readmission [1,4]. At the same time, prolonged stays at ICUs can also lead to increased mortality or poor long-term prognosis for patients [5]. In particular, patients who undergo cardiac surgery are prone to longer ICU stays due to the invasive nature of heart surgery and the resulting increased risk of postoperative complications [6]. Therefore, it is important to find the

**Citation:** Moazemi, S.; Kalkhoff, S.; Kessler, S.; Boztoprak, Z.; Hettlich, V.; Liebrecht, A.; Bibo, R.; Dewitz, B.; Lichtenberg, A.; Aubin, H.; et al. Evaluating a Recurrent Neural Network Model for Predicting Readmission to Cardiovascular ICUs Based on Clinical Time Series Data. *Eng. Proc.* **2022**, *18*, 1. https:// doi.org/10.3390/engproc2022018001

Academic Editors: Ignacio Rojas, Hector Pomares, Olga Valenzuela, Fernando Rojas and Luis Javier Herrera

Published: 16 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

optimal point in time for transfer, for this particular vulnerable group of patients. Intensivists need to quantify patients' readmission risks in order to determine this point in time and avoid unplanned readmissions. This implies an increasing need for clinical decision support tools which complement current discharge routines through the use of predictive models taking advantage of artificial intelligence (AI).

AI in general and ML in particular have been widely used in many diagnostic and prognostic domains, such as oncology [7,8] and computational neuroscience [9,10], to provide clinical decision support tools, which either replace or complement established diagnostic and prognostic routines. More specifically, time series analysis has been successfully applied in many medical domains, including management of type 2 diabetes [11], hospital admission prediction [12], age-related death prediction [13] and ICU readmission prediction [3,14].

From a technical point of view, as related works suggest, a variety of techniques ranging from simple probabilistic predictors such as logistic regression (LR) and ML-based classifiers such as support vector machines (SVMs) to more sophisticated models such as convolutional neural networks (CNNs) have shown their potential for the prediction of readmission to ICUs [3,15,16].

In a previous study [17], currently under review, we trained a model with a recurrent architecture on a subset of the MIMIC-III dataset, an open access dataset of time series data from more than 50,000 care unit stays with different health conditions at a single center [18], without any limitation on the time frames of the input time series data, and showed its superiority to logistic regression and a feed forward network in predicting readmission. In the current study, we analyze the performance of a recurrent neural network (RNN) model processing health-related times series data, taking advantage of the long short-term memory (LSTM) method with slight architectural enhancements compared to the previous model, considering time series data from the last 48 h of patients' stays at ICUs, sampled in 60-minute intervals.

The main contribution of our work is to propose and evaluate two alternative MLbased models, which are trained and fit using a publicly available dataset of health-related time series data, with and without concrete time windows (in compliance with state-of-theart metrics in the domain) prior to patients' discharge from ICUs. Moreover, as a further contribution, the proposed models were evaluated using a real-world hospital cohort with data from our own cardiovascular intensive care unit at UKD to assess how they would generalize against new sets of unseen data.

In the next sections, first, a detailed overview of the patient data and methods is given, and then the results of model training and evaluations are presented and discussed. Finally, the relevance of the findings and possible future follow-ups will be discussed.

#### **2. Materials and Methods**

This section explains the methodology of the study. First, the details of the dataset curation for training, validation and test steps will be elaborated, including the selection criteria and cohort statistics, which will be followed by the description of the integrated times series data. Then, the details of the training and validation step using 5-fold crossvalidation on the MIMIC-III dataset are discussed. Finally, the details of the test step on an unseen set of data from a held-out MIMIC cohort as well as the UKD database are presented. The entire in-house-developed analysis pipeline is developed in the Python (V3.6.) programming language. The LSTM models for time series data analysis are implemented using the PyTorch library [19].

The inclusion criteria of subjects from both MIMIC and UKD datasets have been defined as follows:

1. **Patients who were transferred from the ICU to a normal station and then returned to the ICU within 48 h.** Due to possible logistical reasons, patients whose first ICU stay was less than 24 h are excluded. These patients are labeled as 'readmitted'.


#### *2.1. Patient Cohorts*

For the training, we focused on the cases of patients from the MIMIC-III dataset who visited cardiovascular care units. As a result, the training subject cohort consisted of 11,513 patients, out of which 966 patients were labeled as readmitted and 10,547 were labeled as non-readmitted. From the MIMIC dataset, the train subjects' ages ranged from 17 to 89 years (mean(M) = 66, standard deviation(SD) = 13.51). For anonymization purposes, the real admission times are not attainable from the MIMIC dataset.

For test purposes, a cohort of 502 patients who visited the cardiovascular ICU at the UKD were retrospectively analyzed. Among these, 100 patients were returned to the ICU within the 48 h after discharge (labeled as 'readmitted') and 402 patients visited the ICU just once, without being readmitted within 30 days after their discharge (labeled as 'non-readmitted'). The ages of the subjects from the UKD database were quantified in the range from 19 to 95 years (M = 67, SD = 12.25). The cohort corresponds to cardiovascular ICU stays in the time period from January 2017 to December 2020. Due to the retrospective character of the study, patient consent was waived.

#### *2.2. Time Series and Patient Data*

For each subject in the train or test cohort, 10 vital or laboratory variables (selected by highly qualified cardiovascular surgeons) were captured during the patient's stay at the corresponding intensive care unit. To form the time series data for all the subjects, 60 min time intervals have been taken into account. The missing values were extrapolated using the mean value of the corresponding variable throughout the entire corresponding cohort. Moreover, from the patients' files, the age and weight values were extracted. Finally, the length of stay (LoS) of each subject was quantified from the admission time entries from the corresponding database (MIMIC or UKD) and added to the rest of the features to end up with feature vectors of size 13. Table 1 provides the list of the 10 variables used for the time series data analysis and the 3 extra patient variables.

**Table 1.** The input parameters including time series data from laboratory or vital values and information from patient files (ABP: ambulatory blood pressure, LoS: length of stay). The table is adapted from [17].


#### *2.3. Preprocessing*

Preprocessing of the input data is applied, similarly to our previous work [17], consisting of the following steps: First, the time series features and patient variables are obtained from the corresponding dataset (MIMIC or UKD). Then, all values are standardized and missing data are replaced with the corresponding distribution's mean value. In the last step, the data are both re-sampled and oversampled before being processed by the corresponding model. Figure 1 gives an overview of the preprocessing pipeline.

**Figure 1.** Preprocessing pipeline. First, the data are extracted and standardized, and then the missing data are filled in. Finally, re-sampling and oversampling are applied before the data are fed to the models. This figure is adapted from previous work [17].

In the feature extraction step, the case IDs of the subjects who met our selection criteria have been processed to acquire the feature values from the corresponding database using appropriate queries. Then, the normalization is applied at two different levels. First, the features which might have been stored in different units (such as weight) have been unified and converted accordingly and cases with extreme and outlier values have been removed from the cohorts. In the second step, we applied standardization around the mean value to end up with normalized feature vectors. The standardization of each variable is applied as defined in Equation (1):

$$\mathcal{S}\_{l} = \frac{s\_{l} - \mu\_{\chi}}{\sigma\_{\chi}} \, \, \, \, \tag{1}$$

where *s*ˆ*<sup>i</sup>* is the standardized value, *si* is the original value, *μ<sup>x</sup>* is the mean of the variable over the whole cohort *x* and *σ<sup>x</sup>* is the standard deviation over *x*.

As required by the LSTM method, all the input feature vectors must be of equal size before being analyzed. Thus, for all the timestamps without an actual measurement, the missing value was filled in with zero (which is the mean value after standardization). Moreover, as the vital and lab values of patients are usually measured in fragmented timestamps, the feature vectors extracted from the databases were non-uniformly distributed along subject cohorts. Thus, to unify the entire cohorts, the time series parameters are resampled to a frequency of one entry per hour. In case there exist more than one measurement per hour, the mean of each hour is used as the value. Alternatively, if there is no record for a time step, the mean over the whole cohort is used. Furthermore, the imbalance in the label distribution (there are more non-readmitted cases than readmitted ones in both cohorts) might result in the training of a deep learning model that is not able to predict the minority class labels accurately [20]. Therefore, the data for training the LSTM models were oversampled in order to have evenly distributed classes.

#### *2.4. Model Training*

In a previous study [17], we have shown the superiority of an RNN model based on LSTM architecture to logistic regression and a feed forward network using the MIMIC-III dataset. Therefore, in the current study, two alternative models with LSTM architecture have been analyzed, compared and further evaluated on unseen sets of data from open access and available in-house data. Both models are trained and validated using a subset of the MIMIC-III dataset, which was selected randomly. The first model, which will be denoted as the 'cropped' model from now on, applies a fixed window of the last 48 h to each ICU stay. The second model, which will be denoted as 'uncropped' in the next parts of the text, applies no limits on the available time series data from each subject.

As a special kind of recurrent neural network (RNN), long short-term memory (LSTM) is able to learn long-term dependencies in time series data. The input to an LSTM is a batch of arrays containing time series data. Although the LSTM model can handle inputs with varying lengths, the input length has to be equal inside each batch. Thus, to pad each time series with zeros until they have equal length, a PyTorch object is used.

To be able to capture non-linear relations between input variables and target labels, a hidden layer with the size of 50 neurons is applied after the LSTM pipeline. The input to the models are the time series data. For the cropped model, a fixed window of 48 h is used, while, for the uncropped model, no constraint on the length of time series data is applied. After processing the input time series batches, the three patient variables (age, weight and LoS) are concatenated to the feature vectors. Then, a rectified linear unit (ReLU) is applied as the activation function, while a dropout layer with a value of 0.3 is used for regularization purposes. The final fully connected layer consists of 2 neurons, which are then passed through to the softmax function, which quantifies the final class probabilities (one for each label).

Figure 2 illustrates the architecture of the LSTM-based models. The *xi* input refers to the i-th time step of all time series. Table 2 summarizes the hyperparameters for model fitting.

**Figure 2.** LSTM models' architecture: First, the time series data are passed to the LSTM part. Then, the patient information (age, weight and length of stay) is concatenated to the feature vectors from LSTM output. Next, the ReLU and dropout are applied before the feature vectors are passed through the hidden layers. Finally, the softmax function is used to assign probabilities for each binary label.

**Table 2.** The hyperparameters of the LSTM-based model. The table is adapted from [17].


#### *2.5. Model Validation*

The training cohort from the MIMIC dataset is first split into separate train and validation cohorts with similar ratios of target labels. Then, to validate the model internally, 5-fold cross-validation is applied with random folds. As a result, the hyperparameters of the best performing models are stored to be used for the corresponding test steps. The training of each model continued until the area under the curve of precision recall (AUC-PR) stopped increasing after several epochs. Subsequently, the model that achieved the highest AUC-PR in the cross-validation was applied to the separate validation set. This procedure was repeated 10 times and the mean and standard deviation of the performance metrics were computed accordingly.

#### *2.6. Model Evaluation*

Both of the trained models are evaluated against unseen cohorts from the MIMIC-III and UKD datasets. To this end, two independent test cohorts have acted as representatives of unseen data. To quantify the model performance on each of the held-out test cohorts, each of the cropped and uncropped models, which were trained and fit in the validation step, are applied to the independent test cohorts. Then, the performance of the models is quantified and compared in terms of balanced accuracy, AUC-PR, precision, recall and AUC-ROC.

#### **3. Results**

In this section, the results of the model validation and evaluation are presented. First, the results obtained by different models in the training and validation phase will be elaborated. Then, the models' performance on the test cohorts will be quantified and compared.

#### *3.1. Model Training and Validation*

The results of the model training with five-fold cross-validation on the MIMIC dataset are summarized in Table 3. Both cropped and uncropped models performed well on held-out validation subsets within the training cohort, with AUC-ROCs of 0.877 and 0.859, respectively.

**Table 3.** Evaluation metrics for the validation set from MIMIC database.


#### *3.2. Model Evaluation*

The results of the evaluation steps on held-out cohorts from the MIMIC and UKD databases are illustrated in Table 4 and Figures 3 and 4, comparing the models' performance together and to an unskilled classifier. In general, we observe a remarkable gap in the performance of the models as applied to the two groups of held-out data. Both the cropped and uncrpopped models perform reasonably well on the held-out set from MIMIC, with AUC-ROCs of 0.796 and 0.828, respectively. From the test results of the UKD data, the cropped model (AUC-ROC = 0.554) performed better than the uncropped model (AUC-ROC = 0.517).

**Table 4.** Evaluation metrics for the test sets from UKD and MIMIC databases.


**Figure 3.** Results of the two alternative models' predictions on the unseen cohort from MIMIC dataset: (**left**) the precision–recall curve; (**right**) the receiver operating characteristic curve. The dashed lines denote the performance expected by an unskilled classifier.

**Figure 4.** Results of the two alternative models' predictions on the unseen dataset from UKD: (**left**) the precision–recall curve; (**right**) the receiver operating characteristic curve. The dashed lines denote the performance expected by an unskilled classifier.

#### **4. Discussion**

Predicting the risk of readmission to intensive care units at discharge times through the use of artificial intelligence can support healthcare systems to manage their resources in a more efficient way. As unplanned readmissions might result in longer stays as well as health hazards for patients [1,4], it is critical to provide quantification of the readmission risks as an additional asset for decision making for the intensivists. Related works suggest that different vital and laboratory values of patients sampled in the last 48 h prior to the patient's discharge from the intensive care unit contribute the most to the fact of whether the patient will return to the care unit [21]. Thus, state-of-the-art time series forecasting methods on health-related data from ICU stays could be of great importance for the analysis of readmission risk for patients. Therefore, the motivation behind this study has been to propose and evaluate two alternative models, based on time series forecasting, for readmission prediction for patients visiting cardiovascular ICUs after heart surgery.

The first model analyzes all available vital and lab values for all the participating subjects during their stays at the corresponding ICU. Alternatively, the second model takes into account only the values in the last 48 h prior to discharge from the care unit. As elaborated in the results, both of the alternate models performed reasonably well in predicting patients who would be readmitted to the ICU when they were applied to an unseen cohort from the MIMIC dataset. Furthermore, we showed that the model with the 48 h time limit outperformed the other model in predicting readmissions for a cohort of unseen cases from UKD, which conforms with the findings from related work [21]. The superiority of the model with a persistent number of timestamps can be justified, as, typically, data-driven ML-based methods perform more efficiently on data cohorts with higher consistency [22]. Moreover, it signifies the relevance of the patients' clinical factors measured in the most recent timestamps prior to transfer for predicting readmission risks.

Overall, the findings reveal the predictive potential of the AI-based approach for health forecasting after evaluating its performance on unseen datasets. However, there are some drawbacks to the currently applied methodology, which should be mitigated in future follow-ups. As for any other time series model, the methodology towards health forecasting as described in this paper would also be affected by the issue of missing data. For this study, we replaced missing measurements with the corresponding mean value throughout the whole cohort. For the future follow-ups, we will consider alternative AI-based imputation methods such as bidirectional recurrent imputation for time series (BRITS) [23], which would most likely help to further improve the robustness of our methods. Furthermore, most of the state-of-the-art ML-based models with deep architectures are less interpretable due to their 'black box' nature. Therefore, one of our future works will be to provide an ML-based approach towards ICU readmission prediction in an explainable manner, i.e., by following explainable AI (XAI) best practices. This will further help to give the medical experts higher levels of interpretability and confidence in AI analyses.

The results of the evaluation step reveal the inconsistency between the time series measurements from the two databases (MIMIC-III and UKD), also known as the dataset shift problem [24]. This might reflect different protocols and standards for patient care and discharge in the corresponding care units and should be diminished in the future using appropriate transfer learning techniques or by training new models on larger cohorts which better fit in protocols. Thus, we are gathering more annotated data to prepare a suitable cohort of ICU stays at UKD for further follow-ups. Nonetheless, even these relatively poor preliminary results on the data from UKD are slightly better than the performance of an unskilled classifier. These findings identify room for improvement in readmission prediction accuracy by the AI model. Although the AI performance might be considered far from perfection, even highly experienced intensivists will not predict ICU readmissions quite perfectly. Therefore, to further quantify how the proposed AI approach would compete with experienced domain experts, we intend to conduct experiments in which the performance of the AI model for ICU readmission prediction is compared to that of a group of experienced intensivists. Nevertheless, it should be noted that the proposed methods are designed as decision support tools for medical experts and therefore should not be used as an autonomous system for decision making.

Another important aspect of any clinical decision support tool is usability. In this regard, as part of the upcoming follow-ups, we intend to conduct proper empirical studies to assess how a provided AI assistant user interface would satisfy user needs in terms of experience and performance. Moreover, to account for better generalizability, the integration of datasets from other hospitals and intensive care units would be a useful next step to take.

#### **5. Conclusions**

The aim of the study has been to train and evaluate a deep model with a recurrent architecture to predict the readmission of patients who visited cardiovascular ICUs based on times series data of vital and lab values. To this end, the RNN model based on the LSTM method was trained with open access data and tested using independent hospital data. Our findings revealed the potential of the proposed AI-based methodology to assist domain experts at cardiovascular intensive care units for the better quantification of readmission risks at discharge times. To further address the usability and generalizability aspects of the methods, empirical user studies and experiments with data from other care units should be conducted.

**Author Contributions:** Conceptualization, S.M., S.K. (Sebastian Kalkhoff), S.K. (Steven Kessler), H.A. and F.S.; methodology, S.M., S.K. (Sebastian Kalkhoff), S.K. (Steven Kessler) and Z.B.; software, S.K. (Steven Kessler) and Z.B.; validation, S.K. (Steven Kessler), Z.B. and H.A.; formal analysis, S.M., S.K. (Sebastian Kalkhoff), H.A., F.S. and A.L. (Artur Lichtenberg); resources, A.L. (Artur Liebrecht), R.B., B.D., H.A., F.S., A.L. (Artur Lichtenberg) and A.L. (Artur Lichtenberg); data curation, V.H., S.K. (Steven Kessler), Z.B. and H.A.; writing—original draft preparation, S.M., S.K. (Sebastian Kalkhoff), S.K. (Steven Kessler), Z.B., A.L. (Artur Liebrecht), R.B. and B.D.; writing—review and editing, A.L. (Artur Liebrecht), R.B., B.D., A.L. (Artur Lichtenberg), H.A. and F.S.; visualization, S.K. (Steven Kessler) and Z.B.; supervision, S.M., S.K. (Sebastian Kalkhoff), A.L. (Artur Lichtenberg), H.A. and F.S.; project administration, A.L. (Artur Lichtenberg), H.A. and F.S.; funding acquisition, A.L. (Artur Lichtenberg), H.A. and F.S. All authors have read and agreed to the published version of the manuscript.

**Funding:** The authors disclose receipt of the following financial support for the research, authorship and/or publication of this article: This work was supported by the Federal Ministry of Education and Research of Germany (grant number 16SV8601).

**Institutional Review Board Statement:** The study was conducted according to the guidelines of the Declaration of Helsinki and approved by the Ethics Committee at the Faculty of Medicine of Heinrich Heine University Düsseldorf (Study No. 2021-1388, 5 May 2021).

**Informed Consent Statement:** Patient consent was waived due to the retrospective nature of the study.

**Data Availability Statement:** This study applied an open access dataset (MIMIC-III [18]), which is referenced accordingly, as well as an independent cohort from UKD that, according to German data protection policies, cannot be accessed from outside our facilities.

**Conflicts of Interest:** The authors declare no conflicts of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **References**


### *Proceeding Paper* **K-Means Clustering Assisted Spectrum Utilization Prediction with Deep Learning Models †**

**Bethelhem S. Shawel 1,2,\* , Frehiwot Bantigegn 3, Tsegamlak T. Debella 1,4 and Sofie Pollin <sup>2</sup> and Dereje H. Woldegebreal <sup>1</sup>**


**Abstract:** The radio spectrum is a finite and scarce resource needed to transport data generated by existing and emerging wireless mobile networks and services. As the demand for wireless services is increasing, operators look for ways to efficiently utilize their assigned spectrum. While operators do regularly perform spectrum occupancy measurement using an external spectrum analyzer or installing a dedicated sensing network to understand and plan the spectrum utilization level, both in time and spatial dimensions, such a measurement-based approach is expensive, given the dynamic and wide area covered by spectrum utilization. This paper proposes an indirect approach to assess and predict the average spectrum utilization level using data traffic measured from base stations of an operator network. K-Means clustering and deep learning algorithms, namely Convolution Neural Network (CNN) and Long Short Term Memory (LSTM), are used to model and analyze the current and future spectrum utilization in the 900 MHz frequency range. Data collected from 639 base stations of a mobile operator are used to build the spectrum utilization model. The results show that the CNN model trained on clustered data outperforms the model developed on non-clustered data (with a Root Mean Square Error (RMSE) of 0.58), mainly for base station level prediction. In terms of utilization level, the results also show that the operator does not optimally utilize the 900 MHz range.

**Keywords:** spectrum; utilization; prediction; time-series; clustering; K-Means; LSTM; CNN

#### **1. Introduction**

Over the last three decades, cellular data traffic has exploded due to the proliferation of attractive telecom services with requirements ranging from high throughput to guaranteed low latency and low error rate. To respond to the demand, mobile network operators (MNOs) continuously upgrade their networks, e.g., by deploying advanced network technologies such as Advanced Fourth Generation Long Term Evolution (LTE+) and Fifth Generation (5G) broadband cellular mobile networks. These networks are designed to efficiently utilize the available spectrum and operate in the previously unutilized spectrum in the GHz range.

As the radio spectrum is one of the key and finite resources to transport the growing traffic, there is an ever-increasing demand for it. Due to its favorable propagation condition in providing high coverage and capacity, some bands (mostly sub-6 GHz) are being intensively utilized at different times of the day and space (e.g., location and service area), creating some sort of "scarcity", in contrast with other bands [1]. In most cases, the

**Citation:** Shawel, B.S.; Bantigegn, F.; Debella, T.T.; Pollin, S.; Woldegebreal, D.H. K-Means Clustering Assisted Spectrum Utilization Prediction with Deep Learning Models. *Eng. Proc.* **2022**, *18*, 2. https://doi.org/10.3390/ engproc2022018002

Academic Editors: Ignacio Rojas, Hector Pomares, Olga Valenzuela, Fernando Rojas and Luis Javier Herrera

Published: 17 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

non-uniformity of users' spatial distribution and usage patterns within various geographical areas and times limits the full utilization of the available radio spectrum assigned for MNOs [2]. In order to increase the efficient use of spectrum bands, MNOs consider different spectrum harvesting approaches, such as intra or inter-operator spectrum sharing or spectrum refarming that rely on detailed knowledge about the spectrum usage in time and space.

From the MNO's perspective, optimizing the spectrum usage focuses not only on maintaining the quality of service but also on ensuring that the allocated spectral resources can support the demands of subscribers. Thus, operators are expected to understand the dynamics of their spectral resource utilization by continuously monitoring the use in terms of space, time, and the number of channels (in a channelized band) that all users in a certain territory may access. While conducting continuous spectrum measurement by dedicating sensing networks is the most accurate approach to attaining such knowledge, it is costly and resource intensive, given cellular traffic's rising demand and dynamic nature. Rather, an alternative approach is to exploit the correlation between spectrum use and transported data/voice traffic information, which is already available in the operator's network, to understand, estimate, and predict the spectrum utilization.

From these aspects, several literature analyzed efficient spectrum/channel use in spectral and temporal dimensions. Motivated by the lack of knowledge regarding spectrum occupancy in South Africa, the authors of [3] measured the spectrum occupancy for Global System for Mobile Communications (GSM) 900 and 1800 MHz bands. The results indicate a maximum occupancy of 20% for UHF bands and different maximum utilization during peak hours for the GSM 900 (92%) and 1800 (40%) MHz. Similarly, spectrum utilization in Malaysia's TV and cellular bands was carried out in [3] showing the maximum utilization of 35%, 10%, and 26% in GSM900, GSM1800, and 3G bands, respectively, and 11% and 13% utilization for TV broadcasting in VHF and UHF bands.

The practical prowess of time series modeling methodologies are considered for predicting spectrum occupancy in different bands and applications. In this regard, ref. [4] applied Autoregressive Integrated Moving Average (ARIMA) models, Lagrangian Support Vector Machine, and an Elman network (simplified models of Recurrent Neural Networks (RNNs)) are used to predict spectrum occupancy in a TV and cellular bands. The results show that the RNN technique outperforms the other models in prediction accuracy for cellular networks, as it better captures the non-stationarity and several irregularities in the data traffic. In contrast, the ARIMA model works efficiently in the TV band, since the traffic pattern is stationary. A similar analysis is presented in [5] for GSM channel utilization modeling and prediction with Seasonal ARIMA (SARIMA).

On the other hand, traffic volume and resource utilization mapping is presented in [1,6]. Traffic-related measurements such as call success rate, call drop rate, and antenna properties, including antenna height, transmit power, used/unused time, and frequency bandwidth, are used in [6] to indirectly map system/network parameters into spectrum utilization efficiency. Under a multi-MNOs environment, the analysis results showed a heavy under-utilization of the spectrum. In [1], upper limits on the traffic volume and the spectrum resource usage is evaluated for a single LTE cell to ensure seamless video streaming in dense urban environments.

While these papers indicated: (1) the need for analyzing spectrum usage directly from measurement or indirectly from voice/data traffic volume; and (2) the need for prediction or classification models that are capable of understanding random spectrum usage, there is still a limitation of capturing the spatial correlation within various geographical regions.

With the limitations in mind, the main objective of this paper is to develop a machinelearning-based model that captures the spatio-temporal variation of spectrum utilization. The model helps to understand (in an average sense) how the operator utilizes the different spectrum bands allocated to it. For that, 100 days of voice traffic channel data (hourly based frequency utilization per cell in percentage) are collected from 639 base stations operating

at the 900 MHz frequency range. Based on the data, the approaches followed in this work are as follows:


The indirect spectrum usage assessment is of low cost and requires fewer resources, as it uses the operator's already monitored/available data. With knowledge of the average spectrum utilization, operators may follow multiple approaches, such as going for a new frequency band in the case of "full utilization"; reframing frequency, half rate configuration implementation, and spectral efficient technologies to improve the utilization in case of moderate/medium utilization; or in the case of low utilization even allowing other users to utilize its frequency in the context of cognitive radios or spectrum sharing with other operators [9]. To the best of our knowledge, no prior work has investigated spectrum utilization based on the operator's data. Even though our analysis considers voice traffic at 900 MHz of the spectrum channel, it can easily be extended to different spectrum bands and data traffic volume inputs with appropriate traffic-spectrum utilization mapping and more complex learning architecture.

The remainder of the paper is organized as follows. The spectrum utilization concept and its traffic mapping are discussed in Section 2. We present the clustering and prediction approaches used in Section 3, followed by the results and discussion in Section 4. Finally, we conclude the work in Section 5.

#### **2. Spectrum Utilization Analysis**

#### *2.1. Spectrum Bands in Cellular Mobile Networks*

Radio spectrum is divided into frequency/spectrum bands, e.g., in mobile systems 800 MHz, 900 MHz, 1800 MHz, 2100 MHz, and 2600 MHz bands are allocated for various generations of mobile systems [9]. Each band has different propagation characteristics and bandwidth that, in turn, determine mobile network coverage and capacity [2]. Operators further divide the frequency bands into channels (also called carriers) and are used to transport traffic and control information. As an example in GSM systems operating in the 900 MHz band, the band is further divided into 124 duplex channels (or carriers) of 200 kHz bandwidth.

By systematically spacing base stations in a geographic area, each base station is configured to operate on a certain group/cluster of channels. The configured channels are reused by other base stations as many times as the resulting co-channel interference is within the service requirement [10]. In mobile systems, channels are classified as a physical channel and a logical channel. The physical channel corresponds to one timeslot on one carrier/channel, while the logical channel reflects the specific type of information carried by the physical channel, which could be either a traffic channel or a control channel. A traffic channel in GSM is abbreviated by TCH and is used for either voice or data service. For voice service, each timeslot can carry a full rate TCH of 9600 bit/s,two half-rate TCHs of each 4800 bit/s rates, or one of the control channels.

#### *2.2. Traffic Engineering*

As previously stated, the main intention of this work is to use data available in operators' performance report systems (PRS) to estimate current and future channel utilization. When viewed per base station level where measurement is available, channel utilization level, among others, depends on the number of configured channels per base station; the

specific geographic area; time of a day; users' behaviors; service delivered; generation of the cellular network; rate, i.e., full- or half rate, supported; and multiplexing scheme used.

Cognizant of these facts, as well as taking the availability of operator's data and operator's understanding, the utilization study is a 2G network providing voice service. The approach is, however, generalizable to other advanced networks and it is one area in which we are working to publish the results in the future. Moreover, from the operator's PRS, one can collect the aggregate offered traffic, measured in Erlang, for the configured channels per base station on an hourly basis and for both full-rate, *TvF* , and half-rate, *TvH* , TCH channels. For this paper, data available per base station are TCH traffic both for half and full rate, configured channels per base station, and each site's longitude and latitude to analyze the spatial behavior of its utilization. Six hundred and thirty-nine sites (base stations) are used and one-month data with a granularity of 1 hour is collected.

Traffic engineering is then used to map the utilized channels, which in turn will be compared with the configured channels to compute the percentage utilization. Channel utilization, *CU*, is defined as:

$$\mathcal{C}\_{II} = \frac{T\_{v\_F}}{\mathcal{C}\_c \times \left(1 + \frac{T\_{v\_H}}{T\_{v\_F}}\right)} \times 100 \tag{1}$$

The operator understudy designed its voice service assuming Erlang B service with a grade of service of 98% network availability [9]. Figure 1 shows the calculated spectrum/channel utilization, based on Equation (1), for a particular base station.

**Figure 1.** Spectrum/channel utilization.

#### **3. Methodology**

As the spectrum utilization data are evaluated from the cellular traffic observed in a timely basis, it is modeled as a non-linear and non-stationarity time series. In order to capture the commonality in users' behaviors and distribution at different times and locations, a cluster-level approach is considered when predicting the utilization. The prediction model is developed using LSTM and CNN for clustered and non-clustered data.

#### *3.1. Utilization Clustering with K-Means*

As one of the most popular unsupervised clustering algorithms due to its simplicity and linear complexity, K-Means is widely used in many application areas such as computer vision, image processing, and business analytics [9]. The goal of the algorithm is to group the unlabeled multidimensional data into K clusters by assigning each data point to one unique cluster based on the provided features. With the objective of maximizing intracluster similarities and minimizing the inter-cluster similarities in the spectrum usage, we used *Silhouette analysis* to find the optimum number of clusters.

The Silhouette index (SI) is used to distinguish the different unique patterns and measures the distance between the time series and the centroid of the cluster they belong to compared through comparison with other clusters. Base stations might have significant load variation, as shown in Figure 2, due to changes in work or rest time in commercial and residential regions, as well as variations in human behavior through time and in different locations, among other factors. Based on the district pattern on the spectrum utilization time series data, SI can be used to cluster the different spatially distributed base stations. The SI for the time series dataset, *y*, and *k* the number of clusters, is defined as [9]:

$$SI(\mathbb{C}) = \frac{1}{N} \sum\_{c\_k \in \mathbb{C}} \sum\_{\mathcal{Y}\_i \in c\_k} \frac{b(c\_{k'} y\_i) - a(c\_{k'} y\_i)}{\max(a(c\_{k'} y\_i), b(c\_{k'} y\_i))} \tag{2}$$

where *a*(*ck*, *yj*) = <sup>1</sup> <sup>|</sup>*ck* <sup>|</sup> <sup>∑</sup>*yj*∈*ck* ||*yi* <sup>−</sup> *yj*||,is the measure of similarity of the time series to its own cluster and *<sup>b</sup>*(*ck*, *yj*) = min*cm*=*ck cm*∈*C* 1 <sup>|</sup>*ck* <sup>|</sup> <sup>∑</sup>*yj*∈*ck* ||*yi* <sup>−</sup> *yj*||, is the measure of dissimilarity from time series in other clusters.

**Figure 2.** Different channel utilization observed at four base stations.

#### *3.2. Spectrum Utilization Prediction Approach with Deep Learning*

The remarkable achievement of deep learning prediction in relation to wireless network problems, including its capability to capture complex nature and its processing of time series information, was achieved with a "time-aware" architecture. Without explicitly decomposing the different time series characteristics of spectrum utilization, deep learning will model/learn its dynamic temporal behavior. We consider the two widely used deep learning networks, CNN and LSTM, for the spectrum utilization problem.

#### 3.2.1. Spectrum Utilization Modeling Using CNN

CNN is a type of deep neural network initially designed for image processing problems, but now it is applied to data that can be represented in a grid-like matrix form. In CNN, time-series and textual data can be represented by a 1D vector and a 2D matrix can be used to represent the pixels in the image data [11]. Unlike image processing, the CNN-based time series analysis/prediction requires extracting information along the time dimension, hence the reason we use the stack 1D CNN model.

To achieve the purpose of extracting features, two layers of a 1D-convolution layer are used. Max pooling layer follows each convolution layer to shirk the input resolution and assist the convolution layer to extract abundant temporal correlated features under the various input resolutions. In addition, a flattening layer for data reshaping and two dense layers are used sequentially to get the required output shape.

#### 3.2.2. Spectrum Utilization Modeling Using LSTM

LSTM network is an advanced recurrent neural network (RNN) and is designed to learn order dependence in sequence prediction. The LSTM contains three parts, namely the Forget gate, *ft*, Input gate, *it*, Output gate, *Ot*, and memory cell, *Ct*, where each part performs a separate function. The forget gate chooses whether the information coming from the previous timestamp is to be remembered or is irrelevant and can be forgotten. The input gate is used to quantify the importance of the new information carried by the input. While in the output gate, the cell passes the updated information from the current timestamp to the next timestamp. Even with its computational complexity for retaining memory, it is easy to model complex non-linear feature interactions using the LSTM [12].

$$f\_t = \sigma(X\_t \times \mathcal{U}\_f + H\_{t-1} \times W\_f) \tag{3}$$

$$\mathbf{i}\_l = \sigma(\mathbf{X}\_l \times \mathbf{U}\_l + H\_{t-1} \times W\_l) \tag{4}$$

$$\mathbb{C}\_{t} = f\_{t} \odot \mathbb{C}\_{t-1} + i\_{t} \odot \tanh(\mathbf{x}\_{t} \times \mathbf{U}\_{c} + H\_{t-1} \times \mathcal{W}\_{c}) \tag{5}$$

$$O\_l = \sigma(X\_l \times \mathcal{U}\_0 + H\_{l-1} \times \mathcal{W}\_0) \tag{6}$$

$$H\_l = \mathcal{O}\_l \odot \tanh\left(\mathcal{C}\_l\right) \tag{7}$$

For our real-valued prediction problem, three layers of LSTM networks are used with ReLU, *σ*(.) = *R*(*z*) = *max*(0, *z*), as an activation function.

#### **4. Results and Discussion**

*4.1. Experimental Settings*

4.1.1. Data Preprocessing

We considered a dataset from an operator measuring the spectrum utilization of GSM 900 in Ethiopia. The data were collected for 100 days, from 1 January to 10 April 2021, with a granularity of 1 h for 639 base stations. Additionally, each site's longitude and latitude information was taken to analyze the spatial behavior of its utilization. The operator has 61 and 85 channels configured to handle GSM service at 900 Mhz and 1800 Mhz, respectively (Tabel 1). The maximum cell capacity for GSM900 is 8TRX per cell and 12 TRX for DCS 1800 [9].

**Table 1.** GSM channel configuration.


Data preprocessing techniques, such as handling missing values, standardization, and outlier handling, are applied to the collected dataset. With that, five base stations with a continuous missing value are excluded; the data set is divided into 80% of training data, 10% of validation data, and 10% of test data.

#### 4.1.2. Hyperparameter Tuning

Hyperparameter tuning refers to finding the best parameters to get the best results from models. Hyperparameters are set before training a machine learning model. These hyperparameters need to be optimized to adapt the model to a dataset [6].

When building the LSTM model, how many hidden layers the model will include, the number of LSTM cells that should be used in each layer, and what the dropout should be

must be considered, in addition to other parameters. Similarly, the CNN model is defined with various hyperparameters such as kernel size, filter size, hidden layer, optimizer activation function, and Epoch. A grid search algorithm was used for selecting these appropriate combinations of hyperparameters listed in Table 2 as it is critical for building a model with better accuracy.


**Table 2.** Hyperparameters used in LSTM And CNN Models.

#### 4.1.3. Evaluation Metrics

The Model evaluation aims to estimate the generalization accuracy of a model on future or test data. We jointly used two evaluation metrics to quantify our model performance: Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE).

$$MAE = \frac{1}{N} \sum\_{i=1}^{N} |y\_i - \hat{y}\_i| \tag{8}$$

$$RMSE = \sqrt{\frac{1}{N} \sum\_{i=1}^{N} (y\_i - \hat{y}\_i)^2} \tag{9}$$

where *N* is the size of the evaluated set and *y*ˆ*<sup>i</sup>* is the predicted utilization for *yi*.

#### *4.2. Model Performances*

4.2.1. Clustering with K-Means

As the utilization of the spectrum resource at a particular base station relates to the number of channels allocated and the aggregated traffic requested from the users, its temporal pattern resembles, to certain extent, the users' behavior.

In the K-Means analysis, the closeness of the utilization pattern of a particular base station to the mean traffic pattern of a cluster is evaluated for cluster membership. Using preprocessed data, the K-Means algorithm clusters the data into an optimal cluster size of five based on the minimum SI score. Figure 3 illustrates the utilizations pattern and the spatial distribution of the corresponding base stations. The plots illustrate a distinct variation in utilization pattern due to factors such as user behavior (the high call rate during working hours indicated by the high picks) and a higher number of channel allocations (reflected in 3rd cluster,from left to right). Aside from being an input for better spectrum utilization through dynamic spectrum allocation, the clustering based approach averaged out the different patterns observed at a base station level to four, simplifying networklevel predictions.

**Figure 3.** The four clustered base stations based on their average spectrum utilization patterns.The top-row plots representing the average utilization pattern of each cluster in five days duration, and the bottom-row plots represents the spatially distributed base stations of each cluster.

#### 4.2.2. Prediction Performance

The prediction for spectrum utilization at a cluster level and base station level was made considering the two models: LSTM and CNN. Figure 4 and Table 3 present results for cluster-level prediction that showed close performance between the LSTM and CNN models in capturing characteristics of the GSM 900 spectrum usage. Similarly, results for 24 h base station level prediction are shown in Table 4 and Figure 5.

**Figure 4.** Actual vs. predicted plot for CNN and LSTM.

**Table 3.** Cluster level prediction performance.



**Table 4.** Performance evaluation results for base station level prediction.

**Figure 5.** Base station level prediction with both LSTM and CNN.

As noted, the prediction error produced by our LSTM model is much greater than the CNN (RMSE of 0.58) in both cluster and base station levels, showing its limitation was an inability to sufficiently learn the patterns (i.e., trend, seasonalities, and non-linearities) inherent in the data.

Compared to the error generated by the cluster level prediction (RMSE of 1.04), the base station level approaches in both models perform poorly. Especially during high utilization, the prediction error is very significant, which will create network condition and QoS degradation in case of prediction-based network resource allocation and optimization.

#### **5. Conclusions**

In wireless network planning and optimization, data-prediction-assisted analysis provides the opportunity for operators to determine the extent to which the resources are utilized and quality of service is attained. As spectrum is the scare resource in wireless communication, it is important for MNOs to understand how the spectrum is utilized over time and space.

In our paper, spectrum utilization data are modeled as time series data during model development, and various strategies for enhancing the model's performance are employed to obtain a better model with the least amount of error. Since the utilization data in practice are not typically measured, we exploit the traffic utilization relation. A clusterlevel approach is considered with the help of K-Means to provide network-level spectrum utilization prediction CNN and LSTM algorithms. Based on the temporal pattern, the GSM 900 band utilization is clustered into four. To compare and evaluate the prediction accuracy, four different metrics are used. As shown, the model developed for the cluster data using the CNN outperforms the LSTM algorithm with an RMSE value of 0.58. Similarly, for base-station-level prediction, CNN is found to be the best predicting model with an RMSE value of 1.04.

We hope the presented results provide a new insight for MNOs to understand the utilization level of the spectrum allocated to them that can also be extended to 3G and beyond networks. Moreover, the presented approach is on a per-cluster level, which spans a wider geographic area. How to obtain base-station-level knowledge and for a large number of base stations is another area to explore in future work.

**Author Contributions:** Conceptualization, B.S.S., F.B. and D.H.W.; methodology, B.S.S., F.B. and D.H.W.; software, B.S.S., F.B. and T.T.D.; validation, B.S.S. and T.T.D.; formal analysis, B.S.S., F.B. and D.H.W.; investigation, B.S.S., F.B. and D.H.W.; resources, F.B. and D.H.W.; data curation, F.B; writing—original draft preparation, B.S.S. and F.B.; writing—review and editing, B.S.S. and D.H.W.; visualization, F.B.; supervision, D.H.W. and S.P. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Data Availability Statement:** Restrictions apply to the availability of these data. Data was obtained from Ethio Telecom and are available from the corresponding author with the permission of Ethio Telecom.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Proceeding Paper* **Alone We Can Do So Little; Together We Cannot Be Detected †**

**Sergej Korlakov 1,\*,‡ , Gerhard Klassen 1,\*,‡ , Marcus Bravidor <sup>2</sup> and Stefan Conrad <sup>1</sup>**


**Abstract:** It is no longer possible to imagine our everyday life without time series data. This includes, for example, market developments, COVID-19 cases, electricity prices, and other data from a wide variety of domains. An important task in the analysis of these data is the detection of anomalies. In most cases, this is accomplished by examining individual time series. In our work, we use the techniques of cluster analysis to establish a relationship between time series and groups of time series. This relationship allows us to observe the development of time series in their entirety, thereby gaining additional insights. Our approach identifies outliers with a real-world reference and enables the user to locate outliers without prior knowledge. To underline the strengths of our approach, we compare our method with another known method on two real-world datasets. We found that our solution needs significantly fewer calculations, produces more reasonable results, and can be applied to real-time data. Moreover, our method detected additional outliers, whose occurrence could be explained by real events.

**Keywords:** outlier detection; outlier detection in time series; time series analysis; time series clustering; time series cluster evaluation

#### **1. Introduction**

Outlier or anomaly detection in time series is the problem of identifying rare or deviating observations in (univariate or multivariate) time series. Those observations may occur once or form a sequence when arising multiple times in a row. Finding anomalies in time series can be beneficial in a variety of applications, such as fraud detection in stock markets [1,2], anomaly detection in network data [3,4], and the detection of unusual time series in medical data [5,6]. The sheer number of possible applications of anomaly detection in time series makes it important for industry; therefore, it has been implemented in a number of business applications released by Google (https://cloud.google.com/blog/products/data-analytics/ (accessed on 10 April 2022)), RapidMiner (https://rapidminer.com/glossary/anomaly-detection/ (accessed on 10 April 2022)), Microsoft, and IBM [7,8]. The broad diversity of applications and products indicates a variation in the underlying data, which requires specific solutions in order to detect meaningful anomalies. Furthermore, outliers may also be defined differently, depending on the context at hand.

Most approaches focus on the detection of anomalous observations or subsequences in a single time series. This is useful for many applications but does not include a comparison to other time series from the same domain. However, assuming that time series from the same context are influenced by similar framework conditions, such a comparison becomes necessary. The idea of outlier detection by comparing time series is part of

**Citation:** Korlakov, S.; Klassen, G.; Bravidor, M.; Conrad, S. Alone We Can Do So Little; Together We Cannot Be Detected. *Eng. Proc.* **2022**, *18*, 3. https://doi.org/10.3390/ engproc2022018003

Academic Editors: Ignacio Rojas, Hector Pomares, Olga Valenzuela, Fernando Rojas and Luis Javier Herrera

Published: 17 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

recent research and often applies Dynamic Time Warping (DTW) techniques [9,10] or the Granger Causality [11,12]. Although, these approaches are able to identify anomalous data points or even subsequences, they are limited to the comparison of only two time series at a time. The comparison of one time series to a group of other time series at the same time is, therefore, the next logical step; however, it also requires techniques to localize corresponding groups. The latter is well researched and is referred to as cluster analysis in time series.

In general, the goal of cluster analysis is to group objects with the objects in the same group being as similar as possible to each other and the objects from different groups being as dissimilar as possible. The similarity between two objects is usually expressed by a distance function (e.g., Euclidean distance). The number of existing clustering algorithms is very large; hence, which one should be used depends, among other things, on the given data and performance requirements. There are also different approaches for the clustering of time series. Examples of these are clustering using common methods with an explicit distance function for time series, clustering of multiple time series at each point in time, or using a clustering algorithm that was specifically developed for time series.

An example of a clustering algorithm developed for time series (based on K-means [13]) is the method of Chakrabarti et al. [14]. The authors claim that their algorithm can preserve a certain consistency of clusters in consecutive points in time. A recent study by Tatusch et al. [15], however, has shown, that the development of time-series adapted algorithms is not implicitly required to preserve this consistency. Instead, it is sufficient to find corresponding parameters for existing not-time-adjusted clustering algorithms. In a different work, the authors also demonstrate the use of this technique to find behavioral outliers in time series. Although the results are convincing, the given method is not applicable to streaming data because it is decidedly computationally intensive. Furthermore, the approach of Tatusch et al. [16] requires the user to set a threshold, which is based on the time-series over-time stability. However, the construct of this stability measure is not intuitive; therefore, it may be very difficult to find appropriate values for the threshold.

In this paper, we present an alternative approach based on a **cl**uster **o**ver-time **s**tability **e**valuation measure called *CLOSE* [15] that is significantly less computationally expensive. Moreover, the results are based on a much more intuitive threshold that incorporates the cluster membership of time series. We compare the obtained results with those of Tatusch et al. [16].

In the remainder of this section, we provide the necessary notations and definitions (Section 1.1). In the next section, we first introduce a categorization of machine learning methods to time series analysis. Based on this categorization, we present the related work with the respective fundamental concepts and discuss the limitations in comparison to our solution. In Section 3, we describe our method mathematically with examples for every introduced definition. Then, we propose an optimization to reduce the number of computations required and, thus, to improve the performance of our method. In Section 4, we compare the results of our method with the results of Tatusch et al. [16], a solution with a similar key idea. To ensure a fair comparison, we use the same data set and hyperparameters as in [16]. Finally, we conclude and discuss possible future work in Section 5.

#### *1.1. Notation and Definitions*

Since we compared our procedure with that of Tatusch et al. [16], we adapted the definitions they provided.

**Definition 1** (Time Series)**.** *A time series TSx* = *x*1,..., *xn is an ordered set of n real-valued data points of any dimension. The data points are ordered chronologically by time. The order is represented by the corresponding indices of the data points.*

**Definition 2** (Subsequence)**.** *A subsequence Sx*,[*k*,*m*] = *xk*,..., *xm of a time series TSx is an ordered subset of m* − *k real-value data points of TSx with k* < *m and* ∀*xl* ∈ *TSx* : *k* < *l* < *m* : *xl* ∈ *Sx*,[*k*,*m*]*.*

**Definition 3** (Data Set)**.** *A data set D* = {*TS*1,..., *TSm*} *is a set of m time series of the same length n and equivalent points in time.*

**Definition 4** (Cluster)**.** *A cluster Ci*,*<sup>t</sup> at time t with i* ∈ 1, . . . , *p being an unique identifier, is a set of similar data points, identified by a clustering algorithm. All clusters have distinct labels regardless of time.*

**Definition 5** (Cluster Member)**.** *A cluster Ci*,*<sup>t</sup> at time t with i* ∈ 1, . . . , *p being an unique identifier, is a set of similar data points identified by a clustering algorithm. All clusters have distinct labels regardless of time.*

#### **2. Related Work**

The different nature of data and approaches has led to a variety of diverse definitions of outliers; however, the general definition according to Douglas M. Hawkins is often used: "An observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism" [17]. The observations mentioned may have been made at one point in time or over a period of time and refer to univariate or multivariate time series. To capture the differences in both data and methods, Blázquez-García et al. [18] proposed a taxonomy that categorizes methods for anomaly detection in time series by three axes: input data (i.e., univariate, multivariate), outlier type (i.e., point, subsequence, time series) and nature of the method (i.e., univariate, multivariate). The latter describes whether a technique converts a multivariate time series into multiple univariate time series before further processing of the data. In the following, the described taxonomy is used to categorize the methods presented in the remainder of this section.

One method that can be categorized according to the described taxonomy under methods of univariate nature with the support of point and subsequence outliers was presented by Sun et al. [19]. The goal of the method is to detect anomalies in character sequences. Therefore, a probabilistic suffix tree is first constructed from character sequences, which is then used to estimate the probability of a point or subsequence being an anomaly. Due to the fact that the method takes univariate character sequences as input, a time series must first be converted into such a sequence, e.g., by SAX [20], in order to be processed in the following steps. In contrast, our proposed method can process time series without first having to convert them into a different (reduced) representation.

Alternatively, Munir et al. [21] introduced a method, that is multivariate in nature and can be applied equally to both types of time series. It consists of two modules: a CNN-based event predictor and a distance-function-based anomaly detector. As the name suggests, the event predictor predicts the next event based on all given time series, and the anomaly detector determines whether the deviation between the prediction and the occurred event is higher than a certain threshold. Thus, the outlier type handled by this method is a point in time. Compared to the supervised CNN-based event predictor, our proposed method is unsupervised, so that a training phase is not required. Moreover, the approach presented here has a white box character, which allows a more detailed analysis of outlier formation.

In contrast to the black box model of Munir et al. [21], Hyndman et al. [22] proposed a white box method exclusively for anomaly detection in multivariate time series. In their study, they extracted 18 features from each time series first. Then, principal components were determined from the data points of these features. Finally, a density based multidimensional anomaly detection algorithm [23] was applied to the first two principal components to detect anomalous time series. Although the method has a good performance and accuracy compared to other presented models, the approach contains a number of drawbacks. On one hand, the feature extraction from time series and the dimension reduction by PCA

(Principal Component Analysis) can lead to a loss of important information. On the other hand, principal components generally have a low interpretability. In this context, it can be difficult to determine the impact of features on the outlier detection. Since we consider the dimensions of the time series as a whole in our approach, there is no loss of information. In addition, based on cluster transitions and the reasons for those transitions, our method allows a detailed analysis of the occurrence of anomalous subsequences.

In response to the fact that several approaches (including those described here) focused on the deviation of one time series from the others, Tatusch et al. [16] presented a method for anomalous subsequence detection, which examines the behavior of a time series relative to their peers. For this, all time series are first clustered per timestamp; then, the transitions of time series between clusters are analyzed. If a time series or its subsequence frequently moves to different clusters compared to its peers from previous clusters, it will obtain a higher outlier score. According to this method, a time series or its subsequence is an outlier, if the outlier score exceeds a threshold, which must be specified by the user. In other words, if a time series changes its group often enough over time, it will be identified as an outlier.

As for any threshold that has to be set by a user, the approach of Tatusch et al. [16] raises the question of what value to set it to. Thus, the problem arises that the outlier score is not intuitive. While an outlier score of zero states that the corresponding time series is consistently in the same clusters with its peers over time, an outlier score from an interval *I* = [0, 1] can be much more difficult to interpret. Further, under the assumption that all time series of a dataset *D* have the same length *l*, the proposed method requires *K* computations to find all anomalous subsequences in *D*, where *K* is defined as:

$$K = \frac{(l-1)^2 + (l-1)}{2} \ast |D|$$

Finally, Tatusch et al. [16] differentiate between two outlier types: outliers by distance and intuitive outliers. Outliers by distance can be detected based on their outlier score, and intuitive outliers are subsequences consisting of noise which can arise during clustering per timestamp. This distinction is necessary to be able to categorize the latter type of subsequences as outlier as well, since the outlier score for these would be zero.

In our work, we present an alternative definition of an outlier, which is also based on a clustering of the time series per timestamp; however, it addresses the problems listed for the approach of Tatusch et al. [16]. Thus, the threshold of our method is more intuitive, contains no need to differentiate between any types of outliers, and the number of computations *K* , based on the same assumptions as above, is smaller with:

$$K' = (l-1)\*|D|$$

For this purpose, the given time series are clustered per timestamp first. Then, scores are calculated for each time series between consecutive timestamps, indicating the number of peers with which the time series remains in the same cluster. Finally, a threshold is defined to indicate how unique a path between two clusters in consecutive timestamps must be for the corresponding subsequence to be classified as an anomaly.

#### **3. Method**

Building on the described fundamentals, this section first introduces further terminology that is necessary to understand the method. This is followed by the definition of an anomalous subsequence of a time series. Finally, a way to find all anomalous subsequences within a time series is presented.

For a better illustration of the equations and corresponding calculations, we refer to the example given in Figure 1, which represents multiple time series clustered per timestamp. In the context of this work, data points defined as noise are considered as separate clusters. Thus, the data points of the time series *TSf* are assigned to the clusters *Cr*,1 and *Cv*,2 at timestamps one and two.

**Figure 1.** Example: clustering per timestamp.

The first term which will be relevant in the rest of this section is the cluster transitions set *ct*. A single cluster transition of a time series *TSy* is a tuple of two cluster labels indicating in which clusters two adjacent data points of *TSy* are located. Thus, a set of cluster transitions *ct* has the following definition:

$$\mathcal{L}t(TS\_{\mathcal{Y}}) = \{ (\mathbb{C}\_{i,t}, \mathbb{C}\_{j,t+1}) \mid \exists y\_{t\prime} y\_{t+1} \in TS\_{\mathcal{Y}} : y\_t \in \mathbb{C}\_{i,t} \land y\_{t+1} \in \mathbb{C}\_{i,t+1} \}. \tag{1}$$

For example, in Figure 1, the cluster transition sets of the time series *TSc* are:

$$ct(TS\_c) = \{ (\mathbb{C}\_{p,1}, \mathbb{C}\_{u,2})\_\prime (\mathbb{C}\_{u,2}, \mathbb{C}\_{w,3}) \}.$$

Given the description of the cluster transition set, we next define a multiset *M* that contains all cluster transitions for all time series *TSi* of the data set *D*:

$$M\_D = \bigcup\_{TS\_i \in D} ct(TS\_i). \tag{2}$$

Regarding Figure 1, the multiset *MD* is:

$$M\_{D} = \{ (\mathbb{C}\_{p,1}, \mathbb{C}\_{s,2}), (\mathbb{C}\_{p,1}, \mathbb{C}\_{s,2}), (\mathbb{C}\_{p,1}, \mathbb{C}\_{u,2}) \},$$

$$\begin{split} (\mathbb{C}\_{q,1}, \mathbb{C}\_{u,2}), (\mathbb{C}\_{q,1}, \mathbb{C}\_{u,2}), (\mathbb{C}\_{r,1}, \mathbb{C}\_{v,2}), \\ (\mathbb{C}\_{s,2}, \mathbb{C}\_{w,3}), (\mathbb{C}\_{s,2}, \mathbb{C}\_{w,3}), (\mathbb{C}\_{u,2}, \mathbb{C}\_{w,3}), \\ (\mathbb{C}\_{u,2}, \mathbb{C}\_{x,3}), (\mathbb{C}\_{u,2}, \mathbb{C}\_{x,3}), (\mathbb{C}\_{v,2}, \mathbb{C}\_{x,3}) \}, \end{split}$$

In combination with the equation for the cluster transition set, the given multiset description is then used to define a conformity score, which indicates how often a particular cluster transition *p* occurs in all time series of a data set *D*:

$$
\rho\_{\text{conforming\\_score}}(p, M\_{\text{D}}) = M\_{\text{D}}(p). \tag{3}
$$

With respect to this equation, the conformity score of the cluster transition (*Cp*,1, *Cs*,2) of the data set *D* presented in Figure 1 is:

$$\text{conformity\\_score}((\mathbb{C}\_{p,1}, \mathbb{C}\_{s,2}), M\_D) = M\_D((\mathbb{C}\_{p,1}, \mathbb{C}\_{s,2})) = \mathcal{Z}$$

Using Equations (1)–(3), a set of anomalous transitions *at* for a time series *TSy* ∈ *D* is defined as follows:

$$\text{cat}(TS\_y) = \{ p \mid p \in \text{ct}(TS\_y) \land \text{conforming\\_score}(p, M\_D) \le \sigma \},\tag{4}$$

where *σ* is a threshold for the conformity score of a single cluster transition. Thus, if the conformity score is less than or equal to *σ*, then the corresponding transition is categorized as anomalous. Consequently, a subsequence *Sy*,[*k*,*m*] of a time series *TSy* is anomalous if and only if:

$$\text{ct}(\mathcal{S}\_{y,[k,m]}) = \text{ct}(\mathcal{S}\_{y,[k,m]}).\tag{5}$$

According to Equations (4) and (5), the entire time series *TSc* of the data set *D* shown in Figure 1 is anomalous due to:

$$\text{cat}(TS\_{\mathfrak{c}}) = \{ (\mathbb{C}\_{p,1}, \mathbb{C}\_{\mathfrak{u},2}), (\mathbb{C}\_{\mathfrak{u},2}, \mathbb{C}\_{\mathfrak{w},3}) \} = \mathfrak{ct}(TS\_{\mathfrak{c}}).$$

Using Equation (5), anomalous subsequences of a time series *TSy* can be identified by iterating over all possible subsequences of *TSy*. As an alternative to this approach, a set of tuples can be derived based on the set of anomalous cluster transitions, where the first element of each tuple indicates the beginning of an anomalous subsequence, and the second one specifies the end of the corresponding subsequence. Given a lower number of required iterations through a time series, this alternative is intended to optimize the performance of the method presented in this paper. This first requires the definition of an order relation ≤:

$$(\mathbf{C}\_{i,t\_d}, \mathbf{C}\_{j,t\_b}) \le (\mathbf{C}\_{k,t\_c}, \mathbf{C}\_{l,t\_d}) : \Leftrightarrow (t\_a \le t\_b) \land (t\_c \le t\_d). \tag{6}$$

Based on the presented order relation, a set of tuples can be defined with respect to a time series *TSy*, where each tuple consists of two anomalous cluster transitions. The first transition indicates the beginning of an anomalous subsequence of the time series *TSy*, and the last one specifies the end of the corresponding subsequence. The formal description for this set of anomalous transition boundaries *atb* is given by:

$$\begin{aligned} \textit{attb}(TS\_{\mathcal{Y}}) &= \{ (p, p') \mid (p, p' \in \textit{at}(TS\_{\mathcal{Y}})) \land \\ &\quad (\nexists \mathfrak{p} \in \textit{ct}(TS\_{\mathcal{Y}}) \land \mathfrak{p} \notin \textit{at}(TS\_{\mathcal{Y}}) : p \le \mathfrak{p} \le p') \land \\ &\quad (\exists (\mathsf{C}\_{i,t}.\mathsf{C}\_{j,t+1}) \in \textit{at}(TS\_{\mathcal{Y}}) : p' = (\mathsf{C}\_{l,t-1}.\mathsf{C}\_{i,l})) \}. \end{aligned}$$

In the case of the time series *TSc* from Figure 1, the set of anomalous transition boundaries for *σ* = 1 is:

$$\operatorname{atb}(TS\_{\mathcal{C}}) = \{ ((\mathbf{C}\_{p,1}, \mathbf{C}\_{\mathbf{u},2}), (\mathbf{C}\_{\mathbf{u},2}, \mathbf{C}\_{\mathbf{w},3})) \}.$$

Further, to obtain data point-based outlier boundaries for a time series *TSy*, a mapping *dpy*(*Ci*,*t*, *Cj*,*t*)=(*yt*, *y <sup>t</sup>*) is required that maps cluster tuples to a tuple of data points based on *TSy*, such that:

$$y\_t \in TS\_y \land y\_t \in \mathbb{C}\_{i,t} \land y\_{t'} \in TS\_y \land y\_{t'} \in \mathbb{C}\_{j,t'}.$$

Finally, the outlier boundaries set *ob* within a time series *TSy* is defined as a elementwise merge of tuples of the set of the anomalous transition boundaries, which is then mapped to the data points of *TSy*:

$$orb(TS\_y) = \{dp\_y(\mathbb{C}\_{w,t}, \mathbb{C}\_{z,t'+1}) | \forall ((\mathbb{C}\_{w,t}, \mathbb{C}\_{x,t+1}), (\mathbb{C}\_{y,t'}, \mathbb{C}\_{z,t'+1})) \in att(TS\_y)\}.$$

In this context, the set of outlier boundaries consists of tuples, where the first element of each tuple marks the beginning of an anomalous subsequence, and the second one indicates the end of that subsequence. Consequently, the outlier boundaries set of the time series *TSc* = *c*1, *c*2, *c*<sup>3</sup> shown in Figure 1 for *σ* = 1 is:

$$ob(TS\_{\mathcal{C}}) = \{dp\_{\mathcal{C}}(\mathbb{C}\_{p,1}, \mathbb{C}\_{w,3})\} = \{(c\_1, c\_3)\}.$$

#### **4. Experiments**

In this section, we evaluate the method presented in this work. Since the method for detection of outliers in time series (*DOOTS* (https://github.com/tatusch/ots-eval/ blob/main/doc/doots.md (accessed on 20 April 2022))) of Tatusch et al. [22] also detects anomalous subsequences based on time series transitions between different clusters, we

used their method for comparison with our approach. In addition, we used the identical data sets and the same clustering method. This is intended to make the comparison of the two methods as fair as possible. In both approaches, DBSCAN with euclidean distance was used for timestamp-based clustering, with *minPts* and *ε* set differently for each data set but equally for both methods. In order to identify the best cluster parameters, we made use of the cluster over-time stability evaluation measure called *CLOSE* [15] and found that these were the same as Tatusch et al. used in their study [16]. Furthermore, we used the same thresholds for *DOOTS* as proposed in Tatusch et al. In order to make a meaningful comparison with our method, we chose the conformity score threshold in a way to ensure the results of both methods were as similar as possible. Overall, we used two real world data sets to compare both methods.

#### *4.1. Eikon Financial Data Set*

One of the real world data sets Tatusch et al. used for the evaluation of their work was an extract from the EIKON database. This database contains financial data from over 150,000 sources worldwide, and the information includes the previous 65 years. The extract contained annual values from the features' net sales and expected return for 30 (originally) random selected companies. The values of both features were normalized by min–maxnormalization. The parameters used for the clustering by DBSCAN were *ε* = 0.15 and *minPts* = 2.

Figure 2a shows the result of *DOOTS* [16] for the threshold *τ* = 0.6, and Figure 2b illustrates the result of our outlier detection method for the conformity score threshold of *σ* = 1. The black dashed boxes in Figure 2a mark intuitive outliers, which by definition consist of noise, and the red boxes represent outliers found by analyzing cluster transitions (outlier by distance). In contrast, in Figure 2b, each black dashed box highlights the beginning of an anomalous subsequence and each red dashed box marks the remainder of that subsequence.

The most noticeable fact when comparing the results of both methods is the different number of detected subsequences. While *DOOTS* [16] found only two anomalous subsequences (*SGM*,[2008,2009] and *SKR*,[2009,2013]), our solution identified four of them (*SGM*,[2008,2009], *SKR*,[2008,2013], *STJX*,[2008,2009], and *SUPS*,[2012,2013]). A detailed analysis of *STJX*,[2008,2009] and *SUPS*,[2012,2013] showed that both of them had a unique cluster transition with a conformity score of one each. Since the threshold *σ* was set to the same value, both subsequences were identified as anomalous by our method.

The explanation for why *STJX*,[2008,2009] and *SUPS*,[2012,2013] were not considered anomalous by *DOOTS* [16] is more complex. In the case of UPS, neither a subsequence score nor the best score can be calculated, due to the lack of a cluster membership of the last data point of the time series. The inclusion of such a case requires an additional case differentiation. This can be seen as a disadvantage of the method of Tatusch et al. [16]. The reason for the missing detection of *STJX*,[2008,2009] was the small size of the cluster in which TJX was located in 2008. Therefore, the calculation of the subsequence score for *STJX*,[2008,2009] resulted in 0.5. Since the best score for this subsequence was one, the outlier score for it was (1 − 0.5) = 1, thus lower than the threshold *τ* of 0.6. From this case, the dependence of the outlier detection result on the corresponding cluster sizes can be derived, which can be considered as another disadvantage of the method of *DOOTS* [16].

**Figure 2.** (**a**) Result of Tatusch et al. Colors: cluster memberships, red dashed boxes: outlier by distance, black dashed boxes: intuitive outliers. (**b**) Result of our method. Colors: cluster memberships, red dashed boxes: outliers, black dashed boxes: preoutliers.

Most interesting with regard to the identification of UPS (2012–2013) and TJX (2008– 2009) as outliers is probably the realization that they can be explained by related events. Unlike the companies with which UPS was clustered in 2012, UPS had to lower its expected return in 2013. This was probably attributable to the crash of UPS Airlines Flight 1354 (https://www.bbc.com/news/world-us-canada-23698279; accessed on 25 April 2022) 14 August 2013. TJX Companies is a multinational department store corporation that in 2009 was still struggling with the consequences of the recession triggered by the economic crisis in 2008 (https://www.bizjournals.com/denver/stories/2009/07/06/daily63.html; accessed on 25 April 2022). For this reason, sales fell sharply, as they did for all retail traders. The reason why TJX was identified as an outlier here was that it was the only retail company in the data set.

#### *4.2. Airline On-Time Performance Data Set*

The other real world data set that Tatusch et al. [16] used in their publication was called "Airline on-time performance". It was originally created for a challenge with the goal to predict delayed and canceled flights. Therefore, the authors included flight data on all commercial flights in the USA between October 1987 and April 2008, resulting in a total of 120 million records. Based on this data set, Tatusch et al. [16] generated one dimensional time series with the feature "distance". As described in their work, they took eight days of every month and calculated the average distance for each airline in the data set. Before the authors clustered the created data set with DBSCAN, they normalized the distances by the min–max normalization. The parameters for the applied clustering algorithm were *minPts* = 3 and = 0.03.

The result of *DOOTS* [16] for *τ* = 0.4 is displayed in Figure 3a. Here, the black solid lines represent outliers by distance, and the black dashed lines are both outliers by distance and intuitive outliers. Our result for *σ* = 1 is shown in Figure 3b, where the black solid lines mark anomalous subsequences. In both figures, the colors of the dots set at each time point represent the cluster membership, with the red color representing noise found by DBSCAN. The results of both methods show strong similarities regarding the detection of anomalous subsequences, but there are also some differences. The most relevant of them are discussed below.

**Figure 3.** Airline on-time performance data set. (**a**) Outlier detection result of the method of Tatusch et al. (**b**) Result of our method.

Foremost, the subsequences *SA*,[1,2] and *SA*,[3,4] of the time series marked as *A* in Figure 3 were detected as anomalous by *DOOTS* [16], while they were not detected as such by our method. In the context of our approach, these subsequences were not detected as anomalous, because the number of equal cluster transitions and, therefore, the conformity score of *SA*,[1,2] as well as of *SA*,[3,4] was higher than the threshold *σ* = 1. In contrast, explaining the results of *DOOTS* [16] requires detailed calculations. First, the subsequence score for *SA*,[1,2] is:

$$subseq\\_score(S\_{A,[1,2]}) = \mathbf{3}/4 = 0.75.$$

Given that the best score for the last cluster of *SA*,[1,2] is one, the outlier score for this subsequence is:

$$outlier\\_score(S\_{A,[1,2]}) = 1 - 0.75 = 0.25 < \tau.$$

Based on this result, *SA*,[1,2] should not be labeled as anomalous. However, if we take the subsequence *SA*,[1,3], which in addition to *SA*,[1,2] contains a further value at *time* = 3, the subsequence score for *SA*,[1,3] becomes smaller compared to *SA*,[1,2] with:

$$Suppose \text{q\\_score}(S\_{A,[1,3]}) = 0.5 \* (0.5 + 0.1) = 0.3.$$

Since the best score for the corresponding cluster at *time* = 3 is one, the outlier score for *SA*,[1,3] is:

$$outlier\\_score(S\_{A,[1,3]}) = 1 - 0.3 = 0.7 > \tau.$$

Thus, the subsequence *SA*,[1,3], was marked as anomalous. The same type of calculations led to the detection of the subsequence *SA*,[3,5] as anomalous, which contained the not anomalous subsequence *SA*,[3,4].

The detection of *SA*,[1,2] and *SA*,[3,4] by *DOOTS* [16] leads to two possible conclusions. On one hand, it can be concluded that the approach of Tatusch et al. [16] considers anomalous subsequences in a broader context with respect to the length of a time series than our method, which would be an advantage of *DOOTS*. On the other hand, this solution leads to subsequences that are not anomalous within their interval being detected as anomalous. This can be seen as a disadvantage of *DOOTS* [16]. In contrast, our method detects only those subsequences that exhibit suspicious behavior within their time interval. However, a broader context regarding our detection method can be achieved by additionally considering non-anomalous subsequences that are adjacent to detected anomalous subsequences.

A more general observation is that our method detected every subsequence of length two as anomalous if it contained one or more noisy data points. The reason for this is that the conformity score of such sequences was one and thus smaller or equal to the chosen threshold *σ*. In contrast, the solution of Tatusch et al. [16] follows a different approach. Even if a subsequence of length two had a noisy data point, it does not mean that *DOOTS* [16] would detect this subsequence as anomalous. In addition to the subsequence score, the detection of outliers by the method of Tatusch et al. [16] also depends on the value of the corresponding best score and whether both scores can be calculated for the given subsequence. In summary, we can conclude that *DOOTS* [16] has a much more complex set of rules than our method, whereby the results of both methods are similar.

#### **5. Conclusions and Future Work**

In this paper, we introduced a new approach for detecting outliers in multiple multivariate time series. For this purpose, we first clustered the time series data at each time point and then calculated conformity scores for each subsequence of length two. Finally, we determined whether the conformity scores were less than or equal to the specified threshold and labeled them as outliers if they were.

Since we found only one alternative algorithm (called *DOOTS*) [16] based on a similar idea, we compared the two in detail. The application of both methods led to similar results, although our solution had a much simpler rule set and, therefore, required fewer calculations (the runtime estimation is provided in Section 2). On one hand, this simplified the understanding of the origin of the outliers and, on the other hand, the better performance of our method allows it to be used on real time data. In addition, our rule set seems to be more consistent in contrast to *DOOTS* [16]. This statement is supported, among other things, by comparing the thresholds of both methods. While for our method the same minimum conformity score threshold was used in each dataset, the threshold set in *DOOTS* [16] varied without much difference between the results of both methods. Furthermore, our solution detected subsequences even if they ended with a noisy data point.

The most important drawback of our method is that the outlier detection result depends highly on the result of previous clustering. This implies that a poor clustering result would lead to a poor detection result of our method. Since the analysis of cluster transitions is the core idea of our method, we have to rely on approaches such as *CLOSE* [15] to obtain reasonable clustering results for our solution. Furthermore, while our method can be applied to real-time data due to its performance, this requires prior clustering in real-time with reasonable results for time series data. Here, *CLOSE* [15] provides good results, but the procedure is not applicable to real-time data because of the high number of needed computations. Given that, the most important aspect for future work is to optimize *CLOSE* [15] for real-time application. In addition, the freely selectable threshold *σ* in our work has a high impact on the detection result of our method. Every increment of *σ* leads to a superset of detected outliers regarding the result of the previous threshold. Although in this work we set this parameter to one for each dataset in order to obtain results that were as similar as possible and thus comparable to *DOOTS* [16], the optimal value for *σ* can

be determined in several ways, and each determination should depend on the dataset in question. One way to determine the optimal *σ* is to count the cluster transitions that occur, sort them in ascending order, and choose the value for *σ* at which the slope change is greatest. However, the analysis of this and other possible methods is beyond the scope of this work and should therefore be addressed in future work. Another aspect is that not all detected anomalous subsequences may be useful in the context of given requirements. Since this case requires further analysis, our method could be applied to multiple data sets in which the outliers have already been labeled.

**Author Contributions:** Conceptualization, S.K. and G.K.; methodology, S.K. and G.K.; software, S.K. and G.K.; validation, S.K. and G.K.; formal analysis, S.K. and G.K.; investigation, S.K. and G.K. and S.C.; data collection, S.K., G.K. and M.B.; writing—original draft preparation S.K. and G.K.; writing review and editing, S.K., G.K., M.B. and S.C.; visualization, S.K. and G.K.; project administration S.C., M.B. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was funded by the Jürgen Manchot Foundation.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Data available in a publicly accessible repository that does not issue DOIs (https://github.com/YellowOfTheEgg/mldp-outlier\_detection (accessed on 16 June 2022)). Publicly available datasets were analyzed in this study. This data can be found here: EIKON dataset: https:// www.refinitiv.com/ (accessed: 20 April 2020); Airline on-time performance dataset: https://community. amstat.org/jointscsg-section/dataexpo/dataexpo2009 (accessed on 20 April 2020).

**Acknowledgments:** We would like to thank the Jürgen Manchot Foundation, which supported this work by financing the AI research group Decision-making with the help of Artificial Intelligence at Heinrich Heine University Düsseldorf.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:

DOOTS Detecton of outliers in time series (method) CLOSE cluster over-time stability evaluation (measure) DTW dynamic time warping

#### **References**


### *Proceeding Paper* **ODIN TS: A Tool for the Black-Box Evaluation of Time Series Analytics †**

**Niccolò Zangrando \* , Rocio Nahime Torres , Federico Milani and Piero Fraternali**

Department of Electronics Information and Bioengineering, Politecnico di Milano, 20133 Milano, Italy; rocionahime.torres@polimi.it (R.N.T.); federico.milani@polimi.it (F.M.); piero.fraternali@polimi.it (P.F.)

**\*** Correspondence: niccolo.zangrando@polimi.it

† Presented at the 8th International Conference on Time Series and Forecasting, Gran Canaria, Spain, 27–30 June 2022.

**Abstract:** The increasing availability of time series datasets enabled by the diffusion of IoT architectures and the progress in the analysis of temporal data fostered by Deep Learning methods are boosting the interest in anomaly detection and predictive maintenance applications. The analysis of performance for these tasks relies on standard metrics applied to the entire dataset. Such indicators provide a global performance assessment but might not provide a deep understanding of the model weaknesses. A complementary diagnostic approach exploits error categorization and ad hoc visualizations. In this paper, we present ODIN TS, an open source diagnosis framework for time series analysis that lets developers compute performance metrics, disaggregated by different criteria, and visualize diagnosis reports. ODIN TS is agnostic to the training platform and can be extended with application- and domain-specific meta-annotations and metrics with almost no coding. We show ODIN TS at work through two time series analytics examples.

**Keywords:** time series; anomaly detection; predictive maintenance; model evaluation; error diagnosis

#### **1. Introduction**

Time series datasets collect observations sampled at different times. Recording can be continuous, when data are collected continuously in a given interval, or discrete, when data are recorded at set time intervals [1]. Based on the number of observations at each timestamp, the time series can be univariate or multivariate. Univariate time series log values generated by a single sensor, whereas multivariate time series record signals from multiple sensors simultaneously. Time series are used to study time-varying phenomena in many fields: in the economy [2] (e.g., stock price trends), in medicine [3] (e.g., the progress of health variables) or in industry [4] (e.g., the status or energy consumption of a machine). Given a time series dataset, different tasks can be performed to predict a specific attribute or event at a given timestamp or assign a label to a particular observation. The most common tasks can be summarized as follows:


**Citation:** Zangrando, N.; Torres, R.N.; Milani, F.; Fraternali, P. ODIN TS: A Tool for the Black-Box Evaluation of Time Series Analytics. *Eng. Proc.* **2022**, *18*, 4. https://doi.org/10.3390/ engproc2022018004

Academic Editors: Ignacio Rojas, Hector Pomares, Olga Valenzuela, Fernando Rojas and Luis Javier Herrera

Published: 20 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

The different tasks are usually evaluated by means of standard metrics such as the Mean Absolute Error (MAE) or Precision and Recall. While these metrics are useful global indicators of the model performances, they provide little insight into the weaknesses of the models. For example, predicting a false positive close to a real anomaly is a less severe error than predicting it at a very distant time. Furthermore, information collected but not used during the model training phase could help understand the model performance. For example, in an industrial application it could be interesting to analyze if the performances vary across appliance versions or install locations. A similar analysis could be performed on any other attribute not exploited for training but available at diagnosis time.

This paper introduces ODIN TS, the extension for anomaly detection and predictive maintenance of the ODIN machine learning diagnosis tool. ODIN [12] is an open-source, Python-based, black-box framework for error diagnosis initially conceived for generic classification and computer vision tasks. Contrary to explainability techniques that aim at "opening" the box by exploring the internals of the models (e.g., CNNs), black-box approaches study only the results of the model with regard to the input and its characteristics. ODIN TS adds the implementation of the most widely adopted metrics for the anomaly detection and predictive maintenance tasks and proposes new analyses for anomaly detection, such as false positive error categorization. ODIN TS also enables the inspection of the time series dataset and of the related predictions by means of a visualizer with different functionalities.

The contributions of the paper can be summarized as follows:


The paper is organized as follows: Section 2 summarizes the most common metrics used for time series analysis, Section 3 describes the proposed framework and its functionalities, Section 4 presents some examples of how the tool can be employed, and Section 5 concludes and provides insight into the future work.

#### **2. Related Work**

The evaluation of inference models applies standard metrics to compute performance indicators based on a comparison between the ground truth (what is expected) and the model predictions. Table 1 presents a by-task summary of the metrics and performance indicators found in the literature for time series analytics and provides a reference to the definition of each index. Different works have focused on the evaluation of time series tasks by proposing novel metrics and assessment procedures or by providing efficient implementations of the classical indicators. In [13], the authors presented a new benchmarking metric, the Numenta Anomaly Benchmark (NAB) score, which augmented traditional indices by incorporating time explicitly so as to reward early detection and introduced the concept of an anomaly window. The work in [14] illustrated the benefits of decomposing performance metrics based on the characteristics of the observations. As a use case, a model to detect illicit use of computational resources was created and assessed. The evaluation considered how the performances of the predictor changed in servers with a specific attributes (low or high profile). Other contributions proposed novel time series evaluation frameworks. The work [15] introduced Darts, a python framework for handling time series, which implemented some state-of-the-art machine learning methods and provided off-the-shelf a subset of the standard metrics reported in Table 1. Another example is the RELOAD tool proposed in [16] to identify the most informative features of a dataset, run anomaly detection algorithms, and apply a set of evaluation metrics on the results. While both tools aim to support training and evaluation with standard metrics, ODIN TS extends the support to time series analytics with a black-box error diagnosis approach focused on the anomaly detection and predictive maintenance tasks. It enables error categorization, predictions

decomposition, and visualizations. The decomposition and visualization functionalities exploit the meta-annotations of the data set, i.e., features not used during model training that can contribute to the interpretation of the model results.

**Table 1.** Metrics and analysis found in the literature for time series based on the different tasks. The value "yes" is used to indicate the metric applies to the specific task, whereas "n/a" is used to indicate the contrary.


#### **3. The ODIN TS Framework**

The ODIN TS framework supports the development of predictive maintenance and anomaly detection tasks enabling designers to evaluate standard metrics on inputs and outputs grouped by meta-annotation values, perform error categorization, evaluate the confidence calibration error, and visualize a variety of diagnostic reports. ODIN TS also includes a Visualizer module for the inspection of the dataset and of the model predictions. ODIN TS is supported by the extension of classes in the Python-based ODIN framework, and publicly released (https://github.com/rnt-pmi/odin (accessed on 15 June 2022)).

#### *3.1. Dataset Input and Output Formats*

ODIN TS supports the import of time series data, of ground truth (GT) annotations, and the output of inference results. The artifacts should follow the guidelines common to most publicly available datasets (summarized in Table 2):

	- **–** Time Series data: a CSV file with the column "timestamp" of the observation and one additional column for each "signal".
	- **–** Ground Truth: a JSON file containing a list of the "timestamp" in which the anomalies appear.
	- **–** Predictions: a CSV file where the first column specifies the "timestamp" and the following column(s) the "confidence" value, or the reconstructed/predicted "signal" values.
	- **–** Time Series data: a CSV with the columns "observation\_id" for the unique identifier of the observation, "unit\_id" for the machine or appliance identifier, and one additional column for each "signal".
	- **–** Ground Truth: it is embedded in the previous CSV file as an additional column ("label") with a Boolean value denoting if the machine is going to fail within *n* timestamps (for classification) or with the RUL value (for regression).
	- **–** Predictions: a CSV file (named "unit\_id.csv") for each machine or appliance mentioned in the ground truth. The file contains one column with the identifier of the observation and one column with the "confidence" score (for classification) or with the RUL value (for regression).

In both cases if additional metaproperties are associated to the time series, these are input as a CSV, where the first column is the identifier of the time series observation ("timestamp" or "observation\_id"), and there is one column per metaproperty.


**Table 2.** Dataset format for ODIN time series. The value "n/a" indicates that a field is not used.

#### *3.2. Supported Types of Dataset Analysis*

ODIN TS supports the following types of analysis of the observations and of the ground truth annotations:

**Distribution of classes.** For classification, a plot displays the percentage of samples for each category.

**Distribution of properties.** A plot displays the percentage of the observations associated with each property value. For example, it visualizes if an observation is associated with a certain period of the day (morning, evening, or night) or with a specific type of anomaly (point, contextual, or collective).

**Stationarity analysis**. A stationary time series is one whose properties do not depend on the time at which the series is observed. The implementation of ODIN TS uses the Augmented Dickey–Fuller statistical test [24].

**Seasonality, trend, and residual decomposition.** These analyses expose the repeating short-term cycles (seasonal) and the general movement over time (trend) of the series [25]. Residuals include everything not captured by the previous two types of decomposition. Decomposition can be realized with an additive model (addition of the decomposed values restores the original times series) or with a multiplicative one (the original series is obtained by multiplying the decomposed values).

#### *3.3. Supported Types of Prediction Analysis*

ODIN TS implements all the metrics of Table 1. To the best of our knowledge, there is no other framework that offers all of them off-the-shelf. Based on the implemented metrics, ODIN TS implements multiple performance reports and types of prediction analysis, summarized in Table 3.


**Table 3.** Types of analysis in ODIN TS for the different tasks. The value "n/a" specifies the type is not relevant for the specific task.

**Summary report**. A report that tabulates the results of all the metrics. The total shows both the micro- and macro-averaged values: the first computes the value without distinguishing the categories; the latter computes the metrics for each class and then performs an unweighted mean.

**Performance per threshold value**. The classification metrics of interest are computed and shown in a graph for each value of the confidence threshold.

**Per property analysis**. The values of the metrics of interest are decomposed by property value and contrasted with the average across all the property values. For example, the RUL value prediction or probability of failure in the next N timestamps could be distinguished per appliance brand or installation location.

**FP anomaly categorization**. The analysis of incorrectly predicted anomalies is supported, including their categorization into the following cases:


**FP error distance distribution**. A distribution plot of the distance (measured as a number of timestamps) between an FP and the closest GT anomaly, color-coded with the FP anomaly category (affected, continuous, or generic).

**RUL variation distribution**. Given that a machine or appliance degrades over time, the predicted RUL should decrease by 1 at each timestamp. In a perfectly consistent model, the following formula should apply:

$$
\mathfrak{H}\_t - \mathfrak{H}\_{t-1} = -1 \tag{1}
$$

where *y*ˆ*<sup>t</sup>* is the predicted RUL at time t, and *y*ˆ*t*−<sup>1</sup> is the predicted RUL at time *t* − 1. This type of analysis plots the distribution of the differences of the predicted RUL between the current cycle and the previous one to assess the consistency of the model predictions.

**Calibration analysis**. It exploits the confidence histogram and the reliability diagram [26]. Both plots assign the confidence values to buckets (e.g., 0–0.1, 0.1–0.2, ..., 0.90–1) on the abscissa. The confidence histogram shows the percentage of positive predicted samples that fall into each confidence range. The reliability diagram indicates, for each confidence

range, the average accuracy of the positive samples in that range. When a classifier is wellcalibrated, its probability estimates can be interpreted as correctness likelihood, i.e., of all the samples that are predicted with a probability estimate of 0.6, around 60% should belong to the positive class [27]. ODIN reports the Expected Calibration Error (ECE) (Equation (2)) and the Maximum Calibration Error (MCE) (Equation (3))

$$ECE = \sum\_{m=1}^{M} \frac{B\_m}{n} acc(B\_m) - conf(B\_m) \tag{2}$$

$$\text{MCE} = \max\_{m \in (1..M)} |acc(B\_m) - conf(B\_m)| \tag{3}$$

where *n* is the number of samples in the data set, *M* is the number of buckets (each of size 1/M), and *Bm* denotes the set of indices of observations whose prediction confidence falls into the interval *m*.

#### *3.4. Supported Visualizations*

ODIN TS allows one to visualize the dataset and the corresponding model predictions, if provided. The dataset visualization offers the following functionalities:


If the predictions are available, the following functionalities can be used:


#### **4. ODIN TS in Action**

This section exemplifies ODIN TS at work on an anomaly detection and a predictive maintenance case. The first example from the NAB datasets [13] used the "ambient temperature system failure" data, which contain ≈5000 hourly temperature measurements in an office and feature two anomalies. To detect the anomalies, an LTSM model was trained as in [9]. It comprised two LSTM modules with four hidden layers and a 0.2 dropout rate. The training set included the first 30% of the data (of which 10% was used for validation) while the remaining 70% was used for testing. The model was trained for 100 epochs, with a batch size of 32 and an input window length of 30. The scenario was relatively small given the few anomalies, but it was still useful to highlight some of the ODIN TS capabilities.

Figure 1 shows the distribution of the FP error categories. For the computation of the "affected" errors, an anomaly window of length 34 was used. Most FPs were within a short distance from the real anomalies ("affected", in orange), which suggests that the anomaly was perceived before its reported occurrence time and continued to be perceived shortly after. Furthermore, the "continuous" FPs were more numerous than the generic ones, which shows that the model tends to identify prolonged anomalies rather than instantaneous exceptions. Figure 2 shows the distance of each FP from the closest anomaly to confirm that the "affected" FPs were the closest to a GT anomaly. These errors are better appreciated in Figure 3 in which the Visualizer helps show the findings of the analysis more intuitively.

**Figure 1.** False Positive distribution plot.

**Figure 2.** False positive distribution of the distance of each FP from the closest anomaly.

To illustrate a case of predictive maintenance, we used a NASA dataset of turbofan engine degradation [28]. The dataset comprised samples from 100 engines, whose state was represented by 24 variables. The GT consisted of the RUL at each inter-observation interval (called "cycle"). The authors provided the dataset in two splits: training (with 20,631 observations among 100 machines) and testing (13,096 observations among 100 machines). In the testing split, the cycles for each machine ranged from 31 to 303. To predict the RUL at each cycle, a two-layer LSTM was employed, with 128 and 64 hidden layers, respectively, and the same dropout rate of 0.3. The LSTM was trained for 10 epochs with a batch size of 150 and a window input length of 60 cycles.

**Figure 3.** The ODIN TS Visualizer showing the actual data and the model predictions color-coded by error type.

The model had an RUL estimation error of 20 cycles (provided by the MAE) and an MAPE of ≈0.17, which denote good performance. Further analysis helps understand where the model can be improved. Figure 4 shows the residual analysis and enables a visual interpretation of the deviation of the predictions from the GT. Two plots report the predicted RUL on the X-axis. The Y-axis of the left plot reports the GT RUL value, while the Y-axis of the right plot reports the standardized difference between the prediction and the GT. Each color represents a different engine. From the analysis, it can be seen that most errors occurred when the component was still in good condition (with an RUL value greater than 100), which highlights the inability of the model to predict the engine's remaining life in the long term. In particular, the highest predicted RUL was 160, while the corresponding GT value was 260. This suggests that the model is not able to learn high RUL values properly. In scenarios where analysts are more interested in correctly predicting the RUL when the engine is close to a failure, it makes sense to set a maximum GT RUL value to reduce the relevance of large values during training [29].

**Figure 4.** Residuals analysis. The colors indicate the different engines.

Figure 5 illustrates the RUL variation distribution analysis, which shows that the model predictions were not very consistent. The model sometimes increased the estimated RUL by 10 cycles instead of decreasing it by 1. This finding can help improve the prediction; for example, if the variation among consecutive cycles was high, one might interpolate or average the RUL values of previous cycles to mitigate the noise.

**Figure 5.** RUL variation distribution: the green bar represents a perfect variation of the RUL estimation, the yellow bars an acceptable variation, and the red bars an inconsistent prediction.

#### **5. Conclusions**

This paper described the addition of time series analysis functions into a black-box error diagnosis framework originally conceived for CV tasks. The novel version of ODIN includes an ODIN TS module, which supports performance diagnosis for two analytics tasks on time series: anomaly detection and predictive maintenance. ODIN TS implements all the most widely adopted metrics for the addressed tasks and introduces new types of analysis for anomaly detection, such as FP error categorization. ODIN TS also enables the inspection of the dataset and of the predictions by means of a Visualizer with rich functionalities. ODIN TS is implemented in Python and released as an open-source project, which developers can easily extend with their own metrics, reports, and visualizations. To conclude we have illustrated the tool at work on two use cases, so as to give a glimpse of its utility. Future work will focus on extending the implemented metrics and on supporting more tasks (e.g., time series classification and forecasting). We also plan to extend the Visualizer with novel functions and to integrate a time series annotator for enriching the GT or generating it from scratch (e.g., by annotating the anomalies in a dataset). Finally, we plan to create automatic property extractors (e.g., for assigning to each anomaly the proper type [8]).

**Author Contributions:** Conceptualization, N.Z., R.N.T., F.M. and P.F.; Methodology, N.Z., R.N.T., F.M. and P.F.; Validation, N.Z., R.N.T., F.M. and P.F.; Writing—original draft, N.Z., R.N.T., F.M. and P.F. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was partially supported by the project "PRECEPT—A novel decentralized edge-enabled PREsCriptivE and ProacTive framework for increased energy efficiency and well-being in residential buildings" funded by the EU H2020 Programme, grant agreement No. 958284.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **References**

