*Proceeding Paper* **Cloud-Base Height Estimation Based on CNN and All Sky Images †**

**Emanuele Ogliari 1,\*,‡ , Alfredo Nespoli 1,‡ , Elena Collino 2,‡ and Dario Ronzio 2,‡**


**Abstract:** Among several meteorological parameters, Cloud-Base Height is employed in many applications to provide operational and real-time cloud-base information to the aviation industry, to initialize Numeric Weather Prediction models and to validate climate models. Moreover, Cloud-Base Height is also useful in the nowcasting (very short-term forecasting) of solar radiation. As cloud movements mainly affect the solar irradiance availability, their characterization is extremely important for solar power applications; an accurate estimation of the ground shadowing requires the knowledge of cloud height and extent. In the present work, the Cloud-Base Height value is estimated starting from sky images acquired from a single All Sky Imager. In order to fulfill this task, a Convolutional Neural Network model is chosen and developed.

**Keywords:** convolutional neural network; CNN; all sky images; cloud-base height; machine learning

### **1. Introduction**

Cloud-Base Height (CBH) estimation is an increasingly crucial task: this parameter is required in many applications [1]. It is exploited to validate climate models [2] and to improve Numeric Weather Prediction (NWP) models [3]; for instance, it improves the definition of the cloud-drift vector field, allowing a more effective modeling of the atmosphere dynamics [4]. Moreover, CBH is also useful in the nowcasting (very short-term forecasting) of solar resources [5] and photovoltaic power plant energy outputs [6]. As clouds are the primary cause of intermittency in solar irradiance, they are of interest for solar power applications: an accurate calculation of the ground shadowing requires the knowledge of cloud height and extent [7].

In the scientific literature, several studies are aimed at estimating CBH through different procedures. In [1], seven All Sky Imagers (ASIs) belonging to the Eye2Sky ASI network, located in the city of Oldenburg, are exploited to estimate CBH. In detail, an independent CBH estimation is derived considering each possible pair of ASIs, and then all the different estimations are properly merged into a final and more reliable one. In [8], the authors combined infrared satellite images and spectral information derived from meteorological sounders to improve the accuracy of the cloud height estimation task. In [9], a method for CBH estimation was developed starting from successions of images captured by groundbased imagers with a hemispherical view of the sky. In [10], a model capable of detecting and tracking individual clouds aimed at creating a 3D model providing cloud attributes such as height, position, size, optical properties and motion was developed. In [11], a newly developed temporal height-tracking (THT) algorithm to the backscatter profiles of two

**Citation:** Ogliari, E.; Nespoli, A.; Collino, E.; Ronzio, D. Cloud-Base Height Estimation Based on CNN and All Sky Images. *Eng. Proc.* **2022**, *18*, 5. https://doi.org/10.3390/ engproc2022018005

Academic Editors: Ignacio Rojas, Hector Pomares, Olga Valenzuela, Fernando Rojas and Luis Javier Herrera

Published: 20 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

ceilometers led to retrieved cloud bases that are statistically consistent with each other and ensured reliable detection of CBH, particularly when inhomogeneous cloud fields were present and changing rapidly in time. However, in situ measurements of cloud properties are essential, but they are quite expensive and typically limited in time and spatial location. On the contrary, Machine Learning-based models are capable overcoming physical model limitations, as for solar power plant energy estimation [12]. A CNN is a classification model specifically designed to detect patterns within images. Since the goal of the work was the evaluation of the possibility of using All Sky Imagers in order to detect the Cloud-Base Height, and considering the complexity of the cloud characteristics, this approach was considered ideal to approach this kind of problem. Therefore, the objective of the present work is to estimate, with a Machine Learning algorithm, the CBH value starting from sky images acquired from a single ASI. In order to fulfill this task, a Convolutional Neural Network (CNN) model was chosen and developed.

#### **2. Convolutional Neural Network**

The method selected to fulfill the objective of the current work is the so-called CNN, a classification model specifically designed to detect patterns within images. In the following, the model is first described from a theoretical point of view. Then, all the characteristics of the model implemented in the present work are discussed and explained.

#### *2.1. General Description*

CNNs are classification Machine Learning models, nowadays involved, for example, in image search services, self-driving cars, automatic video classification systems, etc. Moreover, their utilization is not restricted to visual tasks: they power many other applications such as voice recognition or natural language processing.

The structure of CNNs derives from the studies of the brain's visual cortex. Several studies and experiments demonstrated that neurons dedicated to vision present a small local receptive field, hence they process only information deriving from a limited region of the visual field. Moreover, the receptive fields of different neurons may overlap, and together they cover the entire visual field. This structure, capable of detecting complex patterns in any region of the visual field, inspired the researcher to develop a Neural Network architecture that gradually evolved into into the current CNN.

In further detail, the typical CNN structure consists of a sequence of convolutional and pooling layers:


After being processed in the cascade of convolutional and pooling layers, the information flow is flattened, i.e., it is structured in a suitable format to be further processed. The last step of the classification process takes place in one or more dense layers, providing the final output.

#### *2.2. Adopted Structure*

The CNN structure adopted is represented in Figure 1. The combination between a convolutional and a pooling layer is exploited two times in order to grant a reasonably deep feature extraction from the input images. Then, the information flow is flattened and delivered to the final dense layer, aimed at providing the output label corresponding to an unlabeled input image.

**Figure 1.** Adopted CNN structure.

Some characteristic parameters of the CNN structure need to undergo an optimization procedure in order to grant the best possible performance on the considered case study. In detail, a sensitivity analysis is carried out on the number of filters involved in the convolutional layers. The dimension of the dense layer is kept fixed at 16 hidden neurons. This study is aimed at selecting the configuration representing the optimal compromise between model accuracy and complexity. The results of the sensitivity analysis depend primarily on the amount of data involved in training and are reported in the results section.

#### **3. Case Study and Available Data**

The CBH is a crucial parameter in the characterization of the cloud features and consequently on the determination of the attenuation of solar radiation on the ground. Unfortunately, there are few instruments (ceilometers), which are very expensive, that are able to detect, in an objective way, this parameter. The alternative, considered in this work, is the exploitation of the atmospheric sounding obtained in the correspondence of some airports by means of a radio sounding. This measurement is made with low frequency (generally twice a day at 00 and 12 UTC) because of the weather balloon launch cost which hosts the battery-powered telemetry instrument. In the presented case study, the problem was also to have the radio sounding in proximity of the All Sky Imager. For this reason, the observed dataset was not too large and, therefore, it has been necessary to consider a reduced number of classes. In presence of a consistent dataset of CBH measurements obtained from a ceilometer, it would be possible to increase the number of the classes. Furthermore, in order to train a supervised Machine Learning model, a proper set of data is required. More in detail, it is necessary to have some target data, i.e., representing the quantity that the model is going to estimate, and some input data, i.e., the information on which the estimation is performed. As input data, sky images from an ASI are used. On the other hand, as target data, information derived from radio soundings is available.

#### *3.1. Input Data*

All available sky images are acquired through a whole-sky cam and constitute the input to perform the estimation. The images cover two time periods, one comprised between the months of May and September 2020 and one between January 2014 and February 2015, and present an initial resolution of 1124 × 1124 pixels. In order to reduce the computational burden of the estimation algorithm developed, the images are downsampled to a 256 × 256 pixels resolution.

#### *3.2. Target Data*

Target data consist of the CBH cloud cover depicted in each image. This value is computed starting from the Pressure of Lifted Condensation Level (PLCL) values, recorded through a radio sounding sensor. However, PLCL alone is not enough to properly estimate

the CBH value because the atmosphere pressure is not constant throughout the year. In order to address this issue, the difference between pressure at sea level and PLCL, representative of the height of the bottom layer of the cloud, must be used to estimate the CBH target value. In the presented case study, radio soundings are carried out once a day, at 12:00: therefore, the useful time frames for CNN training, i.e., those where both PLCL value and the corresponding sky image are available, are not numerous; only 109 sky images with a corresponding target label are available. The CNN is applied, in the present work, as a classification method, while the CBH is a continuous variable; this means it cannot be used as a target variable as it is. However, it was possible to divide the range of the registered CBH values into classes corresponding to a range of values and to train the model to assign each image to its corresponding class. Therefore, the available dataset is divided into 3 classes. The identification of more than 3 classes leads to a critical issue: the number of samples inside each group becomes too small, strongly affecting the classification accuracy. In that case, the algorithm would struggle in recognizing the classes because it does not have a sufficient number of examples to infer their characteristic pattern. On the contrary, if only 2 classes are defined, the range of CBH values corresponding to each class would become too large, reducing the usefulness of the classification.

CBH classes within the interval comprised between the maximum (970) and the minimum (670), can be defined according to different strategies leading to different partitioning of samples. The strategies considered in the current work, and the relevant thresholds, are represented in Figure 2 and are described as:


**Figure 2.** Different strategies for CBH classes definition and relevant thresholds: (**a**) linear; (**b**) logarithmic; (**c**) equal number of samples.

The different partitioning strategies lead to a different number of samples in each class, as reported in Table 1:


**Table 1.** Number of samples in each class according to different partitioning strategies.

#### *3.3. Oversampling*

CNNs, in order to be trained, require a large amount of input images. However, in our case study, the available dataset is not that large. In order to address this issue, an oversampling is performed in order to obtain additional generated input samples, useful to improve the training process of the model. Assuming a negligible time variation of the vertical profile of atmosphere, it is possible to assign to a single radio sounding all the images acquired in the surrounding time frames, as graphically represented in Figure 3.

**Figure 3.** Oversampling strategy adopted to enlarge the amount of training data available.

The impact of oversampling on class population is reported in Table 2:

**Table 2.** Samples per class of the adopted partitioning strategy after oversampling.


#### **4. Results**

The performance evaluation is a crucial step to assess the capability of the model to correctly identify the target classes. Here, the classification performances of a CNN are presented and discussed according to specific evaluation metrics after the class definition strategies previously listed in Section 3.2.

#### *4.1. Evaluation Metrics*

The evaluation metrics adopted in order to assess the model performances are presented in the following. The Global Accuracy (*A*) measures "how good" a classification model is, returning the fraction of right predictions. It is calculated as in Equation (1):

$$A = \frac{TP}{TP + TN + FP + FN} \tag{1}$$

where:


• *FN* (False Negatives) are samples belonging to class *C* that are incorrectly classified as belonging to a different class.

Precision, also denoted as Positive Predictive Value (PPV), for a class *C* is calculated as in Equation (2):

$$\text{Precision} = \text{PPV} = \frac{TP}{TP + FP} \tag{2}$$

Recall, also denoted as sensitivity, for a class *C* is calculated as it is stated in the Equation (3):

$$\text{Recall} = \text{Sensitivity} = \frac{TP}{TP + FN} \tag{3}$$

#### *4.2. Linear Classes Definition*

In this case, the classes are defined according to the linear strategy, meaning that samples are defined according to linear partition strategies in the boundary thresholds of the classes, as it is previously described in Section 3.2. The sensitivity analysis carried out in order to identify the optimal network structure indicates the number of filters equal to 64 for both the first and the second convolutional layers. Figure 4 depicts the confusion matrix representing the classification performances in the considered case. A generic cell in row *i* and column *j* represents the number of samples belonging to class *i* that are assigned to class *j* during classification. Table 3 represents the classification performance evaluated through the metrics previously defined.

**Figure 4.** Confusion matrix for the Linear Classes Definition.



In this case, the developed classification model is unable to recognize the presence of two classes (classes 0 and 2) out of three. As a matter of fact, the only class appearing in the model output is class 1, i.e., the most numerous one. The performance metrics numerically confirm the coherent results represented by the confusion matrix.

#### *4.3. Logarithmic Classes Definition*

In this case, the classes are defined according to the logarithmic strategy. The sensitivity analysis carried out in order to identify the optimal network structure indicates the number of filters equal to 64 for both the first and the second convolutional layers.

Figure 5 depicts the confusion matrix representing the classification performances in the considered case. A generic cell in row *i* and column *j* represents the number of samples belonging to class *i* that are assigned to class *j* during classification. Table 4 represents the classification performance evaluated through the metrics previously defined.

**Figure 5.** Confusion matrix for the Logarithmic Classes Definition.



Unlike the previous case, the CNN recognizes the presence of two classes (classes 0 and 1) instead of only one. A small number of samples are assigned to class 0, but only a couple of them really belong to that class. Evaluation metrics highlight coherently the small number of correct classifications in class 0.

#### *4.4. Classes with Equal Number of Samples*

In this case, the classes are defined in a way that an equal number of samples belongs to each class. The sensitivity analysis carried out in order to identify the optimal network structure indicates the number of filters equal to 64 for both the first and the second convolutional layers.

Figure 6 depicts the confusion matrix representing the classification performances in the considered case. A generic cell in row *i* and column *j* represents the number of samples belonging to class *i* that are assigned to class *j* during classification. Table 5 represents the classification performance evaluated through the metrics previously defined.

**Figure 6.** Confusion matrix for classes defined with an equal number of samples.


This last case demonstrates that, even with an equal number of samples, the classification performances are slightly worsened compared with the logarithmic class definition. Once again, the CNN recognizes the presence of only two classes out of three and a large number of test samples are misclassified, as highlighted also by the performance metrics. Finally, the Global Accuracy (*A*) for the different class definitions together with the other evaluation metrics adopted in this work is reported in Table 6. In this study case, the strategy of the logarithmic partition of the samples within classes scores the best result in terms of global accuracy (0.63%), even if the limited amount of samples strongly affects the classification precision PPV. On the contrary, linear class definition shows the worst classification results with a global accuracy equal to 0.4, and it is unable to classify samples belonging to class 0 and class 2, while the Equal Amount of Samples strategy indicates almost comparable results (0.6) to the logarithmic partition of the samples within classes. Finally, samples in class 2 are incorrectly classified, allegedly due to the lack of samples belonging to that class, especially in the linear partition strategy.

**Table 6.** Classification performances for the different class definitions.


#### **5. Conclusions**

The present work aims at estimating the CBH (Cloud-Base Height) value through a CNN (Convolutional Neural Network) model processing sky images acquired from a

single ASI. The CBH value for each of the training images is estimated starting from the PLCL (Pressure of Lifted Condensation Level) value recorded by radio soundings and the pressure at sea level. Moreover, the model output was not a specific CBH value but a class corresponding to a range of possible CBH values. In total, three classes were defined according to different strategies, namely: linear, logarithmic, and an equal number of samples partition in each group. In order to increase the number of available training samples, an oversampling procedure was carried out. The final best classification accuracy is 63% with the logarithmic classes definition. The reduced number of samples does not allow generalized conclusions to be drawn, even if the classification obtained with the logarithmic class definition seems to be the most promising. Future works will aim at adding more data and will combine the here-presented information contained in sky images with additional exogenous parameters (i.e., from sensors located in situ) to further improve the accuracy of the model.

**Author Contributions:** Conceptualization, E.O., A.N., E.C., and D.R.; methodology, E.O. and A.N.; software, A.N.; validation, E.C. and D.R.; formal analysis, E.O.; investigation, E.O. and A.N.; resources, E.C. and D.R.; data curation, A.N.; writing—original draft preparation, E.O.; writing—review and editing, E.C. and D.R.; visualization, E.O., A.N., E.C., and D.R.; supervision, E.O., E.C., and D.R.; project administration, E.O. and E.C.; funding acquisition, E.C. and D.R. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work has been partly financed by the Research Fund for the Italian Electrical System with the Decree of 16 April 2018.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Proceeding Paper* **A Hybrid Model of VAR-DCC-GARCH and Wavelet Analysis for Forecasting Volatility †**

**Maryam Nafisi-Moghadam and Shahram Fattahi \***

**\*** Correspondence: sh\_fatahi@yahoo.com

† Presented at the 8th International Conference on Time Series and Forecasting, Gran Canaria, Spain, 27–30 June 2022.

**Abstract:** The purpose of this study is to investigate the time-varying co-movement between the volatility of gold, exchange rate, and stock market returns in Iran, using weekly data from 27 September 2013 to 3 December 2021. The results of the wavelet-based random forest show that the performance of VAR-DCC-GARCH model is better than that of DCC-GARCH model in predicting financial market volatilities. Furthermore, the results of the VAR-DCC-GARCH model indicate that a positive and relatively high conditional correlation exists in the daily exchange rate and gold-return volatility. The conditional correlation is lower between the exchange rate–stock market returns and the daily gold–stock market returns. In short and long terms, there is no correlation between the exchange rate and the stock market volatilities as well as gold and stock market volatilities, while the correlation between the paired markets exists in the medium term.

**Keywords:** financial market volatility; VAR-DCC-GARCH; wavelet-based random forest; forecasting

**Citation:** Nafisi-Moghadam, M.; Fattahi, S. A Hybrid Model of VAR-DCC-GARCH and Wavelet Analysis for Forecasting Volatility. *Eng. Proc.* **2022**, *18*, 6. https:// doi.org/10.3390/engproc2022018006

Academic Editors: Ignacio Rojas, Hector Pomares, Olga Valenzuela, Fernando Rojas and Luis Javier Herrera

Published: 20 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

#### **1. Introduction**

Among the important topics of financial economics are the modeling of volatility, the interdependence of volatility in financial markets, and forecasting. Volatility in financial markets occurs when these markets are at risk [1]. With the spread of information in financial markets, the correlation between markets and their volatility has increased. Investors predict price changes based on the price changes of other markets. In other words, an increase in the return of a market may change the return of the other markets at the same time. This change is called the "Spillover Effect". Therefore, besides accurate modeling and forecasting of financial market volatility, examining their dependence can help investors, risk managers, and traders to create sustainable profits [2].

The exchange rate is determined by the trade balance or current account balance of countries. Since the exchange-rate changes affect international competition and trade balance, these changes affect real incomes and input prices. An increase in the input cost leads to a decrease in profit firms and a decline in stock price [3]. On the other hand, the exchange rate is determined by the supply and demand of financial assets such as equities and bonds, since expectations of relative changes in domestic currency have a significant effect on the price of financial assets. Bodnar and Gentry [4] show that the relationship between firm profitability and exchange-rate changes depends on the nature and type of industrial activity. An increase in the exchange rate leads to a decline in the price of exported goods in the importing country, and if demand is high in the importing country, the export rate will increase and the exporting company will benefit. Conversely, an increase in the exchange rate will make imported goods more expensive and reduce the value of the importing company (Unless the importer's firms can transfer the price increase to the end consumer). The empirical literature provides conflicting findings of the dynamic relationship between the exchange rate and the stock market. Aggarwal [5], Kim [6], Doong

Department of Economics, Razi University, Kermanshah 6714414971, Iran; nafisi1988@gmail.com

et al. [7], and Dahir et al. [3] showed a positive relationship between the exchange rate and the stock market index. Erdogan et al. [8] showed that there were significant volatility spillover effects on the Islamic stock market to the foreign exchange market. According to Gavins [9] and Zhao [10], there is no relationship between exchange rates and stock prices. Hashim et al. [11] showed a negative relationship between the exchange rate and the stock market.

Since individuals hold their financial assets in a combination of cash, gold, currency, bonds, and equities, changes in each market can affect individuals' investment portfolios. Rising inflation, exchange-rate fluctuations, or political instability can lead to some form of market excitement and increase demand for gold. Various studies have been conducted on the relationship between gold prices and stock indices, some of which are as follows: Gilmore et al. [12] showed that there is a cointegration between the gold price, the stockmarket price of gold, and the US stock index. Mishra et al. [13] showed that there is no correlation between the gold price and stock index in India. Srinivasan [14] investigated the causal nexus between gold price, stock price, and exchange rate in India using ARDL. He showed that there exists no causality from gold price to stock price or vice versa in the short run. Arouri et al. [15] examine the relationship between the global price of gold and the Chinese stock market over the period of 22 March 2004 through to 31 March 2011 in China. For this purpose, they used VAR-GARCH and other multivariate GARCH models such as DCC and CCC. Their research results show that past gold returns play an important role in explaining the dynamics of conditional returns and stock-market volatility. Jain and Biswal [16], using a DCC-GARCH model, showed that falling gold and crude oil prices were causing the Indian stock market to decline. Yousaf and Ali [17] showed that there are return and volatility spillovers between the gold and emerging Asian stock markets during the global financial crisis. Moreover, several studies in the literature exist about the relationship between gold and the exchange rate. Sjaastad [18] found that since the dissolution of the Bretton Woods international monetary system, floating exchange rates among the major currencies have been a major source of price instability in the world gold market. Qureshi et al. [19] using a wavelet approach showed that gold acts as a consistent short-run hedge against the exchange rate. Wang et al. [20] argued that gold can partly hedge against the depreciation of the currency in the long run.

The purpose of this paper is to investigate the correlation between the volatility of gold, foreign exchange, and stock markets in Iran, using weekly average data from 27 September 2013 to 3 December 2021. For this purpose, this study compares two approaches, VAR-DCC-GARCH and DCC-GARCH, using out-of-sample evaluation by the wavelet-based random forest (WRF) model. Then, the co-movement of the volatility is analyzed using a continuous wavelet (CWT) in the timescale. Iran had been subjected to sanctions, a sharp increase in liquidity, and structural problems that caused inflation. Therefore, investors generally turned to invest in stocks, buying foreign currency, and gold. Given the increasing volatility in the underlying markets, this paper contributes to the literature on the Iranian financial market in the following ways:


#### **2. Materials and Methods**

#### *2.1. VAR-DCC-GARCH*

In the VAR-DCC-GARCH model, the VAR method is used to estimate the conditional mean equation of the DCC-GARCH model. In this study, we consider the conditional mean equation VAR(1) according to the Schwartz and Akaike criterion. The conditional mean equation is written as Equation (1):

$$R\_l = \mu + \Phi R\_{l-1} + + e\_l \text{with} \\ e\_l = D\_l^{\frac{1}{2}} \eta\_l \tag{1}$$

*Rt* is the 4 × 1 vector of returns of each variable, *t* is time, *μ* denotes a 4 × 1 vector of

constants, and Φ = ⎛ ⎜⎜⎝ *ϕ*<sup>11</sup> *ϕ*<sup>21</sup> *ϕ*<sup>31</sup> *ϕ*<sup>41</sup> *ϕ*<sup>12</sup> *ϕ*<sup>22</sup> *ϕ*<sup>32</sup> *ϕ*<sup>42</sup> *ϕ*<sup>13</sup> *ϕ*<sup>23</sup> *ϕ*<sup>33</sup> *ϕ*<sup>43</sup> *ϕ*<sup>14</sup> *ϕ*<sup>24</sup> *ϕ*<sup>34</sup> *ϕ*<sup>44</sup> ⎞ ⎟⎟⎠ is a 4 × 4 matrix of parameters measuring

the influence of each variable with own-lagged and the other. *et* is a residual of mean equation for four returns variables. *η<sup>t</sup>* indicates a 4 × 1 vector of independenctly and identically distributed random vectors, and

$$\Delta D\_t^{1/2} = \text{diag}\left(\sqrt{h\_t^{\text{oil}}}, \sqrt{h\_t^{\text{TEIPIX}}}, \sqrt{h\_t^{\text{er}}}, \sqrt{h\_t^{\text{solid}}}\right) \tag{2}$$

where *ht* indicate the conditional variance of returns variables. The conditional variance equation of the DCC-GARCH model is written as Equation (3):

$$H\_l = D\_l R D\_l \tag{3}$$

The *Ht* matrix is an MGARCH process that can be modeled using dynamic conditional correlation (DCC) [21,22]. According to Engel's [21] model, a DCC model is defined as (4):

$$R\_l = \operatorname{diag} \left( q\_{11\ t}^{\frac{1}{2}} \dots q\_{NN\ t}^{\frac{1}{2}} \right) \varrho\_l \operatorname{diag} \left( q\_{11\ t}^{\frac{1}{2}} \dots q\_{NN\ t}^{\frac{1}{2}} \right) \tag{4}$$

where *<sup>t</sup>* is the positive symmetric definite matrix of N × N and

$$
\varrho\_l = (1 - \alpha - \beta)\overline{\varrho} + \alpha u\_{l-1} u\_{l-1}' + \beta \varrho\_{l-1} \tag{5}
$$

*uit* <sup>=</sup> *<sup>ε</sup>* <sup>√</sup>*it hiit* , , a nonconditional variance matrix of ut, *α* and *β* is nonnegative numerical parameters that satisfy *α* + *β* < 1.

#### *2.2. Wavelet-Based Random Forest*

In this paper, we consider wavelet-based random forest as a prediction approach. WRF is built based on maximal overlap discrete wavelet transform (MODWT) function and random forest algorithm learning approach. The model structure is flexible and efficient [23].

The wavelet-based random forest (WRF) is based on the method originally proposed by Breiman [24]. The WRF has two stages: The first stage is the bootstrap aggregation, or bagging samples, which is performed with two-thirds of the inputs. Second, for each node of the individual decision tree, a random selection of decision trees is chosen, and those variables belonging to the best split will be allowed to grow. To identify the best split, the Gini impurity function, first introduced by Breiman et al. [25], is used to minimize the effect of impurities and maximize homogeneity of the selected subsets. Each tree will undergo this process of selecting the best subset and minimizing impurity until it is halted by reaching the best possible training performance for each training process [26].

We use this model to be able to select the best model for evaluating volatilities. The WRF performances are measured with three error metrics including root-mean-squared error (RMSE), mean-squared error (MSE), and mean-absolute error (MAE).

$$MSE = \frac{1}{n} \sum\_{i=1}^{n} \left(O\_i - P\_i\right)^2 \tag{6}$$

$$RMSE = \sqrt{\frac{1}{n} \sum\_{i=1}^{n} \left(O\_i - P\_i\right)^2} \tag{7}$$

$$R = \frac{\sum\_{i=1}^{n} \left(O\_{i} - O\_{avg}\right) \times \left(P\_{i} - P\_{avg}\right)}{\left(\sum\_{i=1}^{n} \left(O\_{i} - O\_{avg}\right)^{2}\right)^{0.5} \times \left(\sum\_{i=1}^{n} \left(P\_{i} - P\_{avg}\right)^{2}\right)^{0.5}} \tag{8}$$

$$MAE = \frac{1}{n} \sum\_{i=1}^{n} |O\_i - P\_i| \tag{9}$$

#### *2.3. Continuous Wavelet Analysis*

Continuous wavelet transform is a method used to investigate the relationship between two time series. In mathematics, a continuous wavelet function for the time series xt whose integral is squared, is defined by the scale a > 0 and the place B∈R as relation (10):

$$X(a,b) = \frac{1}{\sqrt{a}} \int\_{-\infty}^{\infty} x(t) \psi^\* \left(\frac{t-b}{a}\right) dt\tag{10}$$

In this relation, *ψ*(*t*) is a continuous function in both time and frequency domains. We use Morlet wavelet to define the *ψ*(*t*) function as follows:

$$
\psi^M(t) = \frac{1}{\pi^{\frac{1}{4}}} \left( e^{i\omega t} - e^{-\omega/2} \right) e^{-t/2} \tag{11}
$$

where *ψM*(*t*) is the continuous Morlet function, *t* is time, and *ω* is frequency.

#### **3. Discussion**

The study employs weekly data on the Tehran Price Index, gold price, and Unofficial Free Exchange Rate (US Dollar) in the period from 27 September 2013 to 3 December 2021. The weekly return of the above variables can be defined as follows:

$$y\_l = 100\left[\ln(p\_l) - \ln(p\_{l-1})\right] \tag{12}$$

where *pt* represents each variable under study. The descriptive statistics for the return series are shown in Table 1. Accordingly, the mean returns on TEPIX, USD, and gold are 0.87, 0.55, and 0.636, respectively. All of the returns series exhibit positive skewness and asymmetry. According to the Jarque–Bera test, the null of normality is strongly rejected for all returns series.


**Table 1.** Descriptive statistics.

The results of Pearson's correlation coefficient are shown in Table 2. Accordingly, the unconditional correlation between the returns on TEPIX–USD, TEPIX–gold and USD– gold returns are positive. Since this analysis is straightforward and does not consider the effect of time on the correlation between the series of returns, the conditional correlation method is used.


**Table 2.** Pearson's correlation coefficient.

This study employs the VAR-DCC-GARCH and DCC-GARCH method to calculate the volatility of all returns. In order to estimate VAR-DCC-GARCH, it is necessary to determine the optimal lag. In this research, it is chosen a VAR(1) for the mean equation using Akaike and Schwartz Bayesian criteria. The results of this estimation model are reported in Table 3.

**Table 3.** Determination of optimal Lag.


The results of estimating the mean equation are presented in Table 4. *ϕ*11, *ϕ*22, *ϕ*<sup>33</sup> are the coefficients of own-mean spillover. *ϕ*<sup>11</sup> and *ϕ*<sup>33</sup> are significantly positive. In other words, the lag of the return of the time series has a direct effect on the current indicators of the returns of the total stock index TEPIX and gold. *ϕ*<sup>22</sup> is significantly positive. This is not unexpected, because the exchange rate is controlled by the Iranian government. The coefficient of *ϕ*<sup>21</sup> and *ϕ*<sup>31</sup> indicate unidirectional and positive return spillover from USD and gold to TEPIX. In other words, when the prices of USD and gold rise, investors tend to increase their investment in the stock market. Moreover, the gold–USD return results (*ϕ*23; *ϕ*32) indicate unidirectional and positive return spillover from USD to gold. This indicates that when the returns of USD increase, investors tend to increase their investment in gold as well.


**Table 4.** Estimates of conditional mean VAR-DCC-GARCH model.

In the following, the variance equations of the VAR-DCC-GARCH and DCC-GARCH are estimated.

Table 5 present the results of the variance equations of the VAR-DCC-GARCH and DCC-GARCH model. In this study, Ω1, Ω2, and Ω<sup>3</sup> indicate intercept of TEPIX, USD, and gold, respectively. GARCH parameters (α and β) in two models are statically significant, and the α and β coefficients in the estimated equations are non-negative and satisfy the condition α + β < 1. The sum of coefficients in two models are in close unity, implying that shocks are strongly persistent. In other words, this condition guarantees that the volatility of the previous period affects the volatility of the current period.


**Table 5.** Estimates of conditional variance VAR-DCC-GARCH model.

For conditional correlation analyses, we use VAR-DCC-GARCH model. The estimated conditional correlation graphs in VAR-DCC-GARCH between the returns of variables are used in this paper. Therefore, Figures 1–3 show the dynamic conditional correlation trend between USD–TEPIX, gold–TEPIX, and USD-gold, respectively.

**Figure 1.** Dynamic conditional correlation between USD–TEPIX in VAR-DCC-GARCH model.

**Figure 2.** Dynamic conditional correlation between gold–TEPIX in VAR-DCC-GARCH model.

**Figure 3.** Dynamic Conditional Correlation between USD–gold in VAR-DCC-GARCH model.

The dynamic conditional correlation volatility between USD–TEPIX in the range of −0.2657 and 0.7558. The mean correlation coefficient of volatility is 0.1068. These volatilities are positive for most years. Figure 2 shows the dynamic conditional correlation between the returns of gold and TEPIX. The mean correlation coefficient volatility in the period under study is 0.12098. The highest and lowest correlation coefficients between gold–TEPIX are 0.7458 and −0.3338, respectively. Figure 3 shows the dynamic conditional correlation of the returns of USD and gold. The mean correlation coefficient is 0.67409. The highest and the lowest correlation coefficients are 0.8626 and 0.3924, respectively.

The results show that the correlations between gold–TEPIX, USD–TEPIX, and gold– USD are positive. In addition, the correlation between gold–USD is the biggest than the others. It means that investing in the stock market can be considered as an appropriate investment against gold and USD and vice versa. Moreover, these figures show that co-movement between conditional correlations variables exist from early 2018 to December 2021.

#### *3.1. Forecasting Performance*

In this section, the wavelet-based random forest is used to select the best model for forecasting. In fact, we compare out-of-sample forecasting of two approaches, VAR-DCC-GARCH and DCC-GARCH, using the wavelet-based random forest (WRF). For this purpose, the time series of volatilities are calculated using the VAR-DCC-GARCH and DCC-GARCH methods, then these volatilities in the financial markets are predicted using the wavelet-based random forest. Table 6 shows the performance metrics of the WRF modeling approach employed in this study. Comparing the RMSE, MSE, R-squared, and MAE, it indicates that the VAR-DCC-GARCH model outperforms the DCC-GARCH model.

**Table 6.** Model performance metrics.


#### *3.2. Results of Wavelet Coherence Estimation*

Applying volatilities extracted from VAR-DCC-GARCH model, the wavelet approach is used to examine the interconnected relationships between markets. The results of the coherence estimation are shown in Figures 4–6. The color of the graph shows the strength of coherency, ranging from blue (low coherency) to red (high coherency). Therefore, the more the color spectrum trends to red, the higher the correlation. The horizontal axis represents the time scale and the vertical axis represents the frequency. The graphs also show the frequency of the period (short, medium, and long term). In this way, the shortest period is 4 days and the longest is 256 days.

**Figure 4.** The continuous wavelet power spectrum of volatilities in USD–TEPIX.

**Figure 5.** The continuous wavelet power spectrum of volatilities in gold–TEPIX.

**Figure 6.** The continuous wavelet power spectrum of volatilities in gold–USD.

Figure 4 shows the correlation wavelet of volatilities in USD–TEPIX. According to the Figure, the color blue dominates for the most part in the short run, so there is no high correlation in the short run between volatilities in USD–TEPIX. However, in the medium term, despite the orange and red color spectrum between 2014 and 2015, it can be said that the correlation between USD–TEPIX volatilities has increased in the medium term. In the long run, there is correlation from 2016 to 2021.

The wavelet correlation analysis in gold–TEPIX volatilities is presented below in Figure 5. In the short run, given that the blue color spectrum is predominant in most of the periods, there is a low correlation between volatilities in the pairs of gold and TEPIX. In the medium term, given the dominance of the orange and yellow spectrum, especially in the period between 2017, there is a correlation between the volatility of the two markets. In the long run, there is correlation between the volatility of these two markets.

Finally, Figure 6 shows the correlation wavelet analysis on USD–gold volatilities. According to this figure, in the short and medium term from early 2016 to January 2020, given the dominant red color spectrum, a high correlation is observed in the volatilities of the returns of gold–USD. In the long run, the correlation trend in the volatilities of gold–USD has been increasing.

#### **4. Conclusions**

This study investigated the time-varying co-movement between the volatility of gold, exchange rate, and stock market returns in Iran. To do so, we first modelled the conditional correlations of financial markets using VAR-DCC-GARCH and DCC-GARCH models. Then, we compared the out-of-sample forecasting of these models. Finally, the CWT was used to study the co-movement between financial markets' volatilities. The results of the wavelet-based random forest model showed that the forecasting performance of VAR-DCC-GARCH model is better than that of DCC-GARCH model. Furthermore, there are positive correlations between the returns of gold–TEPIX and USD–TEPIX, as well as the returns of USD–gold. On the other hand, the correlation between the returns of USD–gold is higher than the other paired markets. This result implies that investors can use stocks

as a suitable investment against gold and USD to increase profitability and reduce risk in their portfolio, and vice versa. In addition, the existence of gold and USD in their portfolio can significantly eliminate the benefits of portfolio diversification and increase their portfolio risk. The results of the wavelet analysis showed that the correlation process of the volatilities in the returns of gold–USD is increasing in the medium and long term. This result helps investors to maximize profitability in the short, medium, and long terms by choosing different assets in their portfolios.

**Author Contributions:** Conceptualization and methodology, S.F. and M.N.-M.; software, S.F and M.N.-M.; validation, S.F. and M.N.-M.; writing—original draft preparation, S.F. and M.N.-M.; writing review and editing, S.F. and M.N.-M.; visualization, S.F. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no significant external funding for the submitted work.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The data are available on request from the corresponding author.

**Conflicts of Interest:** The authors have no conflicts of interest to declare that are relevant to the content of this paper.

#### **References**


### *Proceeding Paper* **Synthetic Subject Generation with Coupled Coherent Time Series Data †**

**Xabat Larrea 1,2,\* , Mikel Hernandez 1,\* , Gorka Epelde 1,3 , Andoni Beristain 1,3 , Cristina Molina <sup>1</sup> , Ane Alberdi <sup>2</sup> , Debbie Rankin 4, Panagiotis Bamidis 5,‡ and Evdokimos Konstantinidis 5,6,‡**


**Abstract:** A large amount of health and well-being data is collected daily, but little of it reaches its research potential because personal data privacy needs to be protected as an individual's right, as reflected in the data protection regulations. Moreover, the data that do reach the public domain will typically have under-gone anonymization, a process that can result in a loss of information and, consequently, research potential. Lately, synthetic data generation, which mimics the statistics and patterns of the original, real data on which it is based, has been presented as an alternative to data anonymization. As the data collected from health and well-being activities often have a temporal nature, these data tend to be time series data. The synthetic generation of this type of data has already been analyzed in different studies. However, in the healthcare context, time series data have reduced research potential without the subjects' metadata, which are essential to explain the temporal data. Therefore, in this work, the option to generate synthetic subjects using both time series data and subject metadata has been analyzed. Two approaches for generating synthetic subjects are proposed. Real time series data are used in the first approach, while in the second approach, time series data are synthetically generated. Furthermore, the first proposed approach is implemented and evaluated. The generation of synthetic subjects with real time series data has been demonstrated to be functional, whilst the generation of synthetic subjects with synthetic time series data requires further improvements to demonstrate its viability.

**Keywords:** time series; synthetic data; shareable data; privacy

### **1. Introduction**

Time series data are defined as a class of temporal data objects, a collection of chronological observations [1]. Time series data tend to be large in size and with high dimensionality. Time series data are characterized by their numerical and continuous nature, always considered as a whole, instead of a numerical field.

The motivation to investigate synthetic time series generation (STSG) is born from the VITALISE H2020 project [2]. One of the main objectives of this project is to provide virtual transnational access to data generated from several living labs (LLs) throughout Europe and beyond. To provide this transnational access, synthetic data generation (SDG) techniques

**Citation:** Larrea, X.; Hernandez, M.; Epelde, G.; Beristain, A.; Molina, C.; Alberdi, A.; Rankin, D.; Bamidis, P.; Konstantinidis, E. Synthetic Subject Generation with Coupled Coherent Time Series Data. *Eng. Proc.* **2022**, *18*, 7. https://doi.org/10.3390/ engproc2022018007

Academic Editors: Ignacio Rojas, Hector Pomares, Olga Valenzuela, Fernando Rojas and Luis Javier Herrera

Published: 21 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

have been incorporated into the controlled data processing workflow to generate shareable data for external researchers in compliance with the General Data Protection Regulation (GDPR) [3]. LLs are research infrastructures that enable research studies to take place in real-life environments. Those research studies generate data that are potentially interesting for the research community. However, given that these data contain human personal or sensitive information, they are stored internally in LL infrastructures and cannot be externally shared outside the original research context.

Traditionally, anonymization techniques have been used to allow sensitive data to be made publicly available whilst preserving privacy, but traditional anonymization techniques tend to suppress useful data because many of them add noise to the real data or delete attributes from them. In this scenario, SDG is presented as a game-changing anonymization technique, as it has the potential to create data without erasing potentially interesting data.

In this context, a workflow to make LL data accessible for external researchers by generating synthetic data (SD) has been proposed by Hernandez et al. [4]. As explained in the proposed workflow, SD is created with a clear purpose, enabling researchers to develop algorithms and analyses locally using it. Subsequently, the locally developed algorithms and analyses are remotely executed with the real data stored in LL infrastructures. This approach enables external researchers to conduct and validate experiments with GDPR compliance.

When evaluating SD quality, the following three dimensions are most commonly considered: privacy, utility, and resemblance. The main aim of the use case mentioned above is to remotely develop algorithms with SD to test them with real data later. Therefore, the utility dimension will be more relevant than resemblance when generating SD. Once the privacy of SD is ensured, in the context described above, more importance should be given to the utility dimension of the SD in contrast to its resemblance with real data.

#### **2. Related Work**

SDG has been gaining importance for privacy-preserving data publishing. It enables the creation of artificial data with a high statistical resemblance to real data without containing potentially private data [5]. In this context, Hernandez et al. reviewed the SDG approaches proposed as an alternative to anonymization techniques for health domain applications [6]. Furthermore, there are also studies in which STSG has been researched and used [7–14].

In 2018, Norgaard et al. [7] proposed the use of a supervised generative adversarial network (GAN), which is a variation in the originally proposed GAN approach [15]. In 2019, Yoon et al. [8] presented a time series specific GAN model, named Time-GAN, whose focus was on preserving the temporal dynamics of data. TimeGAN has later been used in the medical time series context by Dash et al. [9]. In 2020, Wang et al. [10] proposed a privacypreserving augmentation and releasing scheme for time series data via a GAN (PART-GAN). This approach added differential privacy to the conditional temporal GAN (CT-GAN) [16], an approach that was proposed for generating videos. In addition, in 2020, an update on the Synthetic Data Vault (SDV) [11], a Python package used for generating synthetic data, added a specific model for generating time series data. This model is a probabilistic autoregressive (PAR) model, but its mathematical principles are still unpublished. In 2021, Hyun et al. [13] proposed NeuralProphet, a neural network variation in the forecasting tool Prophet [17], as a method for STSG to create synthetic diabetic foot patients. In 2022, Li et al. [14] presented the transformer-based time-series GAN (TTS-GAN) based on a transformer-encoder architecture.

Although the number of proposed STSG approaches is considerable, most of them do not consider the metadata of the subjects, as they are focused on generating highly realistic sequences of data. Furthermore, some of the approaches mentioned above transform time series data into the latent space, without analyzing the option to transform the generated synthetic time series back to the data format of the real data.

#### **3. Proposed Approaches**

This section presents two approaches for generating synthetic subjects containing temporal data. The first approach assumes that time series data do not require further transformations to ensure patient privacy. The second approach requires time series data to be synthesized. Thus, well-performing STSG techniques that consider the metadata of subjects are required. On this section, both approaches are presented on a general theoretical basis and the specific data generation models that have been used are specified in Section 4.

#### *3.1. Synthetic Subjects with Real Time Series Data*

This approach, as depicted in Figure 1, has been proposed to generate useful partially synthetic multi-subject datasets containing time series data. The approach of adding time series data to synthetically generated patients, by using synthetic tabular data generation techniques, was inspired by Schiff et al. [18], who proposed to enrich synthetic patients' data with real time series data. In their approach, tabular data of synthetic patients are generated from a dataset, creating patients that have their illnesses labelled with diagnostic codes. Then, time series data, which are also labelled with diagnostic codes and do not have any relationship with the first dataset, are added to enrich patients' information. This process is carried out by comparing the diagnostic codes of both datasets. Our approach improves Schiff et al.'s proposed approach, by linking (with a meaningful relationship) time series data with synthetically generated tabular subject metadata, starting from real cohort data containing such links.

**Figure 1.** Workflow for generating a synthetic patient containing real time series data.

The first and obvious step in this approach is to perform a basic exploratory data analysis. Then, data preprocessing is performed to remove missing values and inconsistencies, such as negative age values. The next step is to extract meaningful statistics from each time series and append them in a table to each subject's metadata.

Once the multi-subject tabular data, including subject metadata and basic time series statistics, have been created, synthetic tabular data generation (STDG) techniques [6] are applied to generate a synthetic patient table with synthetic metadata and synthetic time series statistics. The last step is to couple the synthetic statistics with the statistics of real time series. This process matches each synthetic patient with the fitting time series.

#### *3.2. Synthetic Subjects with Synthetic Time Series Data*

The second approach depicted in Figure 2 could be considered the ideal approach, as its outcome is a fully synthetic dataset. This approach can be understood as an evolution of the approach introduced in Section 3.1, since it incorporates STSG techniques.

**Figure 2.** Workflow for generating a synthetic patient containing synthetic time series data.

The process of generating synthetic subjects and synthetic statistics is the same as the previously explained approach. However, in this approach, time series data are synthetically generated instead of using the real time series. Once STSG techniques have been applied, the same basic statistics computed from the real time series are extracted. Considering the statistics generated with the synthetic patients and the statistics obtained from the synthetic time series data, the best fitting synthetic time series is selected for each synthetic subject, using a distance-based metric.

#### **4. Implementation**

Attempts have been made to implement the approaches presented in Section 3, but the quality of synthetic time series generated for the approach presented in Section 3.2 are not yet suitable for the target use. Thus, the approach introduced in Section 3.1, the approach that generates synthetic patient datasets and then couples them with the real time series data, has been implemented. In this section, the steps followed for the implementation of this approach are explained.

#### *4.1. Dataset Selection*

To implement the proposed approach, a dataset with a high patient volume, simple metadata, and manageable time series data that is not extremely long has been chosen. The treadmill maximal exercise tests (TMET) database [19,20] was selected from PhysioNet [21] as it fulfills the aforementioned requirements.

The TMET database is an ensemble of cardiorespiratory measurements acquired during 992 treadmill maximal graded exercise tests (GET) performed in the Exercise Physiology and Human Performance Lab of the University of Malaga. During maximal effort tests, heart rate (HR), oxygen uptake (VO2), carbon dioxide elimination (VCO2), respiration rate (RR), and pulmonary ventilation (VE) were measured on a breath-to-breath premise alongside the treadmill speed. All these measures are measured and time-stamped (Time) every 2–5 s.

The dataset is composed of two files. The first one contains all the subjects' metadata and environmental metadata (humidity and temperature) in a tabular format. It contains data from 992 effort tests from 857 subjects, as there are subjects with several tests. These metadata are organized as shown in Table 1. The second file contains the results obtained from the treadmill experiments, i.e., the time series data. The mean length of these effort tests is 580 data rows; being each row, a measurement taken every two seconds. The time series data are organized as shown in Table 2. In this approach, each test has been considered as a subject.

**Table 1.** Subject metadata variables.


#### *4.2. EDA and Data Preprocessing*

Before applying the proposed approach to the selected dataset, an exploratory data analysis (EDA) has been carried out to identify missing values and inconsistencies.

Once these undesired values have been identified, a criterion to decide how to treat them has been established. If missing values are found in the time series data, a threshold of 30 data points, a 5% of the mean length of time series data, has been established to decide if the time series must be kept or rejected. Therefore, if a time series contains less than 30 missing values, imputation is made by linear interpolation. However, if more than 30 missing values are found, the time series, and the subjects they are related to, are excluded. In case missing values are found in the metadata, the exclusion of the subject, and its related time series, has been considered.

The exploratory analysis found that the environmental data were missing for 30 subjects. Those subjects were excluded from the dataset. No unexpected values were found on the metadata. The time series data analysis found that 942 HR datapoints and 4871 VO2 and VCO2 datapoints had missing values. Following the criteria described above, 12 subjects were excluded due to the amount of missing data in the time series.

Upon completing the preprocessing stage, 950 subjects remained in the dataset from the original 992 subjects.

#### *4.3. Time Series Feature Extraction*

Before selecting the appropriate statistics that need to be extracted, it is important to consider the nature of the tests from which the time series data were obtained. For the selected dataset, the data were obtained from a treadmill test. The treadmill effort test began at a speed of 5 km/h, a speed that increased 1 km/h per minute until the subject went beyond exhaustion [21]. Once the subject's maximal effort had been reached, the treadmill speed was reduced to 5 km/h, and the recovery was recorded for 200 s.

Considering that each test has a different duration and maximal speed, the following statistics have been selected: maximal speed, test duration (last time value), and maximal, minimal, and mean values of the physiological variables (HR, VO2, VCO2, RR, and VE). These statistics have been extracted for each time series, identified with the ID\_test variable, and appended to the corresponding subject in the metadata table.

#### *4.4. Synthetic Subject Generation*

The synthetic subject, and the synthetic time series statistics, have been generated by applying a STDG technique. Specifically, the Synthetic Data Vault (SDV) [11] Python package has been used. Using this approach, the generation of a cohort of synthetic subjects with their metadata and the statistics that their effort test should have were enabled. SDV contains several STDG models, from which the tabular variational auto encoder (TVAE) model with default parameters has been used to generate SD.

#### *4.5. Time Series Coupling with Synthetic Subject*

Considering that the synthetically generated statistics and the statistics extracted from the real time series do not have the same values, a strategy to couple the real effort test to the synthetic subjects is proposed. The mean value of all Euclidean distances between each synthetic time series statistics and all real time series statistics is computed. Then, the time series that brings the lowest value is selected.

Firstly, all the statistics are normalized to avoid some variables being more influential than others when calculating the mean of the distances. Subsequently, distances between each pair of statistics are calculated to compute the sum of all the distances. The ID\_test of the best fitting, lowest distance sum valued, time series data are appended to each synthetic subject. Once all the synthetic subjects are assigned a time series, another dataset containing those time series is created. Finally, new identifiers linking synthetic subjects with the time series data are created. The pseudocode of this coupling can be observed in Algorithm 1.


#### *4.6. Validation of Subjects*

To validate the metadata of the cohort of synthetic subjects, some standardized metrics and methods proposed by Hernandez et al. [22] have been used.

Firstly, the mean and standard deviation values for each variable of real data and SD have been obtained. These values are collected in Table 3. From there, it can be observed that the mean and standard deviation values of the attributes of SD are similar to those of real data.

**Table 3.** Mean and standard deviation (mean ± std) values for real data and SD (where SD is generated using SDV).


A dimensionality reduction method, specifically principal component analysis (PCA), has been used to analyze whether the dimensional properties of the real cohort are preserved in the synthetic one. Figure 3 indicates that the generated cohort of synthetic subjects is quite similar in dimensionality. There are only a few points that differ from the cohort of real subjects.

**Figure 3.** PCA plot of real data (blue) and SD (orange).

Figure 4 shows the pairwise Pearson correlation (PPC) matrices for both the cohort of real subjects and the cohort of synthetic subjects. From there, it can be observed that the correlations between the attributes are very similar for both cohorts. A few correlations in the cohort of synthetic subjects are weaker than the correlations from the cohort of real subjects.

A pairwise distance has been computed between each pair of real and synthetic subjects to evaluate how private the cohort of synthetic subjects is. The Hamming distance metric is used for this, which represents the proportion of attributes that are different between the two sets of records. Therefore, the higher the pairwise distance is, the better the privacy is preserved, since fewer attributes of the SD subjects are exactly equal to the attributes of the real subjects.

After computing the Hamming distance for each pair of real data and SD subjects, the distribution of those pairwise distances has been analyzed. As shown in Figure 5, most of the pairwise distance values are higher than 0.9, which indicates that for most synthetic subjects, the attributes are different from the real subject. This result indicates that privacy has been quite well preserved in the cohort of synthetic subjects.

**Figure 5.** Distribution of pairwise Hamming distances between real data and SD.

#### **5. Conclusions**

The main conclusion derived from this work is that the proposed approach for the generation of synthetic subjects with real time series data can be used to generate a synthetic, thus shareable dataset. Hence, it can be demonstrated that with the selected dataset, the proposed methodology can be used to generate synthetic patients and then combine them with real time series data. Furthermore, the generated cohort of synthetic subjects preserves the privacy of the cohort of real subjects, while maintaining correlations and dimensional properties.

However, the developed work has a few limitations. Firstly, more evaluation with other time series datasets should be performed to validate the generalizability of the approach used. Secondly, despite some trials implementing the second proposed approach described in Section 3.2 (using the SDV PAR for STSG), the results obtained with this model and applied to the selected dataset did not meet the minimum similarity requirements to present them. More favorable results may be obtained using this approach with other datasets. Thirdly, the privacy of the generated subject cohort has not been extensively analyzed, nor has the quality of the coupling. A more extensive evaluation of the generated synthetic cohorts should be carried out to compare different STDG or STSG techniques to select the ones that yield better results. Fourthly, this approach has been generated with only one cohort of subjects and temporal data. More research with more cohorts of subjects and time series data should be carried out to validate the generalizability and improve the approach used and the other proposed approach.

The limitations mentioned above can be taken as guidelines for future work. Firstly, other missing time series data imputation techniques, such as forecasting, will be incorporated into the data processing step. Secondly, a strategy to evaluate the coupling process will be ideated, for example, using some forecasting analysis methods. Then, the approach utilized will be evaluated with more datasets. In addition, the second approach, in which synthetic time series data are generated, will be implemented and validated, together with better performing STSG techniques and more datasets to generate multivariate time series. Concerning this approach, a method to validate the temporal nature of synthetic time series will be established, since the time series data will be fully synthetic. Furthermore, a complete strategy to evaluate the subject's resemblance and privacy will be defined. For multivariate resemblance, the comparison of eigenvalues, the percentage of variance explained by each component and coordinates of the PCA analysis will be considered. In terms of privacy, the use of Wilcoxon signed-rank tests, the analysis of re-identification risks and computation of similarity to real data will be considered. Finally, it is intended to incorporate the proposed approaches in the VITALISE controlled data processing workflow presented by Hernandez et al. [4]. This workflow enables researchers to develop algorithms and perform analyses locally using SD and then request their execution remotely with the real data.

**Author Contributions:** Conceptualization, X.L., M.H., G.E. and A.B.; Data curation, X.L. and M.H.; Funding acquisition, G.E., P.B. and E.K.; Investigation, X.L., M.H. and G.E.; Methodology, G.E.; Project administration, G.E. and E.K.; Software, X.L., M.H., G.E., A.B. and C.M.; Supervision, G.E.; Validation, X.L. and M.H.; Visualization, X.L. and M.H.; Writing—original draft, X.L. and M.H.; Writing—review and editing, X.L., M.H., G.E., A.B., C.M., A.A., D.R. and E.K. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was partly funded by the VITALISE (Virtual Health and well-being Living Lab Infrastructure) project, funded by the Horizon 2020 Framework Program of the European Union for Research Innovation (grant agreement 101007990).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The data used in this study are openly available in Physionet at https://doi.org/10.13026/7ezk-j442 (accessed on 17 June 2022).

**Acknowledgments:** VITALISE Consortium partners and external people participating in requirements shaping open sessions.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

