5.2.2. Cross-Correlation

In order to assess potential temporal shifts between the measurements of the same physiological parameter, we investigate the cross-correlation at different lags. The corresponding cross-correlation pattern in Figure 7 shows that some pairs of parameters (especially the top five rows of the IBI group) have a rather low variance among the lags and tend to correlate positively (i.e., the lags within a cell show rather homogeneous coefficients). Other pairs of parameters (especially the lower half of the IBI group) have a rather high variance among the lags, indicating positive correlations around lag 0 and negative correlations at lag +15 and −15, respectively.

**Figure 7.** Cross-correlation matrix of pairs of parameters and participants; detail (**a**) complements Figure 4, detail (**b**) complements Figure 5; colour: orange indicates positive cross-correlation, blue indicates negative cross-correlation; a cell detail shows lags as small horizontal bars: lag −15 at top, and lag +15 at bottom.

Note that the cross-correlation matrix in Figure 7 is organized as follows: for each group of parameters, the top row shows the highest cross-correlations (i.e., lowest variance) among all participants, while the bottom row shows the lowest cross-correlation (i.e., highest variance). Further, the left column shows the participant with the highest cross-correlations among all parameters, while the right column shows the participant with the lowest cross-correlations among all parameters.

Figure 7 detail (a) refers to the HR example shown in Figure 4 and detail (b) refers to the GSR example shown in Figure 5.

#### 5.2.3. MINE Statistics

By using MINE statistics, we investigate relationships between and among parameter pairs of same as well as different type, for instance, GSR versus HR. In addition to the linear correlation coefficient of determination R<sup>2</sup> (see Section 5.2.1), we use the Maximal Information Coefficient MIC to identify all functional relationships, also including linear ones as issued by R2. Although a functional

relationship between certain combinations of parameters might be obvious (e.g., IBI [ms] = 60,000/HR [beats per hour]), we nonetheless include such combinations herein for reasons of confirmation.

Figure 8 shows 3 k-means clusters of pairs of di fferent parameter types. We tested with k [1,5] and chose 3 because the result is most intuitive—the clusters show low, moderate, and high correlations. Figure 8a–f highlights some particularly interesting parts of the clusters:


**Figure 8.** Maximum Information Coe fficient (MIC) cluster matrix of pairs of parameters (size show averages among all participants, moving averaged versions only); colors: blue (cluster 1): low correlations; orange (cluster 2): moderate correlations; green (cluster 3): high correlations; symbol size in a matrix cell: average MIC among participants; (**<sup>a</sup>**–**f**) highlight special characteristics described in the text.

When focusing on the level of individual participants, Figure 9 shows MIC correlations among pairs of different parameters. Within a single cell, the small vertical bars represent participants (one bar per participant). Figure 9 complements Figure 8 by adding participant information to the corresponding clusters.

**Figure 9.** MIC participant-level matrix of pairs of parameters (moving averages only); details of a matrix cell show participants as small vertical bars (order of participants is shown in the legend).

The MIC is used to quantify the strength of any functional relationships, i.e., including linear ones, while the R<sup>2</sup> coefficient can only quantify linear relationships. By subtracting R<sup>2</sup> from the MIC, we compute a measure of nonlinearity [36], which we use to identify the following three classes of relationships as shown in Figure 10:


On the individual level, the MIC–R<sup>2</sup> matrix shown in Figure 11 provides additional detail to the clustering view. Interestingly, GSR measurements from E4 and VP show rather strong functional but not linear relationships with almost all other parameters (see third and fifth row in Figures 10 and 11). Particularly interesting is the relationship between GSR VP (filtered and moving averaged version, fourth row) and GSR E4 (filtered and moving averaged version, fifth-last column), which shows some highly negative values (see black arrow). These cases indicate a "false" linear relationship. For instance, participant RP 9–17: MIC 0.2 minus R<sup>2</sup> 0.91 results in −0.71. In other words, the MIC does not confirm the highly linear relationship indicated by R2; in fact, the MIC indicates that there is almost no relationship. From a physiological point of view, this relationship might be obvious; however, the quantification of this relationship from a data-driven perspective is, to our best knowledge, novel.

**Figure 10.** MIC–R2 cluster matrix of pairs of parameters (moving averages only); colors: green (cluster 1): "false" linear relationships; yellow (cluster 2): "true" linear relationships; red (cluster 3): functional but not linear relationships.

**Figure 11.** MIC–R<sup>2</sup> individual matrix of pairs of parameters (moving averages only); details of a matrix cell show participants as small vertical bars (order of participants is shown in the legend); back arrow points to pairs of parameters with very weak association.

#### 5.2.4. Fréchet Distance (Global and Local)

The Fréchet distance is a measure of how different two curves are from each other in terms of geometric structure [32]. Herein we use the Fréchet distance to measure the geometric similarity of two time series of a physiological parameter measured by different sensor platforms, one being professional and well-calibrated while the other is low-cost and wearable. In addition to the standard global Fréchet distance, we also compute local versions using a moving window approach.

The global Fréchet distance matrix (Figure 12) shows two expectable general aspects. First, cardiac parameters such as IBI, HF, LF and VLF derived from ECG seem to be more similar than GSR. This is likely because, from a measuring point of view, ECG-related measurements are simply more robust than, for instance, GSR-related ones. Second, the moving averaged versions of the time series also tend to be more similar than their non-averaged counterparts, which include more local fluctuations. Figure 12 detail (a) refers to the HR example shown in Figure 4 and detail (b) refers to the GSR example shown in Figure 5. In both detail (a) and detail (b), the moving averaged time series causes a smoothing effect, thus indicating a higher similarity as compared to the original (non-smoothed) time series.


**Figure 12.** Global Fréchet distance matrix of pairs of parameter and participants; detail (**a**) complements Figure 4, detail (**b**) complements Figure 5; color: green indicates low distance thus high similarity, blue indicates high distance thus low similarity.

In addition to the global geometric similarity, Figure 13 shows local similarity characteristics of the time series using a moving windows approach. The figure shows that the local Fréchet distance of a 1-min moving window indeed reveals differences in similarity at different intensities of physical activity (0–300 s: no activity; 301–900 s: cycling with increasing intensity; 901–1200 s: no activity–cool down; for details refer to Section 4.1). For instance, IBI derived from ECG tends to have a rather constant similarity over the entire measurement period (Figure 13, fourth row), and it tends to be more similar than IBI measured "directly" (Figure 13, first and second row).

**Figure 13.** Local Fréchet distance of a moving time window (1 min) of selected pairs of parameters (inter beat interval IBI, heart rate HR, galvanic skin response GSR; moving averaged only).

#### 5.2.5. DTW Distance

The Dynamic Time Warping (DTW) distance is a measure typically used to assess the similarity of time series [50,51]. Simply speaking, DTW tries to optimize the alignment of one time series (test) with another (reference) by stretching or shrinking it in a non-linear fashion along its time axis. The overall distance is the sum of all distances between pairs of points. Identical time series have a distance of zero. Figure 14 shows the DTW distance between pairs of parameters and participants, detail (a) refers to the HR example shown in Figure 4 and detail (b) refer to the GSR example shown in Figure 5.


**Figure 14.** Dynamic Time Warping (DTW) distance matrix of pairs of parameter and participants; detail (**a**) complements Figure 4, detail (**b**) complements Figure 5; colour: green indicates low distance thus high similarity, purple indicates high distance thus low similarity.

The global DTW distances from Figure 14 can be illustrated as an individual pairwise comparison of time series. For instance, Figure 15 shows an example of the DTW distance between two time series of one participant's HR measurements, which are highly similar (low DTW distance). The corresponding exploratory plots of the HR example are shown in Figure 4**.** Figure 16 shows an example of two time

series of GSR measurements with rather low similarity (high DTW distance); however, the overall trend is highly similar. The corresponding exploratory plots of the GSR example are shown in Figure 5.

**Figure 15.** Illustration of the Dynamic Time Warping (DTW) distance between two parameters of participant RP 5-14: moving averaged version of heart rate HR from BioHarness BH versus moving averagedversionofheartrateHRfromVarioPort VP(notetheoffsetofthetwoy-axes).

**Figure 16.** Illustration of the Dynamic Time Warping (DTW) distance between two parameters of participant RP 3-8: moving averaged version of galvanic skin response GSR from E4 versus moving averaged version of galvanic skin response GSR from VarioPort VP (note the offset of the two y-axes).

#### **6. Discussion and Limitations**

Overall, the sensor benchmarking worked well, both from a standardized laboratory study and data acquisition viewpoint, as well as from the data analysis methodology perspective. The high correlations between the cardiovascular parameters HR and IBI were as expected because these parameters are comparably simple to measure through a range of methodologies (electrical, optical). The high correlations between the other ECG-derived measurements were a little more surprising because (1) ECG is measured through a multi-channel electric current-based system, which is a complex procedure; (2) the use of contact electrodes of the BioHarness sensor (in contrast to the sticky electrodes of high-quality sensors) may cause contact (and thus measurement-) problems; 3) ECG is measured at a very high frequency (at least 200 Hz), which is technologically challenging for low-cost wearables.

For GSR, our experiment resulted in lower, but still reasonable similarities, which may be caused by a number of factors like di fferent measurement methods (sticky electrodes vs. plate electrode), and di fferent placement of the sensors (hand palm vs. wrist), etc.

From a more general point of view, it is a known issue that low-cost wearable sensors tend to be prone to producing datasets that su ffer from reduced data quality—even though we checked the appropriate positioning of the sensors before we started the exercise.

A particular issue arose with participant RP 1–2. As the results show, the measurements for this participant indicate low correlations for almost all physiological parameters. This may be due to problems with the contact between the electrodes and the skin, which may have been compromised by the person's physical characteristics.

A vital part of the analysis is the visualisation of results on two complementary levels: First, on the individual level, the data from di fferent sensors measuring the same physiological parameter at the same time on the same participant provide a direct comparison between the two-time series of interest. This enables reaching conclusions on the sensors' measuring behaviour. Second, on the collective level, the consolidation of global metrics allows for comparing signals between participants. This further provides useful insights into the influence of the participants' individual components (physical constitution, individual baseline level of skin conductance, etc.).

These complementary visualizations enable a flexible method of interpretation. For instance, it allows starting the interpretation on the individual level on a particular pair of physiological parameters of interest (e.g., HR of participant RP 5–14) using the corresponding exploratory plot as shown in Figure 4, then rolling-up using the R<sup>2</sup> matrix (Figure 6a) together with the cross-correlation matrix (Figure 7a) and comparing the individual result between participants. Further, it allows checking whether that particular pair of parameters has a functional relationship and whether that relationship is stable among other participants by using the MIC–R<sup>2</sup> individual matrix (Figure 11). The focus on a particular pair of parameters and participant can be continued to the similarity measures, namely the Fréchet distance and the DTW distance (Figures 12 and 14). Another method of interpretation is to begin at the collective level using the MIC–R<sup>2</sup> cluster matrix of pairs of parameters (Figure 10), then drilling-down on a specific parameter combination of interest using the MIC–R<sup>2</sup> individual matrix (Figure 11) and contextualizing this matrix with the corresponding exploratory plots as shown in Figure 5.

This kind of visualisation provides the central advantage regarding the sensor benchmarking from a "big picture" view, i.e., to serve as a basis for visual analysis of the correlations between the measurements of one parameter as measured by two di fferent sensors (each row in the matrix) and the correlations between the di fferent parameters for a single participant (each column in the matrix). Furthermore, the matrix allows the simple assessment of each single cell to trace back particularities of each measurement to a test person, which makes it easier to single out anomalies that may be caused by usage errors, a user's characteristics, single sensor failures, or violations of the benchmark protocol.

The cross-correlation analysis shows that groups of physiological parameters can be associated with di fferent patterns of temporal shifts. As illustrated in Figure 7, the cross-correlation also varies between participants: from overall positive for IBI derived from ECG for more than 60% of the participants to highly positive at small lags and highly negative at larger lags at the heart rate variability parameters VLF, LF, and HF. Although the clocks of the sensors were synchronized right before the study, we observed a lag of 1–2 s between HR measurements (BH versus VP) and GSR measurements (E4 versus VP), as exemplarily shown in Figure 4a,c and Figure 5c,d, respectively. In all

cases, the VP time series were "leading", which may indicate that the response characteristics of the VP sensors are more sensitive compared to the other sensors.

Measuring the strength of association between pairs of physiological parameters was of particular interest. Herein, we contrasted the coefficient of determination R<sup>2</sup> with the MIC (Figures 10 and 11). In other words, we confronted a statistic that measures linear relationships against a statistic that measures all types of functional relationships, including linear ones, and thereby classified the relationship as 'false linear', 'true linear', or 'functional but not linear'. The results are outstanding: on the one hand, some already expected linear relationships have been confirmed by a purely data-driven approach (for instance, relationships between IBI and VLF, LF, HF); on the other hand, some relationships that were expected to be linear are in fact not linear or functional. For instance, the relationships between GSR measured by E4 and GSR measured by VP (both filtered and moving averaged versions).

#### **7. Conclusions and Future Work**

In this paper, we performed a benchmark of two wearable physiological sensors (Zephyr BioHarness 3 and Empatica E4) by comparing their measurements (heart rate, inter-beat interval, and galvanic skin response, and derived heart rate variability parameters) to highly-calibrated high-end professional equipment. In our study, we used the measurements from 18 participants to compare the correlations (Pearson's r), cross-correlations at different temporal lags from −15 sec to +15 sec, the (sub-)linearity of functional dependencies (MIC), the difference of two measurement time series with respect to their geometric structure (Fréchet distance), local time series similarities (moving window), and time series similarity with respect to their temporal alignment (DTW).

The results of our study show that the measured cardiovascular parameters yield very high similarities between the low-cost wearable and the calibrated professional sensors. Although cardiovascular parameters are simple to measure (technically and phenomenon-wise), the obtained similarities are remarkable. For GSR, our experiment resulted in lower similarities, which may be caused by a number of factors like different measurement methods, different placement of the sensors (hand palm vs. wrist), conduction characteristics between skin and sensor surface (use of electrolyte gel or not), and others. It should be noted that the use of isotonic electrolyte gel is a scientific standard for measurement of electrodermal activity [52] and was used with the Varioport GSR measure but not with the other devices.

We demonstrated that our methodological approach to quantify correlations and similarities on both the individual and the aggregated level can provide interesting insights into the relationships between and among physiological parameters. The many figures generated (only the most essential ones are presented in this paper) enable different points of view on the same data and thus a more holistic interpretation for the benchmark of physiological sensors. Our research contributes to such a holistic interpretation in two ways: 1) the confrontation of the coefficient of determination R<sup>2</sup> against the Maximal Information Coefficient MIC, in particular, the classification of non-linear correlations, and 2) the quantification of the signals' temporal and geometric similarity based on well-established distance metrics (DTW distance and Fréchet distance).

Our future work will focus on two main research challenges. First, to continue fine-tuning the methodology and integrate additional similarity measures, for instance, the Time Warp Edit Distance (TWED) [53]. Second, to evaluate the transferability of the methodology to other time series benchmarking challenges, not necessarily physiological measurements. In the long run we want to expand the methodology to the geospatial domain, i.e., integrating the location in addition to the timestamp and the measurement of mobile sensors. This approach will likely warrant an additional field study that addresses the suitability of measurement devices and measurement quality on moving subjects, e.g., persons riding a bicycle or walking, and relating sensor data to subjective experience self-report data.

**Author Contributions:** Conceptualization, G.S., B.R., A.P., K.K., M.L. and F.H.W.; Data curation, G.S., B.R., A.P., K.K. and M.L.; Formal analysis, G.S. and B.R.; Investigation, G.S., B.R., A.P. and M.L.; Methodology, G.S., B.R. and K.K.; Project administration, A.P.; Software, G.S., A.P., K.K. and M.L.; Validation, G.S., B.R., M.L. and F.H.W.; Visualisation, G.S.; Writing—Original draft, G.S., B.R. and M.L.; Writing—Review & editing, G.S., B.R., A.P., K.K., M.L. and F.H.W.

**Funding:** This research was supported by the Austrian Science Fund (FWF) through the project "Urban Emotions" (FWF I-3022) and by the Austria Research Promotion Agency (FFG) through the project "Walk&Feel" (FFG 865208). This research was partly funded by the Austrian Science Fund (FWF) through the Doctoral College GIScience (DK W 1237-N23).

**Acknowledgments:** Open Access Funding by the Austrian Science Fund (FWF). We would like to thank all participants of the case study who made this work possible by donating several hours of their free time.

**Conflicts of Interest:** The authors declare no conflict of interest.
