**1. Introduction**

In the last decade, the body of literature about physiological sensing and deriving emotions from physiological parameters has grown significantly. One reason for this is the rapid increase in variety of a ffordable wearable sensors that measure a broad range of physiological parameters such as heart rate, galvanic skin response, skin temperature, and others. With this increase, the "Quantified Self" community that promotes the idea of 24/7 tracking and monitoring has been growing significantly [1–3].

These new low-cost wearables are increasingly used in scientific studies in a variety of areas like health research, well-being assessment, disaster management, emotion information extraction and spatial emotion analysis, and stress detection [4–13]. However, some research e fforts have used wearable physiological sensors without prior investigation of the sensor's exact quality parameters, i.e., how accurately a sensor actually measures a given parameter or how reliable a sensor is in producing continuously high-quality measurement results.

Understanding a sensor's quality and accuracy is critical because the research results may otherwise be unreliable: while traditional professional wired sensor devices, which have been used for some time in laboratory and ambulatory studies in the fields of psychological and medical research, are proven to be highly accurate, most wearable sensors used in previous studies are not. In fact, most of them are not medically and/or electronically certified, which compromises the reliability of the measurement results. However, recently, some wearable sensors have been released that are certified and comply with a number of international standards (sensor technology, wireless communication, data transmission, etc.), which makes them a viable alternative to traditional wired equipment.

In the context of this research, we aim to investigate the measurement quality of two wearable sensor devices, namely the Zephyr BioHarness 3 and the Empatica E4, by comparing their measurements to those of calibrated laboratory sensors. Concretely, we are interested in the similarity and correlation of univariate time series from two di fferent sensors that measure the same physiological parameters at the same time on the same participant. To evaluate the accuracy of the low-cost sensors, we perform benchmark testing between low-cost sensors against high-quality and well-calibrated sensors that act as the trusted gold standard. The second aim of this research is to detect and quantify relationships and dependencies between pairs of the same and di fferent physiological parameters measured by di fferent sensors. Our study assesses the parameters heart rate (HR), inter-beat interval (IBI), and galvanic skin response (GSR).

The remaining part of the paper is structured as follows. In Section 2, we provide a concise summary of related work regarding sensor benchmarking, followed by an overview of the physiological parameters of interest and the sensors used for this research (Section 3). The benchmarking methodology is presented in Section 4, where we also explain the entire workflow from sensor data acquisition to the analysis results. Section 5 descriptively illustrates the results, including a variety of statistical visualisations of similarity and correlation patterns. Finally, we discuss the results obtained and close the paper with our core conclusions.

#### **2. Sensor Benchmark Methods—Related Work**

The analysis of physiological signals from wearable sensors in order to better understand the human emotional response to the immediate surroundings has been investigated for several years. In recent years, a variety of a ffordable wearable sensors that measure well-established physiological parameters, such as heart rate and galvanic skin response, has reached the market. As a logical consequence—and as already mentioned in the introduction—the "Quantified Self" community is growing faster than ever, and inspiring scientific research, especially related to emotion and stress detection [5,7–9,14–17]. In any case, the basis for any further advanced analyses is adequate data quality in terms of accuracy, reliability, and validity [7,9,18]. However, scientific literature about the similarity and correlation of the measurements from such a ffordable wearables compared to those from well-calibrated and high-quality sensors from scientific laboratories is rare.

#### *2.1. Similarity Measures*

Generally speaking, the term 'similarity' is not rigorously mathematically defined. A variety of similarity measure families exist, for instance, distance-based (e.g., Euclidean distance), feature-based (e.g., Fourier coe fficients), model-based (e.g., autoregressive), and elastic measures such as Dynamic Time Warping (DTW) and Edit Distance on Real sequence EDR [19–22]. A comprehensive review, however, is out of the scope of this paper—the interested reader may refer to [19,20,23,24], among other work.

In this research, we go beyond global measures and linear models to assess similarity. To uncover local similarity characteristics of time series, we thus follow a moving window approach combined with more informative distance metrics. Elastic measures, such as DTW and the Fréchet distance, allow for a one-to-many comparison of time series elements, while so-called "Lock-Step" measures, such as Euclidian and Manhattan distance, only allow comparison of fixed pairs, making them very sensitive to local time-shifts and noise [23].

DTW temporally aligns two time series using the shortest path in a distance matrix, i.e., the path with the minimal global warping distance [25,26], thereby finding the most representative distance of the overall di fference [20]. However, a comprehensive experimental comparison of representation methods and distance measures of time series reveals inconsistencies and even contradictions in the observations reported in individual studies [23]. An important consequence of this is that experimental results cannot be generalised without critically reviewing the assumptions made for a particular research context and study design. As concluded in [23], "there is no clear evidence that one similarity measure exists that is superior to others in the literature in terms of accuracy. While some similarity measures are more e ffective on certain data sets, they are usually inferior on some other data sets" (p. 297). The DTW distance outperforms Euclidian distance in a variety of studies [27]. Other types of measures are "Edit measures" and "Threshold measures". The former type includes, for instance, Longest Common Sub-Sequence LCSS, Edit Distance on Real sequence EDR and Edit Distance with Real Penalty ERP. The latter type includes Tightness of Lower Bounds TLB. The accuracy of the aforementioned other types is close to the accuracy of DTW, but DTW is much simpler [23,28]. We thus concluded to use DTW to assess the temporal similarity of the physiological time series.

To assess the geometric shape of a curve or curve segment, other distance measures, such as the Fréchet distance [29], can be used [30–32]. "The Fréchet distance is typically explained as the relationship between a person and a dog connected by a leash walking along the two curves and trying to keep the leash as short as possible. The maximum length the leash reaches is the value of the Fréchet distance" [33] (p. 7). We thus use the Fréchet distance to assess the geometric similarity of time series of sensor measurements of the same physiological parameter (e.g., GSR) on the same participant at the same time but with di fferent sensors.

#### *2.2. Correlation Statistics*

The correlation of time series has been investigated for decades, in diverse fields. Herein, our focus on time series correlation is twofold: first, the correlation between equal-type physiological parameters measured by di fferent sensors at the same time on the same participant in order to quantify di fferences between low-cost and un-calibrated sensors versus high-end and calibrated laboratory sensor equipment; second, the correlation between physiological parameters of di fferent types, for instance, IBI and GSR, to explore potentially hidden relationships.

According to [34], the Pearson's correlation coe fficient is the most robust metric when measuring the similarity in physiological time series—where robustness is understood as insensitivity to small variations. However, Pearson's r is highly sensitive to outliers and only considers linear relationships. Spearman's rank correlation coe fficient (rho) is—as the name says—based on the rank of the values rather than on the values themselves; thus, it measures monotonicity rather than linearity. Therefore, using Spearman's rho to measure the strength of the associations between two variables leaves room for interpretation [35].

The human cardiovascular system and the autonomic nervous system are highly non-linear systems. In order to explore possible underlying non-linear interactions in the relationship between di fferent physiological parameter, we herein, use the Maximum Information Coe fficient (MIC) [36,37]. Several studies show the possibility of gaining new insights into such non-linear interactions when applying the MIC, for instance, in the interactions between neural and respiratory dynamics [38].

Further, one method to assess the temporal lag (or lead) between pairs of time series is the cross-correlation function in the time domain [39]. To ge<sup>t</sup> meaningful cross-correlation results, the time series need to be stationary, i.e., have a constant mean and variance. Time series stationarity can be tested using, for instance, the Augmented Dickey-Fuller test [40]. Unless the time series is stationary, it needs to be di fferenced and tested for stationarity.

#### **3. Physiological Parameter of Interest and Sensors used for Benchmarking**

Herein, we describe the physiological parameters we investigated, and the sensors used to measure them. We investigated three sensors and four physiological parameters (Table 1):



**Table 1.** Benchmarked sensors and physiological parameters of interest.
