1. Introduction
Nowadays, smartwatches offer much more than simply connecting to a mobile phone and managing basic functions. Recent studies have shown that, thanks to integrated sensors, smartwatches can monitor heart rate, ECG, and blood pressure, or measure peripheral blood oxygen saturation (SpO
2) values [
1,
2,
3,
4]. The advantage of real-time monitoring of vital signs using smartwatches is their convenient placement on the forearm, which does not restrict the user in daily activities compared to conventional blood pressure cuffs or finger sensors for SpO
2 measurement. Furthermore, users can customize the device to their needs and view the recorded data, which can help in the diagnosis, prevention, and possibly management of various diseases [
5]. Atrial fibrillation is an example of a life-threatening condition that smartwatches can diagnose very accurately [
6]. With the onset of the global COVID-19 pandemic, smartwatch data have also been shown to identify those infected among symptomatic individuals [
7]. Smartwatch-based SpO
2 monitoring has been discussed primarily in patients suffering from chronic obstructive pulmonary disease (COPD) [
8], sleep apnea syndrome [
9], and COVID-19 disease [
10,
11]. These diseases are often characterized by a prolonged decrease in SpO
2 levels where measurement with a standard fingertip oximeter is limiting for the patient. Physiological SpO
2 values in a healthy individual are in the range of 95–100% but are typically reduced to 88–92% in an individual with acute or chronic cardiopulmonary problems [
12]. Smartwatch-based SpO
2 monitoring may also be used to predict acute mountain sickness [
13].
Smartwatches, as well as, for instance, fitness trackers, belong to a group of electronic devices commonly referred to as ‘wearable devices’. The reflectance method used in wearable devices to measure SpO
2 has the problems of a significantly lower signal-to-noise ratio compared to the standard transmission method used in clinical practice and the much smaller perfusion of the wrist compared to the finger [
14,
15]. Motion artifacts are an issue, being more common in wrist measurements than in fingers due to the presence of tendons and bones. If the sensor is not pressed firmly against the tissue, artifacts due to ambient lighting also occur [
16]. In addition, a study by Apple [
17] describes a possible problem with the deteriorated quality of the signal from the photoplethysmographic sensor in people with darker skin, which applies to all oximeters. However, it appears that despite the many complications of wrist-based measurements, it is possible to achieve high accuracy in determining the final SpO
2 value through hardware and software optimization [
14,
15,
18]. In addition, studies validating the SpO
2 accuracy of wrist- or finger-worn wearables have demonstrated their ability to achieve clinically sufficient accuracy over the range of oxygen saturation values examined [
19,
20,
21].
ISO 80601-2-61 is the international standard for assessing the accuracy of SpO
2 measurements by pulse oximeters [
22]. The use of this standard requires, among other things, the induction of short-term hypoxemia in participants and allows validation of the accuracy of SpO
2 measurements in both invasive and non-invasive ways. The accuracy (determined as the root-mean-square difference, A
rms) of the pulse oximetry device over the SpO
2 range of 70% to 100% must be better than or equal to 4.0% compared to the reference device. The U.S. Food and Drug Administration (FDA) has established a stricter criterion for validating the accuracy of SpO
2 measurements by pulse oximeters. For an oximeter with a reflectance method of measuring SpO
2, a root-mean-square difference of better than or equal to 3.5% is required [
23]. In a study by Kirszenblat and Edouard [
16] that followed this standard, the SpO
2 value determined by Withings ScanWatch was compared with the arterial oxygen saturation (SaO
2) value measured by a blood gas analyzer. A study by Apple [
17], which followed the development of an SpO
2 measurement app used in their smartwatch, also adhered to the standard, using the invasive measurement as a reference. Both studies found minimal differences in SpO
2 measured by the smartwatch compared to the gold standard. A non-invasive approach to validate the accuracy of SpO
2 measurement using Apple Watch 6 compared to a standard oximeter was taken in our previous study [
24]. The results of the study also demonstrated the high accuracy of the smartwatch in detecting hypoxemia. Lauterbach et al. [
25] used a different approach to test the smartwatch using a normobaric hypoxic chamber but again found only minimal differences in SpO
2 measurements when compared to a standard oximeter. The above studies [
16,
17,
24,
25] were concerned with determining the accuracy of smartwatch SpO
2 measurements in a group of healthy volunteers during hypoxemia only. Several other studies have investigated the accuracy of Apple Watch SpO
2 measurement in a group of patients with pathologically impaired SpO
2 values [
26,
27,
28,
29]. The measurement methods and the conclusions of the studies on the clinical use of smartwatches vary, but the results show relatively little systematic bias between the devices tested. Recently, Schroder et al. [
30] pointed out a potential problem with outliers in smartwatch SpO
2 measurements, as some of the values measured by the watch compared to a standard oximeter lay outside the physiological range of 95–100%, even though the measurements were performed on healthy volunteers under normal conditions.
Only two studies [
16,
17] have validated the accuracy of SpO
2 measurements even at SpO
2 levels below 80% and fully complied with ISO 80601-2-61. While other manufacturers are introducing their smartwatches capable of measuring SpO
2, according to the available literature, no study has been conducted that simultaneously compares multiple smartwatch models using a single measurement method while meeting the criteria set by the above-mentioned standard.
The aim of this study was to experimentally compare the accuracy of several smartwatches with a clinically used pulse oximeter in the SpO2 range of 70–100%.
2. Methods
This prospective, interventional, randomized crossover study was approved by the Ethical Review Board of the Faculty of Biomedical Engineering of the Czech Technical University in Prague on 7 February 2023 (no. C27/2023). All participants provided written consent prior to enrollment. The study has been registered with ClinicalTrials.gov (NCT05789563).
A total of 18 healthy Caucasian volunteers (14 males, 4 females) aged 21–26 years participated in the study; the group characteristics are shown in
Table 1. The number of participants enrolled in the study and the number of paired SpO
2 observations were based on the International Organization for Standardization (ISO 80601-2-61:2019) guideline for in vivo accuracy testing of pulse oximeters, which requires at least 200 paired SpO
2 readings balanced across the SpO
2 range of 70–100% from at least 10 subjects [
22]. Screened before enrollment, no participants were excluded from the study because of cardiovascular or respiratory disease, pregnancy, diabetes, any acute illness, or upper limb or hand injury that could affect peripheral perfusion.
Each participant underwent the experimental assessment three times in a randomized order, wearing one of three smartwatches (Apple Watch 8 (Apple Inc., Cupertino, CA, USA), Samsung Galaxy Watch 5 (Samsung Electronics Co., Ltd., Suwon-si, Republic of Korea), or Withings ScanWatch (Withings, Issy-les-Moulineaux, France)). The order in which the smartwatches were worn was assigned using computer-generated random numbers. At least a 2 h recovery interval was included between the experimental assessments.
Upon arrival at the workplace, the participants were seated in a comfortable position with their left hand placed on the table in front of them near heart level and with the wrist and palm facing down. A smartwatch (hereafter referred to as Apple, Samsung, or Withings) was attached to the participant’s left wrist according to the manufacturer’s instructions. The Radical-7 reference pulse oximeter sensor (Masimo Corporation, Irvine, CA, USA) was placed on the left middle finger. Three test SpO2 readings from the smartwatch were always taken before the start of the experimental assessment. If the three consecutive readings did not indicate SpO2 greater than 90%, the position of the smartwatch was adjusted, and the test readings were repeated.
A non-rebreathing circuit was set up for the experimental assessments. It allowed the participant to inhale either a hypoxic gas mixture from the Douglas bag or the ambient air and exhale into the ambient air outside the Douglas bag. Inhalation was performed through an anesthetic mask covering the mouth and nose. The composition of the inhaled gas mixtures was monitored continuously by a Datex-Ohmeda S/5 patient monitor (Datex-Ohmeda Inc., Madison, WI, USA) with a sensor placed in the breathing circuit. A disposable antibacterial filter separated the participant from the breathing circuit.
There were three phases in each of the 12 min experimental assessments. During the first 2 min, in the initial stabilization phase, participants inhaled the ambient air via the non-rebreathing circuit. This was followed by the 7.5 min desaturation phase, during which participants inhaled the hypoxic gas mixture from the Douglas bag. Three different hypoxic gas mixtures (14% O2, 12% O2, 10% O2) were used consecutively under normobaric conditions during the desaturation phase (2.5 min each), which we expected to cover the desired saturation range. The reduced oxygen content corresponds approximately to altitudes of 3200 m (14% O2), 4400 m (12% O2), and 5800 m (10% O2). The final stabilization phase, during which participants inhaled ambient air through the breathing circuit, lasted until stable readouts were reached.
Manual readings of SpO2 values from the smartwatch and the reference oximeter were taken simultaneously at predefined time points during the experimental assessment at intervals of 40–50 s. The SpO2 value from the reference oximeter was obtained at the time the SpO2 reading from the smartwatch was completed. A total of 16 paired SpO2 readings were obtained from each experimental assessment.
Data Processing
Only successful coupled readings from both the smartwatch and reference oximeter were included in the final analysis. All data were analyzed in Matlab 2021a (MathWorks, Natick, MA, USA) after transcription from the participant’s log.
To compare the three smartwatch models, each set of paired data was fitted with a linear regression line using the method of least squares, and the correlation coefficient was determined.
To assess the relative response of the smartwatch and the oximeter, we compared the SpO2 readings of all participants for both devices at each time point of the experimental assessment using a two-tailed paired t-test, with a p-value of less than 0.05 considered statistically significant.
The Bland–Altman analysis was conducted to evaluate the agreement between the SpO2 readings obtained from the smartwatch and the oximeter. This standard approach determines the scatter and bias between measurement methods. The 95% limits of agreement (LOAs) were calculated by adding and subtracting 1.96 standard deviations from the mean bias to provide an estimate of the expected differences between the simultaneous SpO2 readings acquired from the smartwatch and the oximeter. The standard deviation was calculated using the modified Bland–Altman method for multiple observations per individual when the measured quantity changes over the observation period. The mean bias was calculated as the average difference between the smartwatch and the oximeter measurements. Mean bias, LOAs, and Arms between the smartwatch and the reference oximeter were also calculated for subintervals of the entire measured SpO2 range (100–91%, 90–81%, and ≤80%). Paired SpO2 readings were assigned to each interval according to the SpO2 value from the reference oximeter.
Finally, we evaluated the diagnostic sensitivity, specificity, and accuracy of each smartwatch in detecting hypoxemia, defined as SpO
2 below 90% based on the reference oximeter, similar to the study by Santos et al. [
19]. SpO
2 values below 90% can be considered as a serious deterioration in oxygenation [
31].
3. Results
The study was conducted on healthy volunteers at the Faculty of Biomedical Engineering in Kladno, Czech Republic, in the Laboratory of Special Equipment for ICU during March and April 2023. All 18 participants completed all three stages of the experimental assessment, resulting in 54 complete datasets with 864 paired manual SpO2 readings (288 for each smartwatch). Of the 864 total readings, 274 (95%) were successfully displayed for Apple, 283 (98%) for Samsung, and 238 (83%) for Withings, and, of the 795 total successful paired manual SpO2 readings, 454 (57%) were in the 91–100% SpO2 range, 229 (29%) were in the 81–90% SpO2 range, and 112 (14%) were in the sub-80% SpO2 range.
The individual datasets for all smartwatches were fitted with a regression line using the least squares method (
Figure 1). While the regression line for Apple and Withings followed the ideal identity line well throughout the 70–100% SpO
2 range, the difference between the SpO
2 measured by Samsung and that measured by the reference oximeter decreased as SpO
2 decreased. Pearson correlation coefficients were greater than 0.9 for all three smartwatches.
The average SpO
2 values measured by the smartwatch and the reference oximeter at each time point of the experimental assessment are depicted for each smartwatch in
Figure 2A–C. The average SpO
2 values measured by the reference oximeter for all experimental assessment stages decreased from 98% in the stabilization phase to about 78% at the end of the desaturation phase. The average difference between the smartwatch and the reference oximeter ranged from 0.0% SpO
2 to −1.4% SpO
2 for Apple, from −1.8% SpO
2 to −3.2% SpO
2 for Samsung, and from 0.0% SpO
2 to −8.3% SpO
2 for Withings. For Apple, there was only one statistically significant difference in the first manual reading (
Figure 2A). For Samsung, there were statistically significant differences throughout the experimental assessment (
Figure 2B), and for Withings, there were three statistically significant differences (
Figure 2C).
Bland–Altman plots that evaluate potential bias and limits of agreement between the smartwatch and the reference oximeter, derived from all pairs of pooled successfully obtained SpO
2 readings, are displayed in
Figure 3A–C. The mean bias in SpO
2 values measured by the reference oximeter and Apple was −0.1% (
Figure 3A), by the reference oximeter and Samsung was −2.6% (
Figure 3B), and by the reference oximeter and Withings was 0.4% (
Figure 3C). The 95% limits of agreement (LOAs) were found to be between −4.4% and 4.2% SpO
2 for Apple, with the largest difference between the smartwatch SpO
2 reading and the reference oximeter being −7% in the negative direction and 8% in the positive direction (
Figure 3A). The 95% LOA for Samsung ranged from −8.1% to 2.9% SpO
2, and the largest difference was −14% in the negative direction and 4% in the positive direction (
Figure 3B). The 95% LOA for Withings ranged from −6.5% to 7.2% SpO
2, and the largest difference was −15% in the negative direction and 8% in the positive direction (
Figure 3C).
The comparison between the smartwatch and the reference oximeter was further analyzed after splitting into three SpO
2 intervals (100–91%, 90–81%, ≤80%). The mean bias, lower and upper LOA, and A
rms are summarized in
Table 2. As shown, Apple and Withings had mean bias not statistically different from zero in all intervals, except for the 90–81% SpO
2 interval for Withings. In contrast, Samsung had a mean bias that was always statistically significant, although small. The widest range between lower and upper LOA was found for Withings and the narrowest for Apple. The calculated A
rms was less than 4% for all the smartwatches.
The reliability of smartwatches in detecting hypoxemia (SpO
2 < 90%) is shown in
Table 3. The negative mean bias in Samsung resulted in the highest sensitivity but the lowest specificity. The accuracy of Apple was statistically significantly higher than of the Samsung.
4. Discussion
In this experimental study, we directly compared the SpO2 measurements of three smartwatches from different manufacturers in young and healthy participants. Our main finding is that, although there are differences in the accuracy of SpO2 measurements between the smartwatches, these differences are small and of little importance to the average user. All three smartwatch models (Apple Watch 8, Samsung Galaxy Watch 5, and Withings ScanWatch) meet the accuracy requirements according to ISO 80601-2-61 when compared to the reference medical-grade pulse oximeter. However, in our study, only Apple Watch 8 and Withings ScanWatch met the more stringent FDA accuracy requirements.
When comparing the averages of paired readings between the smartwatch and the reference oximeter over time (
Figure 2), a similar pattern of SpO
2 decrease during the experimental assessment can be observed. Nevertheless, the only statistically significant difference for Apple was observed at the first time point of the SpO
2 readings (
Figure 2A), whereas, for Samsung, there was a statistically significant negative bias at all 16 time points of the SpO
2 readings (
Figure 2B). For Withings, the most significant differences occurred at 190 s, 310 s, and 600 s (
Figure 2C). These differences may have been caused by the long interval required for the Withings smartwatch to determine the SpO
2 value (30 s) compared to the 2–4 s averaging time of the reference oximeter. The longer time of SpO
2 calculation may result in a deviation from the reference pulse oximeter when there is a rapid change in SpO
2. Possible differences in calibration curves could also contribute to local differences.
The overall root-mean-square deviation of Apple (2.2%) is comparable to the value found in previous studies [
17,
24]. Samsung and Withings had A
rms values that were higher but still within the limits of the acceptable accuracy as specified by the ISO standard. Apple and Withings also showed a mean bias of less than 1% (
Figure 3), which is completely negligible from a clinical point of view. In studies by Apple [
17], Rafl et al. [
24], and Pipek et al. [
27], a mean bias of less than 1% in SpO
2 was also observed for Apple smartwatches. Samsung had the largest mean bias of −2.6%. For Withings and Samsung, there was a decrease in A
rms and the mean bias at lower SpO
2 values. However, this trend is opposite for the Apple smartwatch and opposite to the findings of Kirszenblat and Edouard [
16] for Withings.
The authors of this study believe that the three selected smartwatch models appropriately represent the global market, as Apple’s watch market share reached 43% in terms of shipments in Q4 2022, making it the leading vendor. Samsung was next in line with 8% followed by Huawei, Amazfit, Garmin, Withings, and others. This distribution has not changed significantly over the years [
32]. It is interesting to note that Withings ScanWatch is the only smartwatch on the market with FDA clearance for the functions of monitoring abnormal heart rhythms using ECG and alerting for breathing problems during the night using SpO
2 measurement [
33].
For Apple, of the 288 paired SpO
2 readings, 274 (95.1%) were successful overall, indicating the smartwatch was properly attached to the wrist. The success rate for SpO
2 readings is comparable to the results of the Apple study [
17], which was 94.7%. Samsung achieved an even higher success rate for paired SpO
2 readings in this study (98.3%). In contrast, Withings achieved a success rate of only 82.6%. In the study by Kirszenblat and Edouard [
16], which also tested Withings ScanWatch, the success rate was comparable. Thus, it should be considered that, even under ideal measurement conditions (participants at rest with no movement and with the hand in front according to the manufacturer’s recommendations), there may be a number of failed SpO
2 readings with some smartwatches.
All three smartwatch models demonstrated high diagnostic accuracy for hypoxemia (SpO2 <90%). Although the sensitivity was highest for Samsung (0.97), the smartwatch underestimated the SpO2 value compared to the reference oximeter throughout the measurement range and consequently had the lowest specificity value (0.76). However, the authors of this study suggest that this feature is preferable to overestimating SpO2 values.
This study had several limitations. First, only healthy Caucasian volunteers aged 21–26 years participated in the study. The gender imbalance of the study participants, which approximates the gender distribution of our students, may also be perceived as a limitation, but we did not expect this to significantly affect the results. Results may vary in chronic elderly patients or due to differences in skin pigmentation, which affects light transmission and reflectance. Second, the method of inducing hypoxemia did not allow stable SpO
2 values or the same level of desaturation to be achieved in all participants. This also resulted in a relatively low number of SpO
2 measurements below 80%. On the other hand, based on our experience with these types of hypoxic experiments, we believe that the method is a good compromise between slow desaturation, the reaching of relatively low stable SpO
2 values, and the tolerable length of the experiment for the participants. Third, the SpO
2 measurements were performed under laboratory conditions when the participants were at rest, comfortably seated, and the correct position of the smartwatch was verified. Thus, the success rate of SpO
2 readings would likely be lower in routine practice, where the position of the smartwatch is not checked multiple times and the patient is not sitting perfectly still with their hand on the table in front of them. However, verifying the effects of the smartwatch position and motion artifacts on reading success rate was not the focus of this study. Next, we did not evaluate SaO
2 in this study as it is a method that requires invasive arterial blood sampling, which greatly complicates the experimental assessment and increases the safety requirements for the participants. Finally, we do not know exactly how each smartwatch’s algorithms work to determine the final displayed SpO
2 value. The smartwatches differ in the interval required to determine the final SpO
2 value. For Apple Watch 8, the interval is 15 s. Samsung Galaxy Watch 5 does not have a fixed interval for determining the resulting value; it ranges between 12 and 17 s. Withings ScanWatch determines the SpO
2 value in an interval of 30 s. This resulted in our inability to fully distinguish SpO
2 measurement variations between devices from the time shift. For Apple and Samsung, the time shift was not apparent; however, for Withings, it appears that the difference between the reference oximeter and the smartwatch in the first half of the experimental assessment was due to the long interval required to determine the final SpO
2 value, whereas, toward the end of the desaturation phase, this effect was no longer apparent (
Figure 2C).
This study compared smartwatches from two top-selling smartwatch manufacturers as well as a smartwatch with FDA approval for detecting nighttime breathing problems. To our knowledge, no study has been conducted to compare multiple smartwatch models simultaneously using a single testing method. Therefore, we suggest the findings of the study can be applied to smartwatches in general with greater confidence than studies that validate a single model of a single manufacturer. Overall, the analysis of SpO2 measurements by smartwatches showed the high accuracy of these devices compared to a standard pulse oximeter. The differences we found are unimportant and likely to diminish as manufacturers introduce new models. Smartwatches are not intended for clinical SpO2 measurement, as the manufacturers themselves emphasize. Although the overall accuracy of smartwatches is sufficient, the long time needed to determine the SpO2 value and the high sensitivity to motion artifacts limit their potential clinical use. On the other hand, smartwatches allow long-term and continuous monitoring of SpO2 trends, detection of abnormal fluctuations, and, thus, faster evaluation of changes in the user’s health status over time. This is particularly advantageous for some groups of individuals, such as those suffering from chronic pulmonary disease, sleep apnea, or post-COVID syndrome.