1. Introduction
Healthcare applications see an ever-growing need for high-quality medical data in abundant amounts. In particular, the uprising of smart health services can provide valuable insights into individual health conditions and personalized remedy recommendations. For example, solutions for detecting stress from physiological measurements of wearable devices receive attention from academic [
1,
2,
3] and industrial [
4,
5,
6] communities alike.
However, each entry in a medical dataset often contains detailed information about an individual’s health status, making it highly sensitive and leading to various anonymization techniques [
7,
8,
9]. Still, the risk of re-identification persists, as current methods can successfully identify individuals based solely on their health signal data [
10,
11,
12,
13]. These threats ultimately lead to complex ethical and privacy requirements, complicating the collection and access to sufficient patient data for real-world research [
14,
15].
Regarding patient privacy, training machine learning models under the constraints of Differential Privacy (DP) [
16] provides a robust and verifiable privacy guarantee. This approach ensures the secure handling of sensitive data and effectively mitigates the risk of potential attacks when these models are deployed in operational settings.
To address the limitations related to data availability, one effective strategy is the synthesis of data points, often achieved through techniques like Generative Adversarial Networks (GANs) [
17]. GANs enable the development of models that capture the statistical distribution of a given dataset and subsequently leverage this knowledge to generate new synthetic data samples that adhere to the same foundational principles. In addition, we can directly integrate the privacy assurances of DP into our GAN training process, enabling the direct creation of a privacy-preserving generation model. This ensures that the synthetically generated images offer and maintain privacy guarantees [
18].
In this work, we train both non-private GAN and private DP-GAN models to generate new time-series data needed for smartwatch stress detection. Existing datasets for stress detection are small and can benefit from augmentation, especially when considering difficulties in the private training of detection models using DP [
19]. We present and evaluate multiple strategies for incorporating both non-private and private synthetic data to enhance the utility–privacy trade-off introduced by DP. Through this augmentation, our aim is to optimize the performance of privacy-preserving models in a scenario where we are constrained by limited amounts of real data.
Our contributions are as follows:
We achieve data generation models based on GANs that produce synthetic multimodal time-series sequences corresponding to available smartwatch health sensors. Each data point presents a moment of stress or non-stress and is labeled accordingly.
Our models generate realistic data that are close to the original distribution, allowing us to effectively expand or replace publicly available, albeit limited, data collections for stress detection while keeping their characteristics and offering privacy guarantees.
With our solutions for training stress detection models with synthetic data, we are able to improve on state-of-the-art results. Our private synthetic data generators for training DP-conform classifiers help us in applying DP with much better utility–privacy trade-offs and lead to higher performance than before. We give a quick overview regarding the improvements over related work in
Table 1.
Our approach enables applications for stress detection via smartwatches while safeguarding user privacy. By incorporating DP, we ensure that the generated health data can be leveraged freely, circumventing privacy concerns of basic anonymization. This facilitates the development and deployment of accurate models across diverse user groups and enhances research capabilities through increased data availability.
In
Section 2, we briefly review relevant basic knowledge and concepts before focusing on existing related work regarding synthetic health data and stress detection in
Section 3.
Section 4 presents an overview of our methodology, describes our experiments, and gives reference to the environment for our implementations. The outcome of our experiments is then detailed and evaluated in
Section 5.
Section 6 is centered around discussing the implications of our results and determining the actual best strategies from different perspectives, as well as their possible limitations. Finally, in
Section 7, we provide both a concise summary of our key findings and an outlook on future work.
4. Methodology
In this part, we detail the different methods and settings for our experiments. A general overview is given in
Figure 2, while each process and each presented part are further described in the following section.
4.1. Environment
On the software side, we employ Python 3.8 as our programming language and utilize the Tensorflow framework for our machine-learning models. The accompanying Tensorflow Privacy library provides the relevant DP-SGD training implementations. Our hardware configuration for the experiments comprises machines with 32 GB of RAM and an NVIDIA GeForce RTX 2080 Ti graphics card. We further set the random seed to 42.
4.2. Dataset Description
Our proposed method is examined on the openly accessible multimodal WESAD dataset [
20], a frequently utilized dataset for stress detection. The dataset is composed of 15 healthy participants (12 males, 3 females), each with approximately 36 min of health data recorded during a laboratory study. Throughout this time, data are continuously and concurrently collected from a wrist-worn and a chest-worn device, both providing multiple modalities as time-series data. We limit our consideration to signals obtainable from wrist-worn wearables, such as smartwatches—specifically, the Empatica E4 device used in the dataset. The wristband provides six modalities at varying sampling frequencies: blood volume pulse (BVP), electrodermal activity (EDA), body temperature (TEMP), and three-axis acceleration (ACC[x,y,z]). The dataset records three pertinent affective states: neutral, stress, and amusement. Our focus is on binary classification, distinguishing stress from non-stress, which merges the neutral and amusement classes. Ultimately, we find the data comprise approximately 30% stress and 70% non-stress instances.
4.3. Data Preparation
Transforming time-series signal data to match the expected input format requires several pre-processing steps and is a crucial step in achieving good models. For our approach, we adopt the process of Gil-Martin et al. [
22] in many points. We, however, change some key transformations to better accommodate the data to our setting and stop at 60 s windows since we want to feed them into our GANs instead of their CNN model. Our process can be divided into four general steps.
First, since the Empatica E4 signal modalities are recorded at different sampling rates due to technical implementations, they need to be resampled to a unified sampling rate. We further need to align these sampled data points to ensure that for each point of time in the time series, there is a corresponding entry for all signals. To achieve this, signal data are downsampled to a consistent sampling rate of 1 Hz using the Fourier method. Despite the reduction in original data points, most of the crucial non-stress/stress dynamics are still captured after the Fourier transformation process, while model training is greatly accelerated by reducing the number of samples per second. An additional result is the smoothing of the signals, which helps the GAN in learning the important overall trends without smaller fluctuations present due to higher sampling rates.
In the second step, we adjust the labels by combining
neutral and
amusement into the common
non-stress label. In addition to these data, we only keep the
stress part of the dataset. This reduction in labels is mainly due to the fact that we want to enhance binary stress detection that only distinguishes between moments of stress and moments without stress. However, only keeping neutral data would underestimate the importance of differentiating the amusement phase from the stress phase since there is an overlap in signal characteristics, such as BVP or ACC, for amusement and stress [
20]. After the first and this relabeling step, we obtain an intermediate result of 23,186 non-stress- and 9966 stress-labeled seconds.
Thirdly, we normalize the signals using a min–max normalization in the range of [0,1] to eliminate the differences in scale among the modalities while still capturing their relationships. In addition, the normalization has a great impact on the subsequent training process, as it helps the model to converge faster, thus shortening the time to learn an optimal weight distribution.
Given that the dataset consists of about 36 min sessions per subject, in our fourth and final step, we divide these long sessions into smaller time frames to pose as input windows for our models. We transform each into 60-s long windows but additionally, as described by Dzieżyc et al. [
31], we introduce a sliding window effect of 30 s. This means instead of neatly dividing into 60-s windows, we instead create a 60-s window after every 30 s of the data stream. These additional intermediate windows fill the gap between clean aligned 60-s windows by overlapping with the previous window by 30 s and the next window by 30 s, providing more contextual information by capturing the correlated time series between individual windows. Additionally, sliding windows increase the amount of data points available for subsequent training. We opt for 30-s windows over shorter ones to limit the repeating inclusion of unique data points, which would escalate the amount of DP noise with increased sampling frequency, as detailed in
Section 4.8. A lower amount of overlapping windows ensures manageable DP noise, while still giving more samples. To assign a label for a window, we determine the majority class in the given 60-s frame. Finally, we concatenate the 60-s windows and their associated class labels from all subjects into a final training dataset.
An example of pre-processed data is given in
Figure 3, where we show the graphs for Subject ID4 from the WESAD dataset after the first three processing steps. The orange line represents the associated label for each signal plot and is given as 0 for non-stress and 1 for stress. We can already spot certain differences between the two states in relation to the signal curves simply when looking at the given plots.
4.4. Generative Models
After transforming our signal data to a suitable and consistent input format, it is important to determine the proper model architecture for the given data characteristics. Compared to the original GAN architecture [
17], we face three main challenges:
Time-series data: Instead of singular and individual input samples, we find continuous time-dependent data recorded over a specific time interval. Further, each data point is correlated to the rest of the sequence before and after it.
Multimodal signal data: For each point in time, we find not a single sample but one each for all of our six signal modalities. Artificially generating this multimodality is further complicated by the fact that the modalities correlate to each other and to their labels.
Class labels: Each sample also has a corresponding class label as stress or non-stress. This is solvable with standard GANs by training a separate GAN for each class, like when using the Time-series GAN (TimeGAN) [
32]. However, with such individual models, some correlation between label and signal data might be lost.
Based on these data characteristics and resulting challenges, we have selected the following three GAN architectures that address these criteria in different ways.
4.4.1. Conditional GAN
The Conditional GAN (CGAN) architecture was first introduced by Mirza and Osindero [
33]. Here, both the generator and the discriminator receive additional auxiliary input information, such as a class label, with each sample. This means that, in addition to solely generating synthetic samples, the CGAN is able to learn and output the corresponding labels for synthetic samples, effectively allowing the synthetization of labeled multimodal data. For our time-series CGAN variant, we mainly adopt the architecture and approach from the related work by Ehrhart et al. [
28]. They also evaluated the CGAN against the TimeGAN and determined that the TimeGAN’s generative performance was inferior for our specific task. Consequently, we chose to exclude the TimeGAN from our evaluation, given its inferiority to the CGAN. The used CGAN architecture is based on the LSTM-CGAN [
34] but is expanded by a diversity term to stabilize training and an FCN discriminator model with convolutional layers. We instead rely on an LSTM discriminator by stacking two LSTM layers, which performs better in our scenario [
35]. As hyperparameters, we choose the diversity term
and employ an Adam [
36] optimizer with a learning rate of
. We further pick 64 for the batch size and train for 1600 epochs. We derived these values from hyperparameter tuning.
4.4.2. DoppelGANger GAN
The other architecture considered is the DoppelGANger GAN (DGAN) by Lin et al. [
37]. Like the CGAN, the DGAN uses LSTMs to capture relationships inside the time-series data. Thanks to a new architectural element, the DGAN is able to include multiple generators in its training process. The goal is to decouple the conditional generation part from the time-series generation. They thus include separate generators for auxiliary metadata, like labels, and continuous measurements. In the same vein, they use an auxiliary discriminator in addition to the standard discriminator, which exclusively judges the correctness of metadata outputs. To address mode collapse problems, they further introduce a third generator, which again treats the min and max of signal values as metadata. By combining these techniques, Lin et al. [
37] try to incorporate the relationships between the many different attributes. This approach also offers the advantage that a trained model can be further refined, and by flexibly changing the metadata, can generate synthetic data for a different use case. In terms of hyperparameters, we choose a learning rate of
and train for 10,000 epochs with the number of training samples as the batch size.
4.4.3. DP-CGAN
Our private DP-GAN architecture of choice is the DP-CGAN, which was already used by Torkzadehmahani et al. [
30], without our focus on time-series data. Through the multiple generators and discriminator parts, the DGAN has a harder time complying with private training, which is why we stayed with the CGAN for private training that performed well in the initial tests. To incorporate our task into the architecture, we take the CGAN part from Ehrhart et al. [
28] and make it private using DP-SGD. More specifically, we use the DP-Adam optimizer, which is an Adam variant of DP-SGD. For privatizing the CGAN architecture, we draw on the DP-GAN ideas by both Xie et al. [
18] and Liu et al. [
38]. Both approaches introduce the concept of securing a DP guarantee for GANs via applying noise to the gradients through the optimizer during training. During GAN training, the generator only reacts to the feedback received from the discriminator, while the discriminator is the part that accesses real data for calculating the loss function [
38]. From this, we can determine that just the discriminator needs to implement noise injection when seeing real samples to hide their influence. Thus, only the discriminator needs to switch to the DP optimizer and the generator can keep its standard training procedure. The hyperparameters of DP-CGAN training are described in
Section 4.8, where we focus on the necessary information for implementing the private training.
4.5. Synthetic Data Quality Evaluation
Under the term of data quality, we unite the visual and statistical evaluation methods for our synthetic data. We use the following four strategies to obtain a good understanding of the achieved diversity and fidelity provided by our GANs:
Principal Component Analysis (PCA) [
39]. As a statistical technique for simplifying and visualizing a dataset, PCA converts many correlated statistical variables into principal components to reduce the dimensional space. Generally, PCA is able to identify the principal components that identify the data while preserving their coarser structure. We restrict our analysis to calculating the first two PCs, which is a feasible representation since the major PCs capture most of the variance.
t-Distributed Stochastic Neighbor Embedding (t-SNE) [
40]. Another method for visualizing high-dimensional data is using t-SNE. Each data point is assigned a position in a two-dimensional space. This reduces the dimension while maintaining significant variance. Unlike PCA, it is less qualified at preserving the location of distant points, but can better represent the equality between nearby points.
Signal correlation and distribution. To validate the relationship between signal modalities and to their respective labels, we analyze the strength of the Pearson correlation coefficients [
41] found inside the data. A successful GAN model should be able to output synthetic data with a similar correlation as the original training data. Even though correlation does not imply causation, the correlation between labels and signals can be essential to train classification models. Additionally, we calculate the corresponding
p-values (probability values) [
42] to our correlation coefficients to analyze if our findings are statistically significant. As a further analysis, we also take a look at the actual distribution of signal values to see if the GANs are able to replicate these statistics.
Classifier Two-Sample Test (C2ST). To evaluate whether the generated data are overall comparable to real WESAD data, we employ a C2ST mostly as described by Lopez-Paz and Oquab [
43]. The C2ST uses a classification model that is trained on a portion of both real and synthetic data, with the task of differentiating between the two classes. Afterward, the model is fed with a test set that again consists of real and synthetic samples in equal amounts. Now, if the synthetic data are close to the real data, the classifier would have a hard time correctly labeling the different samples, leaving it with a low accuracy result. In an optimal case, the classifier would label all given test samples as real and thus only achieve 0.5 of accuracy. This test method allows us to see if the generated data are indistinguishable from real data for a trained classifier. For our C2ST model, we decided on a Naive Bayes approach.
4.6. Use Case Specific Evaluation
We test the usefulness of our generated data in an actual stress detection task for classifying stress and non-stress data. The task is based on the WESAD dataset and follows an evaluation scheme using Leave One Subject Out (LOSO) cross-validation. In standard machine-learning evaluation, we would split the subjects from the WESAD dataset into distinct train and test sets. In this scenario, we would only test on the selected subjects, and these would also be excluded from training. In the LOSO format, we instead train 15 different models, one for each subject in the WESAD dataset. A training run uses 14 of the 15 subjects from the WESAD dataset as the training data and the 15th subject as the test set for evaluation. Thereby, when cycling through the whole dataset using this strategy, every subject constitutes the test set once and is included in the training for the 14 other runs. This allows us to evaluate the classification results for each subject. For the final result, all 15 test set results are averaged into one score, simulating an evaluation for all subjects. This process is also performed by the related work presented in
Table 1.
To evaluate our synthetic data, we generate time-series sequences per GAN model with the size of an average subject of roughly 36 min in the WESAD dataset. We also conform to the same distribution of stress and non-stress with about 70% and 30%, respectively. By this, we want to generate comparable subject data that allow us to realistically augment or replace the original WESAD dataset with synthetic data. We can then evaluate the influence of additional subjects on the classification. The synthetic subjects are included in each training round of the LOSO evaluation but the test sets are only based on the original 15 subjects to obtain comparable and consistent results. The GANs are also part of the LOSO procedure, which means the subject that currently provides the test set is omitted from their training. Finally, each full LOSO evaluation run is performed 10 times to better account for randomness and fluctuations from the GAN data, classifier training, and DP noise. The results are then again averaged into one final score.
For an evaluation metric, we use the F1-score over accuracy since it combines both precision and recall and shows the balance between these metrics. The F1-score gives their harmonic mean and is particularly useful for unbalanced datasets, such as the WESAD dataset with its minority label distribution for stress. Precision is defined as , while recall is , and the F1-score is then given as .
To improve the current state-of-the-art classification results using our synthetic data, we test the following two strategies in both non-private and private training scenarios:
Train Synthetic Test Real (TSTR). The TSTR framework is commonly used in the synthetic data domain, which means that the classification model is trained on just the synthetic data and then evaluated on the real data for testing. We implement this concept by generating synthetic subject data in differing amounts, i.e., the number of subjects. We decide to first use the same size as the WESAD set of 15 subjects to simulate a synthetic replacement of the dataset. We then evaluate a larger synthetic set of 100 subjects. Complying with the LOSO method, the model is trained using the respective GAN model, leaving out the test subject on which it is then tested. The average overall subject results are then compared to the original WESAD LOSO result. Private TSTR models can use our already privatized DP-CGAN data in normal training.
Synthetic Data Augmentation (AUGM). The AUGM strategy focuses on enlarging the original WESAD dataset with synthetic data. For each LOSO run of a WESAD subject, we combine the respective original training data and our LOSO-conform GAN data in differing amounts. As before in TSTR, we consider 15 and 100 synthetic subjects. Testing is also performed in the LOSO format. With this setup, we evaluate if adding more subjects, even though synthetic and of the same nature, helps the classification. Private training in this scenario takes the privatized DP-CGAN data but also has to consider the not-yet-private original WESAD data they are combined with. Therefore, the private AUGM models still undergo a DP-SGD training process to guarantee DP.
4.7. Stress Classifiers
In the following section, we present the tested classifier architectures and their needed pre-processing.
4.7.1. Pre-Processing for Classification
After already pre-processing our WESAD data for GAN training, as described in
Section 4.3, we now need the aforementioned further processing steps from Gil-Martin et al. [
22] to transform our training data into the correct shape for posing as inputs to our classification models. The 60-s long windows from
Section 4.3 are present in both the WESAD and synthetically generated data. The only difference between the two is that we do not apply the 30-s sliding window to the original WESAD data as we applied before for the GAN training.
In the next step, we want to convert each window into a frequency-dependent representation using the Fast Fourier Transformation (FFT). The FFT is an efficient algorithm for computing the Fourier transform, which transforms a time-dependent signal into the corresponding frequency components that constitute the original signal. This implies that these windows are converted into frequency spectra. However, before applying the FFT, we further partition the 60-s windows into additional subwindows of varying lengths based on the signal type. For these subwindows, we implement a sliding window of 0.25 s. The varying lengths of the subwindows are due to the distinct frequency spectrum characteristics of each signal type. We modify the subwindow length based on a signal’s frequency range to achieve a consistent spectrum shape comprising 210 frequency points.
Gil-Martin et al. [
22] provide each signal’s frequency range and give the corresponding subwindow length as shown in
Table 2. The subwindow lengths are chosen to always result in the desired 210 data points when multiplied by the frequency range upper bound, which will be the input size for the classification models. An important intermediate step for our GAN-generated data to avoid possible errors in dealing with missing frequencies in the higher ranges is to, in some cases, pad the FFT subwindows with additional zeroes to reach the desired 210 points. The frequency spectra are then averaged along all subwindows inside a 60-s window to finally obtain a single averaged spectrum representation with 210 frequency points to represent a 60-s window. We plot the spectrum results for the subwindows of a 60-s window in
Figure 4a and show their final averaged spectrum representation in
Figure 4b. Higher amplitudes are more present in the lower frequencies.
4.7.2. Time-Series Classification Transformer
As our first classification model, we pick the Time-Series Classification Transformer (TSCT) from Lange et al. [
19] that delivers the only comparison for related work in privacy-preserving stress detection, which is also described in
Section 3. The model is, however, unable to reach the best state-of-the-art results for the non-private setting. In their work, the authors argue that the transformer model could drastically benefit from more training samples, like our synthetic data. In our implementation, we use class weights and train for 110 epochs with a batch size of 50 using the Adam optimizer at a
learning rate.
4.7.3. Convolutional Neural Network
The Convolutional Neural Network (CNN) is the currently best-performing model in the non-private setting presented by Gil-Martin et al. [
22]. For our approach, we also include their model in our evaluations to see if it keeps the top spot. We mostly keep the setup of the TSCT in terms of hyperparameters but train the CNN for just 10 epochs.
4.7.4. Hybrid Convolutional Neural Network
As the final architecture, we consider a hybrid LSTM-CNN model, for which we take the same CNN architecture but add two Long Short-Term Memory (LSTM) layers of sizes 128 and 64 between the convolutional part and the dense layers. Through these additions, we want to combine the advantages of the state-of-the-art CNN and the ability to recognize spatial correlations in the time series from the LSTM. For the hyperparameters, we keep the same setup as for the standard CNN but increase the training time to 20 epochs.
4.8. Private Training
In this section, we go over the necessary steps and parameters to follow our privacy implementation. We first focus on the training of our private DP-CGANs and then follow with the private training of our classification models.
We want to evaluate three DP guarantees that represent different levels of privacy. The first has a budget of
and is a more relaxed although still private setting. The second and third options are significantly stricter in their guarantees, with a budget of
and
. The budget of
is already considered strong in the literature [
24,
25,
26], making the setting of
a very strict guarantee. Giving a less privacy budget leads to higher induced noise during training and therefore a higher utility loss. We want to test all three values to see how the models react to the different amounts of randomness and privacy.
4.8.1. For Generative Models
We already described our private DP-CGAN models in
Section 4.4 and now offer further details on how we choose the hyperparameters relevant to their private training. The induced noise at every training step needs to be calculated depending on the wanted DP guarantee and under the consideration of the training setup. We switch to a learning rate of
, set the epochs to 420, and take a batch size of 8, which is also our number of microbatches. Next, we determine the number of samples in the training dataset, which for us is the number of windows. By applying a 30-s sliding window over the 60-s windows of data, when preparing the WESAD dataset for our GANs, we technically double our training data. Since subjects offer differing numbers of training windows, the total amount of windows for each LOSO run depends on the current test subject. The ranges are
without and
with 30-s sliding windows. We thus see
as the number of windows for each DP-CGAN after leaving a test subject out for LOSO training. The number of unique windows, on the other hand, stays at
since the overlapping windows from sliding do not include new unique data points but instead just resample the already included points from the original 60-s windows. Thus, the original data points are only duplicated into the created intermediate sliding windows, meaning they are not unique anymore. To resolve this issue, we calculate the noise using the unique training set size of
. We, however, take
the number of epochs, which translates to seeing each unique data point twice during training and accounts for our increased sampling probability for each data point. We subsequently choose
according to
[
16] and use a norm clip of
.
4.8.2. For Classification Models
When training our three different classification models in the privacy-preserving setting, we only need to apply DP when including original WESAD data since the DP-CGANs already produce private synthetic data. In these cases, we mostly keep the same hyperparameters for training as before. We, however, exchange the Adam for the DP-Adam optimizer with the same learning rate from the Tensorflow Privacy library, which is an Adam version of DP-SGD. Regarding the DP noise, we calculate the needed amount conforming to the wanted guarantee before training. We already know the number of epochs and the batch size, which we also set for the microbatches. We, however, also have to consider other relevant parameters. The needed noise depends on the number of training samples, which for us is the number of windows. Since we do not use the 30-s sliding windows when training classifiers on the original WESAD data, all windows are unique. We find (at most)
remaining windows when omitting a test subject for LOSO training. This leads to
according to
[
16]. We finally choose a norm clip of
.
6. Discussion
The CGAN wins over the DGAN in our usefulness evaluation regarding an actual stress detection task conducted in
Section 5.2.2. In non-private classification, we are, however, still unable to match the state-of-the-art results listed in
Table 1 with just our synthetic CGAN data. In contrast, we are able to surpass them slightly by +0.45% at a 93.01% F1-score when combining the synthetic and original data in our AUGM setup using a CNN-LSTM. The TSCT model generally tends to underperform, while the performance of the CNN and CNN-LSTM models fluctuates, with each model outperforming the other depending on the specific setting. Our private classification models, which work best when only using synthetic data from DP-CGANs in the TSTR setting, show a favorable utility–privacy trade-off by keeping high performance for all privacy levels. With an F1-score of 84.19% at
, our most private model still delivers usable performance with a loss of just −8.82% compared to the best non-private model, while also offering a very strict privacy guarantee. Compared to other private models from the related work presented in
Table 1, we are able to give a substantial improvement in utility ranging from +11.90% at
to +14.10% at
, and +15.48% at
regarding the F1-score. The related work on private stress detection further indicates a large number of failing models due to increasing noise when training with strict DP budgets [
19]. We did not find any bad models when using our strategies supported by GAN data, making synthetic data a feasible solution to this problem. Our overall results in the privacy-preserving domain indicate that creating private synthetic data using DP-GANs before the actual training of a stress classifier is more effective than later applying DP in its training. Using just already privatized synthetic data is shown to be favorable because GANs seem to work better with the induced DP noise than the classification model itself.
In relation to our baseline results in
Section 5.2.1, our method demonstrates a significant performance boost and the advantage of making the feature selection obsolete. Without additional GAN data, our non-private deep learning model delivers 86.48%, surpassing the baseline by 5.1%. The best non-private model incorporating synthetic data exhibits an even more substantial increase, outperforming the baseline by 11.63%. Moreover, our most private model at
still manages to outperform the best LR model by 2.81%. Overall, the deep learning approach, particularly when augmented with GAN data, proves to be superior to the baseline LR model.
Until now, we only consider the overall average performance from our LOSO evaluation runs; it is, however, also interesting to take a closer look at the actual per-subject results. In this way, we can identify if our synthetic data just boost the already well-recognized subjects or also enable better results for the otherwise poorly classified and thereby underrepresented subjects. In our results on the original WESAD data, we see that Subject ID14 and ID17 from the WESAD dataset are the hardest to classify correctly. In
Table 5, we therefore give a concise overview of the results for the LOSO runs with Subject ID14 and ID17 as our test sets. We include the F1-scores delivered by our best synthetically enhanced models at each privacy level and compare them to the best result from the original WESAD data, as found in
Table 4. We can see that our added synthetic data mostly allow for better generalization and improve the classification of difficult subjects. Even our DP-CGANs at
and
, which are subject to a utility loss from DP, display increased scores. The other DP-CGAN at
, however, struggles on Subject ID14. A complete rundown of each subject-based result for the selected models is given in
Table A1 of
Appendix A. The key insights from the full overview are that our GANs mostly facilitate enhancements in challenging subjects. However, especially non-private GANs somewhat equalize the performance across all subjects, which also leads to a decrease in performance in less challenging subjects. In contrast, private DP-CGANs tend to exhibit considerable differences between subjects, excelling in some while falling short in others. The observed inconsistency is linked to the DP-CGANs’ struggle to correctly learn the full distribution, a challenge exacerbated by the noise introduced through DP. Such inconsistencies may pose a potential constraint on the actual performance of our DP-CGANs on specific subjects.
While improving the classification task is our main objective, we also consider the quality of our synthetic data in
Section 5.1. The CGAN shows to generate the best data for our use case, which are comparable to the original dataset in all data quality tests, while also performing best in classification. The DGAN achieves good results for most tested qualities but stays slightly behind the CGAN in all features and performs especially weakly in our indistinguishability test. We notice more and more reduced data quality from increasing DP guarantees in our DP-CGANs but still see huge improvements in utility for our private classification. Considering the benefits and limitations, the CGAN could potentially generate a dataset that closely approximates the original, offering a viable extension or alternative to the small WESAD dataset. The DP-CGANs, on the other hand, show their advantages only in classification but considering their added privacy attributes, the resulting data quality trade-off could still be tolerable depending on what the synthetic data are used for. The private data are shown to still be feasible for our use case of stress detection. For usage in applications outside of stress classification, e.g., other analyses in clinical or similar critical settings, however, the DP-CGAN data might already be too inaccurate.
Beyond the aforementioned points, our synthetic data approach, to a certain extent, inherits the limitations found in the original dataset it was trained on. Consequently, we encounter the same challenges that are inherent in the WESAD data. These include a small number of subjects, an uneven distribution of gender and age, and the specific characteristics of the study itself, such as the particular method used to trigger stress moments. With such small datasets, training GANs carries the risk of overfitting. However, we have mitigated this risk through the use of LOSO cross-validation. Further, as demonstrated in
Table 5, our GANs have proven capable of enhancing performance on subjects who are underrepresented in earlier classification models. Nevertheless, questions remain regarding the generalizability of our stress classifiers to, e.g., subjects with other stressor profiles and the extent to which our GANs can help overcome the inherent shortcomings of the original WESAD dataset.
7. Conclusions
We present an approach for generating synthetic health sensor data to improve stress detection in wrist-worn wearables, applicable in both non-private and private training scenarios. Our models generate multimodal time-series sequences based on original data, encompassing both stress and non-stress periods. This allows for the substitution or augmentation of the original dataset when implementing machine learning algorithms. Given the significant privacy concerns associated with personal health data, our DP-compliant GAN models facilitate the creation of privatized data at various privacy levels, enabling privacy-aware usage. While our non-private classification results show only slight improvements over current state-of-the-art methods, our approach to include private synthetic data generation effectively manages the utility–privacy trade-offs inherent in DP training for privacy-preserving stress detection. We significantly improve upon the results found in related work, maintaining usable performance levels while ensuring privacy through strict DP budgets. Compared to the current basic anonymization techniques of metadata applied to smartwatch health data in practice, DP offers a provable privacy guarantee for each individual. This not only facilitates the development and deployment of accurate models across diverse user groups but also enhances research capabilities through the increased availability of public data. However, the generalizability of our classifiers to subject data with differing stressors, and the potential enhancement of these capabilities through our synthetic data, remain uncertain without additional public data for evaluation.
Our work sets the stage for how personal health data can be utilized in a secure and ethical manner. The exploration of fully private synthetic data as a viable replacement for real datasets, while maintaining utility, represents a promising direction for making the benefits of big data accessible without compromising individual privacy.
Looking ahead, the potential applications of our synthetic data generation techniques may extend beyond stress detection. They could be adapted for other health monitoring tasks such as heart rate variability, sleep quality assessment, or physical activity recognition, where privacy concerns are similarly demanding. Moreover, the integration of our synthetic data approach with other types of wearable sensors could open new avenues for comprehensive health monitoring systems that respect user privacy. Future work could also explore the scalability of our methods in larger, more diverse populations to further validate the robustness and applicability of the generated synthetic data.