Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Determining the Optimal Window Duration to Enhance Emotion Recognition Based on Galvanic Skin Response and Photoplethysmography Signals

Electronics 2024, 13(16), 3333; https://doi.org/10.3390/electronics13163333

by Marcos F. Bamonte^1,*

, Marcelo Risk²

and Victor Herrero¹

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Electronics 2024, 13(16), 3333; https://doi.org/10.3390/electronics13163333

Submission received: 27 June 2024 / Revised: 16 August 2024 / Accepted: 20 August 2024 / Published: 22 August 2024

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

1) If the approach is subject-dependent, its performance can be good or bad, depending on the user. Just showing the averaged results is not enough to see the usefulness of the proposed method. Especially in Table 5, it is better to add the standard deviations of all methods. For future research, you may identify the subject first and then use a model customized for each user (or a group of users) for emotion classification. This can improve the accuracy further.

2) From Figure 3, we can see the "optimal" window sizes are 11 and 16 seconds (for arousal and valence, respectively). The change in emotion during data collection can be the main reason large window sizes are not good. In this sense, the results shown in Figure 3 may be the longest window sizes available for obtaining the best classification performance. It is important to check the (average) durations of different emotions to improve the classification performance.

3) About Table 6, the computational cost is divided into High and Low without detailed information. How do you judge a method as high or low? Can you provide information like a big-O notation? In practice, if we can implement a method in real-time (i.e. recognize an emotion immediately) using the available computing resources (e.g. a normal PC with a reasonable price), we can say that this method is "low" and useful. In this sense, the method proposed in [15] can be the state of the art. Please provide detailed information and reasons to support the proposed method (i.e. RF with properly selected window durations).

Author Response

Comments 1: If the approach is subject-dependent, its performance can be good or bad, depending on the user. Just showing the averaged results is not enough to see the usefulness of the proposed method. Especially in Table 5, it is better to add the standard deviations of all methods. For future research, you may identify the subject first and then use a model customized for each user (or a group of users) for emotion classification. This can improve the accuracy further.

Response 1: Thank you for pointing this out. I agree with this comment. I have therefore added the standard deviations for all methods to Table 5 (page 8).

***

Comments 2: From Figure 3, we can see the "optimal" window sizes are 11 and 16 seconds (for arousal and valence, respectively). The change in emotion during data collection can be the main reason large window sizes are not good. In this sense, the results shown in Figure 3 may be the longest window sizes available for obtaining the best classification performance. It is important to check the (average) durations of different emotions to improve the classification performance.

Response 2: Thank you for pointing this out. I agree with this comment.

Studies on the effects of emotion on autonomic nervous system activity typically cover periods ranging from 0.5 to 300 seconds. Modern datasets using videos as the eliciting stimulus have employed validated videos with durations of 50 to 180 seconds. Although the duration of emotions is highly variable and can range from a few seconds to hours (depending on factors such as the subject, the emotion-eliciting situation, the intensity of the emotion at onset, and the format of the response scale), it is established in the literature that Event-Related Skin Conductance Responses (ER-SCR) typically take between 2 to 6 seconds to reach their peak following stimulus onset. Additionally, SCR recovery times (i.e., the duration from peak to full recovery) can vary, resulting in a total SCR duration ranging from 6 to 20 seconds. However, we were unable to find studies measuring the time response from stimulus onset to significant changes in the PPG waveform. Despite the lack of specific data on PPG response times, the typical durations observed in SCR may be related to the optimal window duration results obtained in this study.

We have added this response along with additional citations in the Discussion section of the manuscript. The changes can be found on page 10, lines 300-308, and page 12, lines 309-316.

***

Comments 3: About Table 6, the computational cost is divided into High and Low without detailed information. How do you judge a method as high or low? Can you provide information like a big-O notation? In practice, if we can implement a method in real-time (i.e. recognize an emotion immediately) using the available computing resources (e.g. a normal PC with a reasonable price), we can say that this method is "low" and useful. In this sense, the method proposed in [15] can be the state of the art. Please provide detailed information and reasons to support the proposed method (i.e. RF with properly selected window durations).

Response 3: Thank you for pointing this out. I agree with this comment.

The computational cost of the algorithms listed in Table 6 varies significantly depending on several factors, including the number of features, training samples, and the employed kernel functions. Additionally, training cost differs from inference cost. Since we cannot calculate the cost for the studies mentioned, we decided to remove the 'Computation Cost' column from the table. This change can be found on page 12, Table 6.

Regarding the last sentence of the comment, the support for Random Forest (RF) is based solely on empirical evidence, as demonstrated by the results. We have included the window sizes and overlap percentages that resulted in the best RF performance in the Results section to make this more explicit. This change can be found on page 9, lines 249-250.

Reviewer 2 Report

Comments and Suggestions for Authors

This work presents the results of a window sensitivity analysis on a continuous annotation dataset for automatic emotion recognition. Galvanic Skin Response and Photoplethysmography data are used together with non-linear features and common machine learning classifiers such as Random Forrest, Support Vector Machines etc for emotion (valence and arousal) level classifications. The manuscript is clearly organized. However, some critical issues exist.

First of all, the research gap is not clearly identified. It appears there are similar work conducted already and it is not clear why the presented work is important.

Second and most significantly, the influence of window settings on the reported results could be an artifact of how the authors were organizing the datasets rather than due to the intrinsic connection between window settings and machine learning techniques.

Serious revisions and justifications are needed for this work to be deemed solid for publication. Please see the following for more details.

Line 37-38: “On the other hand, PPG sensors indirectly measure heart rate and other associated metrics…” Please specify the other associated metrics.

Line 51: “…trials to familiarize with the system, baseline 51 recordings, stimulus presentation, and annotations.” Please specify how annotations are done. Was it from the feedback of the participants or some other way?

Lines 62-96. Critical details are missing when presenting related literature. What are the limitations of the related work? What is the research gap? Why is this proposed work needed?

Line 134: “We replaced each segment with its median value.” Why not use a standard majority? Isn’t it always more representative than the median in this context?

Line 153: “To handle imbalanced data, all the data and its corresponding labels were split into…” How is splitting data into training and testing able to handle imbalanced data?

Following the previous comment, please report the label distributions of the dataset.

Equations 1 and 2. If Eq (2) will be performed, then there is no need for Eq (1).

Figure 2. The figure is blurry; please use high-resolution or vector graphics for publication.

Figure 3. How does the percentage of the ground truth (GT) label (i.e., the count of the median label divides the total length of the window) change over different window sizes and overlap ratios? It is important to show this to decide if the accuracy change is correlated with the dominance of the GT label within the window. Please show the average percentage (dominancy) of the GT label together with its standard deviation range against window sizes and overlap ratios.

Lines 239-241: “These effects could be due to the reinforcement in the model’s training caused by the overlap, which allows capturing emotional information that would otherwise be missed in the no-overlap situation with the same window duration.” Why would there be a reinforcement in the model’s training caused by the overlap? All the employed classifiers are single-frame and thus should not care how frames are connected. If the number of available data segments is changing with different window sizes and overlap ratios, then it is possible. However, in this case, it is the data volume that matters, not really the window sizes or overlap ratios. Serious evidence and discussion needed to be provided to support the significance of the presented work.

Author Response

Comments 1: First of all, the research gap is not clearly identified. It appears there are similar work conducted already and it is not clear why the presented work is important.

Response 1: I Agree. I have, accordingly, modified the manuscript to emphasize this point. We added critical details in the related works and their limitations.

All of the related works employ discrete labeling methods. Although some used windowing to increase the number of samples and enhance recognition performance, they often reused the same labels due to the discontinuous annotation paradigm of the datasets (e.g., multiple uses of a single label from a 60-second video clip). As a result, local emotional changes are not included because participants can only rate their emotions at specific intervals. Therefore, a genuine windowing method based solely on GSR and PPG sensors needs to be included in the literature. This study addresses this gap by employing a continuous, real-time annotation dataset.

Changes can be found on page 2, lines 69-71, 75-78, and on page 3, lines 84-90, 97, 98-101, 110-112, 115-122, 124-127.

***

Comments 2: Second and most significantly, the influence of window settings on the reported results could be an artifact of how the authors were organizing the datasets rather than due to the intrinsic connection between window settings and machine learning techniques.

Response 2: I agree. To handle imbalanced data and avoid artifacts that may arise from the way the training and test datasets are organized, we applied a stratified K-Fold cross-validation strategy with 10 folds. We computed metrics for each hold-out test sets and averaged them over the 10 scores to determine the participant's emotion recognition performance. We believe this method allows us to demonstrate the intrinsic relationship between the window settings and the machine learning techniques.

Accordingly, I have changed the first paragraph of the Data Splitting section. This change can be found on page 6, lines 194-199.

***

Comments 3: Line 37-38: “On the other hand, PPG sensors indirectly measure heart rate and other associated metrics…” Please specify the other associated metrics.

Response 3: Thank you for pointing this out. I agree with this comment. Therefore, I have added some metrics as examples: the average of inter-beat intervals (IBI), the standard deviation of heart rate (SDNN), and the Root Mean Square of Successive Differences (RMSSD). This change can be found on page 2, lines 38-39.

***

Comments 4: Line 51: “…trials to familiarize with the system, baseline 51 recordings, stimulus presentation, and annotations.” Please specify how annotations are done. Was it from the feedback of the participants or some other way?

Response 4: Thank you for pointing this out. I agree with this comment. Therefore, I have made it more explicit that annotations are made subjectively by the participants themselves. This change can be found on page 2, lines 44, 51-52, and page 4, line 141.

***

Comments 5: Lines 62-96. Critical details are missing when presenting related literature. What are the limitations of the related work? What is the research gap? Why is this proposed work needed?

Response 5: Please refer to Response 1. We have unified the responses to Comments 1 and 5.

***

Comments 6: Line 134: “We replaced each segment with its median value.” Why not use a standard majority? Isn’t it always more representative than the median in this context?

Response 6: Thank you for pointing this out. We didn’t use a standard majority because there are cases where there is no clear majority, as shown in Figure 1c, where there are no horizontal segments. In such situations, a standard majority method would not yield a good result and would require discarding an instance, which is not desirable.

On the other hand, when there are several different short horizontal lines, as in Figure 1b, both the standard majority and the median yield very similar results. For these reasons, we believe that the median better represents the different situations.

We have made this point more explicit in the manuscript. The changes can be found on page 5, lines 172-175, 177, 183-184, and 185-187.

***

Comments 7: Line 153: “To handle imbalanced data, all the data and its corresponding labels were split into…” How is splitting data into training and testing able to handle imbalanced data?

Response 7: Thank you for pointing this out. I agree with this comment.

To handle imbalanced data we applied a stratified K-Fold cross-validation strategy with 10 folds.

(see Response 2). Accordingly, I have changed the first paragraph of the Data Splitting section. This change can be found on page 6, lines 194-199.

***

Comments 8: Following the previous comment, please report the label distributions of the dataset.

Response 8: Thank you for pointing this out. I have accordingly added a scatter plot showing the label distribution. The figure can be found on page 4. Additionally, I have added a reference to the figure in line 142.

***

Comments 9: Equations 1 and 2. If Eq (2) will be performed, then there is no need for Eq (1).

Thank you for pointing this out. We made an error in the manuscript; we standardized the extracted features, not the signals. Therefore, I have removed Equation (2) from the Data Splitting section (Section 2.4) and added it to the Feature Extraction section (Section 2.5). The changes can be found on page 4, line 158, and page 7, lines 216-219.

***

Comments 10: Figure 2. The figure is blurry; please use high-resolution or vector graphics for publication.

Response 10: I agree. Thank you for pointing this out. We replaced the original figure with a high-resolution version. It can be found on page 7.

***

Comments 11: Figure 3. How does the percentage of the ground truth (GT) label (i.e., the count of the median label divides the total length of the window) change over different window sizes and overlap ratios? It is important to show this to decide if the accuracy change is correlated with the dominance of the GT label within the window. Please show the average percentage (dominancy) of the GT label together with its standard deviation range against window sizes and overlap ratios.

Thank you for your question. I am seeking clarification on the calculation of the percentage of the ground truth label. Specifically, is it calculated as the percentage of data within a window where the labels match the median label? I would appreciate any further details you can provide on this matter.

Percentage of GT label = (Number of occurrences of the median label within the window) / (Total number of elements in the window)

***

Comments 12: Lines 239-241: “These effects could be due to the reinforcement in the model’s training caused by the overlap, which allows capturing emotional information that would otherwise be missed in the no-overlap situation with the same window duration.” Why would there be a reinforcement in the model’s training caused by the overlap? All the employed classifiers are single-frame and thus should not care how frames are connected. If the number of available data segments is changing with different window sizes and overlap ratios, then it is possible. However, in this case, it is the data volume that matters, not really the window sizes or overlap ratios. Serious evidence and discussion needed to be provided to support the significance of the presented work.

Response 12: Thank you for pointing this out. I agree with this comment; our manuscript was not sufficiently clear on this point.

The classification algorithms mentioned require a single label value per window (i.e., per instance), which can 'filter' local label fluctuations in specific parts of the window when annotations are continuous. However, these local fluctuations can be captured by an adjacent overlapping window, provided that an appropriate window size and overlap stride are chosen. In this way, the overlap 'reinforces' the learning. Without overlap, these local label variations might be 'missed' or filtered out, resulting in a loss of emotional information.

Therefore, I have added this explanation and a figure to the manuscript to clarify this point. These changes can be found on page 10, lines 286-299, and in Figure 7 (page 11).

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The replies to my comments are satisfiable. I have only one more comment. Please use (as far as possible) font sizes larger than 10 for characters (axis label, legend, etc.) in the fingers.

Author Response

I agree. Thank you for pointing this out. Therefore, I have increased the font size of Figures 2 to 8.

The changes can be found on page 5 (Figure 2), page 6 (Figure 3), page 10 (Figures 5 and 6), page 11 (Figures 7 and 8), page 12 (Figures 9 and 10).

Reviewer 2 Report

Comments and Suggestions for Authors

Thank you for the revision and response. Many of the reviewer’s previous comments have been sufficiently addressed. However, there are still a few critical matters requiring further attention.

Following the previous comment on sufficiently identifying the research gap. After this round of revision, it is clear why a continuously annotated dataset can be helpful. However, this dataset is not part of this work’s contribution. The main contribution of this work is the sensitivity analysis of the window size, which is still not fully motivated. It is a common operation in time series analysis to find a proper window size for data discretization and feature generation. So, how do these workers select their window sizes? Why can’t the authors use window sizes similar to those of their peers? Why is it particularly important to perform such window size sensitivity analysis? These questions are important but not yet clear regarding the proposed objective of this work.

Following the previous comment on using median vs. majority as the label for a window. The reviewer believes that simply using the median is inappropriate and could lead to serious issues. For example, imagine there are three classes within a window: A(49%), B(2%), C(49%). Simply using the median as the label would make B the label for this window. This is clearly problematic and would seriously confuse the classifier during training. If there is no majority (by whatever deification for the majority, e.g., percentage of appearance > 60% within a window) existing within a specific window, then this instance of the window should be excluded for training and testing. Because the label simply can not represent the data. It should be the job of the window sensitivity analysis to decide how best to set the window size and overlap to reduce the number of “bad” window instances that can not have a representative label.

Figure 1. Thank you for adding the scatter plot. Please also report the number of points for each class.

Figure 4. Following the previous comment on presenting the change in the ground truth (GT) label percentage over different window sizes and overlap ratios. Let’s define the percentage or Dominance of the GT label within a window instance as:

Dominance (GT label) = (Number of data points that have the GT label within the window) / (Total number of data points within the window, i.e., window duration * sample frequency)

Then, for each window size and overlap ratio setting, please report the mean and standard deviation of the Dominance of all window instances. Meanwhile, please also report the number of window instances for each setting. These are important metrics associated with window-related analysis.

Author Response

Comments 1: Following the previous comment on sufficiently identifying the research gap. After this round of revision, it is clear why a continuously annotated dataset can be helpful. However, this dataset is not part of this work’s contribution. The main contribution of this work is the sensitivity analysis of the window size, which is still not fully motivated. It is a common operation in time series analysis to find a proper window size for data discretization and feature generation. So, how do these workers select their window sizes? Why can’t the authors use window sizes similar to those of their peers? Why is it particularly important to perform such window size sensitivity analysis? These questions are important but not yet clear regarding the proposed objective of this work.

Response 1: I agree. Thank you for pointing this out.

Regarding this question: “Why is it particularly important to perform such window size sensitivity analysis?”:

This sensitivity study is relevant for determining the optimal window duration and percentage of overlap that best capture elicited emotions at the precise moment the individual feels them and within the appropriate time interval. It is important to note that several factors justify this approach:

The duration of emotions is highly variable, and therefore, the optimal window duration strongly depends on how the individual labels the elicited emotions.
Within a given window, the individual can significantly change the label (because it is continuous). Therefore, small window sizes can potentially capture 'local variations' but are more susceptible to capturing label artifacts, which may confuse the classifier. On the other hand, large window sizes combined with an appropriate metric condensing the participant's annotation of the entire segment can filter out these label artifacts and capture more emotional information. However, this metric may be unsuitable if label fluctuations are significant. An optimal window duration can both filter out label artifacts and keep the metric representative of the labels within the segment.
The representative metric of a consecutive overlapped window can better capture local emotional fluctuations that might be filtered out by the metric of the previous window (e.g., if the fluctuations occur at the end of the previous window).

These changes can be found on page 3, lines 127-135, and page 4, lines 136-145.

Regarding these questions: “how do these workers select their window sizes? Why can’t the authors use window sizes similar to those of their peers? ”:

We selected these ranges based on the works of Kreibig et al., Ayata et al., and Zhang2021 et al. Kreibig et al. reported several studies employing films as the eliciting stimulus with window durations ranging from 0.5 to 120 seconds. Ayata et al. utilized window sizes from 1 to 30 seconds, but with non-uniform steps and decreasing accuracy for window sizes greater than 30 seconds. Zhang2021 et al. employed window durations ranging from 0.125 to 16 seconds, with non-uniform steps, and reported poor performance for window sizes shorter than one second. Therefore, we chose to use a uniform step size ranging from 1 to 29 seconds to facilitate the identification of the optimal window size.

These changes can be found on page 5, lines 183-189.

***

Comments 2: Following the previous comment on using median vs. majority as the label for a window. The reviewer believes that simply using the median is inappropriate and could lead to serious issues. For example, imagine there are three classes within a window: A(49%), B(2%), C(49%). Simply using the median as the label would make B the label for this window. This is clearly problematic and would seriously confuse the classifier during training. If there is no majority (by whatever deification for the majority, e.g., percentage of appearance > 60% within a window) existing within a specific window, then this instance of the window should be excluded for training and testing. Because the label simply can not represent the data. It should be the job of the window sensitivity analysis to decide how best to set the window size and overlap to reduce the number of “bad” window instances that can not have a representative label.

Response 2: I agree. Thank you for pointing this out.

We replaced each segment with its median value because this method avoids discarding segments, which was a premise of this work, given the low number of data samples available (e.g., less than 250 instances for a window size of 5 seconds and overlap=0%). Moreover, we found it more convenient not to discard segments a priori in order to perform the window sensitivity analysis for all the proposed window sizes. To ensure that the median is representative of the annotation segment, we chose a dominance ground truth (GT) label's minimum threshold D_GTminof 50%. Once we determined the change in the dominance ground truth label percentage D_GTacross different window sizes and overlap ratios, we only considered those window segments that met the D_GT - σ(D_GT) ≥ D_GTmin criterion, based on a worst-case scenario (Note that σ(D_GT) is the standard deviation of D_GT). Therefore, we integrated this criterion into our method and modified the resulting optimal window sizes and performances for valence and arousal accordingly in the manuscript.

These changes can be found on page 1, lines 9,10,11; page 5, lines 195-199; page 6, lines 200-207, 209-210, 213-215, 220-222, 224, 225, page 9, lines 287-292; page 10, Figures 5-6, Table 5 and lines 302-304; page 11, Figure 7 and 8, and lines 314, 318; page 12, lines 326-327 and Figure 9; page 13, lines 344-351; page 15, lines 416-425 and 441.

***

Comments 3: Figure 1. Thank you for adding the scatter plot. Please also report the number of points for each class.

Response 3: I agree. Thank you for pointing this out. We have added a figure with the number of points for each class.

Changes can be found on page 4, lines 160-161, and page 5, Figure 2.

***

Comments 4: Figure 4. Following the previous comment on presenting the change in the ground truth (GT) label percentage over different window sizes and overlap ratios. Let’s define the percentage or Dominance of the GT label within a window instance as:

Dominance (GT label) = (Number of data points that have the GT label within the window) / (Total number of data points within the window, i.e., window duration * sample frequency)

Response 4: I agree. Thank you for clarifying this point. Therefore, we added the mean and standard deviation of the Dominance of all window instances and overlap ratios. And the number of window instances for each setting.

These changes can be found on page 10, Figure 6, and page 11, Figure 7.

Round 3

Reviewer 2 Report

Comments and Suggestions for Authors

Thank you for the revision and response. Most of the reviewer’s previous comments have been sufficiently addressed. But there are a few points that should be further clarified.

Figure 1 & 2. The association between these two plots is confusing. Figure 1 contains fewer points than Figure 2. Does each point in Figure 1 corresponds to a video clip, and a point in Figure 2 corresponds to a sensor measurement? Please make it clear.

Line 317: “The main finding is the identification of different optimal window durations for arousal and valence, which are 3 seconds and 7 seconds, respectively, for the CASE dataset, using 50% overlap.” Why is 7 seconds picked as the optimal window duration for valence? Based on Figures 5 and 6, 3 seconds appears to be a better option for valence.

Figure 5 & 6. More discussions, insights, and recommendations should be made based on these results in terms of window settings. For example, smaller window sizes are more agile but may not contain sufficient data for feature generation. Larger window sizes can not effectively capture class transitions, etc. Please provide recommendations for other researchers on how best to decide their window sizes. This will further increase the contribution and impact of this work.

Author Response

Comments 1: Figure 1 & 2. The association between these two plots is confusing. Figure 1 contains fewer points than Figure 2. Does each point in Figure 1 corresponds to a video clip, and a point in Figure 2 corresponds to a sensor measurement? Please make it clear.

I agree. Thank you for pointing this out.

Regarding Figure 1: Each point in the scatter plot represents the mean label for a specific video, annotated by a participant. A total of 240 points are displayed (30 participants x 8 videos per participant).

Regarding Figure 2: We mistakenly uploaded an incorrect histogram for Figure 2. We apologize for the inconvenience and have replaced it with the correct histogram. This figure shows the number of points for each class, where each point corresponds to a sample from the joystick-based sensors with a sample period of t s =0.05 ms.

These changes can be found on page 4, lines 163-168, and Figure 1; and on page 5, Figure 2.

***

Comments 2: Line 317: “The main finding is the identification of different optimal window durations for arousal and valence, which are 3 seconds and 7 seconds, respectively, for the CASE dataset, using 50% overlap.” Why is 7 seconds picked as the optimal window duration for valence? Based on Figures 5 and 6, 3 seconds appears to be a better option for valence.

Response 2: I agree. Thank you for pointing this out.

Indeed, 3 seconds is a better option for valence as it provides the maximum accuracy and dominance. We have made the necessary changes accordingly.

These changes can be found on page 1, lines 8-10; page 9, lines 290-292, 294; page 10, line 307 and Table 5; page 11, lines 314-315, 321-322 and Figure 8; and page 12, Figure 9.

***

Comments 3: Figure 5 & 6. More discussions, insights, and recommendations should be made based on these results in terms of window settings. For example, smaller window sizes are more agile but may not contain sufficient data for feature generation. Larger window sizes can not effectively capture class transitions, etc. Please provide recommendations for other researchers on how best to decide their window sizes. This will further increase the contribution and impact of this work.

Response 3: I agree. Thank you for pointing this out. We have added a comment on the balance required between the number of window segments and sufficient emotional content within the window. Additionally, we have included recommendations for determining optimal window sizes. Finally, we have added a comment on the benefits of the reinforcement effect caused by overlapping windows in the conclusions section.

Changes can be found on page 13, lines 372-377, 392-395; page 15, lines 412-431; and page 16, line 469.

Article Menu

Determining the Optimal Window Duration to Enhance Emotion Recognition Based on Galvanic Skin Response and Photoplethysmography Signals

Further Information

Guidelines

MDPI Initiatives

Follow MDPI