Next Article in Journal
Work-Related Dreams: An Online Survey
Previous Article in Journal
Sleep and Mental Health among Paramedics from Australia and Saudi Arabia: A Comparison Study
 
 
Article
Peer-Review Record

Validation of an Automatic Arousal Detection Algorithm for Whole-Night Sleep EEG Recordings

Clocks & Sleep 2020, 2(3), 258-272; https://doi.org/10.3390/clockssleep2030020
by Daphne Chylinski 1,†, Franziska Rudzik 2,3,†, Dorothée Coppieters ‘t Wallant 4, Martin Grignard 1, Nora Vandeleene 1, Maxime Van Egroo 1, Laurie Thiesse 2,3, Stig Solbach 2, Pierre Maquet 1,5, Christophe Phillips 1,6, Gilles Vandewalle 1, Christian Cajochen 2,3 and Vincenzo Muto 1,*
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Clocks & Sleep 2020, 2(3), 258-272; https://doi.org/10.3390/clockssleep2030020
Submission received: 26 May 2020 / Revised: 9 July 2020 / Accepted: 10 July 2020 / Published: 16 July 2020
(This article belongs to the Section Computational Models)

Round 1

Reviewer 1 Report

The paper presents a comparison of the automatic algorithm for arousal detection in EEG with human visual scoring. It is generally well written, uses appropriate research methods, and the results support the further use of the algorithm. In addition, this work highlights the important question of how an arousal should be comprehensively defined.

 

General comments

1) The aim of research is missing in the Abstract.

2) The number of scoring experts is relatively small (effectively 2 vs. AD) to be able to draw more general conclusions. Moreover, if the experts are also among the Authors of the paper, its objective value may be even smaller in the readers’ perception.

3) If "that more excitement would be detected by AD" was really a hypothesis, it should follow from previous work and findings, and should be rationally justified when mentioned. Otherwise, it looks more like the result of this work.

4) The correlation between AD and HR is very week (Section 2.2.3, r = 0.30­â€‘0.38). This finding should be discussed in relation to other statistical measures which have proved to be quite good.

5) Section 4.1: Since the impact of sex on the results of analyses is studied as well, this demographic information should also be presented here.

6) There was a difference in the methodology of arousals scoring in the BAS and DC centres (age vs. sleep stage as additional information). The potential impact of this difference on the results and conclusions should be briefly discussed.

7) Section 4.3.4: There is no justification here that/why the used procedure can distinguish between arousals and artefacts. All final results and conclusions depend on it (!).

8) Among Refs 23–27, the most recent publications, in particular those using time-frequency analysis in automatic detection of arousals, are missing.

 

Minor comments

i) p. 12, line 417: should be: "were considered"

ii) p.13, line 453: a reference to the applied time-frequency analysis using Morlet's method would be useful here.

Author Response

We kindly thank the reviewer for recognising the quality of our work and emphasising the potential importance of our publication. We also would like to thank him/her for raising pertinent points that helped improving the manuscript. Here are our point by point replies to the raised concerns:

1) The aim of research has been added to the abstract. We modified line 30-31“… visual detection from two research centres with the aim to evaluate the algorithm performance.”

2) We understand the reviewer’s point of view, but respectfully disagree. Our validation dataset includes 35 recordings of ~50% men and women of ~50% younger and older age to cover several aspects of inter-individual variability with a relatively large sample size. To be able to compute comparisons in the most comprehensive way, we managed to have all nights scored by 2 scorers. We indeed report that Basel scorers (n=4) could be considered as a single scorer, likely because of research centre sleep scoring “tradition”. Yet, both scorers come from different research centres with different scoring traditions. It seems therefore to us that we made great effort to provide a strong characterisation of the automatic detection performance. As stated by reviewer 2 “The use of two human raters and their use in defining the “gold standard” classification of arousals was very clearly described and is an excellent proxy for a currently ill defined ‘gold standard.” Arousals scoring and interpretation remains an area of controversy, and our aim was to compare the visual versus automatic detection in a thorough way. Keeping in mind this objective, we decided to construct two “gold standards” (inclusive and conservative) to reflect the multiple sources of variance affecting the human eye.

3) This point was already approached in the discussion (line 418), where we argue that it is typical of automatic detection of transitory events to find a higher number of events than what is yielded by visual inspection of the EEG. Sleep spindle automatic detections have for instance systematically proven this statement. In the revised version of the manuscript, we further added the following (line 89) “Usually, automatic detection methods used for transitory events (e.g. sleep spindles) yield a higher count of events than visual detection”; and line 110 “Moreover, we expected more arousals to be detected through AD based on previous automatic detection methods for transitory events and that supplemental arousals would contain lower frequency oscillation – rendering them less obvious to the human eye.

4) The first goal of our study was to validate AD vs. human ratings. For this, we used Cohen’s Kappa, which is considered as the best metric for this kind of purposes, and all our comparisons fall within the “Almost perfect” category of the Kappa scale. We further provide measures of inter-rater agreement, sensitivity, overlap of events and FDR in an attempt to provide a very comprehensive and detailed characterisation of AD performance. Following the characterisation of the potential impact of age and gender of the recorded volunteer and AD performance according to sleep stage, we wondered whether AD and human rater arousals were correlated, i.e. whether more arousals detected by the human eye was reflected in more arousal automatically detected. We agree with the reviewer that correlation is far from perfect but it was to be expected given the much larger number of arousal detected following AD (cf. Figure 1). Yet r ≥ .3 still constitute medium effect size associations with a relatively low p-value when considering the conservative human rating. In our view, this brings further validity to our algorithm and means that it can be used to compare arousal prevalence across subjects and its potential links with other individual phenotypes.

5) The information was present in table 5 and has been added to the methods section 2.1 (line 119).

6) As more clearly stated in the revised version of the manuscript (line 107), “we further explored whether sex, age and sleep stage influence detection reliability”. In other word, these analyses were computed in addition to our main objective. The origin of the difference between age groups found in Basel detection is unclear and, as the reviewer pointed out, may have been driven in part by the difference in methodology. We stress however that Basel raters had also access to the sex of the participants but it seem arguably less likely to underlie the significant link with sex found in Table 4. Furthermore although Basel raters had access to the full visual sleep staging, which is arguably related to arousal detection, we failed to find significant difference in AD performance with sleep stages. We developed this part in the revised manuscript to take this comment into account (line 450) : “Although we did find differences in some agreement coefficients according to age, particularly when comparing against HR inclusive, we consider that this is likely due to the difference between age groups in the amount of arousals detected by BAS, as arousal density was similar across age-groups when using AD, as well as for Liège HR. The origin of the discrepancy between BAS on the one hand and DC and AD on the other regarding age is unclear but may be related in part to the slight difference in methodology where BAS raters had access to the age of the participants, together with their sex and to the full visual sleep staging (vs. NREM, REM, WAKE for DC).”

7) The detection of artefacts is performed via several steps described in the methods section, that are performed prior to arousals detection (e.g., removal of epochs with noisy, flat channels or the reconstruction of bad EMG signal), as well as some steps that have not been described in the present paper but can be found in the original article (Coppieters ‘t Wallant et al., 2016). As described in section 2.3.1 and 2.3.2, artefacted bits of the EEG are first removed. Then, in section 2.3.3, we explained that a search for EEG shifts is performed, by looking for changes in the theta, alpha and beta bands by comparing it to a global median (of the power in that band for the whole recording), and to a local median in a shorter time window. All shifts in EEG in any of those 3 frequency bands, lasting more than 3 seconds, are considered to be arousals.

The aim of the previous paper was to validate the algorithm that detected the clean segments of EEG (those not containing artefacts and arousals), in order to further perform power spectrum analysis. In this previous paper, we confronted automatically and visually clean detected EEG segments. Here, our aim was to specifically confront arousals detected by the AD to visually detected arousals.

Thus, the main modifications done to the code are in the way arousal events are classified and retrieved from all the “rejected” segments of the EEG.

8) We have added a recent publication to the revised manuscript (line 86), i.e. Ugur et al. 2019. We are not sure which other recent papers the reviewer was referring to but are willing to add other recent references to the text.

i) We thank the reviewer for spotting this typo that has been corrected

ii) A reference is included in the revised version of the manuscript

Reviewer 2 Report

Overall, this is a well-written manuscript describing an automated arousal detection algorithm for use in clinical or research sleep laboratories.  The introduction is appropriate and clearly defines the problem being addressed.  The use of two human raters and their use in defining the “gold standard” classification of arousals was very clearly described and is an excellent proxy for a currently ill defined ‘gold standard’.  Evaluation of potential confounds was well considered and the discussion clearly noted the likely difficulties to be encountered with increasing age; likewise, the authors noted the need for evaluation on clinical populations with significant age-related disorders in validating their algorithm.

This reviewer has only two concerns:

  1. While there is mention in the results that the reported results have been replicated, no information on replicability is actually presented. Acknowledging that the authors make no claim to have fully validated their algorithm, if they wish to claim replicability, additional studies will need to be included
  2. The AD algorithm does return far more ‘arousal’ hits than either of the HR ‘gold standard’ raters. Despite the excellent performance of the AD relative to HR raters, there should be some discussion of the consequences of those ratings considered “false positives” in this analysis on research or clinical staff attempting to deploy this algorithm in analysis of their data

Author Response

We kindly thank the reviewer for recognising the quality of our work and emphasising the potential importance of our publication. We also would like to thank him/her for raising pertinent points that helped improving the manuscript. Here are our replies to the raised concerns:

1) This comment likely arises from the use of the word “reproducibility” and “reproducible” at the end of the discussion. We do not mean replicability in the sense of replicating study findings, but rather meant that the same recording would always lead to the same detection when using AD while it is not the case with human raters, even if the rater evaluates twice the same recording. By replicability, we also meant that detection would rely on the same mathematical criteria across research centres, lab traditions, age groups, sex etc. In other words any bias in the detection of arousals (arousal detection remains sub-optimum whatever the approach) would be systematically the same. We agree with the reviewer that our algorithm should be confronted to more diverse (clinical) populations to be fully validated. Overall, we do feel the results exposed in the present manuscript show that our AD can be reliably used as an alternative to visual scoring of arousals in healthy individuals devoid of sleep pathologies. 

We modified the manuscript as follows (line 429): “In contrast to visual detection, it will always yield the same detection when using the same dataset, and its detection bias will be systematic across research centre and study sample.”

2) While it is true that AD detects many more arousals than what has been visually detected, and while those are considered as “false positives” when being compared to human detection, there is no way to claim that those are not true arousal events for sure. The current definition of an arousal being “a transient shift in EEG frequencies”, and it is what our algorithm detects based on mathematic variations. There is no real objective threshold determining how big the shift should be to be classified as an event. Would a given event detected only automatically also be seen by a human expert if he had more time to inspect the recording? The additional analysis reported in section 2.3 shows that visually detected vs. AD-only detected arousals are statistically different over the whole dataset with, as expected, relatively lower power in fast frequencies in the latter which arguably makes them less obvious to the human eye and trickier to detect. This does not mean that these events do not constitute arousals as defined standard definition.

We modified the text as follows to take this comment in to account (line 431): “Based on the current definition of an arousal where, for instance, there is no objective threshold determining how large the frequency shift for an event to be classified as an arousal, there is no reason to consider that events that are only detected by AD algorithm do not constitute arousals.

Back to TopTop