Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Perturbation-Based Explainable AI for ECG Sensor Data

Appl. Sci. 2023, 13(3), 1805; https://doi.org/10.3390/app13031805

by Ján Paralič^1,*

, Michal Kolárik¹

, Zuzana Paraličová²

, Oliver Lohaj¹

and Adam Jozefík¹

Reviewer 1:

Eric Chalmers

Reviewer 2:

Qingyong Wang

Appl. Sci. 2023, 13(3), 1805; https://doi.org/10.3390/app13031805

Submission received: 13 December 2022 / Revised: 17 January 2023 / Accepted: 27 January 2023 / Published: 31 January 2023

(This article belongs to the Special Issue Recent Advances in Deep Learning for Image Analysis)

Round 1

Reviewer 1 Report

The authors propose a perturbation-based method for explaining predictions made by a deep network using ECG data. I agree that explainable AI – especially in medical applications – is an important topic and falls within the scope of the journal. I thought the paper was well organized and mostly easy to understand. I do have some methodological and other concerns that I hope the authors will address.

General comments & questions:

- The reference list seems to contain only 7 peer-reviewed things (plus 2 non-peer-reviewed preprints and 1 online reference). Since explainable AI and medical AI are very broad and active research fields, a more thorough literature review would better position this paper and identify the problem it’s trying to solve.

- The authors reference Jeyakumar’s study, which suggests that explanation-by-example may be the preferred method of explaining ECG-based predictions. Yet the authors are proposing a perturbation-based method instead. Could this choice be better justified?

- The authors note that other studies have built explainable ECG models by treating the ECG trace as an image, while this study treats the ECG as a data stream. However, doctors likely perceive the ECG trace more like an image. Could this choice be better justified?

- The model used in this study uses 1D convolutional layers. Does this mean that there is no analysis across ECG channels?

- The authors point out they have used a simple neural network for this study, that trails the state-of-the-art approaches in terms of F1 score. The reason given is to reduce the time for explanation. However, for a medical application like this, high performance is probably critical. Moreover, a poor model needs no explanation anyway (nobody will request an explanation for a prediction that isn’t good!) Are the authors suggesting that their method would take too long to run on higher-performance models? Perhaps they could quantify the extra time required to explain a higher-performance network.

- Collaboration with doctors is important, and a great feature of this work. However, I found the experts’ credentials were unclear. The abstract refers to “senior medics”, but line 336 calls them “medical students”. How many experts were engaged, and what were their credentials and level of experience?

- Line 154: How was the threshold of 0.373284 chosen? Using cross validation? On the training or validation data?

Methodological comments & questions:

- The authors use a perturbation-based approach to select important intervals. This approach is based on the same principles as permutation-based feature selection, which identifies important features for a statistical model by shuffling the features one-by-one and observing the effect on the model’s performance. This raises a couple potential problems:

o It’s well known that this approach selects features/intervals that are important to the model, not necessarily the ones with high intrinsic predictive/explanatory power. That is, the approach could select different intervals when used with a more sophisticated model.

o With a simple model (the authors point out that they have chosen a simpler model for this study) it’s more likely that the intervals important to the model are also the ones with good predictive power. But as the model becomes more complex, it is more likely to learn sophisticated relationships and rules that no longer match how a human would see the task. Thus, the intervals selected by this perturbation approach may match human expectations for a simple model, but not a complex one.

o Another drawback of the perturbation/permutation approach is that it devalues correlated features/intervals. If one of the correlated intervals is perturbed, the model can get the same information from the other; As a result, both intervals will appear to be less important than they really are. Again, this effect could become worse with a more complex model.

o Since the same perturbation approach is used to select channels, the same concerns exist for the channel selections.

These sorts of limitations should be considered, described, and addressed if possible – especially since any practical application would probably use a higher-complexity model. Have the authors considered running their method on a more complex model and comparing the results?

- Using experts to validate the machine output is a great idea. However in this case there may be a potential issue. The experimental design is to show human experts the model’s output and asks to what extent they agree. This presents potential for a sort of anchoring bias, where the human’s default response will be to agree with the model, and only to disagree in extreme situations. Better would be to ask the experts to label the important regions in their opinion without showing them the model’s predictions, then compare. Could the authors comment?

- Furthermore, authors state that the task of labelling “most important areas” was confusing for humans. Does this indicate that’s not how this task works? If humans perform the diagnosis using a holistic view of the ECG signal rather than focusing on specific intervals, what does that mean for this approach to XAI? (note that nobody rated the highlighted intervals “completely right”. Could this be why?)

Minor comments:

- Line 70-73: the meaning here is unclear to me. Methods like SHAP and Explanation-by-example are not specific to convolutional networks. Nor were they originally designed for computer vision.

- Line 139: ReLU activation?

- Line 157-170: The description of micro and macro-averaging, and how they were used in these results, was a little unclear to me.

- Line 195: what does “break the interactions” mean here?

Author Response

Thank you very much for your excellent and very helpful review. We appreciate your detailed comments reflecting your deep expertise in the related research area. We reflected on all of them in the revised version of our manuscript. You can find our answers to each comment or question in the attached file.

Author Response File: Author Response.docx

Reviewer 2 Report

1. The Introduction part can be expanded to support core points clearly, you should write concisely and explicitly about your main contributions.

2. More about explainable papers could be discussed.

3. How to perturbation and its definition.

4. The newest explain approaches may add to this work as compared learning methods. For example, references [5] methods can be included.

5. The proposed method only runs a dataset, and the performances are better than other approaches, however, how to verify the proposed method's robustness?

6. Descriptions of biology significance combing explain deep learning methods are given in the section of Results or Discussion.

Author Response

Thank you very much for your excellent and helpful review. We reflected on all of your comments and remarks in the revised version of our manuscript. You can find our answers to each comment or remark in the attached file.

Author Response File: Author Response.docx

Round 2

Reviewer 1 Report

The revised paper provides many of the clarifications I was hoping for, and better acknowledges this study's limitations.

I would like to suggest two more corrections, after which I would call the paper ready for acceptance:

- In round one I proposed a better experimental design in which human participants are asked for their opinions before seeing the model's output. The authors point out this would be a completely different study, which is fair. But the paper should at least acknowledge somewhere the potential for bias that the current experimental design poses.

- The revised paper notes that thresholds were chosen using a simplex method on the validation set. This is inappropriate, as it overfits the validation set (validation sets should be used only at the end of model development, and only for validation, not for setting parameters/thresholds). The results should be recomputed using the training set for choosing thresholds - per best practice.

Author Response

Thank you very much for carefully reading our revised manuscript version and other useful remarks. We reflected on both of them in the revised version of our manuscript. You can find our answers to each remark in the attached file.

Author Response File: Author Response.docx

Reviewer 2 Report

Results of compared methods could be included in Section 4.

Author Response

Thank you for carefully reading our revised manuscript and for another helpful remark. We reflected on it in the revised version of our manuscript. You can find our answer to your remark in the attached file.

Author Response File: Author Response.docx

Article Menu

Perturbation-Based Explainable AI for ECG Sensor Data

Further Information

Guidelines

MDPI Initiatives

Follow MDPI