Next Article in Journal
Similarity Model Test on Stress and Deformation of Freezing Pipe in Composite Strata during Active Freezing
Next Article in Special Issue
Local Differential Privacy Image Generation Using Flow-Based Deep Generative Models
Previous Article in Journal
Electronic Atlas of Climatic Changes in the Western Russian Arctic in 1950–2021 as Geoinformatic Support of Railway Development
 
 
Article
Peer-Review Record

Performance of AI-Based Automated Classifications of Whole-Body FDG PET in Clinical Practice: The CLARITI Project

Appl. Sci. 2023, 13(9), 5281; https://doi.org/10.3390/app13095281
by Arnaud Berenbaum 1,*, Hervé Delingette 2, Aurélien Maire 3, Cécile Poret 3, Claire Hassen-Khodja 3, Stéphane Bréant 4, Christel Daniel 4, Patricia Martel 5, Lamiae Grimaldi 5, Marie Frank 6, Emmanuel Durand 1,7,8 and Florent L. Besson 1,7,8
Reviewer 1:
Reviewer 2:
Reviewer 3: Anonymous
Appl. Sci. 2023, 13(9), 5281; https://doi.org/10.3390/app13095281
Submission received: 11 February 2023 / Revised: 19 April 2023 / Accepted: 20 April 2023 / Published: 23 April 2023
(This article belongs to the Special Issue AI Technologies in Biomedical Image Processing and Analysis)

Round 1

Reviewer 1 Report

In this manuscript, the authors summarized and discussed the feasibility of a 3-dimensional deep convolutional neural network for the general triage of whole-body FDG PET scans in daily clinical practice. On this basis, they provide new insight that AI-based automated classification of whole-body FDG PET is a promising method for high-throughput triage in general clinical practice. The manuscript is well-organized and clearly stated. I would suggest accepting it after the following minor concerns are addressed.

1 The article's introduction lacked a smooth transition to the reason for selecting CNN. We know that deep learning has several models, therefore why was this research written to test the viability of 3D-CNN for general triage of whole-body FDG PET scans? I believe the logic could be more rigorous. Also, I believe the introduction is overly focused on EDS, which can be covered in the method section.

2 The image quality is poor, particularly in Figures 3 and 4. Do these photos have a resolution of 300 dpi?

3 In Table 3, the percentage corresponding to the number 631 in the test set column is calculated inaccurately.

4 In tables 1 and 3, the F and M after sex are not aligned, and the content and numbers after clinical context do not correspond to each other, making it difficult for readers to understand. I recommend that all tables in the article be displayed in a three-line table format, with the spacing of each row and column adjusted to make these tables neater and more beautiful.

5 Cause you concluded that further studies are mandatory to overcome the challenges underlined by the imperfection of real-life PET data, I would like to know if you will continue the research in this direction and what is the next step in your research plan.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

The paper describes the training of a 3D CNN for PET data using “real-life” and “unambiguous” data and its evaluation. The 3D CNN classifies the data on abnormal and normal PET scans, which can be used as triage. The dataset is large, which is difficult to have in general and also uncommon in PET research. The paper is interesting in the separation of “real-life” and “unambiguous” and mostly easy to read. Most of my concerns are related to reproducibility: the code is not accessible (a standard nowadays with Github or similar) and the data selected for “unambiguous” dataset is not clear to me. More details can be given, which are mentioned in the comments below.

 

Comments

-          In Figure 2, it is stated “most representative“ and this is not technically described. Finally, how would you bring this to clinical triage, the real-life or the unambiguous and why? In this context, there is literature that points out that the confidence of the algorithm is important to add. Could you please comment on this? I think this is where the line 308/309 is going.

-          The exclusion criteria of data selection for dataset is described in line 124 and 125. Was this performed through visual selection of the 2D projections. Could you point out, how many of these were found, so that they were excluded?

-          Although I have no special comments about the CNN architecture, there are many well known architectures for classification. Why do you implement a new one and how does it compare with others?

-          The results of using “real-life” and “unambiguous” are showed completely separately. It is interesting to show in the same independent dataset of testing, if the results are different or not. Although it is mentioned in the text that results from the “unambiguous” training have a NPV of 0.05, not more results are showed. Vice-versa is not showed. Could you add more information on this? I would like to see how does “real-life” perform on the just “unambiguous” dataset.  This is a really important point to review.

-          Since your dataset is slightly biased in gender, I think it is important to add a comment if there were significant differences between the results of women and men.

-          Please describe in more detail the written report anonymization or add a reference.

-          In line 101, I think you meant “did not differ from the template”.

-          How did you choose the nine 2D projections for visual inspections?

-          Line 144 – typo NVIDIA

-          Line 301 – typo real-life is missing hyphen -

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

An overall well written manuscript in a currently very hot topic

The scope of automatically sorting normal from abnormal scans and thus reducing reading time/workload is interesting.

Comments;

1)   Regarding the conclusion in the abstract and in the manuscript. In the real-life setting, a sensitivity of 61% seems quite low for a diagnostic test/sorting. The technique is indeed promising, but in this study not sufficient to solve the task of sorting abnormal from normal scans. This should be more clearly stated in the conclusion.

2)   Please elaborate on the definitions of normal and abnormal scans as it is crucial for the rest of the study and the potential generality of the model.

a.    Where the scan classified as normal is the text report only consisted of the “normal” template without manual confirmation?

b.   Line 100-101 page 3 – “Reports that differed from this template were considered normal” – don’t you mean did not differ from the template?

c.    The 9 2D projection used – please describe more thoroughly  - where they “planar” summed parts of the MIP? – different orientations (AP, sagittal, coronal)?

d.   One could speculate that minor abnormal findings and findings close to normal physiological uptake could be missed?

3)   Regarding the test sample, training sets and validation sets. Please specify/clarify in the text– are the initial training sets (the unambiguous set) a sub part of the final real life data set? (it looks that way in fig. 2)

a.    Figure 2 – What is the difference between available medical records and usable medical reports?

b.   Is the medical record the whole patient chart/journal or only the PET report?

c.    “Oui” should be changes to Yes (arrow between quality check and final database).

4)   Discussion: A more thorough discussion of why the model fails in the real life data would be very fruitful, incl. more on the authors thoughts about how to improve the model. I.e. they mention the difficulty regarding the definition “normal”. What are the author’s suggestion the term normal and validation of normal scans? What could further improve the model? What would need to be achieved to provide a reliable classification?

 

5)   Limitation: Another significant limitation is not only the single center design, but also the single vendor (same PET-scanner) and single reconstruction method. The variety of vendors and reconstruction algorithms further introduce variability that a automated classification algorithm needs to address – if generality is needed. 

 

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Back to TopTop