Next Article in Journal
Comparison of Raman Spectra of Optically Nonlinear LiTaO3:Cr3+ (0.005 wt%) Crystal Laser Excited in Visible (532 nm) and Near-IR (785 nm) Areas
Next Article in Special Issue
A Lightweight Swin Transformer-Based Pipeline for Optical Coherence Tomography Image Denoising in Skin Application
Previous Article in Journal
Metasurface Deflector Enhanced Grating Coupler for Perfectly Vertical Coupling
 
 
Article
Peer-Review Record

A Novel Intraretinal Layer Semantic Segmentation Method of Fundus OCT Images Based on the TransUNet Network Model

Photonics 2023, 10(4), 438; https://doi.org/10.3390/photonics10040438
by Zhijun Gao *,†, Zhiming Wang † and Yi Li
Reviewer 1:
Reviewer 2:
Reviewer 3:
Photonics 2023, 10(4), 438; https://doi.org/10.3390/photonics10040438
Submission received: 19 January 2023 / Revised: 29 March 2023 / Accepted: 10 April 2023 / Published: 12 April 2023
(This article belongs to the Special Issue High-Performance Optical Coherence Tomography)

Round 1

Reviewer 1 Report

The authors propose a new deep architecture to segment the OCT intraretinal layers. The architecture is a novel TransUNet architecture with proposed improvements, and shows better performance compared to a number of relevant methods in literature, with two expert gold standards. The outline of the methodology has been done clearly, and the results are convincing.

There are a few points the authors need to improve on.

1)      How are the compared methods trained? A description on their training is important for fair comparison.

2)      How generic is the approach? Some inter-institute testing is important understand if the model overfits the current datasets from the institute. The authors are recommended to show some experiments on other institutional data, and report the generalization even if it is not good enough. That provides the readers with a better understanding of the algorithm and its limitations.

On that note, there are more datasets from Duke on AMD patients and control subjects, that may be used for testing as well.

3)      On continuation from previous point, the training used DME and control patients. How’d the performance be on drusen deposits on RPE? Some qualitative results are recommended. Datasets for AMD should contain these examples as well. Section 4 discusses generalization but it is more like an ablation study. Proper generalization experiment needs to be done on other datasets (which are not used for training).

Author Response

Dear Reviewer:

     Thank you for your letter and for the reviewer’s comments concerning our manuscript entitled “Reseach on the Intraretinal Layer Semantic Segmentation Method of Fundus OCT Images Base on the Novel TransUNet Network Model”. Those comments are all valuable and very helpful for revising and improving our paper, as well as the important guiding significance to our research. We have studied comments carefully and have made correction which we hope to meet with approval. Revised portions are marked in red using the "Track Changes" functions as the revised manuscript.

     Please see the attachment.

Author Response File: Author Response.doc

Reviewer 2 Report

This is an interesting work, which developed an improved lightweight TransUNet deep learning model for intraretinal layer segmentation for fundus OCT image. However, the authors should present the whole methods more transparently. Here are my comments.

1. Overall, all the captions of figures and tables should be improved. All the abbreviations, symbols, and labels should be explained clearly.

2. The critical improvement of this work is using RL-Transformer and Dense Block to replace the Transformer and upsampling parts of the original TrnasUNet. The author should discuss more deeply why these replacement could improve the performance (e.g., Dice coefficient) of segmentation, based on the unique properties of them compared to their counterparts.

3. In Fig.4 and related discussion, showing the difference between ResMLP and MLP is appreciated. Thus, the difference of properties and overall effects could be discussed more deeply.

4. In Eq(1), why is there the LN(.) operator ? Is it Layer Norm ? But it does not appear in Fig.4b.

5. In the beginning of page 6, please explain the term "patch". What is the correspondence to the output of down sampling CNN of U-Net ? Also, please explain Eq(3) clearly.

6. In page 7, the author should also explain Dense Block part more clearly. The first two paragraphs should be largely revised. In addition:

   1) What is the meaning of BN-ReLU-Conv in Fig.5.

   2) In the 1st paragraph, it stated that 5-layer dense block represents five BN-ReLU-Conv. But in Fig.5, there are only four BN-ReLU-Conv layers.

   3) The count of input layers of L-th layer, and the explanation of k, k0, and 4k are very confusion.

7. The Algorithm 1 should be explained, especially all the symbols, statements, and expressions.

8. In page 9, the explanation to Hausdorff distance (HD) should be added.

9. In Table 1:

   1) Does it mistakenly reversed Expert 1 and 2 ? It seems inconsistent to the discussion in page 10.

   2) Please explain the meaning of numbers in the rows of Expert 1 and 2.

   3) Please explain the meaning of significance P.

   4) What is the term "consistency index" in the caption ? What numbers does it refer to ?

   5) Do all the numbers the average over all testing images ? If yes, please show the errorbars of uncertainties between all the testing images.

10. In page 10, what are the meaning of the numbers after the sentences "... superior to other four methods" in paragraphs 1 and 2 ? Also, in paragraph 3, please explain how the t-test being calculated, especially when their Dice coefficient are not differed too much.

11. In paragraph 4 of page 10, there is a typo: Jakd similarity

12. Please explain the meaning of up and down arrows.

13. Please explain how to draw the ROC curves ? since in each model there is only single value of "True Positive Rate" and "False Positive Rate".

14. In page 12, the 1st paragraph reads quite redundant. It can be replaced by a table.

15. In Table 5, please explain FPS. Also, please estimate the amount of parameters of the original upsampling and Dense Block somewhere in the manuscript.

Author Response

Dear Reviewer:

     Thank you for your letter and for the reviewer’s comments concerning our manuscript entitled “Reseach on the Intraretinal Layer Semantic Segmentation Method of Fundus OCT Images Base on the Novel TransUNet Network Model”. Those comments are all valuable and very helpful for revising and improving our paper, as well as the important guiding significance to our research. We have studied comments carefully and have made correction which we hope to meet with approval. Revised portions are marked in red using the "Track Changes" functions as the revised manuscript.

     Please see the attachment.

Author Response File: Author Response.doc

Reviewer 3 Report

The manuscript presents a deep-learning model to segment intraretinal layers in OCT images. The model combines the TransUNet framework with ResLinear-Transformer and dense block. I do believe the utilization of transformer-like networks in this kind of segmentation task is useful. However, the manuscript is not well written and the novelty of the paper is questionable. My major concerns are listed as follows:

1.    If I understand it correctly, RL-Transformer is an existing model introduced in a published paper (ref. 24). If this is the case, how could the authors state that “using RL-Transformer to replace the Transformer part of the original TransUNet model” is a contribution of the current paper?

2.    The methodological novelty is not clear. The authors only combine different existing methods to achieve a high-performance model.

3.    Please describe how the ROC curve is plotted. What is the final activation layer of the network? Is it softmax? How to get the different data points in the ROC curve?

4.    It seems that when choosing the comparison models, the authors focused more on Transformer-like networks. How about those models developed specifically for OCT segmentation? These models are truly state-of-the-art. I think comparisons to these models will be more meaningful.

5.    What is the difference between the annotations obtained by the two experts? Is it large?

6.    From Fig. 7, many boundary details are not correctly segmented by the proposed method, so I don’t think it is appropriate to state that “the texture processing at the boundary is also ideal” (line 378).

7.    Some sentences are confusing and difficult to understand. For example, “The CNN of the encoder is downsampled from the corresponding cascade of the same layer resolution, which is used to declare the hidden characteristics to export the resulting split mask” in lines 157-158, and “Patch embeddings was applied to 1×1 patches extracted from the CNN feature map instead of the original image” in lines 173 and 174. The authors are suggested to modify the manuscript to make it more reader-friendly.

8.    Line 166, the authors wrote “… forecast the segmentation map of corresponding size H × W with a resolution of H × W in space and a number of channels C”, I think H × W should be the matrix size instead of resolution.

9.    Normally, we say “feature extractor” than “feature puller” (line 172) and “receptive field” than “sensory field” (line 175). The authors are suggested to read more papers in the relevant field to avoid unusual utilization of words.

10.   I don’t think the title is appropriate. It is a little bit awkward for a scientific paper with a title starting with “Research on”.

 

11.   English writing needs to be improved. There are many errors in the current version. For example, “The pre-trained weights were used on ImageNet to initialize network parameters” (line 266) and “which was used to optimize the model for backpropagation” (line 269).

Author Response

Dear Reviewer:

     Thank you for your letter and for the reviewer’s comments concerning our manuscript entitled “Reseach on the Intraretinal Layer Semantic Segmentation Method of Fundus OCT Images Base on the Novel TransUNet Network Model”. Those comments are all valuable and very helpful for revising and improving our paper, as well as the important guiding significance to our research. We have studied comments carefully and have made correction which we hope to meet with approval. Revised portions are marked in red using the "Track Changes" functions as the revised manuscript.

     Please see the attachment.

Author Response File: Author Response.doc

Round 2

Reviewer 1 Report

The authors addressed some of the concerns raised in the first review. The generalization experiments show good potential for the approach.

However, it is unlikely that the authors could not locate AMD and Drusen datasets. But this can be a future work and ignored in the current scope.

The authors probably did not understand the point on first question:  how are the "compared" methods trained? It means when other methods are compared (such as TransUNet, ReLayNet etc., how are they trained? Are the authors using some pre-trained model distributed by the original authors, or they are being trained?

Finally, a part that is not clear is: section 3.2 explains about selecting 60% of OCT database images for training. Does it do that for "both datasets" in evaluation or just the SD-OCT dataset? For generalization, the criteria should be - no data from the new POne dataset has been used for training.

The above two concerns (1) training details of "compared" methods, (2) details on using the POne dataset, still need to be addressed. Since this should not require significant effort, a minor revision may suffice.

Author Response

Please see the attachment.

Author Response File: Author Response.doc

Reviewer 2 Report

Thanks to the authors efforts to improve the manuscript, which addressed many of my questions. However, I still hope that the section 2.2.1 RL-Transformer should be entirely revised, since it is one of the central part of this manuscript, but it is still not transparent to readers, especially for readers who are not familiar with these area.

I hope that the presentation should at least follow a reasonable logic, without jumping back and forth. Here I suggest one possibility. Just follow Fig.3 and Fig.4 presented by the authors.

1) In Fig.3, the layers of RL-Transformer is followed by the CNN block, in which the last layer is Liner Projection. I guess the Linear Projection corresponds to Eq.(1) and Eq.(3). If the authors want to present this section starting from the Linear Projection, then Eq.(3) should be first described.

2) In my previous question, what is the meaning of the term "patch" ? The author still does not describe clearly. After google, I found these:

https://d2l.ai/chapter_attention-mechanisms-and-transformers/vision-transformer.html

https://machinelearningmastery.com/a-gentle-introduction-to-positional-encoding-in-transformer-models-part-1/

So, the "patch" means dividing the images into parts, each with dimension p x p. Then Eq.(3) can be explained clearly. The xip are image patches, E is the patch embedding operation, Epos is the position embedding, etc. The authors should at least describe in some words for all these meanings, so that people will appreciate the beauty of this design.

3) Now enters Fig.4a. Note that the input data should be zl, instead of (x1p, x2p, ...., xNp) drawn in the diagram. Here zl comes from Eq.(3) for l=0. Since in Fig.3, there are 12 RL-Transformer layers, it means that l has values from 0 to 12. Then in this part, the authors should describe Eq.(4) and (5).

4) In Fig.4a, I think there are two lines missing in the diagram. See the figure in my above first URL. There is one line bypassing the "Norm" and "Multi-head attention" layers in the bottom block, and the other line by passing the "Norm" and "MLP" layers in the upper block.

5) Finally we enter Fig.4b, which should correspond to Eq.(2). Note that Eq.(2) is only half of Fig.4b. Similarly, in Fig.4b, I think there are also two lines missing in the diagram. Please see Fig.1 of:

https://arxiv.org/pdf/2105.03404v2.pdf

See the two "Skip-connection" lines in the diagram. This is why z''l is added in Eq.(2).

6) Finally, what is the difference between ResMLP and MLP? I suggest the authors should say some words about MLP, e.g., it is actually a full connect neural network, like this,

https://www.researchgate.net/figure/Multi-Layer-Perceptron-MLP-diagram-with-four-hidden-layers-and-a-collection-of-single_fig1_334609713

So the reader will immediately understand what's the significant of using ResMLP to replace MLP.

In brief, I really hope that the authors could consider to revise the presentation of this section logically, with more words describing each term, so that people can easily follow, understand, appreciate the whole design, and see why this design has significant gain in performance improvement.

Besides this section, there are other questions in the remaining part.

7) In section 3.2 (page 8), it mentioned that "the weights pre-trained on ImageNet were used to initialize the model parameters". The authors should describe it more clearly. Does it mean that the proposed model was initially trained by the data in ImageNet, and then trained by the OCT data sets ? Or in ImageNet there is already a pre-trained model (and which model ?) that the weights are available ? All the details should be described.

8) Again, in section 3.2, there are two patch sizes, 16 and 24. What do they refer to ?

9) In section 2.2, it mentioned: "Patch embedding was applied to 1x1 patches extracted from CNN feature maps". Does it mean that in this work, there is only 1 patch in the Linear Projection ? Please note that this is a very significant difference to the original design of ResMLP. It also contradicts to my question 8. So please clarify.

10) In line 115, the tested data set is not only SD-OCT, but also POne data set.

Finally, I also hope that the authors could pay more attention to English editing. Some sentences reads confusing. For example, line 137: ".... were analyzed and background for eight-level semantic high-precision segmentation operation.", Anther example, line 150 - 154, there are two "which" inside a sentence. Probably writing it into two or more shorter sentences would be better.

Author Response

Please see the attachment.

Author Response File: Author Response.doc

Reviewer 3 Report

The authors have addressed most of my comments, which is much appreciated. Still, I think the manuscript could be improved by going through a professional English editing service.

Author Response

Please see the attachment.

Author Response File: Author Response.doc

Round 3

Reviewer 2 Report

Thanks to the authors for their efforts to improve the manuscript further. The presentation in section 2.2.1 is much logic and clear. There are a few minor suggestions:

1. Line 221: The description LN(.) is duplicated to line 206.

2. More careful English editing is suggested. Some sentences read oddly. For example: line 191, line 193, ... etc.

After revising, this manuscript can be accepted.

Author Response

Please see the attachment.

Author Response File: Author Response.doc

Back to TopTop