Next Article in Journal
Influence of Surface Tilt Angle on a Chromatic Confocal Probe with a Femtosecond Laser
Next Article in Special Issue
One-Shot Distributed Generalized Eigenvalue Problem (DGEP): Concept, Algorithm and Experiments
Previous Article in Journal
Brassinin Enhances Apoptosis in Hepatic Carcinoma by Inducing Reactive Oxygen Species Production and Suppressing the JAK2/STAT3 Pathway
 
 
Article
Peer-Review Record

Swin Transformer Assisted Prior Attention Network for Medical Image Segmentation

Appl. Sci. 2022, 12(9), 4735; https://doi.org/10.3390/app12094735
by Zhihao Liao 1,*,†, Neng Fan 2,† and Kai Xu 2,†
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Appl. Sci. 2022, 12(9), 4735; https://doi.org/10.3390/app12094735
Submission received: 20 April 2022 / Revised: 5 May 2022 / Accepted: 6 May 2022 / Published: 8 May 2022
(This article belongs to the Special Issue Selected Papers from the ICCAI and IMIP 2022)

Round 1

Reviewer 1 Report

This research proposes a window-based self-care mechanism called a pre-care network and adds features of the missed connections and pre-care network and refines boundary details into finer segmentation of skin lesions from dermoscopic images. The architecture is composed of a Swin Transformer-assisted pre-attention network, a hybrid Transformer network with a multi-head cross-fusion Transformer, and an enhanced attention module. This provides better care learning and interpretability on the network for accurate and automatic segmentation of medical images. As an experimental result, they systematically outperform existing methods in the segmentation of skin lesions and nuclear segmentation.

Author Response

Response to Reviewer 1 Comments

 

Point 1:

This research proposes a window-based self-care mechanism called a pre-care network and adds features of the missed connections and pre-care network and refines boundary details into finer segmentation of skin lesions from dermoscopic images. The architecture is composed of a Swin Transformer-assisted pre-attention network, a hybrid Transformer network with a multi-head cross-fusion Transformer, and an enhanced attention module. This provides better care learning and interpretability on the network for accurate and automatic segmentation of medical images. As an experimental result, they systematically outperform existing methods in the segmentation of skin lesions and nuclear segmentation.

 

Response 1:

We have adjusted the structure of the paper and added new content in each section for helping readers better understand our proposed method. In short, the section 1 adds research questions, research ideas and a brief explanation of the paper’s structure, the section 2 adds detailed introduction of the achievements and results of the state-of-the-art algorithms, the section 3 adds new description of the contributions of each module in Swin-PANet, and focuses on the three items corresponding to the section 1, the section 4 adds a new discussion subsetion, and the section 5 concludes the paper while adding particulat use and a question of the Swin-PANet, and our future work.

Although Swin-PANet achieves better performance compared some state-of-the-art methods, the proposed network still has limitation in the ability of transfer learning. As shown in Table 2, Swin-PANet is implemented on another dataset such as ISIC 2016 and achieves the performance of 90.68% and 84.06% in terms of Dice and IoU metrics. Compared with some special designed methods such as FAT-Net [42], Ms RED [43], BAT [44]. Swin-PANet has an immense gap in performance of skin lesion segmentation. How to make Swin-PANet perform well in different segmentation tasks is a challenging task and we believe the backbone ability of Swin Transformer and the potential of the combination of Transformer and CNNs can make this task possible. Our future work is to investigate the transfera-bility of the combination of Swin Transformer and CNNs and design a more powerful and reliable network structure on medical image segmentation.

 

Author Response File: Author Response.pdf

Reviewer 2 Report

Please clearly state some fields of potential clinical implementation. Which kind of clinical improvements could be achieved by using your method?

Author Response

Response to Reviewer 2 Comments

 

Point 1:

Please clearly state some fields of potential clinical implementation. Which kind of clinical improvements could be achieved by using your method?

 

Response 1:

We make extensions in lines 38-45 in the Introduction section and attach a Featured Application in the head of paper for stating which this technique can make contributions to the computer-aided diagnosis (CAD) of cell cancer and skin cancer to improve the efficiency and accuracy of medical segmentation. The original contents are as follows:

 

Featured Application: The proposed Swin-PANet can be utilized for computer-aided diagnosis (CAD) of skin cancer or cell cancer to improve the segmentation efficiency and accuracy, considered as a significant technique for the accurate screening of diseased or abnormal area of patients to assist doctors better evaluate disease and optimize prevention measures.

 

Lines 38-45

The evaluation and analysis of pathologies based on lesion segmentation provide val-uable information such as the progression of disease in order to help physicians im-proving the quality of clinical diagnosis, monitoring the plan of treatment strategies, and efficiently judging prediction of a patient’s outcome. For instance, cell segmentation in microscopic images is a critical challenge in biological study, clinical practice and disease diagnosis. Segmentation of robust plasma cell is the initial step towards detecting malignant cells in case of Multiple Myeloma (MM), a type of blood cancer. Given the voluminous data accessible, there is an increasing demand for automated methods and tools for cell analysis. Furthermore, due to variable intra-cellular and in-ter-cellular dynamics, as well as structural features of cells, there is a const need for more accurate and effective segmentation models. Hence, accurate medical image segmentation is of great significance for computer-aided diagnosis and image-guided clinical surgery [1-3].

Author Response File: Author Response.pdf

Reviewer 3 Report

The authors propose a novel approach for combining two algorithms to achieve better results in medical image segmentation. The approach is interesting, and the corresponding results show, that an improvement can be made. 

 

However, the current version of the paper has some flaws and needs to be improved before it can be published. Here are my remarks: 

 

Structure:

the structure of the document is not clear. 

 

Section 1:

  • The „Introduction“ section is not a real introduction. It merely already summarizes some State-Of-The-Art techniques and introduces a lot of prerequisites. 
  • The introduction does not define a Research Question, that is addressed by this paper. And it does not give a particular use case, where the results can provide a benefid. 
  • I would suggest to move all the content from lines 35-66 to section 2 and to clearly define the Research Question of the paper, which can then be addressed by the three mentioned aspects (line 79ff).
  • Furthermore, at the end of the introduction, a brief explanation about the paper’s structure should be given. What is in which section, which methodology does the paper follow, etc.

 

Section 2 should be renamed to "State Of the Art and Related Work" then. 

  • However, several terms and references are already mentioned before. For example, line 92 has been defined in line 40 already. This is a bit confusing for the reader. 
  • It would be important to also mention the outcome of the related work. For example, in line 113, „excellent researches“ are mentioned, however none of their results. Why are they important for this work? What do they contribute? Or where do they distinguish?
  • Lines 133-140 also seem to be repeated without contributing further information to the paper.

 

Section 3 should be renamed to "Modeling, Methods, and Design"

  • In this section, an overview of the structure and purpose of the subsection of section 3 is given at the beginning. Each subsection should be mentioned, e.g. (line 143) „Firstly, the overview…“ -> „Firstly, in section 3.1, the overview…“
  • At the end of section 3, a summary is missing. 
  • Please clearly state the achievements of the modeling and their purpose and how they contribute to answering the research question.
  • Also refer to the „three items“ you mentioned in lines 80-87 and how your modeling reflects them

 

Section 4 

  • This section should also contain some implementation details and examples.
  • Also provide a discussion at the end of the section to summarize the results of the experiments

 

Section 5

  • Is more a summary than a conclusion. Maybe add some ideas for future work oder particular use cases, that can now be addressed in a better way.

 

 

Formal Modeling / Writing / Language:

 

In general, the paper needs a detailed rework on language. In many cases, articles like "the" or "a" are missing, often sentences are not completed, and in some cases, passages seem to be repeated without any purpose. Abbreviations are used without being formally introduced, references are missing, and the punctuation needs to be revised. Also, in the formal modeling of this paper, some flaws have been discovered. I will provide a list of examples in the following. However, this list might not be complete and an additional review after the corrections is required. 

 

  • Line 30 „order that helping physicians improve the quality of clinical diagnosis, monitor the plan 31 of treatment strategies, and efficiently judge prediction of patient outcome“ > „order to helping physicians improving the quality…, monitoring the plan…, and efficiently judging the prediction of a patient’s outcome“
  • Line 35 „approaches about medical…“ > „approaches to medical…“
  • Line 38 „The skip connections“ > has to be introduced and shortly defined
  • Line 42 „become“ > „became“
  • Lines 49-52 > repetition
  • Line 52 „these tasks for segmentation accuracy“ > „these tasks in terms of segmentation accuracy“
  • Line 52 „Since the…“ > „Due to the…“
  • Line 54 „to counteract the inductive caused by…“ > I don’t understand, what this should mean
  • Line 59: Transformer and Natural Language Processing (NLP) should be introduced briefly
  • Line 62: „[20] is presented“ > „[20] has presented“
  • Line 62: „combining it with…“ > what is meant by „it“, please be precise here
  • Line 63: please introduce „Swin Transformer“ with a short explanation
  • Line 63: „adopts hierarchical…“ > „adopts a hierarchical…“
  • Line 65: „Take Swin Transformer as vision as backbone“ > „Taking Swin Transformer as a backbone“
  • Line 69 (and multiple times, when this is mentioned): „Transformer, which is consisted of two components“ > „Transformer, which consists of two components“ + „prior attention network and hybrid Transformer network“ > „a prior attention network and a hybrid Transformer network“
  • Line 80 (see Line 69)
  • Line 84 „and its poor“ > „due to its poor“ (do you mean that?)
  • Line 112 „ViT well be trained“ > „ViT to be trained“
  • Line 113 „Specially“ > „Particularly“
  • Line 118 „is firstly“ > „has firstly“
  • Line 123 „on order that“ > „in order to“
  • Line 123 „of network“ > „of the network“ or „of a network“
  • Line 124 „with additive attention gate“ > „with an additive attention gate“
  • Lines 133-140 repetition
  • Line 142: First introduce the structure, then mention the Figure 1
  • Line 143/144 (see Line 69)
  • Line 145 „details about Swin Transformer block“ > „details about the Swin Transformer block“
  • Line 148+149 > I don’t understand this sentence
  • Line 151 (see Line 69)
  • Line 150-154 > rework these sentences
  • Line 156 „from the out of prior…“ > „from outside the prior…“
  • Line 157 „in enhanced…“ > „in an enhanced…“
  • Line 157 „following direct supervision…“ > „subsequent direct supervision“
  • Line 157 „In the ... network, it is basically…“ > „The hybrid Transformer network is basically modified…“
  • Line 166/167 (see Line 69)
  • Formula 1 > LN is not introduced
  • Formula 2 > MLP is not introduced
  • Line 197 „maps own different spatial…“ > „maps have different spatial“
  • Line 198 „to make sure the sample spatial…“ > „to ensure similar spatial…“
  • Line 200 „following“ > „next“
  • Formula 7 > what is D4
  • Line 210 „feature concatenation“ > needs to be introduced and explained. Which algorithm is used for this?
  • Line 212: „It will receive the multiple…“ > „It receives multiple…“
  • Line 214: „In the end“ > „At the end“
  • Formula 9 > what is „Y“
  • Formula 9 > does not fit to Formulae 6-8. Why is this different?
  • Line 222 > check, if Figure 4 can be placed on the same page
  • Formula 11 > H is not defined
  • Line 271: why are these datasets selected? 
  • Line 279 „same experimental protocols“ > please explain briefly also in this paper. Methodology, experiment steps, previous outcome, etc.
  • Line 281 (and further times) „What’s more“ > „Furthermore“
  • Line 294 „Adam optimizer“ > not introduced, please add reference and short explanation
  • Section 4.1.2 > provide further information about tools, programming language, code snippets
  • Line 329 (see 281)
  • Line 339 > default text of a Latex-table. Please revise
  • Line 347 > Figure 6 is not mentioned in the paper
  • Line 366 (see 281)
  • Line 370 „the result indicates that it’s great“ > „the result indicates its great…“
  • Line 379 (see 69)
  • Line 384 „displays“ > „shows“
  • Line 384 „has capability“ > „has the capability“

 

 

Author Response

Response to Reviewer 3 Comments

 

Point 1:

Structure:

the structure of the document is not clear.

 

Response 1:

We have adjusted the structure of the paper and added new content in each section for helping readers better understand our proposed method. In short, the section 1 adds research questions, research ideas and a brief explanation of the paper’s structure, the section 2 adds detailed introduction of the achievements and results of the state-of-the-art algorithms, the section 3 adds new description of the contributions of each module in Swin-PANet, and focuses on the three items corresponding to the section 1, the section 4 adds a new discussion subsetion, and the section 5 concludes the paper while adding particulat use and a question of the Swin-PANet, and our future work.

 

Point 2:

Section 1:

The „Introduction“ section is not a real introduction. It merely already summarizes some State-Of-The-Art techniques and introduces a lot of prerequisites.

The introduction does not define a Research Question, that is addressed by this paper. And it does not give a particular use case, where the results can provide a benefid.

I would suggest to move all the content from lines 35-66 to section 2 and to clearly define the Research Question of the paper, which can then be addressed by the three mentioned aspects (line 79ff).

Furthermore, at the end of the introduction, a brief explanation about the paper’s structure should be given. What is in which section, which methodology does the paper follow, etc.

 

Response 2:

As suggested, we remove all the content from lines 35-66 to section 2. We add research question and research idea in lines 48-81 of the paper. The content about research question and research idea can correspond to three mentioned aspects (line 79ff). And we make extensions in lines 38 to 45 and attach an Featured Application in the head of paper for stating which this technique can make contributions to the computer-aided diagnosis (CAD) of cell cancer and skin cancer to improve the efficiency and accuracy of medical segmentation. The original contents are as follows:

 

Featured Application: The proposed Swin-PANet can be utilized for computer-aided diagnosis (CAD) of skin cancer or cell cancer to improve the segmentation efficiency and accuracy, considered as a significant technique for the accurate screening of diseased or abnormal area of patients to assist doctors better evaluate disease and optimize prevention measures.

 

 

Particular use (lines 38-45 and the above featured application):

For instance, cell segmentation in microscopic images is a critical challenge in biologi-cal study, clinical practice and disease diagnosis. Segmentation of robust plasma cell is the initial step towards detecting malignant cells in case of Multiple Myeloma (MM), a type of blood cancer. Given the voluminous data accessible, there is an increasing de-mand for automated methods and tools for cell analysis. Furthermore, due to variable intra-cellular and inter-cellular dynamics, as well as structural features of cells, there is a const need for more accurate and effective segmentation models.

 

Lines 48-81 (research question and research idea, the marked contents correspond to three proposed aspects.)

Convolutional neural networks based on U-shaped topology are widely popular in medical image segmentation. However, despite great breakthroughs in this field, CNNs-based methods mostly demonstrate limitations in capturing features to model global dependency, caused by the inner locality of convolution operation. Attention mechanism is a useful solution to settle the above problem and it’s inspired by hu-man’s visual cognition and perception. Recently, there is an excellent work called prior attention network [4] which proposes a novel structure consisted of U-shaped convo-lutional neural network and intermediate supervision network between encoder and decoder. It follows coarse-to-fine strategy and intermediate supervision strategy. Par-ticularly, they propose an attention-based method called attention guiding decoder in intermediate supervision network, and it takes the rich information of multi-scale fea-tures from the encoder to generate spatial attention maps for guiding the final seg-mentation of decoder in CNNs. Further, the process of traditional attention learning is often not humanly interpretable, and the regions focused by network are usually dif-ferent from the regions that we pay attention to. In prior attention network, the inter-mediate supervision learning that guides the next step of segmentation can provide the interpretability of the network in a certain. However, this non-local attention mecha-nism has poor capability for aggregating multi-scale features from different modules and extracting boundary information.

Different from the non-local attention-based methods, Transformer has proposed as an alternative method for modeling long-range dependency by its self-attention mechanism. The current research direction of Transformer on medical image segmen-tation are pure Transformer network and hybrid Transformer network. An interesting work called UCTransNet [5] investigates the potential limitation and semantic gap of the skip-connection between encoder and decoder in U-Net, and proposes a novel de-sign called channel-wise Transformer module to replace the skip-connection for better fusing multi-scale features from encoder and reducing the semantic gap. Combining the above excellent works, it’s very feasible to apply Transformer-based method in in-termediate supervision network for further improving the performance of medical image segmentation, enhancing the transferability of prior attention network, provid-ing the interpretability of Transformer in a certain for better performing dual supervi-sion strategy. Specially, Swin Transformer [6] is proposed for expanding applicability of vision Transformer and serving as a general backbone in the field of computer vi-sion. The key design of Swin Transformer is the shifted window-based attention mechanism, which the shifted windows bridge the windows layer by layer and con-struct connection for further enhancing the power of modeling long-range dependen-cy.

 

 

Point 3:

Section 2 should be renamed to "State Of the Art and Related Work" then.

However, several terms and references are already mentioned before. For example, line 92 has been defined in line 40 already. This is a bit confusing for the reader.

It would be important to also mention the outcome of the related work. For example, in line 113, „excellent researches“ are mentioned, however none of their results. Why are they important for this work? What do they contribute? Or where do they distinguish?

Lines 133-140 also seem to be repeated without contributing further information to the paper.

 

Response 3:

The section 2 has already been renamed to "State Of the Art and Related Work". As suggested in previous point, we integrate all the content from lines 35-66 (include line 40) to section 2 (include line 92) for solving the problem of recurringly mentioned literature. We add the contributions and results of related work such as Swin Transformer which can be seen in lines 164-179. We remove the lines 133-140 with other state-of-the-art methods and related work.

 

Lines 164-179 (the importance, contributions and results of Swin Transformer)

Recently there are several excellent researches that have been done based on vision transformer [6,31,32]. Particularly, Swin Transformer, an efficient and effective hier-archical vision Transformer, adopting hierarchical window scheme that imitates the process of CNN to enlarge the receptive field and being designed as a backbone in vi-sion tasks such as ImageNet classification, objection detection and semantic segmenta-tion based on its shifted windows mechanism, which outperforms the state-of-the-art methods such as ViT/DeiT [27,30], ResNet models [12], and importantly with similar parameters and latency on these tasks. Swin Transformer obtains 53.5 mIoU on ADE20K and gains improvement of +3.2 mIoU over the state-of-the-art methods such as SETR, and it achieves a 87.3% top-1 accuracy on image classification task ImageNet-1K. Different from the previous works in the field of vision transformer, Swin Transformer has the following two outstanding contributions [6] that distinguish it from other works: (1) Swin Transformer constructs the hierarchical feature maps of the image on the basis of linear computational complexity. These hierarchical features can make Swin Transformer suitable as a general backbone for kinds of computer vi-sion tasks. (2) Swin Transformer propose a key design that shifted windows are equipped between consecutive attention layers, which can enhance modeling power while performing computation-efficient strategy.

 

Point 4:

Section 3 should be renamed to "Modeling, Methods, and Design"

In this section, an overview of the structure and purpose of the subsection of section 3 is given at the beginning. Each subsection should be mentioned, e.g. (line 143) „Firstly, the overview…“ -> „Firstly, in section 3.1, the overview…“

At the end of section 3, a summary is missing.

Please clearly state the achievements of the modeling and their purpose and how they contribute to answering the research question.

Also refer to the „three items“ you mentioned in lines 80-87 and how your modeling reflects them

 

Response 4:

The section 3 has already been renamed to "Modeling, Methods, and Design". And we add subsection 1-4 into the the head of section 3 (lines 207-214). We make a summary at the end of section 3 (lines 380-389). Specially, we make great extension in each subsection such as the overview of network structure (lines 221-243), Swin Transformer block (lines 245-264), Attention guiding decoder (lines 279-297) and enhanced attention block (lines 317-336). The marked contents correspond to three proposed aspects. We reflect the above three aspects through providing the introduction of the functional design of the module and the image processing mechanisms.

 

Lines 207-214 (an overview of the structure and purpose of the subsection of section 3)

In this section we will give a detailed introduction of Swin-PANet. Firstly, in sub-section 3.1, the overview of the proposed network will be explained which consists of a prior attention network and a hybrid Transformer network. Then details about the Swin Transformer block and attention guiding decoder are provided in subsection 3.2. Whereafter, hybrid Transformer network with enhanced attention block will be ex-plained in subsection 3.3, which is receiving multiple features for aggregation and re-fining. In subsection 3.4, the dual supervision strategy that achieves two steps of seg-mentation in a single network will be introduced.

 

Lines 221-243 (the overview of network structure)

Swin-PANet consists of a prior attention network and a hybrid Transformer net-work. The illustration of the proposed Swin-PANet can be shown in Fig. 1. Prior atten-tion network assisted by Swin Transformer performs intermediate supervision learn-ing. Hybrid Transformer network with enhanced attention blocks performs direct su-pervision learning. The dual supervision strategy can enhance the performance and interpretability of the Transformer attention mechanism, and provide a humanly in-terpretable way to guide the attentional learning in Transformer. In the prior attention network, Swin Transformer block and attention guiding decoder are cascaded for re-ceiving multi-scale features to shifted-window attention learning and feature fusion. Attention prediction from the out of prior attention network will be involved in en-hanced attention block to guide the subsequent direct supervision learning. The hybrid Transformer network is basically modified based on the U-Net [9] structure, which adopt a U-shaped topology with skip-connections between encoder and decoder, and enhanced attention block to guiding the information filtration and cross attention of the Transformer attention features along channel-wise. Enhanced attention block at each decoder layer will be involved with the multi-scale features of the previous block, attention prediction generated by prior attention network and the features from cor-responding skip-connections, which can ensure the consistency between encoder and decoder and recover the discarded information caused by convolution operations. En-hanced attention bock can also utilize a feature fusion between global and local con-texts along channel-wise for achieving better performance of attention learning. The coarse process of Swin Transformer and attention guiding decoder, and the finer pro-cess of the enhanced attention blocks are integrated into the proposed coarse-to-fine strategy.

 

Lines 245-264 (Swin Transformer block)

Different from the typical multi-head self-attention (MSA) mechanism in vision transformer, Swin Transformer [6] is based on shifted windows to implement self-attention mechanism. The shifted windows are equipped between consecutive at-tention layers, which can enhance modeling power while performing computa-tion-efficient strategy. At the same time, the hierarchical feature maps of the image are constructed on the basis of linear computational complexity. These hierarchical feature maps can make Swin Transformer more suitable as a general backbone for kinds of computer vision tasks. Since the non-local attention mechanism in traditional prior attention network has poor capability for aggregating multi-scale features from dif-ferent modules and extracting boundary information, we insert Swin Transformer block into the prior attention network for enhancing the modeling power of network. As shown in Fig. 2, there are two cascading Swin Transformer modules construct one complete block. It can be seen that each Swin Transformer module consists of Lay-erNorm layer (LN), multi-head self-attention module (MSA), multilayer perceptron (MLP) with non-linearity activation function GELU and twice residual connection between LayerNorm layers (LN). The important difference [6] is the window-based multi-head self-attention (W-MSA) module applied in the first Swin Transformer module and the shifted window-based multi-head self-attention (SW-MSA) module applied in the next Swin Transformer module. Based on two successive transformer blocks with conventional window and shifted window partitioning mechanism,

 

Lines 279-297 (Attention guiding decoder)

In the traditional cascaded networks, the first step is to perform a coarse segmen-tation and find the ROIs in medical images. In the typical prior attention network, the above process of finding coarse ROIs is performed by attention guiding decoder. In the proposed Swin-PANet, the attention guiding decoder is utilized to generated ROI-related attention prediction from the outputs of Swin Transformer. The process of attention guiding decoder plays a role of refining the feature representations and im-proving the quantity of segmentation. Then the refined features will be sent to the multi-level enhanced attention blocks for performing finer segmentation. The coarse process of Swin Transformer and attention guiding decoder, and the finer process of the enhanced attention blocks are integrated into the proposed coarse-to-fine strategy. As shown in Fig. 3, the feature maps E1 to E4 extracted from the one-to-one Swin Transformer block are fed into the attention guiding decoder for feature fusion.

 

Lines 317-336 and lines 358-369 (enhanced attention block)

To better perform dual supervision strategy and fuse multi-scale features of inconsistent semantics between prior attention network and hybrid Transformer network, enhanced attention block is proposed for guiding the information filtration and cross attention of the Transformer attention features along channel-wise. Enhanced attention bock is equipped in decoder layer of the hybrid Transformer network, which is to utilize a feature fusion between global and local contexts along channel-wise for achieving better performance of attention learning. In the enhanced attention block, compared with the traditional CAA module the difference is that it receives multiple features from previous level of enhanced attention block, corresponding skip-connection between encoder and decoder, prior attention network assisted by Swin Transformer. As shown in Fig. 4, we take the  level output of enhanced attention block ,the attention prediction  and the  level output of skip-connections  as the input of next level enhanced attention block. C, H, W denotes the number of channels, and the height and width of features, respectively. Compared with traditional residual blocks, we set a learnable parameter  in the residual paths, which can be updated along with the back propagation and plays a role in retaining effective features and adding non-linearity to the process of integrating and refining the features. On account of the channel-axis attention of the hybrid Transformer network,  can activate the effective channels and restrain the useless channels so as to increase the convergence speed of network while ensuring segmentation performance.

 

Lines 358-369

After the first step of attention learning in Swin Transformer and attention guiding decoder, enhanced attention block receives the attention prediction and fuses it with other multiple features. It can acquire semantic-rich features and recover the origin image information lost by the attention calculation. On the other hand, the second step of attention learning in enhanced attention block guides the information filtration and cross attention of the attention features along channel-wise, which enhance the ability of extracting global context features and modeling long-range dependency along channel-wise.

 

Point 5:

Section 4

This section should also contain some implementation details and examples.

Also provide a discussion at the end of the section to summarize the results of the experiments

 

Response 5:

We add some implementation details such as experimental setup (lines 419-421), code snippets (Appendix A) and the introduction of Adam optimizer (lines 432-434) and so on. We provide a “Experimental Summary and Discussion”at the end of the section (lines 531-561).

 

Lines 419-421 (Experimental setup)

The proposed Swin-PANet is implemented on a single NVIDIA RTX 3060 Ti. The experimental setup includes PyCharm 2021, Python 3.5.7, and PyTorch 1.9.0 frame-work on an Ubuntu 20.04 server

 

Lines 430-434 (Adam optimizer)

To achieve faster convergence on training, the Adam optimizer is employed in net-work with a learning rate of 0.005 to optimize the performance of the network. Adam optimizer [40] can be straightforward to implement and perform performing memory-efficient and computation-efficient strategies, and is very suited for networks that are large in terms of parameters.

 

Lines 531-561 (Experimental Summary and Discussion)

To combine the advantages of two excellent algorithms and enhance the attention ability of the network, the proposed Swin-PANet integrates both attention guiding de-coder and Swin Transformer into prior attention network for capturing both global contextual information and local features. Furthermore, an enhanced attention block is designed for better performing coarse-to-fine strategy and enhancing attention ability of the network, where experiments and ablation studies on the GlaS, MoNuSeg, and ISIC 2016 datasets have been implemented in our model for overall evaluation. Since the proposed Swin-PANet can extract rich global contextual information with the shifted-window self-attention mechanism in prior attention network, and enhanced attention blocks are equipped in hybrid Transformer network to capture the long-range dependency along channel-wise for further enhancing the attention ability of the network. It can be seen in the quantitative comparison on GlaS and MoNuSeg datasets that Swin-PANet consistently achieves better performance than other state-of-the-art methods. Compared with the UCTransNet, the proposed Swin-PANet gains improvement range from 89.84% (82.24%) to 91.42% (84.88%) in terms of Dice and IoU metrics on GlaS dataset, and 79.87% (66.68%) to 81.59% (69.00%) interms of Dice and IoU metrics on MoNuSeg dataset. On the other hand, ISIC 2016 dataset is se-lected for evaluating the contribution of each module in Swin-PANet. From the quan-titative and visual comparisons of skin lesion segmentation, the results confirm the ef-fectiveness of enhanced attention block and dual supervision strategy in Swin-PANet. Enhanced attention block is applied in hybrid Transformer network to achieve better feature fusion between global context and local contexts along channel-wise.

Although Swin-PANet achieves better performance compared some state-of-the-art methods, the proposed network still has limitation in the ability of transfer learning. As shown in Table 2, Swin-PANet is implemented on another dataset such as ISIC 2016 and achieves the performance of 90.68% and 84.06% in terms of Dice and IoU metrics. Compared with some special designed methods such as FAT-Net [42], Ms RED [43], BAT [44]. Swin-PANet has an immense gap in performance of skin lesion segmentation. How to make Swin-PANet perform well in different segmentation tasks is a challenging task and we believe the backbone ability of Swin Transformer and the potential of the combination of Transformer and CNNs can make this task possible.

 

Point 6:

Section 5

Is more a summary than a conclusion. Maybe add some ideas for future work oder particular use cases, that can now be addressed in a better way.

 

Response 6: Please

As suggested, we provide a better conclusion of our research work while adding particular use cases (section 1) and our future work (lines 563-587).

 

Lines 563-587 (conclusion of our research work, particular use cases and future work)

In this paper, we proposed a novel network structure with the combination of two algorithms, called Swin-PANet, following the coarse-to-fine strategy and dual super-vision strategy and aiming to perform accurate segmentation of medical images in-cluding the challenging tasks of cell segmentation and skin lesion segmentation. The proposed Swin-PANet can be utilized for computer-aided diagnosis (CAD) of skin cancer to improve the segmentation efficiency and accuracy, considered as a signifi-cant technique for the accurate screening of diseased or abnormal area of patients to assist doctors better evaluate disease and optimize prevention measures.

 

In conclusion, the proposed Swin-PANet integrates both Swin Transformer and attention guiding decoder into prior attention network for performing intermediate learning and capturing global contextual information between pixel-level. Further-more, an enhanced attention block is designed and proposed for utilizing feature fu-sion between global and local contexts along channel-wise, and better performing coarse-to-fine strategy and enhancing attention ability of the network. Extensive ex-periments are conducted on three public datasets (GlaS, MoNuSeg, and ISIC 2016) for the overall evaluation of the proposed Swin-PANet. The quantitative and visual com-parisons with state-of-the-art methods also demonstrate the effectiveness of the pro-posed Swin-PANet and excellent performance in the segmentation tasks based on our coarse-to-fine and dual supervision strategies.

 

Although Swin-PANet achieves better performance than some state-of-the-art methods, the proposed network still has limitation in the ability of transfer learning. Recent work [2] demonstrates the Transformer has superior transferability for kinds of downstream tasks under pre-training. Our future work is to investigate the transfera-bility of the combination of Swin Transformer and CNNs and design a more powerful and reliable network structure on medical image segmentation.

 

Point 7:

Formal Modeling / Writing / Language:

In general, the paper needs a detailed rework on language. In many cases, articles like "the" or "a" are missing, often sentences are not completed, and in some cases, passages seem to be repeated without any purpose. Abbreviations are used without being formally introduced, references are missing, and the punctuation needs to be revised. Also, in the formal modeling of this paper, some flaws have been discovered. I will provide a list of examples in the following. However, this list might not be complete and an additional review after the corrections is required.

 

Response 7:

We have carefully corrected all grammatical errors in the paper, and made improvements to other issues mentioned. The other mentioned issues are as follow:

 

Lines 49-52 > repetition

We have removed these content in lines 49-52.

 

Line 59: Transformer and Natural Language Processing (NLP) should be introduced briefly

We have moved these content into section 2 for integration and descripe it in one sentence.

Lines 151-152

Transformer [26] was initially proposed for solving the problems of machine translation and made eminent contributions in nature language processing (NLP).

 

Line 63: please introduce „Swin Transformer“ with a short explanation

Lines 163-165

Particularly, Swin Transformer, an efficient and effective hier-archical vision Transformer, adopting hierarchical window scheme that imitates the process of CNN to enlarge the receptive field and being designed as a backbone in vi-sion tasks.

Lines 133-140 repetition

We have removed these content in lines 133-140.

 

Line 142: First introduce the structure, then mention the Figure 1

Lines 220-221

Swin-PANet consists of a prior attention network and a hybrid Transformer net-work. The illustration of the proposed Swin-PANet can be shown in Fig. 1.

 

Line 148+149 > I don’t understand this sentence

We have removed these content in lines 148-149 and reintroduce these content in Section 3.1.

Lines 221-225

Prior attention network assisted by Swin Transformer performs intermediate supervi-sion learning. Hybrid Transformer network with enhanced attention blocks performs direct supervision learning. The dual supervision strategy can enhance the perfor-mance and interpretability of the Transformer attention mechanism, and provide a humanly interpretable way to guide the attentional learning in Transformer.

 

Line 150-154 > rework these sentences

We rewrite this part in lines 150-154 and it can be seen as follow:

Lines 221-225

Swin-PANet consists of a prior attention network and a hybrid Transformer net-work. The illustration of the proposed Swin-PANet can be shown in Fig. 1. Prior atten-tion network assisted by Swin Transformer performs intermediate supervision learn-ing. Hybrid Transformer network with enhanced attention blocks performs direct su-pervision learning.

 

Formula 1 > LN is not introduced

Formula 2 > MLP is not introduced

Lines 256-259

It can be seen that each Swin Transformer module consists of LayerNorm layer (LN), multi-head self-attention module (MSA), multilayer perceptron (MLP) with non-linearity activation function GELU and twice residual connection between Lay-erNorm layers (LN).

 

Formula 7 > what is D4

There’s a typing mistake and actually D4 is D3.

 

Formula 9 > what is „Y“

Lines 307-308

At the end of attention guiding decoder, the output Y is computed and will be sent to the calculation of loss function in prior attention network.

 

Formula 9 > does not fit to Formulae 6-8. Why is this different?

There’s no different between them. We just want to focus on the end processing mechanism of the module and add content to it, so distinguish it.

 

Line 222 > check, if Figure 4 can be placed on the same page

We have solved this problem and it can be seen in page 9.

 

Formula 11 > H is not defined

Lines 329-330

C, H, W denotes the number of channels, and the height and width of features, respec-tively. Compared with traditional residual blocks,

 

Line 271: why are these datasets selected?

Lines 396-398

For overall comparison with other state-of-the-art methods in direction of Trans-former complements CNNs. Gland segmentation dataset [38] and MoNuSeg segmen-tation dataset [39] are selected to evaluate our proposed method.

Lines 408-410

Furthermore, for investigating the performance of the proposed modules and fur-ther demonstrating the effectiveness of Swin-PANet, the skin lesion segmentation task in International Symposium on Biomedical Imaging (ISBI) from the year 2016 is im-plemented on the proposed Swin-PANet.

 

Line 279 „same experimental protocols“ > please explain briefly also in this paper. Methodology, experiment steps, previous outcome, etc.

This experimetal protocols indicate how to divide the training set and test set and how to measure the performance.

Lines 404-408

Experiments on GlaS and MoNuSeg datasets are followed up the same experimental protocols with the recent research [5], in which the datasets are divided according to the training and test datasets provided by the competition, and the five-fold cross-validation strategy is conducted on training and testing datasets for fair com-parison.

 

Line 294 „Adam optimizer“ > not introduced, please add reference and short explanation

Lines 431-433

Adam optimizer [40] can be straightforward to implement and perform performing memory-efficient and computation-efficient strategies, and is very suited for networks that are large in terms of parameters.

 

Section 4.1.2 > provide further information about tools, programming language, code snippets

Lines 419-422

The proposed Swin-PANet is implemented on a single NVIDIA RTX 3060 Ti. The experimental setup includes PyCharm 2021, Python 3.5.7, and PyTorch 1.9.0 frame-work on an Ubuntu 20.04 server. The code snippets of the network structure can refer to Appendix A.

 

Line 339 > default text of a Latex-table. Please revise

Lines 480-482

Table 1. The quantitative comparison on GlaS and MoNuSeg datasets implemented on state-of-the-art methods. For simplicity, AttUNet and MRUNet denote the above methods: Atten-tion U-Net and MultiReUNet. UCTransNet-pre denotes UCTransNet based on pre-training.

 

Line 347 > Figure 6 is not mentioned in the paper

Lines 521-529

Furthermore, the visual comparisons of ablation studies are shown in Fig. 6. The first two rows represent the situation with very ambiguous boundaries of skin lesion areas. The third row represents the situation with the existence of hair partially cover-ing the skin lesion and destroying local contextual information. The last row represents the situation with the immense variability of skin lesion including irregular shape and ambiguous boundary. The proposed Swin-PANet can perform better segmentation compared with other combinations with baseline. It can be seen in the second row that the proposed method has the capability of modeling global dependencies between boundary pixels and accurately segmenting the ambiguous boundary.

Author Response File: Author Response.pdf

Round 2

Reviewer 3 Report

Thank you for considering my remarks and updating the paper. Just two minor remaining topics: line 117 (empty bullet) and line 592 (nippet > snippet). 

Author Response

Response to Reviewer 3 Comments

 

Point 1:

Structure:

Just two minor remaining topics: line 117 (empty bullet) and line 592 (nippet > snippet).

 

Response 1:

We have corrected these problems in paper and it can be seen in line 117 (a new bullet) and lines 607 (nippet>snippet) of the PDF and word version. Thank you very much for your suggestions.

 

Line 117

The proposed Swin-PANet addresses the dilemma that traditional Transformer network has poor interpretability in the process of attention calculation and Swin-PANet can insert its attention predictions into prior attention network for intermediate supervision learning which is humanly interpretable and controllable.

 

Lines 607

Appendix A The code snippets of the network structure.

Author Response File: Author Response.pdf

Back to TopTop