Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Printed Edition

A printed edition of this Special Issue is available at MDPI Books....

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

STHarDNet: Swin Transformer with HarDNet for MRI Segmentation

Appl. Sci. 2022, 12(1), 468; https://doi.org/10.3390/app12010468

by Yeonghyeon Gu^†, Zhegao Piao^†

and Seong Joon Yoo^*

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Appl. Sci. 2022, 12(1), 468; https://doi.org/10.3390/app12010468

Submission received: 19 November 2021 / Revised: 30 December 2021 / Accepted: 1 January 2022 / Published: 4 January 2022

(This article belongs to the Special Issue Applications of Artificial Intelligence in Medicine Practice)

Round 1

Reviewer 1 Report

The authors propose an architecture called STHarDNet which combines HarDNet Blocks, Swin transformers, with a U-Net shape designed for MRI segmentation. The content of the document is scientifically adequate and mostly well written, although some part needs clarification and revision for typos, grammatical errors, and punctuation errors. The manuscript describes the components needed to replicate the results but there are some elements in the Discussion of the results that need further revision.

The description of the characteristics of the proposed model needs further editing.

The reviewer advice to first describe the backbone of the architecture, which is HarDNet, before describing the proposal, this helps the reader understand what elements the authors are changing to achieve better results. The same for the Swin transformer block. This means to move the contents from Section 4.1 and 4.2 before the description of STHardNet.

On page 9, Dice is always uppercase, please correct. Just to clarify to the authors: both IoU and DSC are used in Computer Vision papers that work over semantic segmentation tasks. Dice and IoU are used interchangeably in medical area or in any other area that focus on semantic segmentation (robotics, machine learning, medical imaging, etc.).

Authors mention that the Loss function us the sum of DSC and Cross-Entropy, are these weighted the same? Or do the authors assign different weight to each metric?

In Table 3, the performance of D-UNet is not the same as reported in [18]. The correct value should be 0.5349, which is closer to the author’s result. Why are these values different from reported in [18]? If the authors are reporting the results that they got implementing their own versions of the models this might not be a fair comparison due small randomness in the training process.

The authors should include examples of the segmentation of results of the ground truth vs their proposal and others of the state of the art. This is important to visually assess the performance of the model.

Section 5.4 Speed Comparison needs rework to address the results of the method. The authors get the required inference time of different image batches (100, 1K, 10K, and 100k frames), but what is the ideal quantity for comparison? ATLAS has 189 images per case, which is between 100 and 1000 images, why is important to get data on 10K and 100K frames? Time is an essential factor in medical imaging but the number of images that an MRI can provide is limited. Unless the network is classifying multiple patients per batch from different sources, 10K and 100K frames do not provide an importance for the model for the specific use case. For other scenarios (such as robotics) this would be different.

Also, the authors should put in bold the best results for 100, 1K, and 10K frames, not only 100K.

Another important point is the difference from calculating inference time for 100 and 1000 frames for all the networks. Are the authors including the loading time of the model on the inferences? It seems so. As an example, for U-Net it takes 4.7 seconds for 100 images and 7.1 seconds for 1000 images, this only makes sense if the model takes around 4 seconds for loading and the rest for processing the images. The same for the other reported models. This is also evident because the only linear relation is between 10K and 100K, i.e. 100K frames takes approximately 10 times more than 10K frames.

Are the authors testing under the same conditions for the small batches (100 and 1000 frames)? Are the GPUs preheated before running these small batches? For very small batches of images it is important to first have a dummy run to get the GPUs at a 40 degrees temperature and then run the final test, this due worst performance of the silicon below 40 degrees.

Author Response

Response to Reviewer 1 Comments

Point 1: The reviewer advice to first describe the backbone of the architecture, which is HarDNet, before describing the proposal, this helps the reader understand what elements the authors are changing to achieve better results. The same for the Swin transformer block. This means to move the contents from Section 4.1 and 4.2 before the description of STHardNet.

Response 1: The suggested modification has been made. (Sections 4.1 and 4.2 are now presented before the description of STHardNet.)

Point 2: On page 9, Dice is always uppercase, please correct. Just to clarify to the authors: both IoU and DSC are used in Computer Vision papers that work over semantic segmentation tasks. Dice and IoU are used interchangeably in medical area or in any other area that focus on semantic segmentation (robotics, machine learning, medical imaging, etc.).

Response 2: “dice” has been changed to “Dice” throughout.

Point 2: Authors mention that the Loss function us the sum of DSC and Cross-Entropy, are these weighted the same? Or do the authors assign different weight to each metric?

Response 3: DSC and Cross-Entropy are used with the same weight Page 9

Point 4: In Table 3, the performance of D-UNet is not the same as reported in [18]. The correct value should be 0.5349, which is closer to the author’s result. Why are these values different from reported in [18]? If the authors are reporting the results that they got implementing their own versions of the models this might not be a fair comparison due small randomness in the training process.

Response 4: We re-experimented [18], but still did not get very good results.

The purpose of the experiment is to compare the performance of the model, So we used the conditional training model in the paper 5.2 Parameter Settings.

Point 5: Section 5.4 Speed Comparison needs rework to address the results of the method. The authors get the required inference time of different image batches (100, 1K, 10K, and 100k frames), but what is the ideal quantity for comparison? ATLAS has 189 images per case, which is between 100 and 1000 images, why is important to get data on 10K and 100K frames? Time is an essential factor in medical imaging but the number of images that an MRI can provide is limited. Unless the network is classifying multiple patients per batch from different sources, 10K and 100K frames do not provide an importance for the model for the specific use case. For other scenarios (such as robotics) this would be different.

Also, the authors should put in bold the best results for 100, 1K, and 10K frames, not only 100K.

Response 5: Thank you for providing insightful suggestions.

In this study, the GPU was not preheated during the speed comparison experiment.

When we initially tested 100 and 1k images, the speed difference between the models was unexpected. Therefore, we used the last 10K images (owing to no preheating of the GPU).

Additionally, if the number of images is too small, it is difficult to compare the speed differences of the models.

Therefore, we decided to set the result of the experiment with 10k images as the speed of the final model. (The performance for 100, 1 000, and 10 000 frames was omitted.)

MRI data use 189 images per patient. When 100 patients are screened, approximately 20k worth of assignments must be processed.

From an APP point of view, a faster the model speed is better if several hospitals across the country want to transmit the patient's MRI image to the server and analyze it.

Therefore, in this study, we emphasized the speed of the model considering the large amount of analysis data..

Author Response File: Author Response.docx

Reviewer 2 Report

Abstract

I think the abstract needs to be revised, it does not state the problem to be solved clearly at the beggining, neither the hypothesis. It seems to be a summary of the method, with too much detail for an abstract.

Introduction:

CTs are also 3D studies, it is strange to state it as a benefit of MRI. An advantage of MRI is the reduction of radiation. Also, the resolution of CT is higher than MRI. This paragraph should be modified as it is incorrect.
Please, include references to HarDNet (67). Justify please the use of these blocks for the current problem. The goal of hardnets is to reduce inference times, which is specially important for real time applications, which is not the case here. Could authors, please, justify or explain the hypothesis why these nets are selected?
Figure captions can be more descriptive.
A description of hardnet and swing transformers is included in the method section. I think everything should go there, intead of diving it in introduction and method. I'd keep in the introduction only the basic information to explain why a decision to use these state of the art networks is taken.

Dataset

Why are not all images available in ATLAS used?
I would include a better explanation of the dataset, stating that there are 189 MR scans (which are 3D scans) and that 43281 slices are densenly (at a pixel level) annotated from those scans. The same words are used when refering to 3D scans or slices.

Method

I'd start by introducing the base networks, and then the proposed network based on them.
I suggest removing Table 1 as it does not provide essential information. It is about hardnet and not the proposed STHardNet.

Experiments

The description on how scans are used for evaluation is not clear. It would be enough to explain than 52 scans are used for testing and 177 for training. The following sentence is not clear "the disease of a patient was predicted using 189 MRI scan images captured for the patient 278 as input values of the model".
What's the rationale for combining Dice and cross-entropy in the loss function?
Table 3 exceeds the margins
Why speed comparison is needed is not clear for the application. This is related to my comment on the introduction on why improving efficiency is determinant in this application.
Qualitative results are missing (images showing segmentation results vs ground truth).

Conclusions

I would not say that an "excellent performance" is obtained, but an improvement of the SOTA.
Limitations of the study and further future work should be included.
There are other works (https://arxiv.org/ftp/arxiv/papers/1911/1911.11209.pdf) where sgementation results are better, why are these not mentioned? what is the benefit of the proposed method taking into account that lower dice scores are obtained?

Author Response

Point 1: I think the abstract needs to be revised, it does not state the problem to be solved clearly at the beggining, neither the hypothesis. It seems to be a summary of the method, with too much detail for an abstract.

Response 1: We have substantially revised the abstract In accordance with your suggestion.

Point 2: CTs are also 3D studies, it is strange to state it as a benefit of MRI. An advantage of MRI is the reduction of radiation. Also, the resolution of CT is higher than MRI. This paragraph should be modified as it is incorrect.

Response 2: We have deleted the corresponding sentence:

“Furthermore, MRI images are 3D images and have the advantage that the resolution and readability are higher than those of CT images.” (Lines 45–46).

Point 3: Please, include references to HarDNet (67).

Response 3: We have added a reference for HarDNet

Point 4: Justify please the use of these blocks for the current problem. The goal of hardnets is to reduce inference times, which is specially important for real time applications, which is not the case here. Could authors, please, justify or explain the hypothesis why these nets are selected?

Response 4: The goal of this study is to improve the performance of the segmentation model by combining the characteristics of CNN and transformer.

The CNN model completes the task while simultaneously analyzing the constraining parts
The transformer completes the task while analyzing the sequence data in all images.

However, in the actual field stroke image analysis APP, the speed of the model is important. The golden time for strokes from onset to diagnosis and treatment is less than one hour.

Thus, the HardNet block was selected as the base model.

Swin Transformer is a more suitable model for segmentation/detection that is built based on vison transformer.

Swin Transformer solves the problem of the computation amount increasing rapidly with the size of the input image in the existing vision transformer.

Thus, in this study, Swin Transformer is used to connect with the CNN model.

Point 5: Figure captions can be more descriptive.

A description of hardnet and swing transformers is included in the method section. I think everything should go there, intead of diving it in introduction and method. I'd keep in the introduction only the basic information to explain why a decision to use these state of the art networks is taken.

Response 5: the figures in the paper rarely need special description. We use a table to show the structure of the main model’s architecture.

Point 6: Dataset

Why are not all images available in ATLAS used?

I would include a better explanation of the dataset, stating that there are 189 MR scans (which are 3D scans) and that 43281 slices are densenly (at a pixel level) annotated from those scans. The same words are used when refering to 3D scans or slices.

Response 6: All of the ATLAS dataset was used for the training and validation. We have revised the corresponding sentences accordingly to indicate this (Lines 187–188).

Point 7: Method

I'd start by introducing the base networks, and then the proposed network based on them.

I suggest removing Table 1 as it does not provide essential information. It is about hardnet and not the proposed STHardNet..

Response 7: We have revised the logical order accordingly.

Table 1 provides a detailed explanation of HarDNet. This is necessary because the proposed model is based on HarDNet combined with Swin Transformer.

Further, the size of the input/output feature map of Swin Transformer can be known from the table. Thus, we think this table is necessary..

Point 8: Experiments

The description on how scans are used for evaluation is not clear. It would be enough to explain than 52 scans are used for testing and 177 for training. The following sentence is not clear "the disease of a patient was predicted using 189 MRI scan images captured for the patient 278 as input values of the model".

Response 8: In this study, the MRI scan images of 177 patients were used (177x189 slices), where 80% of the total data was randomly assigned as training data and the data of the remaining 52 patients were designated as validation data (52x189 slices).

An MRI scan image of one patient is a 3D image composed of 189 slices. Thus, in the performance verification, the performance measured in 189 slices of one patient was defined as one result.

Point 9: What's the rationale for combining Dice and cross-entropy in the loss function?

Response 9: DSC and Cross-Entropy are used with the same weight Page 9 (290)

Point 10: Table 3 exceeds the margins

Why speed comparison is needed is not clear for the application. This is related to my comment on the introduction on why improving efficiency is determinant in this application.

Qualitative results are missing (images showing segmentation results vs ground truth).

Response 10: The golden time for strokes from onset to diagnosis and treatment is less than one hour.

However, in APP, not only one patient is treated, but several stroke patients can be received at the same time.

Therefore, we think the diagnosis speed is also important.

Point 11: Conclusions

I would not say that an "excellent performance" is obtained, but an improvement of the SOTA.

Limitations of the study and further future work should be included.

There are other works (https://arxiv.org/ftp/arxiv/papers/1911/1911.11209.pdf) where sgementation results are better, why are these not mentioned? what is the benefit of the proposed method taking into account that lower dice scores are obtained?

Response 11: Thank you for your valuable comments.

We now state that the performance is superior to that of the SOTA models used in the experiment. (Line 350)

We have also added future research information. (Lines 374–376)

In this study, the model using 3D layer was removed from the experimental comparison target.

We carried out our comparison first with models for which their open code is available.

Regarding the paper you have indicated, we did not find the code for the corresponding model on the internet. Therefore, we did not include it in our comparative experiment.

Author Response File: Author Response.docx

Round 2

Reviewer 1 Report

Please review page 11, Section 5.4 Speed Comparison Experiments of Models. The authors still mention the experiments using 100 images. Please correct.

Author Response

Point 1: Please review page 11, Section 5.4 Speed Comparison Experiments of Models. The authors still mention the experiments using 100 images. Please correct..

Response 1: Thank you for your comments. We modified the content in line (327).

Author Response File: Author Response.docx

Article Menu

Printed Edition

STHarDNet: Swin Transformer with HarDNet for MRI Segmentation

Further Information

Guidelines

MDPI Initiatives

Follow MDPI