Next Article in Journal
Vaccine Immunogenicity versus Gastrointestinal Microbiome Status: Implications for Poultry Production
Next Article in Special Issue
Research on Brake Energy Recovery Strategy Based on Working Condition Identification
Previous Article in Journal
Experimental Study on the Flexural and Shear Performance of Concrete Beams Strengthened with Prestressed CFRP Tendons
Previous Article in Special Issue
Research on an Energy Recovery Strategy for Fuel Cell Commercial Vehicles Based on Slope Estimation
 
 
Article
Peer-Review Record

STAVOS: A Medaka Larval Cardiac Video Segmentation Method Based on Deep Learning

Appl. Sci. 2024, 14(3), 1239; https://doi.org/10.3390/app14031239
by Kui Zeng, Shutan Xu, Daode Shu and Ming Chen *
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3:
Appl. Sci. 2024, 14(3), 1239; https://doi.org/10.3390/app14031239
Submission received: 23 December 2023 / Revised: 24 January 2024 / Accepted: 25 January 2024 / Published: 2 February 2024

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This manuscript proposes a novel deep learning based video segmentation method for medaka ventricles. The manuscript mainly contains three components: 1) Constructing a video object segmentation dataset comprising over 7000 microscopic images of medaka ventricles; 2) Proposing a semi-supervised video object segmentation model named STAVOS using a spatiaotemporal attention mechanism; 3) Developing an automated system capable of calculating various parameters and visualizing results for medaka ventricles. While the proposed method demonstrated promising results compared to the conventional U-Net model and the SOTA TBD model, the manuscript requires improvements in terms of the English language delivery and organization. Some parts of the manuscript lack adequate explanations and clarity. Additionally, the Discussion section is missing. Specific recommendations for revising the manuscript are provided below:

  1. Line 24: Please spell out “TBD” when using the acronym for the first time.
  2. Line 30—34: Fix the grammar mistake in the sentence; consider breaking it into two sentences if necessary.
  3. Line 39: Please explain what “accuracy” it is.
  4. Line 49—51: Please cite pertinent references to support the assertion that “the accuracy of u-net in medaka ventricular segmentation task is not high”.
  5. Line 87—93: Fix the grammar mistake in the sentence; consider breaking it into two sentences if necessary.
  6. Line 155—156: Please provide examples of dataset labeled as “Better” and “Lower." Currently, these terms are too vague and subjective.
  7. Figure 2: Please label all terms in the figure. The figure itself should be self-explanatory.
  8. Line 216—219: Please label all letters in the equations.
  9. Figure 3: Please labels all terms and provide more detailed elaboration in the figure caption.
  10. Please include a discussion section before the conclusion.
Comments on the Quality of English Language

Please see the comments above.

Author Response

Dear Editors,

Thank you very much for taking the time to review this manuscript. Thank you very much for providing so many valuable comments. Your feedback has made us acutely aware of numerous shortcomings in our paper.

We have addressed each specific comment you raised and made corresponding modifications.

 

Comments 1: [Line 24: Please spell out “TBD” when using the acronym for the first time.]

Response 1: Thank you for pointing this out. We agree with this comment. Therefore, we have made corresponding changes according to your comment.

[Line 24: Tackling Background Distraction(TBD)]

 

Comments 2: [Line 30—34: Fix the grammar mistake in the sentence; consider breaking it into two sentences if necessary.]

Response 2: Thank you for pointing this out. We agree with this comment. Therefore, We rewrote this sentence.

[Line 31-35: Medaka (Oryzias latipes) is a commonly used model organism with significant importance in cardiovascular disease research [1], widely applied in the fields of ge-netic modification and drug development. This is attributed to the high genetic simi-larity of medaka to humans, low cultivation costs, and a fast growth cycle. Addition-ally, during the early larval stages, the bodies of medaka larvae are transparent.]

 

Comments 3: [Line 39: Please explain what “accuracy” it is.]

Response 3: Thank you for pointing this out. We agree with this comment. The meaning of accuracy is the precision of manual measurement. Due to the rewriting of the introduction section, this sentence has been deleted.

 

Comments 4: [Line 49—51: Please cite pertinent references to support the assertion that “the accuracy of u-net in medaka ventricular segmentation task is not high”.]

Response 4: Thank you for pointing this out. We agree with this comment. Therefore, We have added relevant references for this.

[Line 71: For instance, Bohan Zhang et al. [10]]

 

Comments 5: [Line 87—93: Fix the grammar mistake in the sentence; consider breaking it into two sentences if necessary.]

Response 5: Thank you for pointing this out. We agree with this comment. Therefore, We rewrote this sentence.

[Line 95-101: By providing a video capturing the heartbeat of medaka larvae and a mask for one frame, the system can perform ventricular segmentation. Based on the precise seg-mentation results from STAVOS, the system automates the computation of various cardiac parameters of the medaka ventricle, such as heart rate (HR), stroke volume (SV), fractional shortening (FS), ejection fraction (EF), etc. Some of these parameters will undergo visualization to facilitate further research by relevant biologists into medaka.]

 

Comments 6: [Line 155—156: Please provide examples of dataset labeled as “Better” and “Lower." Currently, these terms are too vague and subjective.]

Response 6: Thank you for pointing this out. We agree with this comment. Therefore, We have rewritten this section. We provide examples of datasets labeled as "Better" and "Lower" as shown in Figure 1.

[Line 158-175]

Comments 7: [Figure 2: Please label all terms in the figure. The figure itself should be selfexplanatory.]

Response 7: Thank you for pointing this out. We agree with this comment. Therefore, We made slight modifications to the content in Figure 2 and explained all the terms.

[Line 250-255,Figure 2: Architecture of our proposed method. The STA module is our proposed spatiotemporal attention mechanism module, and the TBD module is the core module of the TBD model. "pre_frames" and "sub_frames" respectively represent the video sequence before and after the STA frame. During the training phase, the model outputs "pre_scores" and "sub_scores". During the validation or testing phase, the model produces "pre_masks" and "sub_masks". “S&T diversity” represents matching templates for spatial and temporal diversity.]

 

Comments 8: [Line 216—219: Please label all letters in the equations.]

Response 8: Thank you for pointing this out. We agree with this comment. Therefore, We employed an alternative expression for this equation and provided explanations for all mathematical symbols.

[Line 229-238: the horizontal gradient Gx and the vertical gradient Gy, Here, I(x,y) represent the pixel value at position (x,y) in the image, and G_x (i,j) and G_y (i,j) are the two convolution kernels of the Sobel operator]

 

Comments 9: [Figure 3: Please labels all terms and provide more detailed elaboration in the figure caption.]

Response 9: Thank you for pointing this out. We agree with this comment. Therefore, We made slight modifications to the content in Figure 3 and explained all the terms.

[Line 346-349, Figure 3]

 

Comments 10: [Please include a discussion section before the conclusion.]

Response 10: Thank you for pointing this out. We agree with this comment. Therefore, we include a discussion section before the conclusion.

[Lines 540-550]

 

Additionally, we have rewritten almost all sections of the paper, especially the introduction, methods, and experiment sections. We have uploaded the latest manuscript, Please see the attachment. We hope that the revised version meets with your approval.

 

Sincerely,

Kui Zeng

Shanghai Ocean University

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

 

The article does not meet the basic requirements for the scientific article by means of lacking the review of state-of-the-art methods applied in the research area, low quality of mathematical description, unreliable experimental design, and poor presentation of the results. Please find the additional comments below.

-The problem description (introduction), review of similar research, used methods and the motivation behind using them (lines 66-92) are mixed in the introduction section. The quality of review of similar research is low as only several articles are mentioned in this part of article.

-          (24 line) TBD abbreviation is not explained.

-          (148-150 lines, Table 1) The number of videos is low, thus the video dataset lacks diversity. Although the authors apply data augmentation during the training procedure, the test set and validation set contains data from 1-2 videos only. Thus, the later experimental results (Table 2) can be highly dependent on the videos used in the dataset. The cross-validation analysis (leave 1-out, bootstrapping, etc.) would provide more reliable results.

-          (217-219 lines) the terms in the equations are not explained

-          (307-343 lines) the experimental process section describes the experimental results instead of the experimental procedure. Moreover, the results represent only the training data, thus, there is no analysis on how the model behaves with the unseen (testing) data.

-          (346-347 lines) the authors state that they compare the results with U-Net and TBD model. However, they do not describe the finetuning procedure of the models to the created dataset (parameters, epochs, etc.). Was the finetuning even performed with the training dataset?

-          (405 line) Caption of table 3 is not defined.

-          Some typographic mistakes: (79 line) “can be applied to TBD to improves…”, (279 line) “Experment”, …

 

 

Author Response

Dear reviewer:

Thank you very much for taking the time to review this manuscript. Thank you very much for providing so many valuable comments. Your feedback has made us acutely aware of numerous shortcomings in our paper. 


We have addressed each specific comment you raised and made corresponding modifications.


Comments 1: The problem description (introduction), review of similar research, used methods and the motivation behind using them (lines 66-92) are mixed in the introduction section. The quality of review of similar research is low as only several articles are mentioned in this part of article.


Response 1: Thank you for pointing this out. We agree with this comment. Therefore, we have rewritten the introduction section of the paper, added some references, and included examples from classical computer vision cases.
[lines 43-79]


Comments 2: (24 line) TBD abbreviation is not explained.


Response 2: Thank you for pointing this out. We agree with this comment. Therefore, we have made corresponding changes according to your comment.
[Line 24: Tackling Background Distraction(TBD)]


Comments 3: (148-150 lines, Table 1) The number of videos is low, thus the video dataset lacks diversity. Although the authors apply data augmentation during the training procedure, the test set and validation set contains data from 1-2 videos only. Thus, the later experimental results (Table 2) can be highly dependent on the videos used in the dataset. The cross-validation analysis (leave 1-out, bootstrapping, etc.) would provide more reliable results.

Response 3: We primarily used the DAVIS 2016 validation set to assess the framework's performance and employed the medaka test set for supplementary evaluation. 
[lines 467-487, lines 488-516, Table 2: Table 2. Segmentation performance of TBD and STAVOS on video sequences from the DAVIS2016 validation set.]


Comments 4: (217-219 lines) the terms in the equations are not explained

Response 4: Thank you for pointing this out. We agree with this comment. Therefore, We employed an alternative expression for this equation and provided explanations for all mathematical symbols.
[Line 229-238: the horizontal gradient Gx and the vertical gradient Gy, Here, I(x,y) represent the pixel value at position (x,y) in the image, and G_x (i,j) and G_y (i,j) are the two convolution kernels of the Sobel operator]


Comments 5: (307-343 lines) the experimental process section describes the experimental results instead of the experimental procedure. Moreover, the results represent only the training data, thus, there is no analysis on how the model behaves with the unseen (testing) data.

Response 5: Thank you for pointing this out. We agree with this comment. Therefore, we have rewritten the Experiments section, which is now divided into: 4.1. Experimental Environment, 4.2 Experimental Setup, 4.3. Training Results, 4.4 Ablation Experiments, 4.5 Comparative Experiments.
[lines 432-487]


Comments 6: (346-347 lines) the authors state that they compare the results with U-Net and TBD model. However, they do not describe the finetuning procedure of the models to the created dataset (parameters, epochs, etc.). Was the finetuning even performed with the training dataset?

Response 6: Thank you for pointing this out. We agree with this comment. Therefore, we have rewritten the Results section, which is now divided into: 5.1. Qualitative Results, 5.2. Quantitative Results, 5.3. Automated System Results. 
[lines 489-516]


Comments 7: (405 line) Caption of table 3 is not defined.

Response 7: Thank you for pointing this out. We agree with this comment. We sincerely apologize for the oversight in omitting the title for Table 3 in the initial submission. However, due to substantial revisions made to various sections of the paper, the previous Table 3 now corresponds to Table 4.
[lines 523-536, Table 4: Table 4. Ventricular parameters for R0039 output by the automated system. "ED": end-diastole, "ES": end-systole; "FS": fractional shortening; "EDV": the volume at ED, "ESV": the volume at ES, "SV": stroke volume; "EF": ejection fraction, "HR": heartrate. Due to the original table's extensive length, the presented table is a truncated version, including only the first 10 rows.]


Comments 8: Some typographic mistakes: (79 line) “can be applied to TBD to improves…”

Response 8: Thank you for pointing this out. We agree with this comment. Therefore, we rewrote this sentence.
[lines 92-95: Additionally, to address the problem of edge blurring, we propose a spatiotemporal attention module for video object segmentation (STAVOS), which better captures the dynamic features of heartbeats in the video, achieving precise segmentation of the medaka ventricle.]


Comments 9: (279 line) “Experment”, …

Response 9: Thank you for pointing this out. We agree with this comment. Therefore, we have made corresponding changes according to your comment.
[lines 395: Experiment]


Additionally, we have rewritten almost all sections of the paper, especially the introduction, methods, and experiment sections. We have uploaded the latest manuscript, Please see the attachment. We hope that the revised version meets with your approval.


Sincerely,
Kui Zeng
Shanghai Ocean University

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

Please, see the attached PDF file.

Comments for author File: Comments.pdf

Comments on the Quality of English Language

Moderate English revision is needed to improve the fluency of some sentences and fix some grammatical errors.

Author Response

Dear Editors,

 

Thank you very much for taking the time to review this manuscript. Thank you very much for providing so many valuable comments. Your feedback has made us acutely aware of numerous shortcomings in our paper.

We have addressed each specific comment you raised and made corresponding modifications.

 

Comments 1: Lines 52-75: This part needs to be covered by a sufficient bibliography, as the authors refer to the main limitations and challenges of video/image analysis in this context. The authors should enhance it by introducing appropriate bibliographic references, also including studies that have used more classical computer vision techniques without deep learning approaches. In this way, the Introduction would be more comprehensive, emphasizing the potential benefits of the proposed DL framework.

Response 1: Thank you for pointing this out. We agree with this comment. Therefore, we have rewritten the introduction section of the paper, added some references, and included examples from classical computer vision cases.

[lines 43-52]

 

Comments 2: Lines 95-103: For clarity, contributions of the study could be provided as a list.

Response 2: Thank you for pointing this out. We agree with this comment. Therefore, we have made corresponding changes according to your comment.

[lines 103-113]

 

Comments 3: Lines 114 and 118: The authors used two metrics to indicate the age of the embryos (dpf and hpf). It would be better to use the same metric for clarity.

Response 3: Thank you for pointing this out. We agree with this comment. Therefore, we have made corresponding changes according to your comment.

[lines 124 and 128: 32-36 hours post-fertilization (hpf), 32-36 hpf.]

 

Comments 4: Line 120: The resolution and frame rate of the videos are very low, which probably impacts on the results. Is “15fps” enough to measure the heartbeat correctly? Is “640x480px” enough to accurately detect the contours? A higher frame rate and resolution would probably have improved the results and allowed fewer original videos to be discarded. The authors should discuss these points and justify their choice.

Response 4: Thank you for pointing this out. We agree with this comment. Therefore, we discuss these points and provide justification for our choice.

[lines 131-138]

 

Comments 5: Figure 1: It would be helpful for the authors to mark (with an arrow, for example) the position of the ventricle in the images (b, c, d, f, g, h) for readers unfamiliar with this type of images

Response 5: Thank you for pointing this out. We agree with this comment. Therefore, we have made corresponding changes according to your comment.

[Figure 1]

 

Comments 6: Lines 139-140: The authors should clarify what they mean by "The quality of the original video data is uneven, and many videos are no longer able to see the heart clearly" and explicitly indicate why nearly half (25 videos) of the original 63 videos were discarded. Since the original dataset is public (lines 122-123), readers/researchers need to know the reasons for this selection.

Response 6: Thank you for pointing this out. We agree with this comment. Therefore, we explained the reasons for the omission. In addition, we modified the content of Figure 1 to illustrate images of different qualities.

[lines 170-171 and Figure 1: Due to the extremely poor quality of the videos in the third level, preventing ac-curate labeling, they were discarded]

 

Comments 7: Lines 156-159: Could not low-quality videos (named “lower” in Table 1) introduce confusion into the DL model rather than generalizing it? It is also unclear what the difference is between the “lower” videos and those initially discarded (25 videos of the original 63): one would have expected that videos with blurred contours and occlusions would also be discarded. The authors should better argue this critical point.

Response 7: Thank you for pointing this out. We agree with this comment. Therefore, we explained the reasons for the omission. In addition, we modified the content of Figure 1 to illustrate images of different qualities.

[lines 171-175 and Figure 1: The videos in the second level also exhibited suboptimal quality, and considering the prevalence of ventricular occlusion and blurred edges, coupled with the limited da-taset, we chose to use the first and second level of quality videos for dataset construc-tion. For differentiation, we labeled the video data from the first level as "better" and the second level as "lower".]

 

Comments 8: Section 3.1: This section is loosely described. The authors should include more details regarding the framework’s components (Figure 2) to favor the reproducibility of the study. Indeed, many details of the methodological approach need to be included, as well as the rationale for some choices (e.g., why was the 0.5 threshold chosen?).

Response 8: Thank you for pointing this out. We agree with this comment. Therefore, we have rewritten the Methods section, and STAVOS is now part of Section 3.3. We have provided a comprehensive explanation of STAVOS, which is divided into the STA module and the TBD module. The TBD module is further divided into Encode, Matcher, and Decoder components.

[lines 256-299]

 

Comments 9: Lines 177-178: The sentence “omitting the details and partial connections of the spatiotemporal diversity matching template” needs to be clarified

Response 9: Thank you for pointing this out. We agree with this comment. Therefore, we made adjustments to Figure 2 to make the architecture more complete.

[Figure 2]

 

Comments 10: Section 3.2: Section 3.2: This section is loosely described. The authors should include more details on how the computer vision techniques were applied, to favor the reproducibility of the study.

Response 10: Thank you for pointing this out. We agree with this comment. Therefore, we rewrote this section, dividing it into the Spatiotemporal Attention Mechanism section and the Deriving STA Frame section. In Section 3.1, we provide a detailed introduction to the Spatiotemporal Attention Mechanism, and in Section 3.2, we explain the calculation method for the STA frame.

[lines 191-247]

 

Comments 11: Line 219: What are Gx and Gy in Equation 3? The previous equations do not define them.

Response 11: Thank you for pointing this out. We agree with this comment. Therefore, we

[lines 230: Gx : the horizontal gradient , Gy: the vertical gradient.]

 

Comments 12: Line 249: “F-Measure or other contour-related metrics”: What other metrics? The authors should clarify this point since they referred only to F-Measure (lines 233 and 253).

Response 12: Thank you for pointing this out. We agree with this comment. Therefore, we specified other metrics.

[lines 323-324: such as F-Measure, Precision, and IoU]

 

Comments 13: Section 3.3: This section is loosely described. The authors should include more details on how the automatic evaluation system was implemented and how the outputs (parameters, signals, and graphs) were obtained (Figure 3).

Response 13: Thank you for pointing this out. We agree with this comment. Therefore, we have rewritten this section, and it now corresponds to Section 3.5. In Section 3.5, we provide detailed explanations on the implementation of the automated evaluation system and the calculation of relevant parameters.

[lines 350-394]

 

Comments 14: Lines 285-293: How the DAVIS2016 datasets fit into the authors' proposed model needs to be clarified.

Response 14: Thank you for pointing this out. We agree with this comment. Therefore, we explained the reasons for choosing DAVIS2016.

[lines 406-411: The reason for selecting DAVIS2016 is attributed to its annotation approach, which distinguishes only between foreground and background through binary labeling, aligning closely with the nature of our task. Moreover, the prominence of DAVIS lies in its high-quality, high-resolution characteristics, making it a mainstream benchmark dataset for VOS tasks. This choice is expected to enhance the credibility of our model, particularly given the dataset's reputation for being a leading evaluation dataset in the field.]

 

Comments 15: Lines 294-306: The authors should clarify the training process. For example, it needs to be clarified whether videos from the ad-hoc dataset (Table 1) or those included in the DAVIS2016 dataset were used to evaluate framework performance.

Response 15: Thank you for pointing this out. We primarily used the DAVIS 2016 validation set to assess the framework's performance and employed the medaka test set for supplementary evaluation.

[lines 467-471: Due to the limited size of the medaka dataset, both the test and validation sets consist of only 1-2 videos. Consequently, the dataset lacks diversity, resulting in ex-perimental outcomes being significantly dependent on the specific videos used in the dataset. Therefore, we chose to assess the framework's performance using the DA-VIS2016 dataset instead of the medaka dataset.]

 

Comments 16: Section 4.2: This section is also loosely described. In addition, it seems more like an "experiment result" rather than a description of the "experimental process" (Figure 4 and Figure 5)

Response 16: Thank you for pointing this out. We agree with this comment. Therefore, we have rewritten the Experiments section, which is now divided into: 4.1. Experimental Environment, 4.2 Experimental Setup, 4.3. Training Results, 4.4 Ablation Experiments, 4.5 Comparative Experiments.

[lines 432-487]

 

Comments 17: Lines 353-381: The results in this section (0.493, 0.829, and 0.914) seem the average (Figures 6 and 7) of the segmentation results on a single frame from one lateral-right video (N0068) and from one ventral video (R0039) for the three models compared (U-Net, TBD, and STAVOS respectively). These values should be reported as percentages for clarity and congruence with those shown in Figures 6 and 7 and Table 2. The same applies to 0.745, obtained from U-Net only for the ventral frame in the diastolic phase. However, these results seem to be obtained on single frames: what are the results on the whole test and validation sets (according to Table 1)? In addition, how should the results in Table 2 be interpreted? The caption is too poor. The author should improve and clarify the presentation of the results.

Response 17: Thank you for pointing this out. We agree with this comment. Therefore, we have rewritten the Results section, which is now divided into: 5.1. Qualitative Results, 5.2. Quantitative Results, 5.3. Automated System Results.

[lines 489-516]

 

Comments 18: Lines 394-404: The authors should add details on the meaning and method of computation of the estimated parameters (it would be better in the Methods section).

Response 18: Thank you for pointing this out. We agree with this comment. Therefore, we provide detailed explanations and formulas for each parameter in Section 3.5.

[lines 362-389]

 

Comments 19: Table 3: For consistency, the column header should match the name of the estimated parameters (e.g., "stroke V" was used instead of "SV"). In addition, the caption of Table 3 is missing (the one from the original template is shown). A critical and missing aspect is the validation of the results: in the study, was a validation of the measures shown in Table 3 performed at least by a comparison with traditional estimation methods? For example, what is the accuracy of the estimated heart rate (bpm) compared with the actual value? Finally, the parameters shown in Table 3 seem to refer to a portion of a single video. Were they reported only as an example? What are the overall results? The authors should clarify these issues

Response 19: Thank you for pointing this out. We agree with this comment. Therefore, we have made corresponding changes according to your comment. Furthermore, we manually assessed the heartbeat counts of all videos and compared them with the predicted values. As for other parameters, which require measurement by specialized researchers, this paper does not provide an evaluation for them.

[lines 532-536, Table 4, Figure 8]

 

Comments 20: The Discussion section of the results needs to be included (results are reported only in the Conclusions without a discussion). It should also include a comparison with the state of the art, the current limitations of the study, and any future developments. In addition, the Conclusion section should be rewritten by reporting some concluding remarks.

Response 20: Thank you for pointing this out. We agree with this comment. Therefore, we include a discussion section before the conclusion. Furthermore, we rewritten the conclusion section and provided some conclusive remarks.

[lines 540-550, lines 552-568]





Comments 1: Abstract: The acronym TBD must be defined before using it.

Response 1: Thank you for pointing this out. We agree with this comment. Therefore, we have made corresponding changes according to your comment.

[Line 24: Tackling Background Distraction(TBD)]

 

Comments 2: Lines 32-33: The sentence needs to be corrected. The authors should rewrite it.

Response 2: Thank you for pointing this out. We agree with this comment. Therefore, we rewrote this sentence.

[lines 31-33: Medaka (Oryzias latipes) is a commonly used model organism with significant importance in cardiovascular disease research [1], widely applied in the fields of ge-netic modification and drug development. This is attributed to the high genetic simi-larity of medaka to humans, low cultivation costs, and a fast growth cycle.]

 

Comments 3: Line 79: Fix “to improves”.

Response 3: Thank you for pointing this out. We agree with this comment. Therefore, we rewrote this sentence.

[lines 92-95: Additionally, to address the problem of edge blurring, we propose a spatiotemporal attention module for video object segmentation (STAVOS), which better captures the dynamic features of heartbeats in the video, achieving precise segmentation of the medaka ventricle.]

 

Comments 4: Line 115: The acronym ERM must be defined before using it.

Response 4: Thank you for pointing this out. We agree with this comment. Therefore, we have made corresponding changes according to your comment.

[lines 125-126: embryo medium (ERM)]

Comments 5: Line 198: The acronym CBAM must be defined before using it

Response 5: Thank you for pointing this out. We agree with this comment. Therefore, we have made corresponding changes according to your comment.

[lines 291-292: CBAM (Convolutional Block Attention Module)]

 

Comments 6: For the sake of clarity and consistency, it would be appropriate to use the same terms throughout the paper (e.g., "DenseNet-121" or "DenseNet121"; "Intersection over Union" or "Intersection-over-Union"; "Spatio-Temporal Attention" or "Spatiotemporal Attention").

Response 6: Thank you for pointing this out. We agree with this comment. Therefore, we have made corresponding changes according to your comment. I apologize, we actually used DenseNet-201, and I have corrected this in the paper.

[lines 268 and 272: DenseNet-201]

[lines 92, 107, 203...: Spatiotemporal]

 

Comments 7: Line 279: “Experment” instead of “Experiment”: please, fix it

Response 7: Thank you for pointing this out. We agree with this comment. Therefore, we have made corresponding changes according to your comment.

[lines 395: Experiment]

 

Additionally, we have rewritten almost all sections of the paper, especially the introduction, methods, and experiment sections. We have uploaded the latest manuscript, Please see the attachment. We hope that the revised version meets with your approval.

 

Sincerely,

Kui Zeng

Shanghai Ocean University

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

All my comments were addressed.

Comments on the Quality of English Language

N/A

Author Response

Dear reviewer:

Thank you very much for taking the time to review this manuscript. In accordance with the feedback from other reviewers, we have made minor modifications to several locations in the manuscript, specifically at lines 49, 51, and 278.
We appreciate for your warm work earnestly, and hope the correction will meet with approval. Once again, thank you very much for your comments and suggestions. 


Sincerely,
Kui Zeng, Shutan Xu, Daode Shu, Ming Chen
Shanghai Ocean University

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The authors addressed my previous comments properly and strongly modified the article to meet the requirements. Current version still consists some minor mistakes, e.g. (51 line) "... embryos, however, However, this equipment...". 

Author Response

Dear reviewer:

Thank you very much for taking the time to review this manuscript.

Comments 1: (51 line) "... embryos, however, However, this equipment...". 

Response 1: Thank you for pointing this out. We were really sorry for our careless mistakes. Therefore, We have modified this problem.

[lines 51: De Luca et al. [5] designed a platform based on a resonant laser scanning confocal microscope for analyzing the heart rate of fish embryos. However, this equipment is quite expensive for some laboratories.]



We appreciate for your warm work earnestly, and hope the correction will meet with approval. Once again, thank you very much for your comments and suggestions. 

Sincerely,
Kui Zeng, Shutan Xu, Daode Shu, Ming Chen
Shanghai Ocean University

Reviewer 3 Report

Comments and Suggestions for Authors

I would like to thank the authors for responding to all comments and for revising the article accordingly. The overall quality has improved, as its clarity and reproducibility. The new information and images have added completeness to the work. Well done. 

I have no further comments on the article. I just point out the following typos that can be corrected in the later stages of paper finalization.

1) Line 49: "De Luca et al." instead of "Luca et al."

2) Line 51: "however" is repeated twice.

3) Line 278: "the will be" instead of "they will be".

Comments on the Quality of English Language

English is good in general, although some sentences could be rephrased to simplify them.

Author Response

Dear reviewer:

Thank you very much for taking the time to review this manuscript.

 

Comments 1: Line 49: "De Luca et al." instead of "Luca et al."

Response 1: Thank you for pointing this out. We were really sorry for our careless mistakes. Therefore, We have modified this problem.

[lines 49: De Luca et al. [5] designed a platform ...]

 

Comments 2: Line 51: "however" is repeated twice.

Response 2: Thank you for pointing this out. We were really sorry for our careless mistakes. Therefore, We have modified this problem.

[lines 51: However, this equipment is quite expensive for some laboratories.]

Comments 3: Line 278: "the will be" instead of "they will be".

Response 3: Thank you for pointing this out. We were really sorry for our careless mistakes. Therefore, We have modified this problem.

[lines 278: The feature maps and "key" of the STA frame will be utilized to initialize templates for fine-grained matching and coarse-grained matching, serving as the initial state of the state dictionary.]

 

 


We appreciate for your warm work earnestly, and hope the correction will meet with approval. Once again, thank you very much for your comments and suggestions. 

Sincerely,
Kui Zeng, Shutan Xu, Daode Shu, Ming Chen
Shanghai Ocean University

Back to TopTop