Next Article in Journal
Abrupt High PM Concentration in an Urban Calm Cavity Generated by Internal Gravity Waves and a Shallow Coastal Atmospheric Boundary Layer with the Influence of the Yellow Dust from China
Previous Article in Journal
Learning Lightweight and Superior Detectors with Feature Distillation for Onboard Remote Sensing Object Detection
 
 
Article
Peer-Review Record

MDCT: Multi-Kernel Dilated Convolution and Transformer for One-Stage Object Detection of Remote Sensing Images

Remote Sens. 2023, 15(2), 371; https://doi.org/10.3390/rs15020371
by Juanjuan Chen 1, Hansheng Hong 2, Bin Song 1,*, Jie Guo 1, Chen Chen 1 and Junjie Xu 1
Reviewer 1:
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Reviewer 4: Anonymous
Reviewer 5:
Remote Sens. 2023, 15(2), 371; https://doi.org/10.3390/rs15020371
Submission received: 8 December 2022 / Revised: 4 January 2023 / Accepted: 4 January 2023 / Published: 7 January 2023
(This article belongs to the Section AI Remote Sensing)

Round 1

Reviewer 1 Report

This paper proposed a object detection method for remote sensing images using MDCT.  The algorithms are descripted clearly and the experiments on three datasets prove the efficiency. Some  improvements should be given in  the experiments' explanation, please explain clearly the three detection accuracies in Abstract are obtained by comparing which known method, or they are average improvement rates.

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Reviewer 2 Report

This paper proposes an RSI detection model based on YOLO for small remote sensing object detection and remote sensing object detection in complex scenes. The main components of the proposed method are a super-resolution module and a transformer encoder. The super-resolution module is developed for learning different scale features of objects. The Transformer encoder is proposed to enhance the recognition ability of the model. Extensive experiments are performed on DIOR, DOTA, and NWPU VHR-10 datasets. The proposed model achieves promising results. However, some concerns should be resolved.

 

1. Motivation is not clear. The authors just list the recent works one by one and do not summarize them. Thus, we can not learn the drawbacks of current works and obtain the motivation of the design model. Besides, the super-resolution module and transformer encoder are not novel technologies. Dilated convolutions with different dilated rates are widely used for extracting different scale features. The transformer model is good at global information learning. However, the authors just combine them into the YOLO model, which generates limited contributions to the RSI object field. Please reorganize the illustrations of the introduction, clarify the motivation, and highlight the contributions of the proposed method.

 

2. The names of key components in this paper are not appropriate. Firstly, the name "super-resolution module" confuses me. Generally speaking, super-resolution means that the method increases the spatial resolution of the input feature or image. However, the "super-resolution module" in this paper just enhances the feature representation with different scale features. The spatial resolutions are not changed. Besides, the transformer encoder usually includes more transformer blocks. The authors just leverage a transformer block to obtain global information. Thus, the "transformer encoder" named "transformer block" is more appropriate.

 

3. The proposed method is built on the YOLO known for high inference speed, but there is no comparison of inference time in the current version. 

 

4. Context information in RSI is widely used in remote sensing analysis as well as other relevant vision tasks. However, the paper lacks a comprehensive review of the literature. In particular, several relevant works should be discussed, such as, Contextual Transformation Network for Lightweight Remote Sensing Image Super-Resolution, Volumetric memory network for interactive medical image segmentation and Matnet: Motion-attentive transition network for zero-shot video object segmentation.

 

5. There are many typos in the current manuscript. Some of them are listed as follows:

1. A reference is missed in line 47 of page 2.

2.K^{\prime}_{2} is missed in the caption of Fig.2

3." where" should be at the top in line 214 of page 6.

4. The "equal sign" in Fig.3 is not appropriate

5. What's the meaning of using dash lines to represent "MLP" in Fig.4

6. Why does present "shortcut" fonts in Fig.5?

7. Each equation should end with a comma or period.

The authors should carefully proofread their manuscript and correct the above typos.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

the author have done allot of work but the paper need some revision

the introduction section should be revised. 

the title should be change as per the the main  idea of the paper

the abstract is not the same as the content of the work   it should be revised and also conclusions. future work is missing.

Studiesutilizingdeeplearningapproachesandmethodsinthecontextofobjectdetectionarewellknowntothecommunity. it is suggested to read and cite the following paper

1.Liao,L.,Du,L.,&Guo,Y.(2021).Semi-SupervisedSARTargetDetectionBasedonanImprovedFasterR-CNN.Remotesensing(Basel,Switzerland),14(1),143.doi:10.3390/rs14010143

2.H.,Z.,G.,L.,J.,L.,&F.,Y.W.(2022).C2FDA:Coarse-to-FineDomainAdaptationforTrafficObjectDetection.IEEETransactionsonIntelligentTransportationSystems,23(8),12633-12647.doi:10.1109/TITS.2021.3115823

figure should be revised

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 4 Report

MINOR REVISION:

This manuscript proposed Multikernel Dilated Convolution and Transformer for Remote Sensing Object Detection. I think the idea is interesting, and the theoretical derivations are detailed. More detailed comments are given below:

1.      The English of your manuscript must be improved before resubmission.

2.      In the target detection case given in this paper, if there are multiple target categories, how many different categories of targets can be detected simultaneously at most?

3.      What will happen when there are too many target categories? I suggest that the author elaborate on this case or provide insights on this issue.

4.      The paper focuses on small objects in dense scenes and complex backgrounds. Could small objects in dense scenes be quantified?

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 5 Report

This paper (remotesensing-2117609) proposes a remote sensing object detection algorithm called multikernel dilated convolution and transformer-based one-stage object detection model (MDCT) with the aim of overcoming background interference and difficulty in detecting small objects in dense fields. Experimental results observed on three datasets were compared and analyzed with other recently reported approaches in the [41], [50]-[61] literature. Processing steps, motivation of the experiment, and observed results were described in detail with one algorithm, three pictures, and five tables. However, in order to clarify the purpose of the paper and increase readability, it is recommended to make minor revisions from the following perspectives.

1) In general, when comparing experimental results, the results should be measured under the same conditions and in the same manner. In the three tables representing the experimental results, it seems that the experimental conditions of the methods compared to the proposed method should be presented.

2) The performance of MDCT detecting small objects in dense fields and the ability to overcome background interference will be further clarified by adding experimental results compared to existing SOTAs in Ablation Study.

3) It is expected that compliance with the general rules that are followed when using abbreviations will improve the readability of the paper. These include abbreviations of the figures (e.g., SPP Layer, C3 Layer, Conv2d, etc.) and abbreviations of the text (e.g., NLP on line #147).

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

The revision has addressed my concerns.

Author Response

Thank you very much for your positive feedback, which have greatly encouraged us to revise and submit our manuscript.

Reviewer 3 Report

check the overall format of the paper according to the template of the journal

 Updated the conclusion according to abstract. Cite latest research work.

Many grammatical mistakes found throughout in the paper, correct it.

this paper can be published

 

 

 

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Back to TopTop