Next Article in Journal
An Effective Res-Progressive Growing Generative Adversarial Network-Based Cross-Platform Super-Resolution Reconstruction Method for Drone and Satellite Images
Previous Article in Journal
The Distributed Adaptive Bipartite Consensus Tracking Control of Networked Euler–Lagrange Systems with an Application to Quadrotor Drone Groups
Previous Article in Special Issue
A Large Scale Benchmark of Person Re-Identification
 
 
Article
Peer-Review Record

Drone-Based Visible–Thermal Object Detection with Transformers and Prompt Tuning

Drones 2024, 8(9), 451; https://doi.org/10.3390/drones8090451 (registering DOI)
by Rui Chen 1, Dongdong Li 1,*, Zhinan Gao 1, Yangliu Kuai 2 and Chengyuan Wang 3
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3:
Reviewer 4: Anonymous
Reviewer 5: Anonymous
Drones 2024, 8(9), 451; https://doi.org/10.3390/drones8090451 (registering DOI)
Submission received: 31 July 2024 / Revised: 21 August 2024 / Accepted: 29 August 2024 / Published: 1 September 2024

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The comments are attached in PDF.

Comments for author File: Comments.pdf

Comments on the Quality of English Language

Moderate Editing, Passive Voice is recommended.

Author Response

We kindly refer the esteemed reviewer to the attached document, where our comprehensive responses to the insightful comments and suggestions are meticulously detailed. We trust that this will provide a clear and thorough account of our considerations and revisions made in accordance with your valuable feedback.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The article is dedicated to solving a computer vision problem. The topic is relevant. However, the structure of the article does not conform to the standard format for research articles in MDPI (Introduction (including a review of related work), Models and Methods, Results, Discussion, Conclusions). The level of English is acceptable, and the article is easy to read. The figures in the article are of acceptable quality. The article cites 31 sources, some of which are outdated.

The following remarks and recommendations can be made about the article's content:
1. The authors have chosen to complicate the task of object detection in images by focusing on infrared versions. This approach brings several issues. I will mention only the most critical ones: Objects in infrared images often have low contrast compared to the background, making them difficult to detect. Infrared cameras can capture thermal noise from various sources, such as ambient temperature, which may interfere with accurate object detection. Objects with different thermal characteristics can appear differently in infrared images, complicating their classification and identification. Weather conditions, such as fog, rain, or dust, can significantly degrade the quality of infrared images, making object detection difficult. The low resolution of infrared cameras can limit the ability to distinguish fine details of objects, leading to detection errors. All these issues should be addressed before applying image detection methods. I did not see this in the article.

2. Additionally, the computer vision problem tackled by the authors is already complicated. The reason is that they work with images obtained using drones. This brings in specific negative factors and leads to various effects, which I will mention next. Drones can capture objects from different heights and angles, resulting in significant variations in the images and making object recognition difficult. Due to the flight height and speed of drones, the quality and resolution of images may be limited, complicating the identification of small objects. Drone images can contain densely populated or complex scenes with many objects close to each other, making individual object detection difficult. Drones and objects on the ground can be in motion, which may lead to image blurring and, consequently, reduce detection accuracy. I did not see how the authors account for these circumstances in their model.

3. Considering the specific challenges of the problem addressed by the authors, I believe it would be more appropriate to form an ensemble of various neural network architectures, each aimed at solving a corresponding issue. Then, a rational rule for combining classification results should be formulated. This approach would provide scientific novelty, which seems lacking in the current version of the article.

Author Response

We kindly refer the esteemed reviewer to the attached document, where our comprehensive responses to the insightful comments and suggestions are meticulously detailed. We trust that this will provide a clear and thorough account of our considerations and revisions made in accordance with your valuable feedback.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The paper proposes the Visual Prompt multi-modal Detection (VIP-Det) framework, which significantly improves the target detection accuracy of UAVs in complex environments by using the Transformer architecture as the main feature extractor and combining it with visual cues. The problem of CNN limitations is solved, and the detection accuracy and efficiency are improved in complex environments. The paper's method has a positive impact on surveillance and monitoring applications of UAVs in complex environments, especially in low light, bad weather and occlusion situations. However, there are also some problems as follows:

1. In the introduction section of the article, the authors only mention that most of the existing research relies on Convolutional Neural Networks (CNN). It is suggested to clearly point out the limitations of CNN in dealing with specific problems such as light changes and target occlusion. Also, a detailed explanation of the reasons for choosing Transformer is provided to clarify how Transformer overcomes the limitations of CNN and the advantages it brings.

2. For the Related Work section, although the Related Work section lists many image fusion techniques and cutting-edge algorithmic frameworks, some of the latest research and techniques have not yet been mentioned. It is suggested to expand the scope of the literature review to include important research results in the last two years. For example, the discussion of "CVANet: Cascaded Visual Attention Network" and "GACNet: Generate Adversarial-Driven Cross-Aware Network" can be added. Network" in order to fully reflect the latest progress in the field.

3. In the section "3.2 Prompt-based Fusion", it is suggested to add pseudo-code or specific algorithm flow charts so that readers can understand the working principle and implementation details of the algorithm more clearly. This will help readers better grasp the core mechanism and operation steps of the method.

4. The article does not identify the charts such as Figure (1) or Figure (2). It is recommended to identify all diagrams in the text and provide a detailed description of each module in Figure 2 to explain the specific process of data flow in the VIP-Det architecture. This will help the reader to better understand the model structure and functionality.

5. In the ablation experiment section, the authors currently only illustrate the effectiveness of the ablation modules in the overall model. It is recommended that the specific role of each ablation module be further analyzed and the results of each ablation experiment be explained in detail in Tables (1) and (2). For example, it can be explained why "baseline+prompt", "fine-tune 12 layers" or other methods can significantly improve the performance.

6. In the section "4.4.3 Comparison of Visual Detection Results", it is suggested to add more sets of visual comparison graphs to show the detection results under different environments (e.g., daytime, nighttime, rain, fog, and occlusion). This will more prominently demonstrate the superiority of the proposed method under various challenging conditions.

 

7. In the conclusion section, it is suggested to add the outlook of future work or the discussion of potential application scenarios. This will not only summarize the achievements of the current work, but also be able to show the future research direction and possible practical applications, enhancing the forward-looking nature of the article.

Comments on the Quality of English Language

The paper proposes the Visual Prompt multi-modal Detection (VIP-Det) framework, which significantly improves the target detection accuracy of UAVs in complex environments by using the Transformer architecture as the main feature extractor and combining it with visual cues. The problem of CNN limitations is solved, and the detection accuracy and efficiency are improved in complex environments. The paper's method has a positive impact on surveillance and monitoring applications of UAVs in complex environments, especially in low light, bad weather and occlusion situations. However, there are also some problems as follows:

1. In the introduction section of the article, the authors only mention that most of the existing research relies on Convolutional Neural Networks (CNN). It is suggested to clearly point out the limitations of CNN in dealing with specific problems such as light changes and target occlusion. Also, a detailed explanation of the reasons for choosing Transformer is provided to clarify how Transformer overcomes the limitations of CNN and the advantages it brings.

2. For the Related Work section, although the Related Work section lists many image fusion techniques and cutting-edge algorithmic frameworks, some of the latest research and techniques have not yet been mentioned. It is suggested to expand the scope of the literature review to include important research results in the last two years. For example, the discussion of "CVANet: Cascaded Visual Attention Network" and "GACNet: Generate Adversarial-Driven Cross-Aware Network" can be added. Network" in order to fully reflect the latest progress in the field.

3. In the section "3.2 Prompt-based Fusion", it is suggested to add pseudo-code or specific algorithm flow charts so that readers can understand the working principle and implementation details of the algorithm more clearly. This will help readers better grasp the core mechanism and operation steps of the method.

4. The article does not identify the charts such as Figure (1) or Figure (2). It is recommended to identify all diagrams in the text and provide a detailed description of each module in Figure 2 to explain the specific process of data flow in the VIP-Det architecture. This will help the reader to better understand the model structure and functionality.

5. In the ablation experiment section, the authors currently only illustrate the effectiveness of the ablation modules in the overall model. It is recommended that the specific role of each ablation module be further analyzed and the results of each ablation experiment be explained in detail in Tables (1) and (2). For example, it can be explained why "baseline+prompt", "fine-tune 12 layers" or other methods can significantly improve the performance.

6. In the section "4.4.3 Comparison of Visual Detection Results", it is suggested to add more sets of visual comparison graphs to show the detection results under different environments (e.g., daytime, nighttime, rain, fog, and occlusion). This will more prominently demonstrate the superiority of the proposed method under various challenging conditions.

 

7. In the conclusion section, it is suggested to add the outlook of future work or the discussion of potential application scenarios. This will not only summarize the achievements of the current work, but also be able to show the future research direction and possible practical applications, enhancing the forward-looking nature of the article.

Author Response

We kindly refer the esteemed reviewer to the attached document, where our comprehensive responses to the insightful comments and suggestions are meticulously detailed. We trust that this will provide a clear and thorough account of our considerations and revisions made in accordance with your valuable feedback.

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

The reviewer would like to thank the editor and was pleased to review this manuscript. This study investigated a drone based visible-thermal object detection method with transformers and prompt tunning. The topic is interesting and fits with the scope of the journal. Before the final recommendation, some suggestions are listed for further consideration:

(1) It is recommended to clarify why the visiual embedding is frozen and the infrared embedding is finetuned. Was it considered basically from the differences of these two modalities?

(2) Details of finetuning algorithms are recommended to clarify.

(3) “Turned” seemed to be a typo.

(4) How did the authors design the prompt-based fusion module? It seemed similar with the visual and infrared image embedding modules.

(5) Why did the authors only embed prompt layer in L2? How to determine the fusion strategy?

(6) Comparison fairness is a crucial issue for comparative studies of different deep learning models, which is recommended to clarify further, especially avoiding the local optimum for each investigated model.

(7) It is recommended to present the comparisons of finetuning parameter volume as well in the ablation experiments.

(8) A more in-depth literature review would be recommended to enhance. Recently, many related studies about vision-transformer-based small object recognition using remote sensing and drone images have been reported, which might include but are not limited to the following papers:

Hybrid convolutional-transformer framework for drone-based few-shot weakly supervised object detection, Computers and Electrical Engineering, 2022, 102, 108154; Transformer-based semantic segmentation of postearthquake dense buildings in urban areas using remote sensing images, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2022, 16, 369-385; Misaligned visible-thermal object detection: a drone-based benchmark and baseline, IEEE Transactions on Intelligent Vehicles, doi: 10.1109/TIV.2024.3398429.

Author Response

We kindly refer the esteemed reviewer to the attached document, where our comprehensive responses to the insightful comments and suggestions are meticulously detailed. We trust that this will provide a clear and thorough account of our considerations and revisions made in accordance with your valuable feedback.

Author Response File: Author Response.pdf

Reviewer 5 Report

Comments and Suggestions for Authors This paper proposes a novel transformer-based framework named VIP-Det for dual-modal (visible and thermal) object detection tailored for unmanned aerial vehicles (UAVs). The framework leverages the Vision Transformer (ViT) as its backbone, introduces a prompt-based fusion module for refined feature integration, and adopts a stage-wise optimization strategy for efficient fine-tuning. The experiments conducted on the DroneVehicle dataset demonstrate the effectiveness of VIP-Det in handling complex UAV-to-ground target detection scenarios. However, there are several aspects that could be improved to strengthen the paper.   Specific Suggestions for Revision:   (1) Clarify Motivation and Contributions More Concisely: The introduction section could be streamlined to more concisely motivate the need for a dual-modal object detection framework and highlight the key contributions of VIP-Det. Focus on the main limitations of existing approaches and how VIP-Det addresses them. (2) Strengthen Related Work Section: The related work section provides a good overview of existing methods, but could be expanded to more thoroughly discuss the most relevant and recent approaches in dual-modal object detection, particularly those that utilize transformers and prompt-based techniques. (3) Improve Details of the Proposed Method: Provide more details on the implementation of the prompt-based fusion module and the stage-wise optimization strategy. Clarify how the prompts are integrated into the ViT architecture and how they facilitate feature fusion. Explain the specifics of the stage-wise training process and its advantages over traditional training methods. (4) Expand Experimental Evaluation: Conduct additional experiments to more thoroughly evaluate the performance of VIP-Det. Compare against more baseline methods, especially those that utilize similar architectures or techniques. Provide ablation studies to analyze the impact of different components of VIP-Det, such as the number of transformer layers and the choice of pre-trained weights. (5) Discuss Limitations and Future Work: Discuss the limitations of the current work and propose directions for future research. For example, consider how VIP-Det could be extended to handle more complex scenes or detect a wider range of object categories. Discuss potential improvements to the prompt-based fusion module and stage-wise optimization strategy. (6) Improve Figures and Tables: Ensure that all figures and tables are clearly labeled and described in the text. Improve the readability of the figures by adjusting font sizes, line widths, and color schemes. (7) Check for Consistency and Clarity: Carefully proofread the manuscript to correct any grammatical errors, typos, or inconsistencies in notation. Ensure that the explanations are clear and concise, and that technical terms are defined when first introduced.   Overall, this paper presents an interesting and promising approach to dual-modal object detection for UAVs. With some revisions to address the above suggestions, the paper has the potential to make a valuable contribution to the field.

Author Response

We kindly refer the esteemed reviewer to the attached document, where our comprehensive responses to the insightful comments and suggestions are meticulously detailed. We trust that this will provide a clear and thorough account of our considerations and revisions made in accordance with your valuable feedback.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

I have formulated the following comments on the previous version of the article:

1. The authors have chosen to complicate the task of object detection in images by focusing on infrared versions. This approach brings several issues. I will mention only the most critical ones: Objects in infrared images often have low contrast compared to the background, making them difficult to detect. Infrared cameras can capture thermal noise from various sources, such as ambient temperature, which may interfere with accurate object detection. Objects with different thermal characteristics can appear differently in infrared images, complicating their classification and identification. Weather conditions, such as fog, rain, or dust, can significantly degrade the quality of infrared images, making object detection difficult. The low resolution of infrared cameras can limit the ability to distinguish fine details of objects, leading to detection errors. All these issues should be addressed before applying image detection methods. I did not see this in the article.
2. Additionally, the computer vision problem tackled by the authors is already complicated. The reason is that they work with images obtained using drones. This brings in specific negative factors and leads to various effects, which I will mention next. Drones can capture objects from different heights and angles, resulting in significant variations in the images and making object recognition difficult. Due to the flight height and speed of drones, the quality and resolution of images may be limited, complicating the identification of small objects. Drone images can contain densely populated or complex scenes with many objects close to each other, making individual object detection difficult. Drones and objects on the ground can be in motion, which may lead to image blurring and, consequently, reduce detection accuracy. I did not see how the authors account for these circumstances in their model.
3. Considering the specific challenges of the problem addressed by the authors, I believe it would be more appropriate to form an ensemble of various neural network architectures, each aimed at solving a corresponding issue. Then, a rational rule for combining classification results should be formulated. This approach would provide scientific novelty, which seems lacking in the current version of the article.

The authors have addressed all my comments. I found their responses quite convincing. I support the publication of the current version of the article. I wish the authors creative success.

Reviewer 3 Report

Comments and Suggestions for Authors All my concerns have been well addressed. The manuscript can be recommended for publication.

 

 

Reviewer 4 Report

Comments and Suggestions for Authors

No further comments.

Reviewer 5 Report

Comments and Suggestions for Authors

No comment

Back to TopTop