Next Article in Journal
Structure Tensor-Based Infrared Small Target Detection Method for a Double Linear Array Detector
Previous Article in Journal
Automatic Registration for Panoramic Images and Mobile LiDAR Data Based on Phase Hybrid Geometry Index Features
 
 
Article
Peer-Review Record

A Low-Altitude Remote Sensing Inspection Method on Rural Living Environments Based on a Modified YOLOv5s-ViT

Remote Sens. 2022, 14(19), 4784; https://doi.org/10.3390/rs14194784
by Chunshan Wang 1,2,3, Wei Sun 1,3, Huarui Wu 2,*, Chunjiang Zhao 2, Guifa Teng 1,3, Yingru Yang 4 and Pengfei Du 4
Reviewer 1: Anonymous
Reviewer 3:
Remote Sens. 2022, 14(19), 4784; https://doi.org/10.3390/rs14194784
Submission received: 30 August 2022 / Revised: 19 September 2022 / Accepted: 21 September 2022 / Published: 25 September 2022

Round 1

Reviewer 1 Report (Previous Reviewer 3)

Reviewer’s comments
to the manuscript “A low-altitude Remote Sensing Inspection Method on Rural Living Environments Based on Modified YOLOv5s-ViT" (Authors: Chunshan Wang, Wei Sun, Huarui Wu, Chunjiang Zhao, Guifa Teng, Yingru Yang, Pengfei Du).
The article is devoted to efficient and accurate inspection on rural living environments. For this purpose, the authors propose a modified version of the neural network YOLOv5s. Modifications include changed BottleNeck structure, the SimAM attention mechanism module embedding, and the Vision Transformer component The testing results of the established model showed that, compared with the original YOLOv5 network, the Precision, Recall, and mAP of the modified YOLOv5s-ViT model were improved by 2.2%, 11.5%, and 6.5%, respectively, the total number of parameters was reduced by 68.4%, and the computation volume was reduced by 83.3%. Relative to other mainstream detection models, YOLOv5s-ViT achieved a good balance between the detection performance and model complexity. This study provides new ideas for improving the digital capability of the governance of rural living environments.
There are some other points to correct or to make the information more exact:
Essential drawbacks.
Remark 1. The paper uses different names for the same term. For example, YOLOv3 and yolov3, YOLOv5 and yolov5. This should be unified.
Technical drawbacks.
Remark 1. Line 338-341, 343-346, 348-352, 354-355 Wrong indentation.

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Reviewer 2 Report (Previous Reviewer 1)

line 126 should be place on the nexrt page

letter c of figure 2 should be enhance

chance the color used in figures 3, 4, 5, 6, 7, and 8 to make more visible (lighter colors is okay)

provide interpretation of figure 11 and 12

enhance table 5 to make it fit in one page only

table 6 should be interpreted or provide a discussion on this

 

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report (New Reviewer)

This paper proposed a YOLOv5s-ViT method to address the issue of illegal random construction and random storage in the promotion of the rural revitalization strategy. Based on YOLOv5s, BottleNeck structure, SimAM attention mechanism and Vision Transformer component are incorporated to improve the performance. The experimental results on their established rural living environment inspection dataset are presented. Overall, the structure and presentation of this manuscript should be improved. There are some issues that need to be resolved before the paper could be considered for publication.

 

1. Why choose YOLOv5s as the basic model? There are some mistakes about the results in Table 2 and corresponding analysis. For example, “it can be seen that the detection precision of the YOLOv5s model was 92.2%,…”, while the value in Table 2 is 91.2%.

 

2. Figures 11 and 12 are redundant, whose information are all listed in Table 4.

 

3. The ablation experiments are not so convince. The baseline sholud be provided in Table 7. The effectiveness of different proposed modules should be highlighted. Maybe the intermediate visualization results of the proposed modules could help.

 

4. In experiments, can th proposed methods compare to Reference [19], [20] and [21]?

 

5. The presentation of this manuscript is very poor. For example, some tables and figures span the pages, Table 2, Table 5.

 

6. The structure of this manuscript should be improved. Section 3.1 (Basic YOLOv5 Network) and section 4.3 (Evaluation Indicators) take up too much spaces. However, the motivation of those proposed modules are not not clear.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 3 Report (New Reviewer)

No more comments.

This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.


Round 1

Reviewer 1 Report

I think rural living environment and village apperance is not an appropirate keyword

What is in written on like line 56 and the rest "Error! Reference source 56 not found"?

Some referencing numbers are missing

Enhance Figures 2, 3, 4, 5, 6, 7, 8, 11 and 12  (with Chinese characters)

table 5 should be change to figure

Please add related references on your topic within 3 to 5 years

Although the concept ans purpose are clearly define, but the motivation/objectives of the study is not that well establish

 

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

This paper addressed the issue of detecting rural areas from low-attitudes UAVs. The paper is well written and easy to understand.

1) Some of the references in introduction page 2 are not displayed correctly. Please re-format the paper

2) The authors have addressed key issues that occur during processing such as variable target size, relation between global and local targets and have appropriately addressed them.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Reviewer’s comments

to the manuscript A low-altitude Remote Sensing Inspection Method on Rural Living Environments Based on Modified YOLOv5s-ViT" (Authors: Chunshan Wang, Wei Sun, Huarui Wu, Chunjiang Zhao, Guifa Teng, Yingru Yang, Pengfei Du).

 

The article is devoted to efficient and accurate inspection on rural living environments. For this purpose, the authors propose a modified version of the neural network YOLOv5s. Modifications include changed BottleNeck structure, the SimAM attention mechanism module embedding, and the Vision Transformer component The testing results of the established model showed that, compared with the original YOLOv5 network, the Precision, Recall, and mAP of the modified YOLOv5s-ViT model were improved by 2.2%, 11.5%, and 6.5%, respectively, the total number of parameters was reduced by 68.4%, and the computation volume was reduced by 83.3%. Relative to other mainstream detection models, YOLOv5s-ViT achieved a good balance between the detection performance and model complexity. This study provides new ideas for improving the digital capability of the governance of rural living environments.

There are some other points to correct or to make the information more exact:

 

Essential drawbacks.

 Remark 1.              Line 17. YOLOv5s-ViT. There is no explanation of the abbreviation “ViT”.

Remark 2.              Line 123. “Therefore, if the images were directly resized” - How exactly was this done?

Remark 3.              Line 126. “manually segmented using Photoshop” - Does this mean that the repeatability of the experiment is impossible in view of the heuristic approach?

Remark 4.              Line 153. Figure 3 is unreadable.

Remark 5.              Line 192. “From the results (as shown in Table 2), it can be seen that the detection precision of 191 the YOLOv5s model”. There is no such model in the Table 2.

Remark 6.              Line 199. “Table 2. Comparison of different YOLOv5 models.” On what data was the training conducted? How much data was there for training and testing?

Remark 7.              Line 203. “In order to balance the detection precision with model complexity, the BackBone network 202 of YOLOv5s was further optimized in this paper.” Why were this particular model chosen?

Remark 8.              Line 206. Figure 7. The text on the figure is very small and unreadable. How to figure out what is the novelty of the authors here?

Remark 9.              Line 266. Figure 9. How does different sorting change the colors of cells, and therefore their essence?

Remark 10.          Line 284. Figure 10. The text does not give any description of what happens in the blocks in the figure presented, although this, according to the authors, is their novelty.

Remark 11.          Line 288. “The enhanced dataset contained 5072 images in total”. In the line 128, the authors write that there were 1018 images – “Eventually, 1018 images were obtained”.

Remark 12.          Line 289. “were divided into the training set and test set according to the ratio of 9:1.” Why is that? Usually 80/20.

Remark 13.          Line 292. “the dataset was converted to the VOC2007 format first”. What is this format and why did the authors convert to it?

Remark 14.          Line 310. What do the authors mean by Detection Result and Ground Truth?

Remark 15.          Line 329. “The modified YOLOv5s-ViT was compared”. Where did she come from?

Remark 16.          Line 339. “its total number of parameters”. What parameters are we talking about?

Remark 17.          Line 349. Table 4. The authors use different designations for the same neural network. For example, YOLO5s and YOLOv5s. The designations should be uniform throughout the article.

Remark 18.          It is not clear from the article what is the main novelty? Why did the authors choose SimAM and not ECA-Net, for example?

Remark 19.          Presentation quality of this article is very poor. Many abbreviations are not defined. How to reproduce the accuracy reported in the article? How can readers access this dataset? Why is there no comparison of accuracy with modern detection algorithms in the article?

 

Technical drawbacks.

Remark 1. Line 34. villages[1] – space is missed.

Remark 2. Line 56, 64, 66. Incorrected reference - [Error! Reference source 56 not found.,9].

Remark 3. Line 177. “FPN+PAN” - There is no decryption.

Remark 4. Line 193. “15.9GFLOPs,” – space is missed.

Remark 5. Line 205. Figure 7. The text on the figure is very small and unreadable.

Remark 6. Line 315. R should be “italic” in the equation.

Remark 7. Line 324. Equation 4. The multiplication sign is missing between P and R.

Remark 8. Line 344. Figure 12. The abscissa axis is labeled incorrectly.

Remark 9. Line 310, 315, 320, 325 Word “where” must begin with a lowercase letter and without indent after comma.

Remark 10. Line 20-27. Incorrected indent.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Back to TopTop