Next Article in Journal
Deep Learning-Enabled Heterogeneous Transfer Learning for Improved Network Attack Detection in Internal Networks
Previous Article in Journal
Development of a Precision Feeding System with Hierarchical Control for Gestation Units Using Stalls
 
 
Article
Peer-Review Record

Residual Transformer YOLO for Detecting Multi-Scale Crowded Pedestrian

Appl. Sci. 2023, 13(21), 12032; https://doi.org/10.3390/app132112032
by Hechao Ye and Yanni Wang *
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Appl. Sci. 2023, 13(21), 12032; https://doi.org/10.3390/app132112032
Submission received: 8 October 2023 / Revised: 25 October 2023 / Accepted: 2 November 2023 / Published: 4 November 2023
(This article belongs to the Section Computing and Artificial Intelligence)

Round 1

Reviewer 1 Report (New Reviewer)

Comments and Suggestions for Authors

This paper proposes a Residual 10 Transformer YOLO (RT-YOLO) algorithm which  enhances the 11 multi-scale fusion strategy based on YOLOv7 and introduces a dedicated detection layer for small- 12 scale occluded targets. As mentioned by the authors, RT-YOLO achieves an FPS of 67, maintaining real-time performance, and exhibits improved effectiveness and robustness, with a 5.1% enhancement in generalization compared to the original algorithm.

 

The research is well conducted and the solution is well modeled. The experiments are well organized and comparing to the baseline.

 

There are some observations for consideration:

1)    It is better not to use long sentences, for example, lines 37-42.

2)    In section 2 Related Work, there is no any reference. At least, in the title of figure 1, it is better to cite any paper of YOLOv7.

3)    For a scientific paper, it is better not to copy the figures (figure 1 and figure 2) from some original research. The authors may need to have the copyright authorization from the original authors.

4)    Again about section 2, usually the related work should analyses the related research to justify your contributions. Some references in section 1 may transfer to section 2.

5)    It is necessary to mention any variable in all equations and figures, such as equations 1-2. What are T, k, i, h, and others. In equation 10, what are R and P(R). Please check all the equations and figures.

6)    The references need to be with more recently, most of them are before 2021. For example, Athoff et al., 2022, Once Learning for Looking and Identifying Based on YOLO-v5 Object Detection, and others.

Comments on the Quality of English Language

It is better not to use long sentences, for example, lines 37-42.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report (New Reviewer)

Comments and Suggestions for Authors

The authors propose a new variant of YoLov7 algorithm called as Residual Transformer YoLo (RT-YOLO) for detecting multi-scale crowded pedestrian. Based on simulation on two datasets (CrowdHuman and WiderPerson), they demonstrated the efficiency of their model in terms of several metrics, such as mAP and recall. Generally, the paper is well structured, easy to follow and to understand. The obtained results look important and promising. However, I have the following comments regarding the proposed work:

1) In the multi-head self-attention in BOTrans (Figure 5), please justify the selection of sum, matrix and softmax in the structure?

2) Please give more details about both used datasets (number of images, labels, image dimension, etc.)? 

3) In the experiment section, the authors write what they observe in the figures and tables without giving any explanation why obtained these results. Please mention the added components to RT-YOLO that affect/enhance each experiment parameter.

4) In Table 5, it is noticed that the RT-YoLo shows high efficiency when detecting bus (100) and cat (99.6) compared to other objects types. Could you explain this notable results?

5) I miss a study for other important efficiency metrics to evaluate the performance of the proposed algorithm, particularly complexity and execution time.

Comments on the Quality of English Language

Minor editing of English language required

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report (New Reviewer)

Comments and Suggestions for Authors
  • Algorithm Details: a. Can you provide more information about the architecture of the RT-YOLO algorithm, specifically how the Resnet and Transformer structures are integrated? b. How does the Normalization-based Attention Module (NAM) operate within the network, and what specific benefits does it bring to pedestrian detection?

  • Generalization and Robustness: a. The paper mentions an improvement in generalization. Could you elaborate on what aspects of generalization have been enhanced, and what methods or techniques contributed to this improvement? b. What types of occlusion and crowd scenarios were considered during the evaluation, and how well does RT-YOLO perform in handling different degrees of occlusion and crowd density?

  • Performance Metrics: a. Apart from mean Average Precision (mAP), did you consider other evaluation metrics such as F1-score? How does RT-YOLO perform in terms of this metrice? b. Could you provide a more detailed breakdown of the mAP improvements for different object scales and occlusion levels?

  • Feature Fusion Strategy: a. The paper introduces a new multi-scale fusion strategy. Can you explain in more detail how this strategy works and how it contributes to the improvement in detection accuracy? b. Are there any specific design choices in the multi-scale fusion strategy that significantly impact its performance?
  • Future Work: a. In the "Future Work" section, you mention potential directions for further research. Could you elaborate on one or two of these directions and discuss the expected challenges and benefits of pursuing them?

  • Practical Applications: a. Beyond the experimental results, can you discuss potential practical applications for RT-YOLO, particularly in the fields of autonomous driving and video surveillance? b. How well does the algorithm adapt to different real-world scenarios and environments?

  • Open Source Availability: a. Is the code and pre-trained models for RT-YOLO available for the research community to use and reproduce your results? If so, where can they access it?

  • Model Complexity: a. How does the complexity of RT-YOLO compare to other state-of-the-art object detection models? Are there any trade-offs between model complexity and performance?

Comments on the Quality of English Language

It needs minor revision.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.


Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper proposes the RT-YOLO algorithm to improve the faulty detection of many small and occluded objects in real-time pedestrian detection. The proposed solution is obtained by combining a convolutional network with Transformer structure and replacing the E-ELAN structure in the original network of YOLOv7. It extracts global features and combines contextual information to discriminate the crowded and occluded objects. Based on this, the multi-scale fusion strategy is changed to design special network layers for small and occluded objects. These contain feature maps with different receptive fields, so that they have semantic information of high-level features while retaining detailed information of low-level features to the maximum extent. A special attention mechanism is introduced to focus the network on the object area of interest at minimal cost and improve the efficiency of network parameter utilization. Finally, the classification information and localization information of objects are obtained to enable accurate localization. As experimentally demonstrated, RT-YOLO improved the performance compared to YOLOv7 when applied to the filtered CrowdHuman and WiderPerson datasets by 3.8% and 3.4%, respectively. In addition, generalization experiments of the RT-YOLO algorithm with Pascal VOC2007 verified its robustness and effectiveness, with an improvement of 5.1% compared to YOLOv7. The algorithmic complexity of RT-YOLO increased due to the addition of additional modules, but still met the real-time requirements. Since the RT-YOLO algorithm still has a certain leakage phenomenon, there is further room for improvement.
The paper has formulated a clear and relevant research question. The methodology is appropriate and well described. All abbreviations should be explained before using them. The paper is well written, and the figures and tables are illustrative. The structure is logical and clear. For the sake of clarity, many sentences should be shorter. The sources are quite current. The argumentation is convincing and based on the computation results. Disadvantages of the proposed methods are stated. Overall, the paper is of good quality and has a value for the research field, but there is also room for improvement and further investigation. After correcting the minor typographical errors, the paper could be published.

Comments on the Quality of English Language

Line                     Hints / Typos

95                        feature map, when inputting … is feature map -> please, check sentence

120                      T In this paper -> In this paper

147                      transformers[16] -> Transformers[16]

234                      To observe the object information…-> please, complete sentence

277                      whose calculation formula is calculated as follows .->.is computed …

285                      where γ is the scaling factor -> Here, γ is the scaling factor

Please, use shorter sentences for better comprehension of the paper.

397-402              Please, avoid doubling of parts of sentences.

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

- the paper needs to be improved drastically. It doesn't convey clear context in many places. 

- There are several silly mistakes in writing and repeated sentences. For example lines 93-97, 
silly mistakes: line 120 - starting T, terminology ELAN comes 78 before its actual description in line 104, line 155 Multia, line 142 throw

- the figures can be improved better and their captions as well.
For example, Fig. 2b is referenced in line 109 but missing a caption in the actual Fig. 2

- Table 1 doesn't serve any purpose. Except for the last minor change shown

- the description in Section 3 can be improved better—no proper connections between sub-sections.

Justifications are wrong in many places -  For examples line 47, traditional algorithms are less robust in complex scenarios, ... are not due to the rapid development of deep learning. 


Comments on the Quality of English Language

The level of English is good in some places, but not appropriate in many instances. 

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The authors propose RT-YOLO, which improves YOLOv7 and provides a specific detection layer for small-scale occlusion targets. The proposed algorithm showed improvements when tested on CrowdHuman and WiderPerson datasets. The paper is well writen with a few minor errors listed in sequence.

 

The paper seems very complete as the authors propose an improvement over the original YOLOv7 algorithm that has improved results while keeping the real-time performance of the original algorithm. I have no further suggestions to improve the quality of the paper rather than correcting some minor errors.

Comments on the Quality of English Language

More general comments and minor errors are listed as follows.

 

"Hong et al" -> "Hong et al."

"background noise;" -> "background noise."

"T In this paper," -> "In this paper,"

"transformers[16] is a" -> "Transformers[16] are a"

"BOTrans" -> the term was already defined in the text

"Resnet usually have" -> "Resnet usually has"

"Figure 6.The " -> "Figure 6. The "

"To observe the object information of interest for each scale of feature map." -> this sentence seems incomplete

"the rest two annotations are irrelevant to our experiment, as shown in Figure 11, so deleted." -> please rewrite

"delete annotations" -> "deleted annotations"

"in Figure 13 From" -> "in Figure 13. From"

"Table 3,Table 4" -> "Table 3, Table 4"

 

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Back to TopTop