Next Article in Journal
Postmortem Diagnosis of Ketoacidosis by Determining Beta-Hydroxybutyrate Levels in Three Types of Body Fluids by Two Different Methods
Previous Article in Journal
An Improved Sentiment Classification Approach for Measuring User Satisfaction toward Governmental Services’ Mobile Apps Using Machine Learning Methods with Feature Engineering and SMOTE Technique
 
 
Article
Peer-Review Record

Meta-YOLO: Meta-Learning for Few-Shot Traffic Sign Detection via Decoupling Dependencies

Appl. Sci. 2022, 12(11), 5543; https://doi.org/10.3390/app12115543
by Xinyue Ren 1, Weiwei Zhang 1,2,3,*, Minghui Wu 1, Chuanchang Li 1 and Xiaolan Wang 1
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3:
Appl. Sci. 2022, 12(11), 5543; https://doi.org/10.3390/app12115543
Submission received: 4 May 2022 / Revised: 25 May 2022 / Accepted: 26 May 2022 / Published: 30 May 2022
(This article belongs to the Topic Applied Computer Vision and Pattern Recognition)

Round 1

Reviewer 1 Report

The paper needs to be improved and overall proof reading is required 

The author has used 2 datasets to prioritize the study , the author is recommended to use 2 more databases to justify the overall study

The overall accuracy with present model should be detailed while highlighting how there model is better than others

 

Author Response

Dear Reviewers:

 

Thank you very much for giving us an opportunity to revise our manuscript titled “Topology optimization of hierarchical honeycomb acoustic metamaterials for extreme multi-broad band gaps”. The reviewers’ comments are very helpful for improving our paper, as well as the important guiding significance to our research. We have studied comments carefully and have made a lot of revisions. We now hope that the article can meet the high quality and interest criteria of Mechanics of Advanced Materials and Structures. The following paragraphs in BLACK and RED are the reviewers’ comments and our response, respectively.

 

Sincerely,

Weiwei Zhang

 

 

Response to Reviewer 1 Comments

Point 1: The paper needs to be improved and overall proof reading is required.

 

Response 1: Thank you for your careful reading and insightful suggestions. We apologize for the problems of paper and and in the revised version, we made overall proof reading.

 

Point 2: The author has used 2 datasets to prioritize the study , the author is recommended to use 2 more databases to justify the overall study.

 

Response 2: Thank you for your careful reading and insightful suggestions. We added Mapillary Traffic Sign Dataset and compared our method with other methods to justify the overall study.

MTSD: Mapillary Traffic Sign Dataset covers multiple locations on six continents and consists of 52,453 high-resolution images with more than 80,000 annotated signs. This dataset includes 313 categories and the variations in weather, season, moment, camera, perspective.

Table 1 Main characteristic available TSD datasets

 

Images

Annotated Signs

BBoxes

Classes

Annotated sign size

Acquisition

Location

GTSDB[4]

900

1,206

43

16-128 longer edge

Germany

TT-100K[40]

100,000

30,000

45

27 to 397394

China

MTSD[46]

52,453

80.000

313

256x256

global

               

 

In order to ensure the justness of experiments, in this section, the meta-YOLO proposed by us is compared with state-of-art methods. Han et. al. [47], Min et. al. [16], Zhang et. al. [48] and Fan et. al. [42] Experiments are performed based on K=10 across MIST datasets, the results on mAP50, base and novel classes are shown in Table. 4,

 

Point 3: The overall accuracy with present model should be detailed while highlighting how there model is better than others.

 

Response 3: Thank you for the meaningful advice. We apologize for not describing the experimental part in detail, and in the revised version, we reworked and modified the details of the experiment, which modified as follows.

In order to ensure the justness of experiments, in this section, the meta-YOLO proposed by us is compared with state-of-art methods. Han et. al. [47], Min et. al. [16], Zhang et. al. [48] and Fan et. al. [42] Experiments are performed based on K=10 across MIST datasets, the results on mAP50, base and novel classes are shown in Table. 4, the results show that our method outperformed most algorithms. The performances of meta-YOLO are 46.2 and 51.7 on mAP50 ,which are the best in all methods Fan et. al. [42] combined two-way contrastive training strategy and attention-RPN to construct object detection frame that solve the problem of poor generalization of few-shot. Although the Attention RPN mitigates the dependence on region proposals to some extent, its framework is still RPN and the detection background is complex traffic scenes, so its performance is worse than us with scarce training samples. 

Author Response File: Author Response.docx

Reviewer 2 Report

Manuscript is devoted to the solving of actual traffic sign detection problem. The authors clear explain their contributions in detail. Firstly, they present a few-short object detection framework, namely Meta-YOLO, which combines localization meta-learning and object classification at the image level into a single module. Secondly, authors develop a feature decorrelation module due to the rid of false correlations and improve the system reliability. Thirdly, they propose a three-head mechanism for handling spatial-positional relationships of a three-head module to learn global, local and patch correlations with the category detection result outputted by the aggregation in meta-learner.
The authors present a description of the solution developed by them, sufficient for reproducing experiments. Including description of all used datasets and computer platform settings are described in detail.
As part of the experiments, the authors performed a comparison with three competitive baselines: YOLO-joint, YOLO-ft and YOLO-based. The detection performance Meta-YOLO turned out to be the best among the compared models on all datasets.
In addition to recognition experiments, the authors performed adaptation rate measurements, as well as a number of Ablation Studies aimed at studying the effectiveness of separately developed components. The research results confirm the positive impact of the feature decorrelation module, meta-learner and three-head mechanism on the process of object recognition.
Some comments and observations:
1. It is necessary to extend Introduction. Discuss the context of the study and the need for solutions to the problem of traffic sign detection in general.
2. Discussion of related works seems insufficient. Authors provided many related articles, but did not say what distinguishing features of the proposed solution in comparison to Meta-YOLO. And also summarize the limitations of the reviewed papers as a rationale for your research.
3. Conclusion should also be improved. Finally, add a summary of the results obtained by the model in comparison with the other three YOLOs, and information about the speed of adaptation and the need to use individual components (results of Ablation Studies). In addition, discuss the limitations of your Meta-YOLO and how to overcome them.

Author Response

Response to the comments on the paper submitted to Applied Sciences

 

Dear Reviewers:

 

Thank you very much for giving us an opportunity to revise our manuscript titled “Meta-YOLO: Meta Learning for Few-Shot Traffic Sign Detection via Decoupling Dependencies”. The reviewers’ comments are very helpful for improving our paper, as well as the important guiding significance to our research. We have studied comments carefully and have made a lot of revisions. The following paragraphs in BLACK and RED are the reviewers’ comments and our response, respectively.

 

Sincerely,

Weiwei Zhang

 

 

Response to Reviewer 2 Comments

Point 1: Manuscript is devoted to the solving of actual traffic sign detection problem. The authors clear explain their contributions in detail. Firstly, they present a few-short object detection framework, namely Meta-YOLO, which combines localization meta-learning and object classification at the image level into a single module. Secondly, authors develop a feature decorrelation module due to the rid of false correlations and improve the system reliability. Thirdly, they propose a three-head mechanism for handling spatial-positional relationships of a three-head module to learn global, local and patch correlations with the category detection result outputted by the aggregation in meta-learner.

The authors present a description of the solution developed by them, sufficient for reproducing experiments. Including description of all used datasets and computer platform settings are described in detail.

As part of the experiments, the authors performed a comparison with three competitive baselines: YOLO-joint, YOLO-ft and YOLO-based. The detection performance Meta-YOLO turned out to be the best among the compared models on all datasets.

In addition to recognition experiments, the authors performed adaptation rate measurements, as well as a number of Ablation Studies aimed at studying the effectiveness of separately developed components. The research results confirm the positive impact of the feature decorrelation module, meta-learner and three-head mechanism on the process of object recognition.

Some comments and observations:

  1. It is necessary to extend Introduction. Discuss the context of the study and the need for solutions to the problem of traffic sign detection in general.

 

Response 1: Thank you for your careful reading and insightful suggestions. We apologize for lack of the context of the study the need for solutions to the problem of traffic sign detection, and in the revised version, we added this part as follows.

Traffic sign detection is the premise for driverless cars to understand traffic information, avoid traffic congestion and accidents, and ensure safe and orderly driving of vehicles. It is also an essential submodule of driver assistance systems. Recently, CNNs is widely used for traffic sign detection, [1-4] which relies heavily on a large number of accurate bounding box annotations and artificially balanced training classes. When the contextual information of the training and testing sets is distributed unevenly, a serious mistake will occur that fails to generalize. It is a great challenge to guarantee the accuracy and robustness of detection results when samples are limited, because of the large variation of object scale as vehicle speed and the inconsistencies of traffic signs between different regions due to regional differences.

 

Point 2: Discussion of related works seems insufficient. Authors provided many related articles, but did not say what distinguishing features of the proposed solution in comparison to Meta-YOLO. And also summarize the limitations of the reviewed papers as a rationale for your research.

Response 2: Thank you for the meaningful advice. We added the analysis of relation work as follows.

Traffic Sign Detection by CNNs.

In view of complex background and unbalanced sample distribution, Li et al. [15] on the basis of fully study the relationship between different traffic signs with digital characters, design a SE block that could automatically learn the importance of each channel from global information. This method simplifies the detection of a wide variety of numerical traffic signs to 10 digital categories, but it is difficult to distinguish similar false targets in complex real traffic scenes. Min et al. [16] propose LW-RrefineNet to segment the scene and obtain the information of spatial positional at pixel level, and then the constraint model is constructed to establish the search regions. Experiments show that this method alleviate mis-detection of small traffic signs. However, it can only be limited to scenarios where both sides of the road, and other scenarios (such as intersections) have ineffective detection.  The above research shows that fully understanding and representing the real features of the extracted traffic signs is an effective solution to distinguish similar objects and filter false associations, which provides a basis for the design of the FDM module in this algorithm. Too small traffic signs are one of the main causes of mis-detection, and improving the multi-scale ability of detection algorithms is a common method to solve the challenge of small target detection. Cao et al. [17] present an improved Sparse R-CNN and construct hierarchical residual-like connections within each single radix block, while cross-channel attention mechanism is added in the RoI division process to fuse shallow feature information. However, due to its single attention mechanism, global correlation in RPN may suffer from spatial scale dislocation, and local correspondence between objects may be ignored. Wang et al. [18] apply inception and channel attention mechanism to superclass detector and concatenating directly feature maps of different channels which overcomes Cao et al. [17] complex backbone problems. At the same time, there is a negative impact on robustness because the importance of different feature channels is not considered. Although the traffic sign detection algorithm based on CNN has achieved remarkable results in real-time performance and accuracy, most methods require a good deal of labeled sample data, and in fact, our data set cannot exhaust all traffic scenes. Based on this consideration, we combined the meta-paradigm with CNN to promote the robustness to unseen classes tasks. 

Few-shot object detection.

Zhao et. al. [29] propose a multiscale few-shot detection based on fine-tuning, which utilize residual involution blocks to construct the total feature represent learning architecture as well as design PAM to aggregate from all feature levels. This method exploits shallow feature semantic information for object location in the first stage and is partly fine-tuned on a small balanced dataset in second stage.  However, Zhang et.al. [30] visualizes the feature distribution of samples in the pre-training space, proving that fine-tuning has limited performance improvement in meta-learning and can easily increase the risk of base task over-fitting. Therefore, in the work, we take meta-info update meta-learner replacing fine-tuning. Whang et.al. [31] proposed a general object detection system, which combines the feature-based domain attention mechanism with sequence and exception networks, and assigns network activation to different domains through SE adapter library learning, so as to automatically obtain the importance of each feature channel. The core idea of SENet is to learn the feature weight according to the loss, so that the weight of effective feature map is large, and the weight of ineffective or ineffective feature map is small, so as to achieve better results. However, the general detection system ignores the problem of spatial dislocation, which leads to the poor performance of detecting traffic signs with small targets and chaotic background. Han et.al. [32] improve the problem of training on base training to generate candidate proposals for novel classes and missing high IOU boxes in RPN stage. A coarse-grained prototype matching network (meta RPN) is proposed, which takes a nonlinear classifier based on metric learning to replace the traditional linear target classifier, dealing with the similarity between anchor boxes and novel classes in query images, so as to improve the recall of few novel class candidate boxes. A fine-grained prototype matching network (meta classifier) is designed. The network has spatial feature alignment and foreground attention modules to deal with the similarity between noise and novel classes, so as to enhance the overall detection accuracy. However, the meta optimizer lies the problem of prototype deviation. The reason for this problem is to use an average based method to roughly estimate the gradient, when the labeled samples are limited in each category.

 

Point 3: Conclusion should also be improved. Finally, add a summary of the results obtained by the model in comparison with the other three YOLOs, and information about the speed of adaptation and the need to use individual components (results of Ablation Studies). In addition, discuss the limitations of your Meta-YOLO and how to overcome them.

 

Response 2: Thank you for your careful reading and insightful suggestions. According to your suggestion, we add the following contents in Sec.5.

Thirdly, meta-YOLO outperforms the three competitive baselines and improves the mAP of few-shot detection by 39.8%. Comparing with state-of-art methods, our performance is also better than most other detectors. The results of meta-YOLO performance variation under different iteration show that two-stage meta-learner model F own the ability to quickly learn parameters. A large number of ablation studies confirm the positive impact of the FDM, meta-learner, and three-hand mechanism during detection.

We designed three-hand mechanism obtaining the information of different categories and levels, but we did not completely integrate the information obtained by different heads. This is the limitation of our work. Perhaps the idea of residual connection is helpful to alleviate this problem and we will continue to improve this problem in the future.

Author Response File: Author Response.docx

Reviewer 3 Report

The authors presented the  Meta-YOLO: Meta Learning for Few-Shot Traffic Sign Detection via Decoupling Dependencies. The topic is interesting and paper is well written. To improve the quality of paper,

add the literature of year 2022. 

Compare your method with latest state of the art papers.

Author Response

Response to the comments on the paper submitted to Applied Sciences

 

Dear Reviewer:

 

Thank you very much for giving us an opportunity to revise our manuscript titled “Meta-YOLO: Meta Learning for Few-Shot Traffic Sign Detection via Decoupling Dependencies”. The reviewers’ comments are very helpful for improving our paper, as well as the important guiding significance to our research. We have studied comments carefully and have made a lot of revisions. The following paragraphs in BLACK and RED are the reviewers’ comments and our response, respectively.

 

Sincerely,

Weiwei Zhang

 

Response to Reviewer 3 Comments

Point 1: The authors presented the  Meta-YOLO: Meta Learning for Few-Shot Traffic Sign Detection via Decoupling Dependencies. The topic is interesting and paper is well written. To improve the quality of paper,

add the literature of year 2022.

 

Response 1: Thank you for your careful reading and insightful suggestions. We apologize for lack of  literature of year 2022, and in the revised version, we added the content about literature of year 2022 as follows.

Li et al. [15] on the basis of fully study the relationship between different traffic signs with digital characters, design a SE block that could automatically learn the importance of each channel from global information. This method simplifies the detection of a wide variety of numerical traffic signs to 10 digital categories, but it is difficult to distinguish similar false targets in complex real traffic scenes. Min et al. [16] propose LW-RrefineNet to segment the scene and obtain the information of spatial positional at pixel level, and then the constraint model is constructed to establish the search regions. Experiments show that this method alleviate mis-detection of small traffic signs. However, it can only be limited to scenarios where both sides of the road, and other scenarios (such as intersections) have ineffective detection. 

Zhao et. al. [29] propose a multiscale few-shot detection based on fine-tuning, which utilize residual involution blocks to construct the total feature represent learning architecture as well as design PAM to aggregate from all feature levels. This method exploits shallow feature semantic information for object location in the first stage and is partly fine-tuned on a small balanced dataset in second stage.  However, Zhang et.al. [30] visualizes the feature distribution of samples in the pre-training space, proving that fine-tuning has limited performance improvement in meta-learning and can easily increase the risk of base task over-fitting. Therefore, in the work, we take meta-info update meta-learner replacing fine-tuning. Whang et.al. [31] proposed a general object detection system, which combines the feature-based domain attention mechanism with sequence and exception networks, and assigns network activation to different domains through SE adapter library learning, so as to automatically obtain the importance of each feature channel. The core idea of SENet is to learn the feature weight according to the loss, so that the weight of effective feature map is large, and the weight of ineffective or ineffective feature map is small, so as to achieve better results. However, the general detection system ignores the problem of spatial dislocation, which leads to the poor performance of detecting traffic signs with small targets and chaotic background. Han et.al. [32] improve the problem of training on base training to generate candidate proposals for novel classes and missing high IOU boxes in RPN stage. A coarse-grained prototype matching network (meta RPN) is proposed, which takes a nonlinear classifier based on metric learning to replace the traditional linear target classifier, dealing with the similarity between anchor boxes and novel classes in query images, so as to improve the recall of few novel class candidate boxes. A fine-grained prototype matching network (meta classifier) is designed. The network has spatial feature alignment and foreground attention modules to deal with the similarity between noise and novel classes, so as to enhance the overall detection accuracy. However, the meta optimizer lies the problem of prototype deviation. The reason for this problem is to use an average based method to roughly estimate the gradient, when the labeled samples are limited in each category.

 

Point 2: Compare your method with latest state of the art papers.

 

Response 1: We added Section4.3.2 Comparison with State-of-Art Methods in the Sec.4 to supplement the lack of experiments

4.3.2. Comparison with State-of-Art Methods

In order to ensure the justness of experiments, in this section, the meta-YOLO proposed by us is compared with state-of-art methods. Han et. al. [47], Min et. al. [16], Zhang et. al. [48] and Fan et. al. [42] Experiments are performed based on K=10 across MIST datasets, the results on mAP50, base and novel classes are shown in Table. 4,  the results show that our method outperformed most algorithms. The performances of meta-YOLO are 46.2 and 51.7 on mAP50 ,which are the best in all methods Fan et. al. [42] combined two-way contrastive training strategy and attention-RPN to construct object detection frame that solve the problem of poor generalization of few-shot. Although the Attention RPN mitigates the dependence on region proposals to some extent, its framework is still RPN and the detection background is complex traffic scenes, so its performance is worse than us with scarce training samples.  It is worth noting that the meta-DETR framework proposed by Zhang et. al. [48] is on a par with us, and even better than us in novel class. We think this is mainly due to the semantic alignment mechanism (SAM) that using a residual connection aligns the high-level and low-level semantics and rising the function of regularization.

Table 4: Comparison of performance with other detectors under 10 shots.

 

mAP

mAPbase

mAPnovel

 

 

 

 

Method/Shot

5-shot

10-shot

5-shot

10-shot

5-shot

10-shot

Han et. al. [47]

29.2

38.9

46.4

56.1

10.8

12.4

Min et. al. [16]

32.1

43.4

50.1

58.6

10.4

11.7

Zhang et. al. [48]

46.1

51.2

58.4

68.0

37.8

42.2

Fan et. al. [42]

46.0

50.2

57.0

65.4

35.8

38.3

Ours

46.2

51.7

58.7

67.9

37.6

41.9

 

Author Response File: Author Response.docx

Back to TopTop