Multi-Scale Residual Aggregation Feature Pyramid Network for Object Detection
Round 1
Reviewer 1 Report
The introduction provides a lot of information without proper explanation. Figures 1,2 and 3 are meaningless without describing what is what on those images.
In the experiment description you use terms "Method" and "Backbone" that are in my opinion not explained well.
Please correct the references so that they are more uniform - for some entries there is only list of authors and title. Please provide full link as a text for some of the references.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 2 Report
The paper investigates the design of feature pyramid module applicable to different object detection architectures.
The main weakness of this paper is the lack of novelty. The proposed feature pyramid module called UCLRAM is similar to the previously proposed BIFPN [1], DLA [2] and others. These related approaches are indeed mentioned in the paper, however, I would encourage the authors to discuss the differences more thoroughly. Furthermore, I am struggling to see the difference between the proposed CGM and squeeze-and-excite module from the literature [3]. Similarly, I suggest discussing the differences. If there are no differences, I suggest referring to this module as squeeze-and-excite module. Introducing a new name for an existing module does not help the community, but leads to confusion.
The paper presents a strong evaluation study on two datasets: Pascal Voc and the proposed TKFD. To the best of my knowledge, there are no similar datasets to the proposed TKFD. I find it very interesting. However, at the present time, both datasets lack in size. I think that the evaluation section would be even stronger if results on COCO [4] were presented.
I am a little bit confused about the spatial resolutions of the features in the feature pyramid. This formula suggest that there are only three levels in a feature pyramid: P_in = ( P^in_2 , ...P^in_4), and that the most condensed level is at 1/16 resolution of the input image. I believe that the default feature pyramid in the literature contains four levels and that the most condensed feature are at 1/32 resolution of the input image. Furthermore, I am not sure that this notation is followed in the rest of the paper. For example, in Figure 4 I am not sure if outputs from C_4 are at 1/16 or 1/32 of input resolution. I recommend to clarify this.
[1] Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. Computer Vision Foundation / IEEE, pp. 10778–10787. https://doi.org/10.1109/CVPR42600.2020.01079.
[2] Yu, F.; Wang, D.; Shelhamer, E.; Darrell, T. Deep Layer Aggregation. In Proceedings of the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2403–2412.
[3] Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141.
[4] LIN, Tsung-Yi, et al. Microsoft coco: Common objects in context. In: European conference on computer vision. Springer, Cham, 2014. p. 740-755.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf