Multi-Scale Feature Pyramid Network: A Heavily Occluded Pedestrian Detection Network Based on ResNet
Abstract
:1. Introduction
2. Related Works
- (1)
- Loss Function. Wang et al. [11] proposed repulsion loss to optimize the original loss function, attempting to solve the problem of pedestrian detection in crowded scenes. Zhang et al. [10] proposed aggregation loss, which enabled the bounding boxes to be close to ground truths and locate compactly. Although these loss functions can help improve the detection accuracy to some degree, it is difficult to recall overlapped proposals due to the use of traditional NMS.
- (2)
- Part-based Anchors. Zhou et al. [31] presented a method of pedestrian detection: whereby the whole body and visible parts of pedestrians were respectively located by regressing two bounding boxes. Chi et al. [32] proposed Pedhunter which can handle occlusion in pedestrian detection. However, the methods are complicated and take a long time.
- (3)
- Improved Network. Pang et al. [33] proposed the mask-guided attention network (MGAN) which emphasized visible parts of pedestrians and adjusted overall features to suppress invisible areas and detect occluded pedestrians. Zhang et al. [10] proposed occlusion-aware R-CNN, which divided pedestrians into several parts and extracted features for further fusion. Wang et al. [34] used compositional convolutional neural networks to detect objects. Wu et al. [35] exploited the local temporal context of pedestrians in videos and proposed a tube feature aggregation network (TFAN) aiming to detect occluded pedestrians. However, these methods are too intricate to implement or not robust enough to heavily occluded pedestrians.
3. Materials and Methods
- (1)
- Intra-class occlusion. Intra-class occlusion and inter-class occlusion generally exist in pedestrian detection. For intra-class occlusion, the features of targets in the same class are equally important. Different levels of extracting similar features and expressing these similar features will influence the later detection. For example, there will be more false positives occur if the expression ability of features is low. The feature extraction network should not only extract and fuse high-level semantic features but also retain the contour information.
- (2)
- Loss function related to occluded pedestrians. The loss function also influences the detection accuracy of occluded targets in pedestrian detection systems. To repel predicted boxes from surrounding ground truths of other pedestrians in crowds, we introduce the repulsion loss [11] integrated with our minimum of two losses to separate irrelevant boxes.
3.1. Architecture Network
3.2. DFR Network
3.3. Repulsion Loss of Minimum
4. Experiments
4.1. Datasets
4.2. Detailed Settings
4.3. Experiment Platform
4.4. Comparison of Feature Maps
4.5. Ablation Study
4.6. Comparison of Previous Works
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Shen, Y.; Zhang, L.; Wang, Z.L.; Hao, X.L.; Hou, Y.L. Multi-Level Residual Up-Projection Activation Network for Image SuperResolution. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 2841–2845. [Google Scholar] [CrossRef]
- Rasouli, A.; Tsotsos, J.K. Autonomous vehicles that interact with pedestrians: A survey of theory and practice. IEEE Trans. Intell. Transp. Syst. 2020, 21, 900–918. [Google Scholar] [CrossRef] [Green Version]
- Shao, S.; Zhao, Z.J.; Li, B.X.; Xiao, T.T.; Yu, G.; Zhang, X.Y.; Sun, J. Crowdhuman: A benchmark for detecting humans in a crowd. arXiv 2018, arXiv:1805.00123. preprint. [Google Scholar]
- Vimal, S.P.; Ajay, B.; Thiruvikiraman, P.K. Context pruned histogram of oriented gradients for pedestrian detection. In Proceedings of the 2013 International Mutli-Conference on Automation, Computing, Communication, Control and Compressed Sensing (iMac4s), Kottayam, India, 22–23 March 2013; pp. 718–722. [Google Scholar] [CrossRef]
- Zhuang, J. Compressive tracking based on HOG and extended Haar-like feature. In Proceedings of the 2016 2nd IEEE International Conference on Computer and Communications (ICCC), Chengdu, China, 14–17 October 2016; pp. 326–331. [Google Scholar] [CrossRef]
- Cosma, C.; Brehar, R.; Nedevschi, S. Pedestrians detection using a cascade of LBP and HOG classifiers. In Proceedings of the 2013 IEEE 9th International Conference on Intelligent Computer Communication and Processing (ICCP), Cluj-Napoca, Romania, 5–7 September 2013; pp. 69–75. [Google Scholar] [CrossRef]
- Dominguez-Sanchez, A.; Cazorla, M.; OrtsEscolano, S. Pedestrian Movement Direction Recognition Using Convolutional Neural Networks. IEEE Trans. Intell. Transp. Syst. 2017, 18, 3540–3548. [Google Scholar] [CrossRef]
- Zhao, J.X.; Li, J.; Ma, Y.D. RPN+ fast boosted tree: Combining deep neural network with traditional classifier for pedestrian detection. In Proceedings of the 2018 4th International Conference on Computer and Technology Applications (ICCTA), Istanbul, Turkey, 3–5 May 2018; pp. 141–150. [Google Scholar] [CrossRef]
- Zhang, Z.S.; Gao, J.Y.; Mao, J.H.; Liu, Y.K. STINet: Spatio-Temporal-Interactive Network for Pedestrian Detection and Trajectory Prediction. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11343–11352. [Google Scholar] [CrossRef]
- Zhang, S.F.; Wen, L.Y.; Bian, X.; Lei, Z.; Li, S.Z. Occlusion-Aware R-CNN: Detecting Pedestrians in a Crowd. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 657–674. [Google Scholar]
- Wang, X.L.; Xiao, T.T.; Jiang, Y.N.; Shao, S.; Sun, J.; Shen, C.H. Repulsion Loss: Detecting Pedestrians in a Crowd. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7774–7783. [Google Scholar] [CrossRef] [Green Version]
- Shao, X.T.; Wei, C.K.; Shen, Y.; Wang, Z.L. Feature Enhancement Based on CycleGAN for Nighttime Vehicle Detection. IEEE Access. 2021, 9, 849–859. [Google Scholar] [CrossRef]
- Ke, W.; Zhang, T.L.; Huang, Z.Y.; Ye, Q.X.; Liu, J.Z.; Huang, D. Multiple Anchor Learning for Visual Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10203–10212. [Google Scholar] [CrossRef]
- Chen, Y.H.; Cao, Y.; Hu, H.; Wang, L.W. Memory Enhanced Global-Local Aggregation for Video Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10334–10343. [Google Scholar] [CrossRef]
- Li, Y.; Wang, T.; Kang, B.Y.; Tang, S.; Wang, C.F.; Li, J.T.; Feng, J.S. Overcoming Classifier Imbalance for Long-Tail Object Detection with Balanced Group Softmax. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10988–10997. [Google Scholar] [CrossRef]
- Shen, Y.; Luo, M.; Chen, Y.; Shao, X.T.; Wang, Z.L.; Hao, X.L.; Hou, Y.L. Cross-View Image Translation Based on Local and Global Information Guidance. IEEE Access 2021, 9, 12955–12967. [Google Scholar] [CrossRef]
- Tan, M.X.; Pang, R.M.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10778–10787. [Google Scholar] [CrossRef]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef] [Green Version]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the 14th European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
- Bochkovskiy, A.; Wan, C.Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. preprint. [Google Scholar]
- Zhu, C.C.; He, Y.H.; Savvides, M. Feature Selective Anchor-Free Module for Single-Shot Object Detection. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 840–849. [Google Scholar] [CrossRef] [Green Version]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.M.; Dollár, P. Focal Loss for Dense Object Detection. IEEE Trans Pattern Anal. Mach. Intell. 2017, 42, 318–327. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Tan, M.X.; Le, Q.V. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019; pp. 10691–10700. [Google Scholar]
- Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar] [CrossRef]
- Ren, S.Q.; He, K.W.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- He, K.M.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar] [CrossRef]
- Cai, Z.W.; Vasconcelos, N. Cascade R-CNN: High Quality Object Detection and Instance Segmentation. IEEE Trans Pattern Anal. Mach. Intell. 2019. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Bodla, N.; Singh, B.; Chellappa, R.; Davis, L.S. Soft-NMS-Improving Object Detection with One Line of Code. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 5562–5570. [Google Scholar] [CrossRef] [Green Version]
- He, Y.H.; Zhu, C.C.; Wang, J.R.; Savvides, M.; Zhang, X.Y. Bounding Box Regression with Uncertainty for Accurate Object Detection. In Proceedings of the 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 2883–2892. [Google Scholar] [CrossRef] [Green Version]
- Liu, S.T.; Huang, D.; Wang, Y.H. Adaptive NMS: Refining Pedestrian Detection in a Crowd. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 6452–6461. [Google Scholar] [CrossRef] [Green Version]
- Zhou, C.L.; Yuan, J.S. Bi-box Regression for Pedestrian Detection and Occlusion Estimation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 138–154. [Google Scholar]
- Chi, C.; Zhang, S.F.; Xing, J.L.; Lei, Z.; Li, S.Z.; Zou, X.D. PedHunter: Occlusion Robust Pedestrian Detector in Crowded Scenes. In Proceedings of the 2020 AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 10639–10646. [Google Scholar] [CrossRef]
- Pang, Y.W.; Xie, J.; Khan, M.H.; Anwer, R.M.; Khan, F.S.; Shao, L. Mask-Guided Attention Network for Occluded Pedestrian Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 4966–4974. [Google Scholar] [CrossRef] [Green Version]
- Wang, A.T.; Sun, Y.H.; Kortylewski, A.; Yuille, A. Robust Object Detection under Occlusion with Context-Aware CompositionalNets. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 12642–12651. [Google Scholar] [CrossRef]
- Wu, J.L.; Zhou, C.L.; Yang, M.; Zhang, Q.; Li, Y.; Yuan, J.S. Temporal-Context Enhanced Detection of Heavily Occluded Pedestrians. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 13427–13436. [Google Scholar] [CrossRef]
- Cao, J.L.; Pang, Y.W.; Han, J.G.; Gao, B.L.; Li, X.L. Taking a Look at Small-Scale Pedestrians and Occluded Pedestrians. IEEE Trans. Image Process. 2020, 29, 3143–3152. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.M.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar] [CrossRef] [Green Version]
- Liu, S.; Qi, L.; Qin, H.F.; Shi, J.P.; Jia, J.Y. Path Aggregation Network for Instance Segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar] [CrossRef] [Green Version]
- He, K.M.; Zhang, X.Y.; Ren, S.Q.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef] [Green Version]
- Sifre, L. Rigid-Motion Scattering for Image Classification. Ph.D. Thesis, Ecole Polytechnique, Palaiseau, France, 2014; pp. 111–114. [Google Scholar]
- Chu, X.G.; Zheng, A.L.; Zhang, X.Y.; Sun, J. Detection in Crowded Scenes: One Proposal, Multiple Predictions. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 12211–12220. [Google Scholar] [CrossRef]
Heatmaps | ResNet50 | FPN | BiFPN | Double FPN | Separable Convolution(SC) | General Convolution(GC) |
---|---|---|---|---|---|---|
(a) | ✓ | ✓ | ✓ | |||
(b) | ✓ | ✓ | ✓ | |||
(c) | ✓ | ✓ | ✓ | |||
(d) | ✓ | ✓ | ✓ | |||
(e) | ✓ | ✓ | ✓ |
Method | AP/% | MR−2/% | JI/% |
---|---|---|---|
ResNet50 + FPN baseline (impl. by [41]) | 85.8 | 42.9 | 79.8 |
ResNet50 + FPN (impl. by [41]) | 90.3 | 42.2 | 82.0 |
ResNet50 + BiFPN (Separable) | 89.81 | 42.81 | 81.75 |
ResNet50 + BiFPN | 90.64 | 40.82 | 82.83 |
ResNet50 + DoubleFPN (Separable w/o RLM) | 90.76 | 41.05 | 82.83 |
ResNet50 + DoubleFPN (w/o RLM) | 90.88 | 40.52 | 82.92 |
ResNet50 + DoubleFPN (with RLM) | 90.96 | 40.24 | 83.12 |
Dataset | Method | AP/% | MR−2/% | JI/% |
---|---|---|---|---|
CrowdHuman | FPN baseline (impl. by [41]) | 85.8 | 42.9 | 79.8 |
ResNet50 + FPN (impl. by [41]) | 90.3 | 42.2 | 82.0 | |
FPN + Soft-NMS [28] | 88.2 | 42.9 | 79.8 | |
Adaptive NMS [30] | 84.7 | 49.7 | — | |
Cascade R-CNN [27] (impl. by [41]) | 85.6 | 43.0 | 80.6 | |
Ours | 90.96 | 40.24 | 83.12 | |
CityPersons | FPN baseline [41] | 95.2 | 11.7 | — |
FPN + Soft-NMS [28] | 95.3 | 11.8 | — | |
ResNet50 + FPN [41] | 96.1 | 10.7 | — | |
Ours | 96.23 | 10.64 | — |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Shao, X.; Wang, Q.; Yang, W.; Chen, Y.; Xie, Y.; Shen, Y.; Wang, Z. Multi-Scale Feature Pyramid Network: A Heavily Occluded Pedestrian Detection Network Based on ResNet. Sensors 2021, 21, 1820. https://doi.org/10.3390/s21051820
Shao X, Wang Q, Yang W, Chen Y, Xie Y, Shen Y, Wang Z. Multi-Scale Feature Pyramid Network: A Heavily Occluded Pedestrian Detection Network Based on ResNet. Sensors. 2021; 21(5):1820. https://doi.org/10.3390/s21051820
Chicago/Turabian StyleShao, Xiaotao, Qing Wang, Wei Yang, Yun Chen, Yi Xie, Yan Shen, and Zhongli Wang. 2021. "Multi-Scale Feature Pyramid Network: A Heavily Occluded Pedestrian Detection Network Based on ResNet" Sensors 21, no. 5: 1820. https://doi.org/10.3390/s21051820
APA StyleShao, X., Wang, Q., Yang, W., Chen, Y., Xie, Y., Shen, Y., & Wang, Z. (2021). Multi-Scale Feature Pyramid Network: A Heavily Occluded Pedestrian Detection Network Based on ResNet. Sensors, 21(5), 1820. https://doi.org/10.3390/s21051820