Dynamic Label Assignment for Object Detection by Combining Predicted IoUs and Anchor IoUs
Abstract
:1. Introduction
2. Related Works
2.1. Object Detection
2.2. Label Assignment
3. Proposed Approach
3.1. Revisit ATSS
- 1:
- Compute the Euclidean distances between the centers of all anchors and the center of the GT.
- 2:
- Select k anchors with the smallest distances for each feature pyramid level; if the number of feature levels is L, the number of total candidate anchors corresponding to the GT is .
- 3:
- Compute the IoUs between candidate anchors and the GT box. Then calculate the mean and standard deviation of those IoUs.
- 4:
- The adaptive threshold is mean+std. If one anchor has IoU with the GT larger than or equal to mean+std, it is candidate positive of the GT, else it is negative.
- 5:
- Only the candidate positives whose centers are inside the GT box would be the final positives of the GT, others are negatives.
3.2. Dynamic ATSS
3.3. Soft Targets for Classification Loss
4. Experiments
4.1. Ablation Study
4.1.1. The Effectiveness of Proposed Method
4.1.2. The Contribution of Each Element
4.1.3. Balancing Predicted IoUs and Anchor IoUs
4.2. Application to the State-of-the-Art
4.3. Comparison to the State-of-the-Art
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Nguyen, C.C.; Tran, G.S.; Burie, J.C.; Nghiem, T.P. Pulmonary Nodule Detection Based on Faster R-CNN With Adaptive Anchor Box. IEEE Access 2021, 9, 154740–154751. [Google Scholar] [CrossRef]
- Zhang, H.; Zhou, X.; Lan, X.; Li, J.; Tian, Z.; Zheng, N. A real-time robotic grasping approach with oriented anchor box. IEEE Trans. Syst. Man Cybern. Syst. 2019, 51, 3014–3025. [Google Scholar] [CrossRef]
- Dewi, C.; Chen, R.C.; Liu, Y.T.; Liu, Y.S.; Jiang, L.Q. Taiwan stop sign recognition with customize anchor. In Proceedings of the 12th International Conference on Computer Modeling and Simulation, Brisbane, Australia, 22–24 June 2020; pp. 51–55. [Google Scholar]
- Bharati, S.P.; Wu, Y.; Sui, Y.; Padgett, C.; Wang, G. Real-time obstacle detection and tracking for sense-and-avoid mechanism in UAVs. IEEE Trans. Intell. Veh. 2018, 3, 185–197. [Google Scholar] [CrossRef]
- Zhang, T.; Zhang, X.; Yang, Y.; Wang, Z.; Wang, G. Efficient Golf Ball Detection and Tracking Based on Convolutional Neural Networks and Kalman Filter. arXiv 2020, arXiv:2012.09393. [Google Scholar]
- Cen, F.; Zhao, X.; Li, W.; Wang, G. Deep feature augmentation for occluded image classification. Pattern Recognit. 2021, 111, 107737. [Google Scholar] [CrossRef]
- Patel, K.; Wang, G. A discriminative channel diversification network for image classification. Pattern Recognit. Lett. 2022, 153, 176–182. [Google Scholar] [CrossRef]
- Ma, W.; Tu, X.; Luo, B.; Wang, G. Semantic clustering based deduction learning for image recognition and classification. Pattern Recognit. 2022, 124, 108440. [Google Scholar] [CrossRef]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–19 October 2017; pp. 2961–2969. [Google Scholar]
- He, L.; Lu, J.; Wang, G.; Song, S.; Zhou, J. SOSD-Net: Joint semantic object segmentation and depth estimation from monocular images. Neurocomputing 2021, 440, 251–263. [Google Scholar] [CrossRef]
- Hemmati, M.; Biglari-Abhari, M.; Niar, S. Adaptive real-time object detection for autonomous driving systems. J. Imaging 2022, 8, 106. [Google Scholar] [CrossRef]
- Li, K.; Fathan, M.I.; Patel, K.; Zhang, T.; Zhong, C.; Bansal, A.; Rastogi, A.; Wang, J.S.; Wang, G. Colonoscopy Polyp Detection and Classification: Dataset Creation and Comparative Evaluations. arXiv 2021, arXiv:2104.10824. [Google Scholar] [CrossRef]
- Gosavi, D.; Cheatham, B.; Sztuba-Solinska, J. Label-Free Detection of Human Coronaviruses in Infected Cells Using Enhanced Darkfield Hyperspectral Microscopy (EDHM). J. Imaging 2022, 8, 24. [Google Scholar] [CrossRef]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99. [Google Scholar] [CrossRef] [Green Version]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–19 October 2017; pp. 2980–2988. [Google Scholar]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 21–37. [Google Scholar]
- Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S.Z. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 9759–9768. [Google Scholar]
- Zhu, B.; Wang, J.; Jiang, Z.; Zong, F.; Liu, S.; Li, Z.; Sun, J. Autoassign: Differentiable label assignment for dense object detection. arXiv 2020, arXiv:2007.03496. [Google Scholar]
- Ge, Z.; Wang, J.; Huang, X.; Liu, S.; Yoshie, O. Lla: Loss-aware label assignment for dense pedestrian detection. arXiv 2021, arXiv:2101.04307. [Google Scholar] [CrossRef]
- Ke, W.; Zhang, T.; Huang, Z.; Ye, Q.; Liu, J.; Huang, D. Multiple anchor learning for visual object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10206–10215. [Google Scholar]
- Li, X.; Wang, W.; Wu, L.; Chen, S.; Hu, X.; Li, J.; Tang, J.; Yang, J. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. arXiv 2020, arXiv:2006.04388. [Google Scholar]
- Zhang, H.; Wang, Y.; Dayoub, F.; Sunderhauf, N. Varifocalnet: An iou-aware dense object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8514–8523. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
- Li, K.; Ma, W.; Sajid, U.; Wu, Y.; Wang, G. Object detection with convolutional neural networks. In Deep Learning in Computer Vision; CRC Press: Boca Raton, FL, USA, 2020; pp. 41–62. [Google Scholar]
- Ma, W.; Wu, Y.; Cen, F.; Wang, G. Mdfn: Multi-scale deep feature learning network for object detection. Pattern Recognit. 2020, 100, 107149. [Google Scholar] [CrossRef] [Green Version]
- Xu, W.; Wu, Y.; Ma, W.; Wang, G. Adaptively denoising proposal collection for weakly supervised object localization. Neural Process. Lett. 2020, 51, 993–1006. [Google Scholar] [CrossRef] [Green Version]
- Mo, X.; Sajid, U.; Wang, G. Stereo frustums: A siamese pipeline for 3d object detection. J. Intell. Robot. Syst. 2021, 101, 1–15. [Google Scholar] [CrossRef]
- Cai, Z.; Fan, Q.; Feris, R.S.; Vasconcelos, N. A unified multi-scale deep convolutional neural network for fast object detection. In European Conference on Computer Vision; Springer: Amsterdam, The Netherlands, 2016; pp. 354–370. [Google Scholar]
- Dai, J.; Li, Y.; He, K.; Sun, J. R-fcn: Object detection via region-based fully convolutional networks. In Proceedings of the 2016 Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 379–387. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
- Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- Law, H.; Deng, J. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 734–750. [Google Scholar]
- Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; pp. 6569–6578. [Google Scholar]
- Zhou, X.; Zhuo, J.; Krahenbuhl, P. Bottom-up object detection by grouping extreme and center points. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seoul, Korea, 27–28 October 2019; pp. 850–859. [Google Scholar]
- Zhu, C.; He, Y.; Savvides, M. Feature selective anchor-free module for single-shot object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seoul, Korea, 27–28 October 2019; pp. 840–849. [Google Scholar]
- Tian, Z.; Shen, C.; Chen, H.; He, T. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF international conference on computer vision, Seoul, Korea, 27–28 October 2019; pp. 9627–9636. [Google Scholar]
- Patel, K.; Bur, A.M.; Li, F.; Wang, G. Aggregating Global Features into Local Vision Transformer. arXiv 2022, arXiv:2201.12903. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Online, 23–28 August 2020; pp. 213–229. [Google Scholar]
- Zheng, M.; Gao, P.; Zhang, R.; Li, K.; Wang, X.; Li, H.; Dong, H. End-to-end object detection with adaptive clustering transformer. arXiv 2020, arXiv:2011.09315. [Google Scholar]
- Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
- Ma, W.; Zhang, T.; Wang, G. Miti-DETR: Object Detection based on Transformers with Mitigatory Self-Attention Convergence. arXiv 2021, arXiv:2112.13310. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 2017 Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
- Meng, D.; Chen, X.; Fan, Z.; Zeng, G.; Li, H.; Yuan, Y.; Sun, L.; Wang, J. Conditional detr for fast training convergence. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3651–3660. [Google Scholar]
- Gao, P.; Zheng, M.; Wang, X.; Dai, J.; Li, H. Fast convergence of detr with spatially modulated co-attention. arXiv 2021, arXiv:2101.07448. [Google Scholar]
- Kim, K.; Lee, H.S. Probabilistic anchor assignment with iou prediction for object detection. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 355–371. [Google Scholar]
- Zhang, X.; Wan, F.; Liu, C.; Ji, R.; Ye, Q. Freeanchor: Learning to match anchors for visual object detection. arXiv 2019, arXiv:1909.02466. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Li, X.; Wang, W.; Hu, X.; Li, J.; Tang, J.; Yang, J. Generalized focal loss v2: Learning reliable localization quality estimation for dense object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–24 June 2021; pp. 11632–11641. [Google Scholar]
- Li, Y.; Chen, Y.; Wang, N.; Zhang, Z. Scale-aware trident networks for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Long Beach, CA, USA, 15–20 June 2019; pp. 6054–6063. [Google Scholar]
- Zhu, C.; Chen, F.; Shen, Z.; Savvides, M. Soft anchor-point object detection. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 91–107. [Google Scholar]
- Yang, Z.; Liu, S.; Hu, H.; Wang, L.; Lin, S. Reppoints: Point set representation for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Long Beach, CA, USA, 15–20 June 2019; pp. 9657–9666. [Google Scholar]
Model | AP | AP50 | AP75 | APs | APm | APl |
---|---|---|---|---|---|---|
ATSS | 39.06 | 57.11 | 42.49 | 22.33 | 43.27 | 50.23 |
ATSS+CIoUs | 39.75 | 57.43 | 43.08 | 23.03 | 43.83 | 52.27 |
ATSS+QFL | 39.61 | 57.41 | 43.05 | 23.25 | 43.69 | 52.19 |
ATSS+VFL | 39.65 | 57.38 | 43.39 | 23.45 | 43.53 | 52.19 |
AIoUs | PIoUs | QFL | VFL | Centerness Branch | IoU Branch | AP | AP50 | AP75 | APs | APm | APl |
---|---|---|---|---|---|---|---|---|---|---|---|
✓ | ✓ | 39.06 | 57.11 | 42.49 | 22.33 | 43.27 | 50.23 | ||||
✓ | ✓ | 29.39 | 46.77 | 31.13 | 21.57 | 28.38 | 37.08 | ||||
✓ | ✓ | ✓ | 39.75 | 57.43 | 43.08 | 23.03 | 43.83 | 52.27 | |||
✓ | ✓ | ✓ | ✓ | 40.07 | 57.46 | 43.73 | 23.47 | 44.30 | 52.60 | ||
✓ | ✓ | ✓ | ✓ | 39.83 | 57.45 | 43.15 | 22.75 | 44.22 | 52.88 | ||
✓ | ✓ | ✓ | ✓ | 40.30 | 57.49 | 44.00 | 22.85 | 44.48 | 53.71 | ||
✓ | ✓ | ✓ | ✓ | 40.15 | 57.37 | 43.64 | 23.51 | 44.09 | 53.21 |
PIoUs | AIoUs | AP | AP50 | AP75 | APs | APm | APl |
---|---|---|---|---|---|---|---|
1 | 1 | 39.75 | 57.43 | 43.08 | 23.03 | 43.83 | 52.27 |
0.5 | 1 | 39.43 | 57.23 | 42.92 | 23.04 | 43.56 | 51.31 |
1.5 | 1 | 39.55 | 57.38 | 42.63 | 23.15 | 43.28 | 51.60 |
1 | 0.5 | 39.42 | 57.13 | 42.46 | 22.95 | 43.04 | 51.03 |
1 | 1.5 | 39.51 | 57.27 | 42.68 | 22.94 | 43.31 | 51.72 |
D_up | 1 | 39.22 | 56.92 | 42.50 | 22.74 | 42.98 | 51.35 |
1 | D_down | 35.05 | 53.33 | 37.58 | 22.44 | 35.76 | 46.27 |
D_up | D_down | 35.64 | 53.78 | 38.09 | 22.68 | 36.57 | 48.12 |
Model | AP | AP50 | AP75 | APs | APm | APl |
---|---|---|---|---|---|---|
GFLV2 | 40.6 | 58.1 | 44.4 | 22.9 | 44.2 | 52.6 |
GFLV2+CIoUs | 41.1 | 58.6 | 44.8 | 23.7 | 44.3 | 53.9 |
VFL | 41.3 | 59.2 | 44.8 | 24.5 | 44.9 | 54.2 |
VFL+CIoUs | 41.6 | 59.5 | 44.9 | 24.3 | 45.1 | 54.6 |
Method | Backbone | Scheduler | Time | MStrain | AP | AP50 | AP75 | APs | APm | APl |
---|---|---|---|---|---|---|---|---|---|---|
FreeAnchor [48] | ResNet-101 | 2x | - | ✓ | 43.1 | 62.2 | 46.4 | 24.5 | 46.1 | 54.8 |
FreeAnchor [48] | ResNeXt-101-32x8d | 2x | - | ✓ | 44.9 | 64.3 | 48.5 | 26.8 | 48.3 | 55.9 |
TridentNet [51] | ResNet-101 | 2x | - | ✓ | 42.7 | 63.6 | 46.5 | 23.9 | 46.6 | 56.6 |
FCOS [38] | ResNet-101 | 2x | - | ✓ | 41.5 | 60.7 | 45.0 | 24.4 | 44.8 | 51.6 |
FCOS [38] | ResNeXt-101-64x4d | 2x | - | ✓ | 44.7 | 64.1 | 48.4 | 27.6 | 47.5 | 55.6 |
SAPD [52] | ResNet-101 | 2x | - | ✓ | 43.5 | 63.6 | 46.5 | 24.9 | 46.8 | 54.6 |
SAPD [52] | ResNet-101-DCN | 2x | - | ✓ | 46.0 | 65.9 | 49.6 | 26.3 | 49.2 | 59.6 |
RepPoints [53] | ResNet-101 | 2x | - | 41.0 | 62.9 | 44.3 | 23.6 | 44.1 | 51.7 | |
RepPoints [53] | ResNet-101-DCN | 2x | - | ✓ | 45.0 | 66.1 | 49.0 | 26.6 | 48.6 | 57.5 |
GFL [22] | ResNet-101 | 2x | - | ✓ | 45.0 | 63.7 | 48.9 | 27.2 | 48.8 | 54.5 |
GFL [22] | ResNet-101-DCN | 2x | - | ✓ | 47.3 | 66.3 | 51.4 | 28.0 | 51.1 | 59.2 |
ATSS * [18] | ResNet-50 | 1x | 80 ms | 39.2 | 57.5 | 42.6 | 22.3 | 41.9 | 49.0 | |
ATSS * [18] | ResNet-50-DCN | 1x | 96 ms | 43.0 | 61.2 | 46.8 | 24.5 | 45.9 | 55.3 | |
ATSS [18] | ResNet-101 | 2x | 105 ms | ✓ | 43.6 | 62.1 | 47.4 | 26.1 | 47.0 | 53.6 |
ATSS [18] | ResNet-101-DCN | 2x | 131 ms | ✓ | 46.3 | 64.7 | 50.4 | 27.7 | 49.8 | 58.4 |
ATSS [18] | ResNeXt-64x4d-101 | 2x | 191 ms | ✓ | 45.6 | 64.6 | 49.7 | 28.5 | 48.9 | 55.6 |
ATSS [18] | ResNeXt-32x8d-101-DCN | 2x | 225 ms | ✓ | 47.7 | 66.6 | 52.1 | 29.3 | 50.8 | 59.7 |
ATSS [18] | ResNeXt-64x4d-101-DCN | 2x | 236 ms | ✓ | 47.7 | 66.5 | 51.9 | 29.7 | 50.8 | 59.4 |
Dynamic ATSS | ResNet-50 | 1x | 80 ms | 40.3 | 57.9 | 44.1 | 22.5 | 43.6 | 51.2 | |
Dynamic ATSS | ResNet-50-DCN | 1x | 96 ms | 44.4 | 61.9 | 48.6 | 25.1 | 47.8 | 58.1 | |
Dynamic ATSS | ResNet-101 | 2x | 105 ms | ✓ | 44.7 | 62.5 | 48.9 | 26.7 | 48.3 | 55.7 |
Dynamic ATSS | ResNet-101-DCN | 2x | 131 ms | ✓ | 47.3 | 65.0 | 51.7 | 28.3 | 50.8 | 60.4 |
Dynamic ATSS | ResNeXt-64x4d-101 | 2x | 191 ms | ✓ | 46.5 | 64.7 | 50.8 | 29.1 | 49.7 | 57.6 |
Dynamic ATSS | ResNeXt-32x8d-101-DCN | 2x | 225 ms | ✓ | 48.6 | 66.7 | 53.0 | 29.9 | 51.5 | 61.8 |
Dynamic ATSS | ResNeXt-64x4d-101-DCN | 2x | 236 ms | ✓ | 48.6 | 66.7 | 52.9 | 29.4 | 51.9 | 61.3 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, T.; Luo, B.; Sharda, A.; Wang, G. Dynamic Label Assignment for Object Detection by Combining Predicted IoUs and Anchor IoUs. J. Imaging 2022, 8, 193. https://doi.org/10.3390/jimaging8070193
Zhang T, Luo B, Sharda A, Wang G. Dynamic Label Assignment for Object Detection by Combining Predicted IoUs and Anchor IoUs. Journal of Imaging. 2022; 8(7):193. https://doi.org/10.3390/jimaging8070193
Chicago/Turabian StyleZhang, Tianxiao, Bo Luo, Ajay Sharda, and Guanghui Wang. 2022. "Dynamic Label Assignment for Object Detection by Combining Predicted IoUs and Anchor IoUs" Journal of Imaging 8, no. 7: 193. https://doi.org/10.3390/jimaging8070193
APA StyleZhang, T., Luo, B., Sharda, A., & Wang, G. (2022). Dynamic Label Assignment for Object Detection by Combining Predicted IoUs and Anchor IoUs. Journal of Imaging, 8(7), 193. https://doi.org/10.3390/jimaging8070193