Extreme R-CNN: Few-Shot Object Detection via Sample Synthesis and Knowledge Distillation
Abstract
:1. Introduction
- By synthesizing new samples through the data augmentation of object instances and fine-tuning the modified Faster R-CNN model with these samples, we significantly improve the model’s average precision for novel categories.
- Utilizing knowledge distillation, we enhance the model’s recognition capability for novel categories while ensuring that its performance on base categories remains unaffected.
- We validated the effectiveness of our approach in enhancing the performance of few-shot object-detection models through extensive experiments on the Microsoft COCO and PASCAL VOC datasets.
2. Related Work
2.1. Few-Shot Object Detection
2.2. Sample Synthesis
2.3. Knowledge Distillation
3. Method
3.1. Overview of Our Proposed Model (Extreme R-CNN)
3.2. Synthesizing Samples
3.3. Knowledge Distillation for Base Categories
3.4. Decoupled Classification and Regression Heads
3.5. Using Siamese Network and Triplet Loss in Classification Head
4. Experiments
4.1. Experiment Configuration
4.1.1. Datasets
4.1.2. Evaluation Metrics
4.1.3. Implementation Details
4.2. Main Results
4.2.1. PASCAL VOC
4.2.2. Microsoft COCO
4.3. Ablation Study
4.3.1. Synthesizing Samples
4.3.2. Knowledge Distillation
4.3.3. Siamese Network and Triplet Loss (SNTL)
5. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. Comput. Sci. 2015, 14, 38–39. [Google Scholar]
- Schroff, F.; Kalenichenko, D.; Philbin, J. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
- Sun, Y.; Chen, Y.; Wang, X.; Tang, X. Deep learning face representation by joint identification-verification. Adv. Neural Inf. Process. Syst. 2014, 27, 1988–1996. [Google Scholar]
- Wu, Z.; Xiong, Y.; Yu, S.X.; Lin, D. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3733–3742. [Google Scholar]
- Xie, J.; Zhan, X.; Liu, Z.; Ong, Y.S.; Loy, C.C. Delving into inter-image invariance for unsupervised visual representations. Int. J. Comput. Vis. 2022, 130, 2994–3013. [Google Scholar] [CrossRef]
- Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual Event, 13–18 July 2020; pp. 1597–1607. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
- Everingham, M.; Eslami, S.A.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes challenge: A retrospective. Int. J. Comput. Vis. 2015, 111, 98–136. [Google Scholar] [CrossRef]
- Wang, X.; Huang, T.E.; Darrell, T.; Gonzalez, J.E.; Yu, F. Frustratingly simple few-shot object detection. arXiv 2020, arXiv:2003.06957. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Proceedings, Part V 13, Zurich, Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
- Sun, B.; Li, B.; Cai, S.; Yuan, Y.; Zhang, C. Fsce: Few-shot object detection via contrastive proposal encoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 7352–7362. [Google Scholar]
- Wu, J.; Liu, S.; Huang, D.; Wang, Y. Multi-scale positive sample refinement for few-shot object detection. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Proceedings, Part XVI 16, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 456–472. [Google Scholar]
- Zhang, W.; Wang, Y.X. Hallucination improves few-shot object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13008–13017. [Google Scholar]
- Zhu, C.; Chen, F.; Ahmed, U.; Shen, Z.; Savvides, M. Semantic relation reasoning for shot-stable few-shot object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8782–8791. [Google Scholar]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 42, 2999–3007. [Google Scholar]
- Fan, Q.; Zhuo, W.; Tang, C.K.; Tai, Y.W. Few-shot object detection with attention-RPN and multi-relation detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 4013–4022. [Google Scholar]
- Han, G.; He, Y.; Huang, S.; Ma, J.; Chang, S.F. Query adaptive few-shot object detection with heterogeneous graph convolutional networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3263–3272. [Google Scholar]
- Han, G.; Huang, S.; Ma, J.; He, Y.; Chang, S.F. Meta Faster R-CNN: Towards accurate few-shot object detection with attentive feature alignment. In Proceedings of the AAAI Conference on Artificial Intelligence, Enter Virtual, 22 February–1 March 2022; Volume 36, pp. 780–789. [Google Scholar]
- Han, G.; Chen, L.; Ma, J.; Huang, S.; Chellappa, R.; Chang, S.F. Multi-modal few-shot object detection with meta-learning-based cross-modal prompting. arXiv 2022, arXiv:2204.07841. [Google Scholar]
- Hsieh, T.I.; Lo, Y.C.; Chen, H.T.; Liu, T.L. One-shot object detection with co-attention and co-excitation. Adv. Neural Inf. Process. Syst. 2019, 32, 2725–2734. [Google Scholar]
- Kang, B.; Liu, Z.; Wang, X.; Yu, F.; Feng, J.; Darrell, T. Few-shot object detection via feature reweighting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8420–8429. [Google Scholar]
- Yan, X.; Chen, Z.; Xu, A.; Wang, X.; Liang, X.; Lin, L. Meta r-cnn: Towards general solver for instance-level low-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9577–9586. [Google Scholar]
- Xiao, Y.; Lepetit, V.; Marlet, R. Few-shot object detection and viewpoint estimation for objects in the wild. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 3090–3106. [Google Scholar] [CrossRef] [PubMed]
- Yuan, D.; Zhang, H.; Shu, X.; Liu, Q.; Chang, X.; He, Z.; Shi, G. An Attention Mechanism Based AVOD Network for 3D Vehicle Detection. IEEE Trans. Intell. Veh. 2023, 8, 1–13. [Google Scholar] [CrossRef]
- Chen, D.J.; Hsieh, H.Y.; Liu, T.L. Adaptive image transformer for one-shot object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12247–12256. [Google Scholar]
- Chen, T.I.; Liu, Y.C.; Su, H.T.; Chang, Y.C.; Lin, Y.H.; Yeh, J.F.; Chen, W.C.; Hsu, W.H. Dual-awareness attention for few-shot object detection. IEEE Trans. Multimed. 2021, 25, 291–301. [Google Scholar] [CrossRef]
- Doersch, C.; Gupta, A.; Zisserman, A. Crosstransformers: Spatially-aware few-shot transfer. Adv. Neural Inf. Process. Syst. 2020, 33, 21981–21993. [Google Scholar]
- Vaswani, A. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
- Liu, Q.; Pi, J.; Gao, P.; Yuan, D. STFNet: Self-Supervised Transformer for Infrared and Visible Image Fusion. IEEE Trans. Emerg. Top. Comput. Intell. 2024, 8, 1513–1526. [Google Scholar] [CrossRef]
- Dhillon, G.S.; Chaudhari, P.; Ravichandran, A.; Soatto, S. A baseline for few-shot image classification. arXiv 2019, arXiv:1909.02729. [Google Scholar]
- Nilsson, J.; Andersson, P.; Gu, I.Y.H.; Fredriksson, J. Pedestrian detection using augmented training data. In Proceedings of the 2014 22nd International Conference on Pattern Recognition, Stockholm, Sweden, 24–28 August 2014; IEEE: New York, NY, USA, 2014; pp. 4548–4553. [Google Scholar]
- Wong, S.C.; Gatt, A.; Stamatescu, V.; McDonnell, M.D. Understanding data augmentation for classification: When to warp? In Proceedings of the 2016 International Conference on Digital Image Computing: Techniques and Applications (DICTA), Gold Coast, Australia, 30 November–2 December 2016; IEEE: New York, NY, USA, 2016; pp. 1–6. [Google Scholar]
- Lemley, J.; Bazrafkan, S.; Corcoran, P. Smart augmentation learning an optimal data augmentation strategy. IEEE Access 2017, 5, 5858–5869. [Google Scholar] [CrossRef]
- Shijie, J.; Ping, W.; Peiyi, J.; Siping, H. Research on data augmentation for image classification based on convolution neural networks. In Proceedings of the 2017 Chinese automation congress (CAC), Jinan, China, 20–22 October 2017; IEEE: New York, NY, USA, 2017; pp. 4165–4170. [Google Scholar]
- Fujita, K.; Kobayashi, M.; Nagao, T. Data augmentation using evolutionary image processing. In Proceedings of the 2018 Digital Image Computing: Techniques and Applications (DICTA), Canberra, Australia, 10–13 December 2018; IEEE: New York, NY, USA, 2018; pp. 1–6. [Google Scholar]
- Lei, C.; Hu, B.; Wang, D.; Zhang, S.; Chen, Z. A preliminary study on data augmentation of deep learning for image classification. In Proceedings of the 11th Asia-Pacific Symposium on Internetware, Fukuoka, Japan, 28–29 October 2019; pp. 1–6. [Google Scholar]
- Namozov, A.; Im Cho, Y. An improvement for medical image analysis using data enhancement techniques in deep learning. In Proceedings of the 2018 International Conference on Information and Communication Technology Robotics (ICT-ROBOT), Busan, Republic of Korea, 6–8 September 2018; IEEE: New York, NY, USA, 2018; pp. 1–3. [Google Scholar]
- Buciluǎ, C.; Caruana, R.; Niculescu-Mizil, A. Model compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, USA, 20–23 August 2006; pp. 535–541. [Google Scholar]
- Ba, J.; Caruana, R. Do deep nets really need to be deep? Adv. Neural Inf. Process. Syst. 2014, 27, 2654–2662. [Google Scholar]
- Romero, A.; Ballas, N.; Kahou, S.E.; Chassang, A.; Gatta, C.; Bengio, Y. Fitnets: Hints for thin deep nets. arXiv 2014, arXiv:1412.6550. [Google Scholar]
- Zhang, S.; Wang, W.; Li, H.; Zhang, S. Bounding convolutional network for refining object locations. Neural Comput. Appl. 2023, 35, 19297–19313. [Google Scholar] [CrossRef]
- Wu, Y.; Chen, Y.; Yuan, L.; Liu, Z.; Wang, L.; Li, H.; Fu, Y. Rethinking Classification and Localization for Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
- Wu, Y.; Kirillov, A.; Massa, F.; Lo, W.Y.; Girshick, R. Detectron2. 2019. Available online: https://github.com/facebookresearch/detectron2 (accessed on 10 November 2022).
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Chen, H.; Wang, Y.; Wang, G.; Qiao, Y. LSTD: A Low-Shot Transfer Detector for Object Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2 February 2018. [Google Scholar]
Method/Shot | Novel Split 1 | Novel Split 2 | Novel Split 3 | Avg. | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 3 | 5 | 10 | 1 | 2 | 3 | 5 | 10 | 1 | 2 | 3 | 5 | 10 | ||
TFA w/cos (Baseline) | 39.8 | 36.1 | 44.7 | 55.7 | 56 | 23.5 | 26.9 | 34.1 | 35.1 | 39.1 | 30.8 | 34.8 | 42.8 | 49.5 | 49.8 | 39.9 |
FSCE | 44.2 | 43.8 | 51.4 | 61.9 | 63.4 | 27.3 | 29.5 | 43.5 | 44.2 | 50.2 | 37.2 | 41.9 | 47.5 | 54.6 | 58.5 | 46.6 |
DeFRCN | 53.6 | 57.5 | 61.5 | 64.1 | 60.8 | 30.1 | 38.1 | 47.0 | 53.3 | 47.9 | 48.4 | 50.9 | 52.3 | 54.9 | 57.4 | 51.9 |
VFA | 57.7 | 64.6 | 64.7 | 67.2 | 67.4 | 41.4 | 46.2 | 51.1 | 51.8 | 51.6 | 48.9 | 54.8 | 56.6 | 59.0 | 58.9 | 56.1 |
Extreme R-CNN (Ours) | 57.0 | 58.3 | 62.7 | 65.3 | 66.2 | 40.2 | 46.4 | 47.8 | 52.4 | 53.8 | 49.5 | 52.3 | 54.4 | 57.3 | 59.4 | 54.9 |
Improvement (%) | +43.2 | +61.5 | +40.3 | +17.2 | +18.2 | +71.1 | +72.5 | +40.2 | +49.3 | +37.6 | +60.1 | +50.3 | +27.1 | +15.8 | +19.3 | +37.5 |
Method | nAP 10 | nAP 30 |
---|---|---|
LSTD [47] | 3.2 | 6.7 |
FSRW | 5.6 | 9.1 |
MetaDet | 7.1 | 11.3 |
Meta-RCNN | 8.7 | 12.4 |
MPSR | 9.8 | 14.1 |
TFA w/cos (Baseline) | 10.0 | 13.7 |
FSCE | 11.9 | 16.4 |
FADI | 12.2 | 16.1 |
VFA | 16.2 | 18.9 |
DeFRCN | 18.5 | 22.6 |
Extreme R-CNN (ours) | 16.5 | 19.3 |
Method | SS | KD | SNTL | Novel AP50 Split 1 | |||
---|---|---|---|---|---|---|---|
1 | 5 | 10 | Avg. | ||||
TFA w/cos (Baseline) | - | - | - | 39.8 | 55.7 | 56.0 | 50.5 |
TFA w/cos (Our reimpl.) | ✕ | ✕ | ✕ | 40.2 | 55.9 | 57.2 | 51.1 |
Extreme R-CNN (Ours) | √ | ✕ | ✕ | 51.1 | 62.7 | 64.3 | 59.4 |
√ | √ | ✕ | 51.8 | 63.3 | 65.0 | 60.0 | |
√ | √ | √ | 57.0 | 65.3 | 66.2 | 62.8 |
Method | Base AP50 | Novel AP50 | ||||||
---|---|---|---|---|---|---|---|---|
1 | 3 | 5 | 10 | 1 | 3 | 5 | 10 | |
MPSR [12] | 59.4 | 67.8 | 68.4 | - | 41.7 | 51.4 | 55.2 | 61.8 |
FSCE [11] | 78.9 | 74.1 | 76.6 | - | 44.2 | 51.4 | 61.9 | 63.4 |
TFA w/cos [9] (Baseline) | 79.1 | 77.3 | 77.0 | - | 39.8 | 44.6 | 55.7 | 56.0 |
TFA w/cos [9] (Our reimpl.) | 78.8 | 77.5 | 77.2 | 77.0 | 40.2 | 45.3 | 55.9 | 57.2 |
TFA w/cos [9] + SS | 77.5 | 76.7 | 76 | 75.4 | 51.1 | 56.3 | 62.7 | 64.3 |
TFA w/cos [9] + SS + KD | 78.4 | 77.3 | 77 | 76.5 | 51.8 | 56.9 | 63.3 | 65.0 |
Extreme R-CNN (Ours) | 78.2 | 77 | 76.7 | 76.3 | 57 | 62.7 | 65.3 | 66.2 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, S.; Wang, W.; Wang, Z.; Li, H.; Li, R.; Zhang, S. Extreme R-CNN: Few-Shot Object Detection via Sample Synthesis and Knowledge Distillation. Sensors 2024, 24, 7833. https://doi.org/10.3390/s24237833
Zhang S, Wang W, Wang Z, Li H, Li R, Zhang S. Extreme R-CNN: Few-Shot Object Detection via Sample Synthesis and Knowledge Distillation. Sensors. 2024; 24(23):7833. https://doi.org/10.3390/s24237833
Chicago/Turabian StyleZhang, Shenyong, Wenmin Wang, Zhibing Wang, Honglei Li, Ruochen Li, and Shixiong Zhang. 2024. "Extreme R-CNN: Few-Shot Object Detection via Sample Synthesis and Knowledge Distillation" Sensors 24, no. 23: 7833. https://doi.org/10.3390/s24237833
APA StyleZhang, S., Wang, W., Wang, Z., Li, H., Li, R., & Zhang, S. (2024). Extreme R-CNN: Few-Shot Object Detection via Sample Synthesis and Knowledge Distillation. Sensors, 24(23), 7833. https://doi.org/10.3390/s24237833