An Integrated Transformer with Collaborative Tokens Mining for Fine-Grained Recognition
Abstract
:1. Introduction
2. Related Work
2.1. Fine-Grained Recognition
2.2. Inter-Class and Intra-Class Relations by Metric Learning
2.3. Feature Representation
3. Transformer Collaborative Region Mining
3.1. Pyramid Tokens Multiplications
3.2. Tokens Proposals Generation
3.3. Relations among Inter-Class and Intra-Class
3.4. Loss Function
- Ranking Loss.
- Tokens List Loss.
- Predicting loss.
4. Experiments
4.1. Experimental Setup
4.2. Comparison with Previous Results
4.3. Ablation Experiments
4.3.1. Influence of Pyramid Tokens Multiplication
4.3.2. Influence of Tokens Proposals Generation
4.3.3. Complexity
4.3.4. Limitations
5. Conclusions
Author Contributions
Funding
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Li, Z.; Yang, Y.; Liu, X.; Zhou, F.; Wen, S.; Xu, W. Dynamic Computational Time for Visual Attention. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 1199–1209. [Google Scholar]
- Liu, X.; Xia, T.; Wang, J.; Lin, Y. Fully Convolutional Attention Localization Networks: Efficient Attention Localization for Fine-Grained Recognition. arXiv 2016, arXiv:1603.06765. [Google Scholar]
- Zhang, X.; Xiong, H.; Zhou, W.; Lin, W.; Tian, Q. Picking Neural Activations for Fine-Grained Recognition. IEEE Trans. Multimedia 2017, 19, 2736–2750. [Google Scholar] [CrossRef]
- Liu, C.; Xie, H.; Zha, Z.; Yu, L.; Chen, Z.; Zhang, Y. Bidirectional Attention-Recognition Model for Fine-Grained Object Classification. IEEE Trans. Multimedia 2020, 22, 1785–1795. [Google Scholar] [CrossRef]
- Fu, J.; Zheng, H.; Mei, T. Look Closer to See Better: Recurrent Attention Convolutional Neural Network for Fine-Grained Image Recognition. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4476–4484. [Google Scholar]
- Zheng, H.; Fu, J.; Mei, T.; Luo, J. Learning Multi-attention Convolutional Neural Network for Fine-Grained Image Recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 5219–5227. [Google Scholar]
- Yang, Z.; Luo, T.; Wang, D.; Hu, Z.; Gao, J.; Wang, L. Learning to Navigate for Fine-Grained Classification. In Proceedings of the 15th Conference on European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 438–454. [Google Scholar]
- He, X.; Peng, Y.; Zhao, J. Fine-grained Discriminative Localization via Saliency-guided Faster R-CNN. In Proceedings of the 27th the Conference on ACM International Conference on Multimedia (ACM MM), Mountain View, CA, USA, 23–27 October 2017; pp. 627–635. [Google Scholar]
- Wang, Z.; Wang, S.; Zhang, P.; Li, H.; Zhong, W.; Li, J. Weakly Supervised Fine-grained Image Classification via Correlation-guided Discriminative Learning. In Proceedings of the 27th the Conference on ACM International Conference on Multimedia (ACM MM), Nice, France, 21–25 October 2019; pp. 1851–1860. [Google Scholar]
- Ding, Y.; Zhou, Y.; Zhu, Y.; Ye, Q.; Jiao, J. Selective Sparse Sampling for Fine-Grained Image Recognition. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6598–6607. [Google Scholar]
- Wang, Z.; Wang, S.; Li, H.; Dou, Z.; Li, J. Graph-Propagation Based Correlation Learning for Weakly Supervised Fine-Grained Image Classification. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI), New York, NY, USA, 7–12 February 2020; pp. 12289–12296. [Google Scholar]
- Wang, Z.; Wang, S.; Yang, S.; Li, H.; Li, J.; Li, Z. Weakly Supervised Fine-Grained Image Classification via Gaussian Mixture Model Oriented Discriminative Learning. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9746–9755. [Google Scholar]
- Sun, G.; Cholakkal, H.; Khan, S.; Khan, F.; Shao, L. Fine-Grained Recognition: Accounting for Subtle Differences between Similar Classes. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI), New York, NY, USA, 7–12 February 2020; pp. 12047–12054. [Google Scholar]
- Wang, S.; Li, H.; Wang, Z.; Ouyang, W. Dynamic Position-aware Network for Fine-grained Image Recognition. In Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI), Virtually, 2–9 February 2021; pp. 2791–2799. [Google Scholar]
- Dubey, A.; Gupta, O.; Guo, P.; Raskar, R.; Farrell, R.; Naik, N.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Pairwise Confusion for Fine-Grained Visual Classification. In Proceedings of the 15th Conference on European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 71–88. [Google Scholar]
- Zhuang, P.; Wang, Y.; Qiao, Y. Learning Attentive Pairwise Interaction for Fine-Grained Classification. In Proceedings of the 36th AAAI Conference on Artificial Intelligence (AAAI), New York, NY, USA, 7–12 February 2020; pp. 13130–13137. [Google Scholar]
- Sun, M.; Yuan, Y.; Zhou, F.; Ding, E. Multi-Attention Multi-Class Constraint for Fine-grained Image Recognition. In Proceedings of the 15th Conference on European Conference on Computer Vision (ECCV), European Conference, Munich, Germany, 8–14 September 2018; Volume 11220, pp. 834–850. [Google Scholar]
- Liu, M.; Zhang, C.; Bai, H.; Zhang, R.; Zhao, Y. Cross-Part Learning for Fine-Grained Image Classification. IEEE Trans. Image Process. 2022, 31, 748–758. [Google Scholar] [CrossRef] [PubMed]
- Zhao, Y.; Li, J.; Chen, X.; Tian, Y. Part-Guided Relational Transformers for Fine-Grained Visual Recognition. IEEE Trans. Image Process. 2021, 30, 9470–9481. [Google Scholar] [CrossRef] [PubMed]
- Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the 38th International Conference on Mathematical Linguistics (ICML), Virtual Event, 18–24 July 2021; Volume 139, pp. 10347–10357. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the 9th International Conference on Learning Representations (ICLR), Virtual Event, Austria, 3–7 May 2021. [Google Scholar]
- He, J.; Chen, J.; Liu, S.; Kortylewski, A.; Yang, C.; Bai, Y.; Wang, C. TransFG: A Transformer Architecture for Fine-Grained Recognition. In Proceedings of the 36th AAAI Conference on Artificial Intelligence (AAAI), Virtual Event, 22 February–1 March 2022; pp. 852–860. [Google Scholar]
- Liu, H.; Li, J.; Li, D.; See, J.; Lin, W. Learning Scale-Consistent Attention Part Network for Fine-Grained Image Recognition. IEEE Trans. Multimedia 2022, 24, 2902–2913. [Google Scholar] [CrossRef]
- Sohn, K. Improved Deep Metric Learning with Multi-class N-pair Loss Objective. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), Barcelona, Spain, 5–10 December 2016; pp. 1849–1857. [Google Scholar]
- Law, M.T.; Thome, N.; Cord, M. Quadruplet-Wise Image Similarity Learning. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Sydney, Australia, 1–8 December 2013; pp. 249–256. [Google Scholar]
- Song, H.O.; Xiang, Y.; Jegelka, S.; Savarese, S. Deep Metric Learning via Lifted Structured Feature Embedding. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 4004–4012. [Google Scholar]
- Ge, W.; Huang, W.; Dong, D.; Scott, M.R. Deep Metric Learning with Hierarchical Triplet Loss. In Proceedings of the 15th Conference on European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Volume 11210, pp. 272–288. [Google Scholar]
- Manmatha, R.; Wu, C.; Smola, A.J.; Krähenbühl, P. Sampling Matters in Deep Embedding Learning. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2859–2867. [Google Scholar]
- Wang, X.; Hua, Y.; Kodirov, E.; Hu, G.; Garnier, R.; Robertson, N.M. Ranked List Loss for Deep Metric Learning. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 5207–5216. [Google Scholar]
- Lam, M.; Mahasseni, B.; Todorovic, S. Fine-Grained Recognition as HSnet Search for Informative Image Parts. In Proceedings of the conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6497–6506. [Google Scholar]
- Cai, S.; Zuo, W.; Zhang, L. Higher-Order Integration of Hierarchical Convolutional Activations for Fine-Grained Visual Categorization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 511–520. [Google Scholar]
- Yu, C.; Zhao, X.; Zheng, Q.; Zhang, P.; You, X. Hierarchical Bilinear Pooling for Fine-Grained Visual Recognition. In Proceedings of the 15th Conference on European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 595–610. [Google Scholar]
- Chen, M.; Radford, A.; Child, R.; Wu, J.; Jun, H.; Luan, D.; Sutskever, I. Generative Pretraining From Pixels. In Proceedings of the 37th International Conference on Mathematical Linguistics (ICML), Virtual Event, 13–18 July 2020; Volume 119, pp. 1691–1703. [Google Scholar]
- Bao, H.; Dong, L.; Piao, S.; Wei, F. BEiT: BERT Pre-Training of Image Transformers. In Proceedings of the 10th International Conference on Learning Representations (ICLR), Virtual Event, 25–29 April 2022. [Google Scholar]
- Noroozi, M.; Favaro, P. Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles. In Proceedings of the 14th Conference on European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; Volume 9910, pp. 69–84. [Google Scholar]
- Esser, P.; Rombach, R.; Ommer, B. Taming Transformers for High-Resolution Image Synthesis. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021; pp. 12873–12883. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.B.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Proceedings of the 28th Conference on Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 7–12 December 2015; pp. 91–99. [Google Scholar]
- Wah, C.; Branson, S.; Welinder, P.; Perona, P.; Belongie, S. The Caltech-UCSD Birds-200-2011; Dataset.Tech.rep.; California Institute of Technology: Pasadena, CA, USA, 2011. [Google Scholar]
- Krause, J.; Stark, M.; Deng, J.; Li, F.-F. 3D Object Representations for Fine-Grained Categorization. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Sydney, Australia, 1–8 December 2013; pp. 554–561. [Google Scholar]
- Maji, S.; Rahtu, E.; Kannala, J.; Blaschko, M.B.; Vedaldi, A. Fine-Grained Visual Classification of Aircraft. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Sydney, Australia, 23–28 June 2013. [Google Scholar]
- Rabiee, H.R.; Haddadnia, J.; Mousavi, H.; Kalantarzadeh, M.; Nabi, M.; Murino, V. Novel dataset for fine-grained abnormal behavior understanding in crowd. In Proceedings of the 13th IEEE International Conference on Advanced Video and Signal-Based Surveillance (AVSS) 2016, Colorado Springs, CO, USA, 23–26 August 2016; pp. 95–101. [Google Scholar]
- Engin, M.; Wang, L.; Zhou, L.; Liu, X. DeepKSPD: Learning Kernel-Matrix-Based SPD Representation For Fine-Grained Image Recognition. In Proceedings of the 16th Conference on European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 629–645. [Google Scholar]
- Wang, Y.; Morariu, V.I.; Davis, L.S. Learning a Discriminative Filter Bank Within a CNN for Fine-Grained Recognition. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 4148–4157. [Google Scholar]
- Chen, Y.; Bai, Y.; Zhang, W.; Mei, T. Destruction and Construction Learning for Fine-Grained Image Recognition. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 5157–5166. [Google Scholar]
- Zheng, X.; Qi, L.; Ren, Y.; Lu, X. Fine-Grained Visual Categorization by Localizing Object Parts With Single Image. IEEE Trans. Multimedia 2021, 23, 1187–1199. [Google Scholar] [CrossRef]
Dataset | #Class | #Train | #Test |
---|---|---|---|
CUB-200-2011 | 200 | 5994 | 5794 |
Stanford Cars | 196 | 8144 | 8041 |
FGVC Aircraft | 100 | 6667 | 3333 |
Stanford Dogs | 120 | 12,000 | 8580 |
Method | Backbone | CUB-200-2011 | Stanford Cars | FGVC Aircraft | Stanford Dogs |
---|---|---|---|---|---|
RA-CNN [6] | VGGNet-19 | 85.3% | 92.5% | 88.2% | 87.3% |
MA-CNN [7] | VGGNet-19 | 86.5% | 92.8% | 89.9% | - |
HIHCA [32] | ResNet-50 | 85.3% | 91.7% | 88.3% | - |
Deep KSPD [44] | VGGNet-19 | 86.5% | 92.8% | 89.9% | - |
HBP [33] | VGGNet-16 | 87.1% | 93.7% | 90.3% | - |
NTS [8] | ResNet-50 | 87.5% | 93.9% | - | - |
DFL-CNN [45] | 87.4% | 93.8% | 92.0% | - | |
DCL [46] | 87.8% | 94.5% | 93.0% | - | |
S3Ns [11] | 88.5% | 94.7% | 92.8% | - | |
GCL [12] | 88.3% | 94.0% | 93.2% | - | |
BARM [5] | 88.5% | 94.3% | 92.5% | - | |
ASD [14] | 88.6% | 94.9% | 93.5% | - | |
DF-GMM [13] | 88.8% | 94.8% | 93.8% | - | |
DP-Net [15] | 89.3% | 94.8% | 93.9% | - | |
SCAPNet [24] | 89.5% | - | 93.6% | - | |
PART [20] | ResNet-50 | 89.6% | 94.4% | 95.1% | - |
CP-CNN [19] | ResNet-50 | 91.4% | 95.4% | 95.5% | - |
PC [16] | ResNet-50 | 80.2% | 93.4% | 83.4% | - |
MAMC [18] | ResNet-50 | 86.2% | 92.8% | - | 84.8% |
API-Net [17] | ResNet-50 | 87.7% | 94.8% | 93.0% | 90.3% |
LOPSI [47] | ResNet-50 | 88.9% | 94.2% | 92.3% | - |
DeiT | DeiT-B [21] | 90.0% | 93.9% | - | - |
ViT | ViT-B [22] | 90.3% | 93.7% | - | 91.7% |
TransFG [23] | ViT-B [22] | 91.7% | 94.8% | - | 92.3% |
TCTM | Swin-B [38] | 93.2% | 96.3% | 95.7% | 93.8% |
Method | CUB-200-2011 | Stanford Cars | FGVC Aircraft | Stanford Dogs |
---|---|---|---|---|
Swin [38] | 90.8% | 91.8% | 90.3% | 90.9% |
Swin [38] + PTMs (stage 4) + Loss (Equation (11)) | 91.7% | 92.7% | 91.4% | 91.8% |
Swin [38] + PTMs (stage 3, 4 ) + Loss (Equation (11)) | 92.0% | 93.6% | 92.3% | 92.1% |
Swin [38] + PTMs(stage 2, 3, 4) + Loss (Equation (11)) | 92.2% | 94.4% | 94.2% | 92.3% |
Swin [38] + PTMs + Loss (Equation (11)) | 92.3% | 95.1% | 94.2% | 92.5% |
Method | CUB | Stanford Cars | FGVC Aircraft | Stanford Dogs |
---|---|---|---|---|
Swin [38] + PTM + Loss (Equation (11)) | 92.3% | 95.1% | 94.2% | 92.5% |
Swin [38] + PTM I + TPG I + Loss (Equation (15)) | 91.2% | 91.9% | 90.7% | 89.3 |
Swin [38] + PTM I + TPG + Loss (Equation (15)) | 92.6% | 93.7% | 92.3% | 90.6 |
Swin [38] + PTM + TPG I + Loss (Equation (15)) | 92.5% | 93.4% | 92.1% | 90.4 |
Swin [38] + PTM + TPG + Loss (Equation (15)) | 93.2% | 96.3% | 95.7% | 93.8% |
Module | PTM I | PTM II | TPG I | TPG II |
---|---|---|---|---|
Params | 2.64 M | 2.62 M | 0.73 M | 0.71 M |
Flops | 388 MMac | 386 MMac | 294 MMac | 293 MMac |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yang, W.; Yin, J. An Integrated Transformer with Collaborative Tokens Mining for Fine-Grained Recognition. Electronics 2023, 12, 2635. https://doi.org/10.3390/electronics12122635
Yang W, Yin J. An Integrated Transformer with Collaborative Tokens Mining for Fine-Grained Recognition. Electronics. 2023; 12(12):2635. https://doi.org/10.3390/electronics12122635
Chicago/Turabian StyleYang, Weiwei, and Jian Yin. 2023. "An Integrated Transformer with Collaborative Tokens Mining for Fine-Grained Recognition" Electronics 12, no. 12: 2635. https://doi.org/10.3390/electronics12122635