Self-Knowledge Distillation via Progressive Associative Learning
Abstract
:1. Introduction
- We introduce a highly efficient self-KD framework that emulates human associative learning, allowing the distilled network to acquire powerful features through associated samples as its inputs.
- The student network not only learns the characteristics of the original samples but also is compelled to acquire knowledge about interclass relations among all categories in a self-distillation manner.
- Our method delivers promising results compared with the state-of-the-art methods. For example, on the CIFAR-100 dataset, with the same network, it achieves a 2.22% higher accuracy rate than the CSKD method. Notably, our method even outperforms conventional distillation with a pretrained teacher.
2. Related Work
3. Materials and Methods
3.1. Motivation
3.2. Associative Learning for Self-Distillation
3.3. Training Procedure
Algorithm 1 Associative Learning for Self-Distillation |
Input: image data and label . Output: parameters of student model. Initialize: and training hyper-parameters. Repeat: Stage 1: Learning relationship among classes.
Stage 2: Self-distillation. Until: converges. |
4. Results
4.1. Implementation Details
4.2. Cifar-10 and CIFAR-100
4.3. Tiny-Imagenet-200
4.4. Cub200-2011 and Stanford Dogs
5. Discussion
5.1. Improvements on Various Architectures
5.2. Ablation Experiment
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Chen, Z.; Li, J.; You, X. Learn to focus on objects for visual detection. Neurocomputing 2019, 348, 27–39. [Google Scholar] [CrossRef]
- Noh, H.; Hongsuck Seo, P.; Han, B. Image Question Answering Using Convolutional Neural Network With Dynamic Parameter Prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 30–38. [Google Scholar]
- Gong, X.; Rong, Z.; Wang, J.; Zhang, K.; Yang, S. A hybrid algorithm based on state-adaptive slime mold model and fractional-order ant system for the travelling salesman problem. Complex Intell. Syst. 2023, in press. [Google Scholar] [CrossRef]
- Zhang, Y.; Pezeshki, M.; Brakel, P.; Zhang, S.; Laurent, C.; Bengio, Y.; Courville, A.C. Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks. In Proceedings of the Interspeech 2016, 17th Annual Conference of the International Speech Communication Association, San Francisco, CA, USA, 8–12 September 2016; ISCA: Singapore, 2016; pp. 410–414. [Google Scholar]
- Alam, M.; Samad, M.D.; Vidyaratne, L.; Glandon, A.; Iftekharuddin, K.M. Survey on Deep Neural Networks in Speech and Vision Systems. Neurocomputing 2020, 417, 302–321. [Google Scholar] [CrossRef]
- Bian, C.; Feng, W.; Wan, L.; Wang, S. Structural Knowledge Distillation for Efficient Skeleton-Based Action Recognition. IEEE Trans. Image Process. 2021, 30, 2963–2976. [Google Scholar] [CrossRef]
- Zhao, H.; Sun, X.; Dong, J.; Dong, Z.; Li, Q. Knowledge distillation via instance-level sequence learning. Knowl. Based Syst. 2021, 233, 107519. [Google Scholar] [CrossRef]
- Zhao, H.; Sun, X.; Dong, J.; Chen, C.; Dong, Z. Highlight Every Step: Knowledge Distillation via Collaborative Teaching. IEEE Trans. Cybern. 2020, 52, 1–12. [Google Scholar] [CrossRef]
- Ding, F.; Luo, F.; Hu, H.; Yang, Y. Multi-level Knowledge Distillation. Neurocomputing 2020, 415, 106–113. [Google Scholar] [CrossRef]
- Wu, S.; Wang, J.; Sun, H.; Zhang, K.; Pal, N.R. Fractional Approximation of Broad Learning System. IEEE Trans. Cybern. 2023, in press. [Google Scholar] [CrossRef] [PubMed]
- Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. Comput. Sci. 2015, 14, 38–39. [Google Scholar]
- Romero, A.; Ballas, N.; Kahou, S.E.; Chassang, A.; Bengio, Y. FitNets: Hints for Thin Deep Nets. In Proceedings of the 3rd International Conference on Learning Representations, ICLR, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Zhao, H.; Sun, X.; Dong, J.; Yu, H.; Wang, G. Multi-instance semantic similarity transferring for knowledge distillation. Knowl. Based Syst. 2022, 256, 109832. [Google Scholar] [CrossRef]
- Liu, T.; Lam, K.M.; Zhao, R.; Qiu, G. Deep Cross-modal Representation Learning and Distillation for Illumination-invariant Pedestrian Detection. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 315–329. [Google Scholar] [CrossRef]
- Chen, L.; Jiang, Z.; Tong, L.; Liu, Z.; Zhao, A.; Zhang, Q.; Dong, J.; Zhou, H. Perceptual underwater image enhancement with deep learning and physical priors. IEEE Trans. Circuits Syst. Video Technol. 2020, 31, 3078–3092. [Google Scholar] [CrossRef]
- Liu, Y.; Chen, K.; Liu, C.; Qin, Z.; Luo, Z.; Wang, J. Structured Knowledge Distillation for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, 16–20 June 2019; pp. 2604–2613. [Google Scholar] [CrossRef]
- Zhao, H.; Sun, X.; Dong, J.; Yu, H.; Zhou, H. Dual Discriminator Adversarial Distillation for Data-free Model Compression. arXiv 2021, arXiv:2104.05382. [Google Scholar] [CrossRef]
- Lateef, F.; Ruichek, Y. Survey on semantic segmentation using deep learning techniques. Neurocomputing 2019, 338, 321–348. [Google Scholar] [CrossRef]
- Guo, P.; Du, G.; Wei, L.; Lu, H.; Chen, S.; Gao, C.; Chen, Y.; Li, J.; Luo, D. Multiscale face recognition in cluttered backgrounds based on visual attention. Neurocomputing 2022, 469, 65–80. [Google Scholar] [CrossRef]
- Ge, S.; Zhao, S.; Li, C.; Zhang, Y.; Li, J. Efficient Low-Resolution Face Recognition via Bridge Distillation. IEEE Trans. Image Process. 2020, 29, 6898–6908. [Google Scholar] [CrossRef]
- Xue, G.; Wang, J.; Yuan, B.; Dai, C. DG-ALETSK: A High-Dimensional Fuzzy Approach With Simultaneous Feature Selection and Rule Extraction. IEEE Trans. Fuzzy Syst. 2023, 31, 3866–3880. [Google Scholar] [CrossRef]
- Tang, Y.; Wei, Y.; Yu, X.; Lu, J.; Zhou, J. Graph Interaction Networks for Relation Transfer in Human Activity Videos. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 2872–2886. [Google Scholar] [CrossRef]
- Yuan, L.; Tay, F.E.H.; Li, G.; Wang, T.; Feng, J. Revisit Knowledge Distillation: A Teacher-free Framework. arXiv 2019, arXiv:1909.11723. [Google Scholar]
- Hou, Y.; Ma, Z.; Liu, C.; Loy, C.C. Learning Lightweight Lane Detection CNNs by Self Attention Distillation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: New York, NY, USA, 2019; pp. 1013–1021. [Google Scholar]
- Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images. Technical Report. 2009. Available online: http://www.cs.utoronto.ca/~kriz/learning-features-2009-TR.pdf (accessed on 22 May 2024).
- Le, Y.; Yang, X. Tiny Imagenet Visual Recognition Challenge. Stanford Class CS 231N. 2015. Available online: http://cs231n.stanford.edu/reports/2015/pdfs/yle_project.pdf (accessed on 22 May 2024).
- Welinder, P.; Branson, S.; Wah, C.; Schroff, F.; Belongie, S.; Perona, P. Caltech-UCSD Birds 200; California Institute of Technology: Pasadena, CA, USA, 2010; Available online: https://www.florian-schroff.de/publications/CUB-200.pdf (accessed on 22 May 2024).
- Khosla, A.; Jayadevaprakash, N.; Yao, B.; Li, F.L. Novel dataset for fine-grained image categorization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Colorado Springs, CO, USA, 20–25 June 2011. [Google Scholar]
- Ba, J.; Caruana, R. Do Deep Nets Really Need to be Deep? In Proceedings of the Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, Montreal, QC, Canada, 8–13 December 2014; pp. 2654–2662. [Google Scholar]
- Zhao, B.; Cui, Q.; Song, R.; Qiu, Y.; Liang, J. Decoupled knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11953–11962. [Google Scholar]
- Huang, T.; Zhang, Y.; Zheng, M.; You, S.; Wang, F.; Qian, C.; Xu, C. Knowledge diffusion for distillation. Adv. Neural Inf. Process. Syst. 2024, 36, 65299–65316. [Google Scholar]
- Zhang, Y.; Xiang, T.; Hospedales, T.M.; Lu, H. Deep mutual learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4320–4328. [Google Scholar]
- Zhou, G.; Fan, Y.; Cui, R.; Bian, W.; Zhu, X.; Gai, K. Rocket Launching: A Universal and Efficient Framework for Training Well-Performing Light Net. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, LA, USA, 2–7 February 2018; pp. 4580–4587. [Google Scholar]
- Hahn, S.; Choi, H. Self-Knowledge Distillation in Natural Language Processing. In Proceedings of the International Conference on Recent Advances in Natural Language Processing, RANLP 2019, Varna, Bulgaria, 2–4 September 2019; pp. 423–430. [Google Scholar]
- Zhang, L.; Song, J.; Gao, A.; Chen, J.; Bao, C.; Ma, K. Be Your Own Teacher: Improve the Performance of Convolutional Neural Networks via Self Distillation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: New York, NY, USA, 2019; pp. 3712–3721. [Google Scholar]
- Yang, C.; Xie, L.; Su, C.; Yuille, A.L. Snapshot Distillation: Teacher-Student Optimization in One Generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, 16–20 June 2019; pp. 2859–2868. [Google Scholar]
- Buslaev, A.; Iglovikov, V.I.; Khvedchenya, E.; Parinov, A.; Druzhinin, M.; Kalinin, A.A. Albumentations: Fast and Flexible Image Augmentations. Information 2020, 11, 125. [Google Scholar] [CrossRef]
- Zhong, Z.; Zheng, L.; Kang, G.; Li, S.; Yang, Y. Random Erasing Data Augmentation. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, 7–12 February 2020; AAAI Press: Washington, DC, USA, 2020; pp. 13001–13008. [Google Scholar]
- Xu, T.B.; Liu, C.L. Data-Distortion Guided Self-Distillation for Deep Neural Networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 5565–5572. [Google Scholar]
- Nowlan, S.J.; Hinton, G.E. Simplifying Neural Networks by Soft Weight-Sharing. Neural Comput. 1992, 4, 473–493. [Google Scholar] [CrossRef]
- Srivastava, N.; Hinton, G.E.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
- Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, 13–19 June 2020; IEEE: New York, NY, USA, 2020; pp. 10778–10787. [Google Scholar]
- Yun, S.; Park, J.; Lee, K.; Shin, J. Regularizing Class-Wise Predictions via Self-Knowledge Distillation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, 13–19 June 2020; IEEE: New York, NY, USA, 2020; pp. 13873–13882. [Google Scholar]
- Ji, M.; Shin, S.; Hwang, S.; Park, G.; Moon, I. Refine Myself by Teaching Myself: Feature Refinement via Self-Knowledge Distillation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
- Zhang, H.; Cissé, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond Empirical Risk Minimization. In Proceedings of the 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Bottou, L. Stochastic Gradient Descent Tricks. In Neural Networks: Tricks of the Trade, 2nd ed.; Montavon, G., Orr, G.B., Müller, K., Eds.; Springer: Berlin/Heidelberg, Germany, 2012; Volume 7700, pp. 421–436. [Google Scholar]
- Zagoruyko, S.; Komodakis, N. Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer. In Proceedings of the 5th International Conference on Learning Representations, ICLR, Toulon, France, 24–26 April 2017. [Google Scholar]
- Park, W.; Kim, D.; Lu, Y.; Cho, M. Relational Knowledge Distillation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, 16–20 June 2019; pp. 3967–3976. [Google Scholar]
- Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Method | CIFAR-10 Acc (%) | CIFAR-100 Acc (%) |
---|---|---|
Baseline | 94.86% | 75.30% |
KD | 95.66% | 76.68% |
AT | 94.89% | 75.84% |
RKD | 95.13% | 76.02% |
ALSD | 96.04% | 80.10% |
Dataset | Baseline | DDGSD | BYOT | DML | CS-KD | FRSKD | ALSD |
---|---|---|---|---|---|---|---|
CIFAR-100 | 75.30% | 76.15% | 76.19% | 77.51% | 78.01% | 77.76% | 80.10% |
Tiny-Imagenet-200 | 56.63% | 58.52% | 55.98% | 58.65% | 58.38% | 59.60% | 59.70% |
Method | Acc (%) |
---|---|
Baseline | 56.63% |
KD | 58.44% |
AT | 57.49% |
RKD | 57.45% |
ALSD | 59.70% |
Dataset | Baseline | DDGSD | BYOT | DML | CS-KD | FRSKD | ALSD |
---|---|---|---|---|---|---|---|
CUB200-2011 | 54.00% | 58.53% | 59.24% | 54.15% | 66.72% | 67.52% | 70.09% |
Stanford Dogs | 62.71% | 68.47% | 65.98% | 63.24% | 69.15% | 70.75% | 71.51% |
Dateset | Method | Architecture | ||||
---|---|---|---|---|---|---|
CIFAR-100 | Vgg13 | Vgg8 | MobileNetV2 | ShuffleNetV2 | DenseNet-121 | |
Vanilla | 72.93% | 68.99% | 52.41% | 65.47% | 77.77% | |
Our ALSD | 75.36% | 70.27% | 62.83% | 72.83% | 81.46% | |
Tiny-Imagenet-200 | Vgg13 | Vgg8 | MobileNetV2 | ShuffleNetV2 | DenseNet-121 | |
Vanilla | 59.90% | 55.68% | 49.27% | 56.67% | 60.78% | |
Our ALSD | 62.61% | 58.80% | 54.89% | 59.36% | 63.36% |
Acc (%) | ||
---|---|---|
1 | 0 | 78.65% |
0 | 1 | 1.29% |
1 | 1 | 79.10% |
0.1 | 1 | 80.09% |
0.1 | 1.5 | 79.85% |
0.1 | 2 | 79.54% |
Method | CIFAR-100 | Tiny-Imagenet-200 | CUB200-2011 | Stanford Dogs | MIT67 |
---|---|---|---|---|---|
Baseline | 75.30% | 56.63% | 54.00% | 62.71% | 55.91% |
FitNet | 76.67% | 58.92% | 58.97% | 67.18% | 59.18% |
AT | 75.89% | 59.52% | 59.28% | 67.65% | 59.38% |
Overhaul | 74.51% | 59.42% | 59.58% | 66.43% | 58.88% |
FRSKD | 77.76% | 59.60% | 67.52% | 70.75% | 61.18% |
our ALSD | 80.10% | 59.70% | 70.09% | 71.51% | 61.11% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhao, H.; Bi, Y.; Tian, S.; Wang, J.; Zhang, P.; Deng, Z.; Liu, K. Self-Knowledge Distillation via Progressive Associative Learning. Electronics 2024, 13, 2062. https://doi.org/10.3390/electronics13112062
Zhao H, Bi Y, Tian S, Wang J, Zhang P, Deng Z, Liu K. Self-Knowledge Distillation via Progressive Associative Learning. Electronics. 2024; 13(11):2062. https://doi.org/10.3390/electronics13112062
Chicago/Turabian StyleZhao, Haoran, Yanxian Bi, Shuwen Tian, Jian Wang, Peiying Zhang, Zhaopeng Deng, and Kai Liu. 2024. "Self-Knowledge Distillation via Progressive Associative Learning" Electronics 13, no. 11: 2062. https://doi.org/10.3390/electronics13112062
APA StyleZhao, H., Bi, Y., Tian, S., Wang, J., Zhang, P., Deng, Z., & Liu, K. (2024). Self-Knowledge Distillation via Progressive Associative Learning. Electronics, 13(11), 2062. https://doi.org/10.3390/electronics13112062