A Green AI Methodology Based on Persistent Homology for Compressing BERT
Abstract
:1. Introduction
- (RQ1)
- How effective and robust is persistent homology at measuring the importance of neurons in a transformer encoder model such as BERT?
- (RQ2)
- Is it feasible to propose a practical methodology using persistent homology for simplifying BERT-based models?
- (RQ3)
- Can persistent homology be employed to enhance the explainability of language models like BERT?
- A methodology has been developed for compressing the BERT model by analyzing the topological features of the outputs of each neuron. This can be understood as explainability in terms of the individual role of each neuron.
- This methodology is applied to two versions of the BERT model, interpreting the topological characteristics as a tool to assess the importance of neurons, simplifying those that contribute less information, and generating a pruned version of the BERT model.
- The performance of the simplified models has been evaluated using the GLUE Benchmark and compared the results with other state-of-the-art compression techniques, demonstrating the effectiveness of PBCE for model explicability and compression.
2. Previous Works
2.1. Persistent Homology Applied to Machine Learning Problems
2.2. Brief Description of the BERT Model
2.3. BERT Model Pruning Methods
3. Our Proposal
3.1. An Intuitive Geometric Description of Persistent Homology Applied to LLM Explanations
3.2. PBCE: Using Persistent Homology to Compress BERT
Algorithm 1 PBCE: BERT compression through zero-dimensional persistent homology |
|
3.2.1. Corpus Selection
3.2.2. Using Persistent Homology to Analyze BERT Layer Outputs
3.2.3. Evaluation of Distribution and Selection of the Important Units
- The first level, which is the lightest, involves calculating the first quartile (Q1) of the values and retaining neurons with values higher than Q1.
- The second level is slightly more severe, applying the same operation but with the second quartile (Q2).
- The most intense pruning is the third level, where only the neurons with values higher than the third quartile (Q3) are kept.
3.2.4. Evaluation of the Compressed Model Through the GLUE Benchmark
- Model evaluation: Use the fine-tuned simplified model to make predictions on the GLUE benchmark tasks. For each task, the model will generate predictions.
- Evaluation metrics: Calculate task-specific evaluation metrics for each GLUE task. These metrics can vary depending on the task but often include accuracy, F1 score, or other relevant measures. GLUE provides a standard evaluation script for each task (Table 1).
- Comparison: Compare the performance of the simplified model to the performance of the original, more complex BERT model and other simplified approaches from the literature. This will give an indication of how much simplification impacted the model’s ability to perform various NLP tasks.
4. Empirical Evaluation
4.1. Distribution of and Selection of the More Informative Neurons
4.2. Results and Analysis on the BERT Base Model
4.3. Results and Analysis on the BERT Large Model
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
LLM | Large Language Model |
NLP | Natural Language Processing |
PBCE | Persistent BERT Compression and Explainability |
BERT | Bidirectional Encoder Representations from Transformers |
GLUE | General Language Understanding Evaluation |
NHU | Number of Hidden Units |
Appendix A. Homology Theory: Notation and Mathematical Background
- 1.
- , .
- 2.
- If is empty of a face of both.
References
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); Association for Computational Linguistics: Minneapolis, MN, USA, 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
- Bolón-Canedo, V.; Morán-Fernández, L.; Cancela, B.; Alonso-Betanzos, A. A review of green artificial intelligence: Towards a more sustainable future. Neurocomputing 2024, 599, 128096. [Google Scholar] [CrossRef]
- Schwartz, R.; Dodge, J.; Smith, N.A.; Etzioni, O. Green AI. arXiv 2019, arXiv:1907.10597. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
- Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2020, arXiv:1910.01108. [Google Scholar]
- Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. arXiv 2020, arXiv:1909.11942. [Google Scholar]
- Jiao, X.; Yin, Y.; Shang, L.; Jiang, X.; Chen, X.; Li, L.; Wang, F.; Liu, Q. TinyBERT: Distilling BERT for Natural Language Understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020; Association for Computational Linguistics: Minneapolis, MN, USA, 2020; pp. 4163–4174. [Google Scholar] [CrossRef]
- Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP; Association for Computational Linguistics: Minneapolis, MN, USA, 2018; pp. 353–355. [Google Scholar] [CrossRef]
- Mileyko, Y.; Mukherjee, S.; Harer, J. Probability measures on the space of persistence diagrams. Inverse Probl. 2011, 27, 124007. [Google Scholar] [CrossRef]
- Chen, M.; Wang, D.; Feng, S.; Zhang, Y. Topological Regularization for Representation Learning via Persistent Homology. Mathematics 2023, 11, 1008. [Google Scholar] [CrossRef]
- Choe, S.; Ramanna, S. Cubical Homology-Based Machine Learning: An Application in Image Classification. Axioms 2022, 11, 112. [Google Scholar] [CrossRef]
- Pun, C.S.; Lee, S.X.; Xia, K. Persistent-homology-based machine learning: A survey and a comparative study. Artif. Intell. Rev. 2022, 55, 5169–5213. [Google Scholar] [CrossRef]
- Routray, M.; Vipsita, S.; Sundaray, A.; Kulkarni, S. DeepRHD: An efficient hybrid feature extraction technique for protein remote homology detection using deep learning strategies. Comput. Biol. Chem. 2022, 100, 107749. [Google Scholar] [CrossRef]
- Nauman, M.; Ur Rehman, H.; Politano, G.; Benso, A. Beyond Homology Transfer: Deep Learning for Automated Annotation of Proteins. J. Grid Comput. 2019, 17, 225–237. [Google Scholar] [CrossRef]
- Wu, K.; Zhao, Z.; Wang, R.; Wei, G.W. TopP–S: Persistent homology-based multi-task deep neural networks for simultaneous predictions of partition coefficient and aqueous solubility. J. Comput. Chem. 2018, 39, 1444–1454. [Google Scholar] [CrossRef] [PubMed]
- Rathore, A.; Zhou, Y.; Srikumar, V.; Wang, B. TopoBERT: Exploring the topology of fine-tuned word representations. Inf. Vis. 2023, 22, 186–208. [Google Scholar] [CrossRef]
- Clark, K.; Khandelwal, U.; Levy, O.; Manning, C.D. What Does BERT Look at? An Analysis of BERT’s Attention. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP; Association for Computational Linguistics: Minneapolis, MN, USA, 2019; pp. 276–286. [Google Scholar] [CrossRef]
- google-bert/bert-base-cased · Hugging Face—huggingface.co. Available online: https://huggingface.co/bert-base-cased (accessed on 3 September 2023).
- google-bert/bert-large-cased · Hugging Face—huggingface.co. Available online: https://huggingface.co/bert-large-cased (accessed on 3 September 2023).
- Gupta, M.; Agrawal, P. Compression of Deep Learning Models for Text: A Survey. ACM Trans. Knowl. Discov. Data 2022, 16, 61. [Google Scholar] [CrossRef]
- Ganesh, P.; Chen, Y.; Lou, X.; Khan, M.A.; Yang, Y.; Sajjad, H.; Nakov, P.; Chen, D.; Winslett, M. Compressing Large-Scale Transformer-Based Models: A Case Study on BERT. Trans. Assoc. Comput. Linguist. 2021, 9, 1061–1080. [Google Scholar] [CrossRef]
- Lee, H.D.; Lee, S.; Kang, U. AUBER: Automated BERT regularization. PLoS ONE 2021, 16, e0253241. [Google Scholar] [CrossRef] [PubMed]
- Zhang, X.; Fan, J.; Hei, M. Compressing BERT for Binary Text Classification via Adaptive Truncation before Fine-Tuning. Appl. Sci. 2022, 12, 12055. [Google Scholar] [CrossRef]
- Michel, P.; Levy, O.; Neubig, G. Are sixteen heads really better than one? In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Curran Associates Inc.: Red Hook, NY, USA, 2019. [Google Scholar]
- Voita, E.; Talbot, D.; Moiseev, F.; Sennrich, R.; Titov, I. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Florence, Italy, 2019; pp. 5797–5808. [Google Scholar] [CrossRef]
- Huang, S.; Liu, N.; Liang, Y.; Peng, H.; Li, H.; Xu, D.; Xie, M.; Ding, C. An Automatic and Efficient BERT Pruning for Edge AI Systems. arXiv 2022, arXiv:2206.10461. [Google Scholar]
- Zheng, D.; Li, J.; Yang, Y.; Wang, Y.; Pang, P.C.I. MicroBERT: Distilling MoE-Based Knowledge from BERT into a Lighter Model. Appl. Sci. 2024, 14, 6171. [Google Scholar] [CrossRef]
- Zhang, Z.; Lu, Y.; Wang, T.; Wei, X.; Wei, Z. DDK: Dynamic structure pruning based on differentiable search and recursive knowledge distillation for BERT. Neural Netw. 2024, 173, 106164. [Google Scholar] [CrossRef]
- Huang, T.; Dong, W.; Wu, F.; Li, X.; Shi, G. Uncertainty-Driven Knowledge Distillation for Language Model Compression. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 31, 2850–2858. [Google Scholar] [CrossRef]
- Lin, Y.J.; Chen, K.Y.; Kao, H.Y. LAD: Layer-Wise Adaptive Distillation for BERT Model Compression. Sensors 2023, 23, 1483. [Google Scholar] [CrossRef]
- Zhang, S.; Zheng, X.; Li, G.; Yang, C.; Li, Y.; Wang, Y.; Chao, F.; Wang, M.; Li, S.; Ji, R. You only compress once: Towards effective and elastic BERT compression via exploit–explore stochastic nature gradient. Neurocomputing 2024, 599, 128140. [Google Scholar] [CrossRef]
- Chen, T.; Frankle, J.; Chang, S.; Liu, S.; Zhang, Y.; Wang, Z.; Carbin, M. The Lottery Ticket Hypothesis for Pre-trained BERT Networks. Adv. Neural Inf. Process. Syst. 2020, 33, 15834–15846. [Google Scholar]
- Guo, F.M.; Liu, S.; Mungall, F.S.; Lin, X.; Wang, Y. Reweighted Proximal Pruning for Large-Scale Language Representation. arXiv 2019, arXiv:1909.12486. [Google Scholar]
- Shen, S.; Dong, Z.; Ye, J.; Ma, L.; Yao, Z.; Gholami, A.; Mahoney, M.W.; Keutzer, K. Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT. arXiv 2019, arXiv:1909.05840. [Google Scholar] [CrossRef]
- Li, B.; Kong, Z.; Zhang, T.; Li, J.; Li, Z.; Liu, H.; Ding, C. Efficient Transformer-based Large Scale Language Representations using Hardware-friendly Block Structured Pruning. In Findings of the Association for Computational Linguistics: EMNLP 2020; Association for Computational Linguistics: Minneapolis, MN, USA, 2020; pp. 3187–3199. [Google Scholar] [CrossRef]
- Piao, T.; Cho, I.; Kang, U. SensiMix: Sensitivity-Aware 8-bit index 1-bit value mixed precision quantization for BERT compression. PLoS ONE 2022, 17, e0265621. [Google Scholar] [CrossRef] [PubMed]
- legacy-datasets/wikipedia · Datasets at Hugging Face—huggingface.co. Available online: https://huggingface.co/datasets/wikipedia (accessed on 8 September 2024).
- Williams, A.; Nangia, N.; Bowman, S. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers); Association for Computational Linguistics: New Orleans, LA, USA, 2018; pp. 1112–1122. [Google Scholar] [CrossRef]
- Wang, Z.; Hamza, W.; Florian, R. Bilateral multi-perspective matching for natural language sentences. arXiv 2017, arXiv:1702.03814. [Google Scholar]
- The Stanford Question Answering Dataset—rajpurkar.github.io. Available online: https://rajpurkar.github.io/SQuAD-explorer/ (accessed on 4 September 2024).
- Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C.D.; Ng, A.; Potts, C. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA, 18–21 October 2013; pp. 1631–1642. [Google Scholar]
- Warstadt, A.; Singh, A.; Bowman, S.R. Neural Network Acceptability Judgments. Trans. Assoc. Comput. Linguist. 2019, 7, 625–641. [Google Scholar] [CrossRef]
- Cer, D.; Diab, M.; Agirre, E.; Lopez-Gazpio, I.; Specia, L. SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017); Association for Computational Linguistics: Vancouver, BC, Canada, 2017; pp. 1–14. [Google Scholar] [CrossRef]
- Download Microsoft Research Paraphrase Corpus from Official Microsoft Download Center—microsoft.com. Available online: https://www.microsoft.com/en-us/download/details.aspx?id=52398 (accessed on 4 September 2023).
- Bentivogli, L.; Dagan, I.; Magnini, B. The Recognizing Textual Entailment Challenges: Datasets and Methodologies. In Handbook of Linguistic Annotation; Ide, N., Pustejovsky, J., Eds.; Springer: Dordrecht, The Netherlands, 2017; pp. 1119–1147. [Google Scholar] [CrossRef]
- Naveed, H.; Khan, A.U.; Qiu, S.; Saqib, M.; Anwar, S.; Usman, M.; Akhtar, N.; Barnes, N.; Mian, A. A Comprehensive Overview of Large Language Models. arXiv 2024, arXiv:2307.06435. [Google Scholar]
- Liang, P.; Bommasani, R.; Lee, T.; Tsipras, D.; Soylu, D.; Yasunaga, M.; Zhang, Y.; Narayanan, D.; Wu, Y.; Kumar, A.; et al. Holistic Evaluation of Language Models. arXiv 2023, arXiv:2211.09110. [Google Scholar]
- Chang, Y.; Wang, X.; Wang, J.; Wu, Y.; Yang, L.; Zhu, K.; Chen, H.; Yi, X.; Wang, C.; Wang, Y.; et al. A Survey on Evaluation of Large Language Models. ACM Trans. Intell. Syst. Technol. 2024, 15, 1–45. [Google Scholar] [CrossRef]
- Linardatos, P.; Papastefanopoulos, V.; Kotsiantis, S. Explainable AI: A Review of Machine Learning Interpretability Methods. Entropy 2021, 23, 18. [Google Scholar] [CrossRef] [PubMed]
- Gilpin, L.H.; Bau, D.; Yuan, B.Z.; Bajwa, A.; Specter, M.; Kagal, L. Explaining Explanations: An Overview of Interpretability of Machine Learning. In Proceedings of the 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA), Turin, Italy, 1–3 October 2018; pp. 80–89. [Google Scholar] [CrossRef]
- Edelsbrunner, H.; Harer, J. Computational Topology: An Introduction; American Mathematical Society: Providence, RI, USA, 2010. [Google Scholar] [CrossRef]
- Hensel, F.; Moor, M.; Rieck, B. A Survey of Topological Machine Learning Methods. Front. Artif. Intell. 2021, 4, 681108. [Google Scholar] [CrossRef] [PubMed]
Task | Train | Evaluation | Metric |
---|---|---|---|
CoLA | 10 K | 1 K | Matthew’s Correlation |
SST-2 | 67 K | 872 | Accuracy |
MRPC | 5.8 K | 1 K | F1/Accuracy |
STS-B | 7 K | 1.5 K | Pearson–Spearman Correlation |
QQP | 400 K | 10 K | F1/Accuracy |
MNLI | 393 K | 20 K | Accuracy |
QNLI | 108 K | 11 K | Accuracy |
RTE | 2.7 K | 0.5 K | Accuracy |
BertLayer Component | Q1 | Q2 | Q3 |
---|---|---|---|
Q | 3, 4, 12 | 8–11 | 5–7 |
K | 3, 4, 12 | 8–11 | 5–7 |
V | 4 | 3, 11, 12 | 5–10 |
Intermediate | 3, 4 | 5–12 | - |
BertLayer Component | Q1 | Q2 | Q3 |
---|---|---|---|
Q | 11,12, 14–17 | 4–10, 18 | 3, 13, 19–24 |
K | 11, 14–17 | 4–10, 12, 18, 19 | 3, 13, 20–24 |
V | - | 4, 8–12, 14–17 | 3, 5–7, 10, 13, 18–24 |
Intermediate | 3, 4, 8–16, 18 | 5–7, 17, 19–24 | - |
Method | MNLI-(m/mm) | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE | RP (M) |
---|---|---|---|---|---|---|---|---|---|
Original [1] | 84.6/83.4 | 71.2 | 90.5 | 93.5 | 52.1 | 85.8 | 88.9 | 66.4 | - |
AUBER [23] | - | - | - | - | 60.59 0.73 | - | 85.62 0.51 | 65.31 1.30 | - |
AE-BERT [27] | - | - | 88.7 | - | - | 86.1 | 89.5 | 69.7 | - |
ETbLSL [36] | 82.9 | 90.7 | 88.2 | 89.3 | 52.6 | 84.6 | 88.3 | 63.9 | - |
LotteryTicketBert [33] | - | - | 88.9 | - | 53.8 | 88.2 | 84.9 | 66 | - |
Michel et al. [25] | - | - | - | - | 58.86 0.64 | - | 84.22 0.33 | 63.9 | - |
Voita et al. [26] | - | - | - | - | 55.34 0.81 | - | 83.92 0.71 | 64.12 1.65 | - |
QBERT [35] | 77.02/76.56 | - | - | 84.63 | - | - | - | - | - |
YOCO-BERT [32] | 82.6 | 90.5 | 87.2 | 91.6 | 59.8 | - | 89.3 | 72.9 | 67 |
SENSIMIX [37] | - | 89.6 | 86.5 | 90.3 | - | - | 87.2 | - | 13.75 |
LAD [31] | 81.01/81.47 | 87.56 | 89.24 | 91.74 | - | - | 88.71 | 67.15 | 52.5 |
DDK [29] | 83.6 | 88.2 | 91.6 | 92.7 | 61.9 | 89.1 | 90.7 | 73.7 | 67 |
UEM [30] | - | 86.2 | 86.4 | 87.5 | 46.8 | - | 86.7 | - | 66.8 |
MicroBERT [28] | 80.3 | 86.6 | - | 89.6 | - | - | 88.7 | 62.8 | 14.5 |
PBCE | 83.7/81.6 | 91.23 | 91.73 | 91.87 | 62.96 | 89.52 | 91.09 | 71.98 | 52 |
Method | MNLI-(m/mm) | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE | RP (M) |
---|---|---|---|---|---|---|---|---|---|
Original [1] | 86.7/85.9 | 72.1 | 92.7 | 94.9 | 60.5 | 86.5 | 89.3 | 70.1 | - |
RPP [34] | 86.1/85.7 | - | - | - | 61.3 | - | 88.1 | 70.1 | 201 |
PBCE | 87.1/86.2 | 71.9 | 93.2 | 94.7 | 62.1 | 85.2 | 88.6 | 71.7 | 146.2 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Balderas, L.; Lastra, M.; Benítez, J.M. A Green AI Methodology Based on Persistent Homology for Compressing BERT. Appl. Sci. 2025, 15, 390. https://doi.org/10.3390/app15010390
Balderas L, Lastra M, Benítez JM. A Green AI Methodology Based on Persistent Homology for Compressing BERT. Applied Sciences. 2025; 15(1):390. https://doi.org/10.3390/app15010390
Chicago/Turabian StyleBalderas, Luis, Miguel Lastra, and José M. Benítez. 2025. "A Green AI Methodology Based on Persistent Homology for Compressing BERT" Applied Sciences 15, no. 1: 390. https://doi.org/10.3390/app15010390
APA StyleBalderas, L., Lastra, M., & Benítez, J. M. (2025). A Green AI Methodology Based on Persistent Homology for Compressing BERT. Applied Sciences, 15(1), 390. https://doi.org/10.3390/app15010390