Next Article in Journal
A Novel Dual-Function Drone Landing Gear with Ultra-Fast Grasping Capability Enabled by a Quick-Release Mechanism
Previous Article in Journal
Damage Limit Velocity and Fracture Patterns in Single Glass Plates Impacted by Steel Balls of Varying Diameters
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Green AI Methodology Based on Persistent Homology for Compressing BERT

by
Luis Balderas
1,2,3,4,*,
Miguel Lastra
2,3,4,5 and
José M. Benítez
1,2,3,4
1
Department of Computer Science and Artificial Intelligence, University of Granada, 18071 Granada, Spain
2
Distributed Computational Intelligence and Time Series Lab, University of Granada, 18071 Granada, Spain
3
Sport and Health University Research Institute, University of Granada, 18071 Granada, Spain
4
Andalusian Research Institute in Data Science and Computational Intelligence, University of Granada, 18071 Granada, Spain
5
Department of Software Engineering, University of Granada, 18071 Granada, Spain
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(1), 390; https://doi.org/10.3390/app15010390
Submission received: 22 November 2024 / Revised: 17 December 2024 / Accepted: 31 December 2024 / Published: 3 January 2025
(This article belongs to the Section Computing and Artificial Intelligence)

Abstract

:
Large Language Models (LLMs) like BERT have gained significant prominence due to their remarkable performance in various natural language processing tasks. However, they come with substantial computational and memory costs. Additionally, they are essentially black-box models, being challenging to explain and interpret. In this article, Persistent BERT Compression and Explainability (PBCE) is proposed, a Green AI methodology to prune BERT models using persistent homology, aiming to measure the importance of each neuron by studying the topological characteristics of their outputs. As a result, PBCE can compress BERT significantly by reducing the number of parameters (47% of the original parameters for BERT Base, 42% for BERT Large). The proposed methodology has been evaluated on the standard GLUE Benchmark, comparing the results with state-of-the-art techniques achieving outstanding results. Consequently, PBCE can simplify the BERT model by providing explainability to its neurons and reducing the model’s size, making it more suitable for deployment on resource-constrained devices.

1. Introduction

Over the last years, Large Language Models (LLM) like Bidirectional Encoder Representations from Transformers (BERT) [1] have achieved phenomenal results in a wide variety of natural language processing (NLP) tasks. These models, pre-trained on large amounts of data, are available for use by the scientific community, either as a semantic analytical tool or to build fine-tuned solutions for specific problems.
However, these models suffer from two weaknesses. On the one hand, they are computationally and memory-intensive, making it challenging to deploy them on devices with limited resources. As a result, these advancements run counter to the Green AI [2,3] paradigm, which is dedicated to developing AI technologies that minimize computational costs and place equal emphasis on efficiency and predictive accuracy. On the other hand, they are black-box models: due to the large number of neurons, layers, parameters, and data transformations (e.g., in Attention layers [4]), it is practically impossible to explain the internal state of the network at any given moment and provide interpretability to the final output.
In this article, Persistent BERT Compression and Explainability (PBCE) is proposed, a Green AI methodology based on homology theory to compress the BERT model and give insights about the importance of its neurons. BERT has been chosen as the reference model, given the fact that is one of the foundational models that laid the groundwork for LLM encoders. Several subsequent BERT-based encoder models, such as RoBERTa [5], DistilBERT [6], ALBERT [7], or TinyBERT [8], have been developed to address specific limitations of the original. PBCE is focused on the internal representation of the neural network, considering the individual role of each neuron when the model makes an inference. In particular, PBCE uses zero-dimensional persistent homology, extracting the topological characteristics of the neurons’ outputs and providing an assessment of their importance within the network. This way, it is feasible to identify neurons that contribute with more information to the overall computation of the network and pruning those that contribute less. As a consequence, the proposed technique becomes an effective method for compressing BERT.
To the best of our knowledge, this is the first approach that uses persistent homology to compress and give insights of the importance of the units of the BERT model. To measure the effectiveness of PBCE, an extensive experimentation has been designed based on natural language processing tasks proposed in General Language Understanding Evaluation (GLUE) benchmark [9], achieving results that outperform other state-of-the-art techniques for BERT compression. The main purpose of the experimentation is to provide an answer to the following research questions (RQ):
(RQ1)
How effective and robust is persistent homology at measuring the importance of neurons in a transformer encoder model such as BERT?
(RQ2)
Is it feasible to propose a practical methodology using persistent homology for simplifying BERT-based models?
(RQ3)
Can persistent homology be employed to enhance the explainability of language models like BERT?
The main contributions of the article can be summarized as follows:
  • A methodology has been developed for compressing the BERT model by analyzing the topological features of the outputs of each neuron. This can be understood as explainability in terms of the individual role of each neuron.
  • This methodology is applied to two versions of the BERT model, interpreting the topological characteristics as a tool to assess the importance of neurons, simplifying those that contribute less information, and generating a pruned version of the BERT model.
  • The performance of the simplified models has been evaluated using the GLUE Benchmark and compared the results with other state-of-the-art compression techniques, demonstrating the effectiveness of PBCE for model explicability and compression.
The rest of this paper is structured as follows: In Section 2, the state of the art of different approaches using persistent homology and deep learning models to solve problems and BERT compression techniques is introduced. In Section 3, PBCE is described. In Section 4 the methodology is experimentally analyzed and discussed. Section 5 highlights the conclusions.

2. Previous Works

In this section, the most relevant articles from the state of the art related to PBCE are presented. To the best of our knowledge, there is no technique in the literature that uses persistent homology to prune neural networks. Therefore, techniques that utilize homology theory and deep learning to solve scientific problems are presented first. Finally, pruning techniques for LLMs are introduced, which will serve as a reference to validate the proposed methodology.

2.1. Persistent Homology Applied to Machine Learning Problems

In recent years, numerous machine learning methods based on persistent homology, which is a mathematical method used in topological data analysis to study features of data, have been proposed. Some of them involve feature extraction through the analysis of persistence diagrams (Birth–Death diagrams) and persistence barcodes. Additionally, there are methods which establish similarity metrics between barcodes [10]. From a mathematical perspective, solutions based on persistent homology have been presented to regularize the internal representation of deep neural networks applied to image classification [11,12].
Persistent homology, combined with machine learning, also finds applications in the fields of biology and chemistry, particularly in protein analysis. Deep neural networks are not directly applicable to molecular data, as they are characterized by complex three-dimensional structures. Therefore, the use of topological representations and features is essential for solving protein classification problems [13]. They are also highly useful for identifying structures and functions in a protein sequence [14], for automatic protein annotation [15] or, in chemistry, for tasks like simultaneous prediction of partition coefficients and aqueous solubility [16].
Getting closer to the topic of the article, homology theory has been used to analyze natural language. Specifically, in [17], TopoBERT is presented as a visual tool to explore the fine-tuning process of different language models from a topological perspective. Additionally, it facilitates the visualization of the shape of embedding spaces and the linguistic and semantic connection between the input dataset and the topology of its embedding space.

2.2. Brief Description of the BERT Model

BERT, an acronym for “Bidirectional Encoder Representations from Transformers”, Ref. [1] is a language model developed by Google that has revolutionized natural language processing. Its primary innovation lies in its ability to comprehend a word’s context within a text by analyzing all surrounding words, both to the left and right. Unlike previous language models, which were unidirectional and predicted words based on previous ones, BERT takes the entire context into account.
BERT is built upon the Transformer architecture, a deep neural network autoencoder. The Transformer architecture employs multi-head attention to process sequences. Before the network processes the information, the texts are tokenized, meaning texts are divided into a set of tokens that represent the fundamental semantic content of the word and are translated into their numerical position in the model’s vocabulary. Additionally, special tokens are added, such as [CLS], marking the start of a sentence; [SEP], marking the end, and [PAD], signifying the end of padding. Some research papers from the literature, such as [18], show that the [CLS] token, besides marking the beginning of a sequence, provides an aggregated representation of the input. In fact, it is used in text classification tasks and sentiment analysis. Figure 1 shows an example of how sequences are generated to be a valid input for BERT model.
To pre-train and fine-tune BERT, an extensive amount of text was used. During the training process, BERT learns to predict hidden words in complete sentences, enabling it to better understand how words relate to each other in a given context.
This pre-trained model can be fine-tuned for specific tasks such as text classification, entity tagging, machine translation, and more. Due to its context-awareness, BERT has outperformed many records in various natural language processing tasks and has become an essential tool in this field.
Each BERT Layer, as part of the Encoder of the Transformers architecture, consists of three components: the Multi-Head Attention layer, comprises the query (Q), key (K), and value (V) matrices, and Attention Output; the Intermediate layer; and the Output layer. Figure 2 shows the complete representation of the BERT architecture. Additionally, more detailed information can be found in [1,4].
Let L be the number of layers, H the hidden size, and A the number of self-attention layers of the model. There are two implementations of the BERT model: the BERT Base Cased model [19] ( L = 12 , H = 768 , A = 12 , total number of parameters = 110  M, and the BERT Large Cased model [20] ( L = 24 , H = 1024 , A = 16 , total number of parameters  = 340  M).

2.3. BERT Model Pruning Methods

As mentioned earlier, to the best of our knowledge, there are no pruning methods for LLMs based on persistent homology. However, there are compression algorithms for the BERT architecture that provide competitive results [21]. These methods can be divided into structured and unstructured approaches [22].
Among the structured methods, some of them focus on pruning attention heads exclusively. In [23], a regularization method for BERT is proposed using reinforcement learning to prune attention heads. For text classification, a compression method based on adaptive truncation is presented in [24]. In [25], a forward and backward pass is used to calculate gradients, which are used as an importance score for each attention layer. Ref. [26] suggests constructing a loss function that minimizes both classification error and the number of heads used, pruning unproductive ones while maintaining performance. Ref. [27] presents a technique with three stages: in the first, n pruning strategies are generated with the same pruning ratio. In stage II, n candidates from the training set are evaluated, and after all iterations, the one that performs the best on a subnetwork is chosen as the best candidate. This best candidate undergoes fine-tuning to obtain a good subnetwork. In [28], a new sentence-level feature alignment loss distillation mechanism for BERT compression is introduced. It is guided by a mixture of experts to capture contextual semantic knowledge from the teacher model to the student model while reducing its parameters. In [29], a dynamic structure pruning method based on differentiable search and recursive knowledge distillation technique called DDK is presented. It is focused on pruning all feed-forward and self-attention layers. Additionally, a recursive knowledge distillation method is presented, designed to extract the most important features from the intermediate layers. Another uncertainty-driven knowledge distillation technique is presented in [30]. The uncertainty modeling guides the training process effectively, especially when there is a large gap in network performance between pretrained language models and compressed ones. In [31], a Layer-wise Adaptive Distillation (LAD) method is proposed for model compression. Specifically, an iterative aggregation mechanism is designed to distill layer-wise internal knowledge from the teacher model to the student model. Finally, in [32], a novel approach to compress BERT, called You Only Compress Once-BERT (YOCO-BERT), is introduced. YOCO-BERT constructs a search space with all the possible configurations for BERT model. Then, by using an stochastic nature gradient optimization method, an optimal candidate architecture is generated.
Among the unstructured methods, ref. [33] uses unstructured magnitude pruning to find subnetworks with sparsity levels between 40% and 90%. It concludes that those with 70% sparsity, using the masked language modeling task, are universal, and can solve other tasks without losing accuracy. Ref. [34] proposes a weight pruning algorithm that integrates reweighted L1 minimization with a proximal algorithm. More information on pruning algorithms for LLMs is available in [21].
Quantization algorithms have also been proposed, such as [35], which uses second-order Hessian matrix information for quantizing BERT models to low precision. Hardware-based techniques have also been proposed [36]. In [37], a sensitivity-aware mixed precision quantization method called SensiMix is proposed, which applies a 1-bit quantization to insensitive parts of the model. Finally, there are knowledge distillation-based algorithms, like the one presented in [30], which is based on parameter retention and feed forward network parameter distillation.

3. Our Proposal

In this article, Persistent BERT Compression and Explainability (PBCE) is proposed, a novel Green AI technique to compress BERT-based models through the application of homology theory, allowing us to derive a simplified yet effective version of the model. Specifically, using the BERT architecture as a reference, persistent homology is employed as a fundamental tool for analyzing the topological characteristics of the neurons, drawing conclusions to discard those that do not make a relevant contribution to the model. In this section, firstly, an intuitive geometric description of how PBCE applies zero-dimensional persistent homology to the vectors generated by hidden layers of a neural network is presented. Finally, the proposed methodology for compressing the model is presented.

3.1. An Intuitive Geometric Description of Persistent Homology Applied to LLM Explanations

Homology theory is a branch of algebraic topology increasingly used in data science. In particular, persistent homology is a powerful tool for studying patterns in data. Its mathematical foundation, summarized in Appendix A, is deep and complex with a significant algebraic burden. However, it can be given a much more intuitive geometric interpretation, which will be explained below.
In Figure 3, a complete example of data processing in BERT can be found. Starting with a set of elements from a corpus, after being tokenized, they are fed into the network. The output of any neuron in a hidden layer, which serves as input for the next layers, consists of vectors that, in an abstract sense, could be represented as points in a hyperspace. The geometric distribution of these points can provide meaningful information about the behavior and role that the corresponding neuron plays within the data flow of the network. Zero-dimensional persistent homology helps explain this role.
As can be seen in Figure 3, each BERT layer generates multidimensional vectors of size N × M × N H U , with N being the number of output rows, M the number of tokens in each sequence, and N H U the number of hidden units. These vectors represent the internal state of texts used as inputs. For the sake of clarity, a simple two dimensional example is introduced in Figure 4. Two plots at three different time moments can be seen in that figure. The plot on the left depicts some vectors generated by any layer (as explained earlier), which are going to be converted into the centers of disks. The disks will evolve, growing uniformly with a common radius. The value of the radius is marked by the red line on the right (Birth–Death diagram). Notice that this red line will move upwards on the Y-axis of the graph on the right.
Since PBCE is based on zero-dimensional persistent homology, each output vector becomes a singleton connected component and all originate at time zero (value 0 on the Birth–Death diagram). In consequence, there are as many connected components as output vectors. As the radius of the disks grows, the connected components eventually collapse. When two connected components touch, they merge becoming a single area (a single component), and a blue point is marked on the Birth–Death plot at the radius value that led to the merger. This point on the Birth–Death plot implies the death of a connected component. Remarkably, the radius grows until the last two connected components touch, creating a large single connected component that includes all of the vectors. Let us call r f the radius value at which all connected components collapse. Please note that this is the smallest value for the radius at which all the connected components are glued together. Figure 5 shows an example of a persistence diagram in which r f = 0.20229639 .
The r f value will be fundamental in the analysis because it provides information about the distribution of output vectors from each neuron and their variability. This helps assess whether the neuron has very uniform outputs (providing little information) or exhibit variability in its outputs (providing more information). Concretely, if  r f is small, it means that the different outputs of a neuron are very similar to each other. However, if  r f is large, then the outputs are less redundant. All in all, this procedure allows to “measure” the variability in the output distribution of a neuron. It can be described through a simple geometric procedure. Nevertheless, it has a founded mathematical background.

3.2. PBCE: Using Persistent Homology to Compress BERT

In this article, PBCE is proposed, a methodology aimed at compressing BERT models. PBCE uses zero-dimensional persistent homology to analyze the topological characteristics of neuron outputs for each layer, aiming to identify which of them play a more significant role in the information flow and, consequently, in the model’s decision-making process. This allows for the removal of neurons that provide less information, compressing the network and making it more efficient. Algorithm 1 outlines the proposed methodology. In summary, the methodology starts by selecting a corpus. Next, zero-dimensional persistent homology is used as a tool to measure the importance of neurons. Then, using the persistence diagram, it can be found which neurons can be removed. Finally, the simplified model is built and evaluated on the GLUE Benchmark.
Next, a detailed description of each of the steps outlined in Algorithm 1 can be found.
Algorithm 1 PBCE: BERT compression through zero-dimensional persistent homology
(1)
Select the corpus to carry out the analysis.
(2)
Use the zero-dimensional persistent homology and the persistence diagram for the outputs of each unit and layer evaluated on the corpus.
(3)
Analyze the distribution of r f from the persistence diagram. Select the units to be removed.
(4)
Construct the simplified model.
(5)
Evaluate the simplified model with the GLUE Benchmark.

3.2.1. Corpus Selection

To measure the importance of each unit in the BERT model, an extensive text corpus is selected. Specifically, English Wikipedia [38] is chosen, consisting of more than 20 GB of sanitized text, including all entries from the Wikipedia in the mentioned language. To ensure that the texts, after tokenization, adhere to the input size constraint imposed by the model, the entries are split by periods, generating sentences, which will be the definite input for PBCE.
This corpus is suitable because it contains a large number of texts on diverse topics, all written with high quality. Moreover, it is not directly related to any of the tasks in the GLUE benchmark, which will be used to evaluate the performance of the simplified models.

3.2.2. Using Persistent Homology to Analyze BERT Layer Outputs

Once the corpus has been selected for analysis, persistent homology is used to collect and analyze the topological characteristics expressed by each neuron in the network. Considering the output vectors generated by each neuron, the connected components are constructed. These connected components represent the clusters of data points that are topologically connected in some way, and their persistence features helps reveal meaningful hidden structures in the data. This information is crucial for understanding the topological features that each neuron contributes to the model. As can be seen in Figure 4, the connected components evolve as the radius r grows, until all collapse into one. The lowest value of r that provokes the union of all the connected components, denoted as r f , is crucial in the proposed methodology. The larger the r f , the farther the neuron’s outputs are from one other, meaning that they exhibit a higher variability. Conversely, smaller values of r f suggest that the neuron is less relevant, as its outputs are very close in the metric space generated by the unit, and can thus be removed. In cases where the neuron is removed, to avoid losing all the information generated by the neuron, the mean plus the standard deviation of the outputs is taken and added to the bias of the layer. This way, the contribution of the eliminated neuron is implicitly considered in the network’s inference process.
As mentioned in the previous section, the persistence diagram is used to analyze the evolution of the connected components. Since PBCE focuses on zero-dimensional persistent homology, the points in the diagram align along a line parallel to the Y-axis, since all the points share the abscissa (given the fact that all connected components are born simultaneously). Therefore, the values on the Y-axis are only considered. r f can be readily identified in the persistence diagram as the value of r for which all connected components first merge.

3.2.3. Evaluation of r f Distribution and Selection of the Important Units

Once the value of r f for each neuron has been obtained, the analysis of the distribution of these values begins. This helps understand which unit can potentially be suppressed. To facilitate the simplification task, three levels of pruning are established:
  • The first level, which is the lightest, involves calculating the first quartile (Q1) of the r f values and retaining neurons with r f values higher than Q1.
  • The second level is slightly more severe, applying the same operation but with the second quartile (Q2).
  • The most intense pruning is the third level, where only the neurons with r f values higher than the third quartile (Q3) are kept.
The more information a unit provides, the higher its chance to remain in the model. This way, the most influential neurons are retained and the less relevant ones are removed.

3.2.4. Evaluation of the Compressed Model Through the GLUE Benchmark

The evaluation of the simplified model through the General Language Understanding Evaluation (GLUE) benchmark involves assessing the model’s performance on a set of diverse natural language understanding tasks. GLUE is a benchmark that consists of multiple downstream NLP tasks, such as text classification, sentence similarity, and question answering. It serves as a standard evaluation suite for assess the generalization and performance of language models.
Here is how the evaluation process generally works:
  • Fine-tuning process:The simplified model is fine-tuned on the GLUE benchmark tasks. The tasks are MNLI [39], QQP [40], QNLI [41], SST-2 [42], CoLA [43], STS-B [44], MRPC [45] and RTE [46].
  • Model evaluation: Use the fine-tuned simplified model to make predictions on the GLUE benchmark tasks. For each task, the model will generate predictions.
  • Evaluation metrics: Calculate task-specific evaluation metrics for each GLUE task. These metrics can vary depending on the task but often include accuracy, F1 score, or other relevant measures. GLUE provides a standard evaluation script for each task (Table 1).
  • Comparison: Compare the performance of the simplified model to the performance of the original, more complex BERT model and other simplified approaches from the literature. This will give an indication of how much simplification impacted the model’s ability to perform various NLP tasks.
The GLUE benchmark provides a standardized way to assess the trade-off between model complexity and task performance. It helps determine if the simplified model retains sufficient performance on a range of NLP tasks while being computationally more efficient than other BERT models.

4. Empirical Evaluation

To assess the performance of PBCE, a thorough empirical procedure has been designed that involves a wide range of natural language tasks included in the GLUE benchmark through their datasets and specific metrics (MNLI, QQP, QNLI, SST-2, CoLA, STS-B, MRPC, and RTE, as listed in Table 1); and both architectures of the BERT model: BERT Base and BERT Large. This represents a benchmark commonly used in the literature, as shown in [47,48,49]. Additionally, PBCE has been compared with the state-of-the-art BERT compression techniques.
This section presents and analyzes the experimental results. First, the distribution of r f values and their role in identifying significant neurons for prediction are examined. Subsequently, specific results for simplifying both BERT Base and BERT Large models are presented, addressing research questions RQ1 and RQ2.

4.1. Distribution of r f and Selection of the More Informative Neurons

The analysis of the distribution of r f is crucial for correctly identifying which neurons contribute the most information to a neural network’s predictions and which units are dispensable. The premise here is that a neuron with a high associated r f value implies that its outputs exhibit sufficient variability, making it important when the network computes outputs.
Bearing in mind that the mean can be heavily influenced by excessively high r f values in some situations, the median was chosen as the key metric to consider. Figure 6 shows the median of the r f values for each layer and component of BERT Base (upper image, Figure 6a) and BERT Large (lower image, Figure 6b). The Figure helps determine the pruning intensity to apply to each component and layer of the model. By analyzing the median of the r f values of the neurons within that component, PBCE quantifies the importance of each component and, consequently, the extent to which it should be pruned. The experimentation revealed that components with r f values above 0.2 will receive light pruning, values between 0.1 and 0.2 should receive medium pruning, and values below 0.1 should undergo intensive pruning. Nevertheless, it is crucial to asses the optimal cut-off value for each specific case to ensure effective pruning.
Before delving into the details of simplification for each architecture, some considerations that apply to both BERT Base and BERT Large should be taken into account. Firstly, PBCE starts the simplification process from the third layer on. This decision minimally reduces compression capacity, but does favor subsequent fine-tuning to meet the requirements of the GLUE benchmark tasks. In addition, Attention Output and Output components’ implementation involves a LayerNorm operation in which the state of the data and the original input are added together. As a consequence, these components cannot be simplified, and will be excluded of the analysis. Additionally, it is observed that the behavior of the Q and K components for each BertLayer are very similar, so the same pruning process will be applied to them. Additionally, it has been verified that the component that contributes the most information, according to our hypothesis, is the Intermediate component. Table 2 and Table 3 present the distribution of levels of pruning to each component and layer. For example, for the Q component, layers 3, 4, and 12 are lightly pruned (retaining neurons with r f values higher than the first quartile), layers from 8 to 11 are pruned with medium intensity (retaining the second quartile of neurons), and layers 5, 6, and 7 are severely pruned (retaining the third quartile of neurons). Next, the main ideas derived from Figure 6 are explained.
In the case of BERT Base, there is an increasing trend in the contribution of neurons, with outputs becoming more variable as they approach the final layer. This trend experiences an abrupt disruption in layers 3 and 4, where all components exhibit a significant high value for the median of their r f values. In consequence, as shown in Table 2, the third and fourth layers are lightly pruned for all components, except for the V component, in which a medium pruned is applied to the third layer. In contrast, layers from 5 to 7 are the less relevant for each component. Additionally, the Intermediate component is the most meaningful one. As a result, none of its layers is heavily pruned.
In contrast, in the case of BERT Large, the median of the r f values across the layers does not exhibit a consistent growth pattern. As observed, between layers 3 and 11, all components show increasing values, contributing more information over time. However, there are significant drops in the median of r f for the Q, K, and V components between layers 12 and 13, as well as between layers 17 and 18. During these same layers, the Intermediate component experiences local maxima. In general, the values are not very high, and the Intermediate component continues to provide a significant amount of information compared to the others, except for layers 15, 16, and 17, where the Q and K components are more informative.

4.2. Results and Analysis on the BERT Base Model

The results for BERT Base are presented in Table 4. In this table, each column corresponds to a task, and each row represents a compression technique. Thus, for each row, the result obtained by each technique is expressed in the metric corresponding to the task (Table 1). Results are reported according to the common practices in the literature.
Notice that most state-of-the-art techniques do not perform a complete evaluation on the tasks in the GLUE Benchmark. In this work, a thorough study of these tasks has been conducted, reducing the model to 47 % of the original parameters (from 110 M to 52 M). PBCE achieves better results in all the learning tasks compared to state-of-the-art techniques except for SST-2 and RTE, although it obtains competitive results. Even with a significant reduction in the number of parameters, it manages to improve the performance of the original model in most tasks, such as QQP (+20), QNLI (+1.2), CoLA (+10.86), STS-B (+3.72), MRPC (+2.19), and RTE (+5.58). This was accomplished with only 40 epochs in the fine-tuning process for each of the tasks. Regarding the number of parameters retained after pruning, PBCE is far from SENSIMIX (13.75 M) and MicroBERT (14.5 M), although the accuracy results indicate that PBCE achieves the best balance between accuracy and size.

4.3. Results and Analysis on the BERT Large Model

In contrast to BERT Base, there are far fewer experimental results reported in the state of the art for BERT Large. Once again, PBCE outperforms RPP [34] in all tasks except for MRPC and RTE, while improving upon the original model in MNLI (0.4/0.33), QNLI (0.45), and CoLA (1.28) (Table 5) again, with only 40 epochs in the fine-tuning process for each of the tasks. In this case, the reduction in the number of parameters is significant, resulting in a compressed model with only 43 % of the parameters (from 340 M to 146.2 M). This can be explained by the fact that BERT Large is much larger than BERT Base, hence it exhibits greater redundancy.

5. Discussion

The current section elaborates on the key results detailed in the previous section. As evidenced in Table 4, PBCE demonstrates superior performance on six of the eight GLUE Benchmark tasks when applied to BERT Base. Notably, for the remaining tasks, while our approach may not achieve the highest performance, offers a substantial reduction in model size, with a parameter count that is more than 20% lower (15 M fewer parameters) compared to the DDK method. Moreover, when scaled up to BERT Large, Table 5 shows that PBCE consistently outperforms the state of the art, achieving a 27% reduction in model size (55 M fewer parameters).
In summary, given the results and methodology, it can be concluded that research questions RQ1 and RQ2 have been answered. It has been demonstrated that persistent homology is a useful tool for selecting the most important neurons in the predictive process. Furthermore, a methodology such as PBCE is proposed to simplify the BERT models.
The explainability of a machine learning is associated with its internal logic and the transformations that occur within it. The more explainable a model is, the greater the level of understanding a human can achieve in terms of the internal processes that take place when the model makes decisions [50]. As mentioned in [51], understanding the role of individual units in the inference and learning process is one possible way to endow explainability to a deep neural model. Given the fact that PBCE identifies the most significant units for each layer by analyzing the topological features of its outputs, it can be considered that PBCE provides explainability to highly complex neural networks, such as Transformer encoders like BERT, answering research question RQ3.

6. Conclusions

Large Language Models (LLMs) are revolutionizing a number of applications of artificial intelligence, especially in the field of natural language processing. However, due to their complex Transformer-based architecture, LLMs are black-box models with a large number of parameters. In this work, PBCE is presented, a zero-dimensional persistent homology Green AI methodology designed to compress the BERT models and give insights of the role of its units in the inference and learning process. In particular, PBCE analyzes the topological characteristics of neuron outputs, thereby identifying which neurons contribute more information to the inference process and which ones are dispensable. Even though it can be simply described through a geometric procedure, it has well founded mathematical background based on homology theory. Therefore, persistent homology is an effective tool for selecting key neurons, answering RQ1.
As a result of the proposed methodology, simplified versions of the BERT model are built that outperform state-of-the-art techniques, even surpassing the original model’s performance for most tasks included in the GLUE Benchmark. This allows to understand the topological behavior of LLMs like BERT, making them more explainable while maintaining performance and increasing efficiency. Therefore, RQ2 and RQ3 are answered by PBCE, which enhances model explainability.
As a line of research, it would be worthwhile to investigate the generalization of PBCE to other encoder architectures, particularly those based on the BERT family (e.g., RoBERTa, DeBERTa, DistillBERT). Furthermore, it would be interesting to explore the potential of applying PBCE to downstream tasks by fine-tuning these models on specific predictive objectives, such as sentiment classification, text summarization, and machine translation.

Author Contributions

Conceptualization: L.B., M.L. and J.M.B.; methodology: L.B., M.L. and J.M.B.; software: L.B.; validation: L.B.; investigation: M.L. and J.M.B.; writing—original draft: L.B.; writing—review and editing: M.L. and J.M.B.; visualization: L.B.; supervision: M.L. and J.M.B.; project administration: J.M.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially supported by the project with reference PID2020-118224RB-100 (funded by MICIU/AEI/10.13039/501100011033) and PID2023-151336OB-I00 granted by the Spanish Ministerio de Ciencia, Innovación y Universidades.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in the experimentation of this article are publicly available on the Internet. Specifically, Wikipedia dataset can be accessed at the following link: https://huggingface.co/datasets/legacy-datasets/wikipedia (accessed on 1 August 2024). Additionally, GLUE benchmark can be accessed at the following link: https://gluebenchmark.com/ (accessed on 1 August 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

LLMLarge Language Model
NLPNatural Language Processing
PBCEPersistent BERT Compression and Explainability
BERTBidirectional Encoder Representations from Transformers
GLUEGeneral Language Understanding Evaluation
NHUNumber of Hidden Units

Appendix A. Homology Theory: Notation and Mathematical Background

Firstly, based on the book [52] and the survey [53], in which a thorough explanation of homology theory can be found, some concepts are introduced, such as affine combination, affine independence, k simplex, the face of a k simplex, simplicial complex, chain complexes, and the boundary of a p simplex, which are essential for building the theory of homology. Let v 0 , v 1 , , v k be vectors in R d . A point
x = i = 0 k λ i v k
is an affine combination of the v i if i = 0 k λ i = 1 . The set of affine combinations constitutes the affine hull.
Definition A1 
(Affinely independent). Let x = λ i v i , y = γ i v i , x , y R d affine combinations. x and y are affinely independent when x = y if and only if λ i = γ i i . In other words, in a plane of dimension k ( k plane), k + 1 points are affinely independent if the k vectors v i v 0 , for 1 i k , are linearly independent.
An affine combination x = λ i v i is a convex combination if λ i 0 , for 1 i k . The set of convex combinations is called the convex hull. Now, the k simplex concept is introduced.
Definition A2 
(k−simplex). A k simplex, σ, is the convex hull of k + 1 affinely independent points. Its dimension is dim σ = k .
Note that a 0 simplex is a vertex, a 1 simplex is an edge, a 2 simplex is a triangle, and a 3 simplex is a tetrahedron. A face of a k simplex σ is the convex hull of a non-empty subset of { v 0 , v 1 , , v k } . If τ is a face of σ , then τ σ .
At this point, it is interested to focus on sets of simplices that are closed under taking faces and that have no improper intersections.
Definition A3 
(Simplicial complex). A simplicial complex is a finite collection of simplices K, such that
1.
τ K t σ , σ K .
2.
If σ , σ 0 K σ σ 0 is empty of a face of both.
Simplicial complexes ultimately emerge as the intersection of collections of sets. While this theory is developed considering entirely general simplicial complexes, in practice, the mentioned sets will be finally considered as geometric disks. Thus, the special case of the Vietoris–Rips (VR) complex arises, where the convex sets are disks with radius r. The VR complex will play a fundamental role in the methodology proposed, as it will be crucial for interpreting persistent homology and its application in PBCE towards the explainability and simplification of the BERT model.
Definition A4 
(Vietoris–Rips VR complex). Let ( P , d ) a finite metric space. The VR complex of P and r consisting of all subsets of diameter at most 2 r . That is, σ R ( r ) d ( p , q ) 2 r , p , q σ .
After introducing the most basic ideas, let us construct more elaborated concepts that will be part of the definition of the homology group or persistent homology, which are the key tools of this work.
Definition A5 
(Chain complexes). Let K be a simplicial complex and p a dimension. A p chain is a formal sum of p simplices, c = a i σ i , where σ i are p simplices and the a i are modulo 2 coefficients.
Two p chains can be added component-wise. In fact, p chains with the addition operation form the abelian group of p chains ( C p , + ) .
A p simplex is defined as the sum of its ( p 1 ) dimensional faces. If σ = [ v 0 , v 1 , , v p ] for the simplex defined by the listed vertices, its boundary is
p σ = j = 0 p [ v 0 , v 1 , , v j ˜ , , v p ] ,
where v j ˜ is omitted. For a p chain, c = a i σ i , the boundary is p c = a i p σ i . Hence, the boundary maps a p chain to a ( p 1 ) chain
p : C p ( K ) C p 1 ( K ) .
Note that p ( c + c ) = p c + p c ; in other words, the boundary is a homomorphism. The chain complex is the sequence of chain groups connected by boundary homomorphisms,
p + 2 C p + 1 ( K ) p + 1 C p ( K ) p C p 1 ( K )
The elements of Z p ( K ) = ker ( p ) are called p cycles and those of B p ( K ) = im ( p + 1 ) are called p boundaries.
Definition A6 
(Homology group). The p th homology group is defined as
H p = Z p / B p .
The p th Betti number is the rank of this group, β p = r a n k H p .
Finally, persistent homology is presented, which measures the scale of a topological feature combining geometry and algebra. Consider a simplicial complex, K, and a f : K R monotonic, which implies that the sublevel set; K ( a ) = f 1 ( , a ] is a subcomplex of K a R . Letting m be the number of simplices in K, n + 1 m + 1 different subcomplexes appear,
= K 0 K 1 K n = K .
In other words, if = a 0 < a 1 < a 2 < < a n are the function values of the simplices in K, then K i = K ( a i ) . This sequence of complexes is known as the filtration of f. For every i j , there is an inclusion map from K i to K j and, therefore, an induced homomorphism
f p i , j : H p ( K i ) H p ( K j ) .
The filtration corresponds to a sequence of homology groups connected by homomorphisms
0 = H p ( K 0 ) H p ( K 1 ) H p ( K n ) = H p ( K ) .
for each dimension p. In the transition from K i 1 to K i , new homology classes are gained, and other are lost when they merge with each other. Classes that are born are collected at a given threshold and die after another threshold in groups.
Definition A7 
(Persistent homology). The p th persistent homology groups are the images of the homomorphisms induced by inclusion, H p i , j = im f p i , j , for 0 i j n . The corresponding p the persistent Betti numbers are the rank of these groups.
The collection of persistent Betti numbers can be visualized by drawing points in the extended real plane R 2 ¯ . Letting μ p i , j be the number of p dimensional classes born at K i and dying entering K j ,
μ p i , j = ( β p i , j 1 β p i , j ) ( β p i 1 , j 1 β p i 1 , j ) ,
i < j , p . Drawing each point ( a i , a j ) with multiplicity μ p i , j , the p th persistence diagram of the filtration is defined.

References

  1. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); Association for Computational Linguistics: Minneapolis, MN, USA, 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
  2. Bolón-Canedo, V.; Morán-Fernández, L.; Cancela, B.; Alonso-Betanzos, A. A review of green artificial intelligence: Towards a more sustainable future. Neurocomputing 2024, 599, 128096. [Google Scholar] [CrossRef]
  3. Schwartz, R.; Dodge, J.; Smith, N.A.; Etzioni, O. Green AI. arXiv 2019, arXiv:1907.10597. [Google Scholar] [CrossRef]
  4. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
  5. Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
  6. Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2020, arXiv:1910.01108. [Google Scholar]
  7. Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. arXiv 2020, arXiv:1909.11942. [Google Scholar]
  8. Jiao, X.; Yin, Y.; Shang, L.; Jiang, X.; Chen, X.; Li, L.; Wang, F.; Liu, Q. TinyBERT: Distilling BERT for Natural Language Understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020; Association for Computational Linguistics: Minneapolis, MN, USA, 2020; pp. 4163–4174. [Google Scholar] [CrossRef]
  9. Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP; Association for Computational Linguistics: Minneapolis, MN, USA, 2018; pp. 353–355. [Google Scholar] [CrossRef]
  10. Mileyko, Y.; Mukherjee, S.; Harer, J. Probability measures on the space of persistence diagrams. Inverse Probl. 2011, 27, 124007. [Google Scholar] [CrossRef]
  11. Chen, M.; Wang, D.; Feng, S.; Zhang, Y. Topological Regularization for Representation Learning via Persistent Homology. Mathematics 2023, 11, 1008. [Google Scholar] [CrossRef]
  12. Choe, S.; Ramanna, S. Cubical Homology-Based Machine Learning: An Application in Image Classification. Axioms 2022, 11, 112. [Google Scholar] [CrossRef]
  13. Pun, C.S.; Lee, S.X.; Xia, K. Persistent-homology-based machine learning: A survey and a comparative study. Artif. Intell. Rev. 2022, 55, 5169–5213. [Google Scholar] [CrossRef]
  14. Routray, M.; Vipsita, S.; Sundaray, A.; Kulkarni, S. DeepRHD: An efficient hybrid feature extraction technique for protein remote homology detection using deep learning strategies. Comput. Biol. Chem. 2022, 100, 107749. [Google Scholar] [CrossRef]
  15. Nauman, M.; Ur Rehman, H.; Politano, G.; Benso, A. Beyond Homology Transfer: Deep Learning for Automated Annotation of Proteins. J. Grid Comput. 2019, 17, 225–237. [Google Scholar] [CrossRef]
  16. Wu, K.; Zhao, Z.; Wang, R.; Wei, G.W. TopP–S: Persistent homology-based multi-task deep neural networks for simultaneous predictions of partition coefficient and aqueous solubility. J. Comput. Chem. 2018, 39, 1444–1454. [Google Scholar] [CrossRef] [PubMed]
  17. Rathore, A.; Zhou, Y.; Srikumar, V.; Wang, B. TopoBERT: Exploring the topology of fine-tuned word representations. Inf. Vis. 2023, 22, 186–208. [Google Scholar] [CrossRef]
  18. Clark, K.; Khandelwal, U.; Levy, O.; Manning, C.D. What Does BERT Look at? An Analysis of BERT’s Attention. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP; Association for Computational Linguistics: Minneapolis, MN, USA, 2019; pp. 276–286. [Google Scholar] [CrossRef]
  19. google-bert/bert-base-cased · Hugging Face—huggingface.co. Available online: https://huggingface.co/bert-base-cased (accessed on 3 September 2023).
  20. google-bert/bert-large-cased · Hugging Face—huggingface.co. Available online: https://huggingface.co/bert-large-cased (accessed on 3 September 2023).
  21. Gupta, M.; Agrawal, P. Compression of Deep Learning Models for Text: A Survey. ACM Trans. Knowl. Discov. Data 2022, 16, 61. [Google Scholar] [CrossRef]
  22. Ganesh, P.; Chen, Y.; Lou, X.; Khan, M.A.; Yang, Y.; Sajjad, H.; Nakov, P.; Chen, D.; Winslett, M. Compressing Large-Scale Transformer-Based Models: A Case Study on BERT. Trans. Assoc. Comput. Linguist. 2021, 9, 1061–1080. [Google Scholar] [CrossRef]
  23. Lee, H.D.; Lee, S.; Kang, U. AUBER: Automated BERT regularization. PLoS ONE 2021, 16, e0253241. [Google Scholar] [CrossRef] [PubMed]
  24. Zhang, X.; Fan, J.; Hei, M. Compressing BERT for Binary Text Classification via Adaptive Truncation before Fine-Tuning. Appl. Sci. 2022, 12, 12055. [Google Scholar] [CrossRef]
  25. Michel, P.; Levy, O.; Neubig, G. Are sixteen heads really better than one? In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Curran Associates Inc.: Red Hook, NY, USA, 2019. [Google Scholar]
  26. Voita, E.; Talbot, D.; Moiseev, F.; Sennrich, R.; Titov, I. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Florence, Italy, 2019; pp. 5797–5808. [Google Scholar] [CrossRef]
  27. Huang, S.; Liu, N.; Liang, Y.; Peng, H.; Li, H.; Xu, D.; Xie, M.; Ding, C. An Automatic and Efficient BERT Pruning for Edge AI Systems. arXiv 2022, arXiv:2206.10461. [Google Scholar]
  28. Zheng, D.; Li, J.; Yang, Y.; Wang, Y.; Pang, P.C.I. MicroBERT: Distilling MoE-Based Knowledge from BERT into a Lighter Model. Appl. Sci. 2024, 14, 6171. [Google Scholar] [CrossRef]
  29. Zhang, Z.; Lu, Y.; Wang, T.; Wei, X.; Wei, Z. DDK: Dynamic structure pruning based on differentiable search and recursive knowledge distillation for BERT. Neural Netw. 2024, 173, 106164. [Google Scholar] [CrossRef]
  30. Huang, T.; Dong, W.; Wu, F.; Li, X.; Shi, G. Uncertainty-Driven Knowledge Distillation for Language Model Compression. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 31, 2850–2858. [Google Scholar] [CrossRef]
  31. Lin, Y.J.; Chen, K.Y.; Kao, H.Y. LAD: Layer-Wise Adaptive Distillation for BERT Model Compression. Sensors 2023, 23, 1483. [Google Scholar] [CrossRef]
  32. Zhang, S.; Zheng, X.; Li, G.; Yang, C.; Li, Y.; Wang, Y.; Chao, F.; Wang, M.; Li, S.; Ji, R. You only compress once: Towards effective and elastic BERT compression via exploit–explore stochastic nature gradient. Neurocomputing 2024, 599, 128140. [Google Scholar] [CrossRef]
  33. Chen, T.; Frankle, J.; Chang, S.; Liu, S.; Zhang, Y.; Wang, Z.; Carbin, M. The Lottery Ticket Hypothesis for Pre-trained BERT Networks. Adv. Neural Inf. Process. Syst. 2020, 33, 15834–15846. [Google Scholar]
  34. Guo, F.M.; Liu, S.; Mungall, F.S.; Lin, X.; Wang, Y. Reweighted Proximal Pruning for Large-Scale Language Representation. arXiv 2019, arXiv:1909.12486. [Google Scholar]
  35. Shen, S.; Dong, Z.; Ye, J.; Ma, L.; Yao, Z.; Gholami, A.; Mahoney, M.W.; Keutzer, K. Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT. arXiv 2019, arXiv:1909.05840. [Google Scholar] [CrossRef]
  36. Li, B.; Kong, Z.; Zhang, T.; Li, J.; Li, Z.; Liu, H.; Ding, C. Efficient Transformer-based Large Scale Language Representations using Hardware-friendly Block Structured Pruning. In Findings of the Association for Computational Linguistics: EMNLP 2020; Association for Computational Linguistics: Minneapolis, MN, USA, 2020; pp. 3187–3199. [Google Scholar] [CrossRef]
  37. Piao, T.; Cho, I.; Kang, U. SensiMix: Sensitivity-Aware 8-bit index 1-bit value mixed precision quantization for BERT compression. PLoS ONE 2022, 17, e0265621. [Google Scholar] [CrossRef] [PubMed]
  38. legacy-datasets/wikipedia · Datasets at Hugging Face—huggingface.co. Available online: https://huggingface.co/datasets/wikipedia (accessed on 8 September 2024).
  39. Williams, A.; Nangia, N.; Bowman, S. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers); Association for Computational Linguistics: New Orleans, LA, USA, 2018; pp. 1112–1122. [Google Scholar] [CrossRef]
  40. Wang, Z.; Hamza, W.; Florian, R. Bilateral multi-perspective matching for natural language sentences. arXiv 2017, arXiv:1702.03814. [Google Scholar]
  41. The Stanford Question Answering Dataset—rajpurkar.github.io. Available online: https://rajpurkar.github.io/SQuAD-explorer/ (accessed on 4 September 2024).
  42. Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C.D.; Ng, A.; Potts, C. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA, 18–21 October 2013; pp. 1631–1642. [Google Scholar]
  43. Warstadt, A.; Singh, A.; Bowman, S.R. Neural Network Acceptability Judgments. Trans. Assoc. Comput. Linguist. 2019, 7, 625–641. [Google Scholar] [CrossRef]
  44. Cer, D.; Diab, M.; Agirre, E.; Lopez-Gazpio, I.; Specia, L. SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017); Association for Computational Linguistics: Vancouver, BC, Canada, 2017; pp. 1–14. [Google Scholar] [CrossRef]
  45. Download Microsoft Research Paraphrase Corpus from Official Microsoft Download Center—microsoft.com. Available online: https://www.microsoft.com/en-us/download/details.aspx?id=52398 (accessed on 4 September 2023).
  46. Bentivogli, L.; Dagan, I.; Magnini, B. The Recognizing Textual Entailment Challenges: Datasets and Methodologies. In Handbook of Linguistic Annotation; Ide, N., Pustejovsky, J., Eds.; Springer: Dordrecht, The Netherlands, 2017; pp. 1119–1147. [Google Scholar] [CrossRef]
  47. Naveed, H.; Khan, A.U.; Qiu, S.; Saqib, M.; Anwar, S.; Usman, M.; Akhtar, N.; Barnes, N.; Mian, A. A Comprehensive Overview of Large Language Models. arXiv 2024, arXiv:2307.06435. [Google Scholar]
  48. Liang, P.; Bommasani, R.; Lee, T.; Tsipras, D.; Soylu, D.; Yasunaga, M.; Zhang, Y.; Narayanan, D.; Wu, Y.; Kumar, A.; et al. Holistic Evaluation of Language Models. arXiv 2023, arXiv:2211.09110. [Google Scholar]
  49. Chang, Y.; Wang, X.; Wang, J.; Wu, Y.; Yang, L.; Zhu, K.; Chen, H.; Yi, X.; Wang, C.; Wang, Y.; et al. A Survey on Evaluation of Large Language Models. ACM Trans. Intell. Syst. Technol. 2024, 15, 1–45. [Google Scholar] [CrossRef]
  50. Linardatos, P.; Papastefanopoulos, V.; Kotsiantis, S. Explainable AI: A Review of Machine Learning Interpretability Methods. Entropy 2021, 23, 18. [Google Scholar] [CrossRef] [PubMed]
  51. Gilpin, L.H.; Bau, D.; Yuan, B.Z.; Bajwa, A.; Specter, M.; Kagal, L. Explaining Explanations: An Overview of Interpretability of Machine Learning. In Proceedings of the 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA), Turin, Italy, 1–3 October 2018; pp. 80–89. [Google Scholar] [CrossRef]
  52. Edelsbrunner, H.; Harer, J. Computational Topology: An Introduction; American Mathematical Society: Providence, RI, USA, 2010. [Google Scholar] [CrossRef]
  53. Hensel, F.; Moor, M.; Rieck, B. A Survey of Topological Machine Learning Methods. Front. Artif. Intell. 2021, 4, 681108. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Usage of BERT involves taking a sentence, adding the special tokens [CLS] and [SEP], tokenizing the words, and using these tokens as input for the neural network.
Figure 1. Usage of BERT involves taking a sentence, adding the special tokens [CLS] and [SEP], tokenizing the words, and using these tokens as input for the neural network.
Applsci 15 00390 g001
Figure 2. Representation of the BERT architecture. It is composed of an embedding module, followed by the Encoder part, which consists of N BERT Layers (12 or 24, depending on whether it is BERT Base or Large). Within it, three main components stand out: the Attention layer, the Intermediate layer, and the Output layer. After the Encoder, BERT has a Pooler layer.
Figure 2. Representation of the BERT architecture. It is composed of an embedding module, followed by the Encoder part, which consists of N BERT Layers (12 or 24, depending on whether it is BERT Base or Large). Within it, three main components stand out: the Attention layer, the Intermediate layer, and the Output layer. After the Encoder, BERT has a Pooler layer.
Applsci 15 00390 g002
Figure 3. Representation of information through the BERT model and subsequent extraction of values from intermediate dense layers. The process begins with the processing of a set of elements from a corpus. Once the input of N sentences with length M is constructed, it is fed into the neural network. In the lower right part of the image, the output of any dense layer is represented. The output is a three-dimensional matrix: the number of sentences in the input (N), the length of those tokenized sentences (M), and the number of hidden neurons in that layer. This information, in the form of an n-dimensional matrix, is taken to represent the information associated with each neuron. To analyze it, persistent homology is used.
Figure 3. Representation of information through the BERT model and subsequent extraction of values from intermediate dense layers. The process begins with the processing of a set of elements from a corpus. Once the input of N sentences with length M is constructed, it is fed into the neural network. In the lower right part of the image, the output of any dense layer is represented. The output is a three-dimensional matrix: the number of sentences in the input (N), the length of those tokenized sentences (M), and the number of hidden neurons in that layer. This information, in the form of an n-dimensional matrix, is taken to represent the information associated with each neuron. To analyze it, persistent homology is used.
Applsci 15 00390 g003
Figure 4. Application of persistent homology on the output of a neuron at three specific moments. On the left, it can be seen that each of the points comprising the output becomes the center of a disk whose radius grows uniformly for all points. On the right, the Birth–Death diagram is represented for persistent homology of dimension zero. Each blue point corresponds to the disappearance of a connected component after collapsing with another. The last moment depicted in the figure represents the point at which the value of r is reached for which all connected components first merge. This value is called r f , and is crucial in the proposed methodology because it provides information about the importance of neurons based on their output within the neural network’s data flow.
Figure 4. Application of persistent homology on the output of a neuron at three specific moments. On the left, it can be seen that each of the points comprising the output becomes the center of a disk whose radius grows uniformly for all points. On the right, the Birth–Death diagram is represented for persistent homology of dimension zero. Each blue point corresponds to the disappearance of a connected component after collapsing with another. The last moment depicted in the figure represents the point at which the value of r is reached for which all connected components first merge. This value is called r f , and is crucial in the proposed methodology because it provides information about the importance of neurons based on their output within the neural network’s data flow.
Applsci 15 00390 g004
Figure 5. [Diagram: X-axis (Birth Time), Y-axis (Death Time. Persistence)]. Here is an example of a Birth–Death persistence diagram. The points where connected components are born are presented on the X-axis. Since zero-dimensional persistent homology is used, all connected components are born at time zero. As the value of r increases (Y-axis), the connected components collapse. Each time two components collapse, a point is represented. The last value below the dashed line corresponds to r f (circled in red).
Figure 5. [Diagram: X-axis (Birth Time), Y-axis (Death Time. Persistence)]. Here is an example of a Birth–Death persistence diagram. The points where connected components are born are presented on the X-axis. Since zero-dimensional persistent homology is used, all connected components are born at time zero. As the value of r increases (Y-axis), the connected components collapse. Each time two components collapse, a point is represented. The last value below the dashed line corresponds to r f (circled in red).
Applsci 15 00390 g005
Figure 6. Median per layer of the distribution of r f for BERT Base (upper image) and BERT Large (lower image). The components of the BertLayer for each layer are shown: Self-Attention Layer (Q, blue; K, red; V, green) and Intermediate (orange). A higher value of r f (y-coordinate) indicates a greater contribution of information generated by that component to the network. As can be observed, the flow of information in both networks is different, showing distinct behaviors in each component and layer for BERT Base and Large. (a) BERT Base; (b) BERT Large.
Figure 6. Median per layer of the distribution of r f for BERT Base (upper image) and BERT Large (lower image). The components of the BertLayer for each layer are shown: Self-Attention Layer (Q, blue; K, red; V, green) and Intermediate (orange). A higher value of r f (y-coordinate) indicates a greater contribution of information generated by that component to the network. As can be observed, the flow of information in both networks is different, showing distinct behaviors in each component and layer for BERT Base and Large. (a) BERT Base; (b) BERT Large.
Applsci 15 00390 g006aApplsci 15 00390 g006b
Table 1. GLUE tasks, training and evaluation sizes, and metrics.
Table 1. GLUE tasks, training and evaluation sizes, and metrics.
TaskTrainEvaluationMetric
CoLA10 K1 KMatthew’s Correlation
SST-267 K872Accuracy
MRPC5.8 K1 KF1/Accuracy
STS-B7 K1.5 KPearson–Spearman Correlation
QQP400 K10 KF1/Accuracy
MNLI393 K20 KAccuracy
QNLI108 K11 KAccuracy
RTE2.7 K0.5 KAccuracy
Table 2. Levels of pruning applied to each component and layer for the BERT Base architecture. The columns Q1, Q2, and Q3 indicate in which layers light, intermediate, or intense pruning is performed based on their r f values. Note that the V component tends to contribute less information (most of its layers are heavily pruned), while the Intermediate component provides a significant amount of information (none of its layers are pruned heavily, and most are pruned lightly). For more details, please refer to Figure 6.
Table 2. Levels of pruning applied to each component and layer for the BERT Base architecture. The columns Q1, Q2, and Q3 indicate in which layers light, intermediate, or intense pruning is performed based on their r f values. Note that the V component tends to contribute less information (most of its layers are heavily pruned), while the Intermediate component provides a significant amount of information (none of its layers are pruned heavily, and most are pruned lightly). For more details, please refer to Figure 6.
BertLayer ComponentQ1Q2Q3
Q3, 4, 128–115–7
K3, 4, 128–115–7
V43, 11, 125–10
Intermediate3, 45–12-
Table 3. Levels of pruning applied to each component and layer for the BERT Large architecture. The columns Q1, Q2, and Q3 indicate in which layers light, intermediate, or intense pruning is performed based on their r f values. As in the case of BERT Base, the Intermediate component proves to be relevant for the network’s prediction in the BERT Large architecture as well. For more details, please refer to Figure 6.
Table 3. Levels of pruning applied to each component and layer for the BERT Large architecture. The columns Q1, Q2, and Q3 indicate in which layers light, intermediate, or intense pruning is performed based on their r f values. As in the case of BERT Base, the Intermediate component proves to be relevant for the network’s prediction in the BERT Large architecture as well. For more details, please refer to Figure 6.
BertLayer ComponentQ1Q2Q3
Q11,12, 14–174–10, 183, 13, 19–24
K11, 14–174–10, 12, 18, 193, 13, 20–24
V-4, 8–12, 14–173, 5–7, 10, 13, 18–24
Intermediate3, 4, 8–16, 185–7, 17, 19–24-
Table 4. Results of compressing BERT Base on GLUE Benchmark tasks. For each metric and task, the higher the value the better (Table 1). For the remaining parameters value (RP), expressed in millions of parameters, the smaller the value the better. Best results are highlighted in bold (excluding original BERT).
Table 4. Results of compressing BERT Base on GLUE Benchmark tasks. For each metric and task, the higher the value the better (Table 1). For the remaining parameters value (RP), expressed in millions of parameters, the smaller the value the better. Best results are highlighted in bold (excluding original BERT).
MethodMNLI-(m/mm)QQPQNLISST-2CoLASTS-BMRPCRTERP (M)
Original [1]84.6/83.471.290.593.552.185.888.966.4-
AUBER [23]----60.59 ± 0.73-85.62 ± 0.5165.31 ± 1.30-
AE-BERT [27]--88.7--86.189.569.7-
ETbLSL [36]82.990.788.289.352.684.688.363.9-
LotteryTicketBert [33]--88.9-53.888.284.966-
Michel et al. [25]----58.86 ± 0.64-84.22 ± 0.3363.9-
Voita et al. [26]----55.34 ± 0.81-83.92 ± 0.7164.12 ± 1.65-
QBERT [35]77.02/76.56--84.63-----
YOCO-BERT [32]82.690.587.291.659.8-89.372.967
SENSIMIX [37]-89.686.590.3--87.2-13.75
LAD [31]81.01/81.4787.5689.2491.74--88.7167.1552.5
DDK [29]83.688.291.692.761.989.190.773.767
UEM [30]-86.286.487.546.8-86.7-66.8
MicroBERT [28]80.386.6-89.6--88.762.814.5
PBCE83.7/81.691.2391.7391.8762.9689.5291.0971.9852
Table 5. Results of compressing BERT Large on GLUE Benchmark tasks. For each metric and task, the higher the value, the better it is (Table 1). For the remaining parameters value (RP), expressed in millions of parameters, the smaller the value, the better it is. Best results are highlighted in bold (excluding original BERT).
Table 5. Results of compressing BERT Large on GLUE Benchmark tasks. For each metric and task, the higher the value, the better it is (Table 1). For the remaining parameters value (RP), expressed in millions of parameters, the smaller the value, the better it is. Best results are highlighted in bold (excluding original BERT).
MethodMNLI-(m/mm)QQPQNLISST-2CoLASTS-BMRPCRTERP (M)
Original [1]86.7/85.972.192.794.960.586.589.370.1-
RPP [34]86.1/85.7---61.3-88.170.1201
PBCE87.1/86.271.993.294.762.185.288.671.7146.2
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Balderas, L.; Lastra, M.; Benítez, J.M. A Green AI Methodology Based on Persistent Homology for Compressing BERT. Appl. Sci. 2025, 15, 390. https://doi.org/10.3390/app15010390

AMA Style

Balderas L, Lastra M, Benítez JM. A Green AI Methodology Based on Persistent Homology for Compressing BERT. Applied Sciences. 2025; 15(1):390. https://doi.org/10.3390/app15010390

Chicago/Turabian Style

Balderas, Luis, Miguel Lastra, and José M. Benítez. 2025. "A Green AI Methodology Based on Persistent Homology for Compressing BERT" Applied Sciences 15, no. 1: 390. https://doi.org/10.3390/app15010390

APA Style

Balderas, L., Lastra, M., & Benítez, J. M. (2025). A Green AI Methodology Based on Persistent Homology for Compressing BERT. Applied Sciences, 15(1), 390. https://doi.org/10.3390/app15010390

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop