1. Introduction
Over the last few years, deep neural networks have witnessed an important increase in their amount of parameters and computation. As a result, the memory footprint and inference time have severely hindered the deployment of such methods, especially in resource-constrained environments, such as embedded systems or mobile devices. For that reason, neural network compression has been the subject of extensive research recently. Although many theoretical research studies have been conducted on compression, the field still lacks convenient tools for (1) practical applications but also (2) research. Additionally, because of this lack of tools, there is no standard way of implementing new compression techniques, making the comparison with previous techniques more difficult [
1]. To solve this issue, we propose FasterAI [
2], an open-source library, released under an Apache-2.0 license and available at
https://nathanhubens.github.io/fasterai (accessed on 20 September 2022). It also includes extensive documentation and several tutorials to help users become acquainted with the library.
1.1. Related Work
The research field of neural network compression has recently been extremely active, leading to lots of published ideas [
3,
4,
5], but also to the release of their corresponding implementations [
6,
7]. However, the available implementations may operate on different deep learning libraries and be designed for different application cases, thus requiring extensive adaptation in order to compare them. As a result, the field of compression can appear overwhelming for researchers that are willing to implement new techniques and compare them with current methods, but also to newcomers that desire to compress their neural network for a concrete application.
Several pieces of work have proposed solutions to that problem by creating libraries allowing to seamlessly implement compression techniques, such as PyTorch Pruning [
8] and Sparse ML [
9]. However, those are mainly concerned with sparsification, neglecting other compression techniques, such as knowledge distillation and regularization. Another library, Nervana Distiller [
10], provides a more thorough compression toolset, but is intended primarily for research usage. Additionally, most of those libraries require to implement new compression techniques in a self-contained way, limiting the opportunities for extensive experiments. In FasterAI, we aim at reducing the need of custom implementation to its bare minimum. Indeed, implementing a new method in FasterAI usually boils down to writing a single line of code. Moreover, to the best of our knowledge, FasterAI is the first compression library available for both fastai [
11] and PyTorch Lightning [
12].
1.2. Overview
The objective of FasterAI is twofold: (1) allow users not familiar with the domain to apply compression techniques; and (2) allow researchers to easily implement new compression methods and perform various experiments. FasterAI is organized around four modules, each one providing distinct compression capabilities, and which might depend on several arguments, as represented in
Figure 1.
Sparsify. The first module is responsible for making sparse neural networks, either in a static way, when retraining cannot be considered, or in a dynamic way, using callback systems, thus occurring during the training of the neural network.
Distill. This module is in charge of knowledge distillation techniques, i.e., training with a teacher–student paradigm, where a large model guides a smaller one to reach better performance, thus compressing the knowledge of a large model into a smaller one.
Regularize. The regularize module handles group regularization methods, i.e., techniques adding a penalty term on the magnitude of the weights, acting as a feature selection method, where some weights will be pushed toward 0, leading to a learned sparse model.
Misc. The last module includes singular compression methods, such as batch normalization folding, removing batch normalization layers, which can be considered useless after the training phase. It also includes factorization methods for fully connected layers that replace large weight matrices with smaller ones, thus reducing the total amount of weight.
To summarize, with FasterAI, we provide:
An extensive, documented and open-source PyTorch-based neural network compression library.
A new granular design approach for compression techniques, allowing to seamlessly perform thousands of different compression methods, by simply choosing between available options.
A framework suited for practical cases as well as for research, by providing common compression techniques available out-of-the-box and allowing the conception of new compression methods in a single line of code.
This paper is divided into four sections, each one describing a compression module of FasterAI. In particular, we want to highlight how convenient it is to perform different kinds of experiments, either using well-known techniques, or creating novel ones. Indeed, by leveraging the callback system of recent deep learning libraries, such as fastai and Pytorch Lightning, FasterAI provides a user-friendly and high-level API, allowing to easily combine and customize compression techniques. Although FasterAI is suitable for many types of architecture, we illustrate its use for convolutional neural networks.
2. Sparsify
The core of FasterAI resides in its sparsify module, containing capabilities for creating sparse networks, i.e., networks in which a large number of weight values are zeroes. FasterAI possesses two main ways to create a sparse network: (1) the static way, by using the Sparsifier class, able to sparsify either a specified layer, or the whole model, (2) the dynamic way, by using the SparsifyCallback, that must be used in conjunction with training, and that removes weights while the network is learning. Examples of usage for both methods are expressed in Listing 1.
Listing 1. The two ways of sparsifying a model. The static is performed offline, disconnected from training, while the dynamic is performed during training. |
|
While the static way is faster to apply, as it does not require any additional steps, the lack of retraining after the removal of some parameters deeply impairs the model performance. For that reason, the dynamic way is most of the time preferred when trying to achieve compression while keeping the performance as high as possible. Although the distinction is not always clear in the literature, we make the difference within FasterAI between the process of sparsification, i.e., making neural network weights sparse, and pruning, i.e., physically removing those sparse weights. Indeed, the SparsifyCallback does not allow to remove any network’s weight but rather to create a binary mask, of the same structure as the weights, and applies it to either sparsify a weight (when the mask value is 0) or keep it unchanged (when the mask value is 1). Weights sparsified during such a process are still present in the computation graph but do not participate in the final decision anymore.
The whole power of the sparsification capabilities of FasterAI lies in its SparsifyCallback, designed around four independent building blocks: granularity, context, criteria, and schedule, which are sufficient to fully describe the most common sparsification techniques. Those building blocks correspond to the four main axes of research in the field, each providing an answer to the following questions:
Granularity: how to sparsify?
Context: where to sparsify?
Criteria: what to sparsify?
Schedule: when to sparsify?
The purpose is to decompose the sparsifying problem into four subproblems. By doing so, each argument can be modified independently from the others, which allows to (1) create a vast number of opportunities and combinations for experiments and, (2) provide a unique and versatile callback, reducing the problem of implementing a novel sparsification technique to the modification of a single argument.
2.1. Granularity: How to Sparsify?
In FasterAI, the granularity designates the structure of the blocks of weights that are removed during the sparsification process. FasterAI handles most common sparsifying granularities, e.g., weight, kernel and filter, but also allows the use of more seldom ones, e.g., horizontal slices and shared kernels. In the literature, the terms unstructured and structured sparsity are often used to designate when sparsity is applied on weights (unstructured) or larger blocks (structured). In FasterAI, we adopt a more nuanced approach by defining as many granularities as there are slicing combinations of the weight tensor. In the case of 2D convolutions, 16 granularities are thus available by default. By following PyTorch conventions [
8], the weights of a 2D convolutional layer are given by a 4D tensor of dimension [I, O, Kx, Ky], with I, O being, respectively, the input and output dimensions, and Kx, Ky, the dimensions of the convolutional kernel. The granularities available by default are defined by Listing 2.
Listing 2. Different available granularities for 4D weight tensor of dimension [I, O, Kx, Ky]. |
|
These granularities are represented in
Figure 2, sorted by “how structured” the granularity is. On top of the presented granularities, suited for ConvNets, FasterAI also proposes granularities for fully connected Layers, as well as for self-attention layers, required in the transformers’ architectures. FasterAI allows a wide variety of granularities, along which the network’s parameters will be sparsified. Among less common granularities, we introduce the concept of “shared granularity”, indicating granularity structures that are shared between all filters. For example, shared_weight in
Figure 2d defines a granularity, where weights are selected individually in each filter, but the same selection pattern is applied to each filter in the layer.
As a proof of concept, we conduct an experiment to highlight the impact of pruning granularity on the performance of a neural network. We choose the ResNet-18 architecture [
13], as it is a model commonly used for pruning benchmarking, and apply it to the CALTECH-101 dataset [
14], various in images and classes, that is split using a 80:20 split between training and validation sets. The model is trained for 30 epochs, using a learning rate value of
, the Adam [
15] optimizer, and a batch size of 64. We then compare the final validation accuracy, obtained after sparsifying with each of the available granularities. There are 4 sparsity levels that are studied:
,
,
and
. Two initialization methods are considered: either the model is trained from scratch, i.e., the weights are randomly initialized, or finetuned from a pretrained version. The context, criteria and schedule are respectively set to local, large_final and one_cycle. The results, as well as the baseline, i.e., the unpruned model, are presented in
Table 1. For readability constraints, the name of the granularities in
Table 1 are abbreviated, e.g., s-v-slice corresponds to the shared_vertical_slice granularity. The layer sparsity is voluntarily omitted, as it is not available in the local context.
From those results, it can be observed that the general trend is that the more structured granularity is, i.e., the larger the size of structures removed, the larger the drop in final performance. This can be explained by the fact that more structured granularities are less precise in their selection of weights, potentially removing weights that might be important for the network. It is, however, important to mention that, although less performant, more structured granularities allow for an easier speed-up in practice, as they require less overhead to store sparse weight indices [
4]. Additionally, it can be observed that smaller granularities and low sparsity levels can lead to better performance than the baseline, illustrating the regularization capabilities of sparsification, thus helping to reduce overfitting and increase generalization.
2.2. Context: Where to Sparsify?
In FasterAI, the context refers to the locality of the selection of the weights. In the literature, the two most common options are: (1) local pruning, i.e., the selection of the weights is performed in each layer separately, producing equally sparse layers in the network, and (2) global pruning, i.e., the selection of the weights is performed by comparing those of the whole network, producing a network with different sparsity levels for each layer. Both techniques are expressed in a simplified way in Listing 3.
Listing 3. Simplified representation of local sparsification, comparing weights in each layer independently and global sparsification, comparing weights from all the layers. |
|
FasterAI handles both methods by default, only by selecting the local or global method accordingly in the SparsifyCallback. Local and global sparsification have different implications on the final sparsity of the network, with local context leading to equally sparse layers in the network and global context leading to layers with differences in sparsities, which can pose issues for networks possessing bottlenecks, where it can be undesirable to remove too many parameters. In the case that the user wants to specify a particular sparsity level for certain layers, FasterAI accepts a list of sparsities that will be applied to corresponding layers.
We propose to compare the impact of each context on the performance of a neural network. For this purpose, we use the same architecture, datasets and training parameters as the experiment conducted in Sub
Section 2.1 but using a global context instead. The results are provided in
Table 2. For readability constraints, the same abbreviations are applied to the name of granularities in
Table 2.
As can be observed in
Table 2, the general trend seems to be that more coarse granularities perform worse than more precise ones. Additionally, the drop in performance for high sparsities is larger when the network has been fine-tuned than when trained from scratch. By comparing
Table 1 and
Table 2, it can be observed that global sparsifying seems to achieve better results in the scratch training regime, while providing similar results when fine-tuning.
2.3. Criteria: What to Sparsify?
The criteria are a fundamental component of any sparsifying technique, as they act as a proxy for weight importance. In practice, applying the desired criteria to each group of weights returns a score, according to which the selection of weights is based. Group of weights with the lowest score will be zeroed out first, while those having the largest will be retained. There exist many sparsifying criteria [
16], with 14 currently available by default in FasterAI, and expressed in a simplified way, following PyTorch notation, in Listing 4. To that end, we define wi and wf, respectively, being the initial and final values of the weights, i.e., their values at the initialization and at the current step of training.
Listing 4. The list of criteria available in FasterAI and their PyTorch implementation. |
|
Because of the way the criteria are implemented in FasterAI, it is very convenient to create custom criteria. Indeed, implementing new selection criteria boils down to writing a single function that will be applied to each weight before computing the sparsification mask to be applied. For example, we introduce a novel criterion named mov_large_final, which is similar to the movement one, but puts more emphasis on the final value of weights. Similarly, we introduce another criterion, named mov_mag, which considers weights whose absolute value has moved the most. Those criteria are expressed in Listing 5.
Listing 5. Custom criteria and their corresponding implementation in PyTorch. |
|
The decision boundaries of available criteria are represented in
Figure 3. In this figure, we represent the weight distribution at initialization,
, against their value at the current training step,
.
In practice, at each sparsifying phase, the chosen criteria are applied to each weight, before aggregating them according to the desired granularity. The pruning mask is computed by retaining the weights having the largest score, according to the desired context and sparsity level. It is then applied to replace weights considered less important by the criteria by zeroes. Additionally, FasterAI keeps track of the values of the weights during training. This paves the way to creating criteria using first-order information, taking the training dynamics into account.
In
Table 3, we report the comparison between all the available criteria. Experiments are conducted in the same conditions as for the previous experiments, and with the same architecture and dataset. The granularity is set to weight, the context to local, and the schedule to one_cycle. For readability constraints, the names of criteria in
Table 3 are abbreviated, e.g., large i,f corresponds to the large_i_large_f criteria.
From those results, we can observe that the criteria has a minor effect on the performance at low sparsity level, e.g., . This can be explained by the fact that the network, although having a part of parameters that are removed, still possesses enough capacity to compensate for the removed weights and achieve decent performance. When the sparsity level increases, however, criteria based on lower weight values, e.g., small f, small i, small i,f, seem to perform badly. This phenomenon happens because weights with low values do not participate much in the final results, and thus are not holding much discriminative information about the data.
2.4. Schedule: When to Sparsify?
The last argument required in the SparsifyCallback is the sparsification schedule. It defines when the sparsification process will occur during the training phase. Traditionally, the most common schedules are the one-shot, which performs the sparsification in a single step, and iterative, which performs it in several steps. Those methods usually required a fine-tuning phase after each sparsification stage to help the network to recover from the lost performance. In FasterAI, all schedules are implemented within a single class, the only differentiation being defined according to three parameters:
start_pct (default to 0): the percentage of training at which the sparsification process starts, i.e., for how long the model will be pretrained.
end_pct: the percentage of training at which the sparsification process stops, i.e., for how long the model will be fine-tuned after being sparsified.
schedule_function: the function describing the evolution of the sparsity during the training. There are four currently available by default: one_shot, iterative, gradual [
17], and one_cycle [
18]. Those schedule functions are expressed in Listing 6.
Listing 6. Schedules available by default in FasterAI and their corresponding implementation. |
|
By shifting the complexity of the pruning schedule to the schedule_function, we ensure that all schedules can be defined in FasterAI. By doing so, we remove the need for complex training loops, as all schedules are applied in a single main training phase.
In
Figure 4, we represent variations of the four available sparsifying schedules, where adjustments are made to customize the schedule behavior. As can be observed, the start_epoch and end_epoch can further help the user to alter the pruning schedule as desired. For example, in
Figure 4b, the one-shot pruning schedule could also be used with a value of start_pct=0, becoming what is more well-known as pruning at initialization [
19], achieving the target amount of sparsity right from the start of training. For readability constraints, we abbreviate the names of our schedules, e.g., one_shot becomes os.
We report in
Table 4 the results of applying the schedules represented in
Figure 4. Experiments were conducted in the same training conditions as previous ones. The granularity was set to weight, context to local and criteria to large_final. For readability constraints, in
Table 4, the name of the schedule directly refers to the subfigure index in
Figure 4.
As can be observed, schedules implying a weight removal later in training seem to produce suboptimal results, especially in the fine-tuning regime. Indeed, removing parameters close to the end of training does not let enough time for the network to adjust its remaining weights to accommodate its weight loss. Additionally, schedules producing a gradual increase in sparsity, such as the gradual and one-cycle, seem to provide better and more stable results.
By modifying the three schedule parameters, users can also create their own pruning schedule or easily implement other existing ones, such as the dense–sparse–dense (DSD) schedule [
20] for example, which increases the sparsity for the first half of training, then gradually decay it until the network is
sparse again. The corresponding schedule_function would be defined as in Listing 7.
Listing 7. Implementation of the dense–sparse–dense technique in FasterAI. |
|
By then modifying the values of start_pct and end_pct in the SparifyCallback, we can further customize our pruning schedule, as displayed in
Figure 5. Such a schedule_function also shows that it is possible not only to use a schedule to perform sparsification, but also weight growing, i.e., start from a sparse network, and gradually allow zeroed-out weights to be retrained, creating new connections in the network.
2.5. Lottery Ticket Hypothesis
Recent studies have demonstrated that an optimal sparse network could be discovered right from the initialization of a neural network, i.e., without any training being required [
21,
22]. This particularity is named the lottery ticket hypothesis (LTH) and was empirically demonstrated for simple datasets and architectures [
21]. The optimal subnetwork is thus said to have “won” at the initialization lottery and is consequently named the “winning ticket”. To generalize the concept to more complex cases, authors had to weaken the hypothesis, not extracting the optimal network from initialization anymore, but after a few iterations of training. This generalized method is called Lottery Ticket Hypothesis with Rewinding (LTHR) and the found subnetworks named “matching tickets” [
23]. To discover such subnetworks, authors proposed to go through a five-step experiment, represented in
Figure 6 and detailed below:
Train a freshly initialized network () for t iterations and save its set of weights ().
Continue the training until completion ().
Apply a pruning mask according to the desired sparsity level, granularity, context, criteria ().
Reset the weights to their saved values, still applying the pruning mask ().
Continue training and repeat the previous steps, each time updating the mask until the desired sparsity is achieved.
In the case of the original LTH experiment being applied, the rewinding iteration t at which the set of weights is saved is equal to 0. For those experiments, authors used to sparsify their network according to the weights, globally, using the -norm criteria, and following an iterative schedule. FasterAI handles such LTH experiments by default but allows to expand them to any granularities, contexts, criteria and schedules, opening the way to many novel experiments about finding winning tickets. To accomplish such a procedure in FasterAI, some additional arguments can be provided to the SparsifyCallback:
lth: whether weights are reinitialized to their saved value after each pruning round.
rewind_epoch (default to 0): the epoch of training where weights values are saved for further reinitialization.
reset_end: whether to reset the weights to their saved values after training.
The classic Lottery Ticket Experiments [
21,
23] can be performed with Listing 8.
Listing 8. Changes to SparsifyCallback to perform lottery tickets experiments. |
|
In
Table 5, we report the results obtained when performing the classic LTH and LTHR techniques using the same architecture and datasets as the previous experiments. Each pruning round is performed for 30 epochs and the rewind_epoch is set to 1 for LTHR. We can observe that, in our case, both techniques provide similar results. Additionally, results show that it is possible to find high-performing pruned networks, even for high sparsity levels.
2.6. Prune
As described previously, sparsification is usually introduced by applying a binary mask, multiplying the value to keep by 1, and those to remove by 0. This leads to a sparse network, difficult to accelerate in practice. However, some particular granularities allow the sparse weights to be physically removed from the network, effectively taking advantage of the compression to witness speed-up without any dedicated resource. Two granularities allow to perform such a feature: (1) filter and (2) shared-kernel.
Once a filter is completely zeroed out, it can be removed from the network, leading to a dense but smaller architecture. There is one subtlety, however, as removing the zeroed filter is not enough for the architecture to be operational. When removing a filter, it changes the output shape of the concerned layer, as there is one less feature map. This means that the following convolutional layer now receives an input with fewer channels and thus, in all of its filters, the kernel corresponding to the removed feature map has to be removed. As depicted in
Figure 7, removing a single filter in layer
results in the removal of its corresponding feature maps and of the corresponding kernels in layer
. On the other hand, if we decide to zero out shared kernels, we perform the exact inverse operation, as once a shared kernel is removed from the network in layer
, the corresponding input feature map is now useless and can also be removed. As a result, the corresponding filter in layer
can also be removed.
As it removes parameters that have no impact on the computation of the result, the pruning is considered to be lossless, as it reduces the number of parameters and operation of the network, without altering its performance. To perform such an operation in FasterAI, the code required is expressed in Listing 9, with model being the model, sparsified according to the filter granularity beforehand.
Listing 9. Code required to prune a filter-sparse model. |
|
Such a technique is currently only available on strictly feed-forward operations. Indeed, the implementation for operations containing skip connections is not straightforward, as there is no exclusive connection between a filter and its corresponding kernels.
3. Distill
FasterAI also brings knowledge distillation [
24] capabilities to users with the help of its Distill module. Knowledge distillation methods are a set of techniques involving student–teacher-based training. In such a training, a large and performant model (the teacher) guides a small and less performant model (the student) in its learning process, as depicted in
Figure 8. Knowledge distillation can generally be used to make a teacher provide information about its predictions, and a chosen loss is applied to encourage the student to replicate those predictions. The loss is applied on the respective logits of the teacher and student and is thus called the Logits loss (
). A teacher may also be used to provide information about intermediate computation states, e.g., activation maps. The loss responsible for incentivizing the student to replicate similar computation states is called the feature loss (
). A total knowledge-distillation loss can be interpolated from those two losses and the classic training loss (
), e.g., cross entropy between student’s predictions and data labels, with two interpolation parameters
and
, as
In FasterAI, this is managed by KnowledgeDistillationCallback, which offers Knowledge Distillation capabilities in a single line of code. As knowledge distillation is managed by another callback, it can be used in conjunction with SparsifyCallback, for even more flexibility for extreme compression or performing original experiments. The FasterAI usage for the KnowledgeDistillationCallback is given below in Listing 10, where layers_std and layers_tch are optional lists of layers, which will be used to compute the feature loss if desired.
Listing 10. Code required to perform knowledge distillation in FasterAI. |
|
Knowledge distillation losses can be modified or created according to the user’s needs. There are currently 3 logit losses and 4 feature losses available by default in FasterAI. We compare two of those losses in the same training conditions as previous experiments. In this scenario, the teacher model is a ResNet-34 model trained for 30 epochs from pretrained weights, and the student is a ResNet-18 model starting from random initialization. In particular, two distillation losses are compared with different interpolation values of
: (1) SoftTarget, the loss computed between the logits of the teacher and the student and (2) Attention, a loss computed from features extracted after each residual block of the teacher and the student. We report the results in
Table 6. It can be observed that basing the knowledge distillation process on logits provides better results than attention. While SoftTarget compares the respective predictions of the teacher and the student, Attention holds a stronger hypothesis, that the layers used to compare are extracting the same information, which can make it harder to set up correctly.
4. Regularize
The regularize module of FasterAI concerns regularization techniques reducing the magnitude of weights in the network, according to a chosen granularity. This technique is often called weight decay when it concerns the granularity of weights. In practice, it adds a penalty term to the training loss. This penalty term acts as a regularization term, pushing the group of weights toward a value as small as possible during the optimization process. When the regularization is used to penalize weights according to their
-norm, it creates sparsity in the network. Eventually, this acts as a feature selection method, sparsifying some weights according to the desired granularity. However, as it is dependent on the optimization process, the sparsity level cannot be defined beforehand. It is nonetheless possible to control the importance of the penalty, to impose more or less sparsity in the final network, thanks to a penalty factor
. The final loss thus receives an extra term, adding the absolute value of weights for each layer
l, according to the chosen granularity, as
with
as the classification loss, generally a cross-entropy computed between the predictions and the labels, and
as the regularization term,
G being the number of elements in each group. Such regularization can be applied in FasterAI by using the RegularizationCallback, according to a chosen granularity. This callback is presented in Listing 11.
Listing 11. Code required to perform group regularization in FasterAI. |
|
We provide the results of the experiments conducted for different values of
in
Table 7. As can be observed, a higher value of
leads to a degradation in accuracy, as too much penalty is being added to the loss value, making the optimization process put more emphasis on having small magnitude weights instead of an accurate network. Moreover, we can see that, as opposed to sparsifying, regularization performs better for more coarse granularities. This can be explained by the fact that the penalty value is dependent on the granularity structure, as the
-norm is averaged over the size of each block. This means that smaller structures will be penalized more, with the regularization term driving the loss value, thus giving more importance to the
-norm of weights than to the correct classification of data.
6. Conclusions and Future Development
In this paper, we detail the FasterAI library, which provides a lightweight framework enabling quick and diverse experiments on neural network compression techniques. More particularly, we present the four modules along which the library is developed: (1) sparsify, concerning techniques introducing sparsity in neural networks; (2) distill, which concerns knowledge distillation techniques, helping a small model to reach a higher performance; (3) regularize, providing capabilities to perform grouped weight decay; and (4) misc, with other compression techniques such as batch normalization folding or fully connected layer decomposition. For each technique available in FasterAI, we provide extensive proof-of-concept experiments, performed with ResNet-18 trained on CALTECH-101, validating the different techniques available in the library, and demonstrating the range of parameters available by default.
More than just a compression library, we believe that the way FasterAI was built laid solid foundations to allow an easier implementation of novel compression techniques. Indeed, its unique granular approach to implementing compression techniques allows to seamlessly combine and customize them. Additionally, because it possesses many default options, it will help enthusiasts to apply compression techniques to their neural networks. Additionally, as demonstrated in the paper, because the implementation of novel techniques usually comes down to the writing of a single line of code, we hope that the library will help researchers in the field to create new compression techniques and to easily perform extensive experiments. We would like to continue developing FasterAI with the same philosophy in mind, striving for an increasingly flexible and convenient framework. We would also like to keep it up to date with new compression techniques, such as quantization [
25] and conditional computation [
26].