1. Introduction
The convolutional neural network (CNN) is one of the most popular networks in the field of deep learning, showing decent performance in various computer vision tasks, including classification [
1,
2], semantic segmentation [
3,
4], and action recognition [
5,
6]. They were designed to extract two-dimensional features by taking structured data such as images as input and then processing them using convolutional operators [
7,
8]. Studies [
1,
9,
10] have shown that a larger number of layers results in increased receptive fields and, therefore, captures more detail of the image. Recent networks have achieved higher accuracy by increasing the filter bank [
11,
12,
13]. There have also been cases of improved block architecture that yielded higher accuracy without significantly increasing the size of the networks [
10,
14].
Scaling is a widely used technique to achieve better accuracy and numerous methods have been utilized to scale networks. Upscaling depth is the most prevalent method, although scaling models by image resolution is also becoming increasingly popular.
Figure 1 represents the depth and width of a network. EfficientNet [
15] was created by compound scaling MobileNets [
14,
16] and ResNet [
9] networks, i.e., scaling the network width, depth, and resolution by fixed coefficients. Network width refers to the number of filter banks in any layer of a network, depth refers to the number of layers in a network, and the resolution represents the number of pixels in the input image [
15]. Scaling is the increasing or decreasing of any of these factors, resulting in a change in size and accuracy. Upscaling, however, could result in the configurations of the upscaled networks being ill-suited to their tasks because hand-crafting networks lead to human errors and consequent inaccuracies, resulting in an inefficient network. Filter pruning has long been considered a good alternative to accelerate deep neural networks, but this does not solve the core inefficiencies in the construction of the network. In other words, the accuracy gain diminishes as networks get larger because all parameters in a network have different sensitivity to its accuracy.
Identifying these superfluous parameters in a network is crucial to the optimization process. The Lottery Ticket Hypothesis [
17] states that any trained dense neural network contains a subnetwork (called a winning ticket), which, when trained in isolation for at most the same number of epochs, can match the test accuracy of the original network. Automated machine learning (AutoML) [
18] automated steps in the machine learning pipeline. This concept has been applied for neural architecture search (NAS) [
19,
20,
21] to optimize the search for winning tickets within large networks. However, applying AutoML concepts for NAS has wider consequences that can be aggravated during the search stage. The No Free Lunch Theorem [
22] posits that no universal optimization algorithm consistently outperforms all other algorithms across every optimization problem. Therefore, it is necessary to optimize networks with respect to the tasks at hand.
EfficientNet [
15] showed that networks can be optimized by removing all channel connections in the depth-wise layer, and instead increasing the number of channels to boost capacity. This reduces the number of parameters, but significantly increases data movement, resulting in poor performance on hardware accelerators. The final scaling coefficients are determined by grid search. Grid search optimizes scaled networks by conducting a complete search over a given subset of the search space [
23]. However, the search cost of using grid search is very expensive when scaling larger networks. Therefore, scaling algorithms are primarily applied to small networks, and large networks are created by massively upscaling small networks. The proposed algorithm instead downscales large networks to generate slightly smaller networks. This is because larger networks are downscaled from models designed for larger channel connections, which would not increase the number of channels. Thus, the generated network is similar in both size and accuracy to the large networks, but is much closer in size to its original network.
Evolutionary algorithms have been found to significantly outperform random and systemic search methods when searching in large search spaces [
24]. Over the years, various multi-objective evolutionary algorithms have been proposed to varying degrees of success [
25,
26]. However, they tend to suffer from a weak global search ability in low inter-task relevance problems [
27] because the cross-over operator is unable to distinguish between information and noise. This problem can be addressed by introducing multiple search strategies into the objective function to evolve an efficient network with high accuracy.
In this paper, we present a framework to automatically generate the optimal number of layers and channels for a network without manual interference. We use evolutionary algorithms to search through the large sample space of possible subnetworks to address this issue. The proposed method uses evolutionary search to find an optimized subnetwork that keeps the number of parameters low without compromising accuracy. Instead of upscaling networks, a collection of layers and channels are downscaled to find the optimal configuration. The evolved network is built to counteract the lack of expressiveness and effectiveness that is inherent to hand-crafted and grid-searched networks. Generated networks counteract these drawbacks by integrating pruning concepts into the creation of new networks, resulting in more efficient networks.
Our contribution lies in three folds:
We proposed an algorithm to counter inefficiencies in subnetworks by evolving task-agnostic networks of ideal depth and width for a given architecture.
We created a framework to efficiently search through a large sample space of subnetworks to identify smaller networks without a major loss in accuracy.
We experimentally showed the superiority of the network generated by the proposed method on publicly available pre-trained CNNs.
The remainder of this study is organized as follows:
Section 3 briefly discusses the algorithm and its working concept, and
Section 4 describes the result of experiments conducted on networks evolved using the EvolveNet algorithm. We also discuss the advantages of this algorithm in
Section 4.4, and conclude this study in
Section 5.
3. EvolveNet
EvolveNet algorithm attempts to build new networks from scratch by evolving an ideal configuration of layers for a pre-defined architecture. There are four major steps to EvolveNet: (1) filter training to strengthen the individuality of layers, (2) depth evolution to find the ideal number of layers, (3) width evolution to compute the ideal width for each layer, and (4) retraining to fine-tune the evolved network.
Pre-built networks used bottleneck blocks used by EfficientNet. The bottleneck blocks allow the network to reduce the number of parameters and, consequently, the number of floating-point operations. This makes the network more compact and efficient. The bottleneck operation consists of three operators: a linear transformation followed by a non-linear transformation, and then another linear transformation. Each bottleneck first expands a low-dimensional feature map into a high-dimensional feature map using a point-wise convolution. A depth-wise convolution then performs spatial filtering on the high-dimensional tensor. Finally, another point-wise convolution projects the spatially-filtered map back down into a low-dimensional tensor. We try to change as little about the original network as possible, so that much of the changes made have been made by evolution and not by human intervention.
3.1. Filter Training
An initial network
is constructed as a set of layers and filters whose configurations are ideal as found from existing network architectures as follows:
where
represents each layer with channel and kernel sizes.
N is the total number of layers, with each group of a layer consisting of
layers. Using the weights
, the new collection of layers is derived into a pseudo network
, which is a parameterized subset of the layers in
, i.e.,
. During each epoch of the training stage, random layers and filters are chosen to be trained which omits certain layers and filters from
, as shown in
Figure 2.
is trained and its cross-entropy loss is computed as follows:
where
represents the cross entropy loss w.r.t. training data
, and
and
are the truth label and softmax probability of class
i w.r.t.
. This loss is smoothed as a consequence of cross-entropy and then backpropagated through every layer of
including the omitted layers as follows:
where
is the learning rate, adjusted by the Lambda scheduler [
49] to converge quickly and optimally. This trains the larger network, which will have layers that can efficiently be recalibrated into smaller networks composed of only a few of its layers without impacting the overall accuracy. Each layer contributes to the feature map without taking away from the feature map of the larger network, which is made up of other layers trained in a similar fashion.
3.2. Evolving Depth
The network , which is now a collection of layers that have been trained to work independently of the network, is used to evolve a recalibrated network with ideal depth configurations. Configurations of the architecture of each block, such as the number of out-channels, kernel sizes, and strides of each layer, remain the same as before. The depth of the recalibrated network, i.e., the layers chosen to be trained, are chosen by depth encoding vectors (DEVs). DEV is a vector generated using genetic algorithms that has the depths of each layer of the recalibrated network encoded into it as the presence or absence of a layer in the network. These DEVs generate networks of parameter sizes within preset constraints, and the computed reward is assigned as the reward of each DEV. Since is only a collection of layers strung together to build a network, the recalibrated layer, the layers of which are chosen by the DEV, shows the various layers that will be present in the network to be built.
The reward function computes the pre-train accuracy of the generated network. The networks are then evolved as chromosomes under the same preset constraints to maximize their rewards, as seen in Algorithm 1. The best chromosomes are chosen, then mutated, and crossed over to be propagated through to the next chromosome pool. Mutation is the genetic operation of flipping arbitrary genes in a parent chromosome to generate offspring. Cross-over is the genetic operation of combining the genetic information of the two parent genes to generate offspring. In EvolveNet, the mutation is implemented by giving every gene on a random chromosome a 10% chance to be switched to a random new gene. Cross-over is implemented by choosing two random chromosomes and then selecting a gene from either chromosome with a 50% chance of being selected from either parent. We assume that every genetically modified offspring will not be better than every parent; hence, the subsequent chromosome pools are selected from the overall pool and not just the genetically modified pool.
After multiple chromosome pools are generated, the best chromosome is chosen and the network generated from it has the ideal depth configuration for a network block architecture for the given task.
Algorithm 1 Algorithm for evolving depth |
Hyperparameters: Number of searching epochs , number of fine-tuning epochs , number of chromosomes in C n, number of chromosomes selected for evolution k Input: : training images that can be split into , : filter trained network Functions: computes reward of the network created using , and are evolutionary operations performed on a list of chromosomes, converts n into a list of layers using trained network, creates a network from a list of layers, trains using given data, ∇ computes gradient of loss of trained network Output: Depth-evolved network x C = List of n random s = {} for to do R = [,, …,] for to n do = reward() end for .append(C) sort in descending order of R = mutation() = crossover() C = + = [:k] end for [,, …,] = layers(, ) x = create([,, …,]) for to do x += ∇f(x,) end for
|
3.3. Evolving Width
The best DEV after depth evolution is used to build a recalibrated network for width evolution, as explained in Algorithm 1. The recalibrated network is then used to compute rewards for the networks derived from it using the width encoding vectors (WEVs). Unlike DEVs, each WEV is injectively mapped to a network of specific depth and layers of out-channels, i.e., the networks created by every WEV are unique to each element in it and also to the DEV it is evolved from. As shown in
Figure 3, 50 WEVs are built from the recalibrated network, and their rewards are computed. Each gene of a chromosome, represented by the WEV, is encoded with the ratio of channels of each layer compared to the original. The individual genes chosen do not matter, as this evolution is used to resolve the size of the final network, not the specific configuration. The size of each chromosome is dependent on the size of the recalibrated network. At every step before evolution, random chromosomes are used to generate sub-networks for the recalibrated network, and their reward is computed agnostic to the training dataset. The chromosomes with the highest rewards are mutated and crossed over, and the subsequent chromosome pool is created by selecting the best chromosomes from a pool of the parents and their offspring.
The chromosome with the highest reward from the final chromosome pool is used to derive the final network from the recalibrated network. The final derived network has had its depth and width evolved to be ideal for the given task and architecture.
3.4. Retraining
Once the final configurations of the network have been evolved, the derived network is retrained to achieve competitive accuracy with minimum parameters. The final network is a subnetwork; hence, according to Frankle et al., it has to be trained to achieve similar accuracy. The number of retraining epochs is determined by the network whose architecture is used to build
, i.e., EfficientNet [
15]. The number of layers and their configurations have been determined by the DEV, whereas the number of channels in each specific layer is determined by the WEV. Each unique value of DEV and WEV thus generate a unique network. Thus, unlike in EfficientNet, where the network configuration was found using grid search, the channel configurations of the final network here are selected by evolutionary computation without manual influence.
4. Experiments
In this section, we demonstrate the superiority of networks generated by EvolveNet. We compare the results obtained with other major state-of-the-art models. Lastly, we discuss the impact of various hyperparameters to understand their impact on the proposed method.
4.1. Experimental Settings
The generated networks were trained on ImageNet dataset [
50] for image classification tasks. The kernel sizes were selected after conducting ablation studies and the best results were found for kernels of sizes
and
. In other words, the ideal resolution was hand-crafted. The data augmentation involves random cropping after randomly resizing from
to
. The images are also randomly rotated and flipped to further augment the data. Experiments were conducted on four NVIDIA RTX A6000 GPUs with 40 workers and a batch size of 512.
4.2. Evaluation Protocol
We measured the Top-1 and Top-5 accuracies, as well as the number of parameters, to evaluate the evolved networks, and compared them to existing state-of-the-art networks. The primary aim of this experiment is to showcase an improvement in the performance and efficiency of a network before and after the depth and width have evolved. Accuracy is the proportion of images that have been labeled correctly. For each image, the network computes the probability of them being classified into each label. Top-1 accuracy is the proportion of images in which the predicted label is the same as the actual label. Top-5 accuracy is the proportion of images where the actual label is present as at least one of the top five predictions. A fewer number of parameters result in a more streamlined and efficient network.
4.3. Experimental Results
We present the performance of different networks created using EvolveNet. The networks have been created by setting the parameter size as a constraint and generating networks with an ideal network configuration. The generated networks have been called EvolveNet-XS (Extra small EvolveNet), EvolveNet-S (Small EvolveNet), EvolveNet-M (Medium EvolveNet), and EvolveNet-L (Large EvolveNet). These networks are then compared with state-of-the-art methods of similar size.
4.3.1. Performance against Very Small Networks
A set of very small state-of-the-art networks, including DenseNet121 [
10], HRFormer-T [
51], EfficientNetB1 [
15], and EfficientNetV2B1 [
52], are selected for comparing to EvolveNet-XS.
Table 1 presents the result of the comparison with those networks. EvolveNet-XS shows Top-1 and Top-5 accuracy of 80.4% and 95.1%, respectively, with 7.8 M parameters. Overall, it shows competitive performance compared to other methods with the least number of parameters. It outperforms DenseNet121 and HRFormer-T by 5.4% and 1.9% in Top-1 accuracy. Additionally, it has 0.3 M fewer parameters compared to DenseNet121. EvolveNet-XS shows comparable performance to the EfficientNetB1 and EfficientNetV2B1. The difference in number of parameters between EfficientNetB1 and EvolveNet-XS is only 0.1 M. However, EvolveNet-XS shows a gain of 1.3% and 0.7% in Top-1 and Top-5 accuracies, respectively. It outperforms EfficientNetV2B1 by small margins of 0.6% and 0.1% in Top-1 and Top-5 accuracies, respectively, in spite of having 0.4 M fewer parameters.
4.3.2. Performance against Small Networks
Table 2 presents the results of the EvolveNet-S network compared to small networks, including EfficientNetB2 [
15], and EfficientNetV2B1 [
52]. EvolveNet-S shows Top-1 and Top-5 accuracy of 81.1% and 95.6%, respectively, with 8.6 M parameters. It shows competitive performance against those networks with the least number of parameters, outperforming LeViT-128 and ConViT-Ti+ by 1.5% and 4.4%, respectively, in Top-1 accuracy while using 0.2 M and 1.4 M parameters lower. EfficientNetV2B1 has 10.2 M parameters with Top-1 and Top-5 accuracy of 80.5% and 95.1%, respectively. Although EvolveNet-S outperforms it by a small margin of 0.6% and 0.5% in Top-1 and Top-5 accuracy, respectively, it has 1.6 M fewer parameters than EfficientNetV2B1. Similarly, EvolveNet-S outperforms EfficientNetB2 by a small margin of 0.7% in Top-5 accuracy. It also shows a decent gain of 1.0% in Top-1 accuracy with 0.6 M fewer parameters compared to EfficientNetB2.
4.3.3. Performance against Medium-Sized Networks
A set of medium-sized networks, including DenseNet169 [
10], TinyNet [
56], EfficientNetB3 [
15], and EfficientNetV2B3 [
52], are selected for comparing EvolveNet-M network.
Table 3 shows the results for EvolveNet-M compared to these networks. It shows top-1 and top-5 accuracy of 82.8% and 96.3%, respectively. Overall, it outperforms other networks by a decent margin, with fewer parameters. It outperforms SAMix ResNet-18, which has 0.4 M more parameters, by 10.5%. DenseNet169, with 14.3 M parameters, and shows 76.2% and 93.2% of Top-1 and Top-5 accuracy, respectively. However, EvolveNet-M, with 3 M fewer parameters, outperforms it by a significant margin of 6.6% and 3.1% in Top-1 and Top-5 accuracies. Similarly, it outperforms TinyNet by 3.4% and 1.8% in Top-1 and Top-5 accuracy, respectively, despite having 0.6 M fewer parameters. EvolveNet-M shows comparable performance with EfficientNetB3 and EfficientNetV2B3 in Top-5 accuracy. It outperforms them by a small margin of 0.6% and 0.5%. However, EvolveNet-M has 1 M and 3.2 M fewer parameters, respectively. Additionally, the Top-1 accuracy of EvolveNet-M is 1.2% and 0.8% higher than EfficientNetB3 and EfficientNetV2B3, respectively.
4.3.4. Performance against Large Networks
For the comparison of EvolveNet-L, a collection of larger networks with a significantly large number of parameters is selected. These include Xception [
58], ConNeXtTiny [
1], ConvNeXtSmall [
1], NASNETLarge [
13], and EfficientNetB4 [
15]. The results of EvolveNet-L compared to the above networks are presented in
Table 4. EvolveNet-L outperforms other networks with a decent margin in Top-1 accuracy, with the least number of parameters. It shows Top-1 and Top-5 accuracy of 83.2% and 96.5%, respectively, with 17.6 M parameters. The performance of EfficientNetB4, with Top-1 and Top-5 accuracy of 82.9% and 96.4%, respectively, is comparable to EvolveNet-L. However, EvolveNet-L has significantly fewer parameters compared to EfficientNetB4. With 1.9 M fewer parameters, EvolveNet-L outperforms it by 0.3% and 0.1% in Top-1 and Top-5 accuracy, respectively. EvolveNet-L has significantly fewer parameters than NASNETLarge. In spite of having 71.3 M fewer parameters, it outperforms NASNETLarge by 0.7% and 0.5% in Top-1 and Top-5 accuracy, respectively. Similarly, with 32.6 M fewer parameters, it outperforms ConvNeXtSmall by 0.9% in Top-1 accuracy. EvolveNet-L outperforms ConvNeXtTiny by a decent margin of 1.9% in Top-1 accuracy despite having 11 M fewer parameters. Xception has 22.9 M parameters and shows Top-1 and Top-5 accuracy of 79.0% and 94.5%, respectively. However, EvolveNet-L, with 5.3 M fewer parameters, outperforms it by a decent margin of 4.2% and 2.0% in Top-1 and Top-5 accuracy, respectively.
4.4. Discussion
We have experimentally shown that the networks generated by the depth and width encoding vectors evolved using the EvolveNet method consistently show better performance when compared to EfficientNet, while maintaining their architecture. The improvement is significant, and the generated network can still be pruned using the same methods that are used on EfficientNet and other similar CNNs. Hence, it can be inferred that the architecture computed by evolution outperforms the architectures computed using the grid-search method. MobileNetV2 introduced inverted residuals and bottlenecks and improved the accuracy of MobileNetV1 using the new architecture. EfficientNet was an improvement on the MobileNetV2 architecture, where the accuracy of the network was improved by scaling the width, depth, and resolution of MobileNetV2 using grid-search. By evolving networks with higher accuracy and efficiency, EvolveNet has experimentally proven that hand-crafting and grid search are not ideal methods to build networks. Pruning algorithms have shown that a randomly initialized dense network contains multiple sub-networks with fewer parameters and comparable accuracies. However, most pruning algorithms limit themselves by trying to reduce the number of parameters. Since the proposed algorithm evolves networks to emphasize ideal configurations while maximizing rewards, the focus is placed on accuracy, and efficiency is taken care of as a consequence of it. This allows for high accuracies on a relatively smaller network.
The generated network is also independent of the original network, but given the original network and the depth and width encoding vectors, the network can be regenerated. The generated network is injectively mapped to each DEV and WEV and the number of layers in the larger network. Therefore, changing any of these encoding vectors will significantly change the final network. The number of randomly generated layers, from which recalibrated networks are generated, is a hyperparameter used to control the size of the final network, but it has no other bearing on the evolution of the final network. This can be seen in
Table 5. Before evolving the width of the final network, the number of out-channels in each layer is equal to the number of out-channels in MobileNetV2. The structure is the same as that of each block in EfficientNet and MobileNetV2. There is one fixed-out channel for each block layer, but the number of additional blocks is determined by the network encoding vector.