Next Article in Journal
An Improved SPWM Strategy for Effectively Reducing Total Harmonic Distortion
Previous Article in Journal
IPLog: An Efficient Log Parsing Method Based on Few-Shot Learning
Previous Article in Special Issue
Flexible Deployment of Machine Learning Inference Pipelines in the Cloud–Edge–IoT Continuum
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Zero-Shot Proxy with Incorporated-Score for Lightweight Deep Neural Architecture Search

Department of Computer Science and Engineering, Seoul National University of Science and Technology, Seoul 01811, Republic of Korea
*
Author to whom correspondence should be addressed.
Electronics 2024, 13(16), 3325; https://doi.org/10.3390/electronics13163325
Submission received: 2 July 2024 / Revised: 12 August 2024 / Accepted: 15 August 2024 / Published: 21 August 2024
(This article belongs to the Special Issue Towards Efficient and Reliable AI at the Edge)

Abstract

:
Designing a high-performance neural network is a difficult task. Neural architecture search (NAS) methods aim to solve this process. However, the construction of a high-quality accuracy predictor, which is a key component of NAS, usually requires significant computation. Therefore, zero-shot proxy-based NAS methods have been actively and extensively investigated. In this work, we propose a new efficient zero-shot proxy, Incorporated-Score, to rank deep neural network architectures instead of using an accuracy predictor. The proposed Incorporated-Score proxy is generated by incorporating the zen-score and entropy information of the network, and it does not need to train any network. We then introduce an optimal NAS algorithm called Incorporated-NAS that targets the maximization of the Incorporated-Score of the neural network within the specified inference budgets. The experiments show that the network designed by Incorporated-NAS with Incorporated-Score outperforms the previously proposed Zen-NAS and achieves a new SOTAaccuracy on the CIFAR-10, CIFAR-100, and ImageNet datasets with a lightweight scale.

1. Introduction

Neural architecture search (NAS) is a well-known automated, high-performance deep neural network architecture designed for various deep learning tasks, such as image recognition and language modeling. Most NAS approaches comprise two main components, namely a network generator and a performance predictor. The network generator suggests prospective high-quality models, while the accuracy predictor estimates the accuracy of these proposed networks.
There are three commonly used generators, namely uniform sampling [1], reinforcement learning [2], and the evolutionary algorithm [3]. Three main approaches to build performance predictors are brute-force [3,4,5,6], predictor-based [2,7,8], and one-shot methods [5,9,10,11,12,13,14,15,16,17].
Due to the huge computational overhead, the construction of an accuracy predictor that provides good suggestions at the optimal cost is a significant challenge. For instance, a considerable number of networks must be trained using both brute-force and predictor-based approaches. To alleviate these issues, one-shot methods propose parameter-sharing strategies such as DARTS [9], SNAS [13], PC-DARTS [10], ProxylessNAS [14], GDAS [15], FBNetV2 [16], DNANet [18], and Single-Path One-Shot NAS [1]. Although these one-shot methods are more efficient than previous methods, there is still a drawback that requires the training of a large supernet with high computational costs. Another disadvantage is the degradation of accuracy and predictor quality owing to model interference [17,19,20] in almost all supernet-based methods. Moreover, searching for large target networks in the context of resource limitations is challenging because the supernet needs to be significantly larger than the target network. Therefore, designing high-performance networks using one-shot methods is difficult.
To address these issues, in recent works, zero-cost predictors have been proposed instead of expensive predictors for the search process. The key idea behind these methods is to leverage the expressiveness of the deep neural network as a proxy that is positively correlated with the network’s performance [21,22,23,24,25,26].
In this paper, we propose a new optimal zero-shot proxy as the network’s expressiveness for lightweight NAS called Incorporated-Score. The proposed method is inspired by two theories in deep learning research. First, recent breakthroughs in deep learning research [27,28,29,30,31,32,33,34,35,36,37] show that the advantages of deep models compared to shallow models come from the greater expressiveness of deep models, even when the number of neurons is the same in deep models and shallow models. The number of linear regions is known as a good metric to estimate the performance of a neural network. However, Lin et al. [25] showed that directly counting the linear regions is infeasible in a large network. By using Gaussian complexity, they proposed zero-proxy zen-score, which takes advantage of the power of both the distribution of linear regions and the linear classifier’s Gaussian complexity. Second, based on information theory in deep learning information research [26,38,39,40,41,42,43], the differential entropy of the network is an expressiveness of the deep neural network. The deep neural network’s entropy depends on several factors, such as the number of channels, kernel size, and group. These methods are two different approaches to generate a high-quality neural network. Inspired by the two theories mentioned above, we use the zen score [25] and entropy of the network [26] as inputs to compute the network’s expressiveness. This approach enables our proxy to account for the power of distribution of linear regions and the coefficient matrix under entropy optimization. However, the conventional approach with a simple sum of the zen score and entropy of the network [26] leads to dominant issues. Therefore, we propose an efficient method to incorporate two factors called Incorporated-Score. The proposed approach employs neural architecture search (NAS) to discover high-quality networks across diverse model inference measures, encompassing floating-point operations per second (FLOPs) and the number of model parameters. The achieved performance sets a new benchmark, outperforming Zen-NAS on CIFAR-10/CIFAR-100/ImageNet and establishing state-of-the-art (SOTA) results on a lightweight scale. In particular, our work achieves top-1 accuracy of 96.86 % and 81.1 % on the CIFAR-10 and CIFAR-100 datasets, respectively, with less than 1 M params and outperforms ZenNet-cifar-1M ( 96.2 % / 80.1 % ). Within 5 h of GPU, the lightweight EZenNet designed by Incorporated-NAS peaks at 80.1 % top-1 accuracy on ImageNet-1K, which only has 800 M FLOPs.
The contributions of this study are summarized as follows:
  • We propose a zero-cost proxy for NAS called Incorporated-Score. It leverages zen-score and entropy factors to make an efficient proxy to rank networks. The entropy of the network plays a role as the auxiliary optimization factor to zen score in the proposed method.
  • EZenNet, which is designed by Incorporated-Score, outperforms the baseline SOTA ZenNet, which is designed by Zen-NAS, and it achieves a new SOTA on the CIFAR-10/CIFAR-100/ImageNet datasets at a lightweight scale.

2. Related Works

2.1. Information Theory in Deep Learning

Information theory is known as a robust method for investigating complicated systems such as machine learning and deep learning networks. The maximum entropy principle [38,39] is one of the most popular applied principles in this field of research. Numerous studies [41,43,44,45,46] have made an effort to seek the relation between the entropy factor and neural network architecture. For instance, Chan et al. [44] attempted to clarify the learning capabilities of deep neural networks by reducing the subspace entropy. Saxe et al. [46] tried to understand the power of bottlenecks in deep neural network architectures by investigating the entropy distribution and the outflow of information. Yu et al. [43] introduced the principle of maximizing the decrease in coding rate to achieve optimization, while Sun et al. [41] proposed high-performance object detection networks by using the maximal entropy of a multi-scale feature map. Additionally, a monograph [45] examined the mutual information among different neurons in the multilayer perceptron model.
In the context of DeepMAD [26], rather than focusing on coding rate reduction as in [44], the entropy of the model was considered in this study. The effectiveness of this approach was also experimentally verified to demonstrate that solely maximizing entropy, as in [41], was insufficient.

2.2. Neural Architecture Search (NAS)

The target of NAS is to automatically generate a neural architecture that delivers optimal performance using limited computing resources and requiring minimal human intervention. In early methods, brute force was used to find the most accurate network structure by training the neural network models for accuracy. The evolutionary algorithm (EA) and reinforced learning (RL) are popular generators or samplers in NAS. AmoebaNet [3] was used to conduct a neural network search on the CIFAR-10 dataset applying EA, then transfer the architecture to the ImageNet dataset. However, it took approximately 3150 GPU days to search and reach more than 80% top-1 accuracy on ImageNet dataset. Subsequently, several EA-based NAS algorithms have been proposed to enhance search capacity by using downsampled images or by decreasing the query counts, such as CARS [12], EcoNAS [11], PNAS [47], and GeNet [5]. Other researchers used RL as a generator in NAS, including Mnasnet [48], NASNet [49], and MetaQNN [6]. However, both RL and EA methods consume hundreds of GPU days or even more computing resources, which can be highly detrimental to researchers due to the extensive computational cost. Several one-shot methods have been proposed to address these problems. Several architectures have been trained to predict the accuracy of networks [2,8]. Thanks to the training of a large supernet, the one-shot method decreased the training cost of NAS. Therefore, it is commonly utilized in efficient NAS works, such as SNAS [13], DARTS [9], ProxylessNAS [14], PC-DARTS [10], single-path one-shot NAS [1], GDAS [15], FBNetV2 [16], and DNANet [18]. OFANet [17] showed that weight sharing is hindered by model interference, which leads to a significant drop in the accuracy of the target model. Therefore, a progressive shrinkage strategy was proposed to address this issue. It took approximately 52 GPU days to structure OFANet, achieving 80% top-1 accuracy on ImageNet. Another high-quality network constructed by NAS is EfficientNet [50], which achieved 84.4% accuracy on ImageNet after searching for 3800 GPU days. Although the above efforts have significantly decreased search costs, many days have been spent searching for neural networks using one-shot methods.
Recently, several works have been actively proposed to explore zero-shot proxies for efficient NAS, including [21,23,24,25]. Abdelfattah et al. [21] conducted experiments to evaluate the effectiveness of several zero-shot pruning expressiveness methods, among which synflow [22] achieves the highest performance. Pruning proxy methods conduct parameter pruning using a saliency metric that estimates the change in loss or gradient norm when a parameter of the network is pruned. The loss is the predicted by-product of all parameters in the networks in synflow, while the previous proxies need to use a mini batch of training data and cross-entropy loss. Although these proxies achieve moderate accuracy on the CIFAR-10 dataset, the top-1 accuracy needs more improvement on the CIFAR-100 dataset. Another work called training-free neural architecture search (TE-NAS) [23] ranks architectures by considering two factors of the deep neural networks. The first factor is the spectrum of the neural tangent kernel (NTK), and the second is the count of linear regions in the input space. However, a limitation of this method is that directly calculating the count of linear regions becomes impractical when the neural network is too large. This work achieved 74.1% top-1 accuracy on ImageNet, which is still far behind the current state-of-the-art baselines. NASWOT [24] is another approach that achieves similar network performance to TE-NAS on the CIFAR-10 and CIFAR-100 datasets with greater time efficiency by utilizing the kernel matrix of binary activation patterns. Zen-NAS [25] proposes a zero-proxy zen score that takes a few forward inferences on randomly initialized networks using random Gaussian input. Solving the limitation of directly counting the number of linear regions in a large deep network, it takes advantage of both the distribution of linear regions and the power of the coefficient matrix in each region by using the complexity of Gaussian input. In comparison, Zen-NAS achieves more robust performance than TE-NAS and NASWOT in terms of both searching time and network performance. This is a study that reports NAS using a zero-cost proxy, surpassing the performance reported in studies involving training on ImageNet.

3. Preliminary

In this section, we briefly introduce the notations used in this paper, including the L-layer neural network and vanilla convolutional neural network (VCNN). Moreover, we present the entropy of the MLP model and control of an over-deep network issue in MLP.

3.1. L-Layer Neural Network Notation

A neural network with a depth of M layers is defined as a function ( f : R d 0 R d M , where d 0 and d M are the input and output dimensions, respectively). The input image and the t-th layer’s feature map are represented by x 0 R d 0 and x t , respectively. For each t-th layer, the number of input channels is denoted by d t 1 , while the number of output channels is represented as d t . θ t R d t × d t 1 × k × k refers to the convolutional kernel. The image resolution and the mini-batch size are denoted by H × W and B, respectively.

3.2. Vanilla Convolutional Neural Network

The VCNN is a commonly applied prototype in theoretical studies [30,35,36]. Multiple convolutional layers, followed by the RELU activation stack, comprise the main body of a vanilla network. All additional components, such as residual links and batch normalization layers, are cut off from the main network. The resolution of the feature map is decreased to 1 × 1 by a global average pooling (GAP) layer following the backbone. After that, a fully connected layer is attached. Finally, the features of the neural network are transferred to the distribution target by applying the softmax function. The feature map is f ( x | θ ) , which is considered the main body output of the network with input of x and network parameters of θ before the GAP layer. Network expressivity is measured concerning the pre-GAP because of most of the necessary information is remaining. Recently, some theoretical studies have been conducted on the expressivity of deep networks [35,37]. An important insight gleaned from these studies is that any standard network can be viewed as an ensemble of piecewise linear convex polytopes ( U = { U 1 , U 2 , , U | U | } , where | U | represents the number of linear regions) as follows:
Lemma 1
([27,33]). At the t-th layer, the activation pattern is denoted as G t ( x ) . Then, for any standard network, f ( · ) is defined as follows:
f ( x | θ ) = U i U I x ( U i ) W U i x
where a convex polytope is expressed as U i , relying on { G 1 ( x ) , G 2 ( x ) , , G M ( x ) } ; U is a finite set of complex polytopes in R d 0 ; I x ( U i ) = 1 if x U i and zero otherwise; and W U i is a coefficient matrix of size R d M × d 0 .
In other words, a standard network can be construed as a set of piecewise linear functions that depend on activation patterns. Therefore, the abovementioned studies used the number of linear regions ( | U | ) playing a role in expressivity as a proxy for the VCNN family. However, the computation of counting ( | U | ) is infeasible for large networks. Moreover, directly using | U | does not take advantage of the representational power of each coefficient matrix ( W U i ) in Lemma 1.

3.3. Entropy of MLP Models

Entropy is measured as the expressiveness of a deep network [41,44]. Suppose that in an L-layer MLP ( f ( · ) ), w i and w i + 1 are defined as the input and output channels of the i-th layer, respectively. Output x i + 1 is defined by x i + 1 = M i x i , where M i R w i + 1 × w i denotes a trainable weight. According to the entropy analysis reported in [44], the MLP model’s entropy ( f ( · ) ) is expressed as follows:
H f = w L + 1 i + 1 L log ( w i ) ,
where H f is the upper bound of the normalized Gaussian entropy of the MLP ( f ( · ) ). In [26], it was shown that simply maximizing the entropy ( H f ) to determine the best expressiveness of a network results in an over-deep network problem. This is due to the fact that entropy increases exponentially faster with depth compared to width, as shown in Equation (2). An excessively deep network is challenging to train because effective information propagation is hindered [45]. This leads to the question of how to address this problem, which is discussed in Section 3.4.

3.4. Effectiveness of Controlling Extremely Deep Networks in MLP

A network that hinders effective information propagation can be considered chaotic. This phenomenon is known as an over-deep network [45]. Specifically, in an over-deep network, when the network weights are randomly initialized, a small perturbation in the first network layers can result in an exponential large perturbation in the output of the final network’s layers. During backpropagation, the gradient flow does not effectively propagate through the entire network. Hence, training a network becomes difficult when it is over-deep. In [45], the metric term network effectiveness was proposed as follows:
ρ = L / w ,
where L is the number of MLPs and the width of each layer (w) is the same. The MLP behaves as a single-layer linear model if ρ 0 . In contrast, it is a chaotic system if ρ . We can easily see that Equation (3) assumes that the MLP has a uniform width, whereas in practice, the width ( w i ) of each layer may be different. To solve this problem, Shen et al. [26] proposed using the average width of the MLP in Equation (3) as follows:
w ¯ = ( i = 1 L w i ) 1 / L = exp ( 1 L i = 1 L log w i ) .

4. Method

The overall workflow of the proposed Incorporated-NAS is represented in Figure 1. There are two main components, which are the architecture generator and the accuracy predictor. As shown in Figure 1, we apply an evolutionary algorithm (EA) in the architecture generator, which generates the architecture from the search space ( A ). Then, each generated network is evaluated by the accuracy predictor based on the proposed zero-cost proxy, which is Incorporated-Score. The proposed Incorporated-Score proxy is a key component of the accuracy predictor in scoring the networks without any training in NAS.
The problem of measuring the expressivity of a deep neural network and its positive correlation with model accuracy plays an important role in NAS. In this section, we briefly describe the expressivity of the VCNN using zen-score. The original paper [25] that proposed zen-score, contains many more details and in-depth specifications. Then, we briefly introduce the efficient maximization of entropy as expressivity of the CNN. Finally, we present our methods, consisting of the effectiveness evaluation of Incorporated-Score as a new zero-cost proxy and Incorporated-NAS to optimize the proposed score in the search step.

4.1. Zen-Score

Owing to the Gaussian complexity [51] of a linear classifier, Lin et al. [25] proposed a new expressivity for a network that can be efficiently measured by its expected Gaussian complexity, represented by Ψ score as follows:
Ψ ( f ) = log E x , θ { U i U I x ( U i ) W U i F }
= log E x , θ x f ( x | θ ) F
Specifically, x and θ are randomly sampled from the prior distributions, then the average W U i F . Following [25], we use the Gaussian distribution represented by N ( 0 , 1 ) to sample the input ( x ) and θ parameter. This corresponds to computing the expected gradient norm of f with respect to input x . It is crucial to highlight that in the Ψ score, only the gradient of x is considered, not the gradient with respect to θ .
However, Lin et al. [25] showed that a numerical overflow is incurred when directly computing the Ψ score for a very deep network due to gradient explosion without a BN layer. This problem could be resolved by adding a BN layer, but it leads to a scale-sensitive issue that is introduced in deep learning complexity analysis [52,53]. To address these issues, the  Ψ score is rescaled using the product of BN layers’ variance statistics. To differentiate from the Ψ score, the new score is defined as the zen-score.

4.2. Maximizing Entropy as Expressivity of Network

In this subsection, we briefly describe the definition of entropy and how its effectiveness is generalized from an MLP to a CNN as in [26].

4.2.1. Entropy of CNN

Suppose that the i-th CNN layer is defined by the input channel number ( c i ), the output channel number ( c i + 1 ), the kernel size ( k i ), and the group number ( g i ). The operator of this CNN corresponds to matrix multiplication ( W i R c i + 1 × c i × k i 2 / g i ). A new dimension of CNN feature maps is the resolution ( r i × r i ) in the i-th layer. The definition of the CNN’s entropy is proposed as follows [26]:
H L log ( r L + 1 2 c L + 1 ) i = 1 L log ( c i k i 2 / g i ) .
In Equation (7), inspired by [54], using logarithms provides a more effective formulation of the ground-truth entropy for natural images.

4.2.2. Effectiveness Entropy for Scoring Networks

Some studies have used entropy as expressivity to find high-performance CNN models. However, the model performance does not always increase with entropy (see Figure 2 in [26]). Specifically, a positive relationship no longer exists in an over-deep network. Therefore, Shen et al. [26] proposed using the effectiveness ( ρ ) to control this issue, as described in Section 3.4. To find the best network by maximizing the entropy of the network, they achieved optimization by using a mathematical programming problem. The final score is defined as
E s c o r e = i = 1 M α i H i β Q ,
where the L-layer CNN model ( f ( · ) ) has M stages; H i denotes the entropy of the i-th stage, which is defined in Equation (7); and α i and β are hyperparameters. The weight of the entropy at different scales is denoted as { α i } . E s c o r e is maximized by finding w i * and L i * under given constraints, such as budget (FLOPs, params, ρ n ρ 0 ). According to Equations (3) and (4), the effectiveness of CNN ρ n is expressed as follows:
ρ n = L · ( i = 1 L w i ) 1 / L
where each CNN layer’s width is denoted as w i = c i k i 2 / g i and L i is each stage’s depth for i = { 1 , 2 , 3 , , M } . The effectiveness of the network is controlled by ρ 0 , and its value is typically adjusted within the range of [ 0.1 , 2.0 ] . Q, which penalizes the objective function if the depth distribution is nonuniform across stages as follows:
Q exp [ V a r ( L 1 , L 2 , , L M ) ] .

4.3. Effectiveness of the Proposed Incorporated-Score as a Proxy for NAS

We observed that using E s c o r e as the expressivity of the network is based on information theory in deep learning to optimize high-performance CNNs on the same scale as ResNet. The key idea is to maximize the network entropy factor while maintaining the network effectiveness to avoid extremely deep network generation, controlled by a small constant. However, the method proposed in [26] has several limitations as well, such as the fact that the three empirical guidelines are not based on a strong theoretical foundation.
Zen-NAS [25] uses the gradient norm of an input image as the network’s expressiveness score and proposes the use of two feed-forward inferences to approximate this metric for classification. In particular, it works efficiently to find lightweight networks for the classification task. Moreover, this metric takes the power of the weight of the network and the number of linear regions. Entropy is a different approach in information theory based on the channel number, kernel size, and group number. We take advantage of entropy as an additional optimal factor in the zen score to rank neural networks for image classification tasks.
First, the simplest idea is to take the sum of the zen score and the effectiveness entropy ( E s c o r e ) with balance weight. The generated network performance is dramatically decreased compared to the original Zen-NAS. We investigated this issue by analyzing the value of the component score in our proxy during searching. Interestingly, we observed that the value of E s c o r e is much higher than the zen-score value, as shown in Figure 2, which leads to a bias in searching for a high-performance network. In particular, the  E s c o r e is in the range of [ 0 , 8000 ] , whereas the maximum value of the zen score is approximately 120. Secondly, we drastically reduce the entropy weight to 10 4 and increase the zen-score weight to 0.9999. The generated network performance is at the same level as the original Zen-NAS, proving that the E s c o r e needs to be normalized at the same level as zen score. Therefore, we propose an efficient normalization of E s c o r e to address the issue of biased values. The effective Incorporated-Score is proposed as follows:
Incorporated-Score = w z Z s c o r e + w d F ( E s c o r e )
where Z s c o r e denotes the zen score; F ( · ) is the normalization function, for which we choose log or square-root functions to normalize the value of E s c o r e ; and  w z and w d are hyperparameters used to balance the weights between zen score and normalized E s c o r e , respectively, with constraints of 0 < w z < 1 , 0 < w d < 1 , and  w z + w d = 1 . We consider two functions, namely log and square-root functions, as the E s c o r e normalization function because both of them are monotonic functions and yield values greater than zero for positive inputs.
In our experiments, we set the weights of the entropies to α i = 1 on each scale for comparison with the original Zen-NAS. This is because the entropy works as an additional optimal factor in the zen score, and our architectural design generator is the same as that reported in [25]. We set β = 10 , as suggested in [26].

4.4. Incorporated-NAS with Optimized Incorporated-Score

We propose the Incorporated-NAS algorithm, which targets the maximization of the Incorporated-Score of the neural network. Incorporated-NAS is based on Zen-NAS and further computes the efficiency scores ( ρ n and E s c o r e ) in the searching stage. The evolutionary algorithm is used as an architecture generator in our Incorporated-NAS. Other generators, such as greedy selection or reinforcement learning, can also be chosen. The sequential process of Incorporated-NAS is provided in Algorithm 1.
First, N structures are randomly generated as a population ( P ). The structure with the highest incorporated score is the desired output of Incorporated-NAS after T evolutionary iterations. Specifically, a network in the population ( P ) is mutated to generate a new architecture for each iteration step (t), as shown in Algorithm 2. In the mutation stage, the depth and width of the chosen layer are transformed within the mutation ratio range. The range of the mutation ratio is [ 0.5 , 2.0 ] , as in [25]. This means that the new value is generated by half or two times the current value in our work. A new child structure ( A ^ t ) is added to the population if its inference expense is within the given budget, as shown in row 5 of Algorithm 1. In addition, aiming to avoid generating overly deep structures, the depth of the network is regulated by the maximal value (L) and effectiveness ( ρ 0 ). Finally, we remove the network with the lowest score if the population size is more than N.
Algorithm 1 Incorporated-NAS
  • Input: Given search space S, evolutionary population size N, constraints budget O, total number of iterations T, maximal depth L, initial structure A 0 , effectiveness score ρ 0
  • Ensure: NAS for optimal EZenNet A e z e n n e t *
  •     Initialization of the population P = { A 0 } .
  •     for  t = 1 , 2 , , T do
  •      Select randomly A t P .
  •      Mutation stage: A ^ t = MUTATION ( A t , S )
  •      if  A ^ t has more than L layers or overcome budget O or ρ n ρ 0  then
  •       continue
  •      else
  •       Compute zen-score z = zen-calculation ( A ^ t )
  •       Compute E s c o r e
  •       Compute Incorporated-Score y = w z z + w d F ( E s c o r e )
  •       Append A ^ t to P
  •      end if
  •      Cut off architecture from the population list which has the lowest Incorporated-Score if the size of P overcomes population size N.
  •     end for
  •     Return A e z e n n e t * , the network which has max value of incorporate-score in P
Algorithm 2 Mutation
  • Input: Structure A t , Search space S
  • Ensure: Mutation random structure A ^ t
  •    Uniformly choose a block b in A t
  •    The block type, kernel size, width, and depth of b are alternated uniformly in the mutation ratio.
  •    Get the child structure A ^ t from structure A t

5. Experiments

To validate the effectiveness of the proposed Incorporated-Score, we conduct experiments on three datasets, namely CIFAR-10, CIFAR-100, and ImageNet-1k. First, we compare the proposed method to the baseline SOTA proxy, zen-score, on the CIFAR-10 and CIFAR-100 datasets using the same settings for search policy, search space, and training. Next, we compare the search cost of Incorporated-NAS with that of the other methods. Finally, we compare our proposed method to zen-score on ImageNet in lightweight instances.

5.1. Experimental Setting

5.1.1. NAS Settings

We conduct the experiment in the following search spaces:
  • Search Space A: comprises residual and bottleneck blocks in ResNet following [55,56].
  • Search Space B: comprises of MobileNet blocks following [57,58]. The depth-wise expansion ratio is in the range of { 1 , 2 , 4 , 6 } in the search process.
To satisfy the inference budget, we set the initial architecture as a small, randomly selected network, as in [25], for fair comparison. The set of kernel sizes is { 3 , 5 , 7 } . In terms of the count of stages, we set three for CIFAR-10 and CIFAR-100 and five for ImageNet. The size of the evolutionary population (N) is 256, and the number of evolutionary iterations is T = 96,000. We set the maximal depth value of the network to L = 18 and effectiveness to ρ 0 = 0.5 following previous works [25,26]. The resolutions are 32 × 32 for the CIFAR-10 and CIFAR-100 datasets and 224 × 224 for the ImageNet dataset. We find the optimal pair of ( w z , w d ) on the CIFA-10 and CIFAR-100 datasets, which we then use for NAS on ImageNet. The search for optimal hyperparameters ( w z , w d ) is described in Section 5.2. The following experiments are conducted.
  • Incorporated-Score-l: Incorporated-Score is generated by using the logarithm function as the normalization function.
  • Incorporated-Score-s: Incorporated-Score is generated by using the square-root function as the normalization function.
  • Incorporated-Score-w/o-ls: To explain the effect of our normalization approach, we experiment with Incorporated-Score without any normalization and balanced weights.

5.1.2. Training Setting

  • Dataset: CIFAR-10 and CIFAR-100 are two benchmark datasets for image classification. CIFAR-10 consists of 50,000 images for training and 10,000 images for testing in 10 classes. Each image has a resolution of 32 × 32 . CIFAR-100 has a similar number of samples for training and testing but is divided into 100 classes. ImageNet-1k is a large dataset that includes over 1.2 million images for training and 50,000 test images divided into 1000 classes. We experiment with the official training and validation dataset.
  • Augmentation: Augmentations including label smoothing, AutoAugment [59,60], mix-up [61], random erasing [62], and random crop/resize/flip/light are used.
  • Optimizer: The optimizer is SGD with a momentum of 0.9 for all experiments. The weight decay is 5 × 10 4 and 4 × 10 5 for CIFAR-10/100 and ImageNet, respectively. The batch size is 256, with an initial learning rate of 0.1 and cosine learning rate decay [63]. For CIFAR-10 and CIFAR-100, we train the models up to 1440 epochs. For ImageNet, the number of train epochs is 480. Following previous research [17,18,64], we use EfficientNet-B3 as a teacher network when training EZenNets.
We compare Incorporated-Score with the baseline SOTA proxy, zen-score. Moreover, we provide the results for three other proxies, namely gradient-norm (grad), synflow [22], and TE-Score [23], to provide an overview of zero-shot methods. For each proxy, the search step operates as in Algorithm 1 with T = 96,000. The best networks on CIFAR-10 and CIFAR-100 are searched for within fewer than 1 M network parameters in the NAS step. After the NAS step, the best-scoring network is trained in the same training setting.

5.2. Search for Hyperparameters ( w z , w d ) for Each Normalization Proxy

In order to find the best pair ( w z , w d ) in Equation (11) to achieve optimal performance, we set the value of two parameters in the range of ( 0 , 1 ) and w z + w d = 1 . We find the best pair on the CIFAR-10 and CIFAR-100 datasets with fewer than 1 M network parameters. Then, we apply these pairs to NAS on the ImageNet-1k dataset. All experimental results are provided in Table 1. For Incorporated-Score-s, which uses the square-root function for normalization, we observe that the computational cost of network FLOPs tends to rise as w d increases. Our goal is to design a high-performance, lightweight network, which is why we choose ( w z , w d ) = ( 0.8 , 0.2 ) for the Incorporated-Score-s proxy. For Incorporate-Score-l, ( w z , w d ) = ( 0.9 , 0.1 ) is the best pair to achieve optimal results in terms of both performance and the network’s computational cost. In terms of Incorporated-Score-w/o-ls, it can be seen that the top-1 accuracy improves significantly on the CIFAR-100 dataset when the weight of E s c o r e is extremely small ( w d = 10 4 ) relative to the weight of zen-score. This further proves that we need to normalize the E s c o r e .

5.3. Comparison of Results between Incorporated-Score and Zen-Score on CIFAR-10 and CIFAR-100 Datasets

Table 2 presents the detailed experimental results for the CIFAR-10 and CIFAR-100 datasets with a model size of N 1 M. The proposed Incorporated-Score outperforms the original zen-score. In particular, the top-1 accuracy reaches 96.66 % and 96.86 % on the CIFAR-10 dataset when the Incorporated-Score-l and Incorporated-Score-s proxies are used, respectively. For the CIFAR-100 dataset, the top-1 accuracy exceeds that of zen-score by 80.67 % and 81.10 % when using the Incorporated-Score-l and Incorporated-Score-s proxies, respectively. The results demonstrate that the entropy of the network works effectively when incorporated into zen-score.

5.4. The Effectiveness of E s c o r e Normalization

To explain the effectiveness of normalizing E s c o r e when computing Incorporated-Score, we provide the model performance generated without any normalized function in Table 2. The experimental results show that normalization of the entropy affects the NAS performance because using Incorporated-Score with entropy normalization results in better performance than using Incorporated-Score without entropy normalization. As mentioned, the main reason is that the value of E s c o r e is much larger than that of zen-score, which leads to E s c o r e being dominant in Incorporated-Score compared to zen-score. Figure 3a,c show the maximum value of Incorporated-Score with normalization using logarithmic and square-root functions through the NAS stage. The normalized E s c o r e values are shown in Figure 3b,d. The figures show that the value of E s c o r e after normalization is at the same level as zen-score. Specifically, the value distribution is spread in the range of [ 0 , 100 ] . This avoids the dominant problem that occurs when we incorporate zen-score and E s c o r e . Finally, Figure 3e shows the correlation between zen-score and Incorporated-Score, with Incorporated-Score exibiting a smoother curve than zen-score. For a fair comparison with zen-score, we also compare the computational efficiency of Incorporated-Score and zen-score in Table 3. It can be easily seen that the computational time of Incorporated-Score is almost equal to that of zen-score. The main point is that we can find more robust, high-performance networks with insignificant increases in computational cost compared with zen-score.

5.5. Incorporated-Score for Lightweight Model on ImageNet Dataset

Table 4 shows a comparison between the proposed Incorporated-Score and zen-score on the ImageNet dataset. As shown in Table 4, the models that are searched for using the proposed Incorporated-Score outperform the models that are searched for using zen-score with both 400 M and 600 M FLOPs.
To demonstrate the effectiveness of the proposed Incorporated-Score, we provide additional experimental results using other NAS approaches for lightweight cases, as shown in Table 5. The experiments were carried out with the following popular networks as baselines: (a) manually designed networks such as MobileNet-V2 [57]; (b) NAS-designed networks for fast inference on GPUs, i.e., DFNet [65] and RegNet [56], based on the search space optimization approach; (c) NAS-designed networks optimized for FLOPs, including one-shot methods such as OFANet [17], DNANet [18], and EfficientNet [50]; and (d) NAS-designed networks optimized for FLOPs, including RL methods such as MnasNet [48]. As shown in Table 5, EZenNet models searched for using the proposed Incorporated-Score outperform other models with similar FLOPs.
Table 6 provides the NAS searching time of the proposed approach on ImageNet-1k compared to logarithms proposed in previous research, including CARS-I [12], PC-DARTS [10], FBNetV2 [16], MetaQNN [6], TE-NAS [23], and OFANet [17]. As shown in Table 6, the proposed method outperforms the methods proposed in previous research, with an efficient search time. In comparison, the proposed Incorporated-NAS reaches 80.1% accuracy in approximately 0.2 GPU days, while OFANet spends 51.6 GPU days to achieve the same accuracy.

5.6. Architecture Comparison

For further information, we compare architectures generated using the proposed Incorporated-Score proxy and the zen-score proxy, as shown in Table 7, Table 8, Table 9, Table 10, Table 11 and Table 12, on the CIFAR-10, CIFAR-100, and ImageNet datasets in ResNet and MobileNet search spaces.
In the tables, ‘Conv’ represents the standard convolution layer followed by Batch Normalization (BN) and a RELU; ‘Res’ is the residual block designed in ResNet-18; ‘Btn’ is the residual bottleneck block designed in ResNet-50; ‘MB’ is the MobileBlock applied in MobileNet and EfficientNet; ‘Input’ and ‘Output’ are short for the count of input and output channels, respectively; and ‘Bottleneck’ is the number of bottleneck channels. It can be seen that EZenNet models, which are generated using the proposed Incorporated-Score, tend to expand horizontally and are less deep than ZenNet models. For example, when the computation cost is 400 M FLOPs, the number of layers of EZenNet models is 12, whereas that of the ZenNet model is 14. This proves that the entropy value component helps control over-deep network generation and optimizes the networks for the width dimension.

6. Conclusions

In this paper, we propose Incorporated-Score, which is an efficient proxy for zero-shot neural architecture search in the design of high-performance deep image recognition networks based on zen-score and entropy. We conducted experiments to verify that entropy plays a role as the optimal factor in zen-score, and EZenNet models automatically designed by Incorporated-NAS outperformed the state-of-the-art NAS models in top-1 accuracy on the CIFAR-10 and CIFAR-100 datasets with the same budget. Moreover, EZenNet outperforms ZenNet on the the ImageNet dataset with lightweight instances. In the future, we will further investigate this proxy for other deep learning problems such as object detection and natural language processing.

Author Contributions

Conceptualization, T.-T.N. and J.-H.H.; methodology, T.-T.N. and J.-H.H.; software, T.-T.N.; validation, T.-T.N. and J.-H.H.; formal analysis, T.-T.N. and J.-H.H.; investigation, T.-T.N.; resources, J.-H.H.; data curation, T.-T.N.; writing—original draft preparation, T.-T.N.; writing—review and editing, J.-H.H.; visualization, T.-T.N.; supervision, J.-H.H.; project administration, J.-H.H.; funding acquisition, J.-H.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2024-RS-2022-00156295) supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation).

Data Availability Statement

Data are available in a publicly accessible repository.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

  1. Guo, Z.; Zhang, X.; Mu, H.; Heng, W.; Liu, Z.; Wei, Y.; Sun, J. Single Path One-Shot Neural Architecture Search with Uniform Sampling. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 544–560. [Google Scholar] [CrossRef]
  2. Luo, R.; Tian, F.; Qin, T.; Chen, E.; Liu, T.Y. Neural architecture optimization. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 3–8 December 2018; pp. 7827–7838. [Google Scholar]
  3. Real, E.; Aggarwal, A.; Huang, Y.; Le, Q.V. Regularized evolution for image classifier architecture search. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, Honolulu, HI, USA, 29–31 January 2019. [Google Scholar] [CrossRef]
  4. Real, E.; Moore, S.; Selle, A.; Saxena, S.; Suematsu, Y.L.; Tan, J.; Le, Q.V.; Kurakin, A. Large-scale evolution of image classifiers. In Proceedings of the 34th International Conference on Machine Learning, ICML’17, Sydney, NSW, Australia, 6–11 August 2017; Volume 70, pp. 2902–2911. [Google Scholar]
  5. Xie, L.; Yuille, A. Genetic CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 1388–1397. [Google Scholar] [CrossRef]
  6. Baker, B.; Gupta, O.; Naik, N.; Raskar, R. Designing Neural Network Architectures using Reinforcement Learning. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
  7. Wen, W.; Liu, H.; Chen, Y.; Li, H.; Bender, G.; Kindermans, P.J. Neural Predictor for Neural Architecture Search. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 660–676. [Google Scholar] [CrossRef]
  8. Luo, R.; Tan, X.; Wang, R.; Qin, T.; Chen, E.; Liu, T.Y. Semi-Supervised Neural Architecture Search. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–12 December 2020; Volume 33. [Google Scholar]
  9. Liu, H.; Simonyan, K.; Yang, Y. DARTS: Differentiable Architecture Search. In Proceedings of the 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
  10. Xu, Y.; Xie, L.; Zhang, X.; Chen, X.; Qi, G.; Tian, Q.; Xiong, H. PC-DARTS: Partial Channel Connections for Memory-Efficient Architecture Search. In Proceedings of the 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
  11. Zhou, D.; Zhou, X.; Zhang, W.; Loy, C.C.; Yi, S.; Zhang, X.; Ouyang, W. EcoNAS: Finding Proxies for Economical Neural Architecture Search. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11393–11401. [Google Scholar]
  12. Yang, Z.; Wang, Y.; Chen, X.; Shi, B.; Xu, C.; Xu, C.; Tian, Q.; Xu, C. CARS: Continuous Evolution for Efficient Neural Architecture Search. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2019; pp. 1826–1835. [Google Scholar]
  13. Xie, S.; Zheng, H.; Liu, C.; Lin, L. SNAS: Stochastic neural architecture search. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
  14. Cai, H.; Zhu, L.; Han, S. ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware. In Proceedings of the 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
  15. Benyahia, Y.; Yu, K.; Smires, K.B.; Jaggi, M.; Davison, A.C.; Salzmann, M.; Musat, C. Overcoming Multi-model Forgetting. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; Volume 97, pp. 594–603. [Google Scholar]
  16. Wan, A.; Dai, X.; Zhang, P.; He, Z.; Tian, Y.; Xie, S.; Wu, B.; Yu, M.; Xu, T.; Chen, K.; et al. FBNetV2: Differentiable Neural Architecture Search for Spatial and Channel Dimensions. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Los Alamitos, CA, USA, 13–19 June 2020; pp. 12962–12971. [Google Scholar] [CrossRef]
  17. Cai, H.; Gan, C.; Wang, T.; Zhang, Z.; Han, S. Once-for-All: Train One Network and Specialize it for Efficient Deployment. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
  18. Li, C.; Peng, J.; Yuan, L.; Wang, G.; Liang, X.; Lin, L.; Chang, X. Block-Wisely Supervised Neural Architecture Search with Knowledge Distillation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2019; pp. 1986–1995. [Google Scholar]
  19. Yu, K.; Sciuto, C.; Jaggi, M.; Musat, C.; Salzmann, M. Evaluating The Search Phase of Neural Architecture Search. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 30 April 2020. [Google Scholar]
  20. Ying, C.; Klein, A.; Christiansen, E.; Real, E.; Murphy, K.; Hutter, F. NAS-Bench-101: Towards Reproducible Neural Architecture Search. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; Volume 97, pp. 7105–7114. [Google Scholar]
  21. Abdelfattah, M.S.; Mehrotra, A.; Dudziak, Ł.; Lane, N.D. Zero-Cost Proxies for Lightweight {NAS}. In Proceedings of the International Conference on Learning Representations, Virtual, 3–7 May 2021. [Google Scholar]
  22. Tanaka, H.; Kunin, D.; Yamins, D.L.K.; Ganguli, S. Pruning neural networks without any data by iteratively conserving synaptic flow. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 6–12 December 2020. [Google Scholar]
  23. Chen, W.; Gong, X.; Wang, Z. Neural Architecture Search on ImageNet in Four GPU Hours: A Theoretically Inspired Perspective. In Proceedings of the International Conference on Learning Representations, Virtual, 3–7 May 2021. [Google Scholar]
  24. Mellor, J.; Turner, J.; Storkey, A.; Crowley, E.J. Neural Architecture Search without Training. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; Volume 139, pp. 7588–7598. [Google Scholar]
  25. Lin, M.; Wang, P.; Sun, Z.; Chen, H.; Sun, X.; Qian, Q.; Li, H.; Jin, R. Zen-NAS: A Zero-Shot NAS for High-Performance Image Recognition. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 337–346. [Google Scholar]
  26. Shen, X.; Wang, Y.; Lin, M.; Huang, Y.; Tang, H.; Sun, X.; Wang, Y. DeepMAD: Mathematical Architecture Design for Deep Convolutional Neural Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
  27. Montúfar, G.; Pascanu, R.; Cho, K.; Bengio, Y. On the number of linear regions of deep neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems, NIPS’14, Cambridge, MA, USA, 8–14 December 2014; pp. 2924–2932. [Google Scholar]
  28. Daniely, A.; Frostig, R.; Singer, Y. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, Red Hook, NY, USA, 5–10 December 2016; pp. 2261–2269. [Google Scholar]
  29. Liang, S.; Srikant, R. Why Deep Neural Networks for Function Approximation? In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
  30. Poole, B.; Lahiri, S.; Raghu, M.; Sohl-Dickstein, J.; Ganguli, S. Exponential expressivity in deep neural networks through transient chaos. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, Red Hook, NY, USA, 5–10 December 2016; pp. 3368–3376. [Google Scholar]
  31. Cohen, N.; Shashua, A. Inductive Bias of Deep Convolutional Networks through Pooling Geometry. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
  32. Lu, Z.; Pu, H.; Wang, F.; Hu, Z.; Wang, L. The expressive power of neural networks: A view from the width. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Red Hook, NY, USA, 4–9 December 2017; pp. 6232–6240. [Google Scholar]
  33. Raghu, M.; Poole, B.; Kleinberg, J.; Ganguli, S.; Sohl-Dickstein, J. On the Expressive Power of Deep Neural Networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, NSW, Australia, 6–11 August 2017; Volume 70, pp. 2847–2854. [Google Scholar]
  34. Rolnick, D.; Tegmark, M. The power of deeper networks for expressing natural functions. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  35. Serra, T.; Tjandraatmadja, C.; Ramalingam, S. Bounding and Counting Linear Regions of Deep Neural Networks. In Proceedings of the 35th International Conference on Machine Learning, Stockholm Sweden, 10–15 July 2018; Volume 80, pp. 4558–4566. [Google Scholar]
  36. Hanin, B.; Rolnick, D. Complexity of Linear Regions in Deep Networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; Volume 97, pp. 2596–2604. [Google Scholar]
  37. Xiong, H.; Huang, L.; Yu, M.; Liu, L.; Zhu, F.; Shao, L. On the number of linear regions of convolutional neural networks. In Proceedings of the 37th International Conference on Machine Learning, ICML’20, Virtual, 13–18 July 2020. [Google Scholar]
  38. Jaynes, E.T. Information Theory and Statistical Mechanics. Phys. Rev. 1957, 106, 620–630. [Google Scholar] [CrossRef]
  39. Kullback, S. Information Theory and Statistics; Wiley: New York, NY, USA, 1959. [Google Scholar]
  40. Sun, Z.; Ge, C.; Wang, J.; Lin, M.; Chen, H.; Li, H.; Sun, X. Entropy-Driven Mixed-Precision Quantization for Deep Network Design. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
  41. Sun, Z.; Lin, M.; Sun, X.; Tan, Z.; Li, H.; Jin, R. MAE-DET: Revisiting Maximum Entropy Principle in Zero-Shot NAS for Efficient Object Detection. In Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022. [Google Scholar]
  42. Wang, J.; Sun, Z.; Qian, Y.; Gong, D.; Sun, X.; Lin, M.; Pagnucco, M.; Song, Y. Maximizing Spatio-Temporal Entropy of Deep 3D CNNs for Efficient Video Recognition. arXiv 2023, arXiv:2303.02693. [Google Scholar]
  43. Yu, Y.; Chan, K.H.R.; You, C.; Song, C.; Ma, Y. Learning diverse and discriminative representations via the principle of maximal coding rate reduction. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 6–12 December 2020. [Google Scholar]
  44. Chan, K.H.R.; Yu, Y.; You, C.; Qi, H.; Wright, J.; Ma, Y. ReduNet: A white-box deep network from the principle of maximizing rate reduction. J. Mach. Learn. Res. 2022, 23, 1–103. [Google Scholar]
  45. Roberts, D.A.; Yaida, S.; Hanin, B. The Principles of Deep Learning Theory: An Effective Theory Approach to Understanding Neural Networks; Cambridge University Press: Cambridge, UK, 2022. [Google Scholar] [CrossRef]
  46. Saxe, A.M.; Bansal, Y.; Dapello, J.; Advani, M.; Kolchinsky, A.; Tracey, B.D.; Cox, D.D. On the Information Bottleneck Theory of Deep Learning. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  47. Liu, C.; Zoph, B.; Neumann, M.; Shlens, J.; Hua, W.; Li, L.J.; Fei-Fei, L.; Yuille, A.; Huang, J.; Murphy, K. Progressive Neural Architecture Search. In Proceedings of the Computer Vision—ECCV 2018: 15th European Conference, Munich, Germany, 8–14 September 2018; pp. 19–35. [Google Scholar] [CrossRef]
  48. Tan, M.; Chen, B.; Pang, R.; Vasudevan, V.; Sandler, M.; Howard, A.; Le, Q.V. MnasNet: Platform-Aware Neural Architecture Search for Mobile. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 2815–2823. [Google Scholar] [CrossRef]
  49. Zoph, B.; Vasudevan, V.; Shlens, J.; Le, Q.V. Learning Transferable Architectures for Scalable Image Recognition. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar] [CrossRef]
  50. Tan, M.; Le, Q. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; Volume 97, pp. 6105–6114. [Google Scholar]
  51. Kakade, S.M.; Sridharan, K.; Tewari, A. On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization. In Proceedings of the NIPS, Vancouver, BC, Canada, 12 December 2008; pp. 793–800. [Google Scholar]
  52. Bartlett, P.L.; Foster, D.J.; Telgarsky, M. Spectrally-normalized margin bounds for neural networks. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Red Hook, NY, USA, 4–9 December 2017; pp. 6241–6250. [Google Scholar]
  53. Neyshabur, B.; Li, Z.; Bhojanapalli, S.; LeCun, Y.; Srebro, N. The role of over-parametrization in generalization of neural networks. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
  54. Ruderman, D.L. The statistics of natural images. Netw. Comput. Neural Syst. 1994, 5, 517. [Google Scholar] [CrossRef]
  55. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
  56. Radosavovic, I.; Kosaraju, R.P.; Girshick, R.; He, K.; Dollár, P. Designing Network Design Spaces. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10425–10433. [Google Scholar] [CrossRef]
  57. Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar] [CrossRef]
  58. Pham, H.; Guan, M.; Zoph, B.; Le, Q.; Dean, J. Efficient Neural Architecture Search via Parameters Sharing. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; Volume 80, pp. 4095–4104. [Google Scholar]
  59. Cubuk, E.D.; Zoph, B.; Mané, D.; Vasudevan, V.; Le, Q.V. AutoAugment: Learning Augmentation Strategies From Data. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–19 June 2019; pp. 113–123. [Google Scholar] [CrossRef]
  60. Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar] [CrossRef]
  61. Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond Empirical Risk Minimization. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  62. Zhong, Z.; Zheng, L.; Kang, G.; Li, S.; Yang, Y. Random Erasing Data Augmentation. Proc. AAAI Conf. Artif. Intell. 2020, 34, 13001–13008. [Google Scholar]
  63. Loshchilov, I.; Hutter, F. SGDR: Stochastic Gradient Descent with Warm Restarts. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
  64. Aguilar, G.; Ling, Y.; Zhang, Y.; Yao, B.; Fan, X.; Guo, C. Knowledge Distillation from Internal Representations. Proc. AAAI Conf. Artif. Intell. 2020, 34, 7350–7357. [Google Scholar]
  65. Li, X.; Zhou, Y.; Pan, Z.; Feng, J. Partial Order Pruning: For Best Speed/Accuracy Trade-off in Neural Architecture Search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Figure 1. The overview of the proposed Incorporated-NAS. Incorporated-NAS searches for the best network architecture in the search space ( A ) based on the architecture generator and the accuracy predictor without any training.
Figure 1. The overview of the proposed Incorporated-NAS. Incorporated-NAS searches for the best network architecture in the search space ( A ) based on the architecture generator and the accuracy predictor without any training.
Electronics 13 03325 g001
Figure 2. Value of zen score and the effectiveness entropy value ( E s c o r e ) through T = 96,000 iterations in the NAS process when the model size constraint is less than 1 M in search space A (defined in Section 5.1.1). Black lines are trend lines of the two scores in (a,b).
Figure 2. Value of zen score and the effectiveness entropy value ( E s c o r e ) through T = 96,000 iterations in the NAS process when the model size constraint is less than 1 M in search space A (defined in Section 5.1.1). Black lines are trend lines of the two scores in (a,b).
Electronics 13 03325 g002
Figure 3. Values of Incorporated-Score-l, Incorporated-Score-s, and zen-score through T = 96,000 iterations in the NAS process in search space I. Black lines are trend lines in (b,d).
Figure 3. Values of Incorporated-Score-l, Incorporated-Score-s, and zen-score through T = 96,000 iterations in the NAS process in search space I. Black lines are trend lines in (b,d).
Electronics 13 03325 g003
Table 1. The results of top-1 accuracy on CIFAR-10 and CIFAR-100 with a difference pair of ( w z , w d ) of Incorporated-Score and a model size of N 1   M . The bold represents the selected pairs of ( w z , w d ) .
Table 1. The results of top-1 accuracy on CIFAR-10 and CIFAR-100 with a difference pair of ( w z , w d ) of Incorporated-Score and a model size of N 1   M . The bold represents the selected pairs of ( w z , w d ) .
Proxy ( w z , w d ) CIFAR-10CIFAR-100FLOPs
Incorporated-Score-s(0.9; 0.1)96.10%79.00%280 M
(0.8; 0.2)96.86%81.10%309 M
(0.7; 0.3)96.86%80.50%472 M
(0.6; 0.4)97.21%79.85%592 M
(0.5; 0.5)96.63%79.57%315 M
(0.4; 0.6)96.74%81.00%572 M
(0.3; 0.7)96.80%79.61%692 M
(0.2; 0.8)97.08%80.10%630 M
(0.1; 0.9)96.71%80.40%634 M
Incorporated-Score-l(0.9; 0.1)96.66%80.67%112 M
(0.8; 0.2)96.50%80.18%411 M
(0.7; 0.3)96.09%79.32%472 M
(0.6; 0.4)96.56%78.03%592 M
(0.5; 0.5)96.29%79.19%315 M
(0.4; 0.6)96.21%79.17%572 M
(0.3; 0.7)96.53%79.49%692 M
(0.2; 0.8)96.39%79.40%630 M
(0.1; 0.9)96.33%79.66%634 M
Incorporated-Score-w/o-ls(0.5; 0.5)96.66%75.97%560 M
(0.9999; 10 4 )96.33%80.10%411 M
Table 2. Top-1 accuracy on CIFAR-10 and CIFAR-100 to compare the proposed Incorporated-Score on the best pair of weights ( w z , w d ) and four other zero-shot proxies, including the SOTA zen-score. The bold represents the best accuracy.
Table 2. Top-1 accuracy on CIFAR-10 and CIFAR-100 to compare the proposed Incorporated-Score on the best pair of weights ( w z , w d ) and four other zero-shot proxies, including the SOTA zen-score. The bold represents the best accuracy.
ProxyCIFAR-10CIFAR-100
zen-score96.20%80.10%
grad92.80%65.40%
synflow95.10%75.90%
TE-Score96.10%77.20%
Incorporated-Score-l96.66%80.67%
Incorporated-Score-s96.86%81.10%
Incorporated-Score-w/o-ls96.59%75.97%
Table 3. Time cost of Incorporated-Score and zen-score for the ZenNet-400M-imagenet model at a resolution of 224 × 224 and ZenNet-1M-cifar at a resolution of 32 × 32 . The computational times of Incorporated-Score and zen-score for N images are reported in seconds, with the results averaged across 100 trials.
Table 3. Time cost of Incorporated-Score and zen-score for the ZenNet-400M-imagenet model at a resolution of 224 × 224 and ZenNet-1M-cifar at a resolution of 32 × 32 . The computational times of Incorporated-Score and zen-score for N images are reported in seconds, with the results averaged across 100 trials.
ProxyModelNTime
Incorporated-ScoreZenNet-400M-imagenet160.0345
ZenNet-1M-cifar160.0348
zen-scoreZenNet-400M-imagenet160.0337
ZenNet-1M-cifar160.0346
Table 4. Top- 1 accuracies of Incorporated-Score and zen-score proxies on the on ImageNet database in lightweight cases. The architectures that are searched for using the Incorporated-Score and zen-score proxies are denoted as EZenNet and ZenNet, respectively. The bold represents the best accuracy.
Table 4. Top- 1 accuracies of Incorporated-Score and zen-score proxies on the on ImageNet database in lightweight cases. The architectures that are searched for using the Incorporated-Score and zen-score proxies are denoted as EZenNet and ZenNet, respectively. The bold represents the best accuracy.
ModelTop1-Acc (%)FLOPs
EZenNet-400M-SE-l78.30405 M
EZenNet-400M-SE-s78.29418 M
ZenNet-400M-SE [25]78.00410 M
EZenNet-600M-SE-l79.64610 M
EZenNet-600M-SE-s79.73609 M
ZenNet-600M-SE [25]79.10611 M
Table 5. Top- 1 accuracies of lightweight models on ImageNet with a computational cost budge of FLOPs ≤ 1G. ‘*’: OFANet is trained from scratch. ‘+’: OFANet is trained using supernet parameters as an initialization.
Table 5. Top- 1 accuracies of lightweight models on ImageNet with a computational cost budge of FLOPs ≤ 1G. ‘*’: OFANet is trained from scratch. ‘+’: OFANet is trained using supernet parameters as an initialization.
ModelResolutionParamsFLOPsTop1-Acc
MobileNetV2-0.252241.5 M44 M51.80%
MobileNetV2-0.52242.0 M108 M64.40%
MobileNetV2-0.752242.6 M226 M69.40%
MobileNetV2-1.02243.5 M320 M74.70%
MobileNetV2-1.42246.1 M610 M74.70%
DFNet-12248.5 M746 M69.80%
RegNetY-200 MF2243.2 M200 M70.40%
RegNetY-400 MF2244.3 M400 M74.10%
RegNetY-600 MF2246.1 M600 M75.50%
RegNetY-800 MF2246.3 M800 M76.30%
OFANet-9 ms (+)2245.2 M313 M75.30%
OFANet-11 ms (+)2246.2 M352 M76.10%
OFANet-389 M (*)2248.4 M389 M76.30%
OFANet-482 M (*)2249.1 M482 M78.80%
OFANet-595 M (*)2249.1 M595 M79.80%
EfficientNet-B02245.3 M390 M76.30%
EfficientNet-B12407.8 M700 M78.80%
EfficientNet-B22609.2 M1.0G79.80%
DNANet-a2244.2 M348 M77.10%
DNANet-b2244.9 M406 M77.50%
DNANet-c2245.3 M466 M77.80%
DNANet-d2246.4 M611 M78.40%
MnasNet-1.02244.4 M330 M74.20%
Deep MAD-B02245.3 M390 M76.10%
ZenNet-400 M-SE2245.7 M410 M78.00%
ZenNet-600 M-SE2247.1 M611 M79.10%
ZenNet-900 M-SE22413.3 M926 M80.80%
EZenNet-400 M-SE-l2247.1 M405 M78.30%
EZenNet-400 M-SE-s2247.2 M418 M78.29%
EZenNet-600 M-SE-l2249.1 M610 M79.64%
EZenNet-600 M-SE-s2249.6 M609 M79.73%
EZenNet-800 M-SE-s22412.9 M801 M80.10%
Table 6. Comparison of of NAS searching time. ’Top1-Acc’: The top-1 accuracy on the ImageNet-1k dataset. ’Method’: evolutionary algorithm (EA), gradient descent (GD), reinforcement learning (RL), zero-shot proxy (ZS), and progressive shrinkage (PS). The bold represents the best peformance.
Table 6. Comparison of of NAS searching time. ’Top1-Acc’: The top-1 accuracy on the ImageNet-1k dataset. ’Method’: evolutionary algorithm (EA), gradient descent (GD), reinforcement learning (RL), zero-shot proxy (ZS), and progressive shrinkage (PS). The bold represents the best peformance.
ModelMethodTop1-Acc (%)GPU Day
CARS-I [12]EA75.200.40
PC-DARTS [10]GD75.803.80
FBNetV2 [16]GD77.2025.00
MetaQNN [6]RL77.4096.00
TE-NAS [23]ZS74.100.20
OFANet [17]PS80.1051.60
EZenNet-400M-SE-lZS78.300.13
EZenNet-400M-SE-sZS78.290.09
EZenNet-600M-SE-lZS79.640.13
EZenNet-600M-SE-sZS79.730.11
EZenNet-800M-SE-sZS80.100.20
Table 7. Architecture comparison of EZenNet-1.0M-l (orange) vs. ZenNet-1.0M (black) for the CIFAR-10 and CIFAR-100 datasets.
Table 7. Architecture comparison of EZenNet-1.0M-l (orange) vs. ZenNet-1.0M (black) for the CIFAR-10 and CIFAR-100 datasets.
BlockKernelInputOutputStrideBottlenecks# Layers
ConvConv3333882411--11
BtnRes758824120881116811
BtnBtn771208819230422161635
BtnRes55192304224481124843
BtnBtn57224489630421241624
BtnBtn35963041688022403232
BtnBtn351688011225612484031
ConvConv1111225651223211--11
Table 8. Architecture comparison of EZenNet-1.0M-s (blue) vs. ZenNet-1.0M (black) for the CIFAR-10 and CIFAR-100 datasets.
Table 8. Architecture comparison of EZenNet-1.0M-s (blue) vs. ZenNet-1.0M (black) for the CIFAR-10 and CIFAR-100 datasets.
BlockKernelInputOutputStrideBottlenecks# Layers
ConvConv3333885611--11
BtnRes778856120481116815
BtnBtn731204819220022163233
BtnBtn5519220022416011242443
BtnBtn572241609662422241624
BtnBtn33966241684822404031
Btn-3-168-112-1-48-3-
ConvConv111124851230411--11
Table 9. Architecture comparison of EZenNet-400M-SE-l (orange) vs. ZenNet-400M-SE (black) for the ImageNet dataset.
Table 9. Architecture comparison of EZenNet-400M-SE-l (orange) vs. ZenNet-400M-SE (black) for the ImageNet dataset.
BlockKernelInputOutputStrideBottleneckExpansion# Layers
ConvConv3333163222----11
MBMB77163240402240401111
MBMB77404064642264401211
MBMB7764649612822961764251
MBMB7796128224256222241522453
-MB-7-256-104-1-152-6-4
ConvConv112241042048204811----11
Table 10. Architecture comparison of EZenNet-400M-SE-s (blue) vs. ZenNet-400M-SE (black) for the ImageNet dataset.
Table 10. Architecture comparison of EZenNet-400M-SE-s (blue) vs. ZenNet-400M-SE (black) for the ImageNet dataset.
BlockKernelInputOutputStrideBottleneckExpansion# Layers
ConvConv3333164022----11
MBMB77164040802240161111
MBMB77408064802264481111
MBMB7764809633622961124151
MBMB7796336224360222243602153
-MB-7-360-360-1-360-1-4
ConvConv112243602048204811----11
Table 11. Architecture comparison of EZenNet-600M-SE-l (orange) vs. ZenNet-600M-SE(black) for the ImageNet dataset.
Table 11. Architecture comparison of EZenNet-600M-SE-l (orange) vs. ZenNet-600M-SE(black) for the ImageNet dataset.
BlockKernelInputOutputStrideBottleneckExpansion# Layers
ConvConv3333241622----11
MBMB77241648322216321411
MBMB77483272722216402411
MBMB777272963042224966251
MBMB779630419236022243604153
-MB-7-360-176-140240-4-5
ConvConv111921762048204811----11
Table 12. Architecture comparison of EZenNet-600M-SE-s (blue) vs. ZenNet-600M-SE (black) for the ImageNet dataset.
Table 12. Architecture comparison of EZenNet-600M-SE-s (blue) vs. ZenNet-600M-SE (black) for the ImageNet dataset.
BlockKernelInputOutputStrideBottleneckExpansion# Layers
ConvConv3333241622----11
MBMB77241648242216641211
MBMB7748247212022161202111
MBMB77721209619222241446251
MBMB779619219232022242244254
-MB-7-320-384-140384-1-5
ConvConv111923842048204811----11
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Nguyen, T.-T.; Han, J.-H. Zero-Shot Proxy with Incorporated-Score for Lightweight Deep Neural Architecture Search. Electronics 2024, 13, 3325. https://doi.org/10.3390/electronics13163325

AMA Style

Nguyen T-T, Han J-H. Zero-Shot Proxy with Incorporated-Score for Lightweight Deep Neural Architecture Search. Electronics. 2024; 13(16):3325. https://doi.org/10.3390/electronics13163325

Chicago/Turabian Style

Nguyen, Thi-Trang, and Ji-Hyeong Han. 2024. "Zero-Shot Proxy with Incorporated-Score for Lightweight Deep Neural Architecture Search" Electronics 13, no. 16: 3325. https://doi.org/10.3390/electronics13163325

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop