LMD-DARTS: Low-Memory, Densely Connected, Differentiable Architecture Search

Li, Zhongnian; Xu, Yixin; Ying, Peng; Chen, Hu; Sun, Renke; Xu, Xinzheng

doi:10.3390/electronics13142743

Open AccessArticle

LMD-DARTS: Low-Memory, Densely Connected, Differentiable Architecture Search

by

Zhongnian Li

,

Yixin Xu

,

Peng Ying

,

Hu Chen

,

Renke Sun

and

Xinzheng Xu

^*

School of Computer Science and Technology, China University of Mining and Technology, Xuzhou 221116, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(14), 2743; https://doi.org/10.3390/electronics13142743

Submission received: 6 June 2024 / Revised: 5 July 2024 / Accepted: 10 July 2024 / Published: 12 July 2024

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Neural network architecture search (NAS) technology is pivotal for designing lightweight convolutional neural networks (CNNs), facilitating the automatic search for network structures without requiring extensive prior knowledge. However, NAS is resource-intensive, consuming significant computational power and time due to the evaluation of numerous candidate architectures. To address the issues of high memory usage and slow search speed in traditional NAS algorithms, we propose the Low-Memory, Densely Connected, Differentiable Architecture Search (LMD-DARTS) algorithm. To expedite the updating speed of the optional operation weights during the search process, LMD-DARTS introduces a continuous strategy based on weight redistribution. Furthermore, to mitigate the influence of low-weight operations on classification results and reduce the number of searches, LMD-DARTS employs a dynamic sampler to prune underperforming operations during the search process, thereby lowering memory consumption and simplifying the complexity of individual searches. Additionally, to sparsify the dense connection matrix and mitigate redundant connections while maintaining optimal network performance, we introduce an adaptive downsampling search algorithm. Our experimental results show that the proposed LMD-DARTS achieves a remarkable 20% reduction in search time, along with a significant decrease in memory utilization within NAS process. Notably, the lightweight CNNs derived through this algorithm exhibit commendable classification accuracy, underscoring their effectiveness and efficiency for practical applications.

Keywords:

neural architecture search; lightweight convolution neural network; adaptively downsampling; weights redistribution; dynamic sampler

1. Introduction

In recent years, the realm of artificial intelligence has witnessed a flurry of groundbreaking theoretical advancements and methodological innovations, propelling the convolutional neural network (CNN) into widespread adoption across diverse domains such as computer vision, natural language processing, and speech recognition. This integration has fostered a profound fusion between artificial intelligence and various facets of society. However, with the deepening of research, the relentless pursuit of heightened network accuracy has necessitated the continuous escalation of CNN model complexity, posing a formidable challenge to the deployment of these networks on resource-constrained devices. Therefore, lightweight convolutional neural networks become the key technical means to solve the above problems.

Lightweight convolutional neural networks can be deployed on resource-constrained computing platforms, expanding the possibilities for their popularization and application. At present, lightweight CNNs have two directions [1]. The first direction involves building network structures with fewer parameters and lower computational requirements while ensuring network accuracy, including both artificial design technology and neural architecture search (NAS) technology. The other direction involves starting from network optimization, conducting lightweight processing on the existing model, and reducing as many redundant parameters as possible under the premise of ensuring that the accuracy is unchanged, such as via parameter pruning, knowledge distillation, and low-rank decomposition.

With the deepening of network layers and the proposal of various convolutional methods, more and more factors need to be considered when designing CNNs manually. Designers of CNNs need to constantly adjust the network structure according to their design experience and the performance of the models. Simultaneously, large datasets such as ImageNet and COCO increase the trial and error cost of designing CNNs. Therefore, NAS comes into being, aiming to automatically search for better network structures with less prior knowledge [2].

To optimize the weight of candidate operations in the continuous process of discrete space searching for micro-neural architecture, we propose LMD-DARTS, a low-memory neural architecture search algorithm. This algorithm leverages a densely connected, lightweight search space with DAS-Block adaptive subsampling. Firstly, to mitigate the influence of poor candidate operations on classification results and realize weight redistribution, a continuous strategy of weight redistribution, we introduce WRD-softmax, a continuous weight redistribution strategy. This strategy aims to reduce the number of iterations required for network convergence and accelerate the search process. Secondly, to reduce the cache consumption and search time of a single search while further improving the search efficiency without affecting the search results, a dynamic sampler is proposed. This dynamic sampler prunes candidate operations and dynamically eliminates poorly performing ones during the search process. The improvements of this algorithm are as follows:

A continuous strategy redistributes weights to accelerate updates for optional operations during the search, minimizing the impact of low-weight operations on classification results and reducing search iterations.
A dynamic sampler prunes underperforming operations in real time, cutting memory usage and simplifying individual search processes.
An adaptive downsampling search algorithm is proposed that sparsifies the dense connection matrix to reduce redundant connections while ensuring the performance of the network.

2. Related Work

NAS evaluates a large number of search results, which can be extremely time-consuming when using traditional deep learning evaluation methods. For example, for the CIFAR10 dataset, it takes three days to search the network architecture using the part search space [3]. Zoph et al. [4] evaluated 12,800 neural network architectures, taking 24,000 GPU days, a metric that reflects the computational complexity of the NAS algorithm by measuring the number of days required when running on a single GPU. Real et al. [5] searched for 3150 GPU days. In order to improve the evaluation efficiency and reduce the search time, designing an efficient and reasonable performance evaluation strategy is the key to the further popularization and application of NAS.

Baker et al. [6] and Zhong et al. [7] used reinforcement learning as the search strategy for NAS, and they proposed the MetaQNN method and the BlockQNN method, respectively. The former employs a greedy search strategy and a Q-learning method with experience replay to search for network architectures. The latter represents layers in the network using network structure code on a block-based search space. Zoph et al. [4] used the recurrent neural network (RNN) as the controller, combined with reinforcement learning to optimize the parameters of RNN, and they described the CNN architectures as variable-length coding and limited the maximum number of layers to reduce the search space. They utilized the strategy gradient method to optimize the performance of CNNs and achieved 96.35% accuracy with the CIFAR-10 dataset, surpassing most manually designed networks of the time. Under the block-based search space, Zhop et al. [8] and Cai et al. [9] defined recursive network controllers and controllers with different candidate operation spaces, respectively. The former used near-end strategy optimization to accelerate the search and improve its efficiency.

NasNet [10] uses the evolutionary algorithm as the search strategy for NAS. During the search process, it systematically eliminates poorly performing architectures, selecting the well-behaved architectures as the parents. In the replication and mutation stages, the evolutionary process retains the structures and weights of the parent networks, and it adds the offspring to the population after obtaining them. MnasNet [11] introduces the recombination operations during the generation of progeny, retrains and evaluates the obtained progeny, eliminates individuals whose accuracy was lower than the threshold, and reduces the number of progeny generated in the next round of evaluation so as to speed up the search process. Xie et al. [12] used a fixed-length, binary-coded genotype to represent the network architecture and evolve basic operations such as convolution and pooling, and the search results showed good migration performance. Suganuma et al. [13] proposed an optimizer based on evolutionary strategy [14], which uses genotypes encoded by triples to represent network architecture. These genotypes are divided into active and inactive parts, of which the inactive part is involved in mutation but does not affect the phenotype, reducing the training cost of evaluating different architectures. Liu et al. [15] used a stratified search space to select the layer structure of mutation in the case of individual mutation and delete the link of individual selection. So et al. [16] introduced a human-designed transformer framework into the initial population based on a natural language processing task. Real et al. [5] avoided the repeated selection of well-performing network structures through constraints and ensured the diversity of the population. AmoebaNet-B and AmoebaNet-C exceed the performance of manually designed networks and demonstrate that the evolutionary algorithm is superior to reinforcement learning in speed. However, the search process of this method still takes 131 GPU days.

Liu et al. [17] proposed DARTS, a search algorithm for differentiable neural architecture based on gradient optimization. Eight candidate operations were set in the search space. The whole search process only takes 1 GPU day with the CIFAR dataset, though this method requires a large amount of memory. To solve the problem of the excessive memory occupation of DARTS, Cai et al. [18] proposed ProxylessNAS, which uses a probabilistic sampling operation to replace the original mixed operation. When the candidate operand is N, memory consumption is reduced to

\frac{1}{N}

of the original, allowing for direct network architecture with the ImageNet dataset. Xiong et al. [19] reduced redundant parameters in the network and accelerated the search process by sampling part of the feature graph channels. Chen et al. [20] proposed PC-DARTS to reduce memory consumption via an approximate search space, optimizing the problem of DARTS showing different search preferences at different depths during the search and verification phases. Zheng et al. [21] modeled the polynomial distribution of DARTS candidate operations and optimized the hyperparameters of the candidate operations. The proposed MDENAS converges after 4 GPU hours on ImageNet. Hundt et al. [22] used the MaxW regularization method to correct the performance evaluation bias of DARTS, and they proposed sharpDARTS. Wang et al. [23] proposed MetaNTK-NAS, applying NAS to meta-learning optimization and verifying it with small sample datasets, obtaining a network structure superior to traditional meta-learning. Xue et al. [24] proposed ADARTS, applying an attention mechanism to DARTS and reducing weight-free operations in DARTS. Huang et al. [25] applied DARTS to single-image super-resolution (SISR) tasks, obtaining a more lightweight and accurate SR structure. Luo et al. [26] proposed a hardware-aware, differentiable NAS framework called SurgeNAS, which can perform more accurate gradient estimation, effectively alleviate the memory bottleneck, and ensure the fairness of a search. Li et al. [27] used a polarization regularizer to relieve performance collapse and search more directly for a better model by weighting the architectural parameters.

In recent years, neural architecture search (NAS) has demonstrated significant promise across various computer vision tasks, including image classification, recognition, restoration, enhancement, and retrieval. Numerous studies have applied NAS to these areas, enhancing performance and efficiency. For example, Ghiasi et al. [28] proposed the Neural Architecture Search Feature Pyramid Network (NAS-FPN), which employs reinforcement learning to sample and evaluate different feature pyramid architectures for detection tasks while minimizing the computational cost. Zhang et al. [29] proposed a memory-efficient hierarchical NAS (HiNAS) to address image denoising and image super-resolution. Priyadarshi et al. [30] presented DONNAv2, a computationally efficient neural architecture distillation method showcasing NAS applications in various vision tasks. Liu et al. [31] introduced Auto-DeepLab, expanding the application field of the neural architecture search to semantic segmentation tasks. Additionally, Mandal et al. [32] and Liu et al. [33] proposed methods for haze removal and nighttime image enhancement, setting benchmarks in image restoration and enhancement. Our proposed LMD-DARTS algorithm further optimizes NAS by incorporating a dynamic sampler and adaptive downsampling, which enhances image clarity and reduces artifacts. These improvements significantly boost the performance of models in tasks such as image classification, recognition, and retrieval.

3. LMD-DARTS: Low-Memory, Densely Connected, Differentiable Architecture Search

3.1. Search Space

To enhance the efficiency and adaptability of the search process, we use a cell-based search space. The architecture comprises eight sequentially connected cells, including six normal cells and two reduction cells. There are seven ordered nodes in a single cell. Cell i-1 and Cell i-2 are the output feature maps of the previous cell and its previous cell, respectively. All nodes are connected to the preceding node. The network architecture, denoted as

α

, is consistent within each cell, thereby simplifying the search for the entire network to the search for two types of cells that share the same search space. The normal cell is responsible for extracting features and maintaining the same size for both the input and output feature maps. Conversely, the reduction cell reduces the size of the feature map while extracting features, halving the height and width of the output feature map compared to the input feature map. During the search process, normal cells and reduction cells are stacked in a preset mode to form the network structure to be evaluated. Figure 1 shows the stacking settings of the two types of cells.

Figure 2 illustrates the search space integrated with DAS-Blocks. Average pooling and maximum pooling are employed to downsample the feature map, while a connectionless operation sparsifies the connections between nodes. The residual connection is used to achieve feature reuse. Depthwise separable convolution and DAS-Block can reduce the number of parameters and computational complexity. Additionally, dilated convolution is used to maintain the receptive field of the convolution kernel.

The DAS-block consists of trunk branches and residuals. Initially, the channel dimension of the input feature map is reduced using a

1 \times 1

convolution. Channels are then exchanged, deleted, and duplicated based on the index layer. After the feature map passes through the BN layer and ReLU activation function,

3 \times 3

group convolution is used to extract features, in which the number of groups, g, is the hyperparameter. The ratio of the number of channels in the output feature graph to the number of channels in the input feature graph is called the bottleneck factor, b. During feature extraction, the information between channels is learned via the Squeeze-and-Excitation module. The compression multiple of the number of channels in the feature map in the extrusion stage is called the extrusion parameter, R. After passing through a BN layer and ReLU activation function, the dimension of the feature map is improved via the second learned convolution. The output of the trunk branch is obtained after passing through the BN layer and ReLU activation function. Finally, the input feature map of the trunk branch is directly concatenated with the output feature map of the trunk branch through the residual connection to obtain the final output of the DAS-Block.

3.2. Weight Redistribution Softmax

In the process of searching, not only the network weight, w, but also the network architecture,

α

, is updated in the BP process. The updating speed is determined according to not only the training loss but also the learning rate at the current stage. Adjusting the learning rate to speed up the search process can easily cause network shock. Therefore, a reasonable search acceleration strategy is essential for the differentiable architecture search algorithm.

S o f t m a x x_{i} = \frac{e x p (X_{i})}{\sum_{j = 1}^{n} e x p (X_{j})},

(1)

In the random relaxation strategy of DARTS, the process of a continuous, discrete search space is shown in Equation (1). To ensure that the weight of the most important operation among the candidate operations approaches 1 while the weights of other operations approach 0, and to achieve weight redistribution, the random relaxation strategy needs to be further modified. Based on this, this paper proposes a random relaxation strategy, WRD-softmax (weight-redistribution softmax), and the weights learned via WRD-softmax are used as the basis for dynamic sampling. The expression is shown in Equation (2):

W R D - S o f t m a x x_{i} = \frac{e x p (\frac{l o g (X_{i})}{t})}{\sum_{j = 1}^{n} e x p (\frac{l o g (X_{j})}{t})},

(2)

Here, t refers to the reduction coefficient and is set to a minimum value, while X represents the weight of the candidate action. WRD-softmax uses a logarithmic function to magnify the largest value in X and reduce the remaining values, ensuring that the weighted sum remains 1.

The weights of candidate operations for differentiable neural architecture searches are not all positive. With the training, the weight of the convolutional neural network gradually shifts from a uniform distribution to a normal distribution centered around 0 [26]. The optimized WRD-softmax formula, combined with the weight distribution characteristics of the convolutional neural network, is shown in Equation (3):

W R D - S o f t m a x X_{i} = \frac{e x p (\frac{l o g (R e L U (X_{i}) + d)}{t})}{\sum_{j = 1}^{n} e x p (\frac{l o g (R e L U (X_{j}) + d)}{t})},

(3)

Here, d is defined as a small hyperparameter with a negligible impact on the model. Its significance lies in the combination of the

R e L U

function to make the weight matrix meet the definition domain of the logarithmic function. Additionally, the ReLU function can be used to modify the value of the candidate operation whose weight is less than 0 to 0. When t is the minimum value, the influence of the candidate operation on the output feature graph is also reduced, promoting sparsity and decreasing parameter interdependence.

In the process of network training, the distribution of network weight is uncertain, so the hyperparameter t should be adjusted according to the actual effect when using WRD-softmax. Combined with WRD-softmax and a partial channel connection strategy, during the feature graph calculation of the micro-neural architecture search algorithm, the output feature map of the preceding node, i, to node j is shown in Equation (4):

X_{i, j} (X_{i}, S_{i, j}) = \sum_{o \in O} \frac{e x p (\frac{l o g (R e L U (α_{i, j}^{o}) + d)}{t})}{\sum_{o^{'} \in O} e x p (\frac{l o g (R e L U (α_{i, j}^{o^{'}}) + d)}{t})} \cdot o (S_{i, j} * X_{i}) + {\bar{S}}_{i, j} * X_{i},

(4)

Here,

X_{i, j}

represents the calculation function of the output feature map from node i to node j, where o represents a candidate operation from the search space, O, and

α_{i, j}^{o^{'}}

represents the weight corresponding to a candidate operation, o, from node i to node j. The downsampling mask

S_{i, j}

is a one-dimensional random vector composed of 0 and 1, with a length equal to the depth of the input feature map

x_{i}

of node i.

During the NAS process, WRD-softmax amplifies the maximum weight value in the weight matrix of candidate operations, which is equivalent to increasing the learning rate. However, this amplification is not indiscriminate; it specifically enhances the learning rate of the optimal structure based on the existing search results. These structures with amplified weights are very likely to be the final search results. In the early stage of network training, to expedite the discovery of the global optimal solution, the weight matrix undergoes extensive oscillation, and the highest weight will fall on the operation without parameters or the operation with a poor effect, that is, the wrong structure. WRD-softmax magnifies the weights of these erroneous operations, leading to greater losses allocated during gradient backpropagation. This mechanism helps the NAS algorithm quickly jump out of the local optimal and save search time. Additionally, WRD-softmax is suitable for edge regularization processes. The feature diagram calculation method that integrates partial channel connection and WRD-softmax acceleration is shown in Equation (5):

X_{j} = \sum_{i < j} \frac{e x p (\frac{l o g (R e L U (β_{i, j}) + d)}{t})}{\sum_{i^{'} < j} e x p (\frac{l o g (R e L U (β_{i^{'}, j}) + d)}{t})} \cdot f_{i, j} (X_{i}, S_{i, j}),

(5)

3.3. Dynamic Sampling

In the early stages of a DARTS search, the weights of different candidate operations are not reliable indicators of the final search results due to network turbulence and inadequate training. However, the existing experimental results show [27] that, if a candidate operation performs poorly at the beginning of training, its likelihood of being sampled as a final search result is low. Based on this theoretical foundation, we propose a simple and effective dynamic sampler for accelerating the neural architecture search. During the training period, the dynamic sampler progressively prunes the worst-performing candidate operations as the number of iterations, t, increases.

Candidate operations of each edge are updated in the same epoch. After several times of training, the dynamic sampler prunes the candidate operations with the lowest probability, using a mask, M. To ensure the stability of the pruning operation, uniform sampling is employed to generate a hypernetwork structure in the early stage of training, during which M is screened, and no pruning occurs for candidate operations. At this time, mask M is activated. In an epoch, the weight and architecture of the network are trained and evaluated. WRD-softmax, a random relaxation strategy applied to the network architecture after weight redistribution, is continuously updated based on the loss from the validation set. After repeating the above operations for T times, the candidate operations were pruned using Equation (6).

α_{i} = α_{i} * M_{i}, M_{k} = 0,

(6)

Here, k is the subscript corresponding to the operation with the lowest weight among the candidate operations. A similar progressive pruning method is used in the optimization method of P-DARTS, except that P-DARTS does not incorporate the downsampling process. In P-DARTS, the gradient optimization-based method is also used to jointly optimize the network architecture and weight. Additionally, the same edge-sampling strategy as DARTS limits the effectiveness of P-DARTS to a large extent. In contrast, this paper prunes the worst-performing operation after several iterations while retaining more effective predecessor nodes, making the search process more efficient.

3.4. Adaptively Downsampling

During the sampling of node j in the downsampling policy of DARTS,

α_{i, j}^{b e s t}

represents the largest of the optional operation weights for edge

(i, j)

. After the search, the connection weights between nodes tend to converge, and DARTS retains the two preceding nodes with the largest weight value during down-sampling, assuming that node

j - 3

and node

j - 2

are retained, that is,

m i n (α_{j - 3, j}^{b e s t}, α_{j - 2, j}^{b e s t}) > α_{j - 1, j}^{b e s t}

. Then, the mixed operation of edges

(j - 3, j)

and

(j - 2, j)

is, respectively, replaced with the

α_{j - 3, j}^{b e s t}

and

α_{j - 2, j}^{b e s t}

corresponding operation, which is used to make the architecture discrete. However, this approach lacks flexibility and control. In a mixed operation on a given edge, a higher weight for one optional operation compared to others simply indicates that this operation is the most effective for the current edge. It does not necessarily mean that its performance is better than that of other operations with lower weights on different edges. Such a sampling method may lead to a loss in the overall performance of the network.

This section further optimizes the downsampling method and proposes an adaptive downsampling strategy suitable for dense connections. In order to better compare the importance of mixing operations on different edges, edge regularization is introduced, and the weight of edge

(i, j)

of mixing operation

α_{i, j}

is regularized to

θ_{i, j}

, which is used to evaluate the importance of alternative operations. Algorithm 1 shows the adaptive downsampling strategy.

Where

γ (γ \in [0, 1])

is a hyperparameter used to control the degree of sampling of connections between nodes,

θ_{i, j} [k]

represents the

K t h

largest weight in the sorted, mixed operation weight.

For the first three nodes in the cell, their number of previous nodes must be less than or equal to 2. Therefore, in adaptively downsampling, the downsampling mode of these nodes is the same as that of DARTS, that is, candidate operations corresponding to

θ_{i, j} [0]

and

θ_{i, j} [1]

are retained. For the

4 t h

to

6 t h

nodes in the cell, if the ratio between the weight of the unsampled predecessor node and the minimum weight of the sampled predecessor node is greater than

γ

, it indicates that the importance of the unsampled node is comparable to that of the sampled node. In such cases, the unsampled node with the highest weight will be selected until this condition is no longer met.

Algorithm 1 Adaptive downsampling strategy

Input:: Mixed operation weights $α$
1:: if End of search then
2:: Regularize the optimal weights $α_{i, j}^{best}$ in the mixed operation of all edges $(i, j)$ as $θ_{i, j}$
3:: Traverse node j :
4:: if Number of previous nodes > 2 then
5:: Sort $θ_{i, j}$ and then iterate through it:
6:: if $θ_{i, j} [k] / θ_{i, j} [2] > γ$ then
7:: Preserve the operations corresponding to $θ_{i, j} [k]$
8:: end if
9:: end if
10:: end if
Output:: Search results $θ$

Based on the proposed Dass-Block-

α

, adaptively downsampling strategy, continuous strategy of weight redistribution, and dynamic sampler, this paper proposes a low memory-consuming architecture search algorithm (Algorithm 2) for a densely connected network.

Algorithm 2 LMD-DARTS

Input:: Image classification dataset, lightweight search space based on DAS-Block- $α$
1:: Assign edge weights $β^{(i, j)}$ and candidate operation weights $α^{(i, j)}$ for edge $(i, j)$ and its mixed operations $f_{i, j} (\cdot)$ separately
2:: Continuate the search space according to Equation (3)
3:: Form supernet by stacking normal cells and reduction cells
4:: while Uncompleted search do
5:: Subsample feature maps according to Equation (4)
6:: Update the network architecture $α$ and $β$ with approximate gradients and finite difference
7:: Use the gradient descent $\nabla_{ω} L_{train} (ω, α)$ to update the network’s weights $ω$
8:: Pruning with dynamic sampler according to Equation (6)
9:: end while
10:: Use Adaptive Downsampling Strategy for supernet sampling
Output:: Network architecture $α$

4. Experiment

4.1. Evaluation Criteria

In order to comprehensively evaluate a model, multiple evaluation indexes based on a confusion matrix are used in the image classification task.

(1): Parameters serve as an indicator to measure the storage space consumed via a convolutional neural network. Fewer parameters mean less memory usage, allowing for larger batch sizes during training and, thus, reducing the training time. The parameters are mainly derived from the convolution layer, the fully connected layer, the fully connected layer in the extrusion excitation module, and the index layer of the learning group convolution, with the convolution layer contributing the most. The units M and G of the number of parameters indicate that the number of parameters is one million and one billion, respectively.
(2): GPU days are used to measure the complexity of the neural architecture search algorithm, indicating the number of days required for the algorithm to complete its search on a single GPU. For example, if an algorithm searches for three days on four GPUs, it is represented as 12 GPU days. For faster searches, GPU time metrics can be utilized.

Additionally, top-one accuracy, top-five accuracy, and floating-point operations (FLOPs) are also used to evaluate the performance of the model.

4.2. Compared Methods

In this section, we compare our LMD-DARTS method with various NAS technologies, highlighting the advantages of our approach in terms of search speed and overall performance. These methods include reinforcement learning-based (RL) search methods, evolutionary algorithm-based (EA) search methods, and gradient-based search strategies. Each has been applied to neural architecture search tasks with outstanding results, providing efficient and scalable solutions for various applications, such as image classification, object detection, and other computer vision tasks.

Reinforcement learning-based search: This method optimizes neural network parameters by leveraging a search algorithm to discover optimal architectures. Examples of methods in this category include NASNet [10], ENAS [34], Beta-DARTS [35], and Bandit-NAS [36]. These approaches enhance the robustness of the search process.

Evolutionary algorithm-based search: This method utilizes a discrete search space to effectively handle large-scale NAS tasks. AmoebaNet [5] falls into this category, and it performs well in tasks involving discrete search spaces.

Gradient-based search: This method optimizes architecture by iteratively adjusting parameters in the gradient direction derived from the differentiable objective function. CDARTS [37], XNAS [38], DARTS [17], PC-DARTS [19], EPC-DARTS [39], SWD-NAS [40], IS-DARTS [41], VNAS [42], EfficientNet [43], and Shapley-NAS [44] belong to this category. Compared to other methods, gradient-based NAS approaches effectively mitigate performance collapse issues.

4.3. Experimental Setting

During the search phase, Tesla V100 with 32G video memory was used, and a network of eight cells (six normal cells and two reduction cells) was stacked to speed up the search. The learning rate was initialized to 0.1 and automatically corrected using the cosine annealing algorithm. Meanwhile, in order to further improve the stability of training, we initialized the weight of candidate operations to equal values, fixed the architecture parameter

α

in the first 15 epochs, and only trained and updated the network parameter w to improve the stability of the initial search network. During the first 25 epochs, softmax was used to continualize the search space and regularize the edges. In subsequent searches, WRD-softmax was employed for the continualization of the search space and the regularization of the edges, with candidate operations being pruned every five epochs. In the adaptive downsampling strategy, the over-parameter

γ

was set to 0.8, the channel downsampling ratio K was set to 4, and the over-parameter t in WRD-softmax was set to 0.5.

In the network evaluation phase, a network structure consisting of 20 cells (18 normal cells and 2 reduction cells) was stacked. Six hundred epochs were trained on the CIFAR-10 dataset, and the best result was taken.

4.4. Experimental Result

As the search progressed, the weight of the convolutional neural network was gradually trained from a uniform distribution to a normal distribution with a mean of 0. If the input value was negative, d was set to 0.5. Table 1 shows the weight redistribution effect of WRD-softmax.

The softmax function evenly converts initial values to the corresponding probability. In WRD-softmax, the ReLU activation function is applied to the candidate operation weights, significantly inhibiting non-positive values. As t decreases, the weight redistribution effect of WRD-softmax becomes more pronounced. Compared with softmax, when t is 0.4 or 0.5, WRD-softmax maps the intermediate value

ω_{2}

to a similar value, appropriately reducing the mapping of a non-positive input. Moreover, the weight of WRD-softmax is reassigned to

ω_{0}

and

ω_{1}

, which have larger initial values, and the effect of weight redistribution is more ideal.

The search results of LMD-DARTS with the CIFAR-10 dataset are shown in Figure 3, where (a) represents a normal block, and (b) represents a reduction block. In a normal block, the candidate operation between node 0 and node 1 is finally determined as DAS-Block, demonstrating that the performance of DAS-Block in this structure is superior to that of other candidate structures. However, in both the normal cell and the reduction cell, no node has more than two predecessor nodes. This indicates that the weight of unsampled candidate operations is low enough to be negligible in the final decision of the whole network.

The search results of LMD-DARTS with the CIFAR-10 dataset were compared with other models, as shown in Table 2.

The classification accuracy of LMD-DARTS was higher than that of manually designed CNNs, with the number of parameters being 4.23 times that of DenseNet and approximately

\frac{1}{3}

that of SE-Net. Among neural architecture search models based on gradient optimization, LMD-DARTS had the fastest search speed. Compared with DARTS, the accuracy of LMD-DARTS was increased by 0.23%, and the search speed was increased by 9.1 times. When the accuracy was close to that of PC-DARTS, the search speed was improved by 20%. Although LMD-DARTS showed a 0.89% reduction in accuracy compared to XNAS, which had the highest accuracy, its search speed was 1.73 times faster. Compared with AmoebaNet-B, based on an evolutionary algorithm, the search time of LMD-DARTS was only

\frac{1}{26250}

of AmoebaNet-B with similar accuracy. Simultaneously, it can be seen that, although the adaptive upsampling module was removed, the third candidate operation was not sampled, so the parameter amount and the accuracy of the ablation test were similar to the original. Although the use of DAS-Block caused a certain increase in parameters, it also improved the accuracy of LMD-DARTS to some extent.

Figure 4 shows the search results of LMD-DARTS in the ImageNet dataset, where Figure 4a is a normal cell, and Figure 4b is a reduction cell.

The LMD-DARTS search results with the ImageNet dataset were compared with other models, as shown in Table 3. The classification accuracy of LMD-DARTS was, obviously, higher than that of partially manually designed networks, such as VGG-16 and MobileNet V2; although it had 1.57 times more parameters than MobileNet V2. Compared to ResNet-101, which had the highest accuracy, the top-one classification accuracy of MFD-DARTS was reduced by 4.93%, but the FLOPs were only

\frac{1}{13}

of that of ResNet-101. Among the network models based on a neural architecture search design, the MFD-DARTS classification accuracy decreased by 1.9% compared with the top-one EfficientNet, which had the highest classification accuracy, but the search time was significantly better than the grid search, which had the lowest search efficiency. Among the neural architecture search models based on gradient optimization, LMD-DARTS had the highest accuracy. Compared with DARTS, LMD-DARTS improved the accuracy by 0.9% while increasing the number of parameters by 17% and reducing the search time by 27.5%. Compared to PC-DARTS, LMD-DARTS improved the accuracy by 0.3% with a 3.8% increase in the number of parameters.

4.5. Ablation Experiments

In order to verify the effectiveness and contribution of adaptively upsampling and DAS-Block, we designed ablation experiments by removing the adaptive upsampling module and DAS-Block module separately on the CIFAR-10 dataset. The experimental results are shown in Table 4.

5. Discussion

The adaptive downsampling search algorithm used for LMD-DARTS can effectively reduce redundant connections. However, it may also eliminate useful connections, potentially affecting the performance of the model. While LMD-DARTS achieved an improvement in reducing the search time, its performance improvement may vary, based on the task complexity and dataset size. Therefore, conducting a neural architecture search with large models remains an intriguing challenge for future work. In the future, we will focus on optimizing computational efficiency to make LMD-DARTs applicable to larger models. Additionally, we will consider combining it with other optimization algorithms to expand its application to a broader range of scenarios, such as image enhancement and image restoration.

6. Conclusions

To optimize the search space, search strategy, and performance evaluation strategy of a neural architecture search, we have proposed a low-memory, micro-neural architecture search algorithm based on weight redistribution. For the search space, a neural architecture search space containing the adaptive subsampling of DAS-Block was designed. In terms of search strategy, we first introduced the principles of partial channel connection and edge regularization. Additionally, we proposed a relaxation strategy for weight redistribution, WRD-softmax, to accelerate the convergence rate of candidate operation weights and reduce the search times. Regarding the performance evaluation strategy, a low memory-consuming search algorithm based on a dynamic sampler was proposed. During the search process, the candidate operations with poor performance were continuously pruned to reduce the complexity of a single search. Finally, experiments using the CIFAR-10 and ImageNet datasets verified that the lightweight convolutional neural network performs well.

Author Contributions

Methodology, Z.L.; validation, X.X. and Y.X.; investigation, R.S.; writing—original draft preparation, P.Y.; writing—review and editing, Z.L. and H.C.; supervision, X.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (No. 61976217), the Fundamental Research Funds of Central Universities (No. 2019XKQYMS87), Science and the Technology Planning Project of Xuzhou (No. KC21193).

Data Availability Statement

The data presented in this study are available in this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Cong, S.; Zhou, Y. A review of convolutional neural network architectures and their optimizations. Artif. Intell. Rev. 2023, 56, 1905–1969. [Google Scholar] [CrossRef]
Xie, X.; Song, X.; Lv, Z.; Yen, G.G.; Ding, W. Efficient Evaluation Methods for Neural Architecture Search: A Survey. arXiv 2023, arXiv:2301.05919. [Google Scholar]
Tian, S. Research on Neural Architecture Automatic Search and Neural Network Acceleration Technology; National University of Defense Technology: Changsha, China, 2021. (In Chinese) [Google Scholar]
Zoph, B.; Le, Q.V. Neural architecture search with reinforcement learning. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Real, E.; Aggarwal, A.; Huang, Y.; Le, Q.V. Regularized evolution for image classifier architecture search. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; pp. 4780–4789. [Google Scholar]
Baker, B.; Gupta, O.; Naik, N.; Raskar, R. Designing neural network architectures using reinforcement learning. arXiv 2016, arXiv:1611.02167. [Google Scholar]
Zhong, Z.; Yan, J.; Wu, W.; Shao, J.; Liu, C.L. Practical block-wise neural network architecture generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2423–2432. [Google Scholar]
Zoph, B.; Vasudevan, V.; Shlens, J.; Le, Q.V. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8697–8710. [Google Scholar]
Cai, H.; Chen, T.; Zhang, W.; Yu, Y.; Wang, J. Efficient architecture search by network transformation. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; pp. 2787–2794. [Google Scholar]
Qin, X.; Wang, Z. Nasnet: A neuron attention stage-by-stage net for single image deraining. arXiv 2019, arXiv:1912.03151. [Google Scholar]
Tan, M.; Chen, B.; Pang, R.; Vasudevan, V.; Sandler, M.; Howard, A.; Le, Q.V. Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2820–2828. [Google Scholar]
Xie, L.; Yuille, A. Genetic CNN. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1379–1388. [Google Scholar]
Suganuma, M.; Shirakawa, S.; Nagao, T. A genetic programming approach to designing convolutional neural network architectures. In Proceedings of the Genetic and Evolutionary Computation Conference, Berlin, Germany, 15–19 July 2017; pp. 497–504. [Google Scholar]
Cubuk, E.D.; Zoph, B.; Schoenholz, S.S.; Le, Q.V. Intriguing properties of adversarial examples. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Liu, H.; Simonyan, K.; Vinyals, O.; Fernando, C.; Kavukcuoglu, K. Hierarchical representations for efficient architecture search. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
So, D.; Le, Q.; Liang, C. The evolved transformer. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 5877–5886. [Google Scholar]
Liu, H.; Simonyan, K.; Yang, Y. Darts: Differentiable architecture search. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Cai, H.; Zhu, L.; Han, S. Proxylessnas: Direct neural architecture search on target task and hardware. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Xu, Y.; Xie, L.; Zhang, X.; Chen, X.; Qi, G.J.; Tian, Q.; Xiong, H. Pc-darts: Partial channel connections for memory-efficient architecture search. arXiv 2019, arXiv:1907.05737. [Google Scholar]
Chen, X.; Xie, L.; Wu, J.; Tian, Q. Progressive differentiable architecture search: Bridging the depth gap between search and evaluation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1294–1303. [Google Scholar]
Zheng, X.; Ji, R.; Tang, L.; Zhang, B.; Liu, J.; Tian, Q. Multinomial distribution learning for effective neural architecture search. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1304–1313. [Google Scholar]
Hundt, A.; Jain, V.; Hager, G.D. sharpdarts: Faster and more accurate differentiable architecture search. arXiv 2019, arXiv:1903.09900. [Google Scholar]
Wang, H.; Wang, Y.; Sun, R.; Li, B. Global convergence of maml and theory-inspired neural architecture search for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 9797–9808. [Google Scholar]
Xue, Y.; Qin, J. Partial connection based on channel attention for differentiable neural architecture search. IEEE Trans. Ind. Inform. 2023, 19, 6804–6813. [Google Scholar] [CrossRef]
Huang, H.; Shen, L.; He, C.; Dong, W.; Liu, W. Differentiable neural architecture search for extremely lightweight image super-resolution. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 2672–2682. [Google Scholar] [CrossRef]
Luo, X.; Liu, D.; Kong, H.; Huai, S.; Chen, H.; Liu, W. Surgenas: A comprehensive surgery on hardware-aware differentiable neural architecture search. IEEE Trans. Comput. 2023, 72, 1081–1094. [Google Scholar] [CrossRef]
Li, Y.; Li, S.; Yu, Z. DARTS-PAP: Differentiable neural architecture search by polarization of instance complexity weighted architecture parameters. In Proceedings of the International Conference on Multimedia Modeling, Bergen, Norway, 9–12 January 2023; Springer Nature: Cham, Switzerland, 2023; pp. 277–288. [Google Scholar]
Ghiasi, G.; Lin, T.Y.; Le, Q.V. Nas-fpn: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7036–7045. [Google Scholar]
Zhang, H.; Li, Y.; Chen, H.; Gong, C.; Bai, Z.; Shen, C. Memory-efficient hierarchical neural architecture search for image restoration. Int. J. Comput. Vis. 2022, 130, 157–178. [Google Scholar] [CrossRef]
Priyadarshi, S.; Jiang, T.; Cheng, H.P.; Krishna, S.; Ganapathy, V.; Patel, C. DONNAv2-Lightweight Neural Architecture Search for Vision tasks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 4–6 October 2023; pp. 1384–1392. [Google Scholar]
Liu, C.; Chen, L.C.; Schroff, F.; Adam, H.; Hua, W.; Yuille, A.L.; Fei-Fei, L. Auto-deeplab: Hierarchical neural architecture search for semantic image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 82–92. [Google Scholar]
Mandal, M.; Meedimale, Y.R.; Reddy, M.S.K.; Vipparthi, S.K. Neural architecture search for image dehazing. IEEE Trans. Artif. Intell. 2022, 4, 1337–1347. [Google Scholar] [CrossRef]
Liu, Y.; Yan, Z.; Tan, J.; Li, Y. Multi-purpose oriented single nighttime image haze removal based on unified variational retinex model. IEEE Trans. Circuits Syst. Video Technol. 2022, 33, 1643–1657. [Google Scholar] [CrossRef]
Pham, H.; Guan, M.; Zoph, B.; Le, Q.; Dean, J. Efficient neural architecture search via parameters sharing. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 4095–4104. [Google Scholar]
Ye, P.; Li, B.; Li, Y.; Chen, T.; Fan, J.; Ouyang, W. b-darts: Beta-decay regularization for differentiable architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10864–10873. [Google Scholar]
Lin, Y.; Endo, Y.; Lee, J.; Kamijo, S. Bandit-NAS: Bandit sampling and training method for Neural Architecture Search. Neurocomputing 2024, 597, 127684. [Google Scholar] [CrossRef]
Yu, H.; Peng, H.; Huang, Y.; Fu, J.; Du, H.; Wang, L.; Ling, H. Cyclic differentiable architecture search. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 211–228. [Google Scholar] [CrossRef] [PubMed]
Nayman, N.; Noy, A.; Ridnik, T.; Friedman, I.; Jin, R.; Zelnik, L. Xnas: Neural architecture search with expert advice. Adv. Neural Inf. Process. Syst. 2019, 32, 1975–1985. [Google Scholar]
Cai, Z.; Chen, L.; Liu, H.L. EPC-DARTS: Efficient partial channel connection for differentiable architecture search. Neural Netw. 2023, 166, 344–353. [Google Scholar] [CrossRef] [PubMed]
Xue, Y.; Han, X.; Wang, Z. Self-Adaptive Weight Based on Dual-Attention for Differentiable Neural Architecture Search. IEEE Trans. Ind. Inform. 2024, 20, 6394–6403. [Google Scholar] [CrossRef]
He, H.; Liu, L.; Zhang, H.; Zheng, N. IS-DARTS: Stabilizing DARTS through Precise Measurement on Candidate Importance. In Proceedings of the AAAI Conference on Artificial Intelligence, Stanford, CA, USA, 25–27 March 2024; pp. 12367–12375. [Google Scholar]
Ma, B.; Zhang, J.; Xia, Y.; Tao, D. VNAS: Variational Neural Architecture Search. Int. J. Comput. Vis. 2024, 1–25. [Google Scholar] [CrossRef]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
Xiao, H.; Wang, Z.; Zhu, Z.; Zhou, J.; Lu, J. Shapley-NAS: Discovering operation contribution for neural architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11892–11901. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]

Figure 1. Stacking rules of cells.

Figure 2. Search space with DAS-Blcok-

α

.

Figure 2. Search space with DAS-Blcok-

α

.

Figure 3. Search result with CIFAR-10. (a) is a normal block, and (b) is a reduced-dimension block. A cell is a directed, acyclic graph consisting of an ordered sequence of N nodes. Each node is a feature map in convolutional networks.

Figure 4. Search result with ImageNet. (a) is a normal block, and (b) is a reduced-dimension block. A cell is a directed, acyclic graph consisting of an ordered sequence of N nodes. Each node is a feature map in convolutional networks.

Table 1. Comparison of Different Continuation Strategies.

	$X_{0}$	$X_{1}$	$X_{2}$	$X_{3}$	$X_{4}$	$X_{5}$	$X_{6}$
Initial value	0.15	0.1	0.05	0	−0.05	−0.1	−0.15
Softmax	0.165	0.157	0.150	0.142	0.135	0.129	0.122
WRD-softmax(t = 0.9)	0.174	0.160	0.145	0.130	0.130	0.130	0.130
WRD-softmax(t = 0.8)	0.179	0.162	0.145	0.129	0.129	0.129	0.129
WRD-softmax(t = 0.7)	0.184	0.164	0.145	0.127	0.127	0.127	0.127
WRD-softmax(t = 0.6)	0.192	0.168	0.145	0.124	0.124	0.124	0.124
WRD-softmax(t = 0.5)	0.203	0.173	0.145	0.120	0.120	0.120	0.120
WRD-softmax(t = 0.4)	0.220	0.180	0.145	0.114	0.114	0.114	0.114
WRD-softmax(t = 0.3)	0.250	0.191	0.143	0.104	0.104	0.104	0.104
WRD-softmax(t = 0.2)	0.314	0.210	0.136	0.085	0.085	0.085	0.085
WRD-softmax(t = 0.1)	0.519	0.233	0.098	0.038	0.038	0.038	0.038

Table 2. Results with the CIFAR-10 dataset.

Model	Paper Results/%	Our Impl/%	Parm/M	Search Time/GPU Days	Search Strategy
ResNet-50 [45]	93.03	-	25.56	-	Manual
DenseNet121 [46]	94.04	-	6.96	-	Manual
SENet [47]	95.23	-	11.2	-	Manual
NASNet [8]	97.35	97.33	3.3	1800 ^††	RL
ENAS [34]	97.11	97.14	4.6	0.45	RL
Beta-DARTS [35]	97.49	-	3.78	0.4 ^#	RL
CDARTS [37]	97.52	-	3.98	0.3 ^*	Gradient
AmoebaNet-B [5]	97.42	97.47	3.2	3150 ^‡‡	Gradient
XNAS [38]	98.2	97.45	3.79	0.3	Gradient
DARTS [17]	97.08	97.09	4.38	1.0	Gradient
PC-DARTS [19]	97.36	97.18	3.98	0.15	Gradient
EPC-DARTS [39]	97.6	-	3.2	0.2 ^†	Gradient
SWD-NAS [40]	97.49	-	3.17	0.13 ^‡	Gradient
IS-DARTS [41]	97.6	-	4.47	0.42 ^‡	Gradient
Bandit-NAS [36]	97.06	-	3.4	0.3 ^#	RL
VNAS [42]	97.69	-	3.5	0.3 ^*	Gradient
LMD-DARTS ^a	97.34	-	4.25	0.12	Gradient
LMD-DARTS ^b	97.2	-	4.03	0.11	Gradient
LMD-DARTS	97.42	-	4.23	0.12	Gradient

^a LMD-DARTS model with adaptive upsampling module removed. ^b LMD-DARTS model with DAS-Block module removed. ^# Recorded on GTX 1080Ti GPU. ^* Recorded on Tesla V100 GPU. ^† Recorded on RTX 3090Ti GPU. ^‡ Recorded on GTX 3090 GPU. ^†† Results from the original paper using NVIDIA P100 GPU despite replication efforts. ^‡‡ Results from the original paper using Tesla K40 GPU despite replication efforts.

Table 3. Results with ImageNet dataset.

Model	Top-1 ACC/%	Top-5 ACC/%	Parm/M	FLOPs	Search Cost/GPU Days	Search Method
VGG-16 [48]	71.93	90.67	138.36	15.48 G	-	Manual
ResNet-101 [22]	80.13	95.4	44.55	7.83 G	-	Manual
MobileNet V2 [49]	71.8	91	3.5	0.3 G	-	Manual
EfficientNet [43]	77.1	93.3	5.3	399 M	-	Grid Search
NASNet [8]	74	91.6	5.3	564 M	1800 ^††	RL
DARTS [17]	74.3	91.3	4.7	574 M	4	Gradient
PC-DARTS [19]	74.9	92.2	5.3	586 M	3.8	Gradient
Shapley-NAS [44]	76.1	-	5.4	582 M	4.2 ^#	Gradient
VNAS [42]	76.3	92.9	5.4	599 M	5 ^*	Gradient
LMD-DARTS	75.2	93.2	5.5	602 M	2.9	Gradient

^†† Results from the original paper using NVIDIA P100 GPU despite replication efforts. ^# Recorded on GTX 1080Ti GPU. ^* Recorded on Tesla V100 GPU.

Table 4. Ablation experiment results with CIFAR-10 dataset.

Module	Results/%	Parm/M
Adaptively downsampling	97.08	4.38
DAS-Block	97.36	3.98
LMD-DARTS	97.42	4.23

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Z.; Xu, Y.; Ying, P.; Chen, H.; Sun, R.; Xu, X. LMD-DARTS: Low-Memory, Densely Connected, Differentiable Architecture Search. Electronics 2024, 13, 2743. https://doi.org/10.3390/electronics13142743

AMA Style

Li Z, Xu Y, Ying P, Chen H, Sun R, Xu X. LMD-DARTS: Low-Memory, Densely Connected, Differentiable Architecture Search. Electronics. 2024; 13(14):2743. https://doi.org/10.3390/electronics13142743

Chicago/Turabian Style

Li, Zhongnian, Yixin Xu, Peng Ying, Hu Chen, Renke Sun, and Xinzheng Xu. 2024. "LMD-DARTS: Low-Memory, Densely Connected, Differentiable Architecture Search" Electronics 13, no. 14: 2743. https://doi.org/10.3390/electronics13142743

APA Style

Li, Z., Xu, Y., Ying, P., Chen, H., Sun, R., & Xu, X. (2024). LMD-DARTS: Low-Memory, Densely Connected, Differentiable Architecture Search. Electronics, 13(14), 2743. https://doi.org/10.3390/electronics13142743

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

LMD-DARTS: Low-Memory, Densely Connected, Differentiable Architecture Search

Abstract

1. Introduction

2. Related Work

3. LMD-DARTS: Low-Memory, Densely Connected, Differentiable Architecture Search

3.1. Search Space

3.2. Weight Redistribution Softmax

3.3. Dynamic Sampling

3.4. Adaptively Downsampling

4. Experiment

4.1. Evaluation Criteria

4.2. Compared Methods

4.3. Experimental Setting

4.4. Experimental Result

4.5. Ablation Experiments

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI