TA-DARTS: Temperature Annealing of Discrete Operator Distribution for Effective Differential Architecture Search

Shin, Jiyong; Park, Kyongseok; Kang, Dae-Ki

doi:10.3390/app131810138

Open AccessArticle

TA-DARTS: Temperature Annealing of Discrete Operator Distribution for Effective Differential Architecture Search

by

Jiyong Shin

¹,

Kyongseok Park

²

and

Dae-Ki Kang

^1,*

¹

Department of Computer Engineering, Dongseo University, 47 Jurye-ro, Sasang-gu, Busan 47011, Republic of Korea

²

Super Computing Cloud Center, Korea Institute of Science and Technology Information, Daejeon 34141, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(18), 10138; https://doi.org/10.3390/app131810138

Submission received: 5 July 2023 / Revised: 21 August 2023 / Accepted: 2 September 2023 / Published: 8 September 2023

(This article belongs to the Special Issue Recent Advances in Automated Machine Learning)

Download

Browse Figures

Versions Notes

Abstract

:

In the realm of machine learning, the optimization of hyperparameters and the design of neural architectures entail laborious and time-intensive endeavors. To address these challenges, considerable research effort has been directed towards Automated Machine Learning (AutoML), with a focus on enhancing these inherent inefficiencies. A pivotal facet of this pursuit is Neural Architecture Search (NAS), a domain dedicated to the automated formulation of neural network architectures. Given the pronounced impact of network architecture on neural network performance, NAS techniques strive to identify architectures that can manifest optimal performance outcomes. A prominent algorithm in this area is Differentiable Architecture Search (DARTS), which transforms discrete search spaces into continuous counterparts using gradient-based methodologies, thereby surpassing prior NAS methodologies. Notwithstanding DARTS’ achievements, a discrepancy between discrete and continuously encoded architectures persists. To ameliorate this disparity, we propose TA-DARTS in this study—a temperature annealing technique applied to the Softmax function, utilized for encoding the continuous search space. By leveraging temperature values, architectural weights are judiciously adjusted to alleviate biases in the search process or to align resulting architectures more closely with discrete values. Our findings exhibit advancements over the original DARTS methodology, evidenced by a 0.07%p enhancement in validation accuracy and a 0.16%p improvement in test accuracy on the CIFAR-100 dataset. Through systematic experimentation on benchmark datasets, we establish the superiority of TA-DARTS over the original mixed operator, thereby underscoring its efficacy in automating neural architecture design.

Keywords:

temperature annealing; discrete operator distribution; differential architecture search

1. Introduction

With the rapid advancement of artificial intelligence (AI) technologies, the demand for constructing AI models has surged [1,2,3,4,5]. Concurrently, the endeavor of devising neural architectures within machine learning has emerged as a time-intensive and laborious task, mainly due to the inherent challenge of attaining optimal neural architectures solely through expert insights [1,5,6]. More precisely, the pursuit of optimal or sufficiently effective architectures frequently entails considerable time and effort. The complexity stems from the inherent diversity of plausible neural architectures for a given task, necessitating recourse to a brute-force methodology involving training for each architecture to pinpoint the optimal configuration [1,2,3,5,6]. This backdrop has spurred a surge in AutoML research, particularly within the domain of Neural Architecture Search (NAS), which strives to streamline the exploration of efficient architectures [4,6]. Given the multifaceted nature of neural architectures, their formative influence on model performance is profound. Consequently, a mere embrace of automated construction proves inadequate; the imperative is a systematic methodology to prospect for an optimal, or nearly optimal, model configuration.

Many approaches have been explored for efficient search of optimal neural architecture [2,7,8,9]. One main difficulty of NAS is that previous algorithms have performed searches over discrete spaces. This is due to the fundamental problem of NAS algorithms. In NAS, one of the main components is an operator. There can be many different operators in neural architecture, and NAS algorithms evaluate, select and combine those operators for constructing optimal architecture. Combining different operators involves selecting an appropriate operator from a discrete operator pool, and thus, it is a search over a discrete search space. After the NAS problem was raised, evolutionary algorithms [2,7] and reinforcement learning [8,9] algorithms were proposed and achieved considerable performance improvement. Recently, the differentiable architecture search (DARTS) algorithm has been in the spotlight because it has reduced the NAS problem into a differential architecture search problem [10]. DARTS performs its training for the optimal operator by calculating the linear weighted summation of diverse operators, which comprise discrete search space. The weights of the operators constitute continuous space using the Softmax function, enabling backpropagation.

In the realm of Neural Architecture Search, the gradient-based approach encompasses the automatic generation of neural architectures, including sequences of layers as well as optimal architectural choices tailored to specific datasets. Consequently, even configurations akin to residual networks [11] can be accommodated. To determine the optimal configuration for a skip connection, our method involves searching through connections linking preceding and subsequent layers. While alternate approaches to building search models may exist, our work follows the One-Shot NAS algorithm framework [12], which entails creating a supernet where results from all potential operators are aggregated. DARTS facilitates the determination of optimal operators for individual inter-layer connections by parameterizing them through a Softmax function, thereby ensuring the validity of updates through backpropagation.

DARTS applies a weighting scheme to a discrete operator pool to compose a continuous search space. However, it has to make a “discrete” choice of the operator from the encoded architecture where each operator has its weight, which shows a limitation of DARTS. It is suggested that this limitation could be relieved by temperature annealing of Softmax [13]. However, the actual design, implementation, and analysis of temperature annealing for this limitation have yet to be thoroughly explored. In this paper, we performed a systematic analysis of the temperature annealing of Softmax to illustrate the feasibility of relieving this limitation.

DARTS (differentiable architecture search) has gained popularity [14,15,16] as a gradient-based approach that achieved state-of-the-art performance in NAS on datasets such as Penn Treebank, CIFAR-10, and ImageNet. DARTS is an algorithm that constructs models through a supernetwork, which searches for the optimal architecture. Unlike other NAS algorithms that have relied on RL-based or evolutionary-based approaches, DARTS parameterizes the model’s layers to make them differentiable, allowing the supernetwork to construct the optimal model in a gradient-based manner [12,17]. As explained earlier, to evaluate the possible operators for each layer, a weighted sum is performed on each operator, and they are combined into a single Softmax function. This is referred to as a mixed operator, which is used instead of selecting a single operator for each layer. Despite the success of DARTS in achieving high performance through a gradient-based approach, there are still several remaining challenges that need to be addressed.

The

α

parameter represents the weights of each operator at a specific edge. When the weights of operators went through the Softmax function at each edge (i.e., layer) as “beta“, these weights can be seen as the contributions of each operator for that particular edge. All operators are treated as one-hot encoded vector like a labels. Therefore in DARTS, when selecting the final architecture, only the operator with the highest beta value is chosen, and the rest are discarded. During the search process, beta values are real numbers between 0 and 1 and are weighted sums of each operator’s results. However, only one operator is used during testing, leading to architectural discrepancies between the search time and final models. This is known as the discrepancy problem. To mitigate this issue, we employ Softmax temperature annealing as a solution.

2. Related Work

Our method transforms the Softmax function to adjust the distribution of architecture parameters (i.e.,

β

). Previously published methods include using L1 regularization instead of Softmax to relax the unfair competition among operators [18] or applying Gumbel distribution to Softmax function to sample the architecture parameters (i.e., “

α

”) into binary coding [19]. However, these approaches are not sufficient for addressing the discrepancy problem. From the perspective that the discrepancy issue stems from the discrete selection of architecture parameters in the final decision-making process, these approaches are seen as heading in the right direction and can contribute to improving the performance of DARTS. Although our approach shares similarities with the idea of discretization, instead of discretizing architectures and then ensembling them, we utilized temperature to approximate the architecture parameters to values close to 0 and 1, creating an environment similar to the final selection. This allowed backpropagation to occur effectively.

2.1. EDD-DARTS

EDD-DARTS [20] constitutes an algorithm that sought to address the discrepancy problem of DARTS using an approach closely aligned with ours. They pointed out the challenge of adjusting the gradient of the alpha value solely through temperature control. Consequently, they opted to modify

e^{a / t}

by integrating a segment of the softmax function instead of adjusting the temperature. With each epoch, they computed the beta gradient with respect to the

e^{a / t}

value and adjusted the beta value. In essence, this process involved a direct intervention that facilitated the transition from alpha to beta by means of the softmax operation. Furthermore, this adjustment process incorporated a scaling factor to enable effective scaling. Additionally, they introduced an adaptive temperature scheme called entropy-based dynamic decay (EDD). Unlike conventional temperature adjustments, their approach modifies the

e^{a / t}

value. The extent of adjustment was determined by calculating the entropy of beta for each epoch; a higher entropy value led to more significant adjustments in the

e^{a / t}

value.

This approach gradually led to the creation of alpha values that progressively approximated discrete values as the search process advanced. However, this approach was designed to maintain the temperature consistently below 1. In contrast, our approach can accommodate situations where the temperature surpasses 1. In scenarios with temperatures higher than 1, intense competition among operators ensued. In the absence of such competition, some connections could be established without undergoing sufficient exploration. In practical situations, it’s not uncommon for a single dominant operator to persist and eventually be selected. This implies that other operators might be deprived of the opportunity to exhibit their efficacy. To ensure a comprehensive search, a strategy based on our approach can be employed where the initial search phase begins with a temperature higher than 1 and gradually transitions to a temperature below 1 as the search concludes, facilitating the derivation of a discrete architecture.

2.2. Gumbel-Softmax

DARTS with Ensemble Gumbel-Softmax [19] applied Gumbel distribution into Softmax to sample binary variables from architecture distribution. After that, the authors replaced “argmax”, which is discontinuous, with Softmax to generate Gumbel-Softmax. In this study, the architectures generated in this manner are ensembled to enhance sampling performance. Temperature is also applied in this process, serving as a single hyperparameter. It is not used to regulate the competition among architecture parameters during the search. Instead, the focus is on sampling with binary codes. Ensembling the results of these architectures improves the performance in sampling. The research findings demonstrate that the proposed approach achieves similar performance to DARTS on the CIFAR-10 dataset while reducing the resource requirements from 4 GPU days to 1.5 GPU days.

2.3. Fair DARTS

Fair-DARTS [18] alleviates the inherent competition among input values in the Softmax function by using the Sigmoid function, which expands the range of the solution set.

Fair-DARTS argues that the Softmax function induces unfair monopolistic competition, where one or two operators dominate a specific edge during the search process. As a solution, Fair-DARTS replaces the Softmax function with the Sigmoid function and adds Gaussian noise to disrupt the advantages of the Softmax function. Furthermore, the paper challenges the fundamental assumption of DARTS that the continuously encoded architecture solutions are similar to the one-hot encoded architectures. It suggests that a smaller discrepancy between these representations leads to consistent performance in the resulting architectures. To address the discrepancy between the continuously encoded architecture and the discrete architecture obtained from it, the paper introduces a Zero-One auxiliary loss, similar to L1 regularization, to adjust the architecture weights to extreme values of 0 or 1. Consequently, Fair-DARTS achieves a performance surpassing other NAS algorithms such as P-DARTS, PC-DARTS, SNAS, GDAS, and FBNet-C, with an accuracy of 75.6% on the CIFAR-10 dataset. Although this approach minimizes the discrepancy, it results in a form where only one extreme option is selected due to the architecture weights being discontinuously encoded. In this study, competition is gradually relaxed as the search progresses, creating a natural competition environment while ultimately shaping the resulting architecture similar to the desired form.

2.4. $β$ -DARTS

β

-DARTS [21] leverages domain adaptation techniques from AdaptNAS to improve the generalization performance of DARTS while also introducing a proper regularization method for the DARTS algorithm. To ensure correct normalization of the mixed edges in DARTS, Beta-DARTS emphasizes the need for a mapping function within the Softmax function that adjusts the

α

s, the architecture parameters, without being influenced by their magnitudes. They define the output values of the Softmax function that pass through

α

s as beta and propose a new regularization method called beta-Decay regularization, which normalizes the values to have a slight variance while being close to the mean, thereby achieving a similar effect to weight decay. As a result, Beta-DARTS achieves state-of-the-art (SOTA) performance on the CIFAR-100 dataset of NAS-Bench-201 with a validation accuracy of 73.49% and a test accuracy of 73.51%. On the CIFAR-10 dataset, it outperforms models that use second-order approximation with a superior accuracy of 97.47% while maintaining a similar speed to models using the first-order approximation of DARTS.

2.5. ProxylessNAS

ProxylessNAS [22] employed a top-level network (SuperNet) that converges the search model with a single training process through the backpropagation algorithm, similar to One-Shot NAS and DARTS. However, it introduces binary edges to significantly reduce the required memory capacity, addressing the issue of excessive parameterization in the previous approaches. It demonstrated improved results compared to ENAS [8] (a test error rate of 2.83%), DARTS (2.83%), and AmebaNet-B (2.13%) on the CIFAR-10 dataset, achieving a test error rate of 2.08%.

3. Materials and Methods

3.1. Search Space

NAS is the problem of choosing the neural architecture with the highest performance for the given dataset. Many different operators can be candidates in neural architecture construction. Those operators can be represented as edges in directed acyclic graph(DAG), as shown in Figure 1

The simplified search architecture of input/output of a particular architecture is denoted as a node in Figure 1. The output of

x_{i}

calculated with every operator becomes the following input for the next layer, which is

x_{i + 1}

. In practice, several

x_{i}

s can be connected to

x_{i + 1}

. Therefore, the following equations denote the previously given input as i and the output as j. In the design of TA-DARTS, we try to optimize a selection mechanism among various operators in the pool. Since the selection mechanism of operators constructs discrete search space, previously evolutionary algorithms, and reinforcement learning algorithms have been adopted [2,7,8,9]. However, those methods required tremendous computing resources, such as more than 2000 GPU days. In contrast, DARTS achieved gradient-based optimization by encoding continuous search space from discrete search space.

3.2. Continuous Relaxation

To construct neural architecture automatically, NAS algorithms must perform operator selection in the search space. Each operator from the operator pool is denoted as o. Searching the best operator for every edge i to j is the task given. Where the search performed is a discrete space and can contain multiple operators. Thus, input/output and operators in NAS are defined as Equation (1) [18].

x_{j} = \sum_{i < j} o_{i, j} (x_{i})

(1)

DARTS employs the notion of “mixed operators” and represents the mixture as a set of edges which is denoted as

{\bar{o}}_{i, j}

in Equation (2) [10].

{\bar{o}}_{i, j} (x_{i}) = \sum_{o \in O} \frac{exp (α_{i, j}^{o})}{\sum_{o^{'} \in O} exp (α_{i, j}^{o^{'}})} o (x_{i})

(2)

Given the input

x_{i}

to each operator between i and j,

o_{i, j}

, the result is defined as their weighted sum of each output from the operator pool. DARTS adopts the Softmax function to encode these mixed operators in the discrete search space into a mixture of candidate edges in the continuous space. As shown in Equation (2) [10], the outputs of each operator in the pool are multiplied with a weight parameter

α

after putting the

α

to the Softmax function, and they are being summed. A mixed operator is denoted as

\bar{o}

.

For every edge between nodes

x_{i}

to

x_{j}

, mixed operator

\bar{o}

will be calculated. Mixed operator

{\bar{o}}_{i, j}

binds operators with architectural weight

α_{i, j}

. This mechanism parameterizes the importance of each operator by giving each operator an architectural weight

α

, enabling backpropagation. Consequently, a set of

α

values with operators represents continuously encoded architecture.

Softmax (α_{i, j}^{o}) \approx β_{i, j}

(3)

where,

β^{(i, j)} = (β_{i, j}^{o_{1}}, β_{i, j}^{o_{2}}, \dots, β_{i, j}^{o_{M}})

is a normalized

α

vector by Softmax, and

α_{i, j}^{o} = (α_{i, j}^{o_{1}}, α_{i, j}^{o_{2}}, \dots, α_{i, j}^{o_{M}})

is an architectural weight vector [18].

Here, DARTS denotes

α

to approximate the class value by Softmax, which is in a one-hot encoded format. In the resulting formulation,

α

represents a collection of importance values assigned to multiple operators. However, only the singular operator with the highest contribution is retained during the test process, while the remaining operators are disregarded.

DARTS seeks to find the

α

that minimizes

L_{v a l} (w^{*} (α), α)

as follows:

L_{v a l} (w^{*} (α), α) = L_{v a l} (\underset{w}{argmax} L_{t r a i n} (w, α), α)

(4)

This can be solved by bi-level optimization, where w is the lower-level variable and

α

is the upper-level variable.

L_{t r a i n}

is the loss obtained when training w with fixed

α

and

L_{v a l}

is the loss obtained when training

α

with fixed w. Training of w is done with the training dataset, and training of

α

is done with the validation dataset. In this study, we regularize the optimization process in architecture search by annealing the Softmax function with temperature. Thereby we perform relaxation by scaling the weight parameter distribution of the operator(

α

) so that they are close to the one-hot encoding distribution of the operators.

3.3. The Architecture of DARTS

We would like to provide further clarity on the architectural intricacies of the DARTS method. The structure we employ involves a cell configuration, where each cell, apart from the stem layer, comprises a total of 4 layers. However, due to the exploration of 8 operators for each layer within the cell and the examination of connections between layers, the search space expands to 8 operators multiplied by 14 inter-layer connections. Throughout the search phase, the DARTS algorithm duplicates this cell 8 times and updates them simultaneously, with a similar duplication of 20 cells during the training process. Additionally, reduction cells were applied twice at specific intervals, corresponding to 1/3 and 2/3 of the total architecture size. This inclusion serves to refine the architecture’s performance.

The resulting search space for alpha amounts to 912 layers. Regarding the inner weights (w), during the search phase, the total layers equate to (2 + 8 × 14) × 8, encompassing two layers for the stem cell, 8 layers in mixed operators, and 14 inter-connection layers across 8 cells. During the training process, the total number of layers is (2 + 4) × 20. Notably, only 4 layers remain in a cell due to fixed operators, with 2 stem cell layers and 4 internal layers per cell, comprising 20 cells in the training model. It’s important to acknowledge that the actual layer count can deviate from the search process’s results. If a search for an inter-layer connection yields ‘none’, the connection can be eliminated. Similarly, if the result is ‘skip connect’, the connection is skipped.

3.4. Softmax Temperature Annealing

When the input values of the Softmax function are divided by a temperature, the resulting probabilities can be adjusted to decrease or increase the variance compared to those obtained from the original Softmax function as follows:

{\bar{o}}_{i, j} (x) = \sum_{o \in O} \frac{exp (α_{i, j}^{o} / T)}{\sum_{o^{'} \in O} exp (α_{i, j}^{o^{'}} / T)} o (x)

(5)

Applying a temperature value of 1 yields the same probabilities as the traditional Softmax function, while using a temperature value greater than 1 reduces the differences among the probabilities. Conversely, applying a temperature value between 0 to 1 accentuates the differences across the probabilities. By incorporating this approach into DARTS, we can converge the probabilities towards values closer to the extremes of 0 and 1. This approach effectively mitigates the discrepancy between the weighted encoded architectures generated by continuous relaxation and the discrete architectures. Setting a lower temperature value allows DARTS, which may introduce discrepancies due to one-hot encoding, to perform the search process in an environment similar to testing.

For example, let us say after an

α

vector for an edge passes through Softmax function (

β

vector) shown in Figure 2, results as like [0.1537, 0.0959, 0.0956, 0.1136, 0.1425, 0.1366, 0.1349, 0.1272] from an operator pool of [‘none’, ‘max pool 3 × 3’, ‘avg pool 3 × 3’, ‘skip connect’, ‘sep conv 3 × 3’, ‘sep conv 5 × 5’, ‘dil conv 3 × 3’, ‘dil conv 5 × 5’]. In Figure 2 and Figure 3, each operator is represented as op0 to op7 based on the operator pool. The values are the architectural weights for each operator. In this case, only the first value (the highest value) remains, and the rest will be discarded to decide for that layer (edge). This result is pretty much different from the search phase. During the search phase, the contribution of the first operator was only 15.37% on the layer, but it has been chosen for the operator of that layer in this example. This can cause a problem because the actual contribution was similar to all the other operators and was identical. Moreover, we do not have enough confidence that the operator with the contribution of 0.1366 will perform better than the operator with 0.1349 on that layer. Therefore, with the Softmax temperature annealing method, we relieve this discrepancy between continuously relaxed architecture parameters (

β

) and discrete architecture (composition of operators). Temperature annealing can make the distribution of

β

values near discrete values 0 and 1.

The temperature annealing method is a way to control the distribution of beta. Since the beta distribution itself is the prediction of operator selection, the temperature annealing method affects the search process. Operators are used in the form of one-hot encodings, During the operator search procedure, each operator exists as a one-hot encoded label. Each operator becomes a single label, and when the entire operator pool combines, it forms a vector of the same length as beta. The length of the beta vector is equal to the length of the operator pool. Furthermore, the values of the beta vector indicate the importance of each operator corresponding to its index. As the beta values are the softmax function’s output, the sum of the entire vector is 1, and each value ranges between 0 and 1. However, since the softmax function does not alter the ranking of the alpha values, the beta values also maintain the same ranking as the alpha values. When the search process is completed, the operator with the highest beta value is determined to be the most effective operator.

As mentioned earlier, the beta values are derived from the alpha values, which are updated through backpropagation. Scaling the alpha values using Softmax temperature also affects the update process. In detail, even if the rankings of the beta and alpha values are the same, scaling the beta values closer or further from one another during the search process influences the magnitude of the update. For example, assuming that the operators corresponding to the 1st index to the 8th index in Figure 3 are named op0 to op7, when the vector is in the alpha state (i.e., before passing through the softmax function), the gap between op3 and op4 is not significant. However, when the vector is in the beta state (i.e., after passing through the softmax function), the gap between them can increase or decrease depending on the temperature. In this way, if the gap widens, the model perceives a larger difference than the actual difference in the alpha values. This is similar to mean squared error (MSE), where squaring all errors causes them to become larger than the actual differences. However, the difference between MSE and temperature annealing is that it does not spread all differences equally. Some operators experience an increase in the difference, while others experience a decrease. Additionally, depending on the chosen temperature, the variance of the beta vector can be increased or decreased.

In reality, operators are merely individual operators. To select one of these operators and do the operator, the only thing that we have to do is to set the value of the operator’s index value as one or near to one. To make these operators similar to the one-hot encoded vector represented with 0s and 1s, the variance of the beta vector should be small. Having a beta vector similar to the one-hot encoded vector means that each beta value exists as a value very close to 0 or 1. However, in the original DARTS (Differentiable Architecture Search) method, there is no part that adjusts the variance of beta. Therefore, it cannot always yield results similar to the one-hot encoded vector. Instead, DARTS’ mixed operators parameterize the architecture to make it continuous, resulting in beta values that are distant from the actual one-hot encoded architecture (discrete architecture) and far from 0 and 1. During the search process, alpha values are learned as continuous values. Although multiple operators with high contributions can exist during the search, after the final epoch, only one operator must be selected. We expect only one operator to have a value close to 1 while the others have values close to 0. By adjusting the beta values through temperature, the search process can be guided to yield results similar to the one-hot encoded vector.

When scaling the alpha values during the search, the rankings of beta remain unchanged, only the differences change. Let’s examine the changes in beta values between operators when adjusting the difference in beta values. The changes in alpha values depending on the temperature can be seen in Figure 3. When the temperature is 1, as shown in Figure 3b, there is no change in the alpha values, so it follows the general softmax function. When the temperature is 0.1, as shown in Figure 3d, the beta values approach 1 and 0, resulting in a larger difference between the operator with the highest beta value and the one with the lowest beta value. One or a few operators dominate this layer or edge. As the temperature approaches 0, only one operator with the highest alpha value converges to a beta value close to 1, while the rest approach values close to 0. When the temperature is 10, as shown in Figure 3c, multiple operators have similar beta values, meaning that their rankings remain the same, but their evaluations and contributions become similar. As the temperature increases, all the beta values of the operators become similar. In this case, selecting a single operator that dominates the layer or edge becomes meaningless. Since all operators contribute almost equally if the architecture is constructed by selecting only one operator, the resulting architecture will be significantly different from the architecture used during the search.

Suppose the temperature is high, causing a large difference in alpha values while the difference in beta values becomes small. In that case, operators with low alpha values receive higher evaluations than their actual contributions. In other words, operators with low alpha values receive an advantage, while operators with high alpha values receive a disadvantage. Such changes provide the possibility for a reversal in subsequent epochs for operators that initially showed low performance. Conversely, if the temperature is low, causing a small difference in alpha values while the difference in beta values becomes large, operators with high alpha values receive higher evaluations, granting them an advantage, while operators with low alpha values receive lower evaluations, resulting in a disadvantage. In this situation, compared to when the temperature is 1, the beta vector becomes closer to 0 and 1. This leads to results similar to the one-hot encoded operator.

In summary, by scaling the alpha values, the following benefits can be obtained:

Controlling the difference in beta values to regulate the search process;
Intensifying or relaxing the competition among operators;
Increasing search diversity through high temperature;
Creating an architectural environment similar to the one-hot encoded operator through low temperature.

4. Experimental Results

Prior to delving into the main body of this section, we will provide an elucidation of the factors that were applied to all our experiments. First of all, We applied the cutout technique for all experiments. Cutout [23] is an effective data preprocessing technique in Convolutional Neural Networks (CNN), which improves the generalization ability of the NAS algorithm. As other NAS algorithms sometimes apply cutout and sometimes do not, we denote +cutout with the model name in every table to indicate that cutout is applied. Secondly, the notion of search cost in the result tables refers to the quantity of tangible resources expended until the completion of a given search algorithm. This can be discerned through a comparative analysis of the duration taken by computations on various scales of computational units. We denoted GPU days to show the search cost. GPU day indicates the time taken for search when converted to one GPU. Since the GPU days can be varied to the GPU configurations, it may decrease as GPUs advance. Accordingly, an explanation of the device may be required. All of our experiments have been done with RTX A4000 and the results from the original DARTS paper have been done by the GTX 1080Ti. When we ran the DARTS algorithm with RTX A4000, it took 2 days on average. Since the time to calculate the temperature for every epoch in our method takes only a few seconds, the search cost is nearly the same. Lastly, Params in the result tables indicates the number of parameters that the model contains. This is often referred to as the size of the model.

Table 1 shows the results obtained by searching with different temperatures. The CIFAR-10 dataset was used, and the search was conducted for 50 epochs. Search validation accuracy represents the validation accuracy of the last epoch. Train top 1 accuracy refers to the top 1 accuracy of the last epoch when retraining with the searched architecture. Retraining was performed for 600 epochs. Since all top 5 accuracies showed results above 99.99%, the difference cannot be determined by the second decimal place alone. Therefore, we did not indicate the top 5 accuracies. Test accuracy represents the test accuracy of the retrained model. As there are infinitely many possible temperatures, only a few temperatures were selected for experimentation.

In Table 1, the temperature with the highest test accuracy was 3.0, recording 97.40%. The second highest temperature was 0.1, which recorded 97.39%, 0.01%p lower than the accuracy at temperature 3.0. The temperature with the highest validation accuracy was 0.5, recording 88.90%. The second highest validation accuracy was achieved at temperature 0.1, recording 88.74%. Temperature 0.1, which showed high rankings in both test and validation accuracy, appears to be the most suitable temperature for the CIFAR-10 dataset among these examined temperatures. Empirically, architectures with high validation accuracy have a higher chance of improved performance after retraining. Therefore, if multiple trials are conducted, the temperature value of 0.1 may surpass the temperature value of 10.0. However, this is not always the case. For example, in the case of the architecture with a temperature of 0.5, despite recording the highest validation accuracy, it showed the lowest performance in terms of test accuracy.

Table 2 shows the results of experiments conducted on the CIFAR-100 dataset. The items are similar to the Table 2, comparing search validation accuracy, train top 1 accuracy, test accuracy, and additionally train top 5 accuracies. In the CIFAR-100 dataset experiment, a temperature of 0.5 showed the highest performance in all categories, and the test accuracy recorded was 81.67%. It can be observed that the performance significantly dropped at temperatures 0.1 and 5.0. The test accuracy at temperature 0.1 was 76.86%, a decrease of 4.81%p compared to temperature 0.5. At temperature 5.0, the test accuracy was 78.80%, a decrease of 2.87%p compared to temperature 0.5. This indicates that the relationship between temperature and performance is not directly proportional or inversely proportional. Finding and applying the appropriate temperature value is crucial.

Table 3 presents the validation results during the architecture search process. Both Validation accuracy and Test accuracy refer to Top-1 accuracy. The validation accuracy in Table 3 represents the average result obtained after five experiments. As seen in Table 3, the validation accuracy during the search process is 88.35% for DARTS and 88.42% for TA-DARTS, with a temperature setting of 0.1. This indicates an increase in accuracy of 0.07% compared to the original DARTS algorithm.

Table 4 shows the test results after retraining with the searched architecture. With a temperature value of 0.5, it achieved a top-1 accuracy of 97.40%, which is 0.16%p higher than the original DARTS mixed operator and 0.03%p higher than the Gumbel-softmax method. In addition, Different temperature values can be explored for future research, and a Hyperparameter Optimization (HPO) algorithm can be developed to find the optimal temperature value. Another approach could involve studying the correlation between temperature and loss to find the optimal value.

DARTS adopts a cell-based neural architecture [10,24,25,26]. One distinctive feature of the cell-based architecture is that it incorporates a neural architecture within a single building block called a cell [6,8,27]. All cells share the same structure, and their weights are updated simultaneously [8]. In DARTS’ cell-based architecture, the output of the last cell and the output of the cell before it are used as the input for the next cell. Each cell consists of multiple layers. In the search architecture of TA-DARTS, a cell comprises eight layers connected to a total of five nodes, and each layer is equipped with a mixed operator. Figures depicting the architectural variations under temperature values of 0.1 and 10 can be found in the Appendix section.

5. Conclusions

In our experiments, we addressed the discrepancy issue between the continuous and discrete architectures in the original DARTS. The use of the simple Softmax function to convert the discrete search space [10,13] into a continuous one and its successful application in the backpropagation process attracted significant attention from the NAS research community. However, the problem of selecting discrete architectures in a continuous search space still lacks sufficient investigation. To tackle this, we applied temperature annealing to the mixed operator in DARTS, resulting in improved performance compared to the original DARTS. Our study successfully mitigated the architectural discrepancy inherent in DARTS. However, complete elimination of the discrepancy has yet to be achieved. Fundamentally representing an architecture obtained through continuous relaxation as an entirely discrete architecture remains a challenging problem in applying gradient-based methods to NAS.

Author Contributions

Conceptualization, J.S. and D.-K.K.; methodology, D.-K.K.; software, J.S.; validation, D.-K.K. and K.P.; formal analysis, J.S. and D.-K.K.; investigation, J.S.; resources, K.P.; data curation, K.P.; writing—original draft preparation, J.S.; writing—review and editing, D.-K.K.; visualization, J.S.; supervision, D.-K.K.; project administration, D.-K.K. and K.P.; funding acquisition, D.-K.K. and K.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partly supported by the Technology development Program “Development of Supply Chain Optimization Technology Based on Big Data and Machine Learning” (S3126610), funded by the Ministry of SMEs and Startups (MSS, Korea).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here (accessed on 8 September 2023): [https://www.cs.toronto.edu/~kriz/cifar.html].

Acknowledgments

The authors wish to thank members of the Dongseo University Machine Learning/Deep Learning Research Lab., and anonymous referees for their helpful comments on earlier drafts of this paper.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AutoML	Automated Machine Learning
DARTS	Differentiable Architecture Search
HPO	Hyperparameter Optimization
TA-DARTS	Temperature Anealing Differentiable Architecture Search
NAS	Neural Architecture Search
MSE	Mean Squared Error

Appendix A. Architectural Changes of TA-DARTS (Temperature = 0.1)

This Appendix A contains Figures illustrating the examples of changes in the normal and reduction cells of TA-DARTS when the temperature was 0.1. At epoch 0, it represents the initial architecture before the learning of alpha, and throughout the entire search process, only the architecture at every tenth epoch is shown. Note that in a mixed operator, every next node has connected edges with every previous edge. This is to find out an optimal place to put the skip and none connection as a residual network [11]. Furthermore, every edge contains the distribution of

β

, but shows only the highest one. The normal cell maintains the channel size while the reduction cell decreases the channel size to a smaller size with a stride of 2. DARTS applies reduction cells for every 1/3 of the total cells.

c_{k - 1}

and

c_{k - 2}

are the output of the previous cells and the cell before it.

c_{k}

is the output of the cell.

The whole set of operator pool is [‘none’, ‘max pool 3 × 3’, ‘avg pool 3 × 3’, ‘skip connect’, ‘sep conv 3 × 3’, ‘sep conv 5 × 5’, ‘dil conv 3 × 3’, ‘dil conv 5 × 5’]. The operator ‘none’ indicates removing the connection. The operator ‘max pool 3 × 3’ indicates the max pooling operator with a kernel size of 3 by 3. The operator ‘avg pool 3 × 3’ indicates an average pooling operator with a kernel size of 3 by 3. The operator ‘skip connect’ indicates a skip connection that transfers the input as output. The operator ‘sep conv 3 × 3’ indicates a separable convolution operator with a kernel size of 3 by 3. The operator ‘sep conv 5 × 5’ indicates a separable convolution operator with a kernel size of 5 by 5. The operator ‘dil conv 3 × 3’ indicates a dilation convolution operator with a kernel size of 3 by 3 and a dilation rate of 2. The operator ‘dil conv 5 × 5’ indicates dilation convolution with a kernel size of 5 by 5 and a dilation rate of 2.

Appendix A.1

Appendix A.1 shows the changes in normal cell architecture of TA-DARTS when the temperature was 0.1.

Figure A1 Shows the initial architecture of a normal cell when the temperature was 0.1. Epoch 0 is the only architecture that shows randomly initialized architecture.

Figure A1. Initial architecture of a normal cell when the temperature was 0.1.

Figure A2 Shows the 10th architecture of a normal cell when the temperature was 0.1. After 10 epochs of search, TA-DARTS removed all the connections between nodes 0 to 3 and accumulates all the results from nodes 0 to 3 at the end.

Figure A2. 10th architecture of a normal cell when the temperature was 0.1.

Figure A3 Shows the 20th architecture of a normal cell when the temperature was 0.1. After 20 epochs of search, TA-DARTS is still maintaining the architecture without the connection between hidden nodes. Even after searching for 20 epochs, TA-DARTS maintained the form in which all edges between nodes 0 and 3 were removed and all remaining edges were connected. One previously existing skip connection has been changed to a dilation convolution.

Figure A3. Initial architecture of a normal cell when the temperature was 0.1.

Figure A4 Shows the 30th architecture of a normal cell when the temperature was 0.1. Skip connection appeared again at the connection between

c_{k - 1}

to node 0.

Figure A4. 30th architecture of a normal cell when the temperature was 0.1.

Figure A5 Shows the 40th architecture of a normal cell when the temperature was 0.1. The skip connection disappeared at the connection between

c_{k - 1}

to node 0 and changed into a 5 by 5 dilation convolution. We can see that TA-DARTS hesitates to choose the rightful operator for this edge.

Figure A5. 40th architecture of a normal cell when the temperature was 0.1.

Figure A6 Shows the final (50th) architecture of a normal cell when the temperature was 0.1. The form connection is maintained until the end since 10th epoch. The edges between

c_{k - 1}

to node 0,

c_{k - 2}

to node 3,

c_{k - 2}

to node 2,

c_{k - 1}

to node 3,

c_{k - 1}

to node 1 had no changes since 10th epoch. This does not mean that there was no search. There could have been a struggle with selecting these operators among the operator pool, but the most effective operator has not been changed. Skip connection is not selected to any of the edges. This is a good sign because DARTS has the problem that tends to choose skip connections too much and thus lose too much information [18].

Figure A6. Final architecture of a normal cell when the temperature was 0.1.

Appendix A.2

Appendix A.2 shows the changes in reduction cell architecture of TA-DARTS when the temperature was 0.1.

Figure A7 Shows the initial architecture of a normal cell when the temperature was 0.1. This architecture is randomly defined architecture. We can assume that these results from mixed operators can change greatly after several epochs.

Figure A7. Initial architecture of a reduction cell when the temperature was 0.1.

Figure A8 Shows the 10th architecture of a normal cell when the temperature was 0.1. We can see that there were changes in not only the operators but also the architecture. The connection between

c_{k - 1}

to node 3 has appeared, and the connection between

c_{k - 1}

to node 1 has disappeared. For node 2, the connection from input has been changed from

c_{k - 2}

to

c_{k - 1}

.

Figure A8. 10th architecture of a reduction cell when the temperature was 0.1.

Figure A9 Shows the 20th architecture of a normal cell when the temperature was 0.1. The node 0 and node 1 now give the output to node 3 as well. Node 3 now receives the input from node 1 and

c_{k - 1}

.

Figure A9. 20th architecture of a reduction cell when the temperature was 0.1.

Figure A10 Shows the 30th architecture of a normal cell when the temperature was 0.1. The architecture has become different again. We can see that search is still not yet converged.

Figure A10. 30th architecture of a reduction cell when the temperature was 0.1.

Figure A11 Shows the 40th architecture of a normal cell when the temperature was 0.1. There is no change until the end of the search from this architecture. We can assume that the search is already converged.

Figure A11. 40th architecture of a reduction cell when the temperature was 0.1.

Figure A12 Shows the 50th architecture of a normal cell when the temperature was 0.1. This architecture is the same as the 20th architecture. It means that there was evidence already during the search procedure.

Figure A12. Final architecture of a reduction cell when the temperature was 0.1.

Appendix B. Architectural Changes of TA-DARTS (Temperature = 10)

This Appendix B contains Figures illustrating the changes in the normal and reduction cells of TA-DARTS when the temperature was 10. Similar to Appendix A, we can examine how the architecture evolves by inspecting the initial epoch and every tenth epoch.

Appendix B.1

Appendix B.1 shows the changes in normal cell architecture when the temperature was 10.

Figure A13 Shows the initial architecture of a normal cell when the temperature was 10. This architecture is based on the result of randomly generated distribution of

α

.

Figure A13. Initial architecture of a normal cell when the temperature was 10.

Figure A14 Shows the 10th architecture of a normal cell when the temperature was 10. The observed architecture exhibited substantial alterations within this epoch relative to other epochs investigated during this exploration. Two more connections from node 0 to node 1 and node 2 have appeared, input of node 1 has changed to

c_{k - 1}

and node 0 from

c_{k - 1}

and

c_{k - 2}

, the connection from node 1 to node 3 has disappeared, one of the inputs to node 2 has changed from

c_{k - 1}

to node 0, the connection between node 2 to node 3 has appeared, and one of the inputs of node 3 has changed from node 1 to node 2.

Figure A14. 10th architecture of a normal cell when the temperature was 10.

Figure A15 Shows the 20th architecture of a normal cell when the temperature was 10. The architecture has no changes except the operator between node 2 and node 3. The operator has changed into a 5 by 5 dilation convolution.

Figure A15. 20th architecture of a normal cell when the temperature was 10.

Figure A16 Shows the 30th architecture of a normal cell when the temperature was 10. There were a few architectural changes. The connection from node 0 to node 2 has disappeared, and the connection from node 1 to node 3 has appeared.

Figure A16. 30th architecture of a normal cell when the temperature was 10.

Figure A17 Shows the 40th architecture of a normal cell when the temperature was 10. The connection from node 0 to node 1 has disappeared, and the input of node 1 has changed from

c_{k - 1}

and node 0 to

c_{k - 1}

and

c_{k - 2}

.

Figure A17. 40th architecture of a normal cell when the temperature was 10.

Figure A18 illustrates the 50th architecture of a normal cell obtained at a temperature of 10. The final architecture configuration has been chosen based on similarities observed in the 10th and 20th epochs. However, a distinction was observed in the connection between node 1 and node 3. This particular architecture exhibits a pattern characterized by residual networks between every two nodes, suggesting that TA-DARTS leverages the full potential of residual networks.

Figure A18. Final architecture of a normal cell when the temperature was 10.

Appendix B.2

Appendix B.2 shows the changes in reduction cell architecture of TA-DARTS when the temperature was 10.

Figure A19 illustrates the initial architecture of a reduction cell obtained at a temperature of 10. The architecture is made with the initial

α

vector, which is randomly generated.

Figure A19. Initial architecture of a reduction cell when the temperature was 10.

Figure A20 illustrates the 10th architecture of a reduction cell obtained at a temperature of 10. Given the early stage of the search process, there exists a significant potential for substantial changes to occur in subsequent epochs. The biggest change in this architecture is that node 1 became the bridge between node 0 to node 3.

Figure A20. 10th architecture of a reduction cell when the temperature was 10.

Figure A21 illustrates the 20th architecture of a reduction cell obtained at a temperature of 10. The architecture underwent a sudden transition, exhibiting a notable shift towards simplicity, characterized by the absence of connections between nodes.

Figure A21. 20th architecture of a reduction cell when the temperature was 10.

Figure A22 illustrates the 30th architecture of a reduction cell obtained at a temperature of 10. From the 20th simplistic architecture, the sole modification observed was the relocation of node 3 adjacent to node 1.

Figure A22. 30th architecture of a reduction cell when the temperature was 10.

Figure A23 illustrates the 40th architecture of a reduction cell obtained at a temperature of 10. The current observation reveals that node 1 is not only receiving input from

c_{k - 2}

, but also from node 0. The algorithm appears to be actively exploring more intricate architectural configurations.

Figure A23. 40th architecture of a reduction cell when the temperature was 10.

Figure A24 illustrates the final architecture of a reduction cell obtained at a temperature of 10. Node 2 has been repositioned next to node 1, forming a skip connection where node 2 and 3 essentially represent the same entity. This finding indicates that the combination of node 1 and the cell input

c_{k - 2}

through summation yielded highly effective results.

Figure A24. Final architecture of a reduction cell when the temperature was 10.

References

Chitty-Venkata, K.T.; Somani, A.K. Neural Architecture Search Survey: A Hardware Perspective. Assoc. Comput. Mach. Comput. Surv. (ACM Comput. Surv.) 2022, 55, 1–36. [Google Scholar] [CrossRef]
Zhou, X.; Qin, A.K.; Sun, Y.; Tan, K.C. A Survey of Advances in Evolutionary Neural Architecture Search. In Proceedings of the IEEE Congress on Evolutionary Computation (CEC), Kraków, Poland, 28 June–1 July 2021; pp. 950–957. [Google Scholar] [CrossRef]
Elsken, T.; Metzen, J.H.; Hutter, F. Neural Architecture Search: A Survey. J. Mach. Learn. Res. (JMLR) 2019, 20, 1–21. [Google Scholar]
Heuillet, A.; Nasser, A.; Arioui, H.; Tabia, H. Efficient Automation of Neural Network Design: A Survey on Differentiable Neural Architecture Search. arXiv 2023, arXiv:2304.05405. [Google Scholar]
Ren, P.; Xiao, Y.; Chang, X.; Huang, P.Y.; Li, Z.; Chen, X.; Wang, X. A comprehensive survey of neural architecture search: Challenges and solutions. ACM Comput. Surv. (CSUR) 2021, 54, 1–34. [Google Scholar] [CrossRef]
White, C.; Safari, M.; Sukthanker, R.; Ru, B.; Elsken, T.; Zela, A.; Dey, D.; Hutter, F. Neural architecture search: Insights from 1000 papers. arXiv 2023, arXiv:2301.08727. [Google Scholar]
Pan, C.; Yao, X. Neural Architecture Search Based on Evolutionary Algorithms with Fitness Approximation. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; pp. 1–8. [Google Scholar] [CrossRef]
Pham, H.; Guan, M.; Zoph, B.; Le, Q.; Dean, J. Efficient Neural Architecture Search via Parameters Sharing. In Proceedings of the the 35th International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; Volume 80, pp. 4095–4104. [Google Scholar]
Zoph, B.; Le, Q.V. Neural Architecture Search with Reinforcement Learning. In Proceedings of the 5th International Conference on Learning Representations (ICLR), Conference Track Proceedings, Toulon, France, 24–26 April 2017. [Google Scholar]
Liu, H.; Simonyan, K.; Yang, Y. DARTS: Differentiable Architecture Search. In Proceedings of the 7th International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Brock, A.; Lim, T.; Ritchie, J.; Weston, N. SMASH: One-Shot Model Architecture Search through HyperNetworks. In Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Zela, A.; Elsken, T.; Saikia, T.; Marrakchi, Y.; Brox, T.; Hutter, F. Understanding and Robustifying Differentiable Architecture Search. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual Conference, 26 April–1 May 2020. [Google Scholar]
Liang, H.; Zhang, S.; Sun, J.; He, X.; Huang, W.; Zhuang, K.; Li, Z. DARTS+: Improved Differentiable Architecture Search with Early Stopping. arXiv 2019, arXiv:1909.0603. [Google Scholar]
Chen, X.; Xie, L.; Wu, J.; Tian, Q. Progressive Differentiable Architecture Search: Bridging the Depth Gap Between Search and Evaluation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Xu, Y.; Xie, L.; Zhang, X.; Chen, X.; Qi, G.J.; Tian, Q.; Xiong, H. PC-DARTS: Partial Channel Connections for Memory-Efficient Architecture Search. In Proceedings of the International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Stamoulis, D.; Ding, R.; Wang, D.; Lymberopoulos, D.; Priyantha, B.; Liu, J.; Marculescu, D. Single-path nas: Designing hardware-efficient convnets in less than 4 hours. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD), Würzburg, Germany, 16–20 September 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 481–497. [Google Scholar]
Chu, X.; Zhou, T.; Zhang, B.; Li, J. Fair DARTS: Eliminating Unfair Advantages in Differentiable Architecture Search. In Proceedings of the 16th Europoean Conference On Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 465–480. [Google Scholar] [CrossRef]
Chang, J.; Zhang, X.; Guo, Y.; Meng, G.; Xiang, S.; Pan, C. Differentiable Architecture Search with Ensemble Gumbel-Softmax. arXiv 2019, arXiv:1905.01786. [Google Scholar]
Zhang, J.; Ding, Z. Small Temperature is All You Need for Differentiable Architecture Search. In Proceedings of the The 27th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Osaka, Japan, 25–28 May 2023; pp. 303–315. [Google Scholar]
Ye, P.; Li, B.; Li, Y.; Chen, T.; Fan, J.; Ouyang, W. β-DARTS: Beta-Decay Regularization for Differentiable Architecture Search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 10864–10873. [Google Scholar] [CrossRef]
Cai, H.; Zhu, L.; Han, S. ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware. In Proceedings of the 7th International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Devries, T.; Taylor, G.W. Improved Regularization of Convolutional Neural Networks with Cutout. arXiv 2017, arXiv:1708.04552. [Google Scholar]
Zoph, B.; Vasudevan, V.; Shlens, J.; Le, Q.V. Learning Transferable Architectures for Scalable Image Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 8697–8710. [Google Scholar] [CrossRef]
Liu, C.; Zoph, B.; Neumann, M.; Shlens, J.; Hua, W.; Li, L.J.; Fei-Fei, L.; Yuille, A.; Huang, J.; Murphy, K. Progressive Neural Architecture Search. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer: Berlin/Heidelberg, Germany, 2018; pp. 19–35. [Google Scholar]
Real, E.; Aggarwal, A.; Huang, Y.; Le, Q.V. Regularized Evolution for Image Classifier Architecture Search. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19), Honolulu, Hawaii, USA, 27 January–1 February 2019; pp. 4780–4789. [Google Scholar] [CrossRef]
Zhong, Z.; Yan, J.; Wu, W.; Shao, J.; Liu, C.L. Practical Block-Wise Neural Network Architecture Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Los Alamitos, CA, USA, 18–22 June 2018; pp. 2423–2432. [Google Scholar] [CrossRef]

Figure 1. Operator pool for Neural Architecture Search (NAS).

Figure 2. Visualization of an alpha vector at a certain edge during the search process. The horizontal axis represents the index of operators based on the operator pool. The vertical axis represents the

β

values for each operator.

Figure 2. Visualization of an alpha vector at a certain edge during the search process. The horizontal axis represents the index of operators based on the operator pool. The vertical axis represents the

β

values for each operator.

Figure 3. Comparison of beta values passed through Softmax function at different temperatures. The horizontal coordinates indicate each operator based on the index of the operator in the operator pool. (a) The values of an alpha vector before it passes through the mixed operator’s Softmax function. (b) The values of a beta vector passed through the original Softmax function. (c) The values of a beta vector passed through the Softmax function with a temperature of 10. (d) The values of a beta vector passed through the Softmax function with a temperature of 0.1.

Table 1. Experimental results on CIFAR-10 dataset with different temperatures.

Model	Search Validation Accuracy (%)	Train Top 1 Accuracy (%)	Test Accuracy (%)
TA-DARTS (T = 0.1) + cutout	88.74	99.14	97.39
TA-DARTS (T = 0.5) + cutout	88.90	99.85	95.25
TA-DARTS (T = 0.8) + cutout	88.38	98.62	96.93
TA-DARTS (T = 2.0) + cutout	88.34	98.58	97.20
TA-DARTS (T = 3.0) + cutout	88.53	99.09	97.40
TA-DARTS (T = 10.0) + cutout	87.87	99.01	97.19

Table 2. Experimental results on CIFAR-100 dataset with different temperatures.

Model	Search Validation Accuracy (%)	Train Top 1 Accuracy (%)	Train Top 5 Accuracy (%)	Test Accuracy (%)
TA-DARTS (T = 0.1) + cutout	63.14	92.95	99.37	76.86
TA-DARTS (T = 0.5) + cutout	64.12	97.84	99.89	81.67
TA-DARTS (T = 5.0) + cutout	63.65	93.71	99.38	78.80
TA-DARTS (T = 10.0) + cutout	63.40	95.91	99.69	81.21

Table 3. Comparison of validation accuracy with similar gradient-based approaches on CIFAR-10 dataset. The validation accuracies are obtained after five times of search.

Model	Params (M)	Search Cost (GPU days)	Validation Accuracy (%)	Search Method
DARTS (2nd order) + cutout [10]	3.4	1	88.35 ± 0.19	Gradient-based
TA-DARTS (T = 0.1) + cutout	4.1	4	88.42 ± 0.28	Gradient-based
TA-DARTS (T = 10) + cutout	4.1	4	87.98 ± 0.20	Gradient-based

Table 4. Comparison of test accuracy with similar gradient-based approaches on CIFAR-10 dataset. The test accuracies are obtained with retrained models.

Model	Params (M)	Search Cost (GPU Days)	Test Accuracy (%)	Search Method
DARTS (1st order) + cutout [10]	3.4	1.5	97.00 ± 0.14	Gradient-based
DARTS (2nd order) + cutout [10]	3.4	4	97.24 ± 0.09	Gradient-based
DARTS-EGS (M = 4) +cutout [19]	2.6	1	96.99	Gradient-based
DARTS-EGS (M = 7) + cutout [19]	2.9	1	97.37	Gradient-based
TA-DARTS (T = 3.0) + cutout	4.1	4	97.40	Gradient-based

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shin, J.; Park, K.; Kang, D.-K. TA-DARTS: Temperature Annealing of Discrete Operator Distribution for Effective Differential Architecture Search. Appl. Sci. 2023, 13, 10138. https://doi.org/10.3390/app131810138

AMA Style

Shin J, Park K, Kang D-K. TA-DARTS: Temperature Annealing of Discrete Operator Distribution for Effective Differential Architecture Search. Applied Sciences. 2023; 13(18):10138. https://doi.org/10.3390/app131810138

Chicago/Turabian Style

Shin, Jiyong, Kyongseok Park, and Dae-Ki Kang. 2023. "TA-DARTS: Temperature Annealing of Discrete Operator Distribution for Effective Differential Architecture Search" Applied Sciences 13, no. 18: 10138. https://doi.org/10.3390/app131810138

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

TA-DARTS: Temperature Annealing of Discrete Operator Distribution for Effective Differential Architecture Search

Abstract

1. Introduction

2. Related Work

2.1. EDD-DARTS

2.2. Gumbel-Softmax

2.3. Fair DARTS

2.4. $β$ -DARTS

2.5. ProxylessNAS

3. Materials and Methods

3.1. Search Space

3.2. Continuous Relaxation

3.3. The Architecture of DARTS

3.4. Softmax Temperature Annealing

4. Experimental Results

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Architectural Changes of TA-DARTS (Temperature = 0.1)

Appendix A.1

Appendix A.2

Appendix B. Architectural Changes of TA-DARTS (Temperature = 10)

Appendix B.1

Appendix B.2

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

TA-DARTS: Temperature Annealing of Discrete Operator Distribution for Effective Differential Architecture Search

Abstract

1. Introduction

2. Related Work

2.1. EDD-DARTS

2.2. Gumbel-Softmax

2.3. Fair DARTS

2.4. β -DARTS

2.5. ProxylessNAS

3. Materials and Methods

3.1. Search Space

3.2. Continuous Relaxation

3.3. The Architecture of DARTS

3.4. Softmax Temperature Annealing

4. Experimental Results

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Architectural Changes of TA-DARTS (Temperature = 0.1)

Appendix A.1

Appendix A.2

Appendix B. Architectural Changes of TA-DARTS (Temperature = 10)

Appendix B.1

Appendix B.2

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

2.4. $β$ -DARTS