Control the number of skip-connects to improve robustness of the NAS algorithm

Recently, the gradient-based neural architecture search has made remarkable progress with the characteristics of high efficiency and fast convergence. However, two common problems in the gradient-based NAS algorithms are found. First, with the increase in the raining time, the NAS algorithm tends to skip-connect operation, leading to performance degradation and instability results. Second, another problem is no reasonable allocation of computing resources on valuable candidate network models. The above two points lead to the difficulty in searching the optimal sub-network and poor stability. To address them, the trick of pre-training the super-net is applied, so that each operation has an equal opportunity to develop its strength, which provides a fair competition condition for the convergence of the architecture parameters. In addition, a skip-controller is proposed to ensure each sampled sub-network with an appropriate number of skip-connects. The experiments were performed on three mainstream datasets CIFAR-10, CIFAR-100 and ImageNet, in which the improved method achieves comparable results with higher accuracy and stronger robustness.


| INTRODUCTION
In the field of artificial intelligence, the neural network algorithm has made remarkable achievements in image, voice, and other areas. Zhao [1] proposes a novel super-resolution regularization model based on adaptive sparse representation and self-learning frameworks. Zeng [2] presents a weighted structural local sparse appearance model to further improve the robustness of tracking. Luo [3] proposes a fast and effective pruning method to eliminate the outlier correspondences, especially the sharp ones. However, artificial neural networks require a great deal of specialized knowledge and repetitive experiments.
In recent years, the Neural Architecture Search (NAS) algorithm, which can design high performance networks automatically, has received extensive attention from researchers, and many new network models have evolved [4][5][6]. NAS algorithms based on reinforcement learning [7], evolutionary algorithm [8,9], Bayesian optimization [10], and gradient constitute the four development directions of the neural structure search.
ENAS [6] reduced the structure search time to day-level. Gradient-based NAS has become the focus of research due to its advantages of fast search speed and high accuracy. The first gradient-based method [11] made a breakthrough in search time, and follow-up works [12,13] and used Gumbel-softmax [14] to alleviate the gap problem.
However, one problem named 'skip aggressive' is not resolved as the gradient-based NAS method tends to skipconnect operate with the increase in training time. Also, there is another common problem with current methods. NAS algorithm based on cell has a vast search space, but the number of candidate networks sampled is far less than the size of the search space.
In this article, we propose a skip-controller to limit the number of skip-connects in the search process, which not only solve the problem of 'skip aggressive', but also avoids the waste of resources on sampling and training meaningless subnetworks with too many skip-connects. In summary, the strengths of our method lie in the following aspects comparing to previous NAS methods: 1. Through limiting the amount of skip-connects to a reasonable range, significantly narrowing the search space and reducing the difficulty of searching the optimal solution.
2. Alleviating the unfair advantage of skip-connect in operation selection and providing a fair competition platform for operation selection, which is helpful to select the relatively optimal structure. 3. The limitation of the number of skip-connects avoids performance degradation. Improve the accuracy and stability of the search algorithm. 4. Through experiments, higher accuracy and more robust results are obtained. Specifically, the average error rates of CIFAR-10, CIFAR-100 [15], and ImageNet [16] were 2.68%, 17.28%, and 25.1%, respectively, which are higher than the previous three Gumbel-based methods [12,13,17]. Using skip-controller, the variance of model accuracy decreases from 0.14 to 0.07.

| RELATED WORK
Liu et al. proposed DARTS (Differentiable Architecture Search) [11], the first gradient-based NAS. Based on DARTS, SNAS [12] and Fbnet [13] proposed to use Gumbel-softmax to reduce the gap between the derived sub-network and the superNet. GDAS [17] further uses Gumbel-softmax to speed up search. However, they don't address the skip-connect aggressive issue in DARTS. P-DARTS [18], aiming at the problem of the excessive number of skip-connects, uses path drop-out to limit the unfair advantage of skip-connect in the search process and replaces excessive skip-connects with other operations at the end of the search. This is an Ex-post solution, and more attention should be paid to avoid being misled in the process.
FairDARTS [19] trains the structure parameters of different operations independently to avoid interference between various operations, to make a fair competition between different operations, alleviating excessive skip-connects. However, the above methods ignore the vast search space of the cell-based NAS algorithm, and the number of candidate networks sampled in the whole search process is far less than the size of the search space. DARTS+ [20] believes that performance degradation will occur when the number of skips exceeds two during the search process. It uses the early-stop strategy to avoid performance degradation. Similar to DARTS+, we also introduce a prior experience that the accuracy of models with more than four skip-connects is almost never more than 97%. Unlike DARTS+'s early stop strategy, that only gets a structure with two skips, we reduce the search space into a reasonable range to search a network structure. In addition, our method avoids sampling and training meaningless sub-networks, which greatly improves the search efficiency.

| Preliminary: gradient-based NAS
For convolutional neural networks, DARTS searches for a normal cell and a reduced cell to construct the final network model. Each cell can be defined as a directed acyclic graph, which contains N nodes, and each node represents a feature map. In the experiment, the default setting is N = 7, which contains two input nodes, four intermediate nodes, and one output node. The edge e i,j from node i to j operates on the input feature x i , and its output is denoted as o i, j (x i ). Each edge has an optional set of operations L ¼ fo 1 i;j ; o 2 i;j ; …; o M i;j g. DARTS uses softmax to select the operations to be a mixed operation. So the corresponding mixed operation of edge e i,j is: where α oi;j ¼ fα 1 oi;j ; α 2 oi;j ; α 3 oi;j ; ⋯α M oi;j gis the probability vectors corresponding to the operations on edge e i,j .
SNAS and Fbnet use Gumbel-softmax instead of softmax to select the operations so that the structure parameters converge into a softened one-hot variables gradually. The corresponding mixed operation of edge e i,j is: where Z i,j is the softened one-hot random variable for operation selection at edge e i,j , G k i;j ¼ −logð−logðU k i;j ÞÞ, is the kth Gumbel random variable, U k i;j is a uniform random variable. GDAS no longer use the softened one-hot random variables, which need M times computational memory and time. Instead, parameters are sampled to be strictly one-hot vectors to reduce computation time and memory.
where f is a function that changes soften one-hot variables to one-hot variables. Each intermediate node x j sums the inputs of all income x i : The output node of the cell cats all the intermediate nodes: The neural network search problem is to optimize the architecture parameters a i,j and the network operation parameters w* to minimize the validation loss.
At present, all the gradient learning methods, except SNAS, use bi-level optimization. In this article, we also use bi-level optimization and the same search space as DARTS. However, we no longer use softmax for operation selection but use Gumbel-softmax to get one-hot architecture parameters.

| Existing problems
� Problem 1: The unfair advantage of skip-connect leads to performance degradation. Specifically, skip-connect has an unfair advantage [19] that is easy to be sampled, and this advantage is constantly amplified in iterations. As a result, too many skip-connects are eventually produced. This advantage leads to an increase in skip-connect, which in turn leads to Performance decline to be more precise, the accuracy of networks with the skip-connect value of more than four hardly exceeds 97%. Consider the cell of DARTS as an example. Each cell contains 14 edges need to select operations, which means the number of skip-connects can be 0-14, in other words, most of the networks sampled are meaningless. That is, there is no reasonable allocation of computing time and resources. � Problem 2: A cell consists of seven nodes that include four intermediate nodes, so a cell contains 14 paths that are required to select the operations, while each edge includes eight operations. When the sampled architecture parameter is a one-hot tensor, that is, activating only one operation for each path, so the size of the search space is 8 14 . In the previous gradient-based search methods, with the settings of batch_size = 64, epochs = 200, train_dataset: validate_dataset = 1:1, the number of sampled networks is only: 200 * 25,000/64 = 78,125, which is far less than the amount of search space 8 14 . This means that the previous search strategy is not sufficient to explore the search space, and it is hard to cover the optimal network sub-network.

| Pre-training superNet similar to swarm up
For Problem 1, to mitigate the unfair advantage of skipconnection, an idea similar to the swarm up is proposed, known as the 'pre-train the superNet'. In the early stage of the neural search process, sampling was completely random and only the operating parameters were trained, so each candidate sub-network has the same training opportunity. After pretraining the superNet, different operations have the same degree of convergence, providing a fair and reasonable competition platform for architecture convergence. Specifically, training the superNet for K epochs firstly and then using bi-level optimization to train architecture parameters and network parameters in the remaining 200-k epochs. The pseudo code is shown in Algorithm 1.

| Search space exploration strategy
For Problem 2, the search space is enormous, but the exploration of the search space is insufficient. Also, as the prior experience in Problem 1, in the NAS process, a large number of meaningless sub-networks with more than S skip-connects in each cell are sampled which waste many computational resources.
To fully and reasonably explore the search space, we propose two methods to reduce the search space and avoid sampling meaningless sub-networks, and further increase the number of candidate network samples.
First, a skip-controller is proposed that only samples meaningful sub-networks within the S skip-connects. The analysis of the parameter S is detailed in Section 5.3. On the one hand, the search space is greatly reduced from the initial 8 14 to 8 14 14 � 0:21, which reduces the difficulty of searching the optimal network. On the other hand, since we will only train networks within the S skip-connects, the final searched network must be one of the narrowed search spaces, which solve the problem of the excessive number of skip-connects mentioned in Problem 1, which also helps the robustness of the search results. The pseudo code of skip-controller is shown in Algorithm 2. Second, increase the number of samples. It is easy to find from Algorithm 1 that the number of samples is related to epochs, batch_size, and the size of the training data set.

samples ¼ epochs�train_dataset_size=batch_size ð7Þ
So we adjust the data set ratio from 1:1 to 4:1, increasing the number of sampled sub-networks by 60% without increasing search time. Specifically, we can express it based on pseudocode 1 as Algorithm 3.
It is easy to find that through these two tricks, not only greatly reduces the search space, but also increases expands the scope of exploration. Therefore, our method is more conducive to find network structures with excellent performance. Our method uses the same search space as DARTS but analyse the problem from the perspective of the candidate models. During the training, only meaningful candidate sub-networks has been sampled, which contain the optimal solution with a high probability. Furthermore, only sub-networks sampled will be trained, so their loss is lower than the network model which is not sampled. The optimization of structural parameters will be updated tending to meaningful sub-networks.

| Resource constraint
Although limiting the number of skip-connects in the search space can bring stability to the search results and reduce the difficulty of searching the optimal sub-network, the advantages of the structure with fewer parameters are offset, which lead to the network that tends to include the operation with parameters, resulting in a model with many parameters. To solve this problem, we refer to the method of Morphnet [21], which restricts the network model from three perspectives during the search process: model parameter size, the number of floatpoint operations (FLOPs), and the memory access cost (MAC) [12]. We define H, W and f, k as the output spatial dimensions and the filter dimensions, respectively, and use I, O for the number of input and output channels, while g represents the number of group convolution. According to Morphnet, for a convolutional layer, the three can be calculated in the following formula: According to SNAS, to distinguish pooling operations and skip-connect operation. They should be defined separately. FLOPs of pooling: FLOPs of skip-connection: MAC of pooling and skip-connection: Therefore, the final NAS can be redefined as an optimization problem: Min L val ðw � ðαÞ; αÞ þ η � CðαÞ s:t:w � ðαÞ ¼ argmin w L train ðw; αÞ In which C(α) is the cost for the sub-network associated with the random variables α that decides which operation be sampled for training. η is the coefficient of C(α).
Since the cost of each operation is determined, C(α) is linear in terms of the one-hot vector α. In other words, the cost of each sampled subnet is determined by the one-hot type structure parameter Multiplied by the cost of each operation. Therefore, it can be added to the loss function in the form of a regular term, so that the training process tends to be a network structure with a low computational cost.

| Data enhancement adjustment
Through experiments, we found that the verification data set should not adopt a data enhancement strategy like the training data set. We process the validation dataset in the same way as the test dataset to achieve a higher accuracy. This may be explained that architecture parameters are few and do not require overly complex data samples. Besides, data enhancement is more suitable for training operation parameters to prevent over-fitting. The architecture parameters are mainly used to distinguish the performance differences between the operations, while data enhancement may confuse in identifying performance differences.

| Experiment setups
Searching setup: Our method is based on the Gumbelsoftmax to sample a sub-network, instead of SNAS and Fbnet to train the whole network model, so taking GDAS as baseline. The whole search process consists of two stages, which contains cell searching and training the final network. Different from GDAS, the 50k images were divided into 40k training dataset and 10k validation dataset. Following DARTS, the number of initial channels in the first convolutional layer is 16, and the number of computational nodes for each cell is 4. Besides, the number of layers in one block M is set as 2, and the total training epochs are 200. The candidate operations contain eight different functions: (1) identity, (2) zero, (3) 3 � 3 depth-wise separate conv, (4) 3 � 3 dilated depth-wise separate conv, (5) 5 � 5 depth-wise separate conv, (6) 5 � 5 dilated depth-wise separate conv, (7) 3 � 3 average pooling, (8) 3 � 3 max pooling. We set K = 50, S = 4, and η = 10 −6 in the experiment, getting three version models, pre_trained(Pre), pre_trained + skip_controller(Pre + skip), and pre_trained + skip_controller + resource_constra int (Pre + skip + res). Meanwhile, the SGD is applied to optimize the network weights with a learning rate of 0.025 at the beginning and the anneal it down to 10 −3 following a cosine schedule. The weight decay is set to be 10 −3 , and the momentum is set to be 0.9. The temperature of Gumbel_softmax is initialized as 10 and is linearly reduced to 0.3. Adam optimization with the weight decay of 10 −3 and the learning rate of 3 � 10 −4 .

| Experimental results
Results on CIFAR: After finishing the search procedure, we stack the cell with the best performance to form a CNN with cell number set to be 20, block number set to be 6, and node number set to be 4. When the complete network is built, the learning rate is initialized to be 0.025 and anneal it to zero with the cosine learning rate schedule. Similar to GDAS, we train the network by 600 epochs in total. Following existing works [5], additional enhancements consist of cutout, probability of path drop-out being 0.2 and auxiliary towers [22] with weight 0.4. Also, other standard pre-processing and data augmentation such as horizontal flipping and normalization are applied to the network learning. The experimental results of the three versions and other state-of-the-art results are listed in Table 1. The accuracies of the three versions are higher than that of the other Gumbel-based gradient methods. From the Table 2, it is easy to find when adding skip_controller, the stability of the results is significantly enhanced. Finally, using the resource constraint trick for model compression can reduce the parameter size significantly with little effect on accuracy.
Results on ImageNet: The ILSVRC2012 is used to test the cells discovered on CIFAR-10. Following GDAS, we train the model with 14 cells and 50 initial channels from scratch for 250 epochs with batch size 512 on four Nvidia Titan XP GPUs, which takes 6 days with 5.0Mb parameters. The network parameters are optimized using an SGD optimizer with an initial learning rate of 0.2 (decayed linearly after each epoch), a momentum of 0.9, and a weight decay of 3 � 10 −5 . Additional enhancements including label smoothing and auxiliary loss tower are used during training. Because large batch size and learning rates are adopted, learning rate warmup are also used for the first five epochs. In the end, we take the architecture searched using pre_GDAS and pre_GDAS + skip_controller to evaluate on ImageNet which achieves accuracies of 74.6% and 74.9%, which surpasses the previous Gumbel-based algorithms. On ImageNet, the accuracy of pre_skip(Pre_trained + skip_controller) again exceeds that of pre_GDAS(Pre-train the superNet-based on GDAS), indicating its high migration capability. The evaluation results and other previous state-ofthe-art methods are compared in Table 3. It is easy to find that our method outperforms the baseline GDAS by about 1.5%, which proves the success of our improvement. Furthermore, when transforming the architecture searched on CIFAR-100 to ImageNet, its accuracy 75.6% is further improved.

| Skip_controller is friendly for accuracy
In order to prove the effectiveness of skip_controller, we take GDAS as the baseline and only add skip_controller without changing other operations, and the model is named as GDAS_skip. As shown in the Table 4, the accuracy of the GDAS_skip was higher than the original GDAS, even GDAS (FRC), which is obtained by using GDAS based on artificial design reduce cell. It shows that skip_controller can indeed reduce the search space apart from reducing the difficulty of searching high-quality network models.

| Pre-train the superNet to get higher accuracy
To illustrate that the pre-training superNet helps level the playing field between operations and improves accuracy, we pre-train the superNet based on GDAS and name it as Pre_GDAS. Different from GDAS, we believe that when a one-hot vector was used to sample a sub-network model, the sub-network models should be sampled as many as possible to increase the probability of covering the best subnetwork. Therefore, setting the ratio between the training data set and the verification data set to 4:1. In our experiment, the whole experiment is conducted with different random seeds for five times. The results are listed in Table 5. It is easy to find that our method outperforms the baseline GDAS.
T A B L E 1 classification errors and previous state-the-art methods on CIFAR. EV, RL, and the G mean evolution method, reinforce learning method, and gradient-based method respectively, while Gumbel is also a gradient method based on Gumbel-softmax. It is worth noting that our average error rate of 2.68% with a low variance of 0.005% is much lower than other Gumbel-SoftMax methods, and our best result on CIFAR-10 is 2.57%. In particular, our algorithm improves the accuracy by 1% on CIFAR-100 compared to GDAS

| Skip-controller contributes to further improvement of accuracy and robustness
As mentioned above, DARTS is inherently unstable, so it is Pre_GDAS. As shown in Table 5, although the results of Pre_GDAS are higher than that of the GDAS, the result variance is large, which means it is not robustness. Our robustness is reflected in two aspects, first, the model searched by our method will not have too many skip-connections. Second, because we use a skip-controller to make the search space in a smaller range, the variance of the accuracy of the search model will also decrease. To further illustrate that skip-controller has the effect of enhancing stability and improving accuracy, adding skip_controller based on Pre_GDAS to obtain the version Pre_GDAS + skip_controller with the highest accuracy. Different random seeds are applied to repeat the experiment from scratch for five times. Comparing Tables 5 and 2, it is easy to find that with the help of the skipcontroller, the accuracy rate is improved, and the variance is T A B L E 3 Classification errors of our method and the previous state-of-the-art methods on ImageNet + x indicates the number of multiply-add operations. It is easy to see that our approach is superior to our baseline GDAS, and even comparable to the latest state of the art methods FairDARTS and DARTS+

Params (M) �+ ( M) GPU-days Top1 Top5
Inception-v1 [22] 30. To compare the migration capability, we select the best models from them and transfer them to CIFAR-100 and ImageNet, respectively. Pre_skip still achieved better results than Pre_GDAS in Table 6, which proves that Pre_skip has better migration capability.

| Analysis of parameter S in skip-controller
We set up skip-controller mainly to avoid training meaningless sub-networks and constrain the search space within the range of skip-connects less than four to avoid skip aggressive problems. We randomly sampled sub-networks with skipconnects greater than four in the superNet and tested the accuracy. We found that skip-connects over four are unlikely to exceed 97%. Therefore, it is not necessary to sample and train a sub-network structure with skip-connects greater than 4. Constraining the number of skip-connects can not only avoid the waste of computing resources but also avoid the instability and performance degradation caused by the excessive number of skip-connects.
For each S, we perform four repetitive experiments, and the results are shown in the Table 7. When S is large, the skipcontroller is almost useless, and the final search model has more skip-connects and the network structure is too sparse.
When S is too small, the candidate sub-networks in the search space have more parameters. However, more parameters do not mean higher accuracy, and a huge model with redundant connections is not a good result. Therefore, S should be of an appropriate value, which can avoid skip aggressive problems and avoid too large search models.
Setting S to 2 does not mean that the accuracy must be higher. It just means sampling in a search space with skip numbers of 0, 1, and 2. We found that if this restriction is too strong, the search process will tend to form sub-networks with more parameters (with few or no skips). This is because the sub-network samples have more parameters and lack a sparse network structure. Just like the general machine learning algorithm, when there are too many positive samples for certain features and lack of negative samples, the learning algorithm will tend to it, forming a kind of over-fitting. When the skipcontroller constraint is too strong, the samples (candidate sub-networks) are used to train the structural parameters to have the characteristics of several parameters, but the lack of candidate sub-networks with sparse structure, makes the structural parameters tend to be redundant with more parameters during the convergence process, bringing another kind of over-fitting. This may lead to the worst result: the final network structure is too redundant.
When S is small, it tends to produce a network with huge parameters, which produces over-fitting. When S is larger, it tends to produce a sub-network structure with too many skips. Therefore, the constraint on skip needs an appropriate value. Of course, S can also set to be 2, and it is still possible to search for sub-network structures with good accuracy. But the probability of getting too many parameters is greater. Therefore, setting four is a reasonable choice that can eliminate meaningless subnetworks and make the search space reasonable, and can obtain a small number of parameters and high accuracy. DARTS+ is carried out in the entire search space. Subnetworks with too many skips are included, which is prone to skip aggressive problems. To avoid too many skip-connects T A B L E 7 We perform four repetitive searches for each S. When S is large, the network structure is too sparse, and when S is small, the model has too many parameters during the NAS process, it stops when skip = = 2. Unlike the early stop, we search in the skip-restricted search space and solve the skip aggressive problem from the perspective of the search space. Our method limits the search space to a reasonable range based on excluding meaningless candidate sub-networks. In other words, the final network structure must be one of this search spaces. Therefore, this can avoid the skip aggressive problem at the root from the perspective of the search space. So even if other operations in the cell change, it can still avoid the harm caused by excessive skips.

| Resource constraint
Unfortunately, Pre_skip results in a large model size around 4.1 MB. So resource constraint is applied for the model compression. We find that the parameter size decreases significantly with little influence on accuracy, in Table 8.

| CONCLUSION
A more robust NAS algorithm with a skip-controller is proposed. We pre-train the superNet to mitigate aggressive skipconnects. Furthermore, the skip-controller is applied to limit the number of skips and avoid computing on meaningless subnetworks, bringing about the stability of the NAS algorithm. Finally, the resource constraint is used to effectively compress the model.
Through experiments on CIFAR and ImageNet, our method is superior to other Gumbel-softmax-based baseline methods and is comparable to the state-of-the-art methods, with the error rates of 2.57% and 17.19% on CIFAR-10 and CIFAR-100, respectively. When transferred to the ImageNet, it reaches the test error of 24.4%.
In the future, we will try to search on the ImageNet directly, since our algorithm does not require a huge memory consumption. Also, we will test the validity of the skipcontroller on other tasks.