Efficient and Lightweight Visual Tracking with Differentiable Neural Architecture Search

Gao, Peng; Liu, Xiao; Sang, Hong-Chuan; Wang, Yu; Wang, Fei

doi:10.3390/electronics12173623

Open AccessArticle

Efficient and Lightweight Visual Tracking with Differentiable Neural Architecture Search

by

Peng Gao

^1,2,*

,

Xiao Liu

¹,

Hong-Chuan Sang

¹,

Yu Wang

^1,3 and

Fei Wang

⁴

¹

School of Cyber Science and Engineering, Qufu Normal University, Qufu 273165, China

²

Yuntian Group, Dezhou 253700, China

³

Network and Information Center, Qufu Normal University, Qufu 273165, China

⁴

School of Electronics and Information Engineering, Harbin Institute of Technology, Shenzhen 518055, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(17), 3623; https://doi.org/10.3390/electronics12173623

Submission received: 20 July 2023 / Revised: 19 August 2023 / Accepted: 22 August 2023 / Published: 27 August 2023

Download

Browse Figures

Versions Notes

Abstract

:

Over the last decade, Siamese network architectures have emerged as dominating tracking paradigms, which have led to significant progress. These architectures are made up of a backbone network and a head network. The backbone network comprises two identical feature extraction sub-branches, one for the target template and one for the search candidate. The head network takes both the template and candidate features as inputs and produces a local similarity score for the target object in each location of the search candidate. Despite promising results that have been attained in visual tracking, challenges persist in developing efficient and lightweight models due to the inherent complexity of the task. Specifically, manually designed tracking models that rely heavily on the knowledge and experience of relevant experts are lacking. In addition, the existing tracking approaches achieve excellent performance at the cost of large numbers of parameters and vast amounts of computations. A novel Siamese tracking approach called TrackNAS based on neural architecture search is proposed to reduce the complexity of the neural architecture applied in visual tracking. First, according to the principle of the Siamese network, backbone and head network search spaces are constructed, constituting the search space for the network architecture. Next, under the given resource constraints, the network architecture that meets the tracking performance requirements is obtained by optimizing a hybrid search strategy that combines distributed and joint approaches. Then, an evolutionary method is used to lighten the network architecture obtained from the search phase to facilitate deployment to devices with resource constraints (FLOPs). Finally, to verify the performance of TrackNAS, comparison and ablation experiments are conducted using several large-scale visual tracking benchmark datasets, such as OTB100, VOT2018, UAV123, LaSOT, and GOT-10k. The results indicate that the proposed TrackNAS achieves competitive performance in terms of accuracy and robustness, and the number of network parameters and computation volume are far smaller than those of other advanced Siamese trackers, meeting the requirements for lightweight deployment to resource-constrained devices.

Keywords:

visual tracking; deep learning; neural networks; neural architecture search

1. Introduction

Visual tracking is an essential branch of computer vision research. It requires designing a model that continuously tracks a target object in a video sequence which is given only in the initial frame. Visual tracking is widely used in the real world, for example, traffic surveillance, intelligent robots, and unmanned vehicles. However, due to the complexity and variability of object states in video sequences, existing vision tracking methods require increasingly high computational and storage overheads in order to improve tracking performance. Therefore, the implementation of real-time visual tracking in complex scenarios is an important challenge for the development of the computer vision field [1,2].

In recent years, convolutional neural networks (CNNs) have been widely used in various computer vision tasks because of their excellent feature extraction and representation ability. As a special CNN architecture, the Siamese network can measure the similarity among inputs well, making it suitable for visual tracking [3]. Thus, visual tracking approaches based on the Siamese network (a.k.a., Siamese trackers) have been developed [4,5,6,7,8]. However, the CNNs used by Siamese trackers are generally directly migrated from other computer vision tasks and are unsuitable for visual tracking [9]. Additionally, these network architectures are designed and trained manually, and their design processes are often influenced by subjective and objective factors: the former is that experts may add designs to the CNN architectures according to their own experience, and the latter is that experts’ knowledge is insufficient, leading to the necessity of multiple tests to determine the best CNN architectures, which requires considerable time and effort [10,11,12]. In addition, although the existing Siamese trackers have made considerable progress with regard to high tracking performance [5,6,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28], the network architectures used are becoming increasingly heavy and expensive, resulting in larger numbers of parameters and computation volumes, as shown in Figure 1. For example, the latest Siamese trackers SiamRPN++ [15] and SiamFC++ [14] utilize

12.9

GFLOPs and

20.3

GFLOPs as well as

11.2

million and

16.6

million parameters (the efficiency of trackers here was evaluated by using AlexNet [29]), respectively, to achieve advanced tracking performance, and the high complexity makes it difficult to deploy these trackers to resource-constrained devices, such as unmanned aerial vehicles, industrial robotics, and driving assistant system [30].

To alleviate the aforementioned problems, we propose TrackNAS, a novel Siamese tracker based on neural architecture search (NAS) [31,32,33,34]. We use NAS to overcome the dependence on experts’ knowledge in searching for efficient and lightweight CNN architectures suitable for visual tracking, allowing the deployment of Siamese trackers to devices with hardware resource constraints. The design of CNN architectures using the NAS framework requires the consideration of three aspects: the search space, that is, which CNN architectures can be searched for in principle; the search strategy, that is, what strategy is used to search in the search space, specifically, what strategy is used to select the current architecture to be searched for and what strategy is used to select the next architecture to be searched for using the existing architecture and corresponding evaluation results; and performance evaluation, that is, assessing the performance of different architectures found with the aim of identifying the best architecture in the search space, which is the key factor for determining the performance of the NAS framework. In TrackNAS, the backbone network search space mainly contains mobile inverted bottleneck convolution (MBConv) cells [35], and the head network search space mainly contains depthwise separable convolution (DSConv) cells [36,37], with the aim of increasing the computational efficiency of the network architecture and reducing its complexity. In the search strategy, the pretraining of the backbone network is performed first; then, different search schemes are used for the backbone and head networks, and the search phase is performed jointly on the benchmark datasets to obtain a suitable network architecture for visual tracking. Additionally, the evolutionary method is used to lighten the obtained network architecture to meet the resource requirements for practical applications, and retraining is performed to optimize the network parameters. The obtained network architecture is assessed via detailed comparison and ablation experiments using large-scale evaluation benchmark datasets. Experimental results indicated that TrackNAS achieved excellent tracking performance, and the number of parameters and computation volume were far smaller than those of other advanced Siamese trackers.

In summary, this work makes the following contributions:

We propose TrackNAS, a new Siamese tracker based on single-path one-shot NAS and differentiable NAS that facilitates deployment to resource-constrained devices.
We design a series of search spaces and strategies to find promising network architectures for visual tracking, reducing the computational burden and improving the tracking performance.
Extensive experiments on several large-scale datasets show the efficiency and effectiveness of our approach. TrackNAS achieves state-of-the-art performances using few store and computational resources.

The remainder of this paper is organized as follows: Section 2 briefly reviews representative works related to this study. Section 3 describes the proposed TrackNAS algorithm. Section 4 introduces and discusses the experimental results for several challenging large-scale evaluation datasets. Finally, conclusions are presented in Section 5.

2. Related Work

2.1. Neural Architecture Search

In recent years, NAS [38] has received increasing attention in academia and industry, with the aim of searching for optimal network architectures that achieve the best results for given tasks. The hallmark event in the development of NAS occurred in 2016 with the use of reinforcement learning to search for network architectures in the fields of image recognition and language processing, which represented a breakthrough in traditional intelligent models [31]. In 2018, Pham et al. proposed ENAS [32], which explored models better than those designed by human beings in just 1 GPU day by sharing model parameters. Recent studies on NAS have focused on the following three aspects [39,40].

Search space defines the network architectures that can be sampled and usually is divided into a global search space and a local search space. The global search space is searched over the entire network architecture. Zoph et al. proposed setting up a search space at each layer of the CNN architecture [31], including the height, width, and number of convolutional kernels. Such an approach tends to consume considerable computational and storage resources. Conversely, NAS methods typically use local search spaces to search for the network architecture. It is also known as cell-based search [41,42]. Cells can be divided into normal and reduction cells, which have the same architecture, and both form the final network architecture via stacking and splicing. Additionally, to reduce the search time, the internal architectures of normal and reduction cells are often the same, and the entire network architecture is searched only on the two cells. Differentiable architecture search (DARTS) [34] is a typical cell-based search method that has a smaller search space because it only searches operations on the path. In addition, there is a special local search space. Methods such as ProxylessNAS [33] and MDENAS [43] also involve cell-based search, but the architecture is not shared between normal and reduction cells. In these methods, the search space is based on the existing network architecture, such as MobileNet [35,36], and the obtained network architecture is optimized using different strategies.

Search strategies are mainly used to explore the network architectures in the search space and find the architecture with the best performance. Commonly used search strategies include random search [44], Bayesian optimization [45], evolutionary methods [46], reinforcement learning [31], and gradient-based methods [34]. Zoph et al. used a recurrent neural network as a controller to generate the same number of parameters for each layer of the network and used reinforcement learning to constantly train this controller [31]. This method is simple but consumes considerable computational resources. Baker et al. improved the method by searching for the network architecture through a metamodeling algorithm sequence, but this is still time-consuming [47]. Real et al. used an evolutionary method to search in a relatively large space and achieve superior image classification performance [46]. However, the large-scale search space inevitably increases the computational burden. DARTS utilizes a gradient-based search method, which relaxes the discrete search space and significantly increases the search speed of a network architecture, and the convergence of the search phase can be achieved in only 1 day of training on a single GPU [34]. Marvasti-Zadeh et al. applied DARTS to the field of visual tracking and proposed the CHASE tracker [48], which achieves automation of the network architecture design during offline training and effectively integrates the obtained network architecture into the manually designed Siamese network.

Performance evaluation refers to the assessment of the performance of the obtained network architectures. Early evaluation methods [31,34,41] involved training each network architecture obtained until convergence and then evaluating it on the validation set, which consumed large amounts of time and resources. For the method proposed by Real et al., training on 450 GPUs took close to a week [42]. Many solutions to this problem have been developed, which can be broadly classified into three categories. The first category of methods is used only for the training of partial datasets or low-resolution datasets and aims to obtain predictions of lower accuracy by reducing the number of training steps. The second category of methods involves predicting the accuracy via the interpolation of learning curves. The third category of methods treats all network architectures obtained as subnets of a larger supernet, and all the subnets are trained independently. Ultimately, only one of the subnets is retained; thus, this is often referred to as the one-shot method and has been used for performance evaluation in many studies [49]. The aforementioned ENAS significantly reduces the training time by sharing the weights of operations among different network architectures [32]. However, the sharing of weights affects the parameters of the networks, which causes model instability and affects the final performance. The methods of Cai et al. [50] and Stamoulis et al. [51] address this defect. Weight sharing can be implemented using different operations, which refine its granularity. Because methods in the third category explore only one or two network paths at a time, they are often referred to as single-path methods, which not only search for a better network architecture but also significantly reduce the storage overhead during training.

The rapid emergence of new studies signals that NAS is moving toward multitasking. Initially, problems addressed by NAS were focused on image classification and reducing search overhead. As researchers have developed more complex network architecture designs, fields such as object detection [52], semantic segmentation [53], surface scratch detection [54], and SAR image processing [55] have seen considerable progress. In particular, in the field of visual tracking, NAS can utilize its feature of automatically searching the most suitable network architecture to find the network architecture that is dedicated to visual tracking, reducing the reliance on prior knowledge. Moreover, the number of parameters and computational burden can be considered in designing the network architecture to reduce the complexity of the model.

2.2. Siamese Trackers

Because of their excellent similarity measurement ability and robustness, Siamese networks [56] can better cope with the problems of occlusion, scale variations, and motion blur. They have attracted considerable attention in the field of visual tracking and have become a research hotspot [1,2,3,57]. SINT [4], which was proposed by Tao et al. in 2016, is a groundbreaking Siamese tracker. By continuously training the matching function using the Siamese network, each candidate region of the subsequent frames is matched with the target template obtained in the first frame, and the candidate region with the highest score is the target position of the current frame. However, because the equipment was far from meeting the computational requirements of optical flow under the conditions at the time, SINT only ran at a rate of 2 frames per second (FPS), making real-time tracking impossible. In the same year, Bertinetto et al. proposed SiamFC [5], which uses a fully convolutional Siamese network for visual tracking. SiamFC has two branches: one for target template feature extraction and another for candidate region feature extraction. The target position is determined by calculating the similarity between the template and the candidate features. Compared with SINT, SiamFC has a higher tracking speed and accuracy. In addition, SiamFC uses an online learning strategy to improve its robustness by dynamically updating the target templates to adapt to changes in target appearance, making it one of the representative approaches in the field of visual tracking, and it is widely used in practical scenarios.

Many researchers have improved the backbone network used by SiamFC. SA-Siam [58] improves the performance of the tracker in discriminating targets by combining semantic and appearance branches during inference, whereas RASNet [59] uses residual attention, channel attention, and global attention to weight the template features of SiamFC. StructSiam [60] adds a local pattern detection module and uses a message-passing module to extract contextual information from the relevant patterns for further refining the local patterns of predictions. Although each of these trackers has advantages and disadvantages, they have all contributed to the development of Siamese trackers. Another representative approach is SiamRPN proposed by Li et al. [6], which transforms the visual tracking task into an object detection problem. By using the region proposal network (RPN) [61] in the tracking process, a more accurate target bounding box can be obtained, which increases the tracking speed while reducing the computational overhead, and there is also a considerable accuracy improvement due to the enhanced ability of the tracker to characterize the target object. However, the tracking performance of SiamRPN deteriorates significantly when similar target objects appear in the background surroundings. To address this problem, Zhu et al. proposed DaSiamRPN [62] to generate image pairs or training by performing data enhancement on the object detection dataset. The authors introduced the COCO [63] and ImageNet-DET [64] datasets into the training, which significantly enriched the category information in the training set. Additionally, negative samples with semantic information were constructed to enrich the sample data for improving the discrimination ability of the tracker. Wang et al. proposed SiamMask [65], which is a modified version of SiamRPN that adds a mask branch for the Siamese network used for visual tracking to achieve object segmentation, and the network architecture was optimized by replacing the original backbone network of AlexNet [29] with ResNet [66], which significantly improved the tracking performance. Later, Li et al. proposed SiamRPN++ [15], which uses a simple but efficient spatial sampling method to overcome the limitations of Siamese trackers. By replacing the traditional dense convolution with a deeply separating operation, multichannel score maps with different semantic features were generated, and the tracking accuracy was increased. SPM-Tracker [67] and C-RPN [68] use lightweight backbone networks and stacked RPN refinement modules to improve the discriminative power and tracking accuracy. In some recently proposed approaches, such as SiamCAR [10] and SiamBAN [17], the use of excessive hyperparameters in the RPN is avoided, and the tracking speed and accuracy are increased by introducing the object detection framework FCOS [69] to predict the bounding box directly at the pixel level.

The aforementioned Siamese trackers, whose network architectures are manually designed, tend to have large numbers of parameters and large computation volumes, making them difficult for both effective accuracy and speed on resource-constrained devices. The current research trend is to explore the use of NAS methods for designing Siamese trackers, with the aim of reducing the cost and complexity of the network design [48,70]. In this paper, the Siamese network is combined with NAS to explore efficient and lightweight network architectures designed specifically for high-performance visual tracking.

3. Proposed Approach

The overall architecture of TrackNAS, the proposed tracker based on the Siamese network and NAS, is shown in Figure 2. Considering that the network architecture in existing Siamese trackers is often composed of a pretrained backbone network for feature extraction and a head network for object localization, search spaces should be constructed separately for these two parts. The construction of the search spaces determines the size and performance of the model. The performance of the network architecture was evaluated using the training dataset, and the network architecture with the best performance was selected to reduce the complexity of the model while ensuring high tracking performance.

3.1. Search Space

3.1.1. Basic Convolutional Cell

For constructing the search space, the standard convolution cell, DSConv cell, and MBConv cell are mainly used.

DSConv is derived from MobileNet [36], a lightweight network architecture proposed by Google Inc. in 2017, which consists of two parts: depthwise and pointwise convolution. It essentially involves channelwise convolution, where each channel of the input feature map is convoluted channel by channel using a convolution kernel, and then the outputs of all the convolution kernels are combined to obtain the final output. The pointwise convolution is actually a

1 \times 1

convolution, and it plays two roles in DSConv: the first role is to allow DSConv to freely change the number of output channels, and the second role is to perform channel fusion on the feature map outputs via depthwise convolution. The DSConv cell, as an alternative to standard convolution, can significantly increase computational efficiency. The overall architecture of the DSConv cell is shown in Figure 3a.

The MBConv cell is derived from MobileNetV2 [35]. The network is designed with an inverted residual module, and a linear module is used in the last layer of the architecture, which significantly increases the detection accuracy while simplifying the model. The dimension raising of the MBConv cell is performed via a

1 \times 1

convolution with a typical multiplication rate of 4 or 6. Then, the features are extracted by a depthwise convolution module or a squeeze-and-excitation convolution module (SENet). A

1 \times 1

convolution is used for dimension reduction. The architecture of the MBConv cell is shown in Figure 3b.

3.1.2. Search Space for Backbone Network Architecture

The backbone network is mainly used to extract CNN features from the input images for the subsequent tracking tasks. The feature extraction ability of the backbone network significantly affects the tracking performance [3]. The backbone networks commonly used in Siamese trackers include AlexNet [29] and ResNet [66]. Many lightweight trackers employ MobileNet [35,36] as the backbone network. Inspired by these approaches, we construct the search space of the backbone network. The MBConv cell is set as the basic architecture of the search space. Additionally, standard convolution cells are also added to the backbone network search space. Similar to most Siamese trackers [5,6], the total downsampling stride of the backbone network is set to 16. The settings of the backbone network search space

A_{b}

are presented in Table 1.

3.1.3. Search Space for Head Network Architecture

The head network is mainly used to predict the object location and size, generally receiving the feature maps from the backbone network as input and outputting the prediction results. The design of the head network is usually based on the specific requirements and performance requirements of the visual tracking approaches. For example, some trackers need to output the exact location and size of the target object, whereas other trackers only need to determine the approximate position where the target object is located. In general, the head network in recent Siamese trackers can be divided into two parts: a classification subnetwork and a regression subnetwork [6,10,17]. The former transforms the visual tracking task into an object detection problem by classifying each pixel point to derive the object location, while the latter typically uses bounding box regression to directly predict the object size. The head network architectures of mainstream Siamese trackers are often implemented by stacking standard

3 \times 3

and

5 \times 5

convolutions, but this can significantly increase the computational burden, resulting in models that cannot be successfully deployed to resource-constrained devices. Therefore, in the present study, DSConv cells are used to increase computational efficiency. In addition, skip connections [66] are added to ensure the structural difference between the two branches of the head network, that is, that the regression subnetwork and the classification subnetwork have convolutions of different layers. In our proposed approach, the total downsampling stride of the head network is set to 8. The settings of the head network search space

A_{h}

are presented in Table 2.

3.2. Search Strategy

As mentioned in the previous analysis, one of the challenges faced by NAS is the long search time. To solve this problem, a combination of the one-shot scheme and DARTS is used to search for the network architecture in this paper. Because the architecture of the backbone supernet is more complex, the strategy of weight sharing in the one-shot scheme is used to search for the backbone supernet, which can significantly increase the efficiency of the search [70]. For the head network, the cell-based search scheme of DARTS is adopted, which can significantly reduce the search time [48]. According to the characteristics of the Siamese network architecture, the following search process is used: first, the backbone supernet is pretrained; then, the backbone network architecture and the head network architecture are searched for in a distributed manner, followed by the merging of the two network architectures to continue the joint search. The evolutionary method is used for lightweight after the search is completed, and finally, the obtained network architecture is fine-tuned to adjust the network parameters. An overview of the search strategy designed in this paper is shown in Figure 4.

3.2.1. Pretraining of Backbone Supernet

Siamese trackers generally use ImageNet datasets to pretrain the backbone network to learn a more generic feature representation, improving both the tracking accuracy and robustness. In our approach, we encode the search space of backbone architectures

A_{b}

into a supernet

N_{b}

, denoted as

N_{b} (A_{b}, W_{b})

, where

W_{b}

presents the parameters of the backbone supernet

N_{b}

. The single-path one-shot scheme is used to pretrain

N_{b}

before performing NAS to reduce the time cost of the search. This involves uniformly sampling different backbone subnets

α_{b} \in A_{b}

in

N_{b}

, and then optimizing the parameters of each sampled subnet

W_{b} (α_{b})

. The subnets sampled in this process are single-path, and each pretrained process is also performed in a single-path manner. Because the paths are sampled uniformly, the weights of all the subnets can be trained adequately and fairly. Figure 5 shows the single-path one-shot subnet architecture search process used in our approach. In each search phase, a single-path subnet is randomly selected from the backbone network search space, and “

α_{b 3}

” in the figure illustrates the currently selected single-path subnet architecture. Among all the single-path subnets in the supernet, only the currently selected single-path subnet is trained and optimized. The other subnets are not involved in training. After a sufficient number of iterations, that is, after all the single-path subnets

α_{b}

are selected from

A_{b}

at least once, the parameters of each basic cell in

N_{b}

can be obtained. The pretraining procedure is performed by optimizing the classification loss function

L_{p t r a i n}^{c l s}

on the ImageNet dataset as

W_{b}^{p t r a i n} = arg min_{W_{b}} L_{p t r a i n}^{c l s} (N_{b} (A_{b}, W_{b}))

(1)

where

A_{b}

denotes the backbone network search space, and

W_{b}

presents the parameters of the backbone supernet

N_{b}

. The pretrained weight

W_{b}^{p t r a i n}

of pretraining is shared among different subnet architectures and serves as the basis for the subsequent distributed search for the entire tracking network architecture. It is noteworthy that the pretraining procedure is performed only on the backbone supernet

N_{b}

, not on the subnets

α_{b}

, significantly reducing the training cost.

3.2.2. Distributed Search for Backbone and Head Network Architectures

The distributed search includes different NAS methods for the backbone network architecture and the head network architecture. The NAS method for the backbone network architecture search uses the same approach as the backbone supernet pretraining, that is, the single-path one-shot method. Consider that the head network in the Siamese tracker includes a classification subnetwork and a regression subnetwork, which are used to determine the location and size of the target object, respectively. Searching for the head network by means of a global search network architecture can consume considerable resources. Therefore, in this approach, DARTS is used to find the optimal head network architecture

α_{h}^{*}

, as shown in Figure 6. We employ the gradient descent algorithm to minimize the classification loss

L_{p t r a i n}^{c l s}

and validation loss

L_{v a l}

on the ImageNet dataset to learn the pretrained weight of the head supernet

W_{h}^{p t r a i n}

and the optimal architecture of the head subnet

α_{h}

as

\begin{matrix} arg min_{α_{h} \in A_{h}} L_{v a l} (N_{h} (α_{h}, W_{h}^{p t r a i n} (α_{h}))) \\ s . t . W_{h}^{p t r a i n} = arg min_{W_{h}} L_{p t r a i n}^{c l s} (N_{h} (A_{h}, W_{h})) \end{matrix}

(2)

The backbone and the head network architectures together determine the tracking performance and model size. Therefore, it is crucial to search for the backbone and head network architectures as a whole so that the searched network architecture is well suited for the visual tracking task. To this end, we construct a tracking supernet

N_{t r k}

, which consists of the backbone supernet

N_{b}

and the head supernet

N_{h}

, and is formulated as

N_{t r k} = {N_{b}, N_{h}}

. As mentioned previously, the backbone supernet

N_{b}

uses the single-path one-shot method to search, whereas the head supernet

N_{h}

is pretrained based on DARTS. Although they use different NAS methods, they could be regarded as a whole in the weight learning process and can be expressed as follows:

W_{b}^{*}, W_{h}^{*} = arg min_{W_{b}, W_{h}} L_{t r a i n}^{c l s} (N_{t r k} (A_{b}, W_{b}; A_{h}, W_{h}))

(3)

where

W_{b}^{*}

and

W_{h}^{*}

denote the fully trained weights of the backbone and head supernets, respectively. The parameters

W_{b}

and

W_{h}

are initialized respectively by the pretrained weights

W_{b}^{p t r a i n}

and

W_{h}^{p t r a i n}

to accelerate the training convergence while improving tracking performance.

3.2.3. Joint Search for Tracking Network Architecture

After the weights of both the backbone and the head supernets are distributed and trained, the tracking supernet is jointly searched as a whole using a single-path one-shot method, as shown in Figure 7. The detection datasets, including COCO [63], ImageNet-VID [64], ImageNet-DET [64], and YouTube-BB [73], are successively used to search the tracking supernet and fine-tune its parameters, making it adapt to challenging tracking scenarios. The joint search process can be expressed as follows:

α_{b}^{*}, α_{h}^{*} = arg min_{α_{b}, α_{h}} L_{v a l}^{t r k} (N_{t r k} (α_{b}, W_{b}^{*} (α_{b}); α_{h}, W_{h}^{*} (α_{h})))

(4)

where

α_{b}^{*}

and

α_{h}^{*}

represent the optimal backbone and head network architectures, which are searched by minimizing the validation loss

L_{v a l}^{t r k}

on the tracking datasets. It is worth raising that the weights of the architectures

α_{b} \in A_{b}

and

α_{h} \in A_{h}

are inherited from

W_{b}^{*}

and

W_{h}^{*}

.

3.2.4. Lightweighting of Tracking Network Architecture

For the trained tracking network architecture, the search optimization is performed using the evolutionary algorithm, which adopts the single-path one-shot search method. That is, only a single subnet is sampled at a time, the basic cells in different paths are reorganized and mutated via the evolutionary algorithm, and finally, only the optimal path is retained. The iteration repeats until the set conditions are satisfied. Considering the limited computational power of real-world resource-constrained devices, the constraints of the network architecture complexity are set as

FLOPs (α_{b}) + FLOPs (α_{h}) \leq {FLOPs}_{m a x}

(5)

where

FLOPs (α_{b})

and

FLOPs (α_{h})

represent the number of computations of the searched backbone subnet and head supernet, respectively.

{FLOPs}_{m a x}

is the resource constraint and set to 600M [50]. If the number of computations of the network architecture searched is greater than

{FLOPs}_{m a x}

, the network architecture is discarded. The specific operation is to select and evaluate different subnets in the supernet under the control of the evolutionary algorithm, as illustrated in Figure 8. First, the populations of network architectures are randomly initialized. Then, the top-k network architectures with the best performance are selected as parents to generate child networks, and the next-generation network architectures are generated via mutation and crossover. For crossover, the basic cells in two randomly selected network architectures are crossed to generate a new network architecture. For mutation, the basic cells in the randomly selected network architecture are mutated into different types of basic cells to generate a new network architecture; the mutation probability is set to

0.1

in our approach. The crossover and mutation are repeated continuously to generate a sufficient number of network architectures. Finally, the network architecture that is not randomly adopted is continuously mutated and reorganized via the evolutionary algorithm until the resource constraint given by Equation (5) is satisfied, making the network architecture obtained in the final search sufficiently lightweight.

3.2.5. Retraining of Tracking Network Architecture

Although a lightweight and suitable network architecture for visual tracking is obtained using the previous method, the parameters of the searched network are not optimized. The direct use of this network architecture will lead to poor tracking performance. Therefore, retraining of the network is needed to optimize the network parameters.

4. Experiments

4.1. Implementation Details

According to the description in Section 3.2, the proposed TrackNAS was obtained by performing a total of five searching processes of the network architecture. The input sizes of the target template and candidate region are set to

128 \times 128

and

256 \times 256

pixels, respectively. We start by pretraining the backbone supernet on ImageNet [64] for 300 epochs using the stochastic gradient descent (SGD) optimizer with a momentum of 0.9 and a weight decay of

4 \times 10^{- 5}

. The initial learning rate is set at 0.5, and linear annealing is applied. Then, we search the backbone and head network architectures for 120 epochs on the same dataset using the same settings as before. Next, we optimize the tracking network architecture on tracking data, which includes YouTube-BB [73], ImageNet-DET [64], ImageNet-VID [64], and COCO [63] for 30 epochs. During the first 10 epochs, we freeze the parameters of the backbone network and set their learning rate to be

10 \times

smaller than the global learning rate in the remaining epochs. Later, we perform evolutionary training on the training split of GOT-10k [74] to make the searched tracking network architecture lightweight. Finally, we retrain the lightweight tracking network architecture on ImageNet for 300 epochs using the MSProp optimizer with a momentum of 0.9 and a decay of 0.9. The weight decay is set at

1 \times 10^{- 5}

, the dropout ratio is 0.2, and the initial learning rate is 0.065. A warm-up is employed in the first 3 epochs, followed by cosine annealing, AutoAugment policy, and exponential moving average for training. Our tracker was implemented using Python 3.7 with PyTorch 1.8.2. All experiments were performed on an instance with an Intel

^{®}

Xeon

^{®}

(Santa Clara, CA, USA) Gold 6148v4 CPU @ 2.2 GHz CPU with 256 GB RAM and 4 NVIDIA

^{®}

Tesla

^{®}

(Austin, TX, USA) V100-SXM2 GPU with 64 GB VRAM. Our TrackNAS can run in real time at more than 150 FPS.

4.2. Tracking Datasets

The performance of TrackNAS was compared with state-of-the-art trackers using the OTB100 [75] (http://cvlab.hanyang.ac.kr/tracker_benchmark/datasets.html (accessed on 15 August 2023)), UAV123 [76] (https://cemse.kaust.edu.sa/ivul/uav123 (accessed on 15 August 2023)), LaSOT [77] (http://vision.cs.stonybrook.edu/~lasot/index.html (accessed on 15 August 2023)), and GOT-10k [74] (http://got-10k.aitestunion.com (accessed on 15 August 2023)) datasets. We also evaluated the proposed search spaces and search strategies using the VOT2018 dataset [78] (https://votchallenge.net/vot2018/dataset.html (accessed on 15 August 2023)). Table 3 presents the details of each dataset with their contents and statistics. More visual samples can be found on the website of each dataset, respectively.

4.3. Performance Metrics

There are two metrics used by most tracking datasets to assess the performance of trackers: precision rate (PR) and success rate (SR). The former measures the localization performance of the tracker, while the latter can measure the scale estimation capability of the tracker.

PR is defined as the percentage of video frames where the average Euclidean distance between the centroids of the estimated bounding box

(B_{x}, B_{y})

and the ground-truth annotation

(G_{x}, G_{y})

is less than 20 pixels.

SR is based on the overlap ratio between the estimated bounding box B and the ground-truth annotation G as

Overlap ratio = \frac{B \cap G}{B \cup G}

(6)

The area under the curve (AUC) of SR is used to rank the trackers, representing the percentage of video frames where the overlap ratio varies from 0 to 1.

4.4. Comparisons with State-of-the-Art Trackers

We compared TrackNAS with previously reported advanced Siamese trackers including SiamFC [5], SiamRPN [6], SiamRPN++ [15], SiamDW [25], SiamFC++ [14], SiamBAN [17], SiamGAT [18], Ocean [19], SiamTPN [27], HiFT [26], and TrTr [28]. The comparison was conducted directly using the experimental data reported in the relevant papers. The overall performances of the participant trackers are summarized in Table 4.

4.4.1. Results on OTB100

We exploited the official toolkit of the OTB100 dataset [75] to assess the performance of participating Siamese trackers. The SR and PR scores provided by the official toolkit were used as evaluation indices [79]. Compared with existing advanced Siamese trackers, such as SiamRPN++ [15] and SiamFC++ [14], TrackNAS only used one-tenth of parameters and computation volume to obtain competitive performance. Furthermore, compared with recent advanced CNN-Transformer Siamese trackers, such as TrTr [28], HiFT [26], and SiamTPN [27], our tracker is much smaller, and the performance is on par with them. This demonstrates the efficiency of our proposed TrackNAS tracker.

4.4.2. Results on UAV123

The above-mentioned Siamese trackers were compared using the UAV123 dataset [76]. Compared with OTB100 and other datasets, the target object is small in UAV123, and the tracking sequences have several distractor objects and long occlusions; thus, tracking a target object in this dataset is more challenging. The results are also reported using SR and PR as the same as OTB100. TrackNAS significantly outperformed most famous Siamese trackers by a large margin due to the excellent feature capability of the backbone network and the impressive discriminative capability of the head network, which was thoroughly searched without any informative redundancy. It achieved an SR score of 0.633, which was 17.2%, 10.6%, 9.7%, and 4.4% higher than SiamFC [5], SiamRPN [6], SiamDW [25], and HiFT [26], respectively. The results on UAV123 demonstrate the efficacy of the carefully designed search space and the proposed search strategy of TrackNAS.

4.4.3. Results on LaSOT

LaSOT [77] is one of the largest tracking datasets to date. The proposed TrackNAS tracker was compared with the other 11 Siamese trackers on the LaSOT dataset using SR and PR as evaluation indices. TrackNAS showed promising performance in LaSOT with an SR score of 0.538 and a PR score of 0.532. It had 8.7% and 1.1% higher SR scores than CNN-Transformer Siamese trackers HiFT [26] and TrTr [28], respectively. However, TrackNAS is still 4.3% inferior in terms of SR score than SiamTPN [27], and the performance of the tracker will be further improved by adding attention mechanisms to the search spaces to enhance the global feature capturing capabilities of the tracking network architecture.

4.4.4. Results on GOT-10k

GOT-10k [74] is another large-scale tracking dataset. It requires submitting the available source code to the official online platform, where the validation experiments are conducted, and the results are reported publicly. The average overlap (AO) and SR at a threshold of 0.5 (SR

_{0.5}

) suggested by the GOT-10k evaluation protocols were used to rank the trackers in our experiment. Although SiamTPN [27] showed good performance in OTB100, UAV123, and LaSOT, its performance in GOT-10k was inferior to our TrackNAS. This is mainly attributed to the fact that the network architecture of TrackNAS possessed an excellent generalization capability in handling unfamiliar target objects.

4.4.5. Results on Efficiency Analysis

Tracking efficiency is crucial for real-world tracking applications, especially deploying trackers to resource-constrained devices [30,57]. We analyzed the efficiency of the advanced Siamese trackers in terms of their number of parameters and the number of floating point operations (FLOPs) in their tracking model on the LaSOT dataset. Although most Siamese trackers achieved competitive performance, their high FLOPs made them unsuitable for some applications that only run on resource-constrained devices. In particular, SiamBAN [17] and Ocean [19] obtained poor efficiency results with 53.9 and 37.2 million parameters and 48.8 and 20.2 GFLOPs, respectively, since they used deeper backbone network architectures to ensure a more powerful feature extraction capability. Our TrackNAS achieved top efficiency results with 2.1 million parameters and 0.6 GFLOPs while showing a considerable tracking performance.

4.4.6. Visualized Results

Figure 9 shows some visualized results of the proposed TrackNAS with the ground-truth annotations on some challenging video sequences from the evaluation datasets. Despite using low FLOPs and parameters, TrackNAS can accurately predict the bounding boxes even when target objects suffer from rotations, motion blur, and scale variation.

We also show two tracking failure samples in extreme scenes in Figure 10. Despite being different in nature, the proposed TrackNAS cannot successfully discriminate the target object from the background surroundings when similar distractors with the same category and appearance occur along with background clutters and occlusions.

In summary, the proposed TrackNAS achieved a competitive tracking performance on the four large-scale evaluation datasets, and the number of parameters and computation volume were far fewer than those of other advanced Siamese trackers. The results indicate the reliability and effectiveness of the proposed approach based on NAS for designing efficient and lightweight Siamese trackers.

4.5. Ablation Studies

4.5.1. Methodology

To validate the construction of the search space and the effectiveness of the design of the search strategies in this approach, ablation studies were conducted using the VOT2018 dataset [78] containing 60 video sequences, with the official committee-recommended expected average overlap (EAO), accuracy, and robustness scores as evaluation indices [80,81]. Accuracy measures the average overlap rate during successful tracking, robustness measures the average failure times, and EAO reflects both accuracy and robustness and is used to obtain the overall ranking. The network architecture of TrackNAS has multiple options for the search space and the search strategy during the search procedure. Therefore, the ablation studies were used to verify that each setting in TrackNAS is optimal, which included the following: a comparison of experimental results for the performance impact of (a) removing MBConv cells from the search space of the backbone network, (b) removing DSConv cells from the search space of the head network, (c) searching only for the backbone network in the distributed search, and (d) searching only for the head network in the joint search.

4.5.2. Search Space

To verify the effectiveness of the search space setting of TrackNAS, detailed ablation experiments were conducted to compare the number of parameters and computation volumes of TrackNAS under different search space settings. First, the tracker TrackNAS

_{(w / o MBConv)}

was obtained by using all standard convolutional cells in the search space setting for the backbone network and keeping the search space setting for the head network the same as that in Section 3.2.1. Second, TrackNAS

_{(w / o DSConv)}

was obtained by removing the DSConv cells in the search space setting for the head network and keeping the search space setting for the backbone network the same as that in Section 3.2.1. The results of the ablation experiments are presented in Table 5. As shown, the computation volume of the model increased significantly when all the search space settings for the backbone network used standard convolutional cells. This is mainly attributed to the fact that the MBConv cell, as a lightweight architecture, is conducive to reducing the overall number of computations of the model, and because the MBConv cell contains the DSConv part, discarding it affects the number of model parameters. In addition, this change in the head network search space setting will significantly increase the number of model parameters. The analysis indicated that because the DSConv cell is conducive to reducing the number of computations of the network model, discarding it increases the computational load.

4.5.3. Search Strategy

To verify the effectiveness of the search strategy of TrackNAS, detailed ablation comparison experiments were conducted on the specific settings of the search strategy, and trackers using different search strategies were compared and analyzed with the VOT2018 dataset. For the distributed search strategy, TrackNAS

_{(w / o Head)}

was obtained by setting the model to search for only the backbone network. As indicated in Table 5, the number of model parameters of TrackNAS

_{(w / o Head)}

differed significantly from those of TrackNAS. The EAO, accuracy, and robustness scores of TrackNAS

_{(w / o Head)}

were reduced by 4.6%, 2.8%, and 10.4%, respectively. We mainly attribute this to the fact that the head network plays the role of target object classification and regression, and the performance deteriorates owing to the inadequate search of the network architecture, while redundant architectures that are not discarded increase the number of model parameters. After the joint search strategy was changed to search only the head network, the number of parameters of TrackNAS

_{(w / o Backbone)}

differed more than TrackNAS. The EAO, accuracy, and robustness scores of TrackNAS

_{(w / o Backbone)}

were reduced by 8.2%, 6.9%, and 15%, respectively. This was due to the inadequate search of the backbone network architecture, which reduced the feature extraction capability of the tracking network. Moreover, the poor coupling with the head network architecture resulted in a low integration level of the entire tracking network architecture and degraded the tracking performance. The results of the ablation studies indicated that searching and training the backbone network and the head network as a whole contributes significantly to improving the performance of the tracker.

5. Conclusions

In this paper, TrackNAS, a novel Siamese tracker based on the NAS method with features of efficiency and lightweight, was designed. We carefully designed the backbone network and head network search spaces according to the characteristics of the visual tracking task. We also used various search strategies to improve the performance of the backbone network, the head network, and the overall tracking model and reduce the number of parameters and computation volume of the model, including the pretraining of the backbone network, distributed search for the backbone and head network architecture, joint search of the tracking network architecture, lightweight of the tracking network architecture, and retraining of the tracking network parameters. Comparative experimental results for four large-scale evaluation datasets indicated that TrackNAS achieved a competitive tracking performance. For instance, it attained SR scores of 0.685, 0.633, and 0.538 on the OTB100, UAV123, and LaSOT datasets, respectively, and achieved an SR

_{0.5}

score of 0.696 on the GOT-10k dataset. Furthermore, ablation studies on the VOT2018 dataset proved the effectiveness of the designed search space and search strategy. Compared with previous efficient Siamese trackers, TrackNAS is much more lightweight by adding MBConv and DSConv cells. The searched tracking network architecture is without any redundancy and achieves advanced performance using very few FLOPs and parameters. Considering the complexity of the NAS methods, there is still room for improvement and further development. Future research can focus on the following aspects: (1) a more complex search space can be constructed to provide more different types of convolutional cells, making the searched network architecture more task-specific and the extracted features more representational, further increasing the accuracy of the tracker; (2) only the evolutionary method was used for lightweight in this paper, and other lightweight methods, such as model pruning and numerical quantization, can also be attempted to be used in the search process to further reduce the number of parameters and computation of the searched tracking network architecture; and (3) considering the excellent feature extraction ability of the attention mechanism, it may be worthwhile to incorporate the attention mechanism into the search process to enable the searched tracking network architecture to learn more target-specific feature representation, further simplify the network architecture, and improve the tracking performance.

Author Contributions

P.G.: conceptualization, methodology, investigation, writing—original draft, writing—review and editing, and software. X.L.: writing—original draft, writing—review and editing, visualization, and software. H.-C.S.: conception, design, writing—original draft, and resources. Y.W.: software and data curation. F.W.: supervision and validation. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by the China Postdoctoral Science Foundation under Grant 2023M732022, in part by Qufu Normal University under Grant 167-602801, in part by the Shandong Provincial Natural Science Foundation under Grants ZR2021QF061 and ZR2022MF353, and in part by the Guangdong Provincial Basic and Applied Basic Research Foundation under Grant 2020A1515010706.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Javed, S.; Danelljan, M.; Khan, F.S.; Khan, M.H.; Felsberg, M.; Matas, J. Visual object tracking with discriminative filters and siamese networks: A survey and outlook. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 6552–6574. [Google Scholar]
Marvasti-Zadeh, S.M.; Cheng, L.; Ghanei-Yakhdan, H.; Kasaei, S. Deep learning for visual tracking: A comprehensive survey. IEEE Trans. Intell. Transp. Syst. 2022, 23, 3943–3968. [Google Scholar]
Ondrašovič, M.; Tarábek, P. Siamese visual object tracking: A survey. IEEE Access 2021, 9, 110149–110172. [Google Scholar]
Tao, R.; Gavves, E.; Smeulders, A.W.M. Siamese Instance Search for Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1420–1429. [Google Scholar]
Bertinetto, L.; Valmadre, J.; Henriques, J.; Vedaldi, A.; Torr, P.H.S. Fully-Convolutional Siamese Networks for Object Tracking. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–10 October 2016; pp. 850–865. [Google Scholar]
Li, B.; Yan, J.; Wu, W.; Zhu, Z.; Hu, X. High Performance Visual Tracking With Siamese Region Proposal Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8971–8980. [Google Scholar]
Gao, P.; Ma, Y.; Yuan, R.; Xiao, L.; Wang, F. Learning Cascaded Siamese Networks for High Performance Visual Tracking. In Proceedings of the International Conference on Image Processing, Chinese Taipei, China, 22–25 September 2019; pp. 3078–3082. [Google Scholar]
Bao, J.; Yan, M.; Yang, Y.; Chen, K. SiamFFN: Siamese Feature Fusion Network for Visual Tracking. Electronics 2023, 12, 1568. [Google Scholar]
Gao, P.; Yuan, R.; Wang, F.; Xiao, L.; Fujita, H.; Zhang, Y. Siamese attentional keypoint network for high performance visual tracking. Knowl.-Based Syst. 2020, 193, 105448. [Google Scholar]
Cui, Y.; Guo, D.; Shao, Y.; Wang, Z.; Shen, C.; Zhang, L.; Chen, S. Joint Classification and Regression for Visual Tracking with Fully Convolutional Siamese Networks. Int. J. Comput. Vis. 2022, 130, 550–566. [Google Scholar]
Li, J.; Zhang, K.; Gao, Z.; Yang, L.; Zhuo, L. SiamPRA: An Effective Network for UAV Visual Tracking. Electronics 2023, 12, 2374. [Google Scholar]
Gao, P.; Zhang, Q.; Wang, F.; Xiao, L.; Fujita, H.; Zhang, Y. Learning reinforced attentional representation for end-to-end visual tracking. Inf. Sci. 2020, 517, 52–67. [Google Scholar]
Chen, X.; Peng, H.; Wang, D.; Lu, H.; Hu, H. SeqTrack: Sequence to Sequence Learning for Visual Object Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 14572–14581. [Google Scholar]
Xu, Y.; Wang, Z.; Li, Z.; Yuan, Y.; Yu, G. Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines. AAAI Conf. Artif. Intell. 2020, 34, 12549–12556. [Google Scholar] [CrossRef]
Li, B.; Wu, W.; Wang, Q.; Zhang, F.; Xing, J.; Yan, J. Siamrpn++: Evolution of siamese visual tracking with very deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 4282–4291. [Google Scholar]
Bhat, G.; Danelljan, M.; Gool, L.V.; Timofte, R. Learning discriminative model prediction for tracking. In Proceedings of the International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6182–6191. [Google Scholar]
Chen, Z.; Zhong, B.; Li, G.; Zhang, S.; Ji, R. Siamese box adaptive network for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 14–19 June 2020; pp. 6668–6677. [Google Scholar]
Guo, D.; Shao, Y.; Cui, Y.; Wang, Z.; Zhang, L.; Shen, C. Graph Attention Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 9538–9547. [Google Scholar]
Zhang, Z.; Peng, H.; Fu, J.; Li, B.; Hu, W. Ocean: Object-aware anchor-free tracking. In Proceedings of the European Conference on Computer Vision, Glasgow, Scotland, UK, 23–28 August 2020; pp. 771–787. [Google Scholar]
Cheng, S.; Zhong, B.; Li, G.; Liu, X.; Tang, Z.; Li, X.; Wang, J. Learning to filter: Siamese relation network for robust tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 4421–4431. [Google Scholar]
Lin, L.; Fan, H.; Zhang, Z.; Xu, Y.; Ling, H. Swintrack: A simple and strong baseline for transformer tracking. Adv. Neural Inf. Process. Syst. 2022, 35, 16743–16754. [Google Scholar]
Cui, Y.; Jiang, C.; Wang, L.; Wu, G. Mixformer: End-to-end tracking with iterative mixed attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 13608–13618. [Google Scholar]
Song, Z.; Yu, J.; Chen, Y.P.P.; Yang, W. Transformer tracking with cyclic shifting window attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 8791–8800. [Google Scholar]
Lan, J.P.; Cheng, Z.Q.; He, J.Y.; Li, C.; Luo, B.; Bao, X.; Xiang, W.; Geng, Y.; Xie, X. Procontext: Exploring progressive context transformer for tracking. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Rhodes Island, Greece, 4–9 October 2023; pp. 1–5. [Google Scholar]
Zhang, Z.; Peng, H. Deeper and wider siamese networks for real-time visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 4591–4600. [Google Scholar]
Cao, Z.; Fu, C.; Ye, J.; Li, B.; Li, Y. Hift: Hierarchical feature transformer for aerial tracking. In Proceedings of the International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2021; pp. 15457–15466. [Google Scholar]
Xing, D.; Evangeliou, N.; Tsoukalas, A.; Tzes, A. Siamese transformer pyramid networks for real-time UAV tracking. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 4–8 January 2022; pp. 2139–2148. [Google Scholar]
Zhao, M.; Okada, K.; Inaba, M. Trtr: Visual tracking with transformer. arXiv 2021, arXiv:2105.03817. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1097–1105. [Google Scholar]
Fu, C.; Lu, K.; Zheng, G.; Ye, J.; Cao, Z.; Li, B. Siamese object tracking for unmanned aerial vehicle: A review and comprehensive analysis. arXiv 2022, arXiv:2205.04281. [Google Scholar]
Zoph, B.; Le, Q. Neural Architecture Search with Reinforcement Learning. In Proceedings of the International Conference on Learning Representations, San Juan, Puerto Rico, 2–4 May 2016; pp. 1–16. [Google Scholar]
Pham, H.; Guan, M.; Zoph, B.; Le, Q.; Dean, J. Efficient neural architecture search via parameters sharing. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 4095–4104. [Google Scholar]
Cai, H.; Zhu, L.; Han, S. Proxylessnas: Direct neural architecture search on target task and hardware. arXiv 2018, arXiv:1812.00332. [Google Scholar]
Liu, H.; Simonyan, K.; Yang, Y. Darts: Differentiable architecture search. arXiv 2018, arXiv:1806.09055. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Ghimire, D.; Kil, D.; Kim, S.H. A survey on efficient convolutional neural networks and hardware acceleration. Electronics 2022, 11, 945. [Google Scholar]
Song, Y.; Wang, T.; Cai, P.; Mondal, S.K.; Sahoo, J.P. A comprehensive survey of few-shot learning: Evolution, applications, challenges, and opportunities. Acm Comput. Surv. 2023, 55, 271. [Google Scholar]
Cai, H.; Lin, J.; Lin, Y.; Liu, Z.; Tang, H.; Wang, H.; Zhu, L.; Han, S. Enable deep learning on mobile devices: Methods, systems, and applications. ACM Trans. Des. Autom. Electron. Syst. 2022, 27, 1–50. [Google Scholar]
Zoph, B.; Vasudevan, V.; Shlens, J.; Le, Q.V. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8697–8710. [Google Scholar]
Real, E.; Aggarwal, A.; Huang, Y.; Le, Q.V. Regularized evolution for image classifier architecture search. AAAI Conf. Artif. Intell. 2019, 33, 4780–4789. [Google Scholar] [CrossRef]
Zheng, X.; Ji, R.; Tang, L.; Zhang, B.; Liu, J.; Tian, Q. Multinomial distribution learning for effective neural architecture search. In Proceedings of the International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1304–1313. [Google Scholar]
Bergstra, J.; Bengio, Y. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 2012, 13, 281–305. [Google Scholar]
Bergstra, J.; Bardenet, R.; Bengio, Y.; Kégl, B. Algorithms for hyper-parameter optimization. Adv. Neural Inf. Process. Syst. 2011, 24, 1–9. [Google Scholar]
Real, E.; Moore, S.; Selle, A.; Saxena, S.; Suematsu, Y.L.; Tan, J.; Le, Q.V.; Kurakin, A. Large-scale evolution of image classifiers. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 2902–2911. [Google Scholar]
Baker, B.; Gupta, O.; Naik, N.; Raskar, R. Designing neural network architectures using reinforcement learning. arXiv 2016, arXiv:1611.02167. [Google Scholar]
Marvasti-Zadeh, S.M.; Khaghani, J.; Cheng, L.; Ghanei-Yakhdan, H.; Kasaei, S. Chase: Robust visual tracking via cell-level differentiable neural architecture search. arXiv 2021, arXiv:2107.03463. [Google Scholar]
Guo, Z.; Zhang, X.; Mu, H.; Heng, W.; Liu, Z.; Wei, Y.; Sun, J. Single path one-shot neural architecture search with uniform sampling. In Proceedings of the European Conference on Computer Vision, Glasgow, Scotland, UK, 23–28 August 2020; pp. 544–560. [Google Scholar]
Cai, H.; Gan, C.; Wang, T.; Zhang, Z.; Han, S. Once-for-all: Train one network and specialize it for efficient deployment. In Proceedings of the International Conference on Learning Representations, Virtual, 26 April–1 May 2020; pp. 1–15. [Google Scholar]
Stamoulis, D.; Ding, R.; Wang, D.; Lymberopoulos, D.; Priyantha, B.; Liu, J.; Marculescu, D. Single-path nas: Designing hardware-efficient convnets in less than 4 hours. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Wurzburg, Germany, 16–20 September 2019; pp. 481–497. [Google Scholar]
Ghiasi, G.; Lin, T.Y.; Le, Q.V. Nas-fpn: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 7036–7045. [Google Scholar]
Liu, C.; Chen, L.C.; Schroff, F.; Adam, H.; Hua, W.; Yuille, A.L.; Fei-Fei, L. Auto-deeplab: Hierarchical neural architecture search for semantic image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 82–92. [Google Scholar]
Liu, S.; Kong, W.; Chen, X.; Xu, M.; Yasir, M.; Zhao, L.; Li, J. Multi-scale ship detection algorithm based on a lightweight neural network for spaceborne SAR images. Remote Sens. 2022, 14, 1149. [Google Scholar] [CrossRef]
Li, W.; Zhang, L.; Wu, C.; Cui, Z.; Niu, C. A new lightweight deep neural network for surface scratch detection. Int. J. Adv. Manuf. Technol. 2022, 123, 1999–2015. [Google Scholar] [CrossRef]
Bromley, J.; Guyon, I.; LeCun, Y.; Säckinger, E.; Shah, R. Signature verification using a “siamese” time delay neural network. Adv. Neural Inf. Process. Syst. 1993, 6, 1–8. [Google Scholar]
Thangavel, J.; Kokul, T.; Ramanan, A.; Fernando, S. Transformers in Single Object Tracking: An Experimental Survey. arXiv 2023, arXiv:2302.11867. [Google Scholar]
He, A.; Luo, C.; Tian, X.; Zeng, W. A twofold siamese network for real-time object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4834–4843. [Google Scholar]
Wang, Q.; Teng, Z.; Xing, J.; Gao, J.; Hu, W.; Maybank, S. Learning attentions: Residual attentional Siamese Network for high performance online visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4854–4863. [Google Scholar]
Zhang, Y.; Wang, L.; Qi, J.; Wang, D.; Feng, M.; Lu, H. Structured siamese network for real-time visual tracking. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 351–366. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Zhu, Z.; Wang, Q.; Li, B.; Wu, W.; Yan, J.; Hu, W. Distractor-Aware Siamese Networks for Visual Object Tracking. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 103–119. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar]
Wang, Q.; Zhang, L.; Bertinetto, L.; Hu, W.; Torr, P.H. Fast online object tracking and segmentation: A unifying approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 1328–1338. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Wang, G.; Luo, C.; Xiong, Z.; Zeng, W. Spm-tracker: Series-parallel matching for real-time visual object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 3643–3652. [Google Scholar]
Fan, H.; Ling, H. Siamese cascaded region proposal networks for real-time visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 7952–7961. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully Convolutional One-Stage Object Detection. In Proceedings of the International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9627–9636. [Google Scholar]
Yan, B.; Peng, H.; Wu, K.; Wang, D.; Fu, J.; Lu, H. LightTrack: Finding lightweight neural networks for object tracking via one-shot architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 15180–15189. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Ramachandran, P.; Zoph, B.; Le, Q.V. Searching for activation functions. arXiv 2017, arXiv:1710.05941. [Google Scholar]
Real, E.; Shlens, J.; Mazzocchi, S.; Pan, X.; Vanhoucke, V. Youtube-boundingboxes: A large high-precision human-annotated data set for object detection in video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5296–5305. [Google Scholar]
Huang, L.; Zhao, X.; Huang, K. Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 1562–1577. [Google Scholar] [CrossRef] [PubMed]
Wu, Y.; Lim, J.; Yang, M.H. Object tracking benchmark. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1834–1848. [Google Scholar] [CrossRef] [PubMed]
Mueller, M.; Smith, N.; Ghanem, B. A benchmark and simulator for uav tracking. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2016; pp. 445–461. [Google Scholar]
Fan, H.; Bai, H.; Lin, L.; Yang, F.; Chu, P.; Deng, G.; Yu, S.; Huang, M.; Liu, J.; Xu, Y.; et al. Lasot: A high-quality large-scale single object tracking benchmark. Int. J. Comput. Vis. 2021, 129, 439–461. [Google Scholar] [CrossRef]
Kristan, M.; Leonardis, A.; Matas, J.; Felsberg, M.; Pfugfelder, R.; Čehovin, L.; Vojir, T.; Bhat, G.; Lukezic, A.; Eldesokey, A.; et al. The Sixth Visual Object Tracking VOT2018 Challenge Results. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018. [Google Scholar]
Wu, Y.; Lim, J.; Yang, M.H. Online object tracking: A benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–27 June 2013; pp. 2411–2418. [Google Scholar]
Čehovin, L.; Leonardis, A.; Kristan, M. Visual Object Tracking Performance Measures Revisited. IEEE Trans. Image Process. 2016, 25, 1261–1274. [Google Scholar] [CrossRef]
Kristan, M.; Matas, J.; Leonardis, A.; Vojíř, T.; Pflugfelder, R.; Fernández, G.; Nebehay, G.; Porikli, F.; Čehovin, L. A Novel Performance Evaluation Methodology for Single-Target Trackers. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 2137–2155. [Google Scholar] [CrossRef]

Figure 1. Comparison of the performance and complexity of representative advanced Siamese trackers on the LaSOT dataset. The efficient tracker should locate at the top-right corner. The red vertical dashed line denotes the real-time bound (30 FPS) according to [1]. The circle diameter is in proportion to the computational volume in terms of the FLOPs of the tracker. That is, the smaller the diameter of the circle, the fewer the FLOPs of the tracker. The trackers including SiamFC [5], SiamRPN [6], SeqTrack [13], SiamFC++ [14], SiamRPN++ [15], DiMP [16], SiamBAN [17], SiamGAT [18], Ocean [19], SiamRN [20], SwinTrack [21], MixFormer [22], CSWinTT [23], ProContEXT [24], SiamDW [25], HiFT [26], SiamTPN [27], TrTr [28], and the proposed TrackNAS.

Figure 2. Overall architecture searched by the proposed TrackNAS. “Conv” means the standard convolution cell, “MBConv” indicates a mobile inverted bottleneck convolution cell [35] with a squeeze-and-excitation convolution module [71], while “DSConv” denotes a depthwise separable convolution cell [36]. The number (4 or 6) following each “MBConv” represents the dimension multiplication rate, and “

k \times k

” denotes the kernel size. The searched convolution layers are drawn in color, while the predefined parts are plotted in gray.

Figure 2. Overall architecture searched by the proposed TrackNAS. “Conv” means the standard convolution cell, “MBConv” indicates a mobile inverted bottleneck convolution cell [35] with a squeeze-and-excitation convolution module [71], while “DSConv” denotes a depthwise separable convolution cell [36]. The number (4 or 6) following each “MBConv” represents the dimension multiplication rate, and “

k \times k

” denotes the kernel size. The searched convolution layers are drawn in color, while the predefined parts are plotted in gray.

Figure 3. The architectures of DSConv and MBConv cells. “Swish” denotes the Swish activation layer [72]. (a) DSConv cell. (b) MBConv cell.

Figure 4. Search strategy of the proposed TrackNAS. There are five search processes: pretraining of backbone supernet, distributed search for backbone and head network architectures, joint search for tracking network architecture, lightweight of tracking network architecture, and retraining of tracking network architecture.

Figure 5. Single-path one-shot backbone supernet search process.

Figure 6. Differentiable head supernet search process.

Figure 7. Joint search for tracking network architectures.

Figure 8. Schematic diagram of network architecture lightweight framework.

Figure 9. Examples of tracking results of the proposed TrackNAS on four challenging sequences from the evaluation datasets.

Figure 10. Failure examples of tracking results in extreme scenes of the proposed TrackNAS on two challenging sequences from the evaluation datasets.

Table 1. Basic cell settings of the backbone network search space.

Basic Cells	Kernel Sizes	Strides
Standard Conv	$3 \times 3$ , $5 \times 5$ , $7 \times 7$	1, 2
MBConv	$3 \times 3$ , $5 \times 5$ , $7 \times 7$	1, 2

Table 2. Basic cell settings of the head network search space.

Basic Cells	Kernel Sizes	Dimensions	Strides
Standard Conv	$3 \times 3$ , $5 \times 5$ , $7 \times 7$	128, 192, 256	1, 2
DSConv	$3 \times 3$ , $5 \times 5$	128, 192, 256	1

Table 3. Details of the datasets used in our experiments.

Datasets	OTB100	VOT2018	UAV123	LaSOT	GOT-10k
Video Sequences	100	60	123	Total: 1400 Training: 1120 Test: 280	Total: 10 k Training: 9.34 k Test: 420
Total Video Frames (Annotations)	58.61 k	21.356 k	113.476 k	Total: 3.52 M Training: 2.8 M Testing: 685 k	Total: 1.5 M Training: 1.4 M Test: 56 k
Classes	22	41	9	70	Total: 563 Training: 480 Test: 84
Minimum Frames	71	24	109	1000	51
Maximum Frames	3872	1500	3085	11,397	920
Average Resolution	356 × 530	758 × 465	1231 × 699	632 × 1089	929 × 1638
Frame Rate	30 FPS	30 FPS	30 FPS	30 FPS	10 FPS

Table 4. Comparisons of representative state-of-the-art Siamese trackers on OTB100, UAV123, LaSOT, and GOT-10k datasets. (R) and (G) represent ResNet-50 and GoogLeNet, respectively. “-” indicates that data are not available. “↑” denotes a larger value is better, while “↓” means that a smaller value is better.

Trackers	OTB100 [75]		UAV123 [76]		LaSOT [77]		GOT-10k [74]		# Params (M) ↓	GFLOPs ↓
Trackers	SR↑	PR↑	SR↑	PR↑	SR↑	PR↑	AO↑	SR $_{0.5}$ ↑	# Params (M) ↓	GFLOPs ↓
SiamFC [5]	0.582	0.771	0.461	0.691	0.336	0.339	0.392	0.426	2.3	2.6
SiamRPN [6]	0.637	0.851	0.527	0.796	0.433	-	0.481	0.581	7.6	4.9
SiamDW (R) [25]	0.670	0.892	0.536	0.776	0.385	0.389	0.429	0.483	2.5	12.9
SiamRPN++ (R) [15]	0.696	0.915	0.642	0.840	0.496	0.491	0.518	0.618	53.9	48.9
SiamBAN [17]	0.696	0.910	0.631	0.833	0.514	0.521	-	-	53.9	48.8
SiamFC++ (G) [14]	0.683	0.896	0.623	0.810	0.543	0.547	0.595	0.695	13.9	17.5
Ocean [19]	0.684	0.920	0.621	0.823	0.560	0.566	0.611	0.721	37.2	20.2
SiamGAT [18]	0.710	0.916	0.646	0.843	0.539	0.530	0.627	0.488	14.2	17.3
TrTr [28]	0.712	0.931	0.633	0.839	0.527	0.544	-	-	11.9	21.2
HiFT [26]	0.614	0.814	0.589	0.787	0.451	0.421	-	-	11.0	6.5
SiamTPN [27]	0.702	0.902	0.636	0.823	0.581	0.578	0.576	0.441	6.2	2.1
TrackNAS (Ours)	0.685	0.893	0.633	0.816	0.538	0.532	0.603	0.696	2.1	0.6

Table 5. Ablation experiments for different search spaces and strategies. “↑” denotes a larger value is better, while “↓” means that a smaller value is better.

Trackers	# Params (M) ↓	GFLOPs ↓	EAO ↑	Accuracy ↑	Robustness ↓
TrackNAS $_{(w / o MBConv)}$	3.3	0.6	0.407	0.582	0.244
TrackNAS $_{(w / o DSConv)}$	2.8	0.6	0.398	0.577	0.267
TrackNAS $_{(w / o Head)}$	2.6	0.6	0.379	0.574	0.302
TrackNAS $_{(w / o Backbone)}$	2.8	0.6	0.343	0.533	0.348
TrackNAS	2.1	0.6	0.425	0.602	0.198

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gao, P.; Liu, X.; Sang, H.-C.; Wang, Y.; Wang, F. Efficient and Lightweight Visual Tracking with Differentiable Neural Architecture Search. Electronics 2023, 12, 3623. https://doi.org/10.3390/electronics12173623

AMA Style

Gao P, Liu X, Sang H-C, Wang Y, Wang F. Efficient and Lightweight Visual Tracking with Differentiable Neural Architecture Search. Electronics. 2023; 12(17):3623. https://doi.org/10.3390/electronics12173623

Chicago/Turabian Style

Gao, Peng, Xiao Liu, Hong-Chuan Sang, Yu Wang, and Fei Wang. 2023. "Efficient and Lightweight Visual Tracking with Differentiable Neural Architecture Search" Electronics 12, no. 17: 3623. https://doi.org/10.3390/electronics12173623

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Efficient and Lightweight Visual Tracking with Differentiable Neural Architecture Search

Abstract

1. Introduction

2. Related Work

2.1. Neural Architecture Search

2.2. Siamese Trackers

3. Proposed Approach

3.1. Search Space

3.1.1. Basic Convolutional Cell

3.1.2. Search Space for Backbone Network Architecture

3.1.3. Search Space for Head Network Architecture

3.2. Search Strategy

3.2.1. Pretraining of Backbone Supernet

3.2.2. Distributed Search for Backbone and Head Network Architectures

3.2.3. Joint Search for Tracking Network Architecture

3.2.4. Lightweighting of Tracking Network Architecture

3.2.5. Retraining of Tracking Network Architecture

4. Experiments

4.1. Implementation Details

4.2. Tracking Datasets

4.3. Performance Metrics

4.4. Comparisons with State-of-the-Art Trackers

4.4.1. Results on OTB100

4.4.2. Results on UAV123

4.4.3. Results on LaSOT

4.4.4. Results on GOT-10k

4.4.5. Results on Efficiency Analysis

4.4.6. Visualized Results

4.5. Ablation Studies

4.5.1. Methodology

4.5.2. Search Space

4.5.3. Search Strategy

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI