*Article* **Multipopulation Particle Swarm Optimization for Evolutionary Multitasking Sparse Unmixing**

**Dan Feng 1, Mingyang Zhang 2,\* and Shanfeng Wang <sup>3</sup>**


**Abstract:** Recently, the multiobjective evolutionary algorithms (MOEAs) have been designed to cope with the sparse unmixing problem. Due to the excellent performance of MOEAs in solving the NP hard optimization problems, they have also achieved good results for the sparse unmixing problems. However, most of these MOEA-based methods only deal with a single pixel for unmixing and are subjected to low efficiency and are time-consuming. In fact, sparse unmixing can naturally be seen as a multitasking problem when the hyperspectral imagery is clustered into several homogeneous regions, so that evolutionary multitasking can be employed to take advantage of the implicit parallelism from different regions. In this paper, a novel evolutionary multitasking multipopulation particle swarm optimization framework is proposed to solve the hyperspectral sparse unmixing problem. First, we resort to evolutionary multitasking optimization to cluster the hyperspectral image into multiple homogeneous regions, and directly process the entire spectral matrix in multiple regions to avoid dimensional disasters. In addition, we design a novel multipopulation particle swarm optimization method for major evolutionary exploration. Furthermore, an intra-task and inter-task transfer and a local exploration strategy are designed for balancing the exchange of useful information in the multitasking evolutionary process. Experimental results on two benchmark hyperspectral datasets demonstrate the effectiveness of the proposed method compared with the state-of-the-art sparse unmixing algorithms.

**Keywords:** evolutionary multitasking; particle swarm optimization; multipopulation optimization; computational intelligence; sparse unmixing

#### **1. Introduction**

With the progress of remote sensing technology, hyperspectral imagery, which can obtain hundreds of sequential spectrum bands, has been widely applied in both civilian and military scenarios, for example, land-cover classification [1–3], environmental monitoring [4–6] and target detection [7,8], and so forth. However, there remains the problem of mixed pixels due to the low spatial resolution of sensors and the mixture of the surface features [9,10]. Therefore, spectral unmixing aims at extracting the collection of constituent spectra (called endmembers) from the mixed pixels and calculating the fractional abundances of these endmembers [11,12]. Accordingly, different spectral unmixing methods can be divided into three categories, that is, the geometrical-based, statistical-based and sparse-regression-based approaches. Traditional geometrical-based and statistical-based methods are extensively used as they can be utilized easily and flexibly, but they also suffer from the weakness of poor performance on highly mixed scenes spectra and the limitedness of time consumption, respectively [13]. Sparse unmixing, as an emerging spectral unmixing technology in recent years, is devised to find out the optimal solution that can represent each pixel of the hyperspectral image the most from a spectral library

**Citation:** Feng, D.; Zhang, M.; Wang, S. Multipopulation Particle Swarm Optimization for Evolutionary Multitasking Sparse Unmixing. *Electronics* **2021**, *10*, 3034. https:// doi.org/10.3390/electronics10233034

Academic Editor: Amir Mosavi

Received: 22 October 2021 Accepted: 4 December 2021 Published: 5 December 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

known in advance. Among these algorithms, the sparse unmixing via variable splitting and augmented Lagrangian (SUnSAL) based on the alternating direction method of multipliers has been proposed to relax the *l*<sup>0</sup> norm [14]. To overcome the disadvantage of SUnSAL that only utilizes spectral information without considering the spatial-contextual information, Iordache et al. proposed the collaborative SUnSAL (CLSUnSAL) which improves the unmixing results by solving a joint sparse regression problem, where the sparsity is simultaneously imposed to all pixels in the dataset [15,16].

Mathematically, sparse unmixing is an NP-hard problem. Multiobjective evolutionary algorithms (MOEAs), which are able to optimize some contradictory objectives and acquire a set of nondominated solutions called the Pareto-optimal front, are suitable for solving the NP-hard problems and overcoming the aforementioned difficulty in sparse unmixing [17]. A multiobjective sparse unmixing (MOSU) model was first proposed by Gong et al. [18] to deal with the sparse unmixing for hyperspectral imagery. Xu et al. [19] developed a multiobjective optimization based sparse unmixing (SMoSU) to take full advantage of the spectral characteristics of hyperspectral images under the framework of the multiobjective evolutionary algorithm based on decomposition (MOEA/D). In [20], the SMoSU was further improved and a classification-based model called CM-MoSU was designed. The estimation of distribution algorithms is modified to pay more attention to the feasible space with high quality.

However, the existing sparse unmixing algorithms based on MOEAs are limited to the pixel-based unmixing, which leads to the disadvantage of the low efficiency and the lack of the spatial structure information [21]. In some recent studies [22–24], a hyperspectral image is clustered into multiple homogeneous regions based on the assumption that the probability of the active endmember set in the homogeneous region is likely to be the same, which not only reduces the complexity of unmixing, but also further enhances the spatial correlation of pixels in the same category. Interestingly, this coincides with the idea of evolutionary multitasking framework emerging in recent years. The evolutionary multitasking [25] aims to solve different multiobjective optimization problems simultaneously to take advantage of the implicit parallelism from different tasks. Therefore, it is promising to employ the evolutionary multitasking multiobjective framework to efficiently solve the sparse unmixing problem. Besides, the particle swarm optimization (PSO) algorithm, which simulates the regularity of bird cluster activities, has proved to be effective in solving multiobjective endmember extraction problems [26–28]. From this, the current multitasking paradigm can be further explored and applied to sparse unmixing problems.

In this paper, we propose a novel evolutionary multitasking multipopulation particle swarm optimization (EMMPSO) framework for sparse unmixing. In the proposed method, a hyperspectral image is clustered into multiple homogeneous regions first, then the multipopulation particle swarm optimization is employed to explore each sparsity. Finally, the multiobjective optimization is applied to each task simultaneously to obtain a compromise between the reconstruction error and the endmember sparsity. Significantly, it is different from the traditional MOEA-based algorithms that EMMPSO can process the entire matrix due to the decomposition strategy of evolutionary multitasking, aiming at pixel-based unmixing only. In addition, we design a novel intra-task and inter-task transfer strategy to overcome the impact of negative transfer in multitasking. It can not only utilize the effective information in the same task to speed up the convergence of each sub-particle swarm, but also explore the similarities between different tasks to improve the overall convergence performance. Finally, the Pareto optimal solution in each task can be obtained to reverse the final endmember abundance.

The contributions of the proposed EMMPSO algorithm are summarized as follows:

(1) A novel evolutionary multitasking multipopulation particle swarm optimization framework is proposed to solve the sparse unmixing problem. With the decomposition of the evolutionary multitasking, multiple homogeneous regions of a hyperspectral image can be processed simultaneously, which can accelerate the convergence by exploring the relevance of all the tasks. In addition, the Pareto optimal solution between the reconstruction error and the endmember sparsity can be obtained with the multiobjective optimization.


The remainder of this paper is structured as follows: Section 2 briefly reviews some related work on sparse unmixing. In Section 3, our method is introduced in detail. Section 4 gives the experimental settings and the analysis of the experimental results. Finally, the conclusions and future works are described in Section 5.

#### **2. Related Work**

Generally, the mixed pixels are usually unmixed in the linear mixing model. For a single mixed pixel *<sup>y</sup>* <sup>∈</sup> <sup>R</sup>*L*×<sup>1</sup> with *<sup>L</sup>* spectral bands, which can be expressed as:

$$y = A\mathbf{x} + \mathbf{u},\tag{1}$$

where *<sup>A</sup>* <sup>∈</sup> <sup>R</sup>*L*×*<sup>D</sup>* is the spectral library. It is worth noting that all the spectral information is known in advance in the spectral library. In addition, *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*D*×<sup>1</sup> is the corresponding fractional abundance vector, that is, the proportion of each endmember, and *<sup>n</sup>* <sup>∈</sup> <sup>R</sup>*L*×<sup>1</sup> represents the noise term for the mixed pixel. In normal circumstances, a hyperspectral image *<sup>Y</sup>* <sup>∈</sup> <sup>R</sup>*L*×*<sup>n</sup>* contains *<sup>n</sup>* pixels, the matrix form of (1) can be formulated as:

$$Y = AX + N.\tag{2}$$

Therefore, the purpose of sparse unmixing is to obtain the most suitable set of endmembers for the reconstructing remote sensing image from the huge spectral library. Mathematically, this is an NP hard optimization problem, which can be expressed as:

$$\min\_{\mathbf{x}} \|\mathbf{x}\|\_{0^{\circ}} \quad \text{s.t.} \; \|\mathbf{y} - \mathbf{A}\mathbf{x}\|\_{2}^{2} \le \delta. \tag{3}$$

Many studies employed the relaxation methods to solve the *l*0-norm problem. SUn-SAL [14] resorted to the *l*0-norm to match *l*0-norm, and the mathematical optimization formula is as follows:

$$\min\_{\mathbf{x}} (1/2) \|\mathbf{y} - \mathbf{A}\mathbf{x}\|\_2^2 + \lambda \|\mathbf{x}\|\_1 + \iota\_{\mathbb{R}+}(\mathbf{x}) + \iota\_{\{1\}}(\mathbf{1}^\mathbf{T}\mathbf{x}),\tag{4}$$

where *λ* stands for a regularization parameter that controls the relative weight between the sparse term and the error term. In [15], the CLSUnSAL takes spatial information into account and directly processes the whole matrix, which is shown as follows:

$$\min\_{\mathbf{X}} \|\mathbf{Y} - \mathbf{A}\mathbf{X}\|\_F^2 + \lambda \|\mathbf{X}\|\_{2,1} + \iota\_{\mathbb{R}+}(\mathbf{X}).\tag{5}$$

Considering the excellent performance of MOEAs in solving NP-hard optimization problems, many studies have turned their attention to MOEAs to solve the sparse unmixing problem in recent year. Gong [18] proposed a novel multiobjective cooperative coevolutionary algorithm to optimize the reconstruction term, the sparsity term and the total variation regularization term simultaneously, which can be expressed as:

$$\min\_{\mathbf{x}} (\left\|\mathbf{y} - \mathbf{A}\mathbf{x}\right\|\_{2^{\prime}}^2 \left\|\mathbf{x}\right\|\_{0^{\prime}} \sum\_{j \in \mathfrak{c}} \left\|\mathbf{x} - \mathbf{x}\_{j}\right\|\_{1}),\tag{6}$$

where *ε* stands for the set of the horizontal and vertical neighbors in *X*. Jiang [29] decomposed the sparse unmixing problem into two stages and employed the MOEAs to solve them separately. In the first phase, it is mainly aimed at the endmember extraction, the optimized formula is as follows: min *<sup>M</sup>* (*RSE*1, *SP*1), where RSE1 is the residual of the measured hyperspectral image, SP1 represents the size of the measured estimated endmembers (*M*). In the second phase, the extracted abundance estimation becomes the focus, which can be expressed as: min *<sup>M</sup>* (*RSE*2, *SP*2), where RSE2 is the residuals of the hyperspectral unmixing, SP2 represents the favorable abundance matrix obtained by incorporating the spatial–contextual information. In addition, Jiang [30] improved the Tp-MoSU to settle the problems of the limited performance in identifying real endmembers from high-noise data in the first phase, and cannot effectively use the spatial context information in the second phase due to the similarity metric used. Besides, many sparse unmixing algorithms based on evolutionary multiobjective decomposition [19,20,31] have also been explored.

Recently, evolutionary multitasking optimization [25,32] has become a new favorite in the field of evolutionary computing. In a nutshell, evolutionary multitasking aims to deal with multiple optimization problems at the same time, and promote the optimization of each task by exploring the hidden relationship between these optimization problems. It is worth noting that many evolutionary multitasking optimization related algorithms have been explored and applied to many fields, such as feature selection [33], reinforcement learning [34] and sparse regression [22] and so forth. In sparse unmixing, a hyperspectral image can be clustered into multiple homogeneous regions according to spatial information, so this coincides with the concept of evolutionary multitasking. It is very promising to model each homogeneous region as an optimization task, though the decomposition of multiple tasks can effectively reduce the impact of dimensional disasters.

#### **3. EMMPSO Framework**

The pseudo code of EMMPSO is shown in Algorithm 1. In this section, the proposed framework is introduced in detail from initialization, multipopulation particle swarm optimization and the decision making with MOEA.

#### *3.1. Initialization and Representation*

In sparse unmixing, the spectral library known in advance and a hyperspectral image are input for processing, and the endmember set selected from the library and the corresponding abundance map are output. In the proposed EMMPSO, a hyperspectral image is first clustered into *K* homogeneous regions, and each homogeneous region is processed as a task [22], which is shown in Figure 1. The spectra of the entire spectrum library are coded into each particle in order, that is, the length of particle is equal to the number of spectra. Considering that the sparsity of particles remains unchanged in the evolution for most current discrete particle swarm optimization algorithms, the population in each task is divided into multiple subpopulations according to the sparsity to ensure that there are particles to explore in each sparsity. For the *s*-th subpopulation in the *j*-th task, the position of each particle is initialized as follows:

$$\{X\_{i,s}^{l}\}\_{j} = \{ (\mathbf{x}\_1, \mathbf{x}\_2, \dots, \mathbf{x}\_n) | \mathbf{x}\_i \in \{0, 1\}, ||X\_{i,s}^{l}||\_0 = s \},\tag{7}$$

where *xi* is composed of two elements, 0 or 1. If the *xi* is equal to 1, it means that the spectrum at the corresponding position in the spectral library is selected, and vice versa.

Then, each particle is evaluated with the reconstruction error (||*Y* − *AvXv*||*F*) in the corresponding task, where the *Y* is the hyperspectral image, *v* represents the endmember set from the particle {*X<sup>t</sup> <sup>i</sup>*,*s*}*j*, *A<sup>v</sup>* and *X<sup>v</sup>* are the subset of spectral library *A* and the abundances of endmembers, respectively. After the evaluation is completed, the skill factor *τi*,*s*, defined as the task with the best performance of the subpopulations with sparsity *s* in all the tasks, is assigned to each particle. Besides, the {*pbests*}*<sup>j</sup>* and {*gbests*}*<sup>j</sup>* for the subpopulation with sparsity *s* in the *j*-th task can be obtained. The velocity of particle is initialized as:

$$(\{V\_{i,s}^t\}\_j = (\{pbest\_s\}\_j - \{X\_{i,s}^t\}\_j) + (\{gbest\_s\}\_j - \{X\_{i,s}^t\}\_j). \tag{8}$$

**Algorithm 1** The EMMPSO Framework

1: %Initialization 2: Set *t* = 0, *G* = ∅. 3: **for** *j* = 1 to *K* **do** 4: **for** *s* = 1 to *S* **do** 5: **for** *i* = 1 to *N*/*KS* **do** 6: {*Xt <sup>i</sup>*,*s*}*<sup>j</sup>* <sup>=</sup> {(*x*1, *<sup>x</sup>*2, ..., *xn*)|*xi* ∈ {0, 1}, ||*X<sup>t</sup> <sup>i</sup>*,*s*||<sup>0</sup> = *s*}. 7: **end for** 8: **end for** 9: Evaluate the fitness of each particle in task *Tj*. 10: Assign the skill factor *τi*. 11: Initialize the {*pbestt <sup>s</sup>*}*<sup>j</sup>* and {*gbestt <sup>s</sup>*}*j*. 12: {*V<sup>t</sup> <sup>i</sup>*,*s*}*<sup>j</sup>* = ({*pbestt <sup>s</sup>*}*<sup>j</sup>* − {*X<sup>t</sup> i*,*s*}*j*)+({*gbestt <sup>s</sup>*}*<sup>j</sup>* − {*X<sup>t</sup> <sup>i</sup>*,*s*}*j*). 13: *G<sup>t</sup> <sup>j</sup>* = <sup>∑</sup>*<sup>S</sup> s*=1{*gbestt <sup>s</sup>*}*j*. 14: **end for** 15: %Evolution 16: **while** *t*<*Maxt* **do** 17: **for** *j* = 1 to *K* **do** 18: Update the {*Xt*+<sup>1</sup> *<sup>i</sup>*,*<sup>s</sup>* }*<sup>j</sup>* and {*Vt*+<sup>1</sup> *<sup>i</sup>*,*<sup>s</sup>* }*<sup>j</sup>* based on (9). 19: **end for** 20: Update the particle according to Algorithm 2.


**Figure 1.** The evolutionary multitasking optimization framework for hyperspectral sparse unmixing.

#### *3.2. Multipopulation Particle Swarm Optimization for Knowledge Transfer*

Considering the discreteness of decision variables in sparse unmixing, the population in each task is divided into multiple subpopulations according to the sparsity during initialization. In the process of particle swarm evolution, the position and velocity of the particles in the *j*-th task with the sparsity *s* are updated as follows:

$$\begin{aligned} \{X\_{i,s}^{t+1}\}\_j &= \{X\_{i,s}^t\}\_j + \{V\_{i,s}^t\}\_{j\prime} \\ \{V\_{i,s}^{t+1}\}\_j &= \begin{cases} T(\{V\_{i,s}^{t+1}\}\_j), \text{if } (\text{any}(\{V\_{i,s}^t\}\_j) \ge 0) \\ R(\{X\_{i,s}^{t+1}\}\_j), \text{otherwise} \end{cases} \end{aligned} \tag{9}$$

where *T* and *R* are both the selection functions [35].

After updating the positions and velocities of all the particles, we designed an efficient knowledge transfer of intra-task and inter-task to explore the useful information, which is shown in Algorithm 2. Firstly, two particles are randomly selected from the current generation of particles. In the intra-task transfer, the same positions of the particles focus on exploitation. ∩(*pa*, *pb*) represents the positions where the elements in *pa* and *pb* are both 1. Then, the new particles directly inherit positions in ∩(*pa*, *pb*), and the remaining randomly inherit the position on the original particle. On the contrary, the exploration of randomness focuses on the inter-task transfer. ∪(*pa*, *pb*) represents the positions where the elements are equal to 1 in *pa* or *pb*. For the new particles *pa* and *pb* , ||*pa*||<sup>0</sup> and ||*pb*||<sup>0</sup> positions are directly selected from ∪(*pa*, *pb*), respectively. Then the *pa* and *pb* are updated with the better fitness particles. In order to more intuitively illustrate the essence of Algorithm 2, Figure 2 shows a simple example for the genetic knowledge transfer. Two particles with sparsity of 3 and 4 are selected from the current generation first, in the intra-task transfer, the new particles are updated by inheriting all positions in their same positions which refer to the positions in *pc*, then randomly set the rest of positions to 1 to ensure that the sparsity of the new particles is the same as the previous particles. Similar operations are also performed in the inter-task transfer, but the difference is that the new particles are updated by selecting form the positions with 1 in the previous particle, and randomly set them to 1 with the same sparsity.

#### **Algorithm 2** Genetic Knowledge Transfer

**Input:** *Pt* : the current generation of particles. 1: **for** *g* = 1 to *N*/2 **do** 2: Randomly select two particles *pa* and *pb* in *Pt* . 3: **if** *τ<sup>a</sup>* = *τ<sup>b</sup>* **then** 4: %Intra-task Transfer 5: *pc* ← ∩(*pa*, *pb*). 6: *pa* ← Inherit all positions of *pc* and randomly set (||*pa*||<sup>0</sup> − ||*pc*||0) positions to 1. 7: *pb* ← Inherit all positions of *pc* and randomly set (||*pb*||<sup>0</sup> − ||*pc*||0) positions to 1. 8: **else** 9: %Inter-task Transfer 10: *pc* ← ∪(*pa*, *pb*). 11: *pa* ← Randomly select ||*pa*||<sup>0</sup> positions in *pc*. 12: *pb* ← Randomly select ||*pb*||<sup>0</sup> positions in *pc*. 13: **end if** 14: Evaluate the fitness of *pa* and *pb* . 15: Update the *pa* and *pb*. 16: *g* = *g* + 1. 17: **end for**

**Figure 2.** An example of the knowledge transfer.

#### *3.3. An Efficient Local Exploration Strategy with MOEA*

After the optimization of multipopulation particle swarms, the set of globally optimal particles with all the sparsity levels on each task (*<sup>G</sup>* <sup>=</sup> {∑*<sup>K</sup> j*=1{∑*<sup>S</sup> <sup>s</sup>*=1{*gbests*}*j*) can be obtained. Two conflicting parameters are included in each particle, that is, the endmember sparsity and the reconstruction error. Therefore, we employ the multiobjective optimization algorithm to facilitate the search process to obtain the optimal points in each task. In the evolutionary multitasking multiobjective framework, the optimized function is expressed as follows:

$$\begin{cases} \left\{ X\_1^\*, X\_2^\*, \dots, X\_K^\* \right\} = \arg\min \{ F(X\_1), F(X\_2), \dots, F(X\_K) \}, \\ F(X\_j) = \min\_{X\_j} (||X\_j||\_{0\prime} ||Y\_j - A X\_j||\_F) \end{cases} \tag{10}$$

where the *Y<sup>j</sup>* and *X<sup>j</sup>* represent the original image and inversion abundance in the *j*-th task, respectively.

The local exploration strategy processes are in Figure 3. First, the globally optimal particles are transcoded to the first generation of the evolutionary algorithm for the NSGA-II framework. The roulette selection, single-point crossover and bitwise mutation operators are employed to participate in the evolution of multiobjective optimization. Then, the generated offspring are evaluated to update the Pareto front in each task according to the nondominated sorting and crowding distance, and the nondominated solutions are transcoded back to the globally optimal particles.

With the above design, the optimal point in each task can finally be obtained by: *X*∗ *jv* = arg min ||*Y<sup>j</sup>* − *AvXjv*||*F*, which can be solved simply with the least squares method. Finally, the optimal abundance map obtained from each task constitutes the final inverted abundance map.

**Figure 3.** The illustration of the Local Exploration Strategy with MOEA.

#### **4. Experimental Results**

#### *4.1. Data Sets*

Data 1 provided by Iordache et al. [36] is an image which contains 100 × 100 pixels and 224 bands in each pixel, and the related abundance map of nine endmembers is shown Figure 4. It contains nine randomly selected signatures from a sublibrary of 230 spectral signals, and the fractional abundances are piecewise smooth. Data 2 provided by Tang et al. [37] is an image which contains 64 × 64 pixels and 224 bands in each pixel, the related abundance map of five endmembers is shown Figure 5. It includes five endmembers from a sublibrary of 498 spectral signals, and the fractional abundances are also homogeneous. These two benchmark datasets were tested at different levels of white noise, that is, SNR = 20, 30 and 40 dB. The number of tasks was set to three on these two datasets as recommended in [22]. In order to maintain the fairness of the experiments, all experimental results were taken from the average results of 20 experiments, which is the same as in the comparative method paper.

**Figure 4.** True abundance maps of five endmembers in data 1.

**Figure 5.** True abundance maps of nice endmembers in data 2.

#### *4.2. Performance Analysis of EMMPSO*

In this section, the ablation experiments were performed to demonstrate the effectiveness of the knowledge transfer and the local exploration strategy. The hypervolume indicator was used to compare the evolution process and the convergence procedure of the EMMPSO and the EMMPSO without transfer. Hypervolume was calculated using a reference point 1% larger in every component than the corresponding nadir point [38]. As an important indicator to measure the Pareto-optimal front (PF), the larger the value of the hypervolume, or the faster the convergence speed of the hypervolume, the better the PF obtained by the algorithms. The evolution of the hypervolume indicator is shown in Figure 6. It is clear that, after a few iterations, our method can obtain the higher hypervolume values with the help of the intra-task and inter-task transfer strategy. When several related tasks are optimized simultaneously under the framework of evolutionary multitasking, the convergence rate is improved significantly.

Secondly, to test the efficiency of the local exploration, the performance of EMMPSO and EMMPSO without local exploration was compared. Usually, signal to reconstruction error (SRE) is used to measure the quality of the reconstruction of a signal. Table 1 shows the SRE (dB) with different noise levels of our proposed method and the EMMPSO without the local exploration on the simulated data. It can be observed that our method can achieve values of SRE (dB) higher than the EMMPSO without local exploration on both simulated datas. It is obvious to see that the local exploration is useful for facilitating the search process to obtain the optimal points.


**Table 1.** Comparison of EMMPSO and EMMPSO without Local Exploration on data 1 and data 2.

**Figure 6.** *Cont.*

**Figure 6.** Comparison of the hypervolume indicator for EMMPSO and EMMPSO without transfer. (**a**) task 1 on data 1, (**b**) task 2 on data 1, (**c**) task 3 on data 1, (**d**) task 1 on data 2, (**e**) task 2 on data 2, (**f**) task 3 on data 2.

#### *4.3. Comparing with State-of-Art Algorithms*

In order to reflect the superiority of our proposed algorithm, EMMPSO compares with the state-of-art algorithms, including SUnSAL, CLSUnSAL, two-phase multiobjective sparse unmixing (Tp-MOSU) and evolutionary multitasking sparse reconstruction (MTSR). Among them, SUnSAL and CLSUnSAL are the traditional pixel-based and matrix-based processing algorithms. Tp-MOSU and MTSR are two algorithms based on the multiobjective optimization and multitasking optimization, respectively. In order to reflect the advantage of the proposed method, Figures 7 and 8 depict the estimated abundance maps for the endmember 2, 5, 8 on data 1 and the endmember 1, 3, 5 data 2, respectively. The rightmost column represents the abundance map of the real endemembers. The closer the inverted abundance map is to the real abundance map, the better the unmixing performance of the modified algorithm is. It can be seen that the Tp-MOSU, MTSR and EMMPSO exhibit better performances than the other two methods in the similarity with the original abundance map. Although the abundance maps obtained by the Tp-MOSU, MTSR and EMMPSO are similar, the abundance map of EMMPSO has much less noise. Table 2 shows the results of SRE (dB) obtained by the five methods on data 1 and data 2. At different levels of noise, the proposed EMMPSO can always achieve the highest values of SRE (dB) on both simulated datasets. The experimental results on two datasets have proved that our proposed EMMPSO is able to achieve a competitive performance by evolutionary multitasking and local exploration strategy.

**Figure 7.** The fractional abundance maps of endmember 2, 5, 8 by SunSAL, CLSUnSAL, Tp-MOSU, MTSR and EMMPSO on Data 1.

**Figure 8.** The fractional abundance maps of endmember 1, 3, 5 by SunSAL, CLSUnSAL, Tp-MOSU, MTSR and EMMPSO on Data 2.


**Table 2.** Comparison of EMMPSO and other methods on data 1 and data 2.

#### **5. Conclusions**

In this paper, we propose a novel evolutionary multitasking multiobjective particle swarm optimization framework called EMMPSO to solve the sparse unmixing problem. With processing multiple homogeneous regions of a hyperspectral image simultaneously, the evolution convergence is accelerated. The local exploration strategy with MOEA is also designed to obtain the optimal solution. For the case study, the proposed EMMPSO is compared with some state-of-the-art methods on benchmark simulated datasets. The results demonstrate the superiority of the EMMPSO.

In future work, we will focus on reducing the time complexity of EMMPSO, and design an efficient multiobjective particle swarm optimization paradigm for the sparse unmixing problem.

**Author Contributions:** Conceptualization, D.F. and M.Z.; methodology, D.F.; validation, M.Z.; investigation, D.F.; writing—original draft preparation, D.F.; writing—review and editing, D.F., M.Z. and S.W. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Acknowledgments:** This work was supported by the National Natural Science Foundation of China under Grant 61906147 and Grant 61806153, the Fundamental Research Funds for the Central Universities (Grant no. XJS200216) and China Post-Doctoral Science Foundation (Grant no. 2021T140528).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **References**


## *Article* **Reliable Memory Model for Visual Tracking**

**Daohui Ge 1, Ruyi Liu 1, Yunan Li <sup>1</sup> and Qiguang Miao 1,2,\***


**\*** Correspondence: qgmiao@mail.xidian.edu.cn

**Abstract:** Effectively learning the appearance change of a target is the key point of an online tracker. When occlusion and misalignment occur, the tracking results usually contain a great amount of background information, which heavily affects the ability of a tracker to distinguish between targets and backgrounds, eventually leading to tracking failure. To solve this problem, we propose a simple and robust reliable memory model. In particular, an adaptive evaluation strategy (AES) is proposed to assess the reliability of tracking results. AES combines the confidence of the tracker predictions and the similarity distance, which is between the current predicted result and the existing tracking results. Based on the reliable results of AES selection, we designed an active–frozen memory model to store reliable results. Training samples stored in active memory are used to update the tracker, while frozen memory temporarily stores inactive samples. The active–frozen memory model maintains the diversity of samples while satisfying the limitation of storage. We performed comprehensive experiments on five benchmarks: OTB-2013, OTB-2015, UAV123, Temple-color-128, and VOT2016. The experimental results show that our tracker achieves state-of-the-art performance.

**Keywords:** online update; reliable evaluation strategy; active–frozen memory model; visual tracking

#### **1. Introduction**

Visual tracking is a fundamental problem of computer vision that tracks targets in subsequent frames by specifying the position and size of the target in the first frame. It has been successfully applied to robots, video surveillance, and self-driving cars. There are some challenging factors, such as deformation, in-of-plane scale variation, and illumination variation. These challenges are likely to cause significant changes to the appearance of the target. Therefore, how to effectively learn the appearance change of a target is an essential issue of visual tracking.

Recently, online learning-based trackers have achieved good performance. Online updates are often employed to learn appearance changes of targets. The tracking results are collected as online training samples every frame or at fixed intervals. There are some online update strategies that have been proposed [1–6]. For example, some strategies include selecting the most confident tracking result within the fixed interval frames to update specific networks [7]; collecting two consecutive frames [2]; storing each frame in order [3,8,9]; using a convolutional neural network to update the template [4,5]; and storing all tracking results using the Gaussian Mixture Model (GMM) [1,10].

Although the functions of these online update strategies have been validated, there are still two challenges. One challenge is that tracking results are not always reliable. When misalignment, occlusion, and out-of-view occur, the tracking results are likely to contain a great amount of background information, which is regarded as noise. Unreliable tracking results reduce the ability of a tracker to distinguish between targets and backgrounds, ultimately leading to tracking failure. Another challenge is that tracking results are not appropriately stored. The predicted tracking result in each frame [8,9] or several tracking results with higher confidence [2,7] are stored. However, in these methods, there are very

**Citation:** Ge, D.; Liu, R.; Li, Y.; Miao, G. Reliable Memory Model for Visual Tracking. *Electronics* **2021**, *10*, 2488. https://doi.org/10.3390/ electronics10202488

Academic Editor: Christos J. Bouras

Received: 5 September 2021 Accepted: 30 September 2021 Published: 13 October 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

few online tracking samples and also only represent the latest appearance change of the target. This can easily cause the tracker to over-fit the current appearance of target.

To solve the above challenges, we propose a robust reliable memory model that can accurately evaluate the reliability of tracking results and efficiently store all reliable results. First, we propose an adaptive evaluation strategy (AES) to assess the reliability of tracking results. AES calculates the reliability weight based on the tracking confidence of the tracker prediction and the similarity distance, which is between the current predicted result and the existing tracking results. Reliability thresholds are adaptively calculated to enhance the generalization of AES. Only reliable tracking results were selected to construct online training samples. Based on the reliable results of the AES selection, inspired by the computer storage structure, we devised an active–frozen memory model to store all reliable tracking results. Training samples stored in active memory are used to update trackers online. The frozen memory temporarily stores some of the oldest results. The active–frozen memory model maintains the diversity of training samples by exchanging samples in two memories. Combined AES and the active–frozen memory model can effectively avoid introducing background information, while avoiding tracker over-fitting to the current target appearance.

The contributions are summarized as follows:


#### **2. Related Work**

When scale variability, deformation, and rotation occur, the appearance of the target tends to change significantly. How to effectively learn the appearance change of a target is an essential issue of visual tracking. Recently, most approaches utilize the tracking results as online training samples to fine-tune the tracker to learn the appearance change of targets.

**Reliability evaluation of tracking results.** The reliability of online training samples is key to update the tracker. There are two main strategies for constructing online training samples. One strategy is to directly use the tracking results as an online training sample, regardless of its reliability. Some trackers [1,2,8,10] collect one training sample based on the tracking result in each frame. Other trackers [3,11,12] draw in some positive and negative samples around the predicted target location. When tracking drift occurs, the tracking results are likely to contain a great amount of background information that contaminates the online training samples.

The second strategy is to only consider the confidence of the tracking results, which is predicted by the tracker. FCNT [7] collects the most confident tracking results within the intervening frames. STCT [13] sets a confidence threshold and collects the tracking results with a confidence higher than the threshold. However, the tracking results are predicted by the tracker, which is always more confident about its own predictions. Thus, incorrect tracking results are still likely to achieve high confidence. Different from the above methods, we designed a robust adaptive evaluation strategy (AES) to assess the

reliability of the tracking results. The AES not only considers the confidence of the tracking results but also considers the similarity distance between the current predicted result and the existing tracking results.

**Storage of online training samples.** Existing trackers construct a fixed volume of space to store online training samples. Some trackers [2,7,13] maintain a very small space, which only store one or two samples, to reduce the amount of computation. CREST [2] stores only two samples, namely the last two frames. FCNT [7] stores only one training sample within the intervening frames. STCT [13] stores the tracking result with the confidence of the tracker prediction higher than a predefined threshold. These methods only collect a small amount of tracking results, making the tracker over-fit easily to the current training samples.

Other trackers collect large amounts of tracking results in large spaces. Some positive and negative samples are stored in each frame [3,11,12]. One sample is added in each frame [8,10]. UpdateNet [4] uses the initial frame and accumulated template to estimate the optimal template for the next frame. Meta-updater [6] integrates geometric, appearance, and discriminative cues to sequential information. In particular, ECO [1] employs the Gaussian Mixture Model (GMM) to reduce the redundancy of the training samples. When the number of samples reaches the maximum capacity, the tracker discards the oldest samples, which easily causes the tracker to over-fit to the current appearance of the target. We propose an active–frozen memory model to store all reliable tracking results. The training samples stored in the active memory are used to fine-tune the tracker. The frozen memory temporarily stores the sample, whose weight is less than a threshold, as discarded the by active memory. The samples in the active memory and frozen memory are exchanged to ensure the diversity of samples in the active memory.

#### **3. Our Approach**

As mentioned earlier, the reliability of training samples is very important for the online updating of a tracker. When occlusion and tracking misalignment occur, the tracking result has a good chance to contain background information, which can be regarded as noise. When the tracker is updated with these tracking results, the ability of the tracker to distinguish between the background and the target is reduced, and eventually it can lead to poor location estimation or tracking failure. As shown in Figure 1, ECO (red box) does not consider the reliability of the tracking result and is easily affected by similar objects, scale variables, and rotation. Our approach (green box) evaluates the reliability of the result to avoid introducing noise for generating better prediction results. As we know, the reliability of tracking results is not enough of a concern for researches. We obtained two observations by analyzing the confidence of the current tracking result and the similarity distance, which is between the current predicted result and the existing tracking results. Based on the two observations, an adaptive evaluation strategy (AES) was designed to evaluate the reliability of the tracking results.

**The first observation.** The similarity distance between the current predicted result and the existing tracking results increases significantly when tracking drift occurs. Figure 2 shows the change of the minimum distance during the tracking process. Around the 70th frame, the target jumps, causing the appearance to significantly change and leading to the similarity distance to increase rapidly. Thus, the similarity distance can help to recognize when the tracking drift occurs.

**Figure 1.** Our approach is compared with the ECO [1] on three test sequences called Basketball (**top row**), CarScale (**middle row**), and Twinnings (**bottom row**). ECO (red box) does not consider the reliability of the tracking result and is easily affected by similar objects, scale variability, and rotation. Our approach (green box) evaluates the result's reliability by AES.

**Figure 2.** Visualization of the dynamic changes of the similarity distance on Biker. The tracking drift occurs when the target jumps around 70th frame. We can clearly observe that the similarity distance is significantly increased.

**The second observation.** We used the VGG network to extract semantic features and to represent the target with HOG and color name (CN) features together. The tracker has the ability to address some variations in the appearance of the target. Figure 3 shows the relationship of the similarity distance and the confidence. According to the first observation, as indicated by the purple curve, when the illumination or appearance of a target changes drastically, the similarity distance increases significantly. However, the confidence of the

current predicted result (blue curve) is still higher than the mean confidence (red curve). That is, when the appearance of the target changes significantly, the tracker can still show a high level of confidence in the current prediction results.

**Figure 3.** Visualization of the relationship between the confidence and similarity distance on Singer2. Even if the target's pose or appearance changes significantly (purple curve), the confidence of the current predicted result (blue curve) is still higher than the mean confidence (red curve).

The tracking results are collected as training samples to update the tracker online. Based on the reliable results of the AES selection, we designed an active–frozen memory model to maintain the diversity of results while satisfying the limitation of storage.

#### *3.1. Adaptive Evaluation Strategy (AES) of the Reliability*

Inspired by the aforementioned two observations, we propose an adaptive evaluation strategy (AES) that combines the similarity distance with the confidence of the tracker prediction to assess the reliability of tracking results.

We use *<sup>U</sup>* <sup>=</sup> {*u*1, ..., *un*} <sup>∈</sup> *<sup>R</sup>m*∗*<sup>n</sup>* to represent the features of tracking results and *<sup>C</sup>* <sup>=</sup> {*c*1, ..., *cn*} <sup>∈</sup> *<sup>R</sup>*1∗*<sup>n</sup>* to represent the confidence of the tracker prediction. For the current predicted result *x*, its tracking confidence is represented by *t* and its reliable weight is represented by *V*. *V* is composed of distance-based reliability weight *V*<sup>1</sup> and a confidencebased reliability weight *V*2. When the current predicted result *x* is unreliable, the *V* is assigned a value of zero.

We first calculated the distance-based reliability weight *V*<sup>1</sup> based on the similarity distance between the current predicted result and the existing tracking results.

$$\min\_{V\_1} E(V\_1; r) = V\_1 \* \left( r - \min \sum\_{i=1}^n L(x, u\_i) \right) \qquad \text{s.t.} \quad V\_1 \in \{0, 1\} \tag{1}$$

where *L*(*x*, *y*) calculates the Euclidean distance and *r* is a threshold when *L*(*x*, *y*) is greater than *r*, *V*<sup>1</sup> = 0, and otherwise is *V*<sup>1</sup> = 1. The purpose of *V*<sup>1</sup> is to help the tracker to identify significant changes in the appearance of the target. The confidence-based reliability weight *V*<sup>2</sup> is calculated according to the confidence of the tracking results.

$$\min\_{V\_2} E(V\_2) = V\_2 \ast \left(\frac{1}{n} \sum\_{i=1}^n c\_i - t\right) \qquad \text{s.t.} \quad V\_2 \in \{0, 1\} \tag{2}$$

The tracker is robust to appearance changes of the target because of the confidencebased reliability weight *V*2. Based on the distance-based reliability weight *V*<sup>1</sup> and the confidence-based reliability weight *V*2, the reliability weight *V* is calculated by Equation (3).

$$V = V\_1 \circ V\_2 \tag{3}$$

where ◦ is a Hadamard product. According to Equations (1)–(3), the global optimum *<sup>V</sup>* of reliability weight *V* is calculated by the following:

$$V^\star = \begin{cases} 1, V\_1 \circ V\_2 = 1\\ 0, otherwise \end{cases} \tag{4}$$

The reliability of the tracking results can be effectively evaluated by Equation (4). The parameter *r* is an important threshold that determines the reliability of the current predicted result. Figure 4 shows the similarity distance between the current predicted result and the existing tracking results in different sequences. In the FleetFace sequence (yellow curve), the similarity distance is significantly smaller than the Bolt2 sequence (red curve) and BlurCar1 sequence (green curve). In the Bolt2 sequence, the similarity distance shows a significant dynamic change. The similarity distance of different sequences is remarkably different because the target has different motion states, appearance changes, and resolutions of features. According to the second observation, the confidence of the tracking result can effectively address the appearance's change of the target. We propose a method that adaptively calculates the threshold *r*.

**Figure 4.** Visualization of the similarity distance between the current predicted result and the existing tracking results in the BlurCar1 (green curve), Bolt2 (red curve), and FleetFace (yellow curve). The similarity distance of different sequences is remarkably different.

In the case of *V*<sup>1</sup> 2 *V*<sup>2</sup> = 1, this indicates that the distance-based reliability weight *V*<sup>1</sup> is different from the confidence-based reliability weight *V*2. When *V*<sup>2</sup> = 1, this indicates that the appearance of the target has changed significantly. The threshold *r* should be increased to select more tracking results as online tracking samples. When *V*<sup>2</sup> = 0, this indicates that the tracker is not certain about its own predictions. Although the new tracking results are close enough to the existing tracking results, we believe that the threshold *r* should be reduced to ensure the quality of the current predicted result. The threshold *r* can be adaptively calculated by the following formula:

$$r = r + w \ast \left[ r - \min \sum\_{i=1}^{n} L(\mathbf{x}, u\_i) \right] \ast \left( V\_1 \bigoplus V\_2 \right) \tag{5}$$

where *w* represents the pace for each calculation.

#### *3.2. Active–Frozen Memory Model*

In order to learn the appearance change of the target, tracking results are collected as training samples to update the tracker online. Most trackers [1,7,10] discard the oldest results when the number of samples reaches the maximum limit, which results in training samples that do not fully represent the appearance change of the target.

Based on reliable results of the AES selection and as inspired by the multi-level cache technique in computer storage, we propose an active–frozen memory model that stores all reliable tracking results. The structure of the active–frozen memory model is shown in Figure 5, and is a cascaded structure that can exchange components between two memories. Tracking results stored in the active memory are used to update the tracker online. Frozen memory is used to temporarily store some of the oldest results. In order to reduce computation load, following the [1], we used the Gaussian Mixture Model (GMM) to fuse tracking results in each memory. The two closest components, namely K and S in GMM, are merged into one, specifically component G.

$$\mathcal{W}\_{\rm G} = \mathcal{W}\_{\rm K} + \mathcal{W}\_{\rm S}, \ \overline{\mathcal{X}\_{\rm G}} = \frac{\mathcal{W}\_{\rm K}\overline{\mathcal{X}\_{\rm K}} + \mathcal{W}\_{\rm S}\overline{\mathcal{X}\_{\rm S}}}{\mathcal{W}\_{\rm K} + \mathcal{W}\_{\rm S}}\tag{6}$$

**Figure 5.** The structure of the active–frozen memory model (**top row**). There are two operations (**below row**), namely transfer component and exchange component. Only reliable tracking results are stored and are otherwise discarded directly. The active–frozen memory model guarantees the diversity and reliability of tracking results in active memory by exchange operations and AES.

We first constructed a Gaussian component based on the weight *Wx* and mean features *X* of the current predicted result *x*. The reliability of *x* was evaluated by AES (see Section 3.1 for details). If the current predicted result *x* is reliable, it is stored in the active memory. Otherwise, it is discarded directly.

After the current predicted results are collected, we checked whether the component numbers in the active memory had reached the maximum limit and whether the weight of one component was less than the predefined threshold. If an existing component satisfies the above requirement, it is exchanged with the closest component from the frozen memory. If the frozen memory is empty, we place this component directly into the frozen memory. The active–frozen memory model guarantees the diversity and reliability of tracking results in the active memory. The stored procedure of the active–frozen memory model is illustrated in Algorithm 1.

#### **Algorithm 1** Stored procedure of the active–frozen memory model.

**Require:** current predicted result *x*.

**Ensure:** active–frozen memory.


#### *3.3. Model Update*

In recent trackers [1,7,12,14], a sparse update scheme was employed. The tracker, which takes collected tracking results as online training samples, is updated every *Ns* frames and each update performs a fixed number *Ni* of iteration optimization algorithms. The sparse update scheme not only reduces the computations but also reduces the overfitting to the recent online training samples.

We also utilized the sparse update scheme in our approach. Only the training samples stored in the active memory were used to update our tracker (see Section 3.2 for details). When the current predicted result was unreliable, the active memory did not change because the predicted result was discarded directly. Thus, before updating the tracker, we detected whether the active memory changes in the *Ns* frame, that is, whether there were new tracking results to be collected. If the active memory had not changed, indicating the *Ns* tracking results were unreliable, we reduced the number of iterations *Ni* of the optimization algorithms to avoid the tracker over-fitting to existing online training samples. Otherwise, we performed *Ni* times of iteration optimization algorithms.

#### **4. Experiments**

We validated the performance of our tracker on five benchmark datasets, including OTB-2013 [15], OTB-2015 [16], UVA123 [17], Temple-color-128 [18], and VOT2016 [19].

#### *4.1. Implementation Details*

Our tracker was implemented in Pytorch. We initialized our tracker using the method proposed in [1]. The VGG-m network was used as a feature extractor to capture the *Conv*1 (the first convolutional layer) and *Conv*5 (the last convolutional layer) features, and the HOG and Color Name (CN) features were combined to represent the target. For the adaptive evaluation strategy (AES) of the reliability, the threshold *r* was initialized to 0. In order to obtain a reasonable value of *r*, the tracking results of the first 50 frames were used to adaptively calculate the value of *r* by Equation (5). In fact, the initial value of *r* had no effect on the performance of the tracker. In the first 50 frames, the pace for each calculation *w* was set to 0.5. In the subsequent frames, the pace *w* was calculated by the following formula.

$$w = \begin{cases} 0.4 \ast \max(c\_i) + 0.6 \ast \frac{r}{\text{distance}\_{\text{min}}}, & r > \text{distance}\_{\text{min}}\\ 0.4 \ast \max(c\_i) + 0.6 \ast \left(\frac{r}{\text{distance}\_{\text{min}}} - 1\right), & \text{otherwise} \end{cases} \tag{7}$$

where *distancemin* represents the minimum similarity distance between the current predicted result and the existing online training samples.

For the active–frozen memory model, as presented in Section 3.2, the maximum limit of the number of training samples in the active memory and frozen memory was set to 50 and 10, respectively. We initialized the active memory with the tracking results of the first 50 frames of the sequence. The learning rate was set to 0.009. We updated the tracker every *Ns* = 6 frames. When tracking results were added to the active memory, we used the same iteration number *Ni* = 5 as in [1]. Conversely, the number of iterations *Ni* was set to 4. Note that all parameters settings were kept fixed for all the sequences in the dataset. It is important to note that the computational complexity of our proposed adaptive evaluation strategy (AES) and active–frozen memory model was O(n), which is negligible and thus guarantees the real-time performance of the tracking.

#### *4.2. Ablative Study*

In this section, we analyze the contribution of both the adaptive evaluation strategy (AES) of the reliability and the active–frozen memory model to the tracker by performing experiments on the OTB-2013 dataset [15]. The OTB-2013 dataset contains 50 sequences that are all fully annotated. There are 11 attributes, such as occlusion, scale transformation, and deformation, which represent the challenge factors in visual tracking. Each sequence has at least one challenge factor. We used a precision plot and a success plot to evaluate the performance of the tracker. Precision plots calculate the Euclidean distance between the estimated location and the ground truth, and counts the percentage of frames that are less than a given threshold distance. The threshold was set to 20 pixels. The success plot quantitatively calculates the overlap ratio of the bounding box, where the overlap rate ranges from 0 to 1. The success plot counts the number of frames whose overlap rate is greater than a given threshold. The threshold was set to 0.5.

We chose ECO [1] as our baseline tracker and organized four comparison experiments by controlling variables, including standard ECO, only the adaptive e- valuation strategy (ours-AES), only the active–frozen memory model (ours-AF memory), and our proposed approach (ours). Figure 6 shows the comparison experiment results on the OTB-2013 dataset. In the precision plot, the score of the baseline tracker was 93%. Compared with the baseline tracker, our active–frozen memory model achieved a 0.8% improvement and our adaptive evaluation strategy achieved a 1.6% improvement, which provided the greatest contribution. Our approach finally improved by 1.8%. In the success plot, the baseline tracker obtained an area-under-curve (AUC) score of 70.9%. Both the adaptive evaluation strategy and thw active–frozen memory model achieved a 0.4% improvement, and our approach achieved a 0.5% improvement compared with the baseline tracker.

**Figure 6.** Ablative experiments on the OTB-2013 dataset. The area-under-curve (AUC) score of the success plot and the score of the precision plot are represented in the legend, respectively.

We also analyzed the performance of the tracker under different challenge factors. Figure 7 only shows the results of the scale variation, illumination variation, in-plane rotation, and deformation challenge factors; we achieved an increase of 1%, 1%, 0.6%, and 2.6% respectively. In particular, our method can better learn the deformation of a target, which is our main purpose, i.e., learning the appearance change of a target.

**Figure 7.** Success plot on scale variation, illumination variation, in-plane rotation, and deformation. The AUC score of each challenge factor is shown in the legend.

AES guarantees the quality of online training samples to avoid introducing background information and the active–frozen memory model guarantees the diversity of

online training samples to prevent the tracker from over-fitting to the current target appearance. The experimental results in Figures 6 and 7 show that the adaptive evaluation strategy (AES) of the reliability and the active–frozen memory model are useful for improving the performance of the tracker.

Meanwhile, we conducted ablation experiments on VOT2016 [19] as shown in Table 1. Our tracker can reach 35 FPS with negligible computation introduced by AES and AF memory, satisfying the real-time requirement.


**Table 1.** Ablative experiments on the VOT2016.

#### *4.3. Comparisons to State-of-the-Art Trackers*

In this section, we compare our approach with state-of-the-art trackers on five benchmark datasets: OTB-2013 [15], OTB-2015 [16], UVA123 [17], Temple-color-128 [18], and VOT2016 [19].

**OTB-2013.** We compared our approach with VITAL [3], ECO [1], MDNET [12], DAT [11], MCPF [20], CREST [2], CCOT [9], TRACA [21], BACF [22], DeepSRDCF [23], SRDCF [8], SiamFC [24], and 29 trackers from the OTB-2013 dataset. The experimental results are shown in Figure 8. In the precision plot, VITAL achieved the best performance. Our tracker obtained a precision score of 94.8%, second only to VITAL and more than the 0.4% and 1.8% of DAT and ECO, respectively. In the success plot, our method achieved the best performance between all the state-of-the-art trackers, obtaining an AUC score of 71.4%, which was more than the 0.4% and 0.5% of VITAL and ECO, respectively. Compared with ECO, although the adaptive evaluation strategy (AES) of the reliability and the active–frozen memory model had been added, the extra calculations were negligible and our trackers ran at the same speed as ECO.

**Figure 8.** Precision plot and success plot on the OTB-2013 dataset. The AUV score and precision score of each tracker is shown in the legend. For clarity, we only show the top 10 trackers for performance.

**OTB-2015.** The OTB-2015 dataset is based on the OTB-2013 dataset, which adds 50 additional sequences and is still fully annotated. We compared our approach with recent state-of-the-art trackers: VITAL [3], ECO [1], MDNET [12], DAT [11], MCPF [20], CREST [2], CCOT [9], TRACA [21], BACF [22], DeepSRDCF [23], SRDCF [8], SiamFC [24], and 29 existing trackers from the OTB-2015 dataset. The experimental results are shown in Figure 9. Our approach achieved the best performance in both the precision and success plot, with a precision score of 92.3% and an AUC score of 69.4%, respectively. Our tracker

was 0.5% higher than VITAL and 1.3% higher than VITAL in the precision plot. Additionally, our tracker was 0.3% higher than ECO and 1.2% higher than VITAL.

**Figure 9.** Precision plot and success plot for the OTB-2015 dataset. The AUV score and precision score of each tracker is shown in the legend. For clarity, we only show the top 10 trackers for performance.

**UAV123.** UAV123 is constructed by 123 video sequences and more than 110K frames, which contain 12 tracking attributes, captured from a low-altitude aerial perspective. We compared our approach with state-of-the-art trackers: ECO [1], MEEM [14], DSST [25], SRDCF [8], DCF [26], Struck [27], MUSTER [28], SAMF [29], and 31 trackers from the UAV123 dataset. Figure 10 shows the results over all the 123 sequences in the UAV123 dataset. Our tracker provided the best performance with a precision score of 74.9% and an AUC score of 52.8%. Additionally, our tracker achieved a substantial improvement over ECO [1], with a gain of 0.8% in the precision plot and a gain of 0.3% in the AUC.

**Figure 10.** Precision plot and success plot on the UAV123 dataset. The AUV score and precision score of each tracker is shown in the legend. For clarity, we only show the top 10 trackers for performance.

**VOT2016.** The VOT2016 dataset contains 60 sequences with new annotations. We compared our approach with SiamDW [30], UpdateNet [4], SiamRPN [31], and ECO [1]. Table 2 shows the results of the VOT2016 dataset. Our tracker provided the best performance with an EAO score of 0.389.



**Temple-color-128.** The Temple-color-128 dataset is constructed by 128 color sequences with ground truth and challenge factor annotations. As we all know, the color information of a target provides rich discriminative cues for inference. The purpose of this dataset was to study the use of color information for visual tracking. We compared our approach with MEEM [14], Struck [27], KCF [26], and other trackers from the Temple-color-128 dataset. The experimental results over all the sequences are shown in Figure 11. Our approach achieved the best performance in both the precision and success plot, with a precision score of 79.35.3% and an AUC score of 59.10%, respectively. Additionally, our tracker again achieved a substantial improvement over MEEM [14], with a gain of 8.54% in the precision plot and a gain of 9.10% in the AUC.

**Figure 11.** Precision and success plot on the Temple-color-128 dataset. The AUV and precision score of each tracker is shown in the legend. For clarity, we only show the top 10 trackers for performance.

#### **5. Conclusions**

In this paper, we proposed a robust strategy for constructing online training samples to learn the changes of a target's appearance. The adaptive evaluation strategy (AES) combines the tracking confidence of the tracker prediction and similarity distance, which is between the current predicted result and the existing tracking results, to assess the reliability of the tracking results in order to ensure the quality of the online training samples. We also proposed an active–frozen memory model that can effectively store all reliable tracking results. Training samples stored in the active memory are employed to update the tracker. The diversity of the online training samples is ensured by sample exchange between two memories to prevent the tracker from over-fitting to the current appearance changes. Extensive experiments on five benchmark datasets show that our approach outperforms the performance of state-of-the-art trackers.

**Author Contributions:** Conceptualization, D.G., R.L. and Q.M.; methodology, D.G. and Y.L.; software, D.G. and R.L.; validation, D.G. and Q.M.; formal analysis, D.G., Y.L. and Q.M.; investigation, D.G. and R.L.; resources, Q.M.; data curation, D.G. and Y.L.; writing—original draft preparation, D.G. and Y.L.; writing—review and editing, D.G., R.L. and Y.L.; visualization, D.G.; supervision, D.G. and Q.M.; project administration, Y.L.; funding acquisition, Q.M. All authors have read and agreed to the published version of the manuscript.

**Funding:** The research study was jointly funded by the National Key R&D Program of China under grant number 2018YFC0807500; the National Natural Science Foundations of China under grant numbers 61772396, 61772392, 61902296, and 62002271; Xi'an Key Laboratory of Big Data and Intelligent Vision under grant number 201805053ZD4CG37; the National Natural Science Foundation of Shaanxi Province under grant number 2020JQ-330, 2020JM-195; the China Postdoctoral Science Foundation under grant number 2019M663640; and Guangxi Key Laboratory of Trusted Software (number KX202061); the Fundamental Research Funds for the Central Universities under grant No.XJS210310.

**Conflicts of Interest:** The authors declare no conflicts of interest.

#### **References**

