1. Introduction
Target tracking is a comprehensive technology covering computer technology [
1], pattern recognition, image processing [
2], artificial intelligence [
3] and other technologies [
4]. Furthermore, target tracking technology is widely used in smart homes [
5], human–computer interaction [
6], virtual reality, medical diagnosis, modern military, information security [
7], and other computer vision fields. Video target tracking is a technology that uses the context information of a video or image sequence to model the appearance and motion information of the target so as to predict the motion state of the target and calibrate the position [
8].
Although research on video target tracking algorithms has made great progress in recent years, the effect of existing methods has not yet reached the ideal state due to the influence of the target’s appearance and size change, object occlusion, motion blur, tracking background interference and other factors. According to different tracking methods, target tracking algorithms can be divided into tracking algorithms based on correlation filtering and tracking algorithm based on deep learning. The algorithm based on a correlation filter mainly uses a correlation filter to calculate the similarities between a template image and prediction image to determine the target position in the process of tracking, while the algorithm based on deep learning mainly learns target features through training deep networks to complete video target tracking.
Bolme [
9] introduced correlation filtering into video target tracking for the first time and proposed the Minimum Output Sum of Squared Error (MOSSE) algorithm. After that, the algorithm based on correlation filtering gradually became the mainstream method in the field of video target tracking.
However, with the proposal of the Alexnet [
10] network, the video target tracking algorithm based on deep learning is gradually emerging, which has received wide attention in recent years and has produced many algorithms with excellent performance. Although this kind of algorithm has high tracking accuracy and strong robustness, it is difficult to run under the limited hardware conditions of an embedded platform such as DSP, FPGA or ARM, and the requirement of real-time processing is also hard to achieve because of its huge model parameters and computation. For example, SANet [
11], one of the best performance target-tracking networks based on Convolutional Neural Networks (CNN), can only achieve 1FPS on a high-performance GPU NVIDIA GTX TITAN Z with 12GB of memory. Network models with strong performance achieve satisfactory accuracy and robustness at the cost of computing speed. However, in some application scenarios with high requirements on real-time performance, target-tracking algorithms based on CNNs are still difficult for engineers to consider.
The KCF [
12] tracking algorithm is an excellent algorithm with high tracking speed, accuracy and robustness proposed by F. Henriques et al. As a similarity measure between two signals, a correlation filter provides us with a reliable distance measure and a reasonable interpretation scheme. However, because the algorithm uses a fixed tracking template according to the target object determined by the initial frame, it cannot deal with the problem of scale change of the target in the tracking process, and the template is easily blurred by the occluded object, leading to tracking failure. To address these challenges, this paper attempts to propose a fast multi-scale kernel correlation filter tracker with an adaptive template updater for a rigid object. In the phase of correlation filtering, we build a three-layer scale pyramid filter on the basis of KCF, make the target image of the last frame carry out correlation filtering with each layer image in the pyramid, and output the scale factor and maximum response value, which can deal with the problem of multiple scale changes of the target effectively and maintain an extremely high operating speed. In the template update phase, we propose an adaptive template updater based on the Mean of Cumulative Maximum Response Values (MCMRV) to set adaptive thresholds to limit the updating of the template, which alleviates the problem of template drift effectively when occlusion occurs. The main contributions are summarized as follows:
A simple three-layer scale pyramid filter is embedded into KCF, which makes the tracker adapt to the scale change of the target efficiently.
We propose an adaptive template updater based on MCMRV, which adaptively adjusts the template update threshold according to MCMRV criteria and plays a reliable role in dealing with target occlusion.
Experimental results show that the improved algorithm can effectively solve the problems of scale variation and target occlusion in target tracking under the condition of high operation speed.
The rest of this paper is organized as follows.
Section 2 presents the related work. In
Section 3, we propose an improved KCF algorithm.
Section 4 reports the results of the experiment. Finally,
Section 5 is the conclusion of this paper and our future work.
2. Related Works
According to the different methods used in the observation model, target tracking algorithms can be divided into two categories, respectively, generative model and discriminant model. The discriminant model is further divided into models based on correlation filtering and deep neural networks. Generative models mainly include Kalman filter [
13], particle filter [
14], Meanshift [
15] and Camshift [
16], which are the earliest tracking models. A serious shortcoming of the generative model is that it does not update the model and always uses the model built at the beginning of the task. It does not take into account the influence of environmental changes on the target state during the task. When the target is clear in some frames, the target can be found better. However, when the target is blocked or in poor ambient light conditions, the tracking effect of this model is not satisfactory.
The discriminant model is the mainstream model of target tracking, which transforms the target tracking problem into a dichotomous problem and obtains the target by separating the target from the background through the model. This model can solve the problem of tracking targets well in complex environmental conditions, and it can be divided into two algorithms based on correlation filtering and deep learning according to the different features used.
Correlation filtering was originally used in signal processing to describe the correlation between two signals. Bolme proposed the MOSSE filter, which introduces the method of a correlation filter to track a target for the first time and has excellent performance on real-time tracking. Circulant Structure of tracking-by-detection with Kernels (CSK) [
17] uses a kernel correlation filter to find the feature of the cyclic determinant of the negative sample so as to improve the tracking accuracy. KCF uses Histogram of Oriented Gradient (HOG), which replaces the pixel information of an image and obtains a large sample by the method of cyclic shift on original feature samples. A Gaussian kernel function is introduced to transform low-dimensional non-separable feature information into high-dimensional separable feature information so as to facilitate the calculation of feature correlation. Discrete Fourier transform and the properties of cyclic matrix are used to reduce the dimension of operation and improve the speed of the algorithm in the sample classifier and new sample detection. In the process of target tracking, the accuracy of the algorithm is greatly reduced due to the influence of target scale variation. Discriminative correlation filters and the exhaustive Scale Space Tracking (DSST) [
18] algorithm proposed by Danelljan et al. treats target tracking as two independent problems of target center shift and scale change and trains the shift correlation filter and scale correlation filter, respectively, with a HOG feature. Later, Danelljan proposed fast DSST (fDSST) [
19] on the basis of DSST and improved the performance of the algorithm by 6.13% and the FPS by 83.37% through feature reduction and interpolation. In 2015, Danelljan et al., once again, proposed the improved correlation filtering tracking algorithm Spatially Regularized Discriminative Correlation Filters (SRDCF) [
20]. Its idea is to expand the search area and restrict the effective scope of the filter template to solve the boundary effect, but its running speed is obviously reduced. Background-Aware Correlation Filters (BACF) [
21] proposed by H Kiani extend the HOG feature of a single channel to the HOG feature of multiple channels and uses the ADMM method to speed-up the computing speed.
The above algorithms based on improved correlation filtering can solve the problem of target scaling well, but they still update the model even when the target is blocked, which leads to the introduction of a large amount of irrelevant information into the filter. The tracking effect will be reduced if the target is blocked for a long time, and the calculation complexity is high and the amount of calculation is large. In an embedded system with limited computing speed, the real-time performance of the tracking algorithm is greatly affected.
In the task of target tracking, acquiring target features is a key problem, and deep learning has shown its powerful feature extraction and expression ability in other fields, so deep learning has been applied to the field of target tracking. Currently, commonly used neural network models include Alex, Vgg [
22], ResNet [
23], Yolo [
24] and GAN [
25].
Reference [
26] proposes MDNet, a deep-learning tracking algorithm based on classification, which uses the small VGG network, and the authors think that there are common characteristics between the target in the different training videos. Therefore, they adopt multi-domain training, but the algorithm does not perform well in terms of speed and target occlusion. In reference [
27], the GAN network is added on the basis of MDNet, and positive samples under occlusion are generated through the GAN network so that the classifier can deal with the problem of occlusion. However, the rapid increase in computation reduces the speed further. Reference [
28] proposed a twin neural network Siam-FC, which regards tracking as solving the similarity problem and adopts two Alex networks to form a double-branch structure network. Although the running speed is improved, it can meet the real-time requirements only on the premise of using a high-performance graphics card to accelerate the operation.
The purpose of this study is to provide a tracking algorithm with excellent performance and speed for common rigid targets in engineering practice, and deep-learning-based algorithms are still difficult to fully apply in embedded platforms, so this paper will focus on tracking algorithms based on kernel correlation filtering. In this paper, an adaptive multi-scale pyramid and adaptive mean updater are used to improve the tracking performance of KCF for rigid targets.
3. The Proposed Approach
Our tracker framework can be summarized as shown in
Figure 1. Based on the KCF algorithm, we build a simple scale pyramid module, which can construct a multi-layer pyramid according to the target position as the input of the correlation filter. Accordingly, the filter outputs a multi-layer response pyramid from which the most suitable response value is taken as the tracking result. Considering that the response value will decrease sharply when rigid target occlusion occurs, we introduce an adaptive template updater based on MCMRV. In the process of tracking the target, the template updater adaptively adjusts the threshold according to the response results and judges whether the target occlusion occurs so as to avoid the template being polluted by noise in the process of tracking.
3.1. Kernel Correlation Filter Algorithm
Our approach is improved on the basis of the KCF algorithm. The KCF target tracking algorithm firstly extracts HOG features from the image information of the target region and then trains the target classifier by generating a large number of samples through cyclic displacement. A Gaussian kernel function is used to calculate the correlation response between the target sample and the sample to be tested, and the coordinate of the maximum point of the response value is the latest position of the target. Using discrete Fourier transform to transform the above process from the time domain to frequency domain can greatly reduce the amount of computation and improve the speed of computation. Finally, the classifier is updated with new target features.
Assuming that the one-dimensional vector a of 1 ×
q is the information of the target sample, the displacement matrix
P is used to carry out cyclic displacement of the sample as
, and the matrix
P is shown as follows:
represents the displacement of sample a by i bit, and is denoted as the sample after cyclic displacement, from which can form the sample cyclic matrix as .
3.2. Features Extraction and Regularization
The KCF algorithm is an extension of the CSK algorithm, which uses a multi-channel HOG feature instead of a gray feature to enrich the types of target sample information collection and improve the target tracking accuracy. The computed HOG features are
dimensional. There are
contrast sensitive orientation channels, nOrients contrast insensitive orientation channels, 4 texture channels and 1 all zeros channel (used as a ‘truncation’ feature [
29]). Using the standard value of
gives a 32-dimensional feature vector at each cell. This variant of HOG, referred to as FHOG, has been shown to achieve superior performance to the original HOG features.
The KCF algorithm introduces a kernel function to solve the problem of low-dimensional linear inseparability of samples and uses ridge regression to train the classifier. The classifier
is trained by a minimum regularization function,
being a function that maps the sample to the Hilbert feature space. The optimal
is obtained to minimize the function value, and the mathematical formula is expressed as follows:
The similarity between sample
X and
is expressed by a Gaussian kernel function, and the following formula can be derived as follows, where
represents the Fourier transform and
is the Fourier transform of
X:
The kernel matrix
constructed by training sample
X, which can be obtained by a Gaussian kernel function. The optimal solution can be obtained through Equation (
2) as follows:
For the newly input sample
z, the sample set
Z can be obtained through feature extraction and cyclic displacement, which constructs the kernel matrix
with the training sample
X and satisfies the cyclic conditions. From this, the response of the test sample can be obtained, and the coordinate of the point with the maximum response value represents the latest position of the target. Then the tracker updates the template parameter
and the sample parameter
X by the following formula:
where
and
are model parameters and sample parameters applied to the next frame, which are obtained from
and
of the previous frame.
is the template update rate. The traditional KCF algorithm still updates the template and sample parameters when the target is blocked and cannot adjust the detection region.
3.3. Dimension Reduction
The tracking speed based on the kernel correlation filter is determined by the calculation of the Fourier transform. In this paper, our adaptive dimension reduction strategy adopted the Principal Analysis Component Analysis (PCA) method [
30]. The implementation principle will be briefly described below.
Let
be a d-dimensional training sample, where each eigenvector is n-dimensional. Based on the above description, we update the template to:
By minimizing the reconstruction error of
, the projection matrix
is obtained:
Because the reconstruction error
can be minimized under the constraints of
, our projection matrix
can be calculated by the eigenvalue decomposition of the autocorrelation matrix, which corresponds to the maximum eigenvalue. We use a compressed sample and transform template to obtain the response of test sample
:
where
represents the compressed transformation template composed of HOG features.
3.4. Adaptive Multi-Scale Pyramid
In the process of target tracking, target scale changes often occur, but the size of the target tracking window in the traditional KCF algorithm is fixed. When the visual distance of the target changes or the camera moves, the proportion of the target to the image also changes, and there is an error between the tracking window and the actual target. In this paper, we build a scale pyramid of the current target in the original KCF algorithm. The template will carry out correlation filtering with each layer of images in the multi-scale pyramid and judge the scaling degree of the current target size according to the maximum filtering response value.
Assuming that
is the scale factor of the target size of the current frame compared to the previous frame and
is the target image of the previous frame,
can be obtained by the following optimization formula:
where
represents the correlation filtering results of
x and
y and
T represents the tracking template. Since the correlation filter requires the same-sized input images, the scaled target images in the multi-scale pyramid need to be restored to the image with the same size as the previous frame through the resize operation. Meanwhile, the size of the patch window in KCF should also be scaled according to the scale factor
determined in Formula (
6). The construction process of the scale pyramid is shown in
Figure 2.
3.5. Adaptive Template Updater Based on the Mean of Cumulative Maximum Response Values
The problem of target occlusion can easily make the template of the tracker blurred and lead to tracking failure. In the tracking process of a rigid target, the occlusion of a target can be judged according to the maximum response value of the correlation filter. The change in the maximum response value of the correlation filter in the original KCF tracking process is shown in
Figure 3. As can be seen from the figure, when the target is in the normal tracking state, the maximum response value floats around an average value. When the target is occluded, the maximum response value is obviously lower than the mean. Therefore, this paper proposes to judge whether occlusion occurs according to the cumulative maximum response mean value and stops updating the tracker template if occlusion occurs.
Based on the above, we propose the criterion of the Mean of Cumulative Maximum Response Values (MCMRV), and the KCF tracker will implement adaptive template updates according to the MCMRV criteria. The maximum response value
is accumulated after
t frames accumulation to obtain the cumulative value:
Therefore, the update threshold is:
The updating of the template parameter
and the sample parameter
X based on MCMRV criteria yields the following expression, where
indicates the allowed floating range:
In this paper, MCMRV is initialized by assuming that the tracker is in a normal trace state for
frames after the trace begins, and the tracker template is updated according to Formula (
5). When the number of frames is greater than
, the MCMRV criterion is enabled for adaptive updating of the tracker template, according to Formula (
7). The implementation of the adaptive multi-scale KCF based on MCMRV criterion (MMKCF) is shown in Algorithm 1.
Algorithm 1: MMKCF |
- Input:
The video frame, ; Initial bounding box of the target, ; - Output:
The target position predicted by the tracker, p; - 1:
whiledo - 2:
if then - 3:
- 4:
else - 5:
if then - 6:
for to do - 7:
- 8:
end for - 9:
- 10:
- 11:
else - 12:
for do - 13:
- 14:
end for - 15:
if then - 16:
- 17:
else - 18:
- 19:
- 20:
- 21:
end if - 22:
end if - 23:
end if - 24:
end while
|
5. Conclusions
This paper improves the KCF algorithm mainly from two aspects: A simple multi-scale pyramid is integrated in KCF so that the tracker can adapt to the size change of the rigid target adaptively while ensuring the real-time requirement; the adaptive template updater based on MCMRV criterion enables KCF to deal with the problem of occlusion for a rigid target effectively. Experimental results show that our approach is effective and improves the precision and success rate of the tracking algorithm. Compared with other SOTA tracking algorithms based on kernel correlation filter, MMKCF can adapt to the scale change of the target well and deal with the problem of occlusion effectively while maintaining the high-speed processing ability. MMKCF is very suitable for embedded platforms with low power, small volume and limited computing power. Extensive experiments show that the proposed method is effective and real.
Although the proposed method can efficiently deal with the problem of target scale variation and occlusion, tracking accuracy is often reduced due to target illumination variation and motion blur in practical engineering. Meanwhile, solving the problem of targets that rotate or roll and cause tracking failure is also a worthwhile research priority. In future work, we will focus on exploring solutions to these problems and strive to promote the application of our academic research results in engineering projects.