The goal of the object tracking algorithm proposed in this paper is to use SOM and multiple correlation filters to deal with the following challenges in the visual tracking process: (1) the obvious changes in appearance over time; (2) changes in scale; (3) recover the goal from the tracking failure. First, the existing algorithms based on a single correlation filter [
23] cannot achieve these goals, because it is tough to strike a balance between stability and adaptability using one filter only. Secondly, although a lot of works have been done to solve the challenge of scale prediction [
17,
24,
44], it is still an unresolved problem because the slight error of scale estimation will cause rapid degradation of the appearance model. Third, it is still a challenge to determine when the tracking failure occurs and to re-detect and track the target from the failure. In the algorithm proposed in this paper, we use three different levels of displacement filters, a scale filter and a memory filter to solve these problems.
Figure 2 shows the construction of a correlation filter for visual tracking. The displacement filters
and
are used to model and estimate different forms of targets, respectively, the scale filter
is used to evaluate the scale estimation of the tracked object, and the long-time memory filter
is used to keep the long-time memory of the appearance of target to estimate the confidence level of every tracking result.
Figure 3 shows a schematic diagram of the algorithm for object tracking using three correlation filters. It is initialized in the 1st frame of input, and SOM is trained according to the specified object position to extract the regional features, and the three correlation filters proposed by this algorithm are learned. For subsequent input frames, we first use three displacement filters
and
to obtain three target locations at the center of the search window of the previous frame. The average value of these three target locations is our estimated target location. Once the position of estimated target is obtained, we use the scale filter
to predict the change of the target scale, thereby determining the bounding box of the tracking target. For each tracking result, we judge if the tracking fails (whether the target confidence is lower than a certain set threshold
) by the long-time memory filter
. In the event that the tracker loses a target, the online detector will be activated to recover the lost or drifting target. When the confidence of the re-detected object is greater than the set update threshold
, the long-time memory filter
needs to be updated first, and then
and
are updated with a reasonable learning rate.
After comparing our experiments with other classifiers, the support vector machine (SVM) can get much better results than other algorithms on the small sample training set. SVM is currently one of the best classifiers with excellent generalization ability and can reduce the requirements for data scale and data distribution. Although the long-time memory filter proposed by this algorithm itself can also be used as a detector, because the filter uses high-dimensional features, the calculation load is large. In order to improve the calculation efficiency, we use the online training SVM classifier to construct an additional. We update the detection module and the long-time memory filter with a reasonable learning rate, which can snatch the target appearance over a long period of time.
3.1. Kernelized Correlation Filters-Based Tracker
Trackers based on correlation filters [
17,
45] have achieved very good capability in recent evaluations [
38,
46]. The main idea of these works is regressing the input feature of the cyclic shift to a soft regression index, such as generated by a Gaussian function. The input features of the cyclic shift are similar to the densely sampled samples of the target appearance [
6]. Since the training of the correlation filter does not require binary samples (hard threshold), the tracking algorithm using the correlation filter effectively reduces the sampling dilemma that is adversely affected by most tracking algorithms that detect frame by frame. In addition, by using the redundancy in the shifted sample set, Fast Fourier Transform (FFT) can effectively use a large number of training samples to train correlation filters. This increase in training data helps distinguish the target from the surrounding background. This section will explain in detail the derivation process of coring correlation filtering.
Henriques [
6] uses cyclic sampling of the target area, that is, dense sampling to reduce the amount of calculation, which not only improves the calculation efficiency, but also improves the tracking accuracy. Different from the sparse sampling methods of other algorithms, the correlation filtering used in proposed method does not strictly distinguish between positive and negative samples, and a transformation matrix is used to cyclically shift the target image block
x. For a one-dimensional image
, the transformation matrix can be as following:
The cyclic shift transformation matrix (1) is used to chain-shift the image, and the image transformed by the permutation matrix constitutes the cyclic matrix:
X is the circulant matrix, and the circulant matrix we can use Discrete Fourier Transform (DFT) to obtain the following characteristics:
where
F represents the constant matrix of DFT that transforms the spatial domain data into frequency domain;
is the DFT transform of
x (such as
),
is the Hermitian transpose, also called the conjugate transpose matrix, that is, conjugate first and then perform transpose.
f is the linear correlation filter which is trained on the image block X of size can be regarded as a ridge regression model, which uses all cyclic shifts (horizontal and vertical) of x as training data. We assign a regression target score to each shift feature: , where represents the position shifted along the horizontal and vertical directions. In the center of the target object, we have a highest score . If the position is far from the target center, the score drops fast from 1 to 0. The kernel width is a parameter which is defined previously to control the sensitivity of the scoring function.
First, in the Fourier domain, the ridge regression solution for the circulant matrix
X is as follows:
where
I is the identity matrix with size
, according to Equation (
3), we obtain:
The operations on the diagonal matrix are all element-level, so we get the follows:
Among them, the symbol ⊙ represents the Hadamard product, which is a matrix element-level multiplication, that is, elements with the same position are multiplied separately. Then use the unitarity of the Fourier transform matrix, namely:
, Equation (
4) can be rewritten as:
Substituting Equation (
3) into Equation (
7), we get:
According to the characteristics of the circulant matrix, the construction rule of the circulant matrix and the nature of the Fourier change, we have:
is the cyclic shift matrix of
. Synthesizing the right part of the Equations (9) and (10), we have:
According to Equation (
8):
According to the nature of the circulant matrix convolution:
From the Equation (
12), we can get:
Since
and
are in a conjugate relationship, each element in
is a real number. Taking the conjugate of such a matrix, the element value does not change in any way. Therefore, Equation (
13) can continue to be deduced, as follows:
The following is the objective function of the linear ridge regression training correlation filter:
where
is a regularization term. Equation (
14) is a linear estimator:
. From Equation (
13), the Fourier frequency domain solution is:
where
represents the Fourier signal of
,
is the complex conjugate transform of
and operation ⊙ is the product of Hadamard. In order to strengthen the discriminative ability of learning filters, Henriques et al. [
5] and others introduced the kernel
K,
which trains the correlation filter in the kernel space, which is used to study the correlation filter in the kernel space when keeping the computational complexity as linear complexity. The calculation formula of the coring correlation filter is:
where
is the dual variable of
W. In terms of shift-invariant kernels, such as RBF kernels, the dual coefficient
[
20,
47] can be found by using the cyclic matrix in the Fourier domain:
where
K represents the kernel correlation matrix, and the Fourier transform of
K is as follows:
Since the algorithm only requires element dot product, FFT and FFT inverse operations, the computational time complexity is , where n is the number of input data.
Given a new frame as input, we use the similar solution in Equation (
19) to efficiently calculate the correlation response mapping. The method is to crop an image block
z at the center of the object in the previous frame, and then use the trained target template
to calculate the response map
f in the Fourier transform domain:
Finally, we search for the position of the maximum value of the response map f to locate the target.
3.3. Scale Filter
Danelljan et al. [
17] proposed a discriminative correlation filter for scale estimation. We similarly constructed a pyramid feature of the target appearance centered on the estimated position and used it to train the scale-dependent filter. Unlike [
17], our method does not use the predicted scale change to update the displacement filter
. Let
be the size of the tracking target and
S be the target scale set. For scale
, the size of the image area captured with the estimated target position as the center is
, and the captured image block is rescaled to
. Then SOM features are extracted from each sampled image block to form a multi-scale representation of the feature pyramid containing the target. Assuming that
is the feature vector of scale
s, and
is the optimal scale of the target object, then:
In the process of object tracking, our method estimates the change of target displacement firstly, then predicts the change of scale. Our method is different from other existing tracking algorithms, which generally infer changes in position and scale at the same time. For example, the tracking algorithm based on particle filtering [
51] uses random samples to approximate the target’s position and scale change state distribution. The gradient descent method (such as Lucas-Kanade [
52]) infers the local optimal position and changing scale in an iterative manner. The algorithm we proposed is to break the tracking task into two independent subtasks, which not only reduces the burden of intensive evaluation of the target state, but also avoids the noise update of the displacement filter when the scale estimation is not accurate.
The particle filter-based tracking algorithm [
51] uses random samples to approximate the target state distribution including position and scale changes, as shown in
Figure 4a. Gradient descent methods (such as Lucas-Kanade [
52]) iteratively infer local optimal positions and scale changes (see
Figure 4b). The object tracking algorithm based on correlation filter [
23] decomposes the tracking task into two independent subtasks (position and scale estimation) demonstrated in
Figure 4d, which not only reduces the burden of intensive estimation of the target state, but also avoids the noise update of the displacement filter under the circumstance inaccurate scale estimation. Experimental results (see Ablation Study Section). show that the performance of our tracker is significantly better than another implementation (CT-JOP), which uses the estimated scale change to update the displacement filter.
3.4. Long-Time Memory Filter
In order to adapt to the changes in the appearance of the target during the tracking process, as time goes by, the tracking algorithm must update the pre-trained displacement filters. However, if the filter is updated by directly minimizing the output error of all tracking results, the computational overhead in the tracking process will be very large [
53,
54]. The proposed algorithm uses a moving average scheme to update the displacement filter. The updated equation is as follows:
where
t is the index of the image frame, and
is the learning rate. This method updates the position filter every frame, emphasizing the importance of model adaptation and short-time memory of target appearance, but only one of the three position filters is updated each time. The selection of these three filters is a circular selection method. Since this scheme is very effective in dealing with appearance changes, the tracking algorithm [
6,
17] has achieved good performance in recent benchmark studies [
38,
46]. However, when the training samples are noisy, these trackers are prone to drift and cannot recover from tracking failures due to the lack of long-time memory of the appearance of the target. The update scheme in Equations (21) and (22) assumes that the tracking result of each frame is sufficiently reliable, so it is natural to use the training sample to update the correlation filter. This is not correct in a complex scene, the result of such an operation is easy to send tracking drift. To solve this problem, we proposed to create a long-time memory filter to preserve the appearance of the target. In order to maintain the stability of object tracking, we set a threshold
to conditionally update the long-time memory filter. Only when the target’s confidence
is greater than this threshold
do we update the long-time memory filter. The proposed algorithm uses the maximum value of the correlation response map as the confidence score, because it reflects the similarity between the tracked object and the learning template in the long-time memory correlation filter. Compared with the long-time memory method [
55,
56] that only uses the first frame as the target appearance, we conditionally update the long-time memory filter to improve its adaptive ability. This allows the long-time memory filter to adapt to a certain degree of time-varying target appearance.
3.5. Online Object Detector
The displacement filter captures the appearance of the target and is a short-time memory filter. We use contextual information around the target object to learn the filter. In order to reduce the boundary discontinuity caused by the cyclic shift, we weight each channel of the input feature by a two-dimensional cosine window. We use the SOM feature to learn the scale filter . Unlike the displacement filter , we directly extract features from the target area without considering the surrounding context, because considering the surrounding context does not provide information about the target scale change. We use a conservative learning rate to learn the long-time memory filter to maintain the long-time memory of the appearance of the target to determine whether tracking failure occurs.
Tracking failure is generally caused by some serious occlusion or the target moving out of the camera view. In our tracking algorithm, for each tracked target z, we use the memory filter to calculate its confidence . Only when the confidence is lower than the predefined re-detection threshold Tr will we activate the detection device. This can reduce the computational load in the object tracking process and avoid using a sliding window for detection in each frame.
In order to ensure the operating efficiency of the system, we use an SVM as a detector instead of using a long-time memory filter
. We intercept training samples at the estimated target position to train the SVM detector incrementally, and assign binary labels to these samples according to their overlap ratio [
35]. In this algorithm, we only extract samples with changed targets for training to further reduce the computational workload. During training, the quantized color histogram is used as a feature representation, the image color is converted to the CIE Lab space and each channel is quantized to 5 bits (referring to four equal intervals in each channel). In order to improve the robustness against drastic changes in illumination, we apply the non-parametric local rank transform [
57] to the
L channel.
3.6. Method Implementation
As shown in
Figure 3, the tracking algorithm proposed in this paper uses SOM features to train three correlation filters (
) for position estimation, scale estimation and long-time memory of target appearance. We also built a re-detection module that uses the SVM detector to recover targets from tracking failures. We give a summary of the proposed tracking algorithm in Algorithm 1.
Algorithm 1: Object tracking algorithm based on SOM and correlation filter. |
|
The displacement filter
and
combines the context information to separate the tracking target object from the background. Some methods [
20,
58] enlarged the target bounding box based on a fixed ratio of 2.5 to include the surrounding context. We conclude through analysis based on experiments that an appropriate increase in the context area will also improve the tracking results. At the beginning, we set it to 2.8 times larger, and then consider the aspect ratio of the target bounding box. We also observed that when the target (such as pedestrian) has a small height and width ratio, the smaller the zoom ratio, the less unnecessary context area in the vertical direction. For this reason, when the aspect ratio of the target is less than 0.5, we reduce the zoom in the vertical direction by half. To train the SVM detector, we densely sample a large window at the center of the estimated target. When the overlap ratio between these samples and the target position is greater than 0.5, we assign them a positive label +1; when their overlap ratio is less than 0.1, we assign them a negative label
.
In this algorithm, the re-detection threshold
is set to a lower value of 0.20. When the confidence level
is lower than this value, the algorithm will activate the SVM detection module. When the SVM detection module re-detects the target, the target acceptance threshold Ta is set to 0.4, and only if it is higher than this threshold does it indicate that the target is detected. Each of these detection results needs to be retained during detection, because it is needed when relocating the target and reinitializing the tracking process. We also set the stability threshold to 0.4, and update the memory filter
when the confidence is greater than this threshold, so as to achieve the purpose of keeping the long-time memory of the target appearance. All thresholds are compared with the confidence score calculated by the long-time memory filter
, and the regularization parameter of Equation (
2) is set to
. The Gaussian kernel width setting in Equation (
9) is proportional to the target size
,
. The learning rate
in Equations (21) and (22). For scale estimation, we use the feature pyramid series
, and the scale factor
.