1. Introduction
In recent years, there has been increasing interest in real-time tracking and target identification across several contemporary applications. This can be attributed to the widespread use of surveillance cameras and their increasing deployment, especially in security and surveillance domains [
1].
Although object tracking is commonly used in computer vision research to refer to monitoring a single item in a video, Multi-Target Tracking (MTT) is something that people do in the real world.
In computer vision for MTT, two techniques are employed. The first creates several instances of a single-object tracker (SOT) and assigns each one to a different target. The same approach is utilized to track numerous items in this case, and cognitive neuroscience studies of single-object tracking can be useful. The second technique creates an MTT algorithm that can monitor many objects at the same time. This technique is helpful because it may benefit from the system’s shared information, which is important for tracking individuals and dealing with issues. In general, MTT algorithms in computer vision, especially brain-inspired algorithms, are still far from the precise processes used by the brain. However, there is much more to be learned from MTT and cognitive neuroscience.
Regarding the effect of semantic information, MTT is facilitated by the semantic distinction between target and distractor categories. This facilitation is due to the categorical separation of objectives and distractors, which is supported by four processes. For starters, any visual contrast between objectives and distractions aids in tracking. Second, semantic distinctions between targets and distractors necessitate the attentional system selecting a different method for allocating attention to objects, therefore facilitating tracking. Third, category information may be kept in visual working memory, improving mistake recovery. Fourth, there is a system that categorizes information, making it simpler to monitor targets amid distractors belonging to different categories.
For the effect of surface features, color and form are the most immediately perceivable surface characteristics that aid an observer in distinguishing between items. One research work looked into whether surface-feature information helps MTT or whether monitoring numerous objects is dependent only on spatiotemporal information [
2]. Surface characteristics such as color and form are stored in the brain’s early visual processing regions [
3]. These characteristics, according to prior studies, definitely impact the processing of targets and distractors during MTT and aid tracking. MTT methods, on the other hand, lack generality and are biased toward finding the optimal feature set for a single application.
In terms of the effect of motion features and depth in MTT, motion information plays a crucial role [
4]. A task that introduces motion into the texture of each item and the backdrop is meant to examine the influence of motion information on visual attention via MTT [
5].
In MTT tasks, in addition to intrinsic motion, an item’s speed and direction of movement are often given at random to push the object to have different motion trajectories. In other research, velocity is assumed to be constant in order to disregard the influence of acceleration [
6].
Among the various methods used for target tracking, the correlation filtering approach stands out due to its efficiency and robustness [
7]. A real-time target tracking system based on kernel correlation filtering is a computer vision system that uses kernel correlation filters to track targets in real time. Unlike deep learning, which relies on large training datasets, correlation filter-based trackers such as kernelized correlation filters (KCFs) utilize the implicit properties of tracked images, such as their circular configuration, for real-time training [
8]. This system begins by detecting the target in the initial frame of the video sequence and subsequently tracks the target in subsequent frames using a kernel correlation filter [
9].
The computational efficiency of KCFs, which makes them well suited to low-power heterogeneous computational processing technologies, lies in their ability to compute data in a high-dimensional feature space without explicitly performing computations in this space [
10]. A kernel correlation filter is a mathematical algorithm that uses a kernel function to map the input image to a high-dimensional space. The filter calculates the correlation between the target template and the input image in this high-dimensional space, enabling it to track the target even as scale, rotation, and lighting conditions change. A real-time tracking target system based on kernel correlation filtering offers several advantages over other tracking systems. It is computationally efficient and enables real-time tracking of targets. Furthermore, it is safe when changes in a target’s appearance occur, making it suitable for tracking targets in complex situations [
11]. Kernelized correlation filters (KCFs) have gained significant popularity due to their satisfactory speed and accuracy since their inception. They are particularly well suited to target tracking systems with high real-time requirements. However, they lack the ability to detect tracking failures, so they are not suitable for long-term target tracking. Based on previous research, we propose an improved KCF that meets the requirements of long-term target tracking. Firstly, we introduce a confidence mechanism to evaluate the target tracking results and determine the tracking status. Secondly, we design a tracking model update strategy to reduce the interference from background information, thereby enhancing the robustness of the algorithm.
Several studies on real-time detection and tracking algorithms have been conducted in the past. Abdulghafoor and Abdullah (2022) developed a novel real-time framework for detecting and tracking multiple objects, addressing various challenges [
1]. Afterwards, Guoqing Shi et al. (2022) conducted research to enhance the kernel correlation filtering algorithm by incorporating the Kalman filter [
12]. Subsequently, Sun et al. (2023) introduced a target tracking approach utilizing a kernelized correlation filter in conjunction with MWIR and SWIR sensors [
13]. Following this, Feng and Wang (2023) proposed a model-adaptive updating kernel correlation filter tracker that incorporated deep CNN features [
14]. Each of these studies presented distinct methodologies in order to identify the most effective techniques that yielded optimal results.
The major focus of this research is to investigate the target tracking approach in the context of high real-time requirements while addressing the challenge of tracking robustness. The second part provides an overview of the necessary theoretical foundation. We will delve into the kernel-based tracking approach and explore methods to improve the scale estimation. Finally, this research presents the experimental findings and a thorough analysis of the tracking algorithms. This includes the introduction of a tracking algorithm evaluation index system, a comparative analysis of the improved algorithm and the core experimental results of related tracking algorithms, and a comparison between the algorithm developed in this study and other prominent tracking algorithms.
2. Methodology and Development of the Proposed Tracking Method
2.1. Typical Links of Target Tracking
Image target tracking means that after manually specifying a target, the image processing system gives the target status information for each frame of the image in the field of view at the time. As shown in a typical target tracker [
15], it can be divided into the following four parts: a search mechanism, feature extraction, a target classifier, and a model update mechanism. Sometimes, in order to enhance the model’s robustness, some researchers have suggested tracking algorithms that integrate multiple features or multiple trackers. The following
Figure 1 describe a typical target tracker composition:
2.2. Basic Theory
2.2.1. The Nuclear Method
Most real-world models are nonlinear, and the linear mode is a specific aspect of nonlinear models. In exceptional cases, if the target classifier in the target tracking field employs a nonlinear model, its classification impact will be superior to that of a linear model. There are two approaches to dealing with the problem of nonlinear pattern recognition: one is to directly locate nonlinear patterns in the data linear model; another is to translate the original data into a high-dimensional feature space using a mapping function and then use linear algorithms to discover the linear patterns in this high-dimensional space. Human research on linear problems is well established, the computation efficiency in linear problems is high, and the application of nonlinear models requires a more accurate understanding of the data beforehand.
Because these models’ normalizing capacity is limited, the second method of thinking is commonly utilized when dealing with nonlinear models. As demonstrated in the following figure (see
Figure 2), after using the mapping function to move the original data from the original space to the high-dimensional space, the data in the high-dimensional space are linearly separable. The regression problem may now be identified using the algorithm to solve.
However, if this method is used directly, there are two issues to address. One approach is to work directly in the high-dimensional feature space; avoiding the curse of dimensionality is one option; the other is mapping each sample point to a high-dimensional space and then solving this space. This procedure is time-consuming, and the calculating efficiency is low.
We can convert the original ridge regression into its dual form, where the training process only needs to collect the inner product of the sample, and the prediction process only needs to compute the training sample. The inner product does not require direct knowledge of the sample and the regression variable values. As a result, we cannot display the mapping function, but we can easily address the ridge regression problem by demanding only its inner product. The following
Figure 3 illustrates an analysis of the flow of the processing mode in the nuclear method, where the data processing procedure utilizes the kernel function to generate the kernel matrix. We then process the kernel matrix using the pattern analysis method (PA algorithm) to obtain a pattern function. Lastly, the model formula function is utilized to handle new cases effectively.
2.2.2. The Circulant Matrix
The circulant matrix’s simplicity and other outstanding properties make it frequently employed in a variety of professions. As an example of working process applications, we can find the limits of linear time-invariant systems in signal processing [
3], sparse signal reconstruction in compressed sensing [
4], and deblurring methods in image processing [
5]. Similarly, in machine vision, the circulant matrix may be utilized to describe the training dataset [
17].
This section starts with a one-dimensional vector to show the circulant matrix and then expands to two-dimensional. For vectors
, you can use it to generate a circulant matrix
C(
x). The following
Figure 4 shows the cycle shift from a one dimensional vector.
We can use a cyclic shift operator to generate a one-dimensional image offset transformation, and this operator can be used as follows:
The permutation matrix P can be represented by
Product
, with x moving an element to create a small offset due to the cycle
All the offset vectors in the composed set can be obtained by the following formula:
Because of the cyclic properties, we can consider the first half of the elements in the set to be x in the square, whereas the latter half is thought to be formed by shifting in the negative direction. We then construct a circulant matrix using the loop offset approach. After locating the goal, shifting according to varied offsets can yield all the shifted samples relative to the reference sample. We determine the sample’s regression value and then use these samples to train the classifier.
On the outside, the edges are not smooth because of the cyclic shift, and the cosine function may be employed to reduce the boundary effect (as shown in the figure). As a result, we simply represent the sample set using the circulant matrix, as shown below. The most significant advantage of employing the circulant matrix to enhance the sample set is that it can be readily diagonalized, which dramatically accelerates the model training and prediction speed.
The following
Figure 5 shows cyclic shifting that generates training samples (the left picture is a two-dimensional cyclic sample generated by shifting, and the right picture is the corresponding sample).
2.3. Tracking Based on Correlation Filtering and Its Improvements
Initially, researchers used correlation filters in signal processing, but they later extended their use to pattern matching. Image target tracking involves a significant amount of computational work, as it necessitates the identification of the target phase among numerous candidate locations with the highest likelihood. Early target tracking algorithms often lacked real-time performance and tracking capabilities. The complexity of the procedure in particular restricts the target tracking method in an embedded program.
Kernel-related filter tracking begins with ridge regression based on the kernel technique, which uses the circulant matrix as the training sample set and the test sample set. Due to the circulant matrix’s nature, the final model’s training, and the sample detection process, we efficiently perform the Fourier transform in the frequency domain, ensuring a balance between the computation efficiency and tracking performance. First and foremost, this is connected to filtering, and then nuclear-related filtering tracking is performed. The target scale cannot be approximated, as nuclear correlation filter tracking solely pertains to the testing location.
2.3.1. Correlation Filtering
Given a test image
I and a reference image
T,
and its related operations are defined as defined symbols here for correlation operations.
The next section examines how to use correlation filtering for template matching. The aim of template matching is to put an image to the test.
T is a reference image, the most advantageous match point. To discover the optimum matching point, you must first develop a way to quantify
I.
T is a sub-image and a reference image; finally,
xy should be iterated over.
According to Formula (6)’s definition of
I(
x,
y), in
the Euclidean distance from the sub-image and reference image T is used as the degree of mismatch; template matching is to find
D(
x,
y), with (
x,
y) being the smallest position.
Equation (7) simply connects the first component to the reference picture, ignoring the coordinate’s position. The second item is the sum of the squares of the gray values of pixels in the filter’s coverage region. The final component consists of two negative test pictures and the parameters associated with these pictures. If there is no significant change in the level of gray in the test image, we can discard the second term and solely connect (
i,
j) to the third term. The Euclidean distance decreases with a higher correlation value between the test and reference images. In many cases, we use the assumption that the level of gray in the test image does not fluctuate significantly.
According to the results of the preceding study, related processes can easily fulfill the template matching task, causing the associated templates to rotate 180°. In the future, the correlation and convolution operations will be the same; therefore, the convolution theorem can be utilized to adjust the correlation. Switching to the frequency domain significantly accelerates the calculations.
2.3.2. Tracking Based on Correlation Filtering
In basic cases, a basic template matching algorithm can perform target tracking; however, with template-based filtering, the detector’s response value to the background is likewise large, and the capacity to distinguish between the target and the background is insufficient. The design philosophy in a wave device is to provide the highest possible reaction in the target position while producing the lowest possible response in the background position. By identifying the peak of the input picture’s response through the filter, we can accurately predict the target location.
The convolution theorem is the basic principle of correlation filtering tracking.
Here, it can be shown that determinations are made by multiplying the matrix in the frequency domain and then returning it to the spatial domain using an inverse Fourier transform. We extract the features in the current frame and then use a cosine window to remove the window border’s discontinuity. We then perform a Fourier conversion of the feature’s value into the frequency domain, followed by a correlation operation using the correlation filter to generate the frequency domain. To obtain a spatial response, we use the inverse Fourier transform; the spatial response’s maximum point is the projected target position. Finally, depending on the characteristics of the projected position and the standard output retrieved, they are combined to generate a training sample for a correlation filter. The following
Figure 6 shows the mecanism of a typical correlation filter tracking algorithm flow.
The mechanism for determining the filter is at the heart of correlation filtering. To use the nuclear correlation filtering tracking algorithm, first, regression must be conducted; then, training with the circulant matrix sample set is required; and finally, the correlation filter is obtained.
2.4. KCF Tracker Size Estimation
Since the nuclear correlation filter can only output the target’s position during tracking, it cannot adapt to alterations in the target’s scale during the tracking process. As a result, when the goal’s scale varies significantly throughout the tracking process, the standard KCF fails. This leads to a significant reduction in the tracking accuracy of the algorithm. The results of the KCF are shown in the middle two images of video collection of different targets’ scale shown in
Figure 7. During the algorithm test, we discovered that when the targets’ scale significantly changes from 7(a) to 7(b), the system cannot adapt to the changes in the targets’ size, and tracking drifts.
On the one hand, many existing detection-based tracking algorithms only estimate a target’s translational location without accounting for changes in scale. Some current approaches, on the other hand, use scale estimates, but their tracking rates are quite poor; as a result, given the research is based on the KCF, it is important to calculate the size of the KCF. The algorithm framework’s excellent computational efficiency can assure rapid tracking. On the other hand, improving the scale estimates can improve the tracking accuracy and stability.
Because nuclear correlation tracking filtering is used to tackle the problem of changes in target scale, the scale estimation is more accurate, and a scale estimation approach based on the scale pyramid search space is provided.
2.5. Long-Term Tracking Research Based on Kernelized Correlation Filtering
If the target tracking algorithm is to accomplish long-term tracking, many main concerns must be addressed: the algorithm must address several key concerns, including motion suppression of template drift, long-term occlusion of the target, the target moving out of the field of vision, and movement of the target or camera [
9]. The suggested approach can handle short-term occlusion of the target but not long-term occlusion, and because the detector does not conduct a global search, the object cannot be identified again once it has moved out of the field of vision [
15]. In the proposed tracking learning detection (TLD) framework, tracking and detection happen simultaneously; the tracker generates training samples to train the detector, reinitializing it when the tracker malfunctions [
18]. On the other hand, TLD tracks and performs inspection tests concurrently, resulting in a slower algorithm execution rate. We propose an algorithm framework that integrates tracking and detection utilizing the kernel correlation filter (KCF) target tracking method. We analyze the tracking failure process and propose a tracking confidence index for use during the tracking process. We identify the current tracking conditions and modify the tracking approach based on our confidence in the peak-to-sidelobe ratio (PSR) [
13]. If the PSR falls below the threshold, the KCF model judges the tracking to have failed, halts the update, and switches the detection module to search for the target position in the global scope. The detection module must search for and place targets within the overall picture. To improve the real-time performance, it is necessary to employ a cascade detector. In the first stage, the variance classifier is used to filter out variance values less than 50% of the target variance value.
2.5.1. Tracking Based on Evaluating Confidence in the PSR of the Tracker
Under typical tracking conditions, the identified picture block passes the KCF. The filter operation results in a fairly high spatial distribution, and the peak in the output picture corresponds to the anticipated target location. As a result, we can utilize its degree of strength to define the tracker’s level of confidence. MOSSE [
19] is an abbreviation for Minimum Output Sum of Squared Error, which was discovered in the PSR algorithm for properly assessing the peak intensity. The PSR formula is given by [
13,
20]:
where
gmax is the maximum value of the response map, an 11 × 11 window is selected around the peak corresponding to the maximum value, and
and
are the average value and standard deviation of the response map in the window, respectively. From Equation (10), we can see that the PSR calculates the relative value of the peak value and the side value. When the face tracking situation is favorable, the PSR is relatively high, and the location of the response peak represents the new position of the face. Conversely, a lower PSR indicates that a face is obscured. Testing the algorithm on a test video set revealed a typical range of PSR values during tracking of 20 to 60. A simple case analysis was performed on a video series, as illustrated in
Figure 8b, where our volunteer Dudek expressed his feelings during KCF tracking, with the distribution of the PSR measured throughout the sequence. The diagram indicates that the level of the PSR at points A, B, C, D, E, and F was relatively low. According to the results of the analysis (see
Figure 8a), these points correlate with the target’s occlusion, distortion of his appearance, quick movements, rotation, him leaving the field of vision, and occlusion. As a result, the PSR seems quite good; this value reflects the tracker’s tracking performance.
2.5.2. The Variance Classifier
The detector here is made up of two-stage detectors. The cascade classifier must be able to swiftly filter out regions that do not contain the target for the first-stage classifier. After this first screening, the more accurate classifier is applied to detecting the remaining target frames, which can save a significant amount of calculation time.
Figure 9 illustrates the classification of the blue-labeled images into background areas, where, intuitively, they have almost no presence; however, the detailed information reveals slight variance. The yellow-labeled areas are high-variance areas, and their details are rich.
The variance in the target frame is calculated in the following through an integral graph. By definition,
x is the first value in the target frame. Its value can be determined by Formula (11), which defines the variance:
where
n is the number of pixels in the target frame, and µ is the average level of gray in the pixels in the target frame, which can be determined using the following formula, (12):
Formula (11) can be transformed into the following formula:
A target frame of pixels requires n memory query operations for the variance to be determined, but the integral graph technique requires just eight memory query operations. The integral graph
I and the source picture
I are the same size, and the integral graph is in the coordinate system (
x,
y). The original picture (
Figure 1) is represented by the value of the level of gray in (
x,
y)’s pixels between areas. Formula (14) provides the definition of the integral graph:
The integral graph allows for rapid determination of the total gray values in the rectangular region. For example, for a rectangle
B (
x,
y,
w,
h), the coordinate value of the top-left corner of the rectangular box is
x,
y, and the width and height of the rectangular box are
w,
h. For a B-shaped rectangle, the sum of the values of its inner pixels is
2.6. The Tracking Algorithm Framework Combined with Detection
If the target tracker fails in its tracking, we switch to the target detection mode and perform target detection throughout the entire picture. Target tracking does not start until the target is identified. The tracking algorithm used by the KCF is based on a measurement tracking algorithm; the tracker in question is a detector located in a small region close to the target location in the previous frame which performs target processing across the entire picture. The two are very similar. However, the detection area differs.
When the target is constantly moving, utilizing the tracker might result in a quicker tracking rate. When the target is blocked for an extended period of time, the target’s position can still be recognized in the global scope. A cautious target model updating technique can be used to prevent drift of the target model while retaining a description of the target’s most recent form.
Figure 10 illustrates the framework for the entire algorithm.
TLD tracks and detects at the same time. Checks are only used in the context of this article when the tracker fails in tracking. The detector detects the target at the global scale. Furthermore, the tracker and the detector employ the same KCF model as the target model. The goal is to differentiate the target from its surroundings. The current frame’s tracker is dependent on the target location monitored in the previous frame. This process involves extracting the features of histograms of oriented gradients (HOGs). We scale the detection feature vector Zt using an sKCF and calculate the filter’s response using gt. We can estimate the PSR based on the response gt.
If the PSR is greater than 20, the tracker’s results are deemed reliable, and the model is updated. If the PSR is less than 20, the link is activated. It is assumed that the tracker failed in tracking, leading to the following outcomes: the first scenario involves changes in the target’s appearance, such as occlusion or distortion, while the second scenario involves the target or the camera moving violently and undergoing rapid changes. Consequently, we are unable to locate the target in the previous frame’s vicinity in these cases. At this point, we start global target detection to determine the target’s position.
If the tracker’s PSR is greater than 20, the algorithm refreshes the model and indicates that the tracking is normal. To begin with, using the first frame of the model retrieved at the maximum response of the tracker P
t, we obtain the feature vector Z’. The target model Z’ is referred to in the first frame, and we apply linear filtering to obtain the maximum response position (P
t*). If this ||P
t − P
t*|| is less than a given threshold, the model can be updated. We obtain I
t and P
t* during KCF model training, and we determine α
t+1 and x
t+1 as follows:
where ƞ represents the learning rate, tuned to a small value in practice.
If the tracker’s PSR is not greater than 20, the algorithm activates the detection module. For target detection, the detection module examines all candidates in the supplied picture location. First, the variance in the gray value is less than the initial variance in the goal value thanks to the variance classifier. To speed up the calculation, half of the target frame to be identified is filtered out. This strategy is effective for tracking situations with a basic background because the variance classifier can exclude most basic backdrops, such as blue sky and white clouds. On the passing side, the difference classifier recognizes the sKCF in the target frame, carries out target detection, and responds to all the target boxes using the PSR values.
When the PSR exceeds 20, this signifies the presence of the target in a box. The model update module then receives the detected location and updates the model accordingly. If the maximum PSR value is less than 20, this indicates that the detector did not detect the target. The target has not yet appeared in the picture, and the target detection module will continue to run to the next frame. The detected target position is used to reinitialize the tracker.
This section provides an algorithm framework that integrates the nuclear correlation filtering tracking method with tracking and detection. We monitor tracking failures based on confidence. When the PSR falls below the threshold, we judge the tracking to be unsuccessful, halt updates to the KCF model, and switch the detection module to search for the target position in the global scope. The detection module employs a cascade detection strategy.
First level: We use a variance classifier to filter out variance values that are less than 50% of the desired variance value.
Second level: We use the nuclear correlation filter to perform target detection on the sample that passes the first detector during the test. The target becomes visible in the current frame if the maximum confidence exceeds the threshold, and the location with the highest confidence among these samples serves as the projected target location; otherwise, the target remains invisible in the current frame.