Section 3.2 observes the difference between using CPU or GPU to process exhaustive NCC. To continue, our NCC implementation is validated and compared to other trackers using the results reported by Smeulders et al. [
3] (
Section 3.3). Later,
Section 3.4 describes the experiment that was performed to find the optimal number of generations for HSA. Ultimately, the experiment in
Section 3.5 determines which of the proposed implementations is the best alternative in terms of accuracy and speed. Additionally,
Section 3.6 uses artificial noise to test the robustness of the best alternative. Finally,
Section 3.7 provides the comparison of the performance of our best alternative against other recent trackers.
3.1. Polynomial Time Complexity
The time complexity of NCC is obtained by observing Equations (
11), (
12) and (
13). The time it takes to get
using Equation (
13) is of
. The next term
is also obtained in a time of
. Finally,
is computed using three different summations, each of them require
. To conclude, the time complexity of computing
for a single point
is
.
Sequential exhaustive NCC requires to compute every single possible
for every
. This is the size of the feasible region
. So
, note that
is the greatest term because
m should be smaller or equal to
w and
n should be lesser or equal to
h (as illustrated in
Figure 1). This is the time complexity of sequential exhaustive NCC (
Table 2 row 1).
In exhaustive parallel NCC, the GPU performs the computation of several
simultaneously. Let
c be the maximum number of simultaneous work-items of the GPU, then the time complexity of parallel exhaustive NCC is
(
Table 2 row 2).
HSA time complexity can be obtained observing the pseudocode in
Figure 1,
Figure 2 and
Figure 3; and Equations (
1) and (
2). Please note that this analysis is specific to the case where the fitness function is
, if the fitness function experiences any change, then so does the time complexity. The first line of the Exploration pseudocode takes
, as it is a constrained random generation of
individuals. The second line of the exploration pseudocode has a time complexity of
since
is computed for every one of the
individuals. Next, the time for line 4 is of
and
for line 5; line 6 is omitted (no sharing). Exploration line 7 performs sorting, taking
. In total, lines 4–7 have a time complexity of
, since line 5 should be significantly slower to compute. This
time complexity operation is repeated several times (line 3), this number is called the maximum number of generations or
g, producing a time complexity of
which is polynomially greater than the cost of lines 1–2. Thus, the exploration phase takes
. The recruitment phase requires the summation of the fitness of all the
individuals (Equation (
1)) and then Equation (
2) has to be computed for each of those individuals, producing a time complexity of
. The harvest phase is very similar to the exploration phase; the key difference is the size of the evaluated populations (
and
) resulting in time complexity of
. In polynomial terms, the greatest time complexity between exploration, recruitment, and harvest phases is that of the harvest phase given that it should be true that
; taking those considerations the time complexity for sequential HSA NCC (
Table 2 row 3) is of
time complexity.
The parallelization of HSA is described in
Section 2.5.4. The evaluation of the fitness function
for each individual is performed by
q work-items simultaneously. Please note that all individuals are processed in parallel because the number of individuals is
, and this is dependent on the capacity of the GPU. This step derives in a
time complexity to get the fitness of all individuals in the population once. Still, it is not possible to process several generations at the same time; they are dependent on the results of the previous generation. Thus, the overall time complexity of Parallel HSA NCC (
Table 2 row 4) is
.
It is clear that and , then the parallelization is expected to successfully accelerate computations in both cases. However, the comparison between using exhaustive search or using HSA is non-intuitive. Still, we could reduce it to the simple idea that HSA will be faster if the size of the image frame () is significantly greater than the number of individuals in the population multiplied by the number of generations g. In summary, HSA is expected to be faster than exhaustive search as long as , which should be the case in most if not all of the tested videos. In the following sections, the different tracker proposals are tested in terms of speed and accuracy, to contrast the time complexity derivations of each proposal, and to confirm the advantages of parallelizing HSA for object tracking.
3.2. Sequential Exhaustive NCC vs. Parallel Exhaustive NCC
Our theoretical analysis has supported the assumption that using a GPU produces an acceleration against using the CPU to compute NCC with an exhaustive search. The experiments show that the assumption is correct, the GPU implementation of exhaustive NCC takes fewer seconds per frame on average.
We considered impractical to fully process all videos of the ALOV dataset using the CPU and exhaustive search, but in order to estimate the average time per frame, we tested the first two frames of each video. We made this decision because the CPU required 1 day and 18 h to process the first 5 videos of the ALOV dataset. To check that the accuracy of the results is not affected, a few random videos from the ALOV dataset were selected and used for testing, According to the results (
Table 3), the GPU computes NCC for a full frame around 170 times faster than the CPU. A small difference was observed in the average F-score, equivalent to 12.60% of the overall standard deviation (not relevant).
As expected, the parallelization of exhaustive NCC successfully accelerated the execution. This result supports our previous time complexity analysis: .
3.3. Reported NCC and Other Trackers versus This Work’s NCC
This experiment was performed to show that the proposed NCC implementation behaves as reported by other authors. We used all the 314 videos of the ALOV dataset for testing in this experiment. The tests labeled OURS use the described GPU implementation of NCC and the tests labeled CTRL are the ones reported for NCC [
3,
44].
As illustrated in
Figure 7, the difference in average F-score between OURS and CTRL is small. The distance between the averages is 9.06% of the standard deviation of all the tests, not significant. These results validate that the proposed GPU implementation of NCC performs as good as expected in terms of F-score.
Figure 8 shows survival curves for OURS, CTRL and 18 trackers reported by Puddu [
44]. Survival curves are obtained by sorting the scores for every video and show a general overview of the video tracker performance; the closer the curve is to the perfect score (1.0), the better. The other trackers that are displayed are: Struck (STR) [
45], Foreground-Background Tracker (FBT) [
46], Tracking Learning and Detection (TLD) [
47], Tracking by Sampling Trackers (TST) [
48], Lucas-Kanade Tracker (LKT) [
49], Kalman Appearance Tracker (KAT) [
50], Fragments-based Robust Tracking (FRT) [
51], Mean Shift Tracking (MST) [
52], Locally Orderless Tracking (LOT) [
53], Incremental Visual Tracking (IVT) [
54], Tracking on the Affine Group (TAG) [
55], Tracking by Monte Carlo (TMC) [
56], Adaptive Coupled-layer Tracking (ACT) [
57],
-minimization Tracker (L1T) [
58],
Tracker with Occlusion detection (L1O) [
59], Hough-Based Tracking (HBT) [
60], Super Pixel tracking (SPT) [
61], and Multiple Instance learning Tracking (MIT) [
62].
According to this experiment, our implementation of NCC is validated and can be considered a tracker of acceptable performance in terms of accuracy.
3.4. Adjusting the Honeybee Search Algorithm
This experiment was designed to determine if the time per frame is reduced when the number of generations (explained in
Section 2.2) is lower, and what is the optimal number of generations. This experiment considers all the 314 videos of the ALOV dataset. All the tests use NCC, GPU, and HSA. The variation is made in the number of generations. The other configuration parameters of HSA reuse optimal values proposed by Olague and Puente [
14,
15,
16] after numerous experiments.
Figure 9a shows that the average F-score is stable between generations 2 to 10. The difference in score between 2 and 3 generations is 3.50% of the overall standard deviation, and 1.95% between 3 and 4 generations. The leap between 4 and 10 generations is only of 4.02%. On the other hand, there is a notable difference between 1 and 2 generations, measured as 66.79% of the overall standard deviation.
Figure 9b shows that as the number of generations increases, the average time per frame increases. From 1 to 2 generations, the average time per frame increases by 8.65% of the overall standard deviation. From 2 to 3 generations, the increment is 4.50%. When changing from 3 to 4 generations, time per frame increases by 8.48%. The pattern is caused by the increment in the number of generations for the harvest phase, which is the half of the generations used for the exploration phase. Finally, an increment from 4 to 10 generations causes the average time per frame to increase by 38.82% of the overall standard deviation, showing that the trend continues.
Figure 10a shows a three-dimensional plot, the axes show the average F-score, the average time per frame and the number of generations; the color of the bars indicates a difference in the average F-score; this helps to visualize the trend observed when the number of generations of HSA varies. After two generations, the average F-score stabilizes. In contrast, the average time per frame keeps increasing. We inferred from that information that two generations provide the right balance between time and score. Nevertheless, the Score-Time Efficiency shows without a doubt that two generations is the best option (
Figure 10b).
As expected, the average time increases with each new generation. This aspect supports the time complexity analysis of parallel HSA with NCC as fitness function:
, where the number of generations is of significant influence. This result shows the practical utility of the Score-Time Efficiency, using it to determine that the optimal number of generations. Also, the observation that the F-score has asymptotic behaviour starting on two generations ensures that selecting that number does not mean a negative effect on accuracy.
Figure 11 shows an example of the results that were obtained using two generations, the location of the object (a ball) is obtained with a certain degree of accuracy even when the size and illumitation of the object changes.
3.5. GPU Exhaustive vs. GPU with Honeybee Search Algorithm
This experiment also uses all the 314 videos of the ALOV dataset and all the tests use NCC and the GPU. The tests labeled BEE use Parallel HSA configured to run for two generations, and the ones labeled NO BEE use exhaustive search. Keep in mind that the full capacity of the GPU is not used in BEE because of the synchronization needs.
Notice in
Figure 12 that the average difference of F-score between BEE and NO BEE is relatively small. This difference is 1.54% of the overall standard deviation, an irrelevant difference. This validates that BEE performs as good as the exhaustive NCC tracker.
The difference of average time per frame, displayed in
Figure 13a, is of 42.22% of the overall standard deviation. However, there is a remarkable difference in the standard deviation of both groups. In other words, BEE provides a lower and more stable time per frame (about 7.58 times steadier). This result validates that using Parallel HSA provides an improvement in time per frame against a plain GPU-accelerated NCC.
Figure 13b shows how using the GPU accelerates the implementation that uses HSA and how it compares against exhaustive search. This outcome shows how the polynomial time complexity of GPU BEE is lower than GPU NO BEE, which is lower than CPU NO BEE, supporting the theoretical analysis
.
Finally,
Figure 14 shows that using parallel HSA (GPU BEE) provides a greater Score-Time Efficiency meaning it is the best alternative. This simple observation is an intuitive condensation of the previous analysis of both the F-score and average time per frame observations.
3.6. Tests with Gaussian Noise
This experiment uses 14 videos of the ALOV dataset, one for each category. All the tests use NCC, HSA with two generations, and GPU. We apply Gaussian noise to test the resistance of the proposal to external perturbations. Gaussian noise requires the configuration parameters of median and standard deviation to generate random numbers with that distrubution [
63], and then these are added to the values of each color channel.
Figure 15 shows an example frame without noise (a), this is labeled as noise level 0. Then, the same frame with Gaussian noise using median and standard deviation with a noise level of 50 shown in (b). Finally, the same example frame with Gaussian noise using median and standard deviation with a noise level of 100 shown in (c).
The
Figure 16a shows that there is a difference in F-score between videos that display different levels of noise. In other words, the underlying object tracker (NCC) shows a degree susceptibility to noisy image frames. The difference between noise level 0 and level 50 is 34.88% of the overall standard deviation, and the difference between levels 50 and 100 is of 10.38% of the total standard deviation. On the other hand,
Figure 16b shows that noise has no noticeable effect on the average time per frame.
A conclusion drew from the results obtained with the combination of NCC, HSA with two generations, and GPU is that Gaussian noise affects the accuracy. Nevertheless, the difference in F-score observed between no noise (level 0), and some noise (50 and 100) is in the range of 0.11 to 0.14, which is small considering that the full range of the F-score is from 0 to 1. On the other hand, the noise has no effect on speed.
3.7. Our Best Tracker Proposal versus Two Recent Trackers
Two different recent trackers were selected for contrast: Struck [
45] and SiamMask [
10]. Struck remains relevant because it is a purely online tracking proposal that requires no kind of training and is still able to participate in the latest issues of the VOT challenge [
64]. On the other hand, SiamMask is considered among the best tracker proposals that participated in the latest VOT challenge (top 20) in terms of accuracy, also several techniques that take inspiration from SiamMask are positioned in the toptier. SiamMask relies on pre trainined models as it is based on convolutional neural networks. This section compares the performance of this two state of the art trackers against our best proposal (GPU BEE) in terms of speed and accuracy using 14 videos of the ALOV dataset, one per category.
The analysis of Struck and SiamMask revealed that both trackers preprocess the image frames before actually performing tracking, in the case of SiamMask this is usefull to reduce latency caused by the CPU to GPU communication. Taking that into account, a simple reduction of the frame size is performed in our proposal, by scaling the picture accordingly, whenever the width or the height of the frame is greater than 300 pixels. The effect on the accuracy caused by the reduction of frames can be considered negligible compared to the full resolution frame. On the other hand, a relevant reduction of the average time per frame is obtained.
Figure 17a shows that the accuracy of GPU BEE is inferior to that of SiamMask and Struck. This is not surprising since GPU BEE uses NCC as a tracker, and different sources have found it to be surpassed in accuracy by Struck. However it is interesting to note that Struck showed a better accuracy than SiamMask in this particular setting, which is not the general case. The conclusion that is drawn is that Struck shows greater robustness in certain cases: low resolution videos and videos with background objects that resemble the target. In addition, it is clear that the parallelization of the HSA does not improve the accuracy of its fitness function (
Figure 13), but a significant acceleration is observed compared to the conventional NCC tracker. As shown in (
Figure 17b), the average time per frame of our best proposal (GPU BEE) is almost four orders of magnitude smaller than that obtained for CPU NO BEE. Moreover, the survival curve of average time per frame of GPU BEE is in the same window (from 0.2 to 0.9 spf) of Struck and SiamMask, a remarkable improvement against CPU NO BEE.