In this section, we first provide an overview of the proposed DCOM in
Section 3.1. Our bounding box estimation method, DCOM, is composed of three parts, including the overlap maximization module (
Section 3.2), the distribution calibration module (
Section 3.3), and the updating strategy (
Section 3.4). Finally, we discuss the differences between DCOM and other bounding box estimation methods in
Section 3.5.
3.2. Preliminary
Bounding box estimation by overlap maximization [
17], which is based on IoU-Net [
36], is the baseline of our approach. For the reference branch, given the backbone features
of the initial frame and the target bounding box annotation
, the method obtains the modulation vector through a convolutional layer, a PrPool layer, and a fully connected layer, that is,
, where
. For the test branch, the method first extracts the backbone features
of the current test frame. Then, given the initial bounding box estimate
generated by the localization branch, the method employs two convolutional layers and a PrPool layer to obtain the feature representation of the target, i.e.,
, where
, and K is the spatial size.
is then modulated by
through a channel-wise multiplication, generating the target-specific representation for IoU prediction. The baseline finally uses a multi-layer perception (MLP) to obtain the predicted IoU between
. The above process is formulated by
3.3. Bounding Box Estimation by Distribution Calibration
Since the modulation vector in the baseline only depends on the initial frame, the reference information is biased to the initial state of the target and less reliable as the frame grows, especially when the target undergoes severe variations, failing to provide accurate bounding box estimations continuously in online tracking. Therefore, we propose to enhance bounding box estimation with distribution calibration for visual tracking, that is, generating reliable and diverse reference information via distribution calibration.
We take inspiration from few-shot learning with distribution calibration [
24] and propose our distribution calibration module over the modulation vector. We assume every dimension in the modulation vectors follows a Gaussian distribution, and from
Table 1, we observe that targets of similar classes and close sizes usually share similar mean and variance. Based on such observations, we are able to make use of the statistics from large-scale training datasets with accurate annotations to calibrate the distribution of modulation vectors in online tracking. Based on the new distribution, reliable and sufficient reference information can be obtained directly. Note that modern trackers only use the large-scale tracking datasets for offline training of the networks but cannot take advantage of such groundtruth information in online tracking effectively. On the contrary, for the first time, our approach enables exploiting the large-scale tracking datasets on the online stage for more precise bounding box estimation, which can alleviate the issue of scarcity of data in online tracking.
Statistics extraction. Based on the observation from
Table 1, targets with similar sizes tend to share similar mean and variance of the feature representations in reference information. Therefore, for each video of the training datasets, we divide the frames into multiple clips according to the target sizes. In each clip, we have
h and
w are the height and width of the target, and
is the target size in the first selected frame of the clip. To avoid noise, clips from all videos, where the frame number is greater than 50, are selected as base clips.
Then, given the annotations, we obtain the modulation vectors of all frames in base clips through the reference branch. The mean of every dimension in the vector for each base clip is calculated as follows:
is the frame number of the
i-th base clip. The covariance matrix
for the modulation vectors from the
ith base clip is given by
Distribution calibration via statistics transfer. We obtain the modulation vector of the initial target,
, through the reference branch. Similar to [
24], we transform
using Tukey’s ladder of powers transformation [
37] to make the distribution more Gaussian-like.Then, we select the top k base clips where the Euclidean distance between
is closest. Formally, we have
is the selected set and universe of the base clips, respectively, and
is the operator to select the top
k elements from the input set. Finally, we calibrate the mean and covariance of the distribution as follows:
Bounding box estimation. In order to provide sufficient and reliable reference information for precise bounding box estimation, we leverage the calibrated the mean and covariance of the distribution to generate a set of extra modulation vectors by sampling from the calibrated Gaussian distribution as follows:
M is the total number of sampled modulation vectors. For the current test frame, given the coarse target location from the localization branch and target size from the previous frame, we obtain the rough bounding box first and then generate N candidate bounding boxes
by adding Gaussian noise to the rough bounding box. Then, the predicted IoUs are obtained by the test branch and the modulation vectors, i.e.,
. For simplicity, we obtain
, where
. It is noted that
always contributes to the prediction since it contains the groundtruth information of the target. The refined bounding boxes
is estimated by maximizing each predicted IoU in
using five gradient ascent iterations with a step length of 1. Finally, based on
, we obtain the bounding box estimation by taking the mean of the three bounding boxes with highest IoU, i.e.,
3.4. Updating Strategy for Reference Information
As the tracking frame grows, the reference information from the initial frame becomes less reliable, especially when the target undergoes severe appearance variations such as deformation, which may cause the drift problem of the tracker. Thus, it is necessary to update the reference information during online tracking. Based on the distribution calibration module, we propose a simple yet effective strategy to update the reference information, i.e., the modulation vectors.
To achieve a good balance between efficiency and accuracy, we update every
T frames, where
T is the updating interval. Specifically, given the estimated
in current test frame
t, we observe that, though the target can be localized with a high confidence via the localization branch, the predicted bounding box is not precise enough when
, where
are two thresholds. When
, the target can hardly be tracked successfully, and we initialize the reference information with that of initial target, i.e.,
. When
, the modulation vector is kept unchanged for efficiency. When
, based on
, we obtain the new modulation vector
of current frame via the reference branch. Then, we perform distribution calibration w.r.t.
by substituting
in Equations (
5) and (
6). Given the calibrated mean and covariance of new reference information, i.e.,
, we update the modulation vectors by sampling from the new Gaussian distribution as follows:
As such, compared with the baseline, we are able to obtain more reliable reference information for robust bounding box estimation in the whole process of visual tracking. Note that, if the modulation vector is updated without the distribution calibration, i.e.,
only contains
, tracking performance will not be improved, since
based on the estimated
is less reliable. We present the main steps of the updating strategy in Algorithm 1.
Algorithm 1: Updating strategy for reference information |
3.5. Discussion
Comparison with direct bounding box regression. DCOM and the BBR methods are totally different in two aspects. First, BBR methods obtain the estimated box mainly by a regression network/module, which is trained only in the offline process or the first frame, while DCOM obtains the bounding box via an overlap maximization and a distribution calibration module, which benefit from the training datasets in both offline and online process. Second, most BBR methods are tightly coupled with a Siamese-based pipeline, which lacks the process of online discriminative localization, while DCOM is lightweight and can be combined with modern discriminative localization methods easily for robust tracking.
Comparison with bounding box estimation by overlap maximization. Although DCOM shares the same overlap maximization module as that of ATOM, they are different in generating and updating reference information. First, ATOM generates the reference information only from the first frame, causing a biased bounding box estimation in online tracking. Second, such reference information is fixed and cannot be updated effectively, since the new reference information provided only by the tracking results is less reliable, and its error will accumlate. To this end, our DCOM improves ATOM in two ways. On the one hand, we make use of the large-scale tracking datasets, which can only be used in offline training in previous methods to provide extra reference information via distribution calibration. On the other hand, DCOM enables a simple yet effective strategy to update reference information according to the updated distribution besides the tracking results. Thus, the reference information in DCOM is more sufficient and less biased for precise bounding box estimation compared with ATOM.