LocaLock: Enhancing Multi-Object Tracking in Satellite Videos via Local Feature Matching

Kong, Lingyu; Yan, Zhiyuan; Shi, Hanru; Zhang, Ting; Wang, Lei

doi:10.3390/rs17030371

Open AccessArticle

LocaLock: Enhancing Multi-Object Tracking in Satellite Videos via Local Feature Matching

by

Lingyu Kong

^1,2

,

Zhiyuan Yan

¹,

Hanru Shi

^1,2,

Ting Zhang

^1,2 and

Lei Wang

^1,2,*

¹

Key Laboratory of Target Cognition and Application Technology, Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China

²

School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100190, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(3), 371; https://doi.org/10.3390/rs17030371

Submission received: 2 December 2024 / Revised: 13 January 2025 / Accepted: 21 January 2025 / Published: 22 January 2025

(This article belongs to the Special Issue Object Detection and Information Extraction Based on Remote Sensing Imagery (Second Edition))

Download

Browse Figures

Versions Notes

Abstract

:

Multi-object tracking (MOT) in satellite videos is a challenging task due to the small size and blurry features of objects, which often lead to intermittent detection and tracking instability. Many existing object detection and tracking models often struggle with these issues, as they are not designed to effectively handle the unique characteristics of satellite videos. To address these challenges, we propose LocaLock, a joint detection and tracking framework for MOT that incorporates feature matching concepts from single object tracking (SOT) to enhance tracking stability and reduce intermittent tracking results. Specifically, LocaLock utilizes an anchor-free detection backbone for efficiency and employs a local cost volume (LCV) module to perform precise feature matching in the local area. This provides valuable object priors to the detection head, enabling the model to “lock” onto objects with greater accuracy and mitigate the instability associated with small object detection. Additionally, the local computation within the LCV module ensures low computational complexity and memory usage. Furthermore, LocaLock incorporates a novel motion flow (MoF) module to accumulate and exploit temporal information, further enhancing feature robustness and consistency across frames. Rigorous evaluations on the VISO dataset demonstrate the superior performance of LocaLock, surpassing existing methods in tracking accuracy and precision within the demanding satellite video analysis domain. Notably, LocaLock achieved state-of-the-art performance on the VISO benchmark, achieving a multi-object tracking accuracy (MOTA) of 62.6 while ensuring fast running speed.

Keywords:

multi-object tracking; joint detection and tracking (JDT); tracking in satellite videos

1. Introduction

Multi-object tracking in satellite videos plays a pivotal role in applications ranging from traffic analysis to disaster management. With the advancement of Earth observation technology and the availability of video satellites like SkySat-1 and Jilin-1, MOT within satellite videos has attracted increasing attention from researchers.

In satellite videos, objects of interest such as ships, aircraft, and vehicles typically fall into stable and well-defined categories. Multi-object tracking systems, which integrate object detection capabilities, are particularly well-suited for these applications, as they enable the detection, identification, and continuous monitoring of multiple objects across frames. MOT approaches can effectively detect and track numerous objects simultaneously, even when the number of targets reaches dozens or even hundreds. Contemporary deeplearning based MOT methods can be broadly categorized into two approaches: detection-based tracking (DBT) and joint detection and tracking (JDT). Both approaches often leverage motion or re-identification (ReID) features to enhance object association, enabling them to track multiple objects simultaneously across complex scenarios.

DBT methods employ a two-stage process where a trained detector identifies objects and an association model assigns consistent identities across frames. The advantage of these methods is that they can always leverage the latest detection models and benefit from improvements in detection accuracy. Conversely, JDT methods predict bounding boxes and association information in an end-to-end manner, integrating detection and tracking into a unified model. The advantage of these methods is that the joint optimization of detection and association tasks can potentially lead to higher performance ceilings, especially in scenarios where object detection heavily relies on temporal motion information.

Despite significant advancements in detection models for natural scenes and high-resolution satellite images, these methods still face challenges when applied to satellite videos. One major reason is that the resolution of current video satellites is still limited, with some objects, such as vehicles, occupying only a few pixels in the video. Additionally, changes in weather conditions and lighting also make object detection in satellite videos more difficult, often resulting in intermittent object detection and leading to instability in tracking. For example, as shown in Figure 1, the cars in satellite videos are very tiny and may appear blurry in certain frames, making it difficult to detect objects in these frames, as illustrated in Frame 3. These interruptions severely disrupt the continuity of object tracking, highlighting the need for more robust and adaptive solutions. Therefore, in such scenarios, we choose to design a JDT-type method, aiming to achieve better tracking performance by leveraging temporal information and multi-task design.

For tracking stability, there is related research in a similar area known as SOT. In contrast to MOT, SOT methods specialize in persistently following a given object, even under conditions of temporary occlusion or disappearance. Unlike MOT models, which usually process high-resolution full-frame images, SOT models typically work with images within a limited search area, aiming to conserve computational resources and reduce potential distractions. Early SOT methods [1,2,3] primarily relied on correlation filters to locate objects within constrained regions. The transition to deep learning-based features was marked by high-speed tracking with kernelized correlation filters (HCF) [4], leveraging convolutional neural networks (CNNs) for enhanced feature representation and tracking performance, albeit with a trade-off in computational speed. Subsequent advancements have seen the development of models that integrate continuous convolution operators, such as the continuous convolution operator tracker (C-COT) [5] and the efficient convolution operators (ECO) [6], which optimize tracking accuracy and speed through innovative model update strategies and sample space models. Then, Siamese networks [7] use template matching to improve tracking accuracy. Some other frameworks [8] adapt object detection strategies for real-time SOT, balancing precision and efficiency. With the introduction of transformer-based architectures, models like TransT [9] and STARK [10], among others, [11,12] have set new benchmarks by leveraging global attention mechanisms to track objects under challenging conditions, such as occlusions or background clutter.

As research on MOT and SOT continues to evolve, there is a growing interest in exploring how the strengths of SOT can be effectively utilized within MOT frameworks. The primary motivation behind this integration is to leverage the strengths of SOT methods, such as their superior class-agnostic similar-region search power, to create auxiliary clues for processing complex MOT scenes. For example, BMTC [13] initializes trackers with object detection results and incorporates SOT methods multiple times to simultaneously track multiple objects, achieving robust and stable multi-object tracking. By leveraging the strengths of object detection for initialization and SOT techniques for continuous tracking, this approach has the potential to improve the overall performance and reliability of MOT systems in complex real-world scenarios. However, directly combining SOT and MOT methods in a pipeline manner can lead to increased computational complexity and difficulties in joint optimization.

Existing JDT methods typically rely on either motion information or ReID information for the affinity measure in detection predictions. However, in satellite videos, objects of the same category often appear very similar, which diminishes the utility of ReID information. Consequently, most MOT methods in satellite videos utilize only motion information, as depicted in Figure 2a. To enhance the tracking continuity of MOT, some approaches combine separate MOT and SOT algorithms through a pipeline, as shown in Figure 2b. However, running separate SOT and MOT methods simultaneously can significantly increase computational overhead. In contrast, we propose a unified framework named LocaLock, as illustrated in Figure 2c. It seamlessly integrates the local matching paradigm of SOT into the JDT framework. LocaLock strikes a balance between the strengths of SOT’s local matching and MOT’s global tracking, leveraging their complementary characteristics within a single, cohesive architecture. As a result, LocaLock enables efficient and stable multi-object tracking in complex scenarios while maintaining real-time performance. Our primary contributions can be summarized as follows:

(1): We introduce LocaLock, a versatile MOT framework that incorporates the local matching paradigm from SOT into the MOT task. By leveraging the local matching concept, LocaLock enhances the robustness and stability of MOT without the need to simultaneously run separate SOT and MOT methods.
(2): We present the local cost volume module (LCV), which effectively derives objects’ current priors by leveraging appearance-based information, further bolstering the robustness of the tracking process.
(3): We propose a motion flow module (MoF) designed to accumulate past temporal information for predicting current features, thereby enhancing robustness and temporal consistency in feature representation.

2. Preliminaries

This section reviews the evolution and state-of-the-art in object tracking methodologies, encompassing MOT and approaches introducing SOT into MOT.

2.1. Multi-Object Tracking (MOT)

Multiple-object tracking is a complex challenge that entails two principal tasks: object detection and trajectory association. These tasks can be unified within a single network or addressed separately, leading to two primary paradigms in MOT methodologies: joint detection and tracking and detection-based tracking. Detection-based tracking represents a series of approaches wherein detections from a pre-trained object detector are utilized as inputs for subsequent association algorithms to construct object trajectories across frames. For the trajectory association phase, several strategies are employed. Some methods rely solely on motion information to link detections across frames [14,15,16], while others combine ReID features with motion information for association [17]. Joint detection and tracking integrates detection and tracking within a unified framework. Tracktor [18] proposed that some object detection algorithms [19] include modules for bounding box refinement through regression. They suggest directly using the detector’s regressor by initializing the current frame’s bounding box with the target’s bounding box from the previous frame. This approach eliminates the need for a separate data association step and does not require any tracking-specific training. Other methods train a unified network to simultaneously output detection results and information for trajectory association. Some methods, such as CenterTrack [20], predict motion information for association, while others, like FairMOT [21], balance detection with re-identification (ReID) tasks to achieve better tracking performance.

Advancements in satellite technology and video processing have catalyzed significant research into MOT applied to remote sensing. Studies addressing vehicle detection and tracking in satellite videos [22,23,24,25] have underscored the capability of MOT methodologies to contribute to urban and traffic monitoring from a remote sensing perspective. Similarly, research dedicated to tracking airplanes and ships [26] highlights the importance of MOT in global transportation surveillance and maritime safety. The challenge of distinguishing small, easily misclassified objects in satellite videos necessitates specialized MOT approaches. Techniques like CKDNet [24], which employ cross-differential operations to enhance feature differentiation and reduce static noise, represent tailored solutions addressing the unique requirements of satellite imagery. Furthermore, the DSORT-PT method [27], an adaptation of DeepSORT for aerial imagery, combines motion and appearance detection to tackle the difficulty of identifying low-visibility objects, albeit at the cost of increased computational demand. Emerging research also explores the potential of super-resolving satellite videos in spatial and temporal dimensions [28] to improve the performance of detection and tracking tasks. As this field evolves, the effective use of temporal semantic information and the development of more sophisticated algorithms will be crucial for improving accuracy and efficiency in object tracking applications relevant to geoscience and environmental monitoring.

2.2. Introducing SOT into MOT

The first kind of method [29,30,31] utilizes the SOT tracker to measure similarity for association. For instance, the method SOTMOT [31] extends the CenterNet architecture by adding an SOT branch that runs in parallel with the existing object detection branch. The added SOT branch trains a separate SOT model per target online to distinguish the target from its surrounding targets, thus assigning SOT models novel discrimination capabilities.

The second kind of method [13,32,33,34,35] utilizes the SOT tracker to generate supplementary candidate bounding boxes for detection results, in order to mitigate poor detection performance when the detector faces disastrous visual degradation, such as motion blur and occlusions. For example, OMC [32] proposed a re-check SOT network to reload false negatives and repair broken tracklets. BMTC [13] uses the SOT tracker to search and predict locations for each object between consecutive frames, with the detector only used for initializing the trackers’ states. This approach helps mitigate issues such as fragmented tracklets and broken temporal consistency due to missed or false detections by the primary detector in satellite videos. However, although advanced SOT methods can run at high speeds, simultaneously tracking dozens or hundreds of objects in MOT using SOT is still time-consuming.

Despite the potential demonstrated by integrating SOT methods into MOT frameworks, several significant challenges remain. In the first category of methods, SOT only participates in the association process of the bounding boxes, and the performance of MOT still heavily relies on the detector’s capability. In the second category, SOT can mitigate detection defects under adverse conditions, thereby enhancing MOT robustness, but this approach is typically more time-consuming. Consequently, it may prevent the tracking process from being conducted in real-time, as noted in [13]. In this paper, we investigate the critical challenges related to efficiency, scalability, and alignment of tracking objectives when integrating SOT into MOT. Furthermore, we propose the LocaLock method to fully exploit the capabilities of SOT for robust and high performance MOT.

3. Method

This section will provide an elaborated explanation of our proposed method. We first give a brief overview of LocaLock’s architecture and the training workflow in section III-A. After that, in section III-B to section III-D, we delve into the details of individual components and loss functions within LocaLock.

3.1. Overview

The LocaLock is an end-to-end multi-object tracking network based on the detector YOLOX [36], and its overall architecture is illustrated in Figure 3. We adopted ConvNext-Tiny [37] as the visual backbone.

As shown in Figure 3, LocaLock has two inputs, the current frame image

I^{t}

at time t and the reference frame image

I^{t - k}

at time

t - k

.

I^{t}, I^{t - k} \in R^{H \times W \times 3}

are independently fed into the same backbone network to extract their respective features, denoted as

F_{c u r}

and

F_{r e f}

. Then, these two features are processed through a deformable attention module [38] to facilitate interaction between the information from the two frames. Post-interaction, the features of the two frames are denoted as

E_{r e f}

and

E_{c u r}

.

E_{r e f}, E_{c u r} \in R^{H_{f} \times W_{f} \times C}

are employed in two main processes, where

H_{f} = \frac{H}{4}, W_{f} = \frac{W}{4}

. On the one hand, in the local cost volume module (LCV),

E_{r e f}

and

E_{c u r}

are used to calculate the similarity of local visual embeddings between the two frames, resulting in the generation of a cost volume denoted as

V

. The cost volume

V

is then warped with the predictions in the reference frame to obtain the prior mask in the current frame. On the other hand,

E_{r e f}

is used in the motion flow module to obtain an enhanced current feature, denoted as

\hat{E}

, which is used for the unified head. The unified head aims to fuse the current frame’s features

\hat{E}

and the prior mask, unifying information from both aspects to ultimately output the detection results. The association and matching between the result boxes were performed using the same method as SORT.

Finally, the output of the model can be represented as a set

P

composed of

P_{i d}^{c} = [x_{1}, y_{1}, w_{1}, h_{1}]

, where

i d

is the object’s identifier and c is the object’s category.

3.2. Local Cost Volume Module (LCV)

The local cost volume (LCV) module is designed to perform precise feature matching in the local area. This module leverages appearance-based information to derive object priors, which are then used to enhance the robustness of the tracking process. The conventional approach to calculating the cost volume matrix involves computing the similarity between each feature vector in the reference frame and each feature vector in the current frame, leading to a matrix of dimensions

H_{f} \times W_{f} \times H_{f} \times W_{f}

. In the context of remote sensing imagery, where

H_{f}

and

W_{f}

can each exceed hundreds, this method demands extensive computational power and memory. Recognizing that object movement in remote sensing images is predominantly localized, we propose a more efficient, localized calculation method for the cost volume. This method significantly reduces both the computational effort and memory usage.

The computational process of the local cost volume module is illustrated in Figure 4. It involves processing features from both the reference frame

E_{r e f}

, and the current frame

E_{c u r}

. For each vector at position

(x, y)

in the reference frame features, a

D \times D

neighborhood is considered. The similarity is computed by multiplying each vector in this neighborhood with the corresponding vector in the current frame. A softmax calculation is then applied to these

D^{2}

similarity values to obtain the cost volume matrix value at

(x, y)

. The resulting matrix is of size

H_{f} \times W_{f} \times D^{2}

. The pseudocode for the computational process is shown in Algorithm 1.

Algorithm 1 Pseudo-code for local cost volume module

Input:: $E_r e f$ , $E_c u r$ , D.
Output:: $c o s t_v o l u m e$ .
1:: query_data = $E_c u r$
2:: key_data = $E_r e f$
3:: b, c, h, w = key_data.shape
4:: query_data = query_data.permute(0, 2, 3, 1).reshape(−1, c, 1)
5:: # shape of query_data: [b, h, w, c] ->[b∗h∗w, c, 1]
6:: key_data = F.unfold(key_data, kernel_size = (D, D), stride = 1, padding = (D//2, D//2))
7:: key_data = key_data.view(b, c, D∗∗2, h, w).permute(0, 3, 4, 1, 2)
8:: key_data = key_data.reshape(−1, c, D∗∗2)
9:: # shape of key_data: [b, h, w, c, D∗∗2] ->[b∗h∗w, c, D∗∗2]
10:: correlation = torch.bmm(query_data.permute(0, 2, 1), key_data)/(c∗∗0.5)
11:: # shape of correlation: [b∗h∗w, 1, c] ∗ [b∗h∗w, c, D∗∗2] ->[b∗h∗w, 1, D∗∗2]
12:: $c o s t_v o l u m e$ = F.softmax(correlation, dim = 2)
13:: return $c o s t_v o l u m e$ .

In this work, we chose

D = 5

, and since the size of

D^{2}

is significantly smaller than

H_{f} \times W_{f}

, the proposed method drastically reduces both the computational load and the memory required to store the cost volume matrix.

3.3. Motion Flow Module (MoF)

In satellite videos, tiny objects are easily confused with indistinguishable background noise, such as clouds, waves, and reflections. In this section, we introduce the motion flow module for temporal context, designed to implicitly model the flow of motion features from the reference frame to the current frame.

The MoF process, depicted in Figure 5, involves the current frame at time t with features

E_{t}

, and the features of

t - k

frame after interaction

E_{r e f}

. Each set of features undergoes a convolution layer followed by a batch normalization layer. The processed features are then fed into k Conv-LSTM units sequentially, resulting in the output features

\hat{E} = h_{t}

, which are subsequently input into the detection branch for prediction.

Within the MoF module, all Conv-LSTM units share the same parameters. The Conv-LSTM, a convolutional variant of the traditional LSTM structure, consists of forget gates, input gates, and output gates. The computation within the Conv-LSTM can be formally described as follows:

\begin{matrix} i_{t} & = σ (W_{x i} x_{t} + W_{h i} h_{t - 1} + W_{c i} \circ c_{t - 1} + b_{i}) \\ f_{t} & = σ (W_{x f} x_{t} + W_{h f} h_{t - 1} + W_{c f} \circ c_{t - 1} + b_{f}) \\ c_{t} & = f_{t} \circ c_{t - 1} + i_{t} \circ tanh (W_{x c} x_{t} + W_{h c} h_{t - 1} + b_{c}) \\ o_{t} & = σ (W_{x o} x_{t} + W_{h o} h_{t - 1} + W_{c o} \circ c_{t} + b_{o}) \\ h_{t} & = o_{t} \circ tanh (c_{t}) \end{matrix}

(1)

Here,

f_{t}

,

i_{t}

, and

o_{t}

represent the parameters of the forget gate, input gate, and output gate, respectively. For the first Conv-LSTM unit, the cell state

c_{t - k}

and the hidden state

h_{t - k}

are initialized using all-zero matrices. In the first Conv-LSTM unit, the input x is

E_{r e f}

. In the last Conv-LSTM unit, the input x is

E_{t}

. For the intermediate Conv-LSTM units, the input x is initialized using all-zero matrices. In our experiments, we set k to 3, which means we consider the current frame and the two preceding frames to capture the motion information. This choice is based on the observation that most objects in satellite videos do not move significantly over a short period, and thus, a small k value is sufficient to capture the relevant motion features. We also conducted experiments with different values of k to validate this choice, and the results showed that

k = 3

provides a good balance between computational efficiency and tracking performance.

3.4. Fusion of Prior Mask and Current Feature

Using the local cost volume module (LCV) described in the previous section, we obtain the local cost volume, denoted as

V \in R^{H_{f} \times W_{f} \times D^{2}}

. We then perform a similar local operation as with the LCV, but this time, we use the cost volume

V

and the mask

M_{r e f} \in R^{H_{f} \times W_{f} \times 1}

, derived from the predictions of the reference frame, as inputs. This outputs an object’s prior

C_{i n s t}

for the current frame. The computation process can be expressed using the following formula:

C_{i n s t} (x, y) = \sum_{k = 1}^{D^{2}} V (x, y, k) \cdot M_{r e f} (x_{k}, y_{k})

(2)

where

x_{k}, y_{k}

are the coordinates of the k-th neighbor of the position

(x, y)

.

The unified head takes the current frame’s FPN feature after the MoF module

\hat{E_{t}} \in R^{H_{f} \times W_{f} \times C}

and the object’s prior

C_{i n s t} \in R^{H_{f} \times W_{f} \times 1}

as the inputs. First, it fuses these two inputs using a broadcast sum and then passes the fused feature

{\hat{E_{t}}}^{'} \in R^{H_{f} \times W_{f} \times C}

to the original detection head.

3.5. Training Loss

In the output head of model, there are two branches. The classification branch outputs c channels, where c is the number of classes the network focuses on. The box branch has a regression head and an IoU head. The regression head outputs 4 channels for the bounding box coordinates, x, y, width (w), and height (h), and the IoU head outputs 1 channel for objective score. In the detection loss

L_{\det}

, the classification loss

L_{cls}

and objective loss

L_{obj}

are computed using BCE with logits loss, and the regression loss

L_{reg}

using IoU loss. More details about

L_{\det}

can be found in the YOLOX [36] paper.

L_{\det} = L_{cls} + L_{reg} + L_{obj}

(3)

The computation of correlation loss, as illustrated in Figure 6, involves a local correlation calculation between the mask corresponding to the target in the reference frame and the cost volume matrix, yielding a prior

C_{i n s t} \in R^{H_{f} \times W_{f} \times 1}

for the current frame. This prior is then used to calculate the cross entropy loss with the ground truth mask

G \in R^{H_{f} \times W_{f} \times 1}

of the current frame. The cross-entropy loss employs a contrastive learning approach, assigning a label of 1 for the same object and 0 for different objects.

L_{corr} = CrossEntropy (C_{i n s t}, G)

(4)

During training, the loss function is a sum of detection loss and correlation loss, with a weight

α

for the correlation loss. In our experiments,

α = 0.5

.

L_{total} = L_{\det} + α \cdot L_{corr}

(5)

4. Experiments

4.1. Experiment Setting

4.1.1. Datasets

In the experiment, we use the VISO [23] dataset for evaluation. This dataset is captured using the Jilin-1 satellite, with a spatial resolution of 0.9 m and a frame rate of 10 fps. VISO is a commonly used challenging benchmark in satellite object detection and multi-object tracking. This dataset includes annotations for four key categories relevant to real-world applications: vehicles, airplanes, ships, and trains. In multi-object tracking benchmarks, it focus exclusively on vehicle tracking. The dataset is divided into a training set (70 sequences) with a pixel resolution of 512 × 512 and a test set (7 sequences) with a pixel resolution of 1024 × 1024. Each video sequence has 300–326 frames.

4.1.2. Evaluation Metrics

Our experiments adopt the popular MOT metrics set [39] as the MOTChallenge [40] and KITTI Tracking [41] to quantitatively evaluate the methods’ accuracy, including multi-object tracking accuracy (MOTA), multi-object tracking precision (MOTP), ID precision (IDP), ID recall (IDR), IDF1 score (IDF1), the number of false positives (FP), false negatives (FN), ID switches (IDS), fragmentations (FM), the percentage of mostly tracked trajectories (MT), partially tracked (PT), and most lost (ML). The official evaluation code is available at TrackEval [42].

Among these, MOTA, IDF1, and MOTP are three comprehensive and important metrics. Higher values of these metrics indicate better tracking performance of the model. Their calculation formulas are as follows:

MOTA is a measure of the overall accuracy of the tracking system. It is calculated as follows:

$MOTA = 1 - \frac{\sum_{t} ({FN}_{t} + {FP}_{t} + {IDS}_{t})}{\sum_{t} {GT}_{t}}$

(6)

where $F N_{t}$ is the number of false negatives, $F P_{t}$ is the number of false positives, $I D S_{t}$ is the number of identity switches, and $G T_{t}$ is the number of ground truth objects at time t. A higher MOTA score indicates fewer missed detections, false alarms, and identity switches, which translates to better tracking accuracy in real-world scenarios. Since FN, FP, and GT in the MOTA formula are ID-agnostic, MOTA places more emphasis on detection accuracy in MOT.
IDF1 is a harmonic mean of ID precision and ID recall, providing a balanced measure of the accuracy of object identity assignments. It is calculated as follows:

$IDF 1 = \frac{2 \times IDP \times IDR}{IDP + IDR}$

(7)

where IDP is the ID precision and IDR is the ID recall. A higher IDF1 score indicates more consistent and accurate tracking of object identities over time, which is crucial for applications such as traffic monitoring and surveillance, where maintaining the correct identity of objects is essential.
MOTP measures the average precision of the bounding box predictions. It is calculated as follows:

$MOTP = \frac{\sum_{i, t} IoU (p_{i, t}, g t_{i, t})}{\sum_{t} c_{t}}$

(8)

where $c_{t}$ is the number of matches found at time t. $p_{i, t}, g t_{i, t}$ are the predicted position and the ground truth of object i at time t. A higher MOTP score indicates more accurate bounding box predictions, which is important for applications where precise localization of objects is required.

In the VISO dataset, the vehicle objects in satellite videos are quite small, and tiny shifts in the predicted bounding box (i.e., 1 or 2 pixels) can cause significant fluctuations in the IoU score. Following the VISO benchmark [43], we consider a predicted detection as a true positive (TP) if the predicted bounding box overlaps with the ground-truth bounding box. In practice, we achieve this by setting the IoU threshold to 1 ×

10^{- 7}

.

We also report the number of frames per second (FPS) that the model can infer to quantitatively show the model efficiency.

4.1.3. Implementation Details

LocaLock was implemented using the PyTorch framework. LocaLock employed the AdamW optimizer [44] for end-to-end training of the model, with a batch size set to 2. The weight decay was set at 0.0005, and the initial learning rate was established at 0.007. We adopted a cosine learning rate update strategy, with a base learning rate of

7.8 \times 10^{- 6}

per image. All the training experiments were conducted on four NVIDIA RTX 4090 GPUs, and the training spanned 15 epochs. During the training phase, a variety of data augmentation techniques were utilized, including mosaic scaling, mixup, random scaling (ranging from 0.5 to 2.0), color jittering, and random horizontal flipping. These augmentation methods enhanced the robustness and generalizability of the model.

In the inference process, the network’s output is filtered by a confidence threshold

θ

= 0.4 to obtain the final tracklets. All FPS measurements were conducted on a single NVIDIA RTX 3090 GPU using the PyTorch framework and the original codebase with a batch size of 1. The tests were performed on the VISO test set, and the FPS values represent the average processing speed across all frames of the seven videos in the VISO test set.

4.2. Results

4.2.1. Quantitative Results

We test our method on the VISO multi-object tracking benchmark and compare it with other MOT networks, including DBT methods, such as MMB + CMOT [45], DTTP [46], and MMB + SORT [45], and some JDT methods, such as FairMOT [21], CFTracker [47], and DSFNet [23]. The experimental results are presented in Table 1, with the best outcomes highlighted in bold. In this table, a predicted detection is considered a TP if the predicted bounding box overlaps with the ground-truth bounding box. It is adopted to ensure a fair comparison with the method used in the VISO [23] paper. To achieve this, we set the IoU threshold to 1 ×

10^{- 7}

.

Experimental results demonstrate the superiority of our proposed LocaLock method over other approaches. It achieves the highest MOTA, MOTP, and IDF1 scores, indicating that LocaLock surpasses all current major MOT models in terms of overall performance. Notably, our method achieves a MOTP score of 67.5, which is significantly higher than other methods. This improvement highlights the substantial enhancement in target localization accuracy brought by the integration of the LCV and MoF modules proposed by LocaLock, as well as the specially designed training loss to incorporate the SOT concept into the MOT framework. Additionally, our model has the lowest combined false positives (FP) and false negatives (FN), indicating that it makes the fewest errors overall. In terms of inference speed, LocaLock is significantly faster than DSFNet because it does not require input from multiple video frames and avoids the computational overhead of 3D convolutions.

For the VISO MOT benchmark, there are some papers that consider a prediction to be a TP if the IoU between the predicted bounding box and the ground-truth bounding box exceeds 0.4. Therefore, we also reported LocaLock’s quantitative results for the VISO test set in Table 2 when the IoU threshold is 0.4.

As shown in Table 2, our LocaLock achieved the best MOTA and IDF1 at an IoU threshold of 0.4 as well, demonstrating the robustness of our proposed method. Notably, it has significantly lower IDS, proving the effectiveness of the “local lock” effect brought by the LCV module.

4.2.2. Visualization Results

To provide a clearer understanding of the performance of our proposed LocaLock framework, we present a series of visualization results. As shown in Figure 7, the leftmost column presents the zoom-out tracking results of LocaLock for the seven entire test videos in VISO. On the right side, the zoomed-in object trajectories corresponding to the yellow boxed areas in the first column are displayed. From left to right, the results are shown for (a) GT, (b) LocaLock, (c) CFTracker, and (d) DSFNet, respectively.

From the visualizations, it is evident that LocaLock maintains the smoothest trajectories in most of the test videos, with better preservation of the same object and fewer fragmented tracks. Besides, in video 003, compared to DSFNet, LocaLock demonstrates greater robustness to noise, resulting in fewer false alarms from background noise.

4.3. Ablation Study

4.3.1. Component Analysis

To validate the effectiveness of our proposed modules, we conducted ablation studies to quantify the contributions of each component within the LocaLock framework. In these experiments, the initial YOLOX + SORT configuration served as the baseline. We then compared this baseline with the results obtained by individually adding the LCV, unified loss, and MoF modules. The ablation experiments were performed using the VISO training set for training and the VISO test set for validation, with results based on an IoU threshold of 1 ×

10^{- 7}

as the reference. The performance improvements attributed to each module are summarized in Table 3.

Compared with the baseline model YOLOX + SORT, the inclusion of the LCV module led to an improvement of 2.9% in MOTA and 0.2% in IDF1, with a significant reduction of 338 in the fragmentations and 69 in the ID switches. Furthermore, the integration of the unified loss resulted in an additional enhancement of 2.5% in MOTA, 0.9% in IDF1, and a reduction of 25 in the fragmentations. By incorporating the MoF module, the model further improved MOTA, MOTP, and IDF1. Overall, the combination of LCV, unified loss, and MoF modules significantly improved the tracking performance, as evidenced by the higher MOTA and IDF1 scores, and reduced the number of FMs compared to the baseline model.

4.3.2. Effect of Neighborhood Parameter D

To explore the impact of the neighborhood parameter D in the LCV module and the training loss computation, we conducted a series of experiments. The results are shown in Table 4. When the neighborhood range D was increased from 3 to 5, MOTA increased by 0.5, and FP decreased by 779. This indicates that using a small neighborhood parameter makes the model’s prediction of object motion more conservative, thereby increasing false positives. Appropriately increasing the neighborhood range allows the model to better learn the matching of object appearance features, thus improving overall detection and tracking performance. As D continues to increase, more similar objects are included in the current object’s neighborhood. Due to the very small size and fuzzy features of vehicles in satellite videos, this increases the difficulty for the model to learn, and thus an excessively large D can degrade the model’s performance.

In addition to affecting tracking accuracy, the size of D also impacts the memory usage and computational cost of the model. During model training, the input images have a size of

512 \times 512

pixels. After feature extraction by the backbone,

H_{f} = W_{f} = 128

. As shown in Table 5, when D is less than or equal to 7, the additional memory usage is minimal. However, as D continues to increase, the additional memory usage gradually becomes unsustainable. Specifically, when

D = 127

, which corresponds to not using the LCV module but instead computing the global cost volume, the memory requirement exceeds the available memory, making it impractical. Keeping D relatively low helps to save memory and computational resources, allowing for larger batch sizes within the constraints of limited memory.

When

D = 5

, the neighborhood range in the original video frame is approximately equivalent to 20 pixels. Subtracting the 8 to 15 pixels occupied by the vehicle itself, and considering the spatial resolution of 0.9 m and the frame rate of approximately 10 fps of the Jilin-1 video satellite, this can cover the motion of vehicles traveling at speeds up to around 180 km/h. Therefore, from this perspective,

D = 5

is also an optimal choice.

5. Conclusions

In this work, we introduce LocaLock, a robust multi-object tracking framework that incorporates the local feature matching concept from SOT into the MOT task, simultaneously improving MOT performance in satellite videos. We introduce two novel modules: the local cost volume (LCV) module, which performs precise feature matching in the local area to provide valuable object priors to the detection head efficiently, and the motion flow (MoF) module, which accumulates and exploits temporal information to enhance feature robustness and consistency across frames. Our experiments demonstrate that the proposed LocaLock performs well on real satellite video datasets.

While LocaLock has shown commendable performance on satellite video tracking datasets, there is still room for further enhancements. Specifically, the integration between SOT and MOT tasks can be strengthened, fostering a more unified approach that capitalizes on the strengths of both methods. In future work, we intend to delve into the integration of few-shot learning strategies within MOT tasks by capitalizing on the strengths of SOT.

Author Contributions

Conceptualization, L.K., Z.Y., H.S., T.Z. and L.W.; methodology, L.K. and Z.Y.; software, L.K. and H.S.; validation, L.K., Z.Y., H.S., T.Z. and L.W.; formal analysis, L.K. and L.W.; data curation, Z.Y.; writing—original draft preparation, L.K.; writing—review and editing, L.K., Z.Y. and T.Z.; visualization, L.K. and T.Z.; project administration, L.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported in part by the Key Laboratory of Target Cognition and Application Technology under Grant 2023-CXPT-LC-005.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Bolme, D.S.; Beveridge, J.R.; Draper, B.A.; Lui, Y.M. Visual object tracking using adaptive correlation filters. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 2544–2550. [Google Scholar]
Montero, A.; Lang, J.; Laganière, R. Exploiting the circulant structure of tracking-by-detection with kernels. In Proceedings of the International Conference Computer Vision Workshop, Santiago, Chile, 7–13 December 2015; pp. 587–594. [Google Scholar]
Henriques, J.F.; Caseiro, R.; Martins, P.; Batista, J. High-speed tracking with kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 37, 583–596. [Google Scholar] [CrossRef] [PubMed]
Ma, C.; Huang, J.B.; Yang, X.; Yang, M.H. Hierarchical convolutional features for visual tracking. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 3074–3082. [Google Scholar]
Danelljan, M.; Robinson, A.; Shahbaz Khan, F.; Felsberg, M. Beyond correlation filters: Learning continuous convolution operators for visual tracking. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part V 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 472–488. [Google Scholar]
Danelljan, M.; Bhat, G.; Shahbaz Khan, F.; Felsberg, M. Eco: Efficient convolution operators for tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6638–6646. [Google Scholar]
Bertinetto, L.; Valmadre, J.; Henriques, J.F.; Vedaldi, A.; Torr, P.H. Fully-convolutional siamese networks for object tracking. In Proceedings of the Computer Vision—ECCV 2016 Workshops, Amsterdam, The Netherlands, 8–16 October 2016; Proceedings, Part II 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 850–865. [Google Scholar]
Nam, H.; Han, B. Learning Multi-domain Convolutional Neural Networks for Visual Tracking. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 4293–4302. [Google Scholar] [CrossRef]
Chen, X.; Yan, B.; Zhu, J.; Wang, D.; Yang, X.; Lu, H. Transformer tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8126–8135. [Google Scholar]
Yan, B.; Peng, H.; Fu, J.; Wang, D.; Lu, H. Learning spatio-temporal transformer for visual tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10448–10457. [Google Scholar]
Cao, Z.; Fu, C.; Ye, J.; Li, B.; Li, Y. HiFT: Hierarchical Feature Transformer for Aerial Tracking. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 1–10. [Google Scholar]
Cui, Y.; Jiang, C.; Wang, L.; Wu, G. Mixformer: End-to-end tracking with iterative mixed attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13608–13618. [Google Scholar]
Zhang, J.; Zhang, X.; Huang, Z.; Cheng, X.; Feng, J.; Jiao, L. Bidirectional Multiple Object Tracking Based on Trajectory Criteria in Satellite Videos. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–14. [Google Scholar] [CrossRef]
Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; Upcroft, B. Simple online and realtime tracking. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 3464–3468. [Google Scholar]
Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. Bytetrack: Multi-object tracking by associating every detection box. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 1–21. [Google Scholar]
Cao, J.; Pang, J.; Weng, X.; Khirodkar, R.; Kitani, K. Observation-centric sort: Rethinking sort for robust multi-object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 9686–9696. [Google Scholar]
Wojke, N.; Bewley, A.; Paulus, D. Simple online and realtime tracking with a deep association metric. In Proceedings of the 2017 IEEE international conference on image processing (ICIP), Beijing, China, 17–20 September 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 3645–3649. [Google Scholar]
Bergmann, P.; Meinhardt, T.; Leal-Taixe, L. Tracking without bells and whistles. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 941–951. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Zhou, X.; Koltun, V.; Krähenbühl, P. Tracking objects as points. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 474–490. [Google Scholar]
Zhang, Y.; Wang, C.; Wang, X.; Zeng, W.; Liu, W. Fairmot: On the fairness of detection and re-identification in multiple object tracking. Int. J. Comput. Vis. 2021, 129, 3069–3087. [Google Scholar] [CrossRef]
Pflugfelder, R.; Weissenfeld, A.; Wagner, J. Deep Vehicle Detection in Satellite Video. arXiv 2022, arXiv:2204.06828. [Google Scholar]
Xiao, C.; Yin, Q.; Ying, X.; Li, R.; Wu, S.; Li, M.; Liu, L.; An, W.; Chen, Z. DSFNet: Dynamic and Static Fusion Network for Moving Object Detection in Satellite Videos. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
Feng, J.; Zeng, D.; Jia, X.; Zhang, X.; Li, J.; Liang, Y.; Jiao, L. Cross-frame keypoint-based and spatial motion information-guided networks for moving vehicle detection and tracking in satellite videos. ISPRS J. Photogramm. Remote Sens. 2021, 177, 116–130. [Google Scholar] [CrossRef]
Wu, J.; Su, X.; Yuan, Q.; Shen, H.; Zhang, L. Multivehicle Object Tracking in Satellite Video Enhanced by Slow Features and Motion Features. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–26. [Google Scholar] [CrossRef]
Yu, C.; Liu, Y.; Wu, S.; Xia, X.; Hu, Z.; Lan, D.; Liu, X. Pay Attention to Local Contrast Learning Networks for Infrared Small Target Detection. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Sommer, L.; Krüger, W.; Teutsch, M. Appearance and motion based persistent multiple object tracking in wide area motion imagery. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3878–3888. [Google Scholar]
Xiao, Y.; Yuan, Q.; He, J.; Zhang, Q.; Sun, J.; Su, X.; Wu, J.; Zhang, L. Space-time super-resolution for satellite video: A joint framework based on multi-scale spatial-temporal transformer. Int. J. Appl. Earth Obs. Geoinf. 2022, 108, 102731. [Google Scholar] [CrossRef]
Chu, Q.; Ouyang, W.; Li, H.; Wang, X.; Liu, B.; Yu, N. Online multi-object tracking using CNN-based single object tracker with spatial-temporal attention mechanism. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4836–4845. [Google Scholar]
Breitenstein, M.D.; Reichlin, F.; Leibe, B.; Koller-Meier, E.; Van Gool, L. Online multiperson tracking-by-detection from a single, uncalibrated camera. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 33, 1820–1833. [Google Scholar] [CrossRef] [PubMed]
Zheng, L.; Tang, M.; Chen, Y.; Zhu, G.; Wang, J.; Lu, H. Improving Multiple Object Tracking with Single Object Tracking. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 2453–2462. [Google Scholar] [CrossRef]
Liang, C.; Zhang, Z.; Zhou, X.; Li, B.; Hu, W. One More Check: Making “Fake Background” Be Tracked Again. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 22 February–1 March 2022; pp. 1546–1554. [Google Scholar] [CrossRef]
Chen, L.; Ai, H.; Zhuang, Z.; Shang, C. Real-Time Multiple People Tracking with Deeply Learned Candidate Selection and Person Re-Identification. In Proceedings of the 2018 IEEE International Conference on Multimedia and Expo (ICME), San Diego, CA, USA, 23–27 July 2018. [Google Scholar] [CrossRef]
Chu, P.; Fan, H.; Tan, C.C.; Ling, H. Online Multi-Object Tracking With Instance-Aware Tracker and Dynamic Model Refreshment. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa Village, HI, USA, 7–11 January 2019. [Google Scholar] [CrossRef]
Jung, I.; Son, J.; Baek, M.; Han, B. Real-time mdnet. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 83–98. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Bernardin, K.; Stiefelhagen, R. Evaluating multiple object tracking performance: The clear mot metrics. EURASIP J. Image Video Process. 2008, 2008, 246309. [Google Scholar] [CrossRef]
Milan, A.; Leal-Taixé, L.; Reid, I.; Roth, S.; Schindler, K. MOT16: A benchmark for multi-object tracking. arXiv 2016, arXiv:1603.00831. [Google Scholar]
Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012. [Google Scholar]
Jonathon Luiten, A.H. TrackEval. 2020. Available online: https://github.com/JonathonLuiten/TrackEval (accessed on 1 July 2024).
Yin, Q.; Hu, Q.; Liu, H.; Zhang, F.; Wang, Y.; Lin, Z.; An, W.; Guo, Y. Detecting and Tracking Small and Dense Moving Objects in Satellite Videos: A Benchmark. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–18. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Bae, S.H.; Yoon, K.J. Robust online multi-object tracking based on tracklet confidence and online discriminative appearance learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1218–1225. [Google Scholar]
Ahmadi, S.A.; Ghorbanian, A.; Mohammadzadeh, A. Moving vehicle detection, tracking and traffic parameter estimation from a satellite video: A perspective on a smarter city. Int. J. Remote Sens. 2019, 40, 8379–8394. [Google Scholar] [CrossRef]
Kong, L.; Yan, Z.; Zhang, Y.; Diao, W.; Zhu, Z.; Wang, L. CFTracker: Multi-Object Tracking with Cross-Frame Connections in Satellite Videos. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–14. [Google Scholar] [CrossRef]
Aharon, N.; Orfaig, R.; Bobrovsky, B.Z. BoT-SORT: Robust associations multi-pedestrian tracking. arXiv 2022, arXiv:2206.14651. [Google Scholar]
Yi, K.; Luo, K.; Luo, X.; Huang, J.; Wu, H.; Hu, R.; Hao, W. Ucmctrack: Multi-object tracking with uniform camera motion compensation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 6702–6710. [Google Scholar]
Wang, Y.; Kitani, K.; Weng, X. Joint object detection and multi-object tracking with graph neural networks. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 13708–13715. [Google Scholar]
Stanojevic, V.D.; Todorovic, B.T. BoostTrack: Boosting the similarity measure and detection confidence for improved multiple object tracking. Mach. Vis. Appl. 2024, 35, 1–15. [Google Scholar] [CrossRef]
Du, Y.; Zhao, Z.; Song, Y.; Zhao, Y.; Su, F.; Gong, T.; Meng, H. Strongsort: Make deepsort great again. IEEE Trans. Multimed. 2023, 25, 8725–8737. [Google Scholar] [CrossRef]
Morsali, M.M.; Sharifi, Z.; Fallah, F.; Hashembeiki, S.; Mohammadzade, H.; Shouraki, S.B. SFSORT: Scene Features-based Simple Online Real-Time Tracker. arXiv 2024, arXiv:2404.07553. [Google Scholar]
Zhang, Y.; Wang, T.; Zhang, X. Motrv2: Bootstrapping end-to-end multi-object tracking by pretrained object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 22056–22065. [Google Scholar]
Cetintas, O.; Brasó, G.; Leal-Taixé, L. Unifying short and long-term tracking with graph hierarchies. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 22877–22887. [Google Scholar]
Chen, H.; Li, N.; Li, D.; Lv, J.; Zhao, W.; Zhang, R.; Xu, J. Multiple Object Tracking in Satellite Video With Graph-Based Multi-Clue Fusion Tracker. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5639914. [Google Scholar]

Figure 1. Challenges of MOT on satellite videos. The right side provides a zoomed-in sequence of six consecutive frames. The green boxes highlight two objects. The red boxes indicate the positions of the two objects after further zooming in. Due to the limitations of satellite-based video capture technology, objects can appear blurred in some frames, as shown in Frame 3, where the decreased sharpness of the objects increases the possibility of missing objects.

Figure 2. Comparison between the frameworks of (a) joint detection and tracking MOT, (b) SOT for MOT, and (c) our LocaLock method. The notation t represents the current frame time, while

t - k

refers to the reference frame time, where k is the time difference between the current frame and the reference frame.

Figure 2. Comparison between the frameworks of (a) joint detection and tracking MOT, (b) SOT for MOT, and (c) our LocaLock method. The notation t represents the current frame time, while

t - k

refers to the reference frame time, where k is the time difference between the current frame and the reference frame.

Figure 3. The overall architecture of our proposed LocaLock. The network input is the reference frame, reference object, and the current frame. The output includes the bounding boxes and the corresponding object IDs of the predicted objects.

Figure 4. Diagram of the local cost volume module (LCV). For each vector in reference feature, it calculates the similarity with its neighbors in the current feature by four steps.

Figure 5. Diagram of motion flow module (MoF). Conv-LSTM cell forward k times in order, and all Conv-LSTM cells share parameters.

Figure 6. Diagram of training Loss. The total loss consists of two parts: detection loss and correlation loss. The calculation details of the two-part losses are shown in the figure.

Figure 7. Visualization of object trajectories in the VISO test set. The leftmost column shows the zoom-out tracking results of LocaLock. The right side displays the zoomed-in areas corresponding to the yellow boxes in the first column, with trajectories shown from left to right for (a) GT, (b) our LocaLock, (c) CFTracker, and (d) DSFNet, respectively.

Table 1. Quantitative results on the VISO test set. IoU threshold is 1 ×

10^{- 7}

. ↑ indicates that higher is better and ↓ indicates that lower is better. The best results are highlighted in bold. The method with * is only suitable for detecting moving objects due to the use of inter-frame differencing or background formation.

Table 1. Quantitative results on the VISO test set. IoU threshold is 1 ×

10^{- 7}

. ↑ indicates that higher is better and ↓ indicates that lower is better. The best results are highlighted in bold. The method with * is only suitable for detecting moving objects due to the use of inter-frame differencing or background formation.

Method	MOTA↑	MOTP↑	IDF1↑	MT↑	PT↓	ML↓	FP	FN	FP + FN↓	IDS↓	FM↓	FPS↑
* MMB + CMOT [45]	22.8	9.5	-	38	111	494	0	71,638	71,638	89	111	-
DTTP [46]	44.5	16.3	-	483	153	22	38,329	10,032	48,361	3090	1344	-
MMB + SORT [45]	58.2	28.6	-	214	218	221	117	36,377	36,494	2275	2047	-
FairMOT [21]	2.3	28.0	-	21	13	623	2073	83,258	85,331	52	205	7.9
CFTracker [47]	57.6	58.9	64.8	519	92	47	27,423	11,327	38,750	576	772	7.8
DSFNet [23]	61.1	50.5	75.7	549	90	19	28,991	6626	35617	455	1064	2.1
LocaLock (Ours)	62.6	67.5	75.9	377	103	178	11,254	23,190	34,444	218	496	6.8

Table 2. Quantitative results on the VISO test set. IoU threshold is 0.4. ↑ indicates that higher is better, ↓ indicates that lower is better. The best results are highlighted in bold.

Method	Detector	MOTA↑	IDF1↑	IDP↑	IDR↑	MT↑	ML↓	FP	FN	FP + FN↓	IDS↓
Bot-YOLOv7 [48]	YOLOv7-X	46.1	48.3	60.3	56.5	275	235	26,457	35,225	61,682	2971
UCMCTrack [49]	YOLOv7-X	47.1	51.0	53.7	48.7	288	396	24,947	34,988	59,935	3519
OC-SORT [16]	YOLOv7-X	48.8	58.7	61.5	56.2	466	129	25,620	35,316	60,936	578
GSDT [50]	DSFNet	48.1	47.9	45.9	50.1	291	313	24,145	34,981	59,126	3128
BoostTrack [51]	Swin-b+Dino	48.7	53.6	55.5	51.9	377	334	24,158	35,680	59,838	1696
StrongSORT [52]	Swin-b+Dino	48.9	57.2	59.8	54.9	398	93	24,955	35,578	60,533	761
SFSORT [53]	Swin-b+Dino	49.1	56.3	59.5	53.4	347	178	24,750	35,203	5,9953	1101
MOTRv2 [54]	Deform-DETR	49.6	60.2	63.5	57.4	345	114	24,651	35,196	59,847	607
SUSHI [55]	Cascade-RCNN	50.2	55.6	60.4	51.5	489	98	24,032	35,108	59,140	593
CFTracker [47]	-	50.9	57.7	60.6	55.0	392	100	23,515	34,657	58,172	641
GMFTracker [56]	-	52.3	61.7	66.9	57.3	499	84	23,466	33,231	56,697	517
LocaLock (Ours)	-	56.91	72.8	78.2	68.2	345	187	13,909	25,845	39,754	207

Table 3. Ablation results on VISO dataset. A “✓” indicates the use of the module, while a “-” denotes its absence. ↑ indicates that higher is better and ↓ indicates that lower is better.

LCV	Unified Loss	MoF	MOTA↑	MOTP↑	IDF1↑	FM↓	IDS↓
✓	-	-	59.4	67.3	74.2	542	151
✓	✓	-	61.9	66.6	75.1	517	202
✓	✓	✓	62.6	67.5	75.9	496	218
YOLOX + SORT (baseline)			56.5	66.7	74.0	880	220

Table 4. Ablation study on different neighborhood parameters D on the VISO dataset. The IoU threshold is 1 ×

10^{- 7}

. ↑ indicates that higher is better and ↓ indicates that lower is better.

Table 4. Ablation study on different neighborhood parameters D on the VISO dataset. The IoU threshold is 1 ×

10^{- 7}

. ↑ indicates that higher is better and ↓ indicates that lower is better.

D	MOTA↑	MOTP↑	IDF1↑	FP	FN	FP + FN↓	FM↓	IDS↓
3	61.4	68.1	75.7	11,251	24,368	35,619	507	141
5	61.9	66.6	75.1	10,472	24,642	35,114	517	202
7	59.4	66.0	74.9	17,975	19,395	37,370	769	256

Table 5. Ablation study of different parameters D on memory usage during training. The input image size is 512, and the memory usage is tested with a batch size of 1. ↓ indicates that lower is better.

D	Memory Usage (MB)↓
3	8360
5	8450
7	8526
15	9242
31	13,980
127	CUDA out of memory

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kong, L.; Yan, Z.; Shi, H.; Zhang, T.; Wang, L. LocaLock: Enhancing Multi-Object Tracking in Satellite Videos via Local Feature Matching. Remote Sens. 2025, 17, 371. https://doi.org/10.3390/rs17030371

AMA Style

Kong L, Yan Z, Shi H, Zhang T, Wang L. LocaLock: Enhancing Multi-Object Tracking in Satellite Videos via Local Feature Matching. Remote Sensing. 2025; 17(3):371. https://doi.org/10.3390/rs17030371

Chicago/Turabian Style

Kong, Lingyu, Zhiyuan Yan, Hanru Shi, Ting Zhang, and Lei Wang. 2025. "LocaLock: Enhancing Multi-Object Tracking in Satellite Videos via Local Feature Matching" Remote Sensing 17, no. 3: 371. https://doi.org/10.3390/rs17030371

APA Style

Kong, L., Yan, Z., Shi, H., Zhang, T., & Wang, L. (2025). LocaLock: Enhancing Multi-Object Tracking in Satellite Videos via Local Feature Matching. Remote Sensing, 17(3), 371. https://doi.org/10.3390/rs17030371

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

LocaLock: Enhancing Multi-Object Tracking in Satellite Videos via Local Feature Matching

Abstract

1. Introduction

2. Preliminaries

2.1. Multi-Object Tracking (MOT)

2.2. Introducing SOT into MOT

3. Method

3.1. Overview

3.2. Local Cost Volume Module (LCV)

3.3. Motion Flow Module (MoF)

3.4. Fusion of Prior Mask and Current Feature

3.5. Training Loss

4. Experiments

4.1. Experiment Setting

4.1.1. Datasets

4.1.2. Evaluation Metrics

4.1.3. Implementation Details

4.2. Results

4.2.1. Quantitative Results

4.2.2. Visualization Results

4.3. Ablation Study

4.3.1. Component Analysis

4.3.2. Effect of Neighborhood Parameter D

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI