MobileSAM-Track: Lightweight One-Shot Tracking and Segmentation of Small Objects on Edge Devices

Liu, Yehui; Zhao, Yuliang; Zhang, Xinyue; Wang, Xiaoai; Lian, Chao; Li, Jian; Shan, Peng; Fu, Changzeng; Lyu, Xiaoyong; Li, Lianjiang; Fu, Qiang; Li, Wen Jung

doi:10.3390/rs15245665

Open AccessArticle

MobileSAM-Track: Lightweight One-Shot Tracking and Segmentation of Small Objects on Edge Devices

by

Yehui Liu

^1,2,

Yuliang Zhao

^1,2,*

,

Xinyue Zhang

^1,2,

Xiaoai Wang

^1,2,

Chao Lian

^1,2

,

Jian Li

^1,2,

Peng Shan

^1,2,

Changzeng Fu

^1,2,

Xiaoyong Lyu

^1,2,

Lianjiang Li

^1,2,

Qiang Fu

³ and

Wen Jung Li

⁴

¹

School of Control Engineering, Northeastern University at Qinhuangdao, Qinhuangdao 066000, China

²

Hebei Key Laboratory of Micro-Nano Precision Optical Sensing and Measurement Technology, Northeastern University at Qinhuangdao, Qinhuangdao 066000, China

³

Shijiazhuang School, Army Engineering University of PLA, Shijiazhuang 050003, China

⁴

Department of Mechanical Engineering, City University of Hong Kong, Hong Kong 999077, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(24), 5665; https://doi.org/10.3390/rs15245665

Submission received: 29 September 2023 / Revised: 15 November 2023 / Accepted: 28 November 2023 / Published: 7 December 2023

(This article belongs to the Special Issue Advanced Application of Artificial Intelligence and Machine Vision in Remote Sensing II)

Download

Browse Figures

Versions Notes

Abstract

:

Tracking and segmenting small targets in remote sensing videos on edge devices carries significant engineering implications. However, many semi-supervised video object segmentation (S-VOS) methods heavily rely on extensive video random-access memory (VRAM) resources, making deployment on edge devices challenging. Our goal is to develop an edge-deployable S-VOS method that can achieve high-precision tracking and segmentation by selecting a bounding box for the target object. First, a tracker is introduced to pinpoint the position of the tracked object in different frames, thereby eliminating the need to save the results of the split as other S-VOS methods do, thus avoiding an increase in VRAM usage. Second, we use two key lightweight components, correlation filters (CFs) and the Mobile Segment Anything Model (MobileSAM), to ensure the inference speed of our model. Third, a mask diffusion module is proposed that improves the accuracy and robustness of segmentation without increasing VRAM usage. We use our self-built dataset containing airplanes and vehicles to evaluate our method. The results show that on the GTX 1080 Ti, our model achieves a J&F score of 66.4% under the condition that the VRAM usage is less than 500 MB, while maintaining a processing speed of 12 frames per second (FPS). The model we propose exhibits good performance in tracking and segmenting small targets on edge devices, providing a solution for fields such as aircraft monitoring and vehicle tracking that require executing S-VOS tasks on edge devices.

Keywords:

semi-supervised video object segmentation (S-VOS); remote sensing; correlation filters (CFs); Segment Anything Model (SAM); edge devices

1. Introduction

Tracking and segmenting small targets on edge devices is a task of great significance, which has a wide range of applications in air traffic management, military reconnaissance, aviation security, and more [1,2,3,4]. Semi-supervised video object segmentation (S-VOS) is a task that only requires the user to specify the object of interest in the first frame and then automatically completes the tracking and segmentation of the target in the subsequent frames. However, many S-VOS methods rely on extensive memory resources [5,6,7], making them difficult to deploy on edge devices. This calls for a more compact and faster S-VOS method to be deployed on edge devices. The Segment Anything Model (SAM [8]) is an advanced image segmentation model proposed by Meta AI Research in April 2023. Standing as a state-of-the-art model for addressing current image segmentation tasks, the SAM boasts excellent generalization and user-friendliness. It simply requires points or boxes to select the target for unsupervised object segmentation in images. Moreover, the SAM has demonstrated superior performance in multiple image segmentation tasks, even without pre-training on datasets [9,10,11], surpassing the accuracy of previous supervised methods [8]. This underscores its robust capabilities and broad applicability. Therefore, our work aims to extend these outstanding features of the SAM to S-VOS tasks.

Existing research on S-VOS tasks can be primarily categorized into three types based on their technical features: fine-tuning-based methods, propagation-based methods, and matching-based methods. One-shot video object segmentation (OSVOS [12]) represents an early example of a fine-tuning-based method, applying fine tuning to an image segmentation model in the first frame of the video. The network then attempts to segment similar objects in the subsequent frames. OSVOS achieved a J&F score of 79.8% on the DAVIS 2016 dataset [13]. However, two issues limit the practical use of OSVOS. First, the fine-tuning process of OSVOS is time-consuming, requiring several minutes of preparation before segmentation can begin. Second, OSVOS attempts to apply the information learned from the first frame to subsequent frames; however, as segmentation progresses, the information contained in the network weights gradually becomes outdated. Therefore, fine-tuning-based methods are not suitable for long-term video segmentation. To address these issues, Perazzi et al. proposed a propagation-based S-VOS working principle in MaskTrack [14]. MaskTrack fully utilizes the similarity between the tracked objects in the adjacent frames, using the segmentation result of the previous frame as the second input to the network to guide the image segmentation of the current frame. MaskTrack achieved a J&F score of 74.8% on the DAVIS 2016 dataset [13]. MaskTrack successfully solved the need for fine tuning and the lengthy preparation time. However, similar to OSVOS, MaskTrack has a problem with reference information. OSVOS can only refer to the information of the first frame for segmentation, while MaskTrack can only refer to the information of the adjacent frame for segmentation. As segmentation progresses, error accumulation occurs in MaskTrack. Therefore, propagation-based methods remain unsuitable for long-term video segmentation. Matching-based methods are currently the most popular S-VOS methods. Among these, the Space–Time Memory Network (STM [5]) is one of the representative networks. It fully learns from the lessons of previous works and establishes a memory bank in the STM. Every few frames, the segmentation result is saved to the memory bank for use as a reference in subsequent segmentation. The memory network calculates the spatio-temporal attention between the current frame and the memory bank, utilizing the most relevant information in the bank to complete the segmentation of the current frame. Through this method, STM achieved a J&F score of 86.3% on the DAVIS 2016 dataset [13]. STM has had a significant impact on the S-VOS task, and most recent methods are based on matching methods [6,7,15,16].

However, matching-based S-VOS methods are not flawless, with the most significant issue being video random-access memory (VRAM) usage. For instance, in the classic STM method, as described in the corresponding original paper, a new segmentation result is added to the memory bank every five frames [5], which leads to approximately 130 MB of new VRAM usage per second [16]. During the segmentation of long-term videos, the STM network would quickly crash due to VRAM exhaustion [6,17]. To solve the problem of capacity usage, most studies have explored two main methods: memory discard and memory compression. Memory discard is a strategy to regularly clear unimportant memories. Liang et al. proposed a least frequently used discard strategy [18], which counts the usage times of each memory tensor, and the least frequently used memory tensor is evicted when the usage of the VRAM exceeds a certain range. Memory compression is a method to save previous segmentation results in a more compact form, and XMem is a representative method within this category [16]. The XMem method proposed by Cheng et al. has three different levels of memory banks. The most recently entered memories will be stored in the memory bank in a loose structure, allowing for better preservation of details to guide the segmentation. Earlier memories will be stored in memory in a structured form to minimize the VRAM usage. However, these studies have yet to limit the VRAM usage to a level sufficient for deployment on edge devices. Cheng et al. mentioned in their original paper that under optimal conditions, XMem can control the VRAM usage to approximately 1.4 GB [16]. To enable deployment on edge devices, it is best to limit the VRAM usage to less than 1 GB.

Edge devices tend to have less VRAM. Therefore, the biggest limitation of edge devices is their limited capacity to store memory information, which can easily lead to memory overflow when a matching-based method is used. Given this, more effective S-VOS methods need to be designed to improve the utilization of VRAM. Another issue lies in the inference speed. In prior research, many related works have adopted computationally intensive methods, such as fine tuning [12], optical flow [19], and memory networks [5,6,16]. These methods can maximize the accuracy of the network but at the cost of reduced computational efficiency. On edge devices, smaller models can bring a range of benefits [20,21,22], such as power savings and improved inference speeds. Finally, to improve the tracking and segmentation accuracy of the network for small targets, it is necessary to enhance the adaptability of the network to multi-scale targets.

The contributions of this paper can be summarized as follows:

(1): We introduce a tracker to pinpoint the position of the tracked object in different frames. This approach eliminates the need to save the segmented result, thereby avoiding an increase in the VRAM usage.
(2): We employ two lightweight components, Discriminative Correlation Filters with Channel and Spatial Reliability (CSR-DCF) and the Mobile Segment Anything Model (MobileSAM). These components ensure a fast inference speed for our method and minimize the VRAM usage.
(3): We design a diffusion module, which attempts to diffuse the mask over the complete tracked object to improve the segmentation accuracy. This module can enhance the precision of segmentation without increasing the VRAM usage of the model.

The rest of this paper is organized as follows. In Section 2, we provide a detailed introduction to the SAM and MobileSAM. In Section 3, we describe how the CSR-DCF tracker and MobileSAM are structured to form MobileSAM-Track and explain how the mask diffusion module works. In Section 4, we evaluate the performance of MobileSAM-Track on various indicators, including segmentation accuracy, VRAM usage, and inference speed. We also conduct a fair comparison between MobileSAM-Track and related works on two different datasets. In Section 5, we analyze the performance of MobileSAM-Track through specific examples to gain a better understanding of its advantages and disadvantages. Section 6 is dedicated to concluding the paper.

2. Preliminary

In this section, we will introduce the Segment Anything (SA [8]) project and its related works.

SA is a general image segmentation project [8]. It consists of two important components: a large-scale image segmentation model called the Segment Anything Model (SAM [8]), which segments images according to input prompts, and an image segmentation dataset SA-1B, which contains 110 million images and 10 billion masks, representing a massive scale.

The biggest highlight of the SAM is that it can complete the high-quality mask segmentation of the selected object according to the input prompts from users. This process does not require fine tuning the model at all, achieving zero-shot transfer learning. On the one hand, the SAM is a well-designed large model that can have up to 600 million parameters, thus having good learning and generalization abilities. On the other hand, the SAM is pre-trained using SA-1B, which gives the SAM a strong generalization ability. Therefore, without any supervision information, the SAM can surpass previous supervised neural networks in performance and can handle various types and scenarios of images [8], such as remote sensing images [9] and medical images [11].

As shown in Figure 1, the SAM allows two types of prompts, namely points and boxes. Points used as prompts are divided into positive and negative categories. Positive points are used to indicate the position of the object. Negative points are used to exclude the background. The SAM allows users to input multiple point combinations as prompts to describe the approximate outline of the object. The SAM also allows the use of a bounding box as a prompt, used to specify the approximate size and location of the object to be segmented, as shown in Figure 1b.

The SAM consists of three parts: an image encoder, a prompt encoder, and a mask decoder. The image encoder and prompt encoder encode the image and the input prompts into feature vectors, respectively. The mask decoder generates corresponding masks based on these feature vectors. Most of the parameters in the SAM are contained in the image encoder, which is because the image encoder is built based on Vision Transformer (ViT [23]) to enhance its performance. The prompt encoder and mask decoder are composed of relatively lightweight neural networks. This allows the SAM to work in a very flexible way. The image encoder can be utilized on a high-performance GPU server in the cloud to generate the feature vector of the image and transmit it to the user’s local computer. The prompt encoder and mask decoder operate locally on a general CPU, enabling efficient segmentation with an approximate latency of 50 ms [8]. This approach offers the advantage of enabling users to iteratively adjust prompts locally, thereby achieving superior mask quality with minimal delay, without the need to reuse the image encoder.

However, running the SAM on edge devices at real-time speed is still a big challenge. To improve the inference speed of the SAM, Zhang et al. [24] distilled the image encoder in the SAM with Tiny-ViT [25], and they proposed the MobileSAM. Through this method, the parameter size of the image encoder was reduced, and the inference time was greatly reduced. The MobileSAM reduced the parameter size from 632 M to 5.78 M. The original version of the image encoder took 452 ms to process an image, while the image encoder of the MobileSAM only took 8 ms to accomplish this. In addition, the MobileSAM has reduced the minimum VRAM usage to less than 500 MB, as opposed to 3 GB in the original SAM version. Compared to the mask generated by the SAM, the intersection over union (IoU) of the mask generated by the MobileSAM can reach 0.75. The MobileSAM effectively reduces the parameter size and computation time, while maintaining a high level of accuracy.

As a general image segmentation model, the SAM can process images that have never been seen before. The MobileSAM has largely inherited this nice feature, which we aim to transfer to the S-VOS task. However, it should be noted that the robustness of the SAM still needs to be improved [10]. It has been also mentioned that the SAM is designed to meet the generality and application range, and it may produce low-quality masks for some details [8]. For the SAM to be able to segment small objects in videos, we also need to address this problem in this paper.

3. Materials and Methods

In this section, we will introduce the architecture of MobileSAM-Track that we have proposed. Our model consists of two main branches. One branch is for generating the segmentation mask of the target object in the current frame, called the mask generator. The other branch is for predicting the position of the tracked object in the next frame, called the tracker. In addition to these two branches, our model also contains a mask diffusion module, which is attached behind the branch of the mask generator. It improves the accuracy and robustness of segmentation by further refining the segmented mask based on the original mask generated by the mask generator.

3.1. Task Definition

In the “Preliminary” section, we have recognized that the SAM and MobileSAM can separate the specified object from the background in images based on the prompts input by the user. However, the MobileSAM is not directly applicable to S-VOS tasks. The biggest problem lies in how to track the specified object. Or more accurately speaking, once the user specifies the object to be tracked in the first frame, how do we pinpoint the position of the same object in subsequent frames. As long as we can obtain the approximate position of the tracked object, we can use the MobileSAM to obtain the mask of the object at the corresponding position.

In the first frame of the video, the target to be tracked is annotated with a rectangular box, and in the subsequent frames, the target is tracked with a rectangular box. This type of task is known as Visual Object Tracking (VOT). Deep learning is currently the preferred method for VOT tasks [26,27]. However, deep learning methods usually require high computational power and consume additional memory. Correlation filters (CFs) was the preferred method for VOT tasks before the widespread use of deep learning [28,29,30]. Correlation in signal processing is used to evaluate whether two signals have the same pattern or trend. In images, this operation can be used to match the position of the object corresponding to the template in the image. The difference between convolution and correlation calculations lies solely in whether the filter needs to be reversed, and these two calculations can be equivalently converted. In the following text, we will refer to the similarity calculation as the convolution operation for consistency. We first assume that video V is composed of a sequence of k frames. That is, the video can be represented as follows:

V = {p_{i}, i = 1, 2, 3, \dots, k}

(1)

where p_i represents the i-th frame image. We give a rectangular box in one frame to represent the position of the object to be tracked. This rectangular box is defined by four parameters (x_i, y_i, w_i, h_i), representing the horizontal coordinate of the center point, the vertical coordinate, the width, and the height of the rectangle, respectively. To construct the template of the tracked target, we first need to crop the image of the target in the rectangle, and we represent the cropped picture as follows:

f_{i} (x, y) = p_{i} (x_{i} - \frac{w_{i}}{2} + x, y_{i} - \frac{h_{i}}{2} + y), 0 \leq x < w_{i}, 0 \leq y < h_{i}

(2)

Based on the cropped image, we can generate a corresponding template. The CF tracker uses the template to search for the position of the target in subsequent frames, and the position of the tracked target is marked with a rectangular box in the subsequent frames. Thus, the approximate position of the tracked object in each frame is obtained as follows:

P r o m p t_{1} = B = {b_{i}, i = 1, 2, 3, \dots, k}, b_{i} = (x_{i}, y_{i}, w_{i}, h_{i})

(3)

We sequentially input the image of each frame p_i and the positioning box of target b_i into the mask generator, thereby obtaining the mask of the tracked object. The mask is composed of j points, so the entire mask sequence can be represented as follows:

M = {m_{i}, i = 1, 2, 3, \dots, k}, m_{i} = \{p t_{i 1}, p t_{i 2}, p t_{i 3}, \dots, p t_{i j}\}

(4)

However, due to its limited performance, the CF-based tracker only tracks the visual features of the target selected in the first frame. When the appearance features of the object undergo significant changes, the tracker cannot adapt to these changes, making the positioning box unable to completely frame the entire target. This usually results in the mask obtained by using the rectangular box as a prompt for segmentation being unable to completely cover the entire tracking target. Therefore, we can turn to using points as prompts for the MobileSAM to obtain the complete mask of the tracking target, which is the working principle of our mask diffusion module. Moreover, since the mask diffusion module is also constructed by the MobileSAM, it does not impose any additional VRAM usage. We randomly select n points in the mask m_i of each frame, and the new prompt can be represented as follows:

P r o m p t_{2} = {s_{i}, i = 1, 2, 3, \dots, k}, s_{i} = {p t_{i s 1}, p t_{i s 2}, p t_{i s 3}, \dots, p t_{i s n}}

(5)

Prompt s_i and image p_i serve as the new inputs to the MobileSAM, thereby segmenting to obtain the final mask of the object, which can be represented as follows:

M^{'} = {m_{i}^{'}, i = 1, 2, 3, \dots, k}

(6)

The above description outlines the definition of tasks needed to complete MobileSAM-Track. In summary, MobileSAM-Track consists of three components: a tracker, a mask generator, and a mask diffusion module. We will delve into these components in greater detail in the remainder of this section.

3.2. Mask Generator

The mask generator is designed based on the MobileSAM. The MobileSAM can accept points or boxes as input prompts and generate corresponding masks according to the information of the input prompts.

Figure 2a shows the workflow of the mask generator. The input data of the MobileSAM consist of two parts: images and prompts. Before we input images and prompts into the MobileSAM, we need to pre-process the data to improve segmentation accuracy.

In remote sensing images, the target to be tracked usually occupies only a small part of the image. This leads to poor segmentation results, and the poor detection and segmentation performance on small-sized objects can be attributed to the inherent defects of neural networks [31]. On the one hand, the common down-sampling operation in neural networks further reduces the pixel count of small-sized objects, increasing the difficulty in detecting small objects. On the other hand, the proportion of pixels occupied by the specific tracked target in remote sensing images is very small, and a large amount of background information is not our focus. While the image encoder encodes the image, the information of the specific tracked target is easily drowned in a large amount of background information. Considering this, before encoding the image, we only extract the image area within and around the tracking box. We know that each frame of image p_i has a corresponding positioning box b_i indicating the position of the object to be segmented, and b_i can be represented as follows:

b_{i} = (x_{i}, y_{i}, w_{i}, h_{i})

(7)

We construct a cropping scale factor α, which represents the ratio between the size of the positioning box and that of the cropped image, typically α = 0.15. The cropped image can be represented as follows:

I_{i} (x, y) = p_{i} (x_{i} - \frac{w_{i} \times α}{2} + x, y_{i} - \frac{h_{i} \times α}{2} + y), 0 \leq x < w_{i} \times α, 0 \leq y < h_{i} \times α

(8)

When it exceeds the index range of image p_i, we use interpolation algorithms to fill in the missing parts. When I_i is used as the input to the MobileSAM, the prompt also needs to undergo a transformation to ensure that the relative position of the positioning box remains unchanged. The new prompt can be represented as follows:

b_{i}^{'} = (\frac{w_{i} \times α}{2}, \frac{h_{i} \times α}{2}, w_{i}, h_{i})

(9)

We use I_i and b’_i as the inputs to the MobileSAM. Through this transformation, we can enlarge the proportion of the tracked target in the image, thereby improving the segmentation quality.

Image I_i is scaled and padded with the shorter side to a consistent size of 1024 × 1024 resolution image I’_i before entering the image encoder [8], and after entering the image encoder, I’_i∈ℝ^{1024×1024×3} is resized into an image patch sequence E_i∈ℝ^{(16×16×3)×4096}, where the resolution of any image patch P_n is 16 × 16 × 3 [8]. The input sequence of ViT can be expressed as follows:

z_{0} = [P_{1} E; P_{1} E; \dots; P_{4096} E] + E_{p o s}

(10)

where E∈ℝ ^{(16×16×3) ×768} is a matrix used to linearly embed P_n into a 768-dimensional feature vector, and E_pos∈ℝ^64×64×768 is a learnable feature vector used to retain the position information of each patch. After completing the feature vector, the self-attention mechanism is used on the feature vector [23], which can be expressed as follows:

z_{l}^{'} = MSA (LN (z_{l - 1})) + z_{l - 1}

(11)

z_{l} = MLP ((z_{l}^{'})) + z_{l}^{'}

(12)

where l = 1…L, indicating that the above calculation is repeated L times; MSA represents the multi-head attention mechanism commonly used in Transformer [32]; LN represents the Layer Normalization operator; MLP represents the multi-layer perceptron. The multi-head attention mechanism can be expressed as follows:

M S A (X) = Concat (H_{1}, H_{2}, \dots, H_{h}) W^{O}

(13)

H_{i} = SA (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V})

(14)

SA (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(15)

where W_i^Q, W_i^K, W_i^V, and W^O are learnable parameter matrices; SA denotes the standard self-attention mechanism; Q (query), K (key), and V (value) are the attention weight matrices obtained from the original input sequence X; d_K denotes the dimension of K. The attention weights Q, K, and V can be obtained as follows:

Q = X W_{Q}, K = X W_{K}, V = X W_{V}

(16)

where W_Q, W_K, and W_V are learnable.

The feature matrix output by the self-attention mechanism finally contains the contextual information of the input sequence. The multi-head self-attention mechanism, compared to the standard self-attention mechanism, can capture more aspects of the data, which endows it with stronger learning and generalization capabilities.

The MLP is the most widely utilized feedforward neural network architecture [33]. In the SAM, it is employed in conjunction with the Gaussian Error Linear Unit (GELU) activation function.

MLP (X) = GELU (W_{2} (GELU (W_{1} X) + b_{1}) + b_{2})

(17)

The GELU function is a nonlinear activation function that plays a crucial role in introducing nonlinearity into the network and enhancing its fitting capability. It can be computed as follows [34]:

GELU (x) = 0.5 x (1 + \tanh [\sqrt{2 / π} (x + 0.044715 x^{3})])

(18)

After the process of ViT, the output feature vector z_L∈ℝ^64×64×768. To reduce its channel number and reduce the ensuing computational burden, it also needs to go through a 1 × 1 convolution operator with 256 output channels and a 3 × 3 convolution operator with 256 output channels in the subsequent layers. Finally, we obtain the feature vector M∈ℝ^64×64×256 obtained by encoding the image through the image encoder.

Another input required by the MobileSAM is the prompt, which is bounding box b_i in the mask generator. The bounding box is input to the prompt encoder and encoded into two 256-dimensional feature vectors, representing the upper left and lower right corners of the tracking box. The encoding method for the prompt is to use Gaussian random Fourier features for mapping [35]. At this point, we obtain feature vector N∈ℝ^2×256.

The input vectors N and M are fed into the mask decoder to generate the segmentation mask of the target object. The mask decoder comprises four main steps: (1) Calculation of vector M using a self-attention mechanism. (2) Calculation of the cross-attention mechanism of vector N using vector M as a query. (3) Point-by-point updates using MLP. (4) Calculation of the cross-attention mechanism of vector M using vector N as a query [8]. Each step incorporates a residual connection. Both cross-attention and self-attention mechanisms are calculated similarly, with the only difference lying in how Q, K, and V are calculated. The cross-attention mechanism is calculated as follows:

CA (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(19)

Q = X W_{Q}, K = Y W_{K}, V = Y W_{V}

(20)

where X denotes the input sequence used as a query, and Y denotes the input sequence for which attention needs to be calculated.

After the above steps, we obtain the graph embedding I_i1, which has a size of 64 × 64 in each channel. We first use a transposed convolution to up-sample I_i1 by a factor of 4 to obtain I_i2. We then use a small three-layer MLP to predict an embedding I_i3 of the same size as I_i2. We take the spatial dot product of I_i2 and I_i3 to obtain the final mask output by the mask decoder.

m_{i} = I_{i 2} ⊙ I_{i 3}

(21)

This output mask has only one channel, and its size is 1/4 of the original image. We up-sample it to obtain a mask of a consistent size with the original image. Through the above process, we can obtain the mask of the target object in each frame.

3.3. Tracker

To reduce the VRAM consumption, we choose a target tracking algorithm that does not rely on neural networks, CSR-DCF [30], which is based on CFs. A template is trained by using the visual features of the target in the previous frame and then finds the position that matches the template best in the next frame, thus estimating the position of the target in the next frame. The CSR-DCF algorithm has several advantages. First, it is efficient and consumes fewer memory resources, making it suitable for edge deployment. Second, it is robust and can tolerate deformations such as motion blur, making it suitable for tracking high-speed moving objects. Third, it is general as it does not rely on large-scale datasets for training but instead obtains a template by referring to the target given in the first frame. Therefore, CSR-DCF can better cope with objects that have never been seen before.

Figure 2b shows the basic workflow of CSR-DCF. We will introduce its working principle. First, we need to obtain the template of the tracking target for matching in subsequent frames. We assume that the image f_i of the tracking target in the i-th frame has been obtained according to Equation (2). Based on the target image f_i obtained by cropping, we need to construct a template h_i such that after the input image f_i and the template h_i undergo the convolution operation, a high response area is generated in the center area of the image. Let the output response map g_i satisfy the two-dimensional normal distribution, that is,

g_{i} (x, y) = e^{- [\frac{{(x - \frac{w_{i}}{2})}^{2}}{2 σ_{1}^{2}} + \frac{{(y - \frac{h_{i}}{2})}^{2}}{2 σ_{2}^{2}}]}, 0 \leq x < w_{i}, 0 \leq y < h_{i}

(22)

where σ₁ and σ₂ represent the desired response intensities.

Next, we will commence the calculation of template h_i. It is noteworthy that the spatial convolution operation can be transformed into point multiplication in the frequency domain as follows:

g_{i} = f_{i} * h_{i} \Leftrightarrow G_{i} = F_{i} ⊙ H_{i}

(23)

where G_i, F_i, and H_i represent the frequency domain representations of g_i, f_i, and h_i, respectively, which can be obtained using the fast Fourier transform (FFT) algorithm. This expression (23) can be transformed into the following:

H_{i} = \frac{G_{i}}{F_{i}} = \frac{G_{i} ⊙ F_{i}^{*}}{F_{i} ⊙ F_{i}^{*}}

(24)

By performing the inverse Fourier transform on H_i, we can obtain the template h_i. However, the robustness of a single template is usually low. To ensure the robustness of tracking, CSR-DCF uses multiple templates in different channels for matching, such as color channels and HOG channels [36]. Assume that a total of N_c channels is used, which can be expressed as follows:

ε (h_{i}^{'}) = \sum_{d = 1}^{N_{c}} {‖f_{i c d} * h_{i c d}^{'} - g_{i c d}‖}^{2} + λ {‖h_{i c d}^{'}‖}^{2}, g_{i c d} = g_{i}, d = 1, 2, 3, \dots, N_{c}

(25)

where h′_icd represents the feature template generated on the d-th channel, and f_icd represents the image feature on the d-th channel. To enhance robustness, a regularization term is added to this equation. By minimizing ε(h′_i) using a regression algorithm, we can obtain a series of templates, which can be represented as follows:

h_{i}^{'} = {h_{i c d}^{'}, d = 1, 2, 3, \dots, N_{c}}

(26)

h′_i will not be used directly as a template for matching but will be updated based on the previous template to ensure robustness. The updating process can be expressed as follows:

\{\begin{matrix} h_{i c d} = η h_{i c d}^{'} + (1 - η) h_{(i - 1) c d}, d = 1, 2, 3, \dots, N_{c}, i > 0 \\ h_{i c d} = h_{i c d}^{'}, d = 1, 2, 3, \dots, N_{c}, i = 0 \end{matrix}

(27)

where η represents the update speed of the template. The larger the value of η, the faster the template is updated. Typically, we set η = 0.125.

We convolve h_i with the subsequent frame image p_(i+1) by channel, and the areas with high responses in the result are the positions of the tracked targets. Similarly, this operation can also be transformed into the point multiplication operation between the frequency domain representation P_(i+1) of the image and the frequency domain representation H_i of the template by channel.

G_{(i + 1) c d} = P_{(i + 1) c d} ⊙ H_{i c d}, d = 1, 2, 3, \dots, N_{c}

(28)

We perform the inverse Fourier transform on G_(i+1) to obtain g_(i+1) by channel. The response of each channel has different levels of reliability. To evaluate this reliability, we take the maximum response value of each channel as the measure of reliability.

\begin{array}{l} w_{(i + 1)}^{(l r n)} = {w_{(i + 1) c 1}^{(l r n)}, w_{(i + 1) c 2}^{(l r n)}, w_{(i + 1) c 3}^{(l r n)}, \dots, w_{(i + 1) c N_{c}}^{(l r n)}} \\ = {\max (g_{(i + 1) c 1}), \max (g_{(i + 1) c 2}), \max (g_{(i + 1) c 3}), \dots, \max (g_{(i + 1) c N_{c}})} \end{array}

(29)

The response map of all channels is weighted and summed according to the reliability of the channels. Building upon this, the possibility that each point is foreground and background is obtained by the Markov chain in the original text to suppress the weight of the background in the template [30]. A new response map is obtained. The position with the maximum response value in the new response map is the position where the tracked object appears in the image p₁, represented as (x_(i+1), y_(i+1)). By cropping p_(i+1), we can obtain a new target image f_(i+1). The above loop is repeated to complete the target tracking.

CSR-DCF can run at 13 frames per second on an Intel Core i7 3.4 GHz standard desktop [30]. The above process is sufficient to explain the high speed of CSR-DCF trackers. CSR-DCF can transform convolution operations into point multiplication operations and can also use the FFT algorithm to accelerate the conversion from spatial information to frequency domain information [30].

3.4. Diffusion Module

As shown in Figure 3, CSR-DCF only tracks the object based on its selected visual features, without estimating its overall shape and position. If we can only select a part of the target in the initial frame, CSR-DCF will be unable to track the complete individual of the target correctly. This will result in an incomplete and low-quality mask obtained by segmentation. To solve this problem, we propose a diffusion module attached to the mask generator, which can automatically extend the mask to the full view of the target based on the original mask obtained by the mask generator above.

The original workflow of our mask generator is shown in Figure 4a, and the workflow of the mask diffusion module is shown in Figure 4b. The original mask is generated by the mask generator. The sampling points are selected randomly, ensuring that the sampling is as uniform as possible. By utilizing each sampling point as the prompt of the MobileSAM successively, it is possible to create a multitude of sub-masks. However, it is important to note that the quality of sub-masks generated using a single point as a prompt is relatively low. To address this issue, we propose a simple yet effective approach to fuse these sub-masks. The proposed approach involves calculating the average of all sub-masks and applying a predefined threshold to retain only those pixels with an average value exceeding the threshold. Averaging the sub-masks helps to smoothen out any inconsistencies or fluctuations in the individual sub-masks, resulting in a more consistent and accurate final mask. The application of a threshold is essential to filter out sub-masks that do not meet the desired accuracy standards. This process ensures that the final mask is of high quality. To obtain the final mask, we apply Equation (30) to each pixel in the image in the i-th frame.

m_{i}^{'} (x, y) = \{\begin{array}{l} 1, & i f [\frac{1}{N_{m}} \sum_{β = 0}^{N_{m}} m_{s β} (x, y)] \geq T \\ 0, & i f [\frac{1}{N_{m}} \sum_{β = 0}^{N_{m}} m_{s β} (x, y)] < T \end{array}

(30)

where m_s_β represents different sub-masks, there are N_m sub-masks in total, and T represents the mask binarization threshold. The output final mask m_i′ is a binary final mask. This method can effectively improve the quality of the output mask and enhance the robustness of our method.

Adding the mask diffusion module, which generates multiple sub-masks to enhance the segmentation quality, will lead to a certain degree of frame rate degradation. However, this issue can be addressed in two ways. First, redundant computation can be avoided by reusing the image encoder output from the mask generator. Second, parallelization is applied to accelerate each independent sub-mask generation branch and reduce time consumption.

4. Experiments

In this section, we introduce the video dataset used in our experiment and outline experiments designed on this experimental set. These include two ablation experiments to verify the effectiveness of our improvement and one experiment to compare with related methods. To test the generalization performance of MobileSAM-Track, an experiment was also conducted on the DAVIS 2016 dataset [13].

4.1. Dataset

Currently, the available datasets in remote sensing focus on instance segmentation and object detection tasks for aircraft objects in images. iSAID [37] is a remote sensing image dataset specifically designed for instance segmentation, comprising 2806 images with segmentation masks of 15 classes of objects including planes and helicopters. In contrast, there is a greater prevalence of datasets suitable for object detection tasks, such as DOTA [38] and RarePlanes [39]. To the best of our knowledge, no open datasets exist on aircraft tracking and segmentation tasks; hence, we have developed our own dataset to evaluate the performance of S-VOS methods in these domains.

Our dataset consists of 28 videos, with 19 of them being airplane footage and the remaining being car footage, serving to demonstrate the generalization capability of our model. These videos were captured using either remote sensing or ground-based cameras and contain challenging elements such as complex backgrounds, multi-scale objects, and partial occlusion. We obtained these videos from online sources and initially generated mask labels frame by frame using the SAM. Subsequently, we manually examined and corrected some of these mask labels to improve their quality and consistency. This meticulous process resulted in a high-quality dataset with accurate mask annotations that can be utilized for evaluating the performance of our model. Figure 5 presents some screenshots of the video in this dataset, which includes a total of 6280 frames of images. Additionally, Figure 6 presents a statistical plot of object pixel ratio frequency in our dataset; most of the objects to be segmented occupy less than 5% of the pixels, providing a comprehensive evaluation of the model’s ability to effectively segment small objects.

In Table 1, we compare the key parameters of our self-built dataset with other commonly used datasets for S-VOS tasks, including SegTrack [40], DAVIS series datasets [13,41], and YouTube-VOS [42]. Specifically, the maximum frame number and average pixel ratio are two relatively important parameters. The maximum frame number reflects the number of frames in the longest video in the dataset. Matching-based methods often fail to segment longer videos due to VRAM exhaustion. Therefore, longer videos require methods with better memory utilization efficiency. The average pixel ratio represents the proportion of the mask of the object that needs to be tracked and segmented in the image. A smaller average pixel ratio indicates a higher proportion of small objects requiring segmentation in the dataset, thereby presenting a greater challenge. It is worth noting that in the SegTrack [40], DAVIS-17 [41], and YouTube-VOS [42] datasets, there are several videos that contain multiple objects that require both tracking and segmentation. When evaluating the multi-object tracking and segmentation performance of the S-VOS method, it is common practice to conduct multiple tracking and segmentation runs for a single object [41]. Therefore, when counting these videos, we would convert the number of videos according to the number of objects that require tracking, and the total number of frames will also be superimposed accordingly. In addition, since the masks of the test set and validation set of YouTube-VOS [42] are not publicly available, we did not include them in our count.

From Table 1, we can see that the maximum frame number of our dataset is the largest among all datasets. This indicates that our dataset can impose greater memory pressure on methods, thereby fully testing the VRAM utilization efficiency of various methods. When designing this maximum frame number, we fully considered the potential risk of memory overflow. In the experiments conducted by related works, all networks can complete the tracking and segmentation task without memory overflow. This ensures a fair comparison among the networks on our self-built dataset. In addition, our dataset delivers the smallest average pixel ratio, indicating that it can fully test the tracking and segmentation capabilities of different methods for small objects.

4.2. Experimental Environment

We compared MobileSAM-Track with other related works on a device equipped with the NVIDIA GeForce GTX1080 Ti graphics card. This card, released in 2017 Q2, is equipped with 3584 CUDA cores and 11 GB of GDDR5x memory. In addition, we also attempted to deploy MobileSAM-Track on the NVIDIA Jetson Xavier NX embedded development kit for testing. This development kit, released in 2020 Q1, features six ARMv8 CPU cores running at 1.4 GHz and a GPU module with 384 CUDA cores. The CPU and GPU share a set of 8 GB LPDDR4x memory.

MobileSAM-Track does not require pre-training. On the MobileSAM-Track, there are three main adjustable parameters: the cropping scaling factor in the mask generator, the mask binarization threshold, and the number of sub-masks generated in the diffusion module. According to our tests, the cropping scaling factor and the mask binarization threshold do not significantly affect the performance of MobileSAM-Track within a relatively wide range. However, the greater the number of sub-masks generated in the diffusion module, the higher the accuracy of the final mask output but at the cost of a reduced inference speed. To achieve an optimal inference speed on the Jetson Xavier NX, we opted to generate fewer sub-masks. The specific parameter combinations used on different devices are presented in Table 2.

4.3. Evaluation Metrics

We used the Jaccard index J and contour accuracy F as the metrics to evaluate the segmentation accuracy of different S-VOS methods [13].

J is a parameter used to evaluate the region similarity, which is the overlap ratio between the segmentation result and the ground truth, also known as the IoU. Its calculation formula is as follows:

J = \frac{|M \cap T|}{|M \cup T|}

(31)

where M represents the predicted segmentation result and T represents the ground truth segmentation annotation.

F is a parameter used to evaluate the contour accuracy, which is the boundary matching degree between the segmentation result and the ground truth. Its calculation formula is as follows:

F = \frac{2 \times P \times R}{P + R}

(32)

where P represents the boundary precision and R represents the boundary recall. Their calculation formulas are as follows:

P = \frac{|B_{M} \cap B_{T}|}{|B_{M}|}

(33)

R = \frac{|B_{M} \cap B_{T}|}{|B_{T}|}

(34)

where B_M represents the boundary pixel set of the predicted segmentation result and B_T represents the boundary pixel set of the ground truth segmentation annotation. Both J and F follow the principle of “higher is better”. To further simplify the evaluation parameters, we can take the average value of J and F, referred to as J&F.

To fully test the generalization ability of the model, we aimed to avoid fine tuning the model as much as possible in the comparison experiments, except for the OSVOS model, because it cannot work properly without fine tuning. We used frames per second (FPS) as the indicator of the model’s inference speed. We also measured the VRAM usage of various methods as the indicator of model efficiency, recording the maximum VRAM usage observed during the experiment.

4.4. Accuracy Improvement Validation of Diffusion Module

To demonstrate the effectiveness of the diffusion module, we compared the results obtained without and with the addition of the diffusion module.

As shown in Figure 7, Figure 7a presents the experimental results of our model on the GTX 1080 Ti graphics card, reflecting the performance of our model on standard devices. We compared the cases with and without the mask diffusion module. Before the module was added, the accuracy of the MobileSAM was average, with the J&F only reaching 57.1%. Adding the mask diffusion module can significantly improve the performance of MobileSAM-Track. With the generation of five sub-masks in the mask diffusion module, MobileSAM-Track exhibited 10.3% and 8.3% improvements in J and F, respectively, demonstrating that our mask diffusion module can improve the coverage and contour quality of the mask. However, due to the repetitive invocation of the mask decoder of the MobileSAM by the mask diffusion module, the running speed experienced a reduction. On GTX 1080 Ti, the FPS dropped from 16.95 to 12.31, and the FPS drop ratio was approximately 27%.

Our model consumes no more than 500 MB of memory, providing the possibility for it to be deployed on edge devices. To verify this, we ran our model on the Jetson Xavier NX development kit. Without adding the mask diffusion module, MobileSAM-Track can reach a J&F score of 57.6% on edge devices, equivalent to the accuracy level achieved by MobileSAM-Track without the mask diffusion module running on GTX 1080 Ti. In the experiment involving the addition of the mask diffusion module, to maintain the inference speed, we generated only two sub-masks in the mask diffusion module. As a result, MobileSAM-Track demonstrated improvements of 5.5% and 5.3% in J and F, respectively. This increase is relatively limited compared to the method’s performance on GTX 1080 Ti. After adding the mask diffusion module, the FPS dropped from 2.68 to 1.72, and the drop ratio was approximately 36%. This drop ratio is larger than that on GTX 1080 Ti, and the mask diffusion module imposes a greater burden on Jetson Xavier NX. We suspect that the limited multithreading capability of the 1.4 GHz ARM processor on the Jetson Xavier NX, in comparison to the 2.6 GHz Intel Core i9 processor on the standard computer, hinders its ability to be adequately parallelized for acceleration.

Our experiments on the GTX 1080 Ti and Jetson Xavier NX have demonstrated that the mask diffusion module can achieve a significant accuracy gain with an acceptable decrease in the inference speed. Importantly, this accuracy gain can be achieved on both devices used. Furthermore, by generating more sub-masks in the mask diffusion module, we can achieve a higher level of accuracy.

4.5. Lightweight Validation of MobileSAM

To validate the lightweight nature of the MobileSAM, we designed an ablation experiment on the GTX 1080 Ti. This involved replacing the image encoder of the MobileSAM with three different versions from the original version of the SAM. The data from our experiment are shown in Table 3. These data were analyzed from three perspectives: accuracy, inference speed, and VRAM usage. For the convenience of experimental data analysis, we have visualized the data in Figure 8a.

First, we analyzed the accuracy. As shown in Figure 8a, the lines on the three axes of J (%), F (%), and J&F (%) are clearly divided into two clusters, reflecting the significantly different performance levels of the models with and without the mask diffusion module. When the mask diffusion module is added, the image encoders built with Tiny-ViT, ViT-B, ViT-L, and ViT-H can achieve J&F scores of 66.4%, 67.8%, 69.5%, and 70.5%, respectively. This represents a 4.1% increase in J&F from Tiny-ViT to ViT-H. However, without the mask diffusion module, the image encoders built with Tiny-ViT, ViT-B, ViT-L, and ViT-H can achieve J&F scores of 57.1%, 58.0%, 59.1%, and 59.1%, respectively, with a mere 2.0% increase in J&F from Tiny-ViT to ViT-H. These results indicate that the mask diffusion module can better leverage the performance advantages offered by large-scale image encoders, significantly improving the performance level of MobileSAM-Track.

Second, we analyzed the inference speed. On the FPS axis of Figure 8a, Tiny-ViT shows a significant performance advantage over the other three ViTs, and even Tiny-ViT with the mask diffusion module is faster than the other three ViTs without the mask diffusion module. When the mask diffusion module is added, Tiny-ViT, ViT-B, ViT-L, and ViT-H can achieve 12.31 FPS, 3.30 FPS, 1.50 FPS, and 0.95 FPS, respectively. The inference speed gains of Tiny-ViT compared to ViT-B, ViT-L, and ViT-H are 273%, 720%, and 1196%, respectively. In addition, adding the mask diffusion module will cause a wide drop in the inference speed when any image encoder is used. This is specifically reflected by the inference speed dropping by 27%, 12%, 13%, and 2%, respectively, on Tiny-ViT, ViT-B, ViT-L, and ViT-H after the mask diffusion module is added. As the scale of the image encoder increases, the trend of the drop in the inference speed caused by adding the mask diffusion module becomes less pronounced. This may be because as the calculation time of the image encoder increases, the time ratio of the mask diffusion module operation decreases.

In addition, we measured the VRAM usage under different conditions. From Table 3, we can see that the use of Tiny-ViT results in very low memory usage, occupying only 425 MB of VRAM. Compared to the 3269 MB, 5047 MB, and 6570 MB corresponding to ViT-B, ViT-L, and ViT-H, Tiny-ViT can save 87%, 92%, and 94% of VRAM, respectively, and the VRAM usage of Tiny-ViT reached a very low level. And the mask diffusion module will not bring about the increase in the VRAM usage.

Based on the above data, we can conclude that the MobileSAM carries a slight accuracy penalty compared to the original version of the SAM, but it offers significant savings in VRAM resources and a notable increase in inference speed.

4.6. Comparison with Other Related Methods

To evaluate the effectiveness of our proposed MobileSAM-Track method, we conducted comparative experiments with another seven S-VOS models on our dataset. Each of these methods represents the different technical features of S-VOS-related works to fully test the self-built dataset. Among these methods, OSVOS [12] and Reference-Guided Mask Propagation (RGMP [43]) represent the relatively popular fine-tuning-based and propagation-based S-VOS methods, respectively. STM [5], SwiftNet [7], modular interactive video object segmentation (MiVOS [15]), Space–Time Correspondence Networks (STCNs [6]), and XMem [16] represent the recent popular matching-based S-VOS methods. To ensure the fairness of the experiments, all experiments were conducted on the same experimental device [44]. We replicated these works on a GTX 1080 Ti graphics card and recorded their performance. The comparison results of these models are shown in Table 4. It should be noted that the VRAM usage of the matching-based method increases gradually with the progress of segmentation. Table 4 captures the maximum VRAM usage, while Figure 8b provides a visual representation of these data.

In terms of accuracy, MobileSAM-Track has achieved optimal performance on our self-built dataset, with the J&F score reaching 66.4%. In comparison, the J&F scores of OSVOS, RGMP, STM, SwiftNet, MiVOS, STCN, and XMem on our dataset are 41.8%, 60.1%, 52.4%, 55.1%, 54.9%, 55.4%, and 62.5% respectively. This indicates that MobileSAM-Track has improved the J&F metric by 24.6%, 6.3%, 14.0%, 11.3%, 11.5%, 11.0%, and 3.9%, respectively, compared to these methods. From these data, it can be observed that the performance of the fine-tuning-based method OSVOS on this dataset is mediocre. This is influenced by its working principle. Since OSVOS only performs transfer learning on the first frame, it cannot smoothly learn the information of the network in subsequent frames. The information contained in the network weights will gradually become outdated, and this disadvantage becomes more apparent in longer videos. RGMP performs relatively well on this dataset, which may be due to the lower number of occlusions in the dataset, allowing the mask to propagate smoothly. Interestingly, the matching-based methods, typically indicative of the state-of-the-art performance of existing S-VOS-related works, have not exceeded 60% in J&F scores, except for XMem. We believe this is associated with the characteristics of the dataset. In these methods, the segmentation results will be encoded as tensors and stored in the memory bank. However, our self-built dataset contains a large number of small objects. After the image is encoded, the information of small objects will be overshadowed by a large amount of useless background information. This inspired us to remove the background information in the mask generator when designing MobileSAM-Track.

In terms of inference speed, MobileSAM-Track reaches 12.31 FPS, second only to OSVOS at 15.99 FPS. The inference speed of MobileSAM-Track is 23% lower than that of OSVOS. The inference speeds of other methods, including RGMP, STM, SwiftNet, MiVOS, STCN, and XMem, are 7.29 FPS, 6.89 FPS, 10.27 FPS, 6.38 FPS, 9.53 FPS, and 10.76 FPS, respectively. Compared to these works, the inference speed of MobileSAM-Track is 69%, 79%, 20%, 93%, 29%, and 14% higher, respectively. By the numbers, OSVOS seems to deliver the fastest inference speed among these eight methods. However, it is worth noting that OSVOS, as a fine-tuning-based method, is the only one among the eight methods that requires preparation time to start. Before segmenting each video, OSVOS requires more than 8 min for fine tuning on the first frame of each video. Therefore, it is reasonable to infer that MobileSAM-Track achieves the best comprehensive inference efficiency among the eight methods.

In terms of VRAM usage, MobileSAM-Track shows a clear advantage, requiring only 425 MB of VRAM. In comparison, OSVOS, RGMP, STM, SwiftNet, MiVOS, STCN, and XMem require 3246 MB, 3584 MB, 7308 MB, 6998 MB, 7254 MB, 7841 MB, and 4211 MB of VRAM, respectively. We believe that two factors contribute to the improved VRAM utilization efficiency of MobileSAM-Track. On the one hand, the encoder in MobileSAM-Track is composed of a fairly lightweight neural network, Tiny-ViT, which has a very small weight file and occupies less VRAM. On the other hand, we abandoned the memory bank commonly used in matching neural networks to match and track segmented objects and instead used a tracker to track objects. Through this method, the VRAM usage of MobileSAM-Track will not experience any increase.

In summary, considering the above data, we believe that in this dataset, MobileSAM-Track shows clear advantages in the inference speed and VRAM utilization, while demonstrating a good level of accuracy, making it suitable for small object tracking and segmentation on edge devices.

4.7. Generalization Ability Validation

Given that the MobileSAM and CSR-DCF can handle previously unseen targets, MobileSAM-Track is expected to have excellent generalization performance. It can perform S-VOS tasks on unseen video datasets without the need for pre-training. To validate this, we applied MobileSAM-Track to the DAVIS 2016 validation set to complete the S-VOS task. We placed particular emphasis on analyzing the accuracy. The data from our experiment are shown in Table 5.

From Table 5, we can see that MobileSAM-Track achieves a J&F score of approximately 70.3% on the DAVIS 2016 dataset. In comparison, OSVOS, RGMP, STM, SwiftNet, MiVOS, STCN, and XMem achieve J&F scores of approximately 80.2%, 81.8%, 89.3%, 90.4%, 90.0%, 91.6%, and 92.0% on DAVIS 2016, respectively, which are 9.9%, 11.5%, 19.0%, 20.1%, 19.7%, 21.3%, and 21.6% higher than that of MobileSAM-Track. Compared to these methods, the performance of MobileSAM-Track is hardly satisfactory. Further analysis of the data reveals that the J score of MobileSAM-Track seems to be significantly lower than that of the other methods, at only 65.4%. Compared to OSVOS, which has the second-lowest J score, the J score of MobileSAM-Track is approximately 14.4% lower. This indicates that the main problem with MobileSAM-Track is the low coverage of the generated masks, which leads to the subpar performance of MobileSAM-Track on the DAVIS 2016 dataset. Still, we believe that MobileSAM-Track holds potential, especially if pre-training is not required.

5. Discussion

This paper aims to address the problem of tracking and segmenting selected small objects in remote sensing videos on edge devices. To achieve this, we propose a lightweight S-VOS model based on the MobileSAM and CSR-DCF, called MobileSAM-Track. In this section, we will discuss the evaluation, implications, and limitations of the experimental results.

5.1. Interpretation and Evaluation of Experimental Results

This ablation experiment has validated two issues: (1) whether the mask diffusion module can improve segmentation accuracy; and (2) whether replacing the SAM with the MobileSAM can effectively reduce system resource consumption. Meanwhile, we compared our method with some existing works on our self-built dataset and the DAVIS 2016 dataset.

Accuracy improvement validation of diffusion module: The mask diffusion module can significantly enhance the segmentation accuracy of the model, with only a minimal impact on the inference speed and no notable increase in model memory usage. Additionally, it effectively addresses the limitations of the tracker on segmentation accuracy, as illustrated in Figure 9. We summarize the three key roles of the mask diffusion module.

As shown in Figure 9a, when only a part of the tracked object can be selected in the initial frame, the tracker will only track the selected part of the tracked target. Subsequently, the diffusion module can automatically diffuse the mask to the complete object and update the tracked target. Figure 9b shows that when the size of the tracked object changes rapidly, the CF-based tracker cannot adapt quickly, and the diffusion module can diffuse the mask to the complete tracked object and correct the tracking box of the tracker. As shown in Figure 9c, when the posture of the tracked object changes, such as a change in the shooting angle, the tracker may occasionally lose track due to the loss of visual features. The new mask generated by the diffusion module can correct the tracking box.

Lightweight validation of the MobileSAM: In Experiment 4.5, we replaced different versions of the image encoder in MobileSAM-Track and compared their differences in terms of FPS, VRAM usage, and J&F scores. The experimental results demonstrate that incorporating Tiny-ViT as an image encoder in the MobileSAM can significantly reduce VRAM usage and improve its inference speed, while maintaining a high level of accuracy. Figure 10 showcases the segmentation outcomes obtained by employing different image encoders alongside the addition of the diffusion module. As observed, when using Tiny-ViT as the encoder, the overall performance remains satisfactory; however, there are occasional instances where motion-blurred parts or object details result in low-quality masks. The advantage in speed and VRAM usage is clear, and a slight drop in accuracy is acceptable.

Comparison of related works: In Experiment 4.6, we conducted a comparative analysis between MobileSAM-Track and several other existing S-VOS methods, revealing distinct advantages of our model in terms of inference speed and VRAM usage. Figure 11 provides additional insight into the high accuracy achieved by MobileSAM-Track. When compared to alternative approaches, our method exhibits enhanced edge sharpness and produces masks of superior quality.

Generalization ability validation: In the generalization ability validation experiment, we tested the accuracy of MobileSAM-Track on the DAVIS 2016 dataset to evaluate its performance on general S-VOS tasks. The results revealed two factors that constrain the accuracy of MobileSAM-Track. First, the MobileSAM tends to segment smaller objects when segmenting objects with clear internal boundaries. As illustrated in Figure 12a, the MobileSAM always attempts to segment the parts rather than the whole, and this problem can sometimes be mitigated by carefully adjusting the mask diffusion module. Second, the CSR-DCF tracker also limits the accuracy of MobileSAM-Track. As shown in Figure 12b, once the tracker loses track of the target, it is almost impossible to find it again, resulting in the poor occlusion resistance of MobileSAM-Track. This problem is difficult to fix by adjusting the parameters of MobileSAM-Track, unless we find a more suitable tracker. When the object does not have clear internal boundaries and is not occluded, MobileSAM-Track still performs well. As depicted in Figure 12c, despite the significant shape change of the object over time, MobileSAM-Track still exhibits good accuracy and can clearly distinguish between foreground and background objects, demonstrating strong deformation resistance.

5.2. Implications of This Work

This paper proposes a novel and easy-to-use S-VOS method, which has good performance in practical applications. Specifically, we believe that our work makes the following three contributions:

(1): Inspired by the propagation-based S-VOS network, we decompose the S-VOS task into two subtasks: tracking and segmentation. These subtasks leverage mature achievements in the fields of Visual Object Tracking (VOT) [45,46,47,48,49] and instance segmentation [50,51,52,53,54]. Our model inherits the lightweight and fast inference speed characteristics from the CSR-DCF [30] and MobileSAM [24].
(2): MobileSAM-Track only requires 500 MB of VRAM, making it suitable for performing S-VOS inference tasks on edge devices. Unlike matching-based methods [5,15], MobileSAM-Track maintains consistent memory usage.
(3): We introduce a diffusion module to enhance the accuracy and robustness of model segmentation without increasing VRAM usage. Our model demonstrates significant advantages in terms of inference efficiency, VRAM utilization, and segmentation accuracy.

5.3. Limitations of Our Method and Future Research Directions

Compared to other works, MobileSAM-Track shows clear advantages in small target tracking, VRAM usage, and inference speed, but it only performs moderately on general-purpose tasks, indicating the need for improvement in the future. Here are some possible limitations and future research directions:

(1): The proposed model has a weak anti-interference ability in occlusion situations. When the target is partially or completely occluded by other objects, the segmentation results tend to be erroneous or unstable. This is mainly because the mask generator lacks the prior estimation of the previous frame mask, which hinders its ability to effectively maintain the continuity and integrity of the target.
(2): This paper adopts a CF-based tracker as the propagation component, which offers the advantages of fast speed and strong robustness but also presents some limitations. For example, the tracker is insensitive to the appearance changes and motion patterns of the target, which may cause tracking drift or loss. Therefore, the segmentation accuracy largely depends on the performance of the tracker. In future work, we need to improve the tracker, enhancing its accuracy and stability.

6. Conclusions

In this paper, we propose a lightweight S-VOS method called MobileSAM-Track, specifically designed for tracking and segmenting small targets in remote sensing videos on edge devices. First, we introduce a tracker to avoid saving the segmentation results of past frames, thereby reducing VRAM usage. Second, we use two lightweight components, the CSR-DCF and MobileSAM, to enhance the inference speed of the model. In addition, we construct a mask diffusion module that allows the generated mask to fully cover the entire target object, thereby improving the accuracy and robustness of the model.

MobileSAM-Track can run at a speed exceeding 12 FPS on a GTX 1080 Ti graphics card, while utilizing less than 500 MB of VRAM. When tested on our self-built aircraft and vehicle dataset, it can achieve a J&F score of 66.4%. This demonstrates the superior performance of our model in small target tracking and detection compared to other methods at a lower level of computational and memory resource consumption.

However, MobileSAM-Track still has some limitations. For example, it exhibits a weak ability to handle occlusion and does not incorporate prior estimation of completed segmentation results during segmentation. In future work, we need to improve the robustness of the tracker and integrate the MobileSAM and CSR-DCF in a more compact way to fully utilize the prior knowledge of the segmented results.

Author Contributions

Conceptualization, Y.Z.; methodology, Y.L.; software, Y.L.; validation, Y.L.; formal analysis, X.Z., C.L. and X.L.; investigation, Y.L.; resources, Y.Z.; data curation, P.S. and C.F.; writing—original draft preparation, Y.L.; writing—review and editing, Y.L., X.Z. and X.W.; visualization, Y.L., C.L. and J.L.; supervision, L.L., Q.F. and W.J.L.; project administration, Y.Z.; funding acquisition, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (Grant No. 61873307), the Hebei Natural Science Foundation (Grant Nos. F2020501040, F2021203070, and F2022501031), the Fundamental Research Funds for the Central Universities (Grant No. N2123004, 2022GFZD014), and the Administration of Central Funds Guiding the Local Science and Technology Development (Grant No. 206Z1702G).

Data Availability Statement

The data used in this analysis are not public but are available upon request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Luo, S.; Yu, J.; Xi, Y.; Liao, X. Aircraft Target Detection in Remote Sensing Images Based on Improved YOLOv5. IEEE Access 2022, 10, 5184–5192. [Google Scholar] [CrossRef]
Zhou, L.; Yan, H.; Shan, Y.; Zheng, C.; Liu, Y.; Zuo, X.; Qiao, B. Aircraft Detection for Remote Sensing Images Based on Deep Convolutional Neural Networks. J. Electr. Comput. Eng. 2021, 2021, 4685644. [Google Scholar] [CrossRef]
Li, Y.; Zhao, J.; Zhang, S.; Tan, W. Aircraft Detection in Remote Sensing Images Based on Deep Convolutional Neural Network. In Proceedings of the 2018 IEEE 3rd International Conference on Cloud Computing and Internet of Things (CCIOT) Aircraft, Dalian, China, 20–21 October 2018; pp. 135–138. [Google Scholar]
Wu, S.; Zhang, K.; Li, S.; Yan, J. Learning to Track Aircraft in Infrared Imagery. Remote Sens. 2020, 12, 3995. [Google Scholar] [CrossRef]
Oh, S.W.; Lee, J.-Y.; Xu, N.; Kim, S.J. Video Object Segmentation Using Space-Time Memory Networks. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9225–9234. [Google Scholar]
Cheng, H.K.; Tai, Y.-W.; Tang, C.-K. Rethinking Space-Time Networks with Improved Memory Coverage for Efficient Video Object Segmentation. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 9 June 2021; Volume 15, pp. 11781–11794. [Google Scholar]
Wang, H.; Jiang, X.; Ren, H.; Hu, Y.; Bai, S. SwiftNet: Real-Time Video Object Segmentation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1296–1305. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.-Y.; et al. Segment Anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 4015–4026. [Google Scholar]
Chen, K.; Liu, C.; Chen, H.; Zhang, H.; Li, W.; Zou, Z.; Shi, Z. RSPrompter: Learning to Prompt for Remote Sensing Instance Segmentation Based on Visual Foundation Model. arXiv 2023, arXiv:2306.16269. [Google Scholar]
Wang, Y.; Zhao, Y.; Petzold, L. An Empirical Study on the Robustness of the Segment Anything Model (SAM). arXiv 2023, arXiv:2305.06422. [Google Scholar]
Huang, Y.; Yang, X.; Liu, L.; Zhou, H.; Chang, A.; Zhou, X.; Chen, R.; Yu, J.; Chen, J.; Chen, C.; et al. Segment Anything Model for Medical Images? arXiv 2023, arXiv:2304.14660. [Google Scholar]
Caelles, S.; Maninis, K.K.; Pont-Tuset, J.; Leal-Taixé, L.; Cremers, D.; Van Gool, L. One-Shot Video Object Segmentation. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 5320–5329. [Google Scholar]
Perazzi, F.; Pont-Tuset, J.; McWilliams, B.; Van Gool, L.; Gross, M.; Sorkine-Hornung, A. A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 30 June 2016; pp. 724–732. [Google Scholar]
Perazzi, F.; Khoreva, A.; Benenson, R.; Schiele, B.; Sorkine-Hornung, A. Learning Video Object Segmentation from Static Images. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3491–3500. [Google Scholar]
Cheng, H.K.; Tai, Y.W.; Tang, C.K. Modular Interactive Video Object Segmentation: Interaction-to-Mask, Propagation and Difference-Aware Fusion. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 5555–5564. [Google Scholar] [CrossRef]
Cheng, H.K.; Schwing, A.G. XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model; Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Berlin/Heidelberg, Germany, 2022; Volume 13688, pp. 640–658. [Google Scholar]
Li, M.; Hu, L.; Xiong, Z.; Zhang, B.; Pan, P.; Liu, D. Recurrent Dynamic Embedding for Video Object Segmentation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 24 June 2022; pp. 1322–1331. [Google Scholar]
Liang, Y.; Li, X.; Jafari, N.; Chen, Q. Video Object Segmentation with Adaptive Feature Bank and Uncertain-Region Refinement. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 15 October 2020; Volume 2020, pp. 3430–3441. [Google Scholar]
Li, X.; Loy, C.C. Video Object Segmentation with Joint Re-Identification and Attention-Aware Mask Propagation; Lecture Notes in Computer Science (Including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Berlin/Heidelberg, Germany, 2018; Volume 11207, pp. 93–110. [Google Scholar]
Rahmatulloh, A.; Gunawan, R.; Sulastri, H.; Pratama, I.; Darmawan, I. Face Mask Detection Using Haar Cascade Classifier Algorithm Based on Internet of Things with Telegram Bot Notification. In Proceedings of the 2021 International Conference Advancement in Data Science, E-Learning and Information Systems, ICADEIS 2021, Nusa Dua Bali, Indonesia, 13–14 October 2021. [Google Scholar]
Lakhan, A.; Elhoseny, M.; Mohammed, M.A.; Jaber, M.M. SFDWA: Secure and Fault-Tolerant Aware Delay Optimal Workload Assignment Schemes in Edge Computing for Internet of Drone Things Applications. Wirel. Commun. Mob. Comput. 2022, 2022, 5667012. [Google Scholar] [CrossRef]
Mostafa, S.A.; Mustapha, A.; Gunasekaran, S.S.; Ahmad, M.S.; Mohammed, M.A.; Parwekar, P.; Kadry, S. An Agent Architecture for Autonomous UAV Flight Control in Object Classification and Recognition Missions. Soft Comput. 2023, 27, 391–404. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16 × 16 Words: Transformers for Image Recognition At Scale. In Proceedings of the ICLR 2021—9th International Conference on Learning Representations, Virtual Event, Austria, 3–7 May 2021. [Google Scholar]
Zhang, C.; Han, D.; Qiao, Y.; Kim, J.U.; Bae, S.-H.; Lee, S.; Hong, C.S. Faster Segment Anything: Towards Lightweight SAM for Mobile Applications. arXiv 2023, arXiv:2306.14289. [Google Scholar]
Wu, K.; Zhang, J.; Peng, H.; Liu, M.; Xiao, B.; Fu, J.; Yuan, L. TinyViT: Fast Pretraining Distillation for Small Vision Transformers; Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Berlin/Heidelberg, Germany, 2022; Volume 13681, pp. 68–85. [Google Scholar]
Shelhamer, E.; Long, J.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 640–651. [Google Scholar] [CrossRef] [PubMed]
Held, D.; Thrun, S.; Savarese, S. Learning to Track at 100 FPS with Deep Regression Networks; Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Berlin/Heidelberg, Germany, 2016; Volume 9905, pp. 749–765. [Google Scholar]
Bolme, D.S.; Beveridge, J.R.; Draper, B.A.; Lui, Y.M. Visual Object Tracking Using Adaptive Correlation Filters. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 2544–2550. [Google Scholar]
Henriques, J.F.; Caseiro, R.; Martins, P.; Batista, J. High-Speed Tracking with Kernelized Correlation Filters. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 583–596. [Google Scholar] [CrossRef] [PubMed]
Lukežič, A.; Vojíř, T.; Čehovin Zajc, L.; Matas, J.; Kristan, M. Discriminative Correlation Filter Tracker with Channel and Spatial Reliability. Int. J. Comput. Vis. 2018, 126, 671–688. [Google Scholar] [CrossRef]
Feng, Q.; Xu, X.; Wang, Z. Deep Learning-Based Small Object Detection: A Survey. Math. Biosci. Eng. 2023, 20, 6551–6590. [Google Scholar] [CrossRef] [PubMed]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 2017, pp. 5999–6009. [Google Scholar]
Li, R.Y.M.; Tang, B.; Chau, K.W. Sustainable Construction Safety Knowledge Sharing: A Partial Least Square-Structural Equation Modeling and a Feedforward Neural Network Approach. Sustainability 2019, 11, 5831. [Google Scholar] [CrossRef]
Nguyen, A.; Pham, K.; Ngo, D.; Ngo, T.; Pham, L. An Analysis of State-of-the-Art Activation Functions for Supervised Deep Neural Network. In Proceedings of the 2021 International Conference on System Science and Engineering, ICSSE 2021, Ho Chi Minh City, Vietnam, 26–28 August 2021; pp. 215–220. [Google Scholar]
Tancik, M.; Srinivasan, P.P.; Mildenhall, B.; Fridovich-Keil, S.; Raghavan, N.; Singhal, U.; Ramamoorthi, R.; Barron, J.T.; Ng, R. Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–12 December 2020; Volume 2020. [Google Scholar]
Dalal, N.; Triggs, B.; Dalal, N.; Triggs, B. Histograms of Oriented Gradients for Human Detection To Cite This Version: Histograms of Oriented Gradients for Human Detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 20–25 June 2005; pp. 886–893. [Google Scholar]
Zamir, S.W.; Arora, A.; Gupta, A.; Khan, S.; Sun, G.; Khan, F.S.; Zhu, F.; Shao, L.; Xia, G.-S.; Bai, X. ISAID: A Large-Scale Dataset for Instance Segmentation in Aerial Images. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 28–37. [Google Scholar]
Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A Large-Scale Dataset for Object Detection in Aerial Images. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3974–3983. [Google Scholar]
Shermeyer, J.; Hossler, T.; Van Etten, A.; Hogan, D.; Lewis, R.; Kim, D. RarePlanes: Synthetic Data Takes Flight. In Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2021; pp. 207–217. [Google Scholar]
Li, F.; Kim, T.; Humayun, A.; Tsai, D.; Rehg, J.M. Video Segmentation by Tracking Many Figure-Ground Segments. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 2192–2199. [Google Scholar]
Pont-Tuset, J.; Perazzi, F.; Caelles, S.; Arbeláez, P.; Sorkine-Hornung, A.; Van Gool, L. The 2017 DAVIS Challenge on Video Object Segmentation. arXiv 2017, arXiv:1704.00675. [Google Scholar]
Xu, N.; Yang, L.; Fan, Y.; Yang, J.; Yue, D.; Liang, Y.; Price, B.; Cohen, S.; Huang, T. YouTube-VOS: Sequence-to-Sequence Video Object Segmentation. In Scanning Microscopy; Springer: Berlin/Heidelberg, Germany, 2018; Volume 3, pp. 603–619. [Google Scholar]
Oh, S.W.; Lee, J.Y.; Sunkavalli, K.; Kim, S.J. Fast Video Object Segmentation by Reference-Guided Mask Propagation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7376–7385. [Google Scholar]
Chiroma, H.; Herawan, T.; Fister, I.; Fister, I.; Abdulkareem, S.; Shuib, L.; Hamza, M.F.; Saadi, Y.; Abubakar, A. Bio-Inspired Computation: Recent Development on the Modifications of the Cuckoo Search Algorithm. Appl. Soft Comput. J. 2017, 61, 149–173. [Google Scholar] [CrossRef]
Chen, X.; Yan, B.; Zhu, J.; Lu, H.; Ruan, X.; Wang, D. High-Performance Transformer Tracking. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 8507–8523. [Google Scholar] [CrossRef] [PubMed]
Zhao, J.; Dai, K.; Zhang, P.; Wang, D.; Lu, H. Robust Online Tracking with Meta-Updater. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 6168–6182. [Google Scholar] [CrossRef] [PubMed]
Zhu, J.; Lai, S.; Chen, X.; Wang, D.; Lu, H. Visual Prompt Multi-Modal Tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 9516–9526. [Google Scholar]
Chen, X.; Peng, H.; Wang, D.; Lu, H.; Hu, H. SeqTrack: Sequence to Sequence Learning for Visual Object Tracking. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 14572–14581. [Google Scholar]
Liu, S.; Li, X.; Lu, H.; He, Y. Multi-Object Tracking Meets Moving UAV. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; Volume 2022, pp. 8866–8875. [Google Scholar]
Li, R.; He, C.; Li, S.; Zhang, Y.; Zhang, L. DynaMask: Dynamic Mask Selection for Instance Segmentation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 11279–11288. [Google Scholar]
Li, R.; He, C.; Zhang, Y.; Li, S.; Chen, L.; Zhang, L. SIM: Semantic-Aware Instance Mask Generation for Box-Supervised Instance Segmentation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 7193–7203. [Google Scholar]
Zhang, T.; Wei, S.; Ji, S. E2EC: An End-to-End Contour-Based Method for High-Quality High-Speed Instance Segmentation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4443–4452. [Google Scholar]
Zhu, C.; Zhang, X.; Li, Y.; Qiu, L.; Han, K.; Han, X. SharpContour: A Contour-Based Boundary Refinement Approach for Efficient and Accurate Instance Segmentation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; Volume 2022, pp. 4382–4391. [Google Scholar]
Cheng, T.; Wang, X.; Chen, S.; Zhang, W.; Zhang, Q.; Huang, C.; Zhang, Z.; Liu, W. Sparse Instance Activation for Real-Time Instance Segmentation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; Volume 2022, pp. 4423–4432. [Google Scholar]

Figure 1. A contrast between (a) points as prompts and (b) one box as a prompt.

Figure 2. Two main branches of MobileSAM-Track. (a) Mask Generator. (b) Tracker. The green box describes the approximate position of the object in each frame, which is slightly larger than the object; the red box is the smallest box that can cover all the masks output by the mask generator, which is usually slightly smaller than the object.

Figure 3. Illustration of the tracking and segmentation effects. The red boxes describe the prompts to generate a mask for the corresponding object. (a) Without mask diffusion module. (b) With mask diffusion module.

Figure 4. Schematic of the workflow of the diffusion module. (a) Mask generator. (b) Mask diffusion module.

Figure 5. Illustration of the dataset.

Figure 6. Statistical plot of object pixel ratio frequency (6280 frames in total).

Figure 7. Results of the accuracy verification experiment for the diffusion module. (a) Results on GTX 1080 Ti. (b) Results on Jetson Xavier NX.

Figure 8. Visualization of data obtained from experiments in Section 4.5 and Section 4.6. (a) Lightweight validation data on MobileSAM. (b) Comparison with other related works.

Figure 9. Schematic of failure situation without diffusion module. (a) The tracker only tracks the nose of the aircraft. (b) The size of the tracking box is not adjusted in time. (c) The tracker only tracks the side of the vehicle.

Figure 10. Video segmentation results using different versions of the image encoder. (a) Tiny-ViT. (b) ViT-B. (c) ViT-L. (d) ViT-H.

Figure 11. Video segmentation results using different methods. (a) OSVOS. (b) RGMP. (c) STM. (d) XMem. (e) MobileSAM-Track.

Figure 12. Example of segmentation results from MobileSAM-Track on the DAVIS 2016 dataset. (a) Breakdance. (b) BMX and trees. (c) Camels.

Table 1. Comparison of datasets used to evaluate S-VOS methods.

	Number of Videos	Number of Frames	Resolution	Maximum Frame Number	Average Pixel Ratio
SegTrack [40]	24	1516	-	279	5.39%
DAVIS-16 [13]	50	3455	854 × 480	104	8.09%
DAVIS-17 [41]	209	13,586	854 × 480	104	6.09%
YouTube-VOS [42]	6559	160,697	1280 × 720	36	10.77%
Ours	28	6280	854 × 480	327	4.76%

Table 2. Parameters of MobileSAM-Track on different devices.

Device	Cropping Scaling Factor	Mask Binarization Threshold	Number of Sub-Masks
GTX 1080 Ti	0.15	0.3	5
Jetson Xavier NX	0.15	0.3	2

Table 3. Data obtained from the lightweight validation experiment.

Image Encoder Structure	With Mask Diffusion Module	J (%)	F (%)	J&F (%)	FPS	VRAM Usage
Tiny-ViT	Yes	60.4	72.4	66.4	12.31	425 MB
ViT-B	Yes	62.4	73.2	67.8	3.30	3269 MB
ViT-L	Yes	63.2	75.8	69.5	1.50	5047 MB
ViT-H	Yes	64.5	76.4	70.5	0.95	6570 MB
Tiny-ViT	No	51.1	63.1	57.1	16.95	425 MB
ViT-B	No	52.5	63.5	58.0	3.73	3269 MB
ViT-L	No	52.7	65.4	59.1	1.73	5047 MB
ViT-H	No	52.1	66.0	59.1	0.97	6570 MB

Table 4. Comparison of the performance data of related methods on our self-built dataset.

Methods	Init Methods	J (%)	F (%)	J&F (%)	FPS	Maximum VRAM Usage
OSVOS [12]	Mask	37.0	46.6	41.8	15.99	3246 MB
RGMP [43]	Mask	53.9	66.3	60.1	7.29	3584 MB
STM [5]	Mask	44.6	60.1	52.4	6.89	7308 MB
SwiftNet [7]	Mask	49.8	60.3	55.1	10.27	6998 MB
MiVOS [15]	Click	45.7	64.1	54.9	6.38	7254 MB
STCN [6]	Mask	44.5	66.2	55.4	9.53	7841 MB
XMem [16]	Click	55.8	69.2	62.5	10.76	4211 MB
MobileSAM-Track	Box	61.4	71.4	66.4	12.31	425 MB

Table 5. Comparison of the performance data of related methods on DAVIS 2016.

Methods	J (%)	F (%)	J&F (%)
OSVOS [12]	79.8	80.6	80.2
RGMP [43]	81.5	82.0	81.8
STM [5]	88.7	89.9	89.3
SwiftNet [7]	90.5	90.3	90.4
MiVOS [15]	88.9	91.1	90.0
STCN [6]	90.8	92.5	91.6
XMem [16]	90.7	93.2	92.0
MobileSAM-Track	65.4	75.2	70.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, Y.; Zhao, Y.; Zhang, X.; Wang, X.; Lian, C.; Li, J.; Shan, P.; Fu, C.; Lyu, X.; Li, L.; et al. MobileSAM-Track: Lightweight One-Shot Tracking and Segmentation of Small Objects on Edge Devices. Remote Sens. 2023, 15, 5665. https://doi.org/10.3390/rs15245665

AMA Style

Liu Y, Zhao Y, Zhang X, Wang X, Lian C, Li J, Shan P, Fu C, Lyu X, Li L, et al. MobileSAM-Track: Lightweight One-Shot Tracking and Segmentation of Small Objects on Edge Devices. Remote Sensing. 2023; 15(24):5665. https://doi.org/10.3390/rs15245665

Chicago/Turabian Style

Liu, Yehui, Yuliang Zhao, Xinyue Zhang, Xiaoai Wang, Chao Lian, Jian Li, Peng Shan, Changzeng Fu, Xiaoyong Lyu, Lianjiang Li, and et al. 2023. "MobileSAM-Track: Lightweight One-Shot Tracking and Segmentation of Small Objects on Edge Devices" Remote Sensing 15, no. 24: 5665. https://doi.org/10.3390/rs15245665

APA Style

Liu, Y., Zhao, Y., Zhang, X., Wang, X., Lian, C., Li, J., Shan, P., Fu, C., Lyu, X., Li, L., Fu, Q., & Li, W. J. (2023). MobileSAM-Track: Lightweight One-Shot Tracking and Segmentation of Small Objects on Edge Devices. Remote Sensing, 15(24), 5665. https://doi.org/10.3390/rs15245665

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MobileSAM-Track: Lightweight One-Shot Tracking and Segmentation of Small Objects on Edge Devices

Abstract

1. Introduction

2. Preliminary

3. Materials and Methods

3.1. Task Definition

3.2. Mask Generator

3.3. Tracker

3.4. Diffusion Module

4. Experiments

4.1. Dataset

4.2. Experimental Environment

4.3. Evaluation Metrics

4.4. Accuracy Improvement Validation of Diffusion Module

4.5. Lightweight Validation of MobileSAM

4.6. Comparison with Other Related Methods

4.7. Generalization Ability Validation

5. Discussion

5.1. Interpretation and Evaluation of Experimental Results

5.2. Implications of This Work

5.3. Limitations of Our Method and Future Research Directions

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI