Detecting Moving Vehicles from Satellite-Based Videos by Tracklet Feature Classification

Chen, Renxi; Ferreira, Vagner G.; Li, Xinhui

doi:10.3390/rs15010034

Open AccessArticle

Detecting Moving Vehicles from Satellite-Based Videos by Tracklet Feature Classification

by

Renxi Chen

^1,*,

Vagner G. Ferreira

¹

and

Xinhui Li

²

¹

School of Earth Sciences and Engineering, Hohai University, Nanjing 211100, China

²

School of Art, Nanjing University of Information Science and Technology, Nanjing 210044, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(1), 34; https://doi.org/10.3390/rs15010034

Submission received: 11 November 2022 / Revised: 19 December 2022 / Accepted: 19 December 2022 / Published: 21 December 2022

(This article belongs to the Section Remote Sensing Image Processing)

Download

Browse Figures

Versions Notes

Abstract

:

Satellite-based video enables potential vehicle monitoring and tracking for urban traffic management. However, due to the tiny size of moving vehicles and cluttered background, it is difficult to distinguish actual targets from random noise and pseudo-moving objects, resulting in low detection accuracy. In contrast to the currently overused deep-learning-based methods, this study takes full advantage of the geometric properties of vehicle tracklets (segments of moving object trajectory) and proposes a tracklet-feature-based method that can achieve high precision and high recall. The approach is a two-step strategy: (1) smoothing filtering is used to suppress noise, and then a non-parametric-based background subtracting model is applied for obtaining preliminary recognition results with high recall but low precision; and (2) generated tracklets are used to discriminate between true and false vehicles by tracklet feature classification. Experiments and evaluations were performed on SkySat and ChangGuang acquired videos, showing that our method can improve precision and retain high recall, outperforming some classical and deep-learning methods from previously published literature.

Keywords:

moving vehicle detection; motion detection; object detection; satellite video; tracklet feature classification

1. Introduction

In recent years, capturing videos from satellites has become a reality, with the launch of commercial satellites such as SkySat, ChangGuang, and Urthecast. Unlike conventional satellites, video satellites can capture consecutive images (videos) by staring at a designated area, thus providing critical information for dynamic monitoring [1,2,3].

Videos satellites have opened up new areas for remote sensing applications, such as large-scale traffic surveillance, 3D reconstruction of buildings, urban management, etc. In traffic surveillance, extracting moving vehicles from satellite videos is one of the most critical tasks. Moving vehicle detection from ground-based cameras has been an active research topic for many years, and many state-of-the-art methods have been proposed, for example, the Temporal Differencing [4], the Gaussian Mixture Model (GMM) [5], the Codebook method [6], and the Visual Background Extraction (ViBE) [7]. However, these methods do not work well on satellite videos because of several challenges as follows [2,8]:

(1): Tiny Size of Targets: In ground-based surveillance videos, targets are relatively large and can be easily distinguished from noise. In satellite videos, moving vehicles are usually composed of several pixels without distinctive colour and texture, making them difficult to distinguish from noise. Under such conditions, motion information is probably the most robust feature that can be used to identify moving vehicles.
(2): Pseudo-Moving Objects: Due to the camera′s ego-motion, the parallax of tall buildings causes many pseudo-moving objects, which significantly affect the detection accuracy. False objects often appear on the edges of tall buildings. In the absence of available features, efficiently distinguishing between true and false targets is a crucial challenge.
(3): Cluttered Background: Satellite videos cover expansive views and capture various ground objects, such as roads, buildings, water bodies, vegetation, etc. These objects form complex and cluttered backgrounds, making vehicles tiny and challenging to identify. As Ao et al. [2] adequately mentioned, the task is like looking for a needle in a haystack. Moreover, video illumination changes cause more difficulties, resulting in inconsistent recognition results.

Motion detection has been a hot research area for decades, and many methods for detecting objects from videos have been proposed. Despite the difference between satellite- and ground-based surveillance videos, some basic techniques still apply to detect objects from satellite-based videos.

Background subtraction for motion segmentation in static scenes is a commonly used technique to detect moving objects from videos [9]. This technique relies on subtracting the current image from a reference image (also called a background image), resulting in the difference image. The pixels where the difference is above a given threshold are classified as moving objects (i.e., foreground). Nevertheless, the critical task is to create the background image (reference image), which can be solved by the median background model [10], the GMM [5], the Kernel Density Estimation (KDE) [11], the ViBE model [7], etc. As the most popular and effective methods for motion detection, the Background Subtraction Models (BSMs) also play an essential role in generating initial detection results from satellite videos. For instance, Kopsiaftis and Karantzalos [1] applied the BSMs and mathematical morphology to estimate traffic density. Yang et al. [12] presented a novel saliency-based background model adapted to detect tiny vehicles from satellite-based videos. The evaluation of different background models was performed by Chen [13], who indicated the best method to detect vehicle candidates from satellite-based videos.

Besides the BSMs, the temporal differencing method, in which the moving objects are evaluated by finding the difference between two or three consecutive frames, has been widely used to detect moving vehicles as the most efficient method. Although the frame differencing method has been used for real-time processing, it generally fails to obtain complete outlines of moving objects and causes ghost regions when objects are moving fast. Therefore, it is commonly combined or adapted with other techniques to overcome the drawbacks. More recently, Chen et al. [14] proposed a method for detecting moving objects, called adaptive separation, which is based on the idea of frame difference. Their method can be easily implemented and offers high performance. Likewise, Shi et al. [15] developed a normalised frame difference method, which was also a variant of frame differencing. Shu et al. [16] proposed a hybrid method to detect small vehicles from satellite-based videos by combining three-frame differencing methods with the Gaussian mixture model.

Witnessing the great success of convolutional neural networks (CNN) in object detection, recent works have utilised CNN to detect moving vehicles. For example, Pflugfelder et al. [17] trained and fine-tuned a FoveaNet for detecting vehicles from satellite-based videos. Chen et al. [13] proposed a lightweight CNN (LCNN) to reject false targets from preliminary detection results to increase recognition accuracy. Zhang et al. [18] trained a weakly supervised deep convolutional neural network (DCNN) to improve detection accuracy. They used the pseudo labels extracted by extended low-rank and structured sparse decomposition.

Some methods exploit motion information to improve detection accuracy. Unlike moving objects in ground surveillance videos, moving vehicles in satellite-based videos lack texture and colour features, and therefore motion information is accounted for. For instance, Xu et al. [19] used a global scene motion compensation and a local dynamic updating method to remove false vehicles by considering the moving background. Likewise, Ao et al. [2] proposed a motion-based detection algorithm via noise modelling. Hu et al. [20] also utilised motion features in deep neural networks for accurate object detection and tracking. Feng et al. [21] proposed a deep-learning framework guided by spatial motion information. More recently, Pi et al. [22] integrated motion information from adjacent frames into an end-to-end neural network framework to detect tiny moving vehicles. Besides spatial information, Lei et al. [23] utilised multiple prior information to improve the accuracy of detecting tiny moving vehicles from satellite-based videos.

Besides the above methods, some studies deal with detecting moving objects using low-rank and sparse representation [24]. Inspired by the low-rank decomposition, Zhang et al. [25] extended the decomposition formulation with bounded errors, called Extended LSD (E-LSD). This method has improved the background subtraction method with boosted detection precision over other methods. In addition, Zhang et al. [26] proposed online low-rank and structured sparse decomposition (O-LSD) to reduce the computational cost, which addresses the processing delay in seeking rank minimisation.

Detecting moving vehicles from satellite videos has been studied for about a decade. Therefore, the existing detection methods have some limitations. The method based on BSMs often introduces noise that is difficult to distinguish from actual vehicles, while the methods based on frame difference are sensitive to noise. Although the deep-learning-based approach has achieved excellent performance, it requires many samples and repetitive training, resulting in a highly time-consuming task. In addition, deep-learning requires expensive computational resources and costs, making it unsuitable for high-speed processing.

Due to the cluttered background and absence of conspicuous appearance features, moving vehicles are often mistaken for false targets caused by complex moving backgrounds. Although some algorithms allow us to adjust parameters to achieve higher detection precision or recall, generally increasing precision results in low recall or vice versa. The critical task is to eliminate the effects of background movement and environmental changes (e.g., illumination change). Overall, a low computational detection algorithm with high precision and recall is lacking, and further studies are needed.

Among the state-of-art techniques for detecting moving objects from satellite-based videos, BSMs are still the most widely used since they are practical methods. Nevertheless, we have noticed that some BSM-based approaches produce results with low precision but high recall. That is to say, they can identify almost all true objects, but also misidentify many false ones. The key difficulty is to distinguish these pseudo-moving vehicles from true. If we can remove false objects and keep true vehicles, both precision and recall can be guaranteed. To this end, this paper proposes a moving vehicle detection approach based on BSMs and tracklet analysis.

Our proposed method fully uses motion information, which is a potential cue for identifying actual moving vehicles. This study aims to distinguish between true and false vehicles using the features of their tracklets, which have not been studied in the existing literature. Object tracking tries to obtain an object’s trajectory, which refers to the entire path an object travels from its appearance to its disappearance. A tracklet is considered one of the fragments of the whole path and can be used to distinguish different moving objects.

To distinguish between these tracklets, we need to investigate the characteristics of different tracklets and design appropriate features accordingly. This paper has several contributions as follows: (1) an object matching algorithm considering the directional distance metrics is proposed to create reliable tracklets of moving objects; (2) a tracklet feature classification method is proposed to distinguish between true and false targets; and (3) a tracklet rasterisation method is proposed to create a confidence map that can be used to adaptively alter the parameters in tracklet classification and derive road regions that can be utilised to verify actual moving vehicles. The following sections cover the details of our procedure and experiments.

2. Method

The proposed method mainly consists of the following steps: (1) image filtering and enhancement, (2) background modelling, (3) object tracking, (4) tracklet feature extraction, and (5) tracklet classification, as summarised in Figure 1.

In the first step, we apply a Gaussian filter to the input videos, suppressing some random noise that often appears at object edges. This step removes part of the false objects from the object detection results. In the second step, a non-parametric background modelling method [27] is chosen to obtain initial vehicle recognition results with high recall but low precision. In the third step, we extract object tracklets using our newly developed tracking method, derived from the Hungary Matching Algorithm (HMA) [28] and Optical Flow (OF) [29]; then, features are extracted for each tracklet and used to distinguish true vehicles from false targets. At the same time, a confidence map can be derived from vehicle tracklets and then used to adjust parameters in the tracklet classification. In the final step, the confidence map can also be used as a constraint to verify proper moving vehicles, achieving both high precision and recall.

2.1. Video Filtering

The challenge of detecting moving vehicles from satellite videos lies in the target size and random distracting noise. We observed that noise often occurred at building edges and road curbs. Image smoothing is an essential technique for dealing with noise. Some commonly used filters are available for our task, such as the mean, median, and Gaussian. For the detection of such tiny objects, the filter should be able to be fine-tuned through its parameters. Among the three, the mean and median filters can be adjusted by only one parameter: the window size. Thus, the targets tend to be over-blurred even when a small window size is applied. In contrast, the Gaussian filter can adjust the smoothing strength by tuning two parameters: the window size (w) and the standard deviation (σ) of the Gaussian function. The Gaussian filter was utilised in our experiments. The appropriate parameters are determined by repetitive testing until the filter can remove much noise and keep targets unspoiled. Finally, the empirical parameters were set to w = 5 and σ = 1.

2.2. Background Subtraction Model

Satellite-based videos differ from ground-based videos. We observed that some popular background models, such as GMM [5] and ViBE [7], do not work well when used directly on satellite-based videos. In our study, we found that GMM generated too much noise, while ViBE tended to miss a lot of true objects. Therefore, we generated initial detection results using a non-parametric modelling method [11] that can adapt quickly to changes in the scene and yield sensitive results with low false alarm rates. This model applies a normal kernel function to estimate the density, and can obtain a more accurate estimation. In our study, we adopt an improved version, proposed by Zivkovic and Heijden [27], and implemented it in OpenCV (an open-source package for computer vision). This model needs two parameters specified by the user, namely (1) the threshold (d_T2) on the squared distance between the pixel and the sample to decide whether a pixel is close to a data sample; and (2) the history length (L_his), which is the number of last frames that affect the background model.

2.3. Tracklet Extraction

When detecting large objects from videos, noise is much smaller than target objects and can be filtered by specifying a threshold value. However, in satellite-based videos, moving vehicles appear in the same size as the noise, making it extremely challenging to remove false targets from the initial detection results. Two major types of false targets exist, namely random noise and pseudo-moving objects, which are caused by the motion parallax of building roofs. We have noticed that a vehicle moves at a constant speed and in a steady direction, following regular trajectories, whereas random noise appears at random positions and therefore cannot form smooth trajectories. Although pseudo-moving objects also form trajectories, their trajectories are often short and erratic. Based on these observations, our method intends to discriminate true targets from false ones by taking full advantage of the features of moving objects′ trajectories.

Object tracking techniques can be utilised to obtain trajectories. Unlike object tracking, our goal is to detect moving vehicles, rather than continuously tracking each object. Therefore, we do not need to track the entire trajectory of a moving object, but rather its tracklets, which are explained in the introduction. In object tracking, an object may be lost for a while and then get tracked again afterward, so in this scenario, we can only get separate tracklets, rather than a complete trajectory. For tracking targets, the algorithm attempts to obtain their complete trajectories. However, our study is supposed to detect the positions of true vehicles, instead of the whole trajectories. Hence, in the remainder of this paper, we only discuss the tracklet tracking method.

The background modelling mentioned above allows us to detect potential moving vehicles, which consist of true and false targets. To track their tracklets, we need to associate each object on the current frame with its corresponding identification on the next frame. This work is also known as data association in Multiple Object Tracking (MOT) [30]. One of the techniques for data association is the Hungarian Matching Algorithm (HMA) [28]. It requires high-quality detection results to obtain satisfactory tracking results. However, some vehicles cannot be detected on specific frames, due to the brightness of targets or motion blur, resulting in incoherent detection results. Missing detection will degrade the performance of HMA. For compensation, another technique, the Lucas-Kanade Optical Flow (LKOF) [31], is applied to track vehicles when HMA fails. Combining the two techniques makes our tracklet tracking algorithm reliable.

2.3.1. The Hungarian Matching Algorithm—HMA

Let T₁, T₂, T₃, …, T_n denote the targets on the k-th frame, and D₁, D₂, D₃, …, D_m denote the detected objects on the (k + 1)-th frame. To track T_i, we must find its match D_j from the next frame. If D_j cannot be found, this indicates that the tracking of T_i is interrupted, or T_i disappeared from the image. If one detection D_j cannot be matched to any target, it is considered a new object. To match a target T_i to its corresponding detection, we need to measure the distances between the target T_i and all the detections. Given n targets and m detections, a distance matrix C_nm (as shown in Figure 2) is generated, based on the distance measurement. Each element S_ij measures the distance between the target T_i and the detection D_j. Through HMA, we can find the globally optimal solution for this problem. That is to say, we can assign all targets to their detections, guaranteeing that the summary of distances is minimised.

The quality of the matching algorithm depends on the distance measurement. To ensure the algorithm′s reliability, we proposed a measurement that considers both spatial and directional distances. As shown in Figure 3, the two triangles represent two targets: T₁ and T₂, and the two circles represent the two corresponding detections: D₁ and D₂. When matching T₁ to the detections under the measurement of only spatial distance, T₁ may be assigned to D₂. This is because the spatial distance d_s₁₂ between T₁ and D₂ is smaller than the distance d_s₁₁ between T₁ and D₁. Consequently, the matching is wrong and, to avoid this, directional distance is considered. After a period of tracking, each target’s tracklet is obtained. From some recent tracking points (e.g., the last 10 points), we can predict the target′s velocity, which is a vector defined by the thick, dotted lines depicted in Figure 3. The velocity of the target T₁ and T₂ are represented by V₁ and V₂, respectively.

Here, we define the directional distance as the distance between a detection point and the line determined by a velocity. In geometry, the distance between a point and a line is defined as the length of the perpendicular segment connecting the point to the given line. For convenience, we denote d_pij as the directional distance between target T_i and detection D_j. For instance, in Figure 3, the directional distance between T₁ and point D₁ is d_p₁₁, and d_p₁₂ is the directional distance between T₁ and point D₂. Given a straight line as ax + by + c = 0, the distance between a point (x_p, y_p) and the straight line can be calculated with the following equation:

d = \frac{| a x_{p} + b y_{p} + c |}{\sqrt{a^{2} + b^{2}}}

(1)

Combining the spatial and directional distance, we define the distance measurement as:

S_{i j} = α d s_{i j} + (1 - α) d p_{i j}

(2)

where ds_ij is the spatial distance between the target T_i and the detection D_j, and dp_ij is the directional distance between V_i (velocity of target T_i) and D_j. The coefficient α (0~1.0) controls the weights between ds_ij and dp_ij. In our experiments, we found that the spatial and directional distance are equally important, and thus α was set to 0.5. Under our distance measurement, the target T₁ in Figure 3 can be matched to its correct detection D₁.

2.3.2. Optical Flow Tracking

Optical flow (OF) is the pattern of apparent motion of image objects between consecutive frames, caused by object or camera movement. It is a two-dimensional vector field where each vector is a displacement vector showing the movement of points from one frame to the second. A practical method for calculating the OF is LKOF, which takes a

w \times w

patch around one point and assumes that all the points in the patch have the same motion. The equation for the OF calculation can be represented as follows:

I (x, y, t) = I (x + d x, y + d y, t + 1)

(3)

where I (x, y, t) is the intensity of the pixel (x, y) at time t, and dx and dy are pixel displacements in the x and y directions, respectively. By linearising Equation (3) and expanding the first-order Taylor, we can get the OF constraint equation as:

\frac{\partial I}{\partial x} \frac{\partial x}{\partial t} + \frac{\partial I}{\partial y} \frac{\partial y}{\partial t} + \frac{\partial I}{\partial t} = 0

(4)

where ∂I/∂x and ∂I/∂y are the gradients of the image in the x and y directions, respectively. In addition, ∂I/∂t is the partial derivative with respect to time. For simplicity, Equation (4) can be rewritten as:

I_{x} v_{x} + I_{y} v_{y} + I_{t} = 0

(5)

where (v_x, v_y) is the OF velocity in the x and y directions. The LKOF method assumes that the flow (v_x, v_y) is constant in a small window of size

w \times w

. Using a point and its surrounding pixels, the velocity (v_x, v_y) can be worked out through a group of equations as follows:

[\begin{matrix} I_{x 1} & I_{y 1} \\ \dots & \dots \\ I_{x n} & I_{y n} \end{matrix}] [\begin{matrix} v_{x} \\ v_{y} \end{matrix}] = [\begin{matrix} - I_{t 1} \\ \dots \\ - I_{t n} \end{matrix}]

(6)

Equation (6) means that the OF can be estimated by calculating the derivatives of an image in three dimensions. LKOF is often used to track feature points in computer vision. In our research, we extracted the centroids of the detected objects as the “feature points” and then estimated the OF of each object and thus could track each object.

So far, we have introduced two methods for tracking moving objects: the HMA and the LKOF. HMA is a popular and effective method for multiple object tracking, and it is chosen as the primary tracking method in this study. However, HMA may fail if an object cannot be successfully detected in the next frame. Therefore, LKOF can be used as an auxiliary method when HMA cannot track the object.

2.4. Tracklet Feature Analysis

It is difficult to distinguish true vehicles from false ones using only static information from each frame. Since we track each moving object′s tracklet from consecutive frames, we can distinguish them using temporal information. We observed that different moving objects yielded tracklets with different shapes. These tracklets can be categorised into four types (Figure 4). In Figure 4, the first row shows the illustration for each tracklet type, and the second row shows the corresponding instances from a video. Figure 4a shows a vehicle running on a straight road, Figure 4b represents a vehicle that has just completed a right turn, and Figure 4c depicts a vehicle that is making a U-turn. All three vehicle tracklets are smooth, and the points on the tracklets are almost evenly spaced. In contrast, the rough and erratic tracklet shown in Figure 4d is tracked from a distractor on a building. The background modelling algorithm has misinterpreted it as a vehicle. From the characteristics of these tracklets, the roughness of the tracklet can be used as an indicator that can distinguish between true and false vehicles.

To measure the roughness of the tracklet, we proposed a method based on the directional change angle (DCA). As shown in Figure 5, P₁, P₂, P₃, and P₄ are four consecutive points on a tracklet, and the angle β₂, between two consecutive line segments P₁P₂ and P₂P₃, is defined as the DCA. Suppose an object is moving along the tracklet, starting from P₁, it changes its direction towards P₃ when it reaches P₂. Therefore, β₂ measures how much the motion direction has changed at P₂. Similarly, we can define another DCA β₃ at P₃ and so on. All these DCAs will be small if the tracklet is smooth, whereas a rough and erratic tracklet would yield large DCAs.

Given a tracklet with n points, we can calculate (n-2) DCAs, denoted as (β₂, β₃, …, β_n-₁), and then derive their descriptive statistics (measured in pixel): minimum (DCA_min), maximum (DCA_max), mean value (

\bar{DCA}

), and standard deviation (

σ_{DCA}

). We selected some samples for each type of tracklets and calculated their DCA statistics, and the results are presented in Table 1. We can see that even if a vehicle is on a U-turn, the

\bar{DCA}

is about 0.315 (less than 0.5) and the

σ_{DCA}

is 0.218 (less than 0.5). All these data indicate that vehicle tracklets (i.e., type (A), (B), and (C)) have small DCA_min, DCA_max,

\bar{DCA}

, and

σ_{DCA}

, whereas these values are much larger for non-vehicle tracklets.

Among the four statistical values (Table 1), the DCA_min and DCA_max are easily affected by outliers in the DCAs, due to the presence of noises on the tracklet. Similarly, a non-vehicle tracklet may also contain a small DCA value, resulting in a small DCA_min. In contrast, the

\bar{DCA}

and

σ_{DCA}

are calculated from a series of DCAs, so they are less affected by outlier values and are more reliable. From analysing the geometric features of the tracklets, we can distinguish non-vehicles from vehicles through the mean and standard deviation of the DCAs calculated from their tracklets. The following section covers the tracklet classification based on their tracklet features.

2.5. Tracklet Classification

To describe our algorithm, we define the tracklet as an object for convenience, like the concept of an object in object-oriented programming (OOP). Figure 6 shows the tracklet object with some attributes in UML (Unified Modelling Language) format. DCAs.x means the DCA_min, DCA_max,

\bar{DCA}

, and

σ_{DCA}

of DCAs discussed in the previous section. Attribute “Length” is the number of points in the tracklet, and “Active” is a Boolean attribute that is true if the tracklet is still being tracked, or false if the tracking has been completed.

As we have observed in traffic videos, vehicle tracklets are smooth, even if they are traveling on a bent or making a U-turn, whereas non-vehicle tracklets are erratic and ragged. Additionally, vehicle tracklets are long (often more than 20 points) and non-vehicle tracklets are relatively shorter.

In the previous section, we mentioned that the DCA_min and DCA_max features of DCAs might not be reliable in distinguishing non-vehicles and vehicles, so we chose

\bar{DCA}

,

σ_{DCA}

, and the length of the tracklet as the features to classify tracklets. In addition, the attribute “Active” must be used when judging whether a short tracklet is a vehicle or not because a true vehicle tracklet may not be long enough at the beginning of the tracking. Regardless of the active state of a tracklet, all truly moving vehicles will be classified as non-vehicles.

In our research, we do not need to classify tracklets into four types because we aim to distinguish real vehicles from spurious ones. Therefore, our classification is a binary classification problem. The statistics in Table 1 were calculated based on many real instances from traffic videos. Note that the

\bar{DCA}

of a non-vehicle is above 1.0 and all the

\bar{DCA}

values of vehicles are less than 0.3. The

σ_{DCA}

of a non-vehicle is usually larger than 0.5 and that of a vehicle is less than 0.5. This indicates that non-vehicles and vehicles are linearly separable. Considering the tracklet length as another feature, we set three thresholds: T_mean = 0.5 C, T_std = 0.5 C, and T_length = 20/C for tracklet classification, where C is a self-adaptive coefficient (default value is 1.0), which we will discuss in the next section. We need to process each frame as fast as possible, so the classification method should be lightweight. From this point, we use the decision tree in Figure 6 as our classifier:

From the decision tree, we can see that if a tracklet is too short (shorter than T_Length), it will be classified as a non-vehicle when the tracklet has been completed (Active = False), since short tracklets are usually tracked from noise. If a short tracklet is still active, we cannot currently judge whether it is a true vehicle or not and just skip it. If the tracklet is longer than the threshold T_length and has a small

\bar{DCA}

value (less than T_mean), it is a vehicle on a straight road. On the other hand, if the

\bar{DCA}

is larger than T_mean, we need the third feature

σ_{DCA}

to make further decisions: (1) if

σ_{DCA}

is also larger than the corresponding threshold T_std, the tracklet is classified as a non-vehicle; (2) if

σ_{DCA}

is less than T_std, it is a vehicle.

2.6. Confidence Map

In urban areas, moving vehicles are driven on certain fixed regions: roads. Getting the road ROIs (Region of Interest) in advance would help suppress false targets using the ROIs as constraints. Yang et al. [12] and Chen et al. [14] utilised a similar strategy in their research, where the road mask was generated from accumulated trajectories and then used to filter false targets. Unlike their work, we created a confidence map from the tracked tracklets and then used it to adapt the parameter C in the tracklet classification.

In the previous section, three thresholds, T_mean = 0.5 C, T_std = 0.5 C, and T_length = 20/C were set. When C is set to the default value 1.0, the classification criteria are strict, and only true vehicles that meet the high criteria can be classified as a vehicle. The confidence of a tracklet is computed by the following formula:

V_{c o n f} = 0.5 \max (0, \frac{2 - T . D C A . m e a n}{2}) + 0.5 \frac{\min (T . L e n g t h, 255)}{255}

(7)

where T.DCA.mean is the mean value (i.e.,

\bar{DCA}

) of the tracklet′s DCAs, and T.Length is the length of the tracklet. The first term of Equation (7) depends on the smoothness of the tracklet and the second on the length. The smoother and longer the tracklet, the higher the confidence. The

\bar{DCA}

of a tracklet is larger than 0 and usually less than 2, so the first term ranges from 0 to 0.5. The second term also ranges from 0 to 0.5, and thus V_Conf varies between 0 and 1.0. The confidence measures the likelihood that a tracklet is tracked from a true moving vehicle.

A confidence map can be generated from vehicle tracklets as follows: each vehicle tracklet is rasterised into a 3-pixels wide trace, and the corresponding pixel values are set to the tracklet’s confidence value. Once all traces are created, they are overlapped together. When overlapping these traces, low confidence values are replaced by high values, as shown in Figure 7a, where four traces with different confidence values (1, 2, 3, and 4) are displayed in blue, green, yellow, and red colours, respectively.

This way, the confidence map is accumulated and updated during tracking. Figure 7b shows a confidence map generated and dynamically updated from real tracklets.

Once the confidence map is obtained, the coefficient C is calculated by the following formula:

C = 2 \max (0.25, V_{c o n f})

(8)

As mentioned above, V_conf varies between 0 and 1.0, and thus C ranges from 0.5 to 2.0. As previously mentioned, C is a self-adaptive coefficient that controls the three thresholds. If an object is in a high-confidence region, C × T_mean and C × T_std increase with a large value of C, while T_length decreases. This means that the smoothness and length criteria are lowered in high-confidence areas, thus reducing the probability of rejecting a true vehicle with a ragged tracklet. On the other hand, the criteria are raised in low confidence regions, and thus false moving vehicles could be rejected safely.

3. Result

To evaluate the performance of the proposed method, two experiments were conducted on two different satellite-based videos. The two parameters of the background model were set as: L_his = 50 and d_T2 = 200. All other parameters were set as mentioned in the previous sections. Our approach was implemented in Python and the OpenCV package, which offers basic image processing functions, the optical flow tracking algorithm, and classical background subtraction models. The HMA, tracklet classification, and evaluation modules were implemented in Python language, along with NumPy (a mathematical package for Python).

3.1. Datasets

The datasets consist of videos from two recently launched satellites. The first video was acquired by SkySat over Las Vegas, USA, on 25 March 2014, with a frame size of 1280 × 720. This dataset has been used in our previous study [13] and can be reused in this experiment. Due to a large number of tiny and dim vehicles in videos, the manual labelling work of samples is too enormous to apply to the whole frame. Therefore, the evaluation was conducted in a sub-region of 360 × 460, in which tall buildings cause obvious motion parallax. Eleven frames (200, 350, 400, 500, 600, 700, 800, 900, 1000, 1100, and 1200) were chosen as the validation data. The second video was captured by ChangGuang satellite over Atlanta, USA, on 3 May 2017, with a frame size of 1920 × 1080. A 416 × 516 sized sub-region was cropped as the experimental area and ten frames (410, 510, 560, 610, 640, 670, 730, 760, 790, and 820) were chosen as the validation data. In the two validation datasets, true moving vehicles were manually annotated as ground-truth images. The green rectangles in Figure 8 are the moving vehicles annotated manually.

3.2. Evaluation Metrics

The validation of object detection is often conducted by comparing the detection results against the ground-truth images. If a detected object overlaps with the corresponding ground-truth object, it is considered a True Positive (TP) detection; otherwise, it is a False Positive (FP) target. A ground-truth object is a False Negative (FN) when it is not successfully detected. Once the TP, FP, and FN are calculated out, the precision, recall, and F1 score can be computed as follows [32]:

P r e c i s i o n = \frac{(T P)}{(T P) + (F P)}

(9)

P r e c i s i o n = \frac{(T P)}{(T P) + (F P)}

(10)

F 1 = \frac{2 (Precision) (Recall)}{(P r e c i s i o n) + (R e c a l l)}

(11)

3.3. Evaluation

3.3.1. Qualitative Evaluation

This section presents some results of the tracklets and evaluates the performance by visual inspection. For qualitative evaluation, we conducted experiments on Dataset 1 (Figure 8a), which contains tall buildings causing motion parallax that resembles moving vehicles.

Figure 9 shows the tracklets from frame 359 to frame 379 of Dataset 1. We only show an enlarged portion of the image so that we can see the tracklets and the vehicles. In each frame, the green dots indicate positive cases (true moving vehicles), and the red dots show false ones (non-vehicles). Note that a green dot does not mean it is a true vehicle forever, and a red dot does not mean a false target. The colours only indicate the decision (true or false vehicle) the classifier can make up to the current frame. Therefore, the decision is dynamically updated from one frame to the next. The final decision is only made once a tracklet has been completed. Let us take the two green dots in the dotted rectangle in Figure 9b as examples. They are random noise on building tops and are misidentified as moving vehicles by the background model. Currently, the classifier cannot judge whether they are vehicles or not, as they appear in frame 362 for the first time, and their tracklets are not long enough to make decisions; then, the two objects are classified as non-vehicles in frame 365 in Figure 9c; where their tracking is finished, the classifier can make the final decision. After frame 365, the two non-vehicle objects in the rectangle in Figure 9c become inactive and disappear.

Figure 9d–f provides more examples of such random noise. Notice that three new objects and two new objects appear in the dotted rectangles in frames 372 and 375, respectively. Similarly, some other objects appear on the building tops from frame 375 to frame 378, which are not presented here, due to space constraints. When the classifier proceeds to frame 379, the objects on building tops are correctly classified as non-vehicle—red dots in the rectangle in Figure 9f. These examples show that the classifier effectively distinguishes random noise in the detection results.

Our classifier makes decisions dynamically with the accumulation of tracklets. This can be verified by the examples in the circle in Figure 9b, where some true vehicles are marked in red spots. This does not mean that our classifier cannot make correct decisions; this is because these vehicles are newly tracked and their tracklets are not long enough to obtain a steady calculation of the features. With the accumulation of the tracklets, the calculation becomes stable, and the decision can be made correctly, as shown in the circle in Figure 9f, where all true vehicles are marked in green colour.

Apart from some random noise that cannot track long tracklets, some distractors can also track long tracklets. In such a scenario, our classifier can also identify these false targets using the tracklet features as discussed above. Figure 10 shows an instance in the dotted rectangle on the top of a building, which tracks a long tracklet and is misidentified as a vehicle, as shown from frame 514 to frame 525. Although this situation persists for an extended period, it is correctly identified as a non-vehicle when the algorithm proceeds to frame 530 in Figure 10d. This example shows that our classifier can make a correct decision once the tracklet is accumulated.

As mentioned before, our algorithm can generate a confidence map as a by-product during the tracklet classification. The confidence map is generated from the vehicle tracklets and used to adjust the thresholds adaptively. Figure 11 shows the phases of the confidence map generated from Dataset 1, at intervals of 100 frames. One can notice that the road regions are fully formed at frames 600 to 700 and remain almost unchanged afterward. This indicates that the confidence map can also be used as road ROIs to verify our detection results and to remove false targets outside these road ROIs. With the assistance of the confidence map, our algorithm can also remove false targets when the tracklet classification fails, thus guaranteeing detection accuracy.

3.3.2. Quantitative Evaluation

This section compares the detection results with the ground-truth images, and then the quantitative evaluation metrics are calculated. For comparison, we also implemented several other classic motion detection models and methods from previously published literature: (1) three-frame difference (Diff3): a classic moving object detection method that uses the difference image between consecutive frames [33]; (2) MOG2: an improved method based on GMM [34]; (3) ViBE [7]: a background model using an update mechanism composed of random substitutions and spatial diffusion; (4) AMS-DAT [14]: a novel moving vehicle detection approach using adaptive motion separation and difference accumulated trajectory; (5) NPBSM: a non-parametric background subtraction model [27]; (6) LCNN [13]: a moving vehicle detection method based on a background model and a lightweight convolutional neural network. For convenience, the proposed method is abbreviated as TFC (Tracklet Feature Classification).

The performance of different methods depends on the datasets’ characteristics and the videos’ content. An area threshold is often used to filter out noise to improve object detection accuracy. That is, objects smaller than the area threshold are removed. However, the performance of the Diff3, MOG2, and ViBE methods is significantly affected by the area threshold: a small threshold keeps much noise in the detection, resulting in low precision values, whereas a large threshold often omits true objects, leading to low recall values. To ensure fairness, the thresholds for the three methods are carefully adjusted to obtain the best trade-off between precision and recall (i.e., the best F1 score). NPBSM also depends on the area threshold, but it was set to a small value (3 pixels) to ensure a high precision value because our method (i.e., TFC) took its results as input, and its high precision was required. The rest of these methods were also adjusted to their best performance trade-off.

The detection results of Datasets 1 and 2 are presented in Figure 12 and Figure 13, respectively, and the quantitative evaluation results are shown in Table 2. Note that the statistics in Table 2 are calculated over multiple frames, while Figure 12 and Figure 13 show only one frame of the results. Therefore, the scores may seem inconsistent with the results in some of these images. From the statistics in Table 2, one can see that TFC can achieve the highest F1 score on both datasets.

Dataset 1 contains tall buildings that cause motion parallax resembling moving vehicles. After removing many noises, Diff3 achieves a pretty good recall but a low precision, because this method is very susceptible to noise. Diff3 can detect subtle changes between frames and thus retrieves a large portion of true vehicles as well as introduces many false targets. ViBE has been reported to have satisfactory performance in detecting moving objects from ground surveillance videos, but it tends to omit many true vehicles in satellite-based videos, resulting in the lowest recall and F1 score. As a classical method, MOG2 achieves a precision, recall, and F1 score of around 70%. NPBSM achieves the highest recall (96.4%) but the lowest precision (48.6%). As we have mentioned earlier, NPBSM was not tuned to the best trade-off between precision and recall because our method needed its high recall results as input, showing that our method can significantly improve its precision. Finally, AMS-DAT, LCNN, and TFC can achieve precision and F1 scores of over 80%. TFC obtains the highest F1 score among the three methods, indicating the best overall performance—the best trade-off between precision and recall.

Dataset 2 does not contain tall buildings, but has many “distractors” from the parking lots and viaducts in the scene, introducing random noise. Moreover, some local distortions, as well as dim and blurred vehicles, make detection difficult. One can notice that Diff3′s F1 score is the lowest among all the results and that its precision is inferior, indicating its high sensitivity to noise. With the area threshold set to 3 pixels, MOG2 can achieve the highest precision (89.9%), but its recall is unsatisfactory (66%, the second-lowest value). ViBE again achieves the lowest recall and the second-lowest F1 score on this dataset. AMS-DAT obtains moderate precision, recall, and F1 score. Similarly, NPBSM achieves the highest recall but low precision. Both LCNN and TFC achieve precision, recall, and F1 scores of over 80%. Of the two, TFC has higher recall but lower precision, but the F1 scores also show that TFC achieves the best overall performance.

3.3.3. Analysis of Area Threshold

We have mentioned earlier that using area thresholds to filter out noise or false objects is not effective for detecting small moving vehicles from satellite-based videos. This is because the vehicles in the videos often resemble noise in size or shape. Removing noise by an area threshold can improve the precision to some extent, but it also removes true targets, lowering recall value. This can be confirmed by the following experiment. Among the background subtraction models used in the previous experiments, the NPBSM proved to have the best overall performance. Therefore, we chose the NPBSM method to demonstrate the results.

In this experiment, the area threshold gradually increased from 1 pixel to 15 pixels, and detected objects smaller than each of these thresholds were filtered out. Thus, we obtained a series of detection results under different area thresholds, as shown in Figure 14. These detection results were quantitatively evaluated, and the assessments are presented in Table 3 (due to space limitation in the row, only assessments under odd-numbered thresholds are shown).

The first image in Figure 14 shows the detection results when objects less than, or equal to, 1 pixel are removed. Under this threshold, almost all true vehicles (objects in yellow rectangles) are correctly recognised, except for one (the object in the cyan rectangle). However, many distractors (objects in red rectangles) are also mistaken as true vehicles. With the increase in the area threshold, more red rectangles disappear and more cyan rectangles appear in the following images. This means that precision increases while recall decreases, as shown in Figure 15. In Figure 15, the precision and recall curves intersect at about 7.5 pixels, where the precision and recall reach the best compromise. The assessments in Table 3 also show that the F1 score reaches the highest value (about 83.3%) at the threshold of 7 or 8 pixels.

The experiment shows that high precision and recall cannot be obtained simultaneously by directly filtering out small objects via area thresholds. The highest F1 score the filtering method can obtain on Dataset 1 is 83.3%, which is lower than that TFC achieves (89.5%). This indicates that the TFC method effectively improves detection precision while retaining high recall.

4. Discussion

The experimental results show that the frame difference method (Diff3) introduces too much noise, resulting in inferior accuracy. The ViBE method, as a recently proposed novel background subtraction model, works well with surveillance monitoring videos but not when applied to detect tiny moving objects from satellite-based videos. The MOG2 method achieves moderate performance. Both LCNN and TFC methods adopt the NPBSM as the preliminary detector and then eliminate false targets, but they use different strategies to remove them. LCNN uses a lightweight convolutional neural network that requires collecting samples and costs additional processing time, whereas TFC takes full advantage of motion information to identify false objects. Both AMS-DAT and TFC use motion information to eliminate false targets, but they adopt different strategies. AMS-DAT uses a straightforward method that accumulates the difference foreground to obtain moving trajectories. TFC applies complete temporal information and uses tracklet features to distinguish between true and false targets.

Our method uses a two-step strategy to obtain the final result. The accuracy of the final result depends on two aspects: high recall of the preliminary detection and the accuracy of the tracklet classifier. Hence, the basic premise of the TFC method is that including more false targets in the preliminary detection is more acceptable than missing true objects. This is because false targets can be filtered out through post-processing, but missed targets will never be retrieved again. Therefore, the algorithm should guarantee that all candidates can be recognised in the first step. The NPBSM used in our method can detect candidates with recall higher than 96%, fulfilling the task.

In the subsequent post-processing step, our method tracks the centroids of candidates using the HMA and LKOF, which is different from the accumulation strategy proposed by Chen et al. [14]. The edges of tall buildings tend to form tracklets similar to vehicles. The yellow and green rectangles in Figure 16a indicate edges that are incorrectly recognised as moving vehicles, and Figure 16b shows the tracklets extracted by accumulation, including incorrect tracklets (in the two rectangles) caused by building edges. In contrast, Figure 16c shows that our strategy can prevent the false targets from forming tracklets, yielding more accurate results. In addition, our method can obtain more complete tracklets than the accumulation method.

As for the detection performance of the methods involved in our experiments, a summary is drawn as follows: NPBSM has the highest recall (over 96%), indicating that it can retrieve almost all targets, but it also produces many false alarms, with the precision from 48 to 55%. Therefore, NPBSM has the advantage of high recall capability and can be used to generate initial detection in two-step methods, outperforming Diff3, MOG2, and ViBE. As two-step methods, AMS-DAT, LCNN, and TFC apply different strategies to eliminate false alarms. AMS-DAT uses a straightforward trajectory accumulation method to verify true targets and can achieve moderate precisions (78−85%) and recalls (68−78%). LCNN employs a lightweight convolutional neural network and achieves precisions from 84 to 90% and recalls of around 85%. TFC utilises tracklet features to discriminate between true and false targets and achieves precisions from 82 to 93% and recalls from 85 to 92%. On our datasets, TFC outperforms LCNN and AMS-DAT. Among all the methods in the experiments, TFC has the best overall performance, achieving the highest F1 score.

5. Conclusions

Satellite videos offer a new data source for monitoring moving vehicles for urban traffic management. Because of the tiny size and lack of distinctive appearance features of the vehicles, improving the accuracy of moving vehicle detection is challenging. This study proposed a two-step approach to improve detection accuracy. First, the Gaussian smoothing filtering is applied to suppress noise, and then the background subtraction model is used to obtain the preliminary detection results. Second, the Hungarian matching algorithm and Lucas–Kanade optical flow are used to generate tracklets, which are then used to discriminate between true and false targets by extracting tracklet features. Experiments on different datasets demonstrated that the proposed method could achieve both satisfactory precision and recall, outperforming some classical and other specific methods from recently published literature.

Our method takes full advantage of motion information and does not need any auxiliary data. Therefore, this method avoids much extra work, such as sample collection and training, which are indispensable and time-consuming in deep-learning-based methods. In addition, this method is easy to be implemented. The experiments showed that the algorithm could detect vehicles traveling at different speeds, indicating that the detection results are not affected by the vehicle speed as long as the tracklets can be tracked. It is important to note that our method depends on motion information and thus can only detect moving vehicles (it fails when vehicles stop at road intersections). Future work will be devoted to detecting both moving and stationary vehicles. Another limitation is that the method cannot recognise vehicle types, due to the low resolution and lack of colour and texture information. The only clue available is the target size, which can distinguish large vehicles (e.g., trucks and buses) from small ones (cars). This can be solved by extending this method to high-resolution data, such as unmanned aerial vehicle (UAV) videos.

Author Contributions

Conceptualisation, R.C.; Methodology, R.C.; Validation, X.L.; Writing—original draft, R.C.; Writing—review and editing, V.G.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 41471276.

Data Availability Statement

Not applicable.

Acknowledgments

The authors would like to thank the editors and reviewers for providing suggestions. The authors also acknowledge the company Skybox Imaging and ChangGuang Satellite Technology for the satellite videos.

Conflicts of Interest

The authors declare no conflict of interest.

References

Kopsiaftis, G.; Karantzalos, K. Vehicle Detection and Traffic Density Monitoring from Very High Resolution Satellite Video Data. In Proceedings of the IGARSS, Milan, Italy, 26–31 July 2015; pp. 1881–1884. [Google Scholar]
Ao, W.; Fu, Y.; Hou, X.; Xu, F. Needles in a Haystack: Tracking City-Scale Moving Vehicles From Continuously Moving Satellite. IEEE Trans. on Image Process. 2020, 29, 1944–1957. [Google Scholar] [CrossRef] [PubMed]
Ahmadi, S.A.; Ghorbanian, A.; Mohammadzadeh, A. Moving Vehicle Detection, Tracking and Traffic Parameter Estimation from a Satellite Video: A Perspective on a Smarter City. Null 2019, 40, 8379–8394. [Google Scholar] [CrossRef]
Shaikh, S.H.; Saeed, K.; Chaki, N. Moving Object Detection Approaches, Challenges and Object Tracking. In Moving Object Detection Using Background Subtraction; SpringerBriefs in Computer Science; Springer International Publishing: Cham, Switzerland, 2014; pp. 5–14. ISBN 978-3-319-07385-9. [Google Scholar]
Friedman, N.; Russell, S. Image Segmentation in Video Sequences: A Probabilistic Approach. arXiv 1997, arXiv:1302.1539. [Google Scholar] [CrossRef]
Kim, K.; Chalidabhongse, T.H.; Harwood, D.; Davis, L. Background Modeling and Subtraction by Codebook Construction. In Proceedings of the 2004 International Conference on Image Processing, 2004. ICIP ’04, Singapore, 24–27 October 2004; Volume 5, pp. 3061–3064. [Google Scholar]
Barnich, O.; Droogenbroeck, M.V. ViBE: A Powerful Random Technique to Estimate the Background in Video Sequences. In Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, Taipei, Taiwan, 19–24 April 2009; pp. 945–948. [Google Scholar]
Zhang, W.; Jiao, L.; Liu, F.; Li, L.; Liu, X.; Liu, J. MBLT: Learning Motion and Background for Vehicle Tracking in Satellite Videos. IEEE Trans. Geosci. Remote Sensing 2021, 60, 1–15. [Google Scholar] [CrossRef]
Piccardi, M. Background Subtraction Techniques: A Review. In Proceedings of the 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No.04CH37583), IEEE, The Hague, The Netherlands, 10–13 October 2004; Volume 4, pp. 3099–3104. [Google Scholar]
Hung, M.-H.; Pan, J.-S.; Hsieh, C.-H. A Fast Algorithm of Temporal Median Filter for Background Subtraction. J. Inf. Hiding Multimed. Signal Process. 2014, 5, 33–40. [Google Scholar]
Elgammal, A.; Harwood, D.; Davis, L. Non-Parametric Model for Background Subtraction. In Computer Vision—ECCV 2000; Vernon, D., Ed.; Springer: Berlin/Heidelberg, Germany, 2000; Volume 1843, pp. 751–767. ISBN 978-3-540-67686-7. [Google Scholar]
Yang, T.; Wang, X.; Yao, B.; Li, J.; Zhang, Y.; He, Z.; Duan, W. Small Moving Vehicle Detection in a Satellite Video of an Urban Area. Sensors 2016, 16, 1528. [Google Scholar] [CrossRef] [Green Version]
Chen, R.; Li, X.; Li, S. A Lightweight CNN Model for Refining Moving Vehicle Detection From Satellite Videos. IEEE Access 2020, 8, 221897–221917. [Google Scholar] [CrossRef]
Chen, X.; Sui, H.; Fang, J.; Zhou, M.; Wu, C. A Novel AMS-DAT Algorithm for Moving Vehicle Detection in a Satellite Video. IEEE Geosci. Remote Sens. Lett. 2020, 19, 3501505. [Google Scholar] [CrossRef]
Shi, F.; Qiu, F.; Li, X.; Zhong, R.; Yang, C.; Tang, Y. Detecting and Tracking Moving Airplanes from Space Based on Normalized Frame Difference Labeling and Improved Similarity Measures. Remote Sens. 2020, 12, 3589. [Google Scholar] [CrossRef]
Shu, M.; Zhong, Y.; Lv, P. Small Moving Vehicle Detection via Local Enhancement Fusion for Satellite Video. Null 2021, 42, 7189–7214. [Google Scholar] [CrossRef]
Pflugfelder, R.; Weissenfeld, A.; Wagner, J. On Learning Vehicle Detection in Satellite Video. arXiv 2020, arXiv:2001.10900. [Google Scholar]
Zhang, J.; Zhang, J.; Jia, X. Learning Via Watching: A Weakly Supervised Moving Object Detector for Satellite Videos. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, IEEE, Brussels, Belgium, 11–16 July 2021; pp. 2333–2336. [Google Scholar]
Xu, A.; Wu, J.; Zhang, G.; Pan, S.; Wang, T.; Jang, Y.; Shen, X. Motion Detection in Satellite Video. J. Remote Sens. GIS 2017, 6, 194. [Google Scholar] [CrossRef]
Hu, Z.; Yang, D.; Zhang, K.; Chen, Z. Object Tracking in Satellite Videos Based on Convolutional Regression Network With Appearance and Motion Features. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 783–793. [Google Scholar] [CrossRef]
Feng, J.; Zeng, D.; Jia, X.; Zhang, X.; Li, J.; Liang, Y.; Jiao, L. Cross-Frame Keypoint-Based and Spatial Motion Information-Guided Networks for Moving Vehicle Detection and Tracking in Satellite Videos. ISPRS J. Photogramm. Remote Sens. 2021, 177, 116–130. [Google Scholar] [CrossRef]
Pi, Z.; Jiao, L.; Liu, F.; Liu, X.; Li, L.; Hou, B.; Yang, S. Very Low-Resolution Moving Vehicle Detection in Satellite Videos. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5624517. [Google Scholar] [CrossRef]
Lei, J.; Dong, Y.; Sui, H. Tiny Moving Vehicle Detection in Satellite Video with Constraints of Multiple Prior Information. Null 2021, 42, 4110–4125. [Google Scholar] [CrossRef]
Liu, X.; Zhao, G.; Yao, J.; Qi, C. Background Subtraction Based on Low-Rank and Structured Sparse Decomposition. IEEE Trans. Image Process. 2015, 24, 2502–2514. [Google Scholar] [CrossRef]
Zhang, J.; Jia, X.; Hu, J. Error Bounded Foreground and Background Modeling for Moving Object Detection in Satellite Videos. IEEE Trans. Geosci. Remote Sens. 2020, 58, 2659–2669. [Google Scholar] [CrossRef] [Green Version]
Zhang, J.; Jia, X.; Hu, J.; Chanussot, J. Online Structured Sparsity-Based Moving-Object Detection From Satellite Videos. IEEE Trans. Geosci. Remote Sens. 2020, 58, 6420–6433. [Google Scholar] [CrossRef] [Green Version]
Zivkovic, Z.; van der Heijden, F. Efficient Adaptive Density Estimation per Image Pixel for the Task of Background Subtraction. Pattern Recognit. Lett. 2006, 27, 773–780. [Google Scholar] [CrossRef]
Kuhn, H.W. The Hungarian Method for the Assignment Problem. Naval Research Logistics Quarterly 1955, 2, 83–97. [Google Scholar] [CrossRef] [Green Version]
Horn, B.K.P.; Schunck, B.G. Determining Optical Flow. Artif. Intell. 1981, 17, 185–203. [Google Scholar] [CrossRef]
Luo, W.; Xing, J.; Milan, A.; Zhang, X.; Liu, W.; Zhao, X.; Kim, T.-K. Multiple Object Tracking: A Literature Review. arXiv 2017, arXiv:1409.7618. [Google Scholar] [CrossRef]
Lucas, B.D.; Kanade, T. An Iterative Image Registration Technique with an Application to Stereo Vision. In Proceedings of the Proceedings of the 7th International Joint Conference on Artificial Intelligence—Volume 2; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1981; pp. 674–679. [Google Scholar]
Goyal, K.; Singhai, J. Recursive-Learning-Based Moving Object Detection in Video with Dynamic Environment. Multimed. Tools Appl. 2021, 80, 1375–1386. [Google Scholar] [CrossRef]
Zhang, Y.; Wang, X.; Qu, B. Three-Frame Difference Algorithm Research Based on Mathematical Morphology. Procedia Eng. 2012, 29, 2705–2709. [Google Scholar] [CrossRef] [Green Version]
Zivkovic, Z.; von der Heijden, F. Recursive Unsupervised Learning of Finite Mixture Models. IEEE Trans. Pattern Anal. Mach. Intell. 2004, 26, 651–656. [Google Scholar] [CrossRef]

Figure 1. The flow of the proposed method.

Figure 2. Distance matrix between the targets and detections. S_ij measures the distance between the target T_i and the detection D_j.

Figure 3. Distance measurement between targets and detections. d_s11 and d_s12 are the spatial distance from T₁ to D₁ and D₂, respectively. d_p₁₁ and d_p₁₂ are the directional distance from line V₁ to D₁ and D₂, respectively.

Figure 4. Types of tracklets. The panels (a–c) show a vehicle′s tracklet when traveling on a straight road, a bend, and a U-turn, respectively. The panel (d) shows an erratic tracklet tracked from a non-vehicle object.

Figure 5. Calculation of the directional change angle. P₁, P₂, P₃, and P₄ are points on the tracklet. β₂ and β₃ are two directional change angles formed by P₁, P₂, P₃, and P₄.

Figure 6. Decision tree of classification based on the attributes of the tracklet.

Figure 7. Creation of confidence map. Panel (a) shows a confidence map, which was created by overlapping traces with different confidence values. Panel (b) shows a confidence map generated from true vehicle tracklets, where the jet colourmap indicates confidence value from 0 (blue) to 1 (red).

Figure 8. Enlarged parts of the video data and manually annotated ground-truth vehicles. Image (a) shows the SkyBox satellite data [13], and (b) shows the ChangGuang satellite data.

Figure 9. Short tracklets of noise from the tops of buildings. Green dots indicate true vehicles and red dots depict false ones in the current frame.

Figure 10. An example of a long tracklet of a false target from the top of a building.

Figure 11. An example of a confidence map generated from tracklets.

Figure 12. Detection of Dataset 1: yellow indicates correct targets, red shows false targets, and cyan depicts missed targets. Panels (a) Ground-truth, (b) Diff3, (c) MOG2, (d) ViBE, (e) AMS-DAT, (f) NPBSM, (g) LCNN, and (h) TFC (the proposed method).

Figure 13. Detection of Dataset 2: yellow indicates correct targets, red shows false targets, and cyan depicts missed targets. Panels (a) Ground-truth, (b) Diff3, (c) MOG2, (d) ViBE, (e) AMS-DAT, (f) NPBSM, (g) LCNN, and (h) TFC (the proposed method).

Figure 14. Noise removal of NPBSM detection results under different area thresholds: yellow indicates correct targets, red shows false targets, and cyan depicts missed targets.

Figure 15. Treading curves of precision, recall, and F1 score under different area thresholds (in pixels). Increasing precision is accompanied by decreasing recall.

Figure 16. Comparison of the tracklets extracted using different methods. (a) The initially detected candidates of moving objects; (b) Motion trajectories extracted by accumulation; (c) Tracklets extracted using our method, in which incorrect tracklets in the rectangles in (b) were removed.

Table 1. Statistics of DCAs for each tracklet type.

Type of Tracklet	DCA_min	DCA_max	$\bar{DCA}$	$σ_{DCA}$
(A) Straight	0.003	0.472	0.166	0.121
(B) Bend	0.015	0.891	0.279	0.191
(C) U-turn	0.038	0.715	0.315	0.218
(D) Non-vehicle	0.293	2.511	1.178	0.909

Table 2. Evaluation of the detection results from different methods.

Data	Method	Precision (%)	Recall (%)	F1-Score (%)
Dataset 1	Diff3	48.8	87.2	62.5
	MOG2	68.0	72.0	69.0
	ViBE	57.0	62.7	60.0
	AMS-DAT	85.1	78.3	81.5
	NPBSM	48.6	96.4	64.6
	LCNN	90.3	84.2	87.2
	TFC (ours)	93.7	85.6	89.5
Dataset 2	Diff3	30.1	77.7	43.4
	MOG2	89.9	66.0	76.1
	ViBE	75.9	60.6	67.4
	AMS-DAT	77.9	67.6	72.4
	NPBSM	54.8	96.6	69.9
	LCNN	84.0	85.3	84.6
	TFC (ours)	81.8	91.8	86.5

Note: Bold indicates the highest score, and underline the lowest.

Table 3. Evaluation under different area thresholds.

Threshold (Pixels)	1	3	5	7	9	11	13	15
Precision (%)	30.6	61.3	74.4	82.2	85.8	88.4	88.4	87.7
Recall (%)	98.5	94.4	90.3	94.4	77.6	67.8	53.9	42.2
F1 score (%)	46.7	74.3	81.6	83.3	81.5	76.8	67.0	57.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, R.; Ferreira, V.G.; Li, X. Detecting Moving Vehicles from Satellite-Based Videos by Tracklet Feature Classification. Remote Sens. 2023, 15, 34. https://doi.org/10.3390/rs15010034

AMA Style

Chen R, Ferreira VG, Li X. Detecting Moving Vehicles from Satellite-Based Videos by Tracklet Feature Classification. Remote Sensing. 2023; 15(1):34. https://doi.org/10.3390/rs15010034

Chicago/Turabian Style

Chen, Renxi, Vagner G. Ferreira, and Xinhui Li. 2023. "Detecting Moving Vehicles from Satellite-Based Videos by Tracklet Feature Classification" Remote Sensing 15, no. 1: 34. https://doi.org/10.3390/rs15010034

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Detecting Moving Vehicles from Satellite-Based Videos by Tracklet Feature Classification

Abstract

1. Introduction

2. Method

2.1. Video Filtering

2.2. Background Subtraction Model

2.3. Tracklet Extraction

2.3.1. The Hungarian Matching Algorithm—HMA

2.3.2. Optical Flow Tracking

2.4. Tracklet Feature Analysis

2.5. Tracklet Classification

2.6. Confidence Map

3. Result

3.1. Datasets

3.2. Evaluation Metrics

3.3. Evaluation

3.3.1. Qualitative Evaluation

3.3.2. Quantitative Evaluation

3.3.3. Analysis of Area Threshold

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI