Dynamic Multiple Object Segmentation with Spatio-Temporal Filtering

Yang, Wenguang; Ren, Kan; Wan, Minjie; Kong, Xiaofang; Qian, Weixian

doi:10.3390/s24072094

Open AccessArticle

Dynamic Multiple Object Segmentation with Spatio-Temporal Filtering

by

Wenguang Yang

^1,2,

Kan Ren

^1,2,

Minjie Wan

^1,2

,

Xiaofang Kong

³ and

Weixian Qian

^1,2,*

¹

School of Electronic and Optical Engineering, Nanjing University of Science and Technology, Nanjing 210094, China

²

Jiangsu Key Laboratory of Spectral Imaging & Intelligent Sense, Nanjing University of Science and Technology, Nanjing 210094, China

³

National Key Laboratory of Transient Physics, Nanjing University of Science and Technology, Nanjing 210094, China

^*

Author to whom correspondence should be addressed.

Sensors 2024, 24(7), 2094; https://doi.org/10.3390/s24072094

Submission received: 2 February 2024 / Revised: 15 March 2024 / Accepted: 22 March 2024 / Published: 25 March 2024

(This article belongs to the Section Sensing and Imaging)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

This article primarily focuses on the localization and extraction of multiple moving objects in images taken from a moving camera platform, such as image sequences captured by drones. The positions of moving objects in the images are influenced by both the camera’s motion and the movement of the objects themselves, while the background position in the images is related to the camera’s motion. The main objective of this article was to extract all moving objects from the background in an image. We first constructed a motion feature space containing motion distance and direction, to map the trajectories of feature points. Subsequently, we employed a clustering algorithm based on trajectory distinctiveness to differentiate between moving objects and the background, as well as feature points corresponding to different moving objects. The pixels between the feature points were then designated as source points. Within local regions, complete moving objects were segmented by identifying these pixels. We validated the algorithm on some sequences in the Video Verification of Identity (VIVID) program database and compared it with relevant algorithms. The experimental results demonstrated that, in the test sequences when the feature point trajectories exceed 10 frames, there was a significant difference in the feature space between the feature points on the moving objects and those on the background. Correctly classified frames with feature points accounted for 67% of the total frames.The positions of the moving objects in the images were accurately localized, with an average IOU value of 0.76 and an average contour accuracy of 0.57. This indicated that our algorithm effectively localized and segmented the moving objects in images captured by moving cameras.

Keywords:

feature point tracks; multi-object detection; trajectory distinctiveness

1. Introduction

Object segmentation plays an important role in the field of computer vision. Our primary focus in image processing is to identify the regions of interest within the image. Here, we refer to these regions as the foreground or objects, while the remaining areas are referred to as the background. For an input image, the purpose of a segmentation algorithm is to accurately distinguish the image into foreground and background regions. In most cases, the foreground in an image is considered one or several objects, such as a vehicle.

The application scenario of this technique is image region segmentation. Grayscale-based object segmentation is a common method in image segmentation and utilizes the grayscale information of pixels in an image to segment the object regions. This method typically includes techniques such as thresholding, edge detection, and region growing, which analyze the grayscale levels of pixels to determine the boundaries between the target and background, thus achieving object segmentation. Thresholding and edge-detection-based image segmentation algorithms divide an image into multiple regions. Subsequently, these regions are further classified as foreground or background using other algorithms. On the other hand, region-growing algorithms first identify a portion of the foreground region and then expand this region to obtain the complete foreground area. These fundamental ideas can also be applied to multi-object segmentation. The main difference is that, in addition to distinguishing the foreground and background in the image, further differentiation needs to be performed among the portions attributed to the foreground.

To achieve more precise object segmentation, some research has focused on studying the objects that need to be segmented. For example, in [1,2], the authors focused on pedestrians on the road, whose segmentation objects often occupy large areas in the image, formed by interconnected but different small regions. In [1], the authors focused on the continuity of trajectories in global associations and used a conditional random field (CRF) model to better match the trajectories of objects. Ref. [2] was more focused on predicting the position of the pedestrians first, to reduce the difficulty of object association, thus achieving continuous detection of targets. Objects can also include athletes on a sports field [3]. Compared to pedestrians on the road, this type of object often involves a mutual influence between objects and fast movement speeds. The study of small animals in groups (birds [4,5], bats [6], ants [7], fish [8], cells [9], etc.) is another focus, where the movement of such targets is more difficult to estimate, and there is a higher degree of occlusion between them. Vehicles are also one of the key objects of segmentation that researchers focus on [10]. First, vehicles in images are a very typical type of rigid object, and compared to other objects such as pedestrians or animals, vehicles typically have distinct shapes and structural characteristics. Second, vehicles come in different types and brands, and there is a practical demand for research on vehicle recognition and classification.

Currently, research on object segmentation in images mainly focuses on single-frame images or moving objects in static cameras. However, with the rise of drone technology, identifying segmented objects from image sequences captured by cameras on moving platforms has become more valuable and meaningful. Compared to previous application scenarios, the changing background adds complexity to the distinction between foreground and background in images. In this paper, we aimed to differentiate between the foreground and background in the current frame by analyzing the distinct motion patterns of the background and objects across multiple frames. This allows us to identify a portion of the foreground, which satisfies the preconditions for region-growing algorithms. Compared to edge-based segmentation, feature point tracking and small-scale region-growing algorithm computations involve significantly lower computational overheads, which is advantageous for processing large image sequences.

Aiming to accomplish this task, we constructed detected point tracks using the optical flow method [11]. Then, through point classification, we could distinguish between the foreground (objects) and background, as well as identify different objects. Finally, by using the adjusted object points as centers, the segmentation of each object contour could be achieved. Our method makes the following contributions:

(1): The dual-threshold DBSCAN algorithm is used to temporally classify and differentiate moving objects with inconsistent motion.
(2): An iterative algorithm based on the sample mean and variance is used to differentiate spatial objects.
(3): An improved two-step contour generation algorithm is used to extract the contours of the objects.

The remainder of this paper is organized as follows: An overview of the related work is presented in Section 2. Section 3 presents the feature point extraction algorithm for spatiotemporal clustering and filters the feature points by the plane constraint of feature points. Section 4 shows the experimental results for different data sets. Then, the results are analyzed. Finally, Section 5 provides the conclusions.

2. Related Work

This chapter introduces the latest techniques for object segmentation. Traditional video segmentation techniques can generally be divided into unsupervised, semi-supervised, and supervised methods.

Unsupervised Video Segmentation: Unsupervised video segmentation methods do not rely on any annotated information and achieve segmentation by clustering or matching pixels in a video sequence. Including solutions for hierarchical segmentation, temporal superpixel, and super-voxels [12]. These methods mostly rely on a bottom-up approach, starting from low-level features in images or videos and gradually deriving higher-level semantic information and object segmentation results. In [13], the author utilized a voting mechanism based on the similarity of pixels in the image to perform object segmentation and iteratively refine the results. Algorithms at the pixel level also include combining the first-order directional derivative of the facet model [14]. These pixel-based approaches are often used for small object segmentation. However, it is important to note that the performance of pixel-based methods for small object segmentation can be influenced by various factors, such as the resolution of the image, the quality of object boundaries, and the level of noise present in the image. Therefore, it is necessary to carefully design and optimize the algorithm parameters to achieve accurate and robust segmentation results for small objects. For unsupervised large-scale objects, motion characteristics play an important role in object segmentation. Large-scale objects often exhibit significant motion characteristics, such as movement or complex deformations, which can be utilized to locate and segment the objects in image sequences. This also applies to the requirements for long-term tracking of objects. In [11], an optical-flow-feature fusion-based video object detection method was proposed with the consideration of temporal coherence among video frames to solve the occlusion problem. In terms of multi-object tracking, such as [15], a method for monitoring multiple object cow ruminant behavior based on an optical flow method and an inter-frame difference method was proposed. However, the method was only suitable for short-term and small-scale motion tracking.In [16], the authors detected moving objects and performed blob analysis. Then, a Kalman filter or particle filter was applied to every object to obtain a predicted position, using Munkres’ version of the Hungarian algorithm to distinguish objects. However, the method was greatly influenced by the motion model.

In semi-supervised object segmentation, only a small portion of the image data are manually labeled as an object or the background. In this scenario, the methods utilize the labeled data to guide the entire object segmentation process. The core idea of semi-supervised methods is to learn the features of the object and background based on the known label information. Graph cuts or conditional random fields are used to minimize an energy function or maximize a probability model for segmentation is the common approach. In [12], the authors employed a modified version of the density peak clustering algorithm to generate super-trajectories. Trajectories were generated starting from known objects in the first frame. To enhance the reliability of the trajectories, a reverse tracking strategy was introduced to counteract the impact of slight camera movements. In [17], the authors performed segmentation in “bilateral space”, which is a high-dimensional feature space [18], and this method segments images by using interpolation and locally connected labeling in bilateral space to label different pixels.

Supervised methods require significant human intervention [19,20]. The traditional supervision method needs to correct the image sequence frame by frame, to obtain a high-quality boundary. This huge workload makes the supervised approach more suitable for specific scenarios such as video post-production.

Deep learning algorithms can be considered a special type of fully supervised algorithm, because they typically require a large amount of labeled data for training. In [21], the authors focused on the interactions of objects and built a spatial graph transformer encoder layer, a temporal transformer encoder layer, and a spatial transformer decoder layer to model the spatial-temporal relationships of the object. In [22]. The authors reduced the number of model parameters by introducing convolutional operators. They then addressed overfitting and other training issues by establishing a compact sample distribution model and modifying the update strategy. In [23], the authors used a high-capacity convolutional neural network for object localization and segmentation. To address the issue of scarce training data, they employed auxiliary tasks to pre-train the model and fine-tune the training results based on actual detection. In Ref. [24], the authors proposed a biomimetic retinal neural network based on neurodynamics time filtering and multi-form two-dimensional spatial Gabor filtering. By using two spatial filters that are orthogonal to each other, movement direction can be accurately estimated. In [25], the authors combined CenterNet with object recognition to enhance the localization and identification of objects, through a series of detailed designs to reduce ambiguity in network training. It is important to note that, while deep learning algorithms excel in fully supervised settings, they also have some limitations. For instance, in the case of object segmentation tasks, acquiring pixel-level labeled data can be challenging, leading to issues such as data imbalance, labeling errors, and boundary ambiguity.

3. Algorithm Scheme

In related research on object segmentation, the definition and extraction of the objects was the initial consideration. In this paper, the point tracks in the image are defined and the motion information in the image is determined by the motion information of the multi-frame images.

We first extract the feature points of the image using a curvature model and describe them with a SURF descriptor [26]. In the next frame, the optical flow method [11] is used to associate the same feature point, from which tracks are constructed. The trajectories represent the temporal correlation of the video sequence. Then, using the motion consistency of the feature points on the steel body, the tracks of the feature points belonging to the same object are distinguished from the other tracks. Finally, the points on an object are used as reference points to obtain the full contour of the object. An overall algorithm flow chart is shown in Figure 1.

3.1. Image Initialization and Feature Point Detection

The curvature of a curve reflects the degree of curvature at each point. For a point on a continuous curve, the curvature at this point can be calculated according to Equation (1)

κ = \frac{|y^{″}|}{{(1 + y^{' 2})}^{\frac{3}{2}}},

(1)

where

y^{″}

represents the second derivative of the point M,

y^{'}

represents the first derivative of the point, and

κ

represents the curvature of a point on the curve. The image is a discrete signal, which can be described by the Facet model [14]. Based on this model, the first derivative and second derivative of the image can be calculated. In this paper, the feature points are matched between adjacent frames. Therefore, feature points need to be described. The SURF descriptor was chosen because of its repeatability, distinctiveness, and robustness.

3.2. Trajectory Correlation

To record the trajectories, we construct the eigenvector

F_{i}^{f r a m e} = (v_{i}^{f r a m e}, L_{i}^{f r a m e}, θ_{i}^{f r a m e})

.

v_{i}^{f r a m e}

is the velocity,

F_{i}^{f r a m e}

stands for the distance in the x and y directions, and

θ_{i}^{f r a m e}

represents the angle of the point’s trajectory. These can be calculated as follows:

v_{i}^{f r a m e} = p_{i}^{f r a m e} - p_{i}^{f r a m e - 1},

(2)

L_{i}^{f r a m e} = \sum_{n u m = r e f e r + 1}^{f r a m e} |p_{i}^{n u m} - p_{i}^{n u m - 1}|,

(3)

θ = \frac{L_{i}^{f r a m e} (y)}{L_{i}^{f r a m e} (x)},

(4)

where i is the label of feature points, and

f r a m e

is the frame number.

p_{i}^{n u m}

represents the image coordinate of the

i_{t h}

feature point.

r e f e r

is the reference frame and the initial value is 1. This value changes based on the trajectory update.

L_{i}^{f r m a e}

can be regarded as the integral of the

v_{i}^{f r a m e}

’s absolute value from the reference frame to the current frame.

According to Equations (2) and (3), the speed and distance of all points are shown in Figure 2.

In Figure 2, the red lines represent the feature points located within the objects. As seen, the object points are not obvious in the background points in Figure 2. Upon integration, the contrast between the moving objects and the background becomes more pronounced.

Compared to the speed, the distance of the trajectory is more significant. To distinguish the object from the background, polar coordinates are used to represent the feature points. The parameter

(L, θ)

expression of the trajectory is shown in Figure 3.

The method of polar coordinates includes the moving distance and direction of the points, which is more conducive to the subsequent clustering.

3.3. Temporal Clustering and Plane Constraint

Although the disparity in each trajectory’s distance at the background feature points is large, the density is also large. This makes the points in the background different from the points on the objects. For multiple objects, it should be taken into account that different objects are not consistent with each other. Thus, it is not possible to simply set the number of clusters to two. Therefore, density-based spatial clustering of applications with noise (DBSCAN) was chosen as the basis for clustering the distance data [27].

DBSCAN relies on a threshold between data. When the distance between two elements in the data set is less than the threshold

ε

, the two elements are considered density-reachable. All density-reachable elements are considered as part of the same cluster. When the data are one-dimensional, the algorithmic is as shown in Figure 4.

In this paper, the clustering threshold between moving objects needed to be different from the clustering threshold between objects and the background. So, the double threshold DBSCAN algorithm was used for the temporal clustering of data.

According to the Algorithm 1 and Figure 5, the entire processing procedure is as follows:

Algorithm 1: Process of temporal clustering and plane constraint

Input: all feature points

parameter:

1:

t h r_{1}

= smaller clustering threshold

2:

t h r_{2}

= larger clustering threshold

Output: the clusters of feature points representing different motion modes

Do:

1: distinguish the points with different movement patterns using

t h r_{1}

.

2: use the background points to calculate the plane constraints.

3: use

t h r_{2}

to re-cluster the non-background feature points and distinguish the objects of different motion modes.

Step 1: Try to distinguish the objects and the background first. Since the movements of the objects are unpredictable, a smaller threshold is used to distinguish the objects from the background. Then, mark the cluster with the largest amount of data as the background cluster:

{susTarget}_{i n d e x_{i}}^{f r a m e} = \{\begin{matrix} true p_{i n d e x_{i}} not in largest cluster \\ false p_{i n d e x_{i}} in largest cluster \end{matrix} .

(5)

Step 2: Use the background points to calculate the plane constraints.

Since the background points cannot be completely segmented by the clustering algorithm, an additional plane constraint is required for the classification of feature points. The projection between planes satisfies the homography change. So, the points’ coordinates on the ground satisfy the homography change under the overhead perspective:

p^{f r a m e} = H p^{r e f} = [\begin{matrix} H_{11} & H_{12} & H_{13} \\ H_{21} & H_{22} & H_{23} \\ H_{31} & H_{32} & H_{33} \end{matrix}] P^{r e f},

(6)

where

p^{r e f}

are the feature points of the reference frame, and

p^{f r a m e}

are the feature points of the

f r a m e_{t h}

. Equation (6) can be rewritten as

s^{f r a m e} = p^{f r a m e} - p^{r e f} = [H - I] p^{r e f},

(7)

where

s^{f r a m e}

indicates the distance from the reference frame to the

f r a m e_{t h}

. I is a 3 by 3 identity matrix. This can be obtained from Equation (7):

\begin{matrix} s_{x}^{f r a m e} = a_{1} p_{x}^{r e f} + b_{1} p_{y}^{r e f} + c_{1}, \end{matrix}

(8)

\begin{matrix} s_{y}^{f r a m e} = a_{2} p_{x}^{r e f} + b_{2} p_{y}^{r e f} + c_{2}, \end{matrix}

(9)

where

s_{x}^{f r a m e}

and

s_{y}^{f r a m e}

respectively represent the component of x and y. So,

S_{x, isTarget}^{f r a m e} = [\begin{matrix} p_{x}^{r e f} & p_{y}^{r e f} & s_{x}^{f r a m e} \end{matrix}]

and

S_{y, isTarget}^{f r a m e} = [\begin{matrix} p_{x}^{r e f} & p_{y}^{r e f} & s_{y}^{f r a m e} \end{matrix}]

form two plane equations

P_{x}^{f r a m e}

and

P_{y}^{f r a m e}

. For a spatial point set, only three points that are not co-linear are needed to determine the plane. Then, the points that are not judged as background are judged again. The points that satisfy the plane equation in both directions are also considered to be in the background.

{isTarget}_{inde x_{i}}^{f r a m e} = \{\begin{matrix} {true if S}_{x, inde x_{i}}^{f r a m e} \in P_{x}^{f r a m e} {and S}_{y, inde x_{i}}^{f r a m e} \in P_{y}^{f r a m e} \\ false else \end{matrix} .

(10)

Step 3: A large threshold

t h r_{2}

is used to distinguish different objects. For multiple objects, different motion modes produce different motion eigenvalues.

3.4. Spatial Point Clustering

When the motion modes of the objects are consistent, which happens in many scenarios, spatial clustering is necessary, as the number of objects cannot be determined and the position relationship of different objects is also uncertain. To better distinguish different objects, we use an iterative algorithm based on the sample mean and variance.

The main idea of clustering is to solve the clustering center and regroup all the data according to the Euclidean distance between the data and the clustering center. When the variance

(s t d_{x}, s t d_{y})

of a class with a center point

(x, y)

is greater than the threshold

θ_{s}

, the cluster is divided. If

s t d_{x}

is greater than

s t d_{y}

, the new center points

(x^{+}, y)

and

(x^{-}, y)

after splitting are

\begin{matrix} x^{+} = x + s \cdot s t d_{x}, \end{matrix}

(11)

\begin{matrix} x^{-} = x - s \cdot s t d_{x}, \end{matrix}

(12)

where s is the scaling factor, which is generally 0.5. If

s t d_{y}

is greater than

s t d_{x}

, the value y changes in the same way. After splitting, the cluster is re-clustered and the new cluster center is calculated.

When there are two clusters whose center distance is less than the threshold, the two clusters are merged. In the iteration process, if the cluster center changes and the number of iterations is less than the maximum iterations, the iteration continues. The algorithm flow is shown in the Algorithm 2.

Algorithm 2: Process of spatial clustering

Input: object points sets

parameter:

1:

κ

= the expected number of cluster centers

2:

θ_{s}

= the maximum standard deviation of a cluster

3:

θ_{c}

= the minimum distance between two cluster centers

4: I = maximum iteration times

Output: space clustering results in the current frame

While: the clustering results change or the iteration times are less than I:

1: Classify according to the cluster center

2: Re-determine the cluster center

3: Determine whether to split a class based on the threshold

4: Re-determine the cluster center

5: Merge the clusters that meet the criteria

6: Re-determine the cluster center

3.5. Object Contour Generation

To segment the objects’ contour, this paper segments the object shape based on the detected feature points. We use the combination of point clustering and image matching to judge the label of each pixel around the selected feature points. For points on the plane, the corresponding points between two frames should satisfy the corresponding homography relationship. This can be expressed as

p_{f r a m e} = H_{r e f e r}^{f r a m e} p_{r e f e r} .

(13)

In Equation (13),

H_{r e f e r}^{f r a m e}

represents the homography matrix from

r e f e r

to

f r a m e

. It is noted that the homography matrix can only constrain points in the background. These points were distinguished in Section 3.3. After obtaining

H_{r e f e r}^{f r a m e}

, the image can be changed to eliminate the influence of the moving platform and obtain the region of interest. This process is shown in Figure 6.

In Figure 6, the generation of the background image adopted the Gauss mixed model. As the number of frames increases, the gray value of each pixel is jointly determined by the gray value of the same position from the reference frame to the current frame. Since the objects in the background are difficult to distinguish, the objects on the reference frame are also projected in Figure 6c. The object area is determined by the extracted object location in Section 3.3. In this case, the selected area will be affected by the reference frame. So, it is necessary to make a further judgment on the pixel points according to the moving points.

In Figure 7, we begin by fusing the points with the region of interest, which is referred to as the first extraction. Then, the marked points on each object are taken as seed points to generate the contour within the region of interest, which is referred to as the second extraction. Finally, the contours of the different objects are generated.

3.6. Trajectories Update

With time, the number of trajectory points will inevitably decrease due to the movement of the platform. It is necessary to update the feature points and background images based on preserving trajectory information. Therefore, set the threshold

t h r_{n u m}

as follows:

u p d a t e_{s i g n} = \{\begin{matrix} t r u e i f n u m_{f r a m e} < t h r_{n u m} \cdot n u m_{r e f e r} \\ f a l s e e l s e \end{matrix},

(14)

where

n u m_{f r a m e}

represents the number of feature point trajectories retained in the current frame.

n u m_{r e f e r}

represents the number of feature points detected in the reference frame. The empirical value

t h r_{n u m}

is 0.5. In addition, feature points are not updated when the trajectory duration is less than 15 frames. When the trajectory information is updated, the accumulated trajectory distance L is reset as the reference frame changes. To reduce the influence of the update, a certain initial value is given to the feature points existing in the object area. Then, in the subsequent frames, the feature points on the object can still be distinguished from the feature points on the background through time domain filtering.

4. Experiment and Result

This study intended to find multiple moving objects (mainly rigid bodies such as vehicles) under a moving platform. So, we used some sequences in the Video Verification of Identity (VIVID) program database [28] to verify the algorithm. The selected image sequences had interference such as object intersection and object occlusion. Part of the experimental image sequences is shown in Figure 8. Table 1 lists the major challenges of the relevant sequences.

For the algorithm designed in this paper, no preconditions were used except for image sequences. The values of constant parameters are described in Section 3. We used Matlab2018 to implement our method on a computer platform configured with an Intel processor of 3 GHz and 16 gigabytes of RAM.

4.1. Moving Point Segmentation

The key point of the algorithm is to determine the number and location of objects according to different movement modes. Therefore, after the tracks have been established, the primary task is to filter the moving points.

Part of the experimental results are shown in Figure 9 and Figure 10.

In the data processing, all images were transformed into gray images. As seen in Figure 9 and Figure 10, the algorithm could identify the feature points belonging to the moving objects from a large number of feature points, and it classified the moving feature points correctly.

We evaluated the model performance using the F1-score:

\begin{matrix} P_{i} = \frac{T P}{T P + F P}, \end{matrix}

(15)

\begin{matrix} R_{i} = \frac{T P}{T P + F N}, \end{matrix}

(16)

\begin{matrix} F = \frac{2 P_{i} R_{i}}{P_{i} + R_{i}}, \end{matrix}

(17)

where

T P

represents the number of points that were correctly classified.

F P

represents the incorrectly classified points.

F N

represents the points on the objects that were not detected.

The F-values calculated for the first 300 frames of each sequence are shown in Figure 11. Table 2 shows the proportion of frames that accurately differentiated between moving target points and background points. In Seq. (a), there were a total of 5 trigger detections, and due to its relatively simple background, the F-values for the entire sequence were consistently high. The points on objects and background were well distinguished. The decrease in F-value was mainly caused by the updates of feature points. The overall accuracy of the sequence reached 84%. After the feature points had been updated, the algorithm was able to correctly differentiate between moving feature points and background points within 5 frames. Seq. (b) was similar to Seq. (a), with an overall accuracy of 85%. Multiple objects in Seq. (a) and Seq. (b) were also correctly distinguished from each other. This demonstrated the algorithm’s ability for multi-target segmentation as presented in this study. In Seq. (c), due to the limited number of stable feature points that could be tracked on the background, it was challenging to establish long-term trajectories, resulting in a higher frequency of updates. In the 300 frames of images tested, the feature points were updated 15 times. Additionally, significant camera motion, while the objects themselves exhibited minimal movement after trajectory updates, could have led to a situation where some feature points on the objects were misclassified as background points. In Seq. (d), the overall detection success rate was also 85%. According to Figure 11, there were some frames in the sequence where F-values were missing. Comparing the data, during these frames, the camera focal length was changing, causing an overall blurring of the images and significantly impacting the detection of feature points, failing to detect feature points on the objects. After restoring the image quality, the detection performance returned to normal, demonstrating the robustness of the algorithm. In Seq. (e), which contained large-scale vehicles as objects, the overall accuracy was only 52%. Upon analyzing the classification of feature points, we found that this was due to a significant overlap between the objects and the background. In these overlapping regions, it was easier for feature points to be generated. However, when counting such feature points, we did not consider them as correct feature points on the objects, since their trajectories were unstable. This was the main reason for the significant curvature in the F-value. Due to the complexity of the background, the number of feature points on the background remained stable, leading to only four updates of the feature points. Seq. (f) had the main challenges of intense camera movement and rapid changes in target scale, affecting the stability of the trajectories of both object and background feature points. This also resulted in a faster update of feature points. In the 300 frames of images, the feature points were updated 16 times, which was similar to Seq. (c). This was the reason for the overall accuracy of 75%.

In general, after the trajectory disappears, the updating strategy may cause fluctuations in point classification in a short period. Then, it quickly adjusts and achieves good results. The experimental results indicated that the algorithm proposed in this study performed well in handling occlusion in image sequences. In the case of multiple target vehicles in an image, the algorithm could effectively differentiate feature points belonging to different targets based on their trajectories.

4.2. Split Results of the Test Dataset

Finding points on the objects showed the positioning function. Building upon this, we further addressed the target segmentation issue through background differencing and seed generation algorithms. In object segmentation algorithms, there are two commonly used evaluation metrics: the intersection-over-union metric (

Y

), and the contour accuracy (F) [12].

The intersection-over-union score is defined as

Y = \frac{1}{n} (\sum_{i = 1}^{n} \frac{M_{i} \cap G_{i}}{M_{i} \cup G_{i}}),

(18)

where

M_{i}

is the segmentation mask of the

i_{t h}

object, and

G_{i}

is the corresponding ground-true mask. n is the number of objects.

The contour accuracy F is for measuring how well the segment contours match the ground-truth contour. Contour-based precision

P_{i}

and recall

R_{i}

between

c (M)

and

c (G)

can be calculated as follows:

\begin{matrix} P_{i} = \frac{T P}{T P + F P}, \end{matrix}

(19)

\begin{matrix} R_{i} = \frac{T P}{T P + F N}, \end{matrix}

(20)

As in Equations (15) and (17),

T P

indicates that both the measured value and the true value are object boundaries.

F P

represents the part that was incorrectly detected as the boundary.

F N

represents the undetected boundary.

Then contour accuracy F can be defined as

F = \frac{1}{n} \sum_{i = 1}^{n} \frac{2 P_{i} R_{i}}{P_{i} + R_{i}} .

(21)

Qualitative segmentation results for the sequences are presented in Figure 12 and Figure 13.

As shown in Table 3, Figure 12 and Figure 13, in the case where the overall size of the objects was small, the algorithm in this paper could effectively recognize multiple objects and extract their contours. In some sequences, the contours of the objects were not well segmented, due to the objects being similar to the background, and with the parameters being set consistently.

In Seq. (a), the results demonstrated that the algorithm could effectively segment the objects. The errors mainly stemmed from the shadows formed by the vehicles on the road. Since the shadows moved along with the vehicles, they also complied with the definition of moving targets in the images. For Seq. (b), when there was an intersection between objects, the algorithm could effectively differentiate between different vehicle objects due to their distinct trajectories. However, during the process of intersection, it may not be possible to accurately separate the portions occupied by each object. The segmentation results for Seq. (c) were lower than those of Seq. (a) and Seq. (b). This is because both background differencing and seed generation rely on the differences between the objects and the background. In Seq. (c), the difference between the objects and the background was minimal. Even if feature points on the object can be located, this may not effectively segment a distinct region of the object. For Seq. (d), the segmentation of the target was quite good, with an IoU value of 0.83. When the target is heavily occluded, the segmentation of the target may be greatly disturbed. Then, when the object reappears, it can still be completely segmented. In Seq. (e), the segmentation of the target also depended primarily on the contrast between the target and the background. Additionally, due to the larger size of the target, it became more challenging to achieve complete segmentation. This was the main reason for the lower IoU value and accuracy compared to the other test sequences, which were only 0.62 and 0.49. In Seq. (f), the significant scale change of the object was the main reason affecting the segmentation results. When the scale was small, the IoU value could reach 0.83. However, when the target scale rapidly increased, this often led to the rapid disappearance of feature points as well. The trajectories of feature points on the target were also affected, resulting in unstable segmentation results.

4.3. Evaluation on a Youtube-Object

To facilitate comparisons with other algorithms, we ran our algorithm on Youtube-Object sets [29] and compared it with other algorithms: BVS [17], VSF [30], OFL [31], and STV [12]. Since some code was not available, we could compare the sequences reported in papers. The IoU scores of our and the compared methods are presented in Table 4.

In Table 4, the characteristics of the different image sequences are shown. In the aeroplane sequence, the background remained relatively static, which means that the trajectories of background feature points had a higher level of similarity. On the other hand, there was minimal interference from the sky background, which facilitated object segmentation in this case. For the bird sequence, due to the movement of the bird being concentrated on its head, the feature points on its body were classified as background points. This was the reason for the lower IoU value achieved by the algorithm in this study. The algorithm in this study achieved good results when detecting targets that moved as a whole, such as cars and horses. For the train sequence, as the train gradually approached the camera, the algorithm in this study could effectively segment the object. However, when the train got too close to the camera and occupied a large portion of the image, it no longer adhered to the assumption that there were more background feature points than object feature points. As a result, the algorithm failed to segment the target effectively.

5. Summary and Future Work

This paper introduced an algorithm that focused on detecting all moving objects in a dynamic background. The algorithm in this paper is based on feature points, establishing trajectories for detected feature points in the image. By utilizing the density and consistency of background points, the trajectories of feature points on the objects are filtered. The algorithm leverages the differences in background point matching to identify foreground objects and uses motion consistency to differentiate between different objects, enabling the detection of multiple objects within the same frame. Through experiments and evaluations on various image sequences, we validated the feasibility of the proposed method.

For the algorithm designed in this study, there were three main factors that influenced the final results: the formation of feature points and trajectory generation, extracting information from trajectories to distinguish between motion feature points and background feature points, and complete classification and segmentation of objects based on the detected object feature points.

The results of this study demonstrate that trajectory classification can effectively differentiate moving objects in a dynamic background. The algorithm performed well in handling issues such as occlusion and sudden changes in image sequences. When the motion of the background in the images became smoother, the algorithm produced good results. Through comparing the results of various experiments, it was found that image feature point detection exhibited more randomness, and the stability of trajectories needs to be improved. On the other hand, segmentation algorithms based on feature points are limited by the positions of the feature points on the objects. The way feature points are generated determines that they are more likely to appear at the edges of objects. Although this study selected points by considering the positions of different points on the same object comprehensively and prioritized selecting intermediate points to increase the probability of seed points being inside the object, there was still room for improvement in the overall segmentation performance.

Therefore, in future research, we aim to focus on improving trajectory stability and object segmentation by using feature line trajectories in images, instead of the current feature points. Feature lines can provide more stable feature descriptions, helping to form more stable trajectories. When an object is surrounded by multiple sets of feature lines, the likelihood of obtaining its complete outline increases. Additionally, dealing with the trajectory classification issue brought about by sparse feature lines on the background compared to feature points will also be addressed.

Author Contributions

Conceptualization, W.Q. and K.R.; methodology, W.Q.; software, W.Y.; validation, W.Q., K.R. and M.W.; formal analysis, M.W.; resources, W.Q.; data curation, X.K.; writing—original draft preparation, W.Y. and X.K.; writing—review and editing, W.Y.; supervision, W.Q.; project administration, W.Q.; funding acquisition, K.R., M.W. and X.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant 62175111, Grant 62201260, and Grand 62001234, in part by the Natural Science Foundation of Jiangsu Province under Grant BK20200487, in part by the Fundamental Research Funds for the Central Universities under Grant No. 30923011015 and Grant No. 30922010715.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Dataset available on request from the authors.

Acknowledgments

We would like to thank the anonymous reviewers for their valuable suggestions and comments.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yang, B.; Huang, C.; Nevatia, R. Learning Affinities and Dependencies for Multi-Target Tracking Using a CRF Model. In Proceedings of the CVPR 2011, Colorado Springs, CO, USA, 20–25 June 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 1233–1240. [Google Scholar] [CrossRef]
Pellegrini, S.; Ess, A.; Schindler, K.; Van Gool, L. You’ll Never Walk Alone: Modeling Social Behavior for Multi-Target Tracking. In Proceedings of the 2009 IEEE 12th International Conference on Computer Vision, Kyoto, Japan, 29 September–2 October 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 261–268. [Google Scholar] [CrossRef]
Lu, W.L.; Ting, J.A.; Little, J.J.; Murphy, K.P. Learning to Track and Identify Players from Broadcast Sports Videos. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 1704–1716. [Google Scholar] [CrossRef] [PubMed]
Luo, W.; Kim, T.K.; Stenger, B.; Zhao, X.; Cipolla, R. Bi-Label Propagation for Generic Multiple Object Tracking. In Proceedings of the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1290–1297. [Google Scholar]
Attanasi, A.; Cavagna, A.; Castello, L.D.; Giardina, I.; Jelić, A.; Melillo, S.; Parisi, L.; Pellacini, F.; Shen, E.; Silvestri, E.; et al. GReTA-A Novel Global and Recursive Tracking Algorithm in Three Dimensions. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 2451–2463. [Google Scholar] [CrossRef]
Betke, M.; Hirsh, D.E.; Bagchi, A.; Hristov, N.I.; Makris, N.C.; Kunz, T.H. Tracking Large Variable Numbers of Objects in Clutter. In Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, 17–22 June 2007; IEEE: Piscataway, NJ, USA, 2007; pp. 1–8. [Google Scholar] [CrossRef]
Khan, Z.; Balch, T.; Dellaert, F. An MCMC-Based Particle Filter for Tracking Multiple Interacting Targets. In Proceedings of the Computer Vision-ECCV 2004: 8th European Conference on Computer Vision, Prague, Czech Republic, 11–14 May 2004; Proceedings, Part IV 8. Springer: Berlin/Heidelberg, Germany, 2004; pp. 279–290. [Google Scholar]
Castro, F.M.; Marín-Jiménez, M.J.; Medina-Carnicer, R. Pyramidal fisher motion for multiview gait recognition. In Proceedings of the 2014 22nd International Conference on Pattern Recognition, Stockholm, Sweden, 24–28 August 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 1692–1697. [Google Scholar]
Yoon, H.Y.; Lee, D.; Lim, D.K.; Koo, H.; Kim, K. Copper-Free Click Chemistry: Applications in Drug Delivery, Cell Tracking, and Tissue Engineering. Adv. Mater. 2022, 34, 2107192. [Google Scholar] [CrossRef] [PubMed]
Talukder, A.; Matthies, L. Real-Time Detection of Moving Objects from Moving Vehicles Using Dense Stereo and Optical Flow. In Proceedings of the 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566), Sendai, Japan, 28 September–2 October 2004; Volume 4, pp. 3718–3725. [Google Scholar] [CrossRef]
Fan, L.; Zhang, T.; Du, W. Optical-Flow-Based Framework to Boost Video Object Detection Performance with Object Enhancement. Expert Syst. Appl. 2021, 170, 114544. [Google Scholar] [CrossRef]
Wang, W.; Shen, J.; Porikli, F.; Yang, R. Semi-Supervised Video Object Segmentation with Super-Trajectories. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 985–998. [Google Scholar] [CrossRef] [PubMed]
Faktor, A.; Irani, M. Video Segmentation by Non-Local Consensus Voting. 2014, 2, 8. Available online: https://www.wisdom.weizmann.ac.il/~alonf/papers/NonLocalVideoSegmentation.pdf (accessed on 14 March 2024).
Dong, J.; Shan, T.; Pang, D. Infrared small target detection based on facet model and structure tensor. In Proceedings of the 2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), Chongqing, China, 12–14 June 2020; Volume 1, pp. 1246–1250. [Google Scholar]
Li, T.; Jiang, B.; Wu, D.; Yin, X.; Song, H. Tracking Multiple Target Cows’ Ruminant Mouth Areas Using Optical Flow and Inter-Frame Difference Methods. IEEE Access 2019, 7, 185520–185531. [Google Scholar] [CrossRef]
Zheng, S. Tracking Multiple Objects in Surveillance Cameras; Technical Report; 2010. [Google Scholar]
Maerki, N.; Perazzi, F.; Wang, O.; Sorkine-Hornung, A. Bilateral Space Video Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 743–751. [Google Scholar]
Chen, J.; Paris, S.; Durand, F. Real-time edge-aware image processing with the bilateral grid. ACM Trans. Graph. (TOG) 2007, 26, 103. [Google Scholar] [CrossRef]
Discontinuity-Aware Video Object Cutout. ACM Transactions on Graphics. 2023. Available online: https://dl.acm.org/doi/abs/10.1145/2366145.2366194 (accessed on 14 March 2024).
Fan, Q.; Zhong, F.; Lischinski, D.; Cohen-Or, D.; Chen, B. JumpCut: Non-Successive Mask Transfer and Interpolation for Video Cutout. ACM Trans. Graph. 2015, 34, 195. [Google Scholar] [CrossRef]
Chu, P.; Wang, J.; You, Q.; Ling, H.; Liu, Z. TransMOT: Spatial-Temporal Graph Transformer for Multiple Object Tracking. arXiv 2021, arXiv:2104.00194. [Google Scholar] [CrossRef]
Danelljan, M.; Bhat, G.; Shahbaz Khan, F.; Felsberg, M. ECO: Efficient Convolution Operators for Tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6638–6646. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Huang, X.; Qiao, H.; Li, H.; Jiang, Z. A Bioinspired Retinal Neural Network for Accurately Extracting Small-Target Motion Information in Cluttered Backgrounds. Image Vis. Comput. 2021, 114, 104266. [Google Scholar] [CrossRef]
Zhang, Y.; Wang, C.; Wang, X.; Zeng, W.; Liu, W. FairMOT: On the Fairness of Detection and Re-Identification in Multiple Object Tracking. Int. J. Comput. Vis. 2021, 129, 3069–3087. [Google Scholar] [CrossRef]
Bay, H.; Ess, A.; Tuytelaars, T.; Van Gool, L. Speeded-Up Robust Features (SURF). Comput. Vis. Image Underst. 2008, 110, 346–359. [Google Scholar] [CrossRef]
Schubert, E.; Sander, J.; Ester, M.; Kriegel, H.P.; Xu, X. DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN. ACM Trans. Database Syst. 2017, 42, 19. [Google Scholar] [CrossRef]
Collins, R.; Zhou, X.; Teh, S.K. An Open Source Tracking Testbed and Evaluation Web Site. Available online: https://www.cs.cmu.edu/~rcollins/Papers/opensourceweb.pdf (accessed on 14 March 2024).
Prest, A.; Leistner, C.; Civera, J.; Schmid, C.; Ferrari, V. Learning Object Class Detectors from Weakly Annotated Video. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3282–3289. [Google Scholar] [CrossRef]
Nagaraja, N.S.; Schmidt, F.R.; Brox, T. Video Segmentation With Just a Few Strokes. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 3235–3243. [Google Scholar]
Tsai, Y.H.; Yang, M.H.; Black, M.J. Video Segmentation via Object Flow. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3899–3908. [Google Scholar]

Figure 1. Overall algorithm flow chart.

Figure 2. The velocity and distance representation of the trajectories in one direction. The X-axis represents the frame number.

Figure 3. Polar representation of the trajectory of feature points in a frame.

Figure 4. Density clustering. All data are divided into three clusters, and elements in each cluster are less than a threshold (radius of the circle) from at least one element in the same cluster.

Figure 5. (a) Using small thresholds to distinguish backgrounds. (b) Screening the feature points by plane constraints. (c) Using large thresholds to distinguish backgrounds.

Figure 6. The image is aligned by a homography matrix. (a) represents the generated background image

I_{1}

, (b) represents the image

I_{r e b a c k}

after changes through the homography matrix, and (c) represents the gray difference between the current frame and the background frame after binarization.

Figure 6. The image is aligned by a homography matrix. (a) represents the generated background image

I_{1}

, (b) represents the image

I_{r e b a c k}

after changes through the homography matrix, and (c) represents the gray difference between the current frame and the background frame after binarization.

Figure 7. Extraction of objects based on background points and background regions.

Figure 8. Part of the experimental image sequences.

Figure 9. Motion feature points screening (Seq. (a)~Seq. (c)).

Figure 10. Motion feature points screening (Seq. (d) ~Seq. (f)).

Figure 11. F values obtained from different sequences. The red dots represent the updated frames.

Figure 12. Qualitative segmentation results on Seq. (a)~(c). The results show that the proposed algorithm could cut out multiple objects well and distinguish different objects.

Figure 13. Qualitative segmentation results on sequences. (d)~(f). The results show that the proposed algorithm could cut out multiple objects well and distinguish different objects.

Table 1. Introducing the existing problems of each image sequence.

Seq.	Challenge
(a)	There are more moving objects.
(b)	The objects are close together.
(c)	The objects’ contrasts are not high.
(d)	The image is blurred and there are occlusion.
(e)	Interference is bigger and a large number of background points are not on the ground.
(f)	There are scale variations.

Table 2. The accuracy of detection points for each sequence.

Evaluation Tools	Seq. (a)	Seq. (b)	Seq. (c)	Seq. (d)	Seq. (e)	Seq. (f)
Accuracy	84%	85%	40%	85%	52%	75%

Table 3. IoU score (

Y

) and contour accuracy (F) averaged value on the tested sequences. Higher values of IoU score and contour accuracy are better.

Table 3. IoU score (

Y

) and contour accuracy (F) averaged value on the tested sequences. Higher values of IoU score and contour accuracy are better.

Evaluation Tools	Seq. (a)	Seq. (b)	Seq. (c)	Seq. (d)	Seq. (e)	Seq. (f)
IoU	0.86	0.83	0.75	0.83	0.62	0.65
contour accuracy	0.64	0.63	0.58	0.66	0.49	0.42

Table 4. IoU score on the some Youtube-Objects dataset. Higher values mean better results.

Dataset	Category	BVS	VSF	OFL	STV	OURS
Youtube-Object	Aeroplane	0.808	0.89	0.853	0.811	0.83
	bird	0.764	0.816	0.831	0.813	0.62
	car	0.567	0.709	0.606	0.754	0.72
	horse	0.531	0.678	0.623	0.716	0.73
	train	0.621	0.782	0.747	0.761	0.621

Bold numbers indicate the best-performing values across all algorithms.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, W.; Ren, K.; Wan, M.; Kong, X.; Qian, W. Dynamic Multiple Object Segmentation with Spatio-Temporal Filtering. Sensors 2024, 24, 2094. https://doi.org/10.3390/s24072094

AMA Style

Yang W, Ren K, Wan M, Kong X, Qian W. Dynamic Multiple Object Segmentation with Spatio-Temporal Filtering. Sensors. 2024; 24(7):2094. https://doi.org/10.3390/s24072094

Chicago/Turabian Style

Yang, Wenguang, Kan Ren, Minjie Wan, Xiaofang Kong, and Weixian Qian. 2024. "Dynamic Multiple Object Segmentation with Spatio-Temporal Filtering" Sensors 24, no. 7: 2094. https://doi.org/10.3390/s24072094

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Dynamic Multiple Object Segmentation with Spatio-Temporal Filtering

Abstract

1. Introduction

2. Related Work

3. Algorithm Scheme

3.1. Image Initialization and Feature Point Detection

3.2. Trajectory Correlation

3.3. Temporal Clustering and Plane Constraint

3.4. Spatial Point Clustering

3.5. Object Contour Generation

3.6. Trajectories Update

4. Experiment and Result

4.1. Moving Point Segmentation

4.2. Split Results of the Test Dataset

4.3. Evaluation on a Youtube-Object

5. Summary and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI