High-Speed Spatial–Temporal Saliency Model: A Novel Detection Method for Infrared Small Moving Targets Based on a Vectorized Guided Filter

Aliha, Aersi; Liu, Yuhan; Zhou, Guangyao; Hu, Yuxin

doi:10.3390/rs16101685

Open AccessArticle

High-Speed Spatial–Temporal Saliency Model: A Novel Detection Method for Infrared Small Moving Targets Based on a Vectorized Guided Filter

¹

Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China

²

Key Laboratory of Technology in Geo-Spatial Information Processing and Application System, Chinese Academy of Sciences, Beijing 100190, China

³

School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(10), 1685; https://doi.org/10.3390/rs16101685

Submission received: 27 March 2024 / Revised: 4 May 2024 / Accepted: 8 May 2024 / Published: 9 May 2024

Download

Browse Figures

Versions Notes

Abstract

:

Infrared (IR) imaging-based detection systems are of vital significance in the domains of early warning and security, necessitating a high level of precision and efficiency in infrared small moving target detection. IR targets often appear dim and small relative to the background and are easily buried by noise and difficult to detect. A novel high-speed spatial–temporal saliency model (HS-STSM) based on a guided filter (GF) is proposed, which innovatively introduces GF into IR target detection to extract the local anisotropy saliency in the spatial domain, and substantially suppresses the background region as well as the bright clutter false alarms present in the background. Moreover, the proposed model extracts the motion saliency of the target in the temporal domain through vectorization of IR image sequences. Additionally, the proposed model significantly improves the detection efficiency through a vectorized filtering process and effectively suppresses edge components in the background by integrating a prior weight. Experiments conducted on five real infrared image sequences demonstrate the superior performance of the model compared to existing algorithms in terms of the detection rate, noise suppression, real-time processing, and robustness to the background.

Keywords:

infrared small moving target detection; spatial–temporal model; motion saliency; local anisotropy saliency; guided filter

Graphical Abstract

1. Introduction

Infrared imaging-based detection systems can transform the infrared radiation emitted by an object into observable infrared images, which enables target detection and tracking. These systems currently serve as crucial tools in various fields such as early warning and security. However, infrared small moving targets, such as small aircraft and unmanned aerial vehicles [1], are typically observed at a considerable distance, resulting in images with small and dim targets that often occupy only a few pixels and lack distinctive features. Additionally, the complex imaging environment often includes areas with high radiation characteristics, leading to the presence of various forms of bright noise in the background [2]. As a result, IR targets often appear dim and small relative to the background and are easily buried by noise and difficult to detect. Additionally, real-time detection of infrared targets requires high-speed detection methods. Given these challenges, the demand for enhanced detection accuracy and timeliness has significantly increased. Enhancing the stability and real-time performance of infrared small moving target detection in complex backgrounds holds significant practical importance.

1.1. Related Works

Conventional spatial filtering methods for IR target detection, including the top-hat transform filter [3], max-mean/max-median filters [4], and bilateral filter [5], leverage the discernible features of the target and background in the image sequence. Although these techniques are computationally efficient and straightforward to implement, they tend to be less effective in complex scenes.

Based on low-rank sparse decomposition (LRSD), various methods have been developed, such as the infrared patch image model (IPI) by Gao et al. [6] and the reweighted infrared patch tensor model (RIPT) proposed by Dai et al. [7]. Zhang et al. [8] introduced corner points and edge prior weight information and approximated the rank of the matrix by partial sum of tensor singular values and proposed the partial sum of tensor nuclear norm (PSTNN). Hu et al. [9] presented the multi-frame spatial–temporal image block tensor model (MFSTPT) that utilizes rank approximation with Laplace operator constraints.

Human visual system (HVS)-based methods perform target detection by extracting salient features such as local contrast. The local contrast method (LCM), proposed by Chen et al. [10], was one of the first methods to use this approach for IR target detection. Wei et al. [11] introduced the multiscale patch-based contrast measure (MPCM) which builds upon the LCM by adding a multiscale contrast calculation, leading to an improved detection performance. Shi et al. [12] introduced a high-boost-based multiscale local contrast measure (HB-MLCM) as a refinement to LCM. Additionally, Han et al. [13] proposed a multiscale three-layer local contrast measurement (TLLCM), while Cui et al. [14] proposed a weighted three-layer window local contrast method (WTLLCM). To capture time domain information between frames, Deng et al. [15] developed the spatial–temporal local contrast filter (STLCF). Du and Hamdulla [16] also introduced a novel spatial–temporal local difference measure (STLDM) algorithm for the detection of moving targets in IR image sequences. Moreover, Ma et al. [17] proposed a method based on coarse-to-fine structure (MFCS). HVS-based detection methods have gained widespread use due to their minimal reliance on prior information and efficient computational processing. However, there is still room for improvement in accuracy for detecting small moving IR targets in complex scenes and robustness of noise.

In recent years, neural networks have emerged as a viable approach for detecting IR targets. Wang et al. [18] utilized convolutional neural networks (CNNs) for extracting target features. Furthermore, the YOLO series, introduced by Redmon et al. [19], and generative adversarial networks (GANs) proposed by Goodfellow et al. [20] have been the subject of extensive research. Wang et al. [21] introduced the asymmetric patch attention fusion network (APAFNet), aimed at fusing high-level semantics and low-level spatial details. Nevertheless, the limited resolution of IR images and sample adequacy poses a challenge as it restricts the extraction of features from targets. Consequently, there is still a need to enhance the detection efficiency and robustness of these methods.

1.2. Motivation

Most of the current methods for infrared small moving target detection process single-frame images, focusing only on the spatial domain features of the target. Therefore, these methods are most effective in scenarios where the background is uncomplicated and the target stands out prominently in comparison to the background. In situations where targets have low contrast, small size, and complex backgrounds, ensuring the accuracy and robustness of detection results can be challenging. Effectively utilizing the inter-frame information of IR image sequences to extract temporal-domain features of the target becomes crucial for enhancing detection accuracy and reducing false alarms.

In addition, the current filtering methods and HVS-based methods are prone to miss detection and false alarms, particularly in scenarios with complex scenes. Therefore, algorithms for extracting spatial-domain features of the target based on local saliency require further enhancement. This is especially relevant in cases where strong edges exist in the background and point noise resembles the shape of the target. It becomes essential to integrate the suppression of background edge components into the spatial filtering process.

Finally, in practical applications, the speed of detection is as crucial as the accuracy; while single-frame-based methods can often guarantee high detection efficiency, there is a need to enhance the accuracy and robustness of detection. On the other hand, LRSD-based and multi-frame-based methods often achieve higher detection accuracy but at the cost of efficiency. Therefore, it is essential to find a balance between detection accuracy and speed and to maintain high-precision detection while enhancing efficiency.

To address the aforementioned issues, a fast and robust method for detecting small moving IR targets is introduced in this paper. The main contributions are as follows:

(1): A novel high-speed spatial–temporal saliency model (HS-STSM) is proposed, which simultaneously extracts the temporal saliency of the target from the inter-frame information of IR image sequences and the local anisotropy saliency in the spatial domain.
(2): To enhance the extraction of spatial saliency of the target, this paper proposes a novel fast spatial filtering algorithm via a guided filter. This approach is combined with edge suppression using local prior weights, which serves to further reduce background residuals and highlight the target.
(3): Achieving real-time performance in IR target detection is crucial. To address this, the vectorization of IR image sequences is introduced into the filtering process. This method significantly improves the speed of multi-frame detection and the extraction of the temporal saliency of the IR target, resulting in superior detection efficiency while maintaining high levels of detection accuracy.
(4): Both qualitative and quantitative experimental results on five real sequences demonstrate that our model performs advanced, fast, and robust detection of IR small moving targets.

The rest of this paper is structured as follows. Section 2 presents the proposed model, covering the extraction of temporal and spatial saliency, as well as the filtering process. Additionally, a detailed explanation of the computation of local prior weights is provided. Section 3 presents the experimental results and analysis and offers both qualitative and quantitative evaluations. Section 4 is the discussion of the performance of the proposed method and provides insights into the effectiveness and limitations of the model. Finally, Section 5 presents the conclusion of this paper.

2. Proposed Model

Inspired by the guided filter (GF) introduced by He et al. [22], the edge preservation property is capitalized on to jointly filter IR image sequences in spatial and temporal domains, thereby achieving both background suppression and target enhancement. Specifically, the vectorization of the IR image sequence facilitates the mapping of inter-frame information into a two-dimensional matrix, allowing for the extraction of temporal features of the moving target and further target enhancement. This matrix serves as the guided image for the fast guided filter (FGF) [23], which leverages anisotropic features in the spatial domain to extract background components and differentiate them from the original image, effectively suppressing the background. Furthermore, to eliminate bright edge interference in the background, prior weights are computed utilizing the local structure tensor in the spatial domain. Finally, the target image is reconstructed, and detection results are derived through adaptive threshold segmentation.

2.1. Spatial–Temporal Saliency

The background often exhibits diverse properties such as gradient, texture, reflection, etc., which manifest differently in various directions within the spatial domain. In contrast, these properties appear isotropic in the target, as illustrated in Figure 1. Leveraging this disparity, background suppression can be achieved based on the anisotropic features between the target and the background. To better preserve target components from the background, the superior detail preservation capability of GF is harnessed. The image to be detected is input as the guide image, thereby enabling the filter output to retain more structural features. Moreover, the proposed approach also utilizes the temporal saliency of the moving target. In the temporal domain, under a fixed system, as the target traverses a pixel location, the pixel’s value at this location sharply rises to a high level and then decreases to its original level. This temporal feature can be described by a varying curve with a pronounced peak when the target is present and a relatively smooth line when the target is absent, as depicted in Figure 2.

2.2. Vectorization of IR Image Sequence

To extract the temporal saliency of a moving target for target enhancement, each frame in the IR image sequence is converted into a column vector. Each pixel of the current image is mapped into the column vector in left-to-right and top-to-bottom order, as shown in Figure 3. These column vectors are then employed to construct a two-dimensional matrix, which combines temporal and spatial domain information and serves as the input for the filtering process. Compared with single-frame-based spatial-domain filtering algorithms, the proposed method yields more effective removal of point-like bright noise present in the background as it fails to align with the temporal saliency feature of the moving target, which thus improves the robustness.

2.3. Filtering Process Based on Vectorized Guided Filter

To satisfy the real-time detection of infrared targets, the improved FGF is employed. Initially, nearest-neighbor subsampling is applied to the two-dimensional matrix F, resulting in the sampled output

F^{'}

. This sampled result is utilized as both the input image I and the guiding map G. Specifically, O is defined to be the linear transformation of G in a window

ω_{k}

centered at pixel k:

O_{i} = \sum_{j} W_{i j} (G) I_{j} = a_{k} G_{i} + b_{k}, \forall i \in ω_{k}

(1)

where i and j represent pixel indices,

W_{i j}

denotes the filter kernel, and

(a_{k}, b_{k})

are linear constant coefficients within the window

ω_{k}

, which takes the form of a square window with a radius r. It becomes evident that there is a linear gradient change between the filtered output image O and the guiding map G, which serves to preserve detailed information such as targets and edges within the matrix

F^{'}

prior to filtering. The objective is to minimize the following cost function within the window:

E (a_{k}, b_{k}) = \sum_{i \in ω_{k}} ({(a_{k} G_{i} + b_{k} - I_{i})}^{2} + ϵ a_{k}^{2})

(2)

where

ϵ

is a regularization parameter constraining

a_{k}

. The parameter

(a_{k}, b_{k})

can be solved by the principle of linear regression:

a_{k} = \frac{1}{| w |} \sum_{i \in ω_{k}} \frac{G_{i} I_{i} - μ_{k} {\bar{I}}_{k}}{σ_{k}^{2} + ϵ}

(3)

b_{k} = {\bar{I}}_{k} - a_{k} μ_{k}

(4)

where

μ_{k}

and

σ_{k^{2}}

are the mean and variance of the grayscale of the pixels within the window

ω_{k}

in G, and w is the number of pixels within the window

ω_{k}

.

{\bar{I}}_{k}

, on the other hand, is the mean value of the grayscale of the pixels within the window

ω_{k}

in the input image I. It is calculated as follows:

{\bar{I}}_{k} = \frac{1}{| w |} \sum_{i \in ω_{k}} I_{i}

(5)

By combining the previously mentioned formulas, the equation for deriving the background component in G extracted by the filter is as follows:

O_{i} = \frac{1}{| w |} \sum_{k, i \in ω_{k}} (a_{k} G_{i} + b_{k}) = {\bar{a}}_{k} G_{i} + {\bar{b}}_{k}

(6)

{\bar{a}}_{i} = \frac{1}{| w |} \sum_{k \in ω_{i}} a_{k}, {\bar{b}}_{i} = \frac{1}{| w |} \sum_{k \in ω_{i}} b_{k}

(7)

where

{\bar{a}}_{i}

and

{\bar{b}}_{i}

are the mean values of all the windows overlapping with pixel i. This gives the following filter kernel function

W_{i j} (G)

with normalized characteristics, and the filtered matrix O is upsampled to restore it to its original full resolution.

W_{i j} (G) = \frac{1}{{| w |}^{2}} \sum_{k, i, j \in ω_{k}} (1 + \frac{(G_{i} - μ_{k}) (G_{j} - μ_{k})}{σ_{k}^{2} + ϵ})

(8)

2.4. Edge Suppression Based on Prior Weights

After the filtering, there are still several strong edge components in the suppressed results. Gao et al. [24] proposed a method to differentiate edges and corners by computing the eigenvalues

(λ_{1}, λ_{2})

of the structure tensor of the pixel points in the image:

\begin{array}{l} J_{ρ} & = & K_{ρ} (Δ D_{ρ} \otimes Δ D_{ρ}) = [\begin{matrix} J_{11} & J_{12} \\ J_{21} & J_{22} \end{matrix}] \\ = & [\begin{matrix} K_{ρ} I_{x}^{2} & K_{ρ} I_{x} I_{y} \\ K_{ρ} I_{x} I_{y} & K_{ρ} I_{y}^{2} \end{matrix}] \end{array}

(9)

λ_{1} = \frac{1}{2} (J_{11} + J_{22} + \sqrt{{(J_{22} - J_{11})}^{2} + 4 J_{12}^{2}})

(10)

λ_{2} = \frac{1}{2} (J_{11} + J_{22} - \sqrt{{(J_{22} - J_{11})}^{2} + 4 J_{12}^{2}})

(11)

where ⊗ is the Kronecker product,

Δ

is the gradient operator,

K_{ρ}

is the Gaussian kernel function with variance

ρ

, and

I_{x}

and

I_{y}

are the derivatives along the x and y directions, respectively. Gao et al. stated that when

λ_{1} \geq λ_{2} \approx 0

, it means that the pixel belongs to the edge. Then, the spatial prior information of the edge is calculated as follows, as adopted from Dai et al. [7] in RIPT:

E (x, y) = λ_{1} - λ_{2}

(12)

This spatial weight is combined in the filtering process to further suppress the residual edge components.

2.5. Adaptive Threshold Segmentation

Although the background clutter and edge interference have been suppressed and the target is preserved, the background information does not become zero. Thus, the post-processing adaptive threshold segmentation module should consider all pixels. So, the threshold proposed in this paper is computed as follows:

T h = q \times I_{\max} + (1 - q) \times I_{mean}

(13)

where

I_{m a x}

is the maximum gray value,

I_{m e a n}

is the average of non-zero pixels, and q is an adjustable threshold factor between 0 and 1.

2.6. The Flowchart of the Proposed Model

The flowchart of the proposed model is depicted in Figure 4.

The input consists of an IR image sequence;
The spatial prior weight map is constructed by Formulas (9)–(12);
Each pixel of the current image is mapped into the column vector in left-to-right and top-to-bottom order to construct an input matrix for the filtering process, as shown in Figure 3;
The fast guided filter is utilized for the extraction of spatial saliency and background suppression in the filtering process;
The filtered image is subtracted from the original image to perform background suppression;
The reconstruction process involves placing each pixel in the IR sequence matrix back to their original positions in the IR images;
Spatial prior weights are integrated into the reconstructed infrared image to suppress edge residuals in the background;
Finally, the adaptive threshold segmentation, as shown in Formula (13), is performed on the recovered target detection result map to obtain the final target image.

3. Experimental Results and Analysis

This section presents the experimental results and analysis of the proposed model. A detailed description of the five real infrared image sequences used in the experiments is provided, which contain a variety of complex background components and targets of varying sizes. Additionally, the evaluation metrics used in the experiments are introduced. Quantitative experiments including parameter analysis, background suppression, target detection accuracy and detection efficiency are performed, as well as qualitative analysis of the detection results. Finally, seven state-of-the-art methods are chosen for direct comparison, and the experimental results demonstrate the superiority of the proposed method.

3.1. Evaluation Metrics

Based on the evaluation metrics commonly used in the field of infrared target detection, the target and its neighboring region are defined as shown in Figure 5, where

a * b

denotes the target region and

(a + 2 d) * (b + 2 d)

denotes the background region around the target. The specific evaluation metrics used are described below.

First, the definition of the signal-to-clutter ratio (SCR) is established as follows:

$S C R = \frac{| m e a n_{t} - m e a n_{n} |}{d e v_{n}}$

(14)

where $m e a n_{t}$ and $m e a n_{n}$ denote the grayscale mean of the target region and the surrounding neighborhood region, respectively, and $d e v_{n}$ represents the grayscale standard deviation of the neighborhood region. The signal-to-clutter ratio gain (SCRG) is frequently utilized to assess the effectiveness of clutter suppression and target enhancement. From SCR, SCRG is calculated as follows:

$S C R G = \frac{S C R_{r}}{S C R_{i}}$

(15)

where $S C R_{r}$ and $S C R_{i}$ represent the SCR value of the detection result and the input IR image, respectively.
To measure the effect of the method on background suppression, the background suppression factor (BSF) is calculated as follows:

$B S F = \frac{d e v_{i}}{d e v_{r}}$

(16)

where $d e v_{i}$ and $d e v_{r}$ represent the standard deviation of the background region in the original input image and the detection result, respectively.
The detection rate and false alarm rate are used for comprehensive evaluation of the detection performance of the method across the entire image sequence. The detection rate $P_{d}$ is calculated as follows:

$P_{d} = \frac{D T}{A T}$

(17)

where $D T$ is the number of successful detection and $A T$ is the number of targets to be detected in the IR sequence. The false alarm rate $F_{a}$ is calculated as follows:

$F_{a} = \frac{F P}{N P}$

(18)

where $F P$ is the number of non-target pixels and $N P$ is the total number of pixels in the IR sequence. Further, the receiver operating characteristic curve (ROC) can be plotted according to the change in the threshold, using the detection rate as the horizontal coordinate and the false alarm rate as the vertical coordinate. Furthermore, the AUC value, i.e., the area under the curve, which is the area enclosed by the ROC curve and the axes, is a direct indicator of the target detection performance.

3.2. Description of the Dataset

To better demonstrate the robustness of the proposed method to infrared small targets in different backgrounds, experiments are performed on real IR image sequences [25]. The targets in this dataset are mainly airborne IR targets in sky and ground scenes in complex situations. In order to evaluate the performance of the proposed method, five representative scenes with high detection difficulty as test sequences are selected, as shown in Figure 6. The 3D display is given by Figure 7. A detailed description regarding the sequences is shown in Table 1. The SCR values in the table represent the signal-to-cluster ratio.

3.3. Parameter Analysis

The model’s adjustable parameters primarily reside in the filtering process, specifically the radius r of the filtering window

ω_{k}

and the regularization parameter

ϵ

.

ϵ

controls the degree of smoothness in the original guided filter. However, the method proposed in this study achieves background suppression through the disparity between the original image and the filtered image, rendering the detection result almost independent of the parameter

ϵ

. The experimental findings conducted on real IR target sequences also support this observation. Hence, the default

ϵ

value is set to 0.01.

Here, this article concentrates on how the radius r of the filter window

ω_{k}

affects the detection performance. Given the size range of IR small targets, the radiuses are set to 2, 4, 6, 8, 10, and 12, respectively. Subsequently, experiments on five real IR target sequences are conducted. The 3D receiver operating curves (ROC) of the experimental results are depicted in Figure 8, with the corresponding area under curve (AUC) value provided in the legend for each case. A larger AUC value indicates a superior detection performance.

It is evident that the model exhibits the poorest detection performance when the radius is at the minimum value of 2, the optimal performance is achieved when the radius is approximately 6, and the effectiveness diminishes again after attaining and surpassing 8. This trend can be attributed to the target size range of 2 × 2 to 5 × 5 within the five sequences. Specifically, when the window radius is small, it fails to completely encompass the target area, resulting in excessive contraction of the target and a reduction in the detection rate. Conversely, when the window radius is too large, it includes excessive background clutter components, thereby leading to a higher false alarm rate. Consequently, the ideal window radius should be slightly larger than the size of the target to be detected. As a result, the value is set as

r = 6

which yields the optimal and most robust detection performance.

3.4. Ablation Experiments

To validate the efficacy of each part of the proposed model, ablation experiments are conducted, and the 3D ROC curves derived from the results are illustrated in Figure 9. Ablation experiments are performed separately on five real infrared sequences listed in Table 1. In the legend, “Without” denotes an experiment conducted without a particular component of the proposed method. The AUC values derived from the 3D ROC curves are listed in Table 2. The best results are marked in red. The results of the ablation experiments demonstrate that the overall algorithm surpasses the removal or replacement of any individual module, thereby affirming the significance of each component in enhancing the detection performance.

3.5. Qualitative Experiments

In this section, qualitative experiments on robustness in different scenes are performed. Furthermore, a comparison with seven state-of-the-art methods is presented, evaluating the ability of target enhancement and background suppression.

Qualitative Comparison with State-of-the-Art Methods

To demonstrate the superiority of the proposed model, it is compared with seven current state-of-the-art and efficient detection methods for IR target detection. These methods include single-frame detection methods HB-MCLM [12], MPCM [11], PSTNN [8], TLLCM [13], and WTLLCM [14], as well as multi-frame detection methods STLCF [15] and STLDM [16]. The parameter settings for each comparison method are shown in Table 3. All the parameters of the methods are consistent with the parameter settings in the corresponding literature of the methods.

The qualitative detection results are shown in Figure 10, Figure 11, Figure 12, Figure 13 and Figure 14, where the red boxes indicate successful detection for the target, the blue oval boxes represent false alarms, and the green boxes indicate missed detection. Furthermore, 3D maps of the results are provided in the upper right corner of each image.

Observing the results in data 1 and data 4, where the target stands out more prominently against the background, all methods successfully detect the target. However, it is evident that HB-MLCM, MPCM, and STLDM exhibit a higher number of false alarms in the detection results for data 1, as well as insufficient suppression of the buildings present in data 4. This is attributed to their sensitivity to strong edges and bright noise in the background. The targets in data 2 and data 3 are characterized by their small size and low contrast and the presence of considerable clutter interference in the background. This makes it challenging to effectively detect them using only the spatial saliency features of the targets. Consequently, single-frame-based detection methods such as PSTNN are less effective on these sequences, resulting in a large number of missed detections. Furthermore, multi-frame-based detection methods such as STLCF and STLDM do not adequately suppress the background components in the spatial domain, thereby leading to a bright noise residual, which is easily confused with the target. The background of data 5 contains a substantial amount of noise components which are similar to the target. Therefore, pure spatial filtering methods like TLLCM and WTLLCM encounter difficulty in distinguishing between the target and the noise, leading to a high number of false alarms and missed detections simultaneously in the detection results. However, it is interesting to note that the single-frame-based PSTNN performs surprisingly well on data 5. This may be attributed to the ability of the low-rank matrix constructed by its neighborhood sliding-window sampling, which effectively preserves the local features of the target in the spatial domain.

The proposed HS-STSM model outperforms other methods on all five datasets by accurately detecting the target location and effectively preserving or enhancing the target. This superior performance is achieved through the extraction of temporal saliency features of the target motion by image vectorization, and the complete suppression of background components via spatial anisotropy as well as prior weight assignment. The 3D maps of the results also demonstrate that the proposed method can accurately distinguish the target from the complex background. These qualitative detection results in various scenarios show the robustness of the proposed method.

3.6. Quantitative Analysis

Quantitative analysis is performed on the detection results to achieve a more accurate evaluation of the algorithms. The evaluation metrics described before are employed, and the results are presented in Table 4 and Table 5.

The optimal results are highlighted in red in the table. It is evident that the proposed method outperforms the seven state-of-the-art methods by achieving optimal or suboptimal values. The high BSF values denotes superior suppression of background residuals by effectively leveraging spatial anisotropic features and prior weights. Furthermore, the high SCRG values indicate a strong target enhancement through the utilization of temporal saliency. Finally, the high AUC values indicate that the proposed method achieves high detection rates while maintaining low false alarm rates, demonstrating its superior detection performance and robustness to noise and scenes. Overall, the proposed model effectively enhances targets while successfully suppressing background residuals.

To visualize the detection results across the entire image sequence, Figure 15 displays the ROC curves of the eight methods for the five image sequences mentioned. The corresponding AUC values for each case are provided in the legend. It is evident that methods such as HB-MLCM and MPCM, which rely on single-frame filtering, exhibit less robustness and display more fluctuations in different scenarios. Furthermore, the LRSD-based approach, PSTNN, demonstrates moderate performance, with a significant number of missed detections on data 3, where the target size is the smallest. STLCF and STLDM efficiently leverage multi-frame information, resulting in greater resistance to interference. Nevertheless, these methods perform unsatisfactorily in detecting small, dim targets in data 2 and data 3.

Compared with other algorithms, the model presented in this paper effectively suppresses background noise and improves target visibility by extracting spatial isotropic features and temporal saliency. This approach achieves optimal detection and false alarm rates across various scenarios. Results from the 3D ROC curves for five real and complex scenes demonstrate the high accuracy and robustness of the proposed model in practical applications.

3.7. Detection Efficiency

Real-time detection is crucial for IR target detection. The average detection times for a single IR image of the aforementioned methods are provided in Table 6, with optimal results highlighted in red font. The results demonstrate the superior detection efficiency of the proposed method, achieving either optimal or near-optimal average detection times on real IR sequences. Spatial filtering methods, such as HB-MLCM, MPCM, and WTLLCM, have relatively lower computational complexity, resulting in the higher detection efficiency. However, these methods compromise the detection accuracy and exhibits a high false alarm rate. LRSD-based methods typically have high computational complexity and longer detection time. PSTNN only operates on single frame image and reduces computational complexity through the partial sum of the tensor nuclear norm, achieving a considerable detection efficiency. However, compared to filtering methods, it still lacks competitiveness. The STLDM method is limited by its multi-frame differencing, resulting in high computational effort and low detection efficiency. As a multi-frame spatial–temporal filtering algorithm, the proposed method achieves significantly higher detection efficiency than similar filtering methods. Moreover, it exceeds the detection efficiency of single-frame-based methods. This superiority can be attributed to the vectorization on IR image sequences, which reduces repeated and redundant computations. In conclusion, the proposed HS-STSM model maintains high detection accuracy without sacrificing efficiency, thus demonstrating a practical application value.

3.8. Intuitive Effect

To intuitively demonstrate the superior detection performance of the proposed model, the experimental results from quantitative evaluation, i.e., BSF, SCRG, AUC, and computation time, are summarized in a histogram, as shown in Figure 16. The values in the histogram represent the average results obtained from the aforementioned five real IR sequences. It is evident that the proposed HS-STSM model has the highest BSF, SCRG, and AUC values and the lowest computation time, which exhibits the model’s improved detection accuracy and efficiency.

4. Discussion

Research in the field of IR target detection methods has indeed received considerable attention, and significant advancements have been made in recent years. However, there is always room for improvement and further development. Spatial filtering methods and HVS-based detection methods have low computational complexity and therefore are faster. Still, they often perform poorly on the detection of dim and small targets in complex scenes. LRSD-based models have shown improvements in detection accuracy, yet they often rely on limited prior information and struggle to effectively leverage inter-frame information in the temporal domain of moving targets. Furthermore, multi-frame-based detection tends to involve high computational complexity, and there is a need to enhance detection efficiency in this area.

To address the problem of robustness and efficiency for IR target detection in complex scenes, the HS-STSM model is proposed. This method leverages temporal information of moving targets to effectively suppress clutter interference from complex backgrounds. In addition, it utilizes the local structure tensor in the spatial domain to assign prior weight, further reducing residual bright edge components in the background. Furthermore, by the vectorization of IR image sequences, the model significantly improves detection efficiency, as shown in Table 6. Qualitative and quantitative experiments were conducted, including robustness evaluation, background suppression, target enhancement, detection ability, and computation time, on five real IR target image sequences as depicted in Figure 6 and Table 1. These results were compared with seven state-of-the-art methods, namely HB-MLCM, MPCM, PSTNN, STLCF, STLDM, TLLCM, and WTLLCM. The obtained qualitative results in Figure 10, Figure 11, Figure 12, Figure 13 and Figure 14 and the overall evaluation results presented in Figure 16 demonstrate that the proposed model outperforms the other methods in terms of both detection efficiency and robustness.

Nevertheless, the proposed model still has room for improvement. The current size of the filtering window radius r is a fixed value, which may lead to excessive target shrinkage or missed detection when the target size varies greatly. One possible improvement idea is to adaptively determine the optimal window size in the filtering process based on the maximum response principle.

5. Conclusions

In order to achieve fast and robust IR target detection in complex scenes, a high-speed spatial–temporal saliency model (HS-STSM) for detecting small moving IR targets based on vectorized guided filter is proposed. The model extracts the target’s motion saliency in the temporal domain and its local anisotropy saliency in the spatial domain from IR sequences in complex scenes. Thus, the target can be distinguished from both dimensions. Moreover, the local structure tensor in the spatial domain is employed to assign prior weights to strong edges, contributing significantly to the suppression of the background. Both qualitative and quantitative experimental results conducted on five real sequences demonstrate that the model proposed in this paper enhances the detection performance of small moving IR targets in complex scenes while preserving high speed and robustness.

Author Contributions

A.A. and Y.L. proposed the original idea and designed the experiments. A.A. performed the experiments and wrote the manuscript. Y.L. reviewed and edited the manuscript. Y.H. and G.Z. contributed computational resources and edited the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Aerospace Information Research Institute, Chinese Academy of Sciences.

Data Availability Statement

Bingwei Hui et al., “A dataset for infrared image dim-small aircraft target detection and tracking underground/air background.” Science Data Bank, 28 October 2019 [Online]. Available: https://doi.org/10.11922/sciencedb.902 [Accessed: 2 December 2022].

Conflicts of Interest

The authors declare no conflicts of interest.

References

Eysa, R.; Hamdulla, A. Issues on Infrared Dim Small Target Detection and Tracking. In Proceedings of the 2019 International Conference on Smart Grid and Electrical Automation (ICSGEA), Xiangtan, China, 10–11 August 2019; pp. 452–456. [Google Scholar] [CrossRef]
Tong, Z.; Can, C.; Xing, F.W.; Qiao, H.H.; Yu, C.H. Improved small moving target detection method in infrared sequences under a rotational background. Appl. Opt. 2018, 57, 9279–9286. [Google Scholar] [CrossRef] [PubMed]
Tom, V.T.; Peli, T.; Leung, M.; Bondaryk, J.E. Morphology-based algorithm for point target detection in infrared backgrounds. In Proceedings of the Defense, Security, and Sensing, Orlando, FL, USA, 22 October 1993. [Google Scholar]
Deshpande, S.D.; Er, M.H.; Venkateswarlu, R.; Chan, P. Max-mean and max-median filters for detection of small targets. In Proceedings of the Optics and Photonics, Denver, CO, USA, 4 October 1999. [Google Scholar]
Tomasi, C.; Manduchi, R. Bilateral filtering for gray and color images. In Proceedings of the Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271), Bombay, India, 7 January 1998; pp. 839–846. [Google Scholar] [CrossRef]
Gao, C.; Meng, D.; Yang, Y.; Wang, Y.; Zhou, X.; Hauptmann, A.G. Infrared Patch-Image Model for Small Target Detection in a Single Image. IEEE Trans. Image Process. 2013, 22, 4996–5009. [Google Scholar] [CrossRef] [PubMed]
Dai, Y.; Wu, Y. Reweighted Infrared Patch-Tensor Model With Both Nonlocal and Local Priors for Single-Frame Small Target Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2017, 10, 3752–3767. [Google Scholar] [CrossRef]
Zhang, L.; Peng, Z. Infrared Small Target Detection Based on Partial Sum of the Tensor Nuclear Norm. Remote Sens. 2019, 11, 382. [Google Scholar] [CrossRef]
Hu, Y.; Ma, Y.; Pan, Z.; Liu, Y. Infrared Dim and Small Target Detection from Complex Scenes via Multi-Frame Spatial-Temporal Patch-Tensor Model. Remote Sens. 2022, 14, 2234. [Google Scholar] [CrossRef]
Chen, C.L.P.; Li, H.; Wei, Y.; Xia, T.; Tang, Y.Y. A Local Contrast Method for Small Infrared Target Detection. IEEE Trans. Geosci. Remote Sens. 2014, 52, 574–581. [Google Scholar] [CrossRef]
Wei, Y.; You, X.; Li, H. Multiscale patch-based contrast measure for small infrared target detection. Pattern Recognit. 2016, 58, 216–226. [Google Scholar] [CrossRef]
Shi, Y.; Wei, Y.; Yao, H.; Pan, D.; Xiao, G. High-Boost-Based Multiscale Local Contrast Measure for Infrared Small Target Detection. IEEE Geosci. Remote Sens. Lett. 2018, 15, 33–37. [Google Scholar] [CrossRef]
Han, J.; Moradi, S.; Faramarzi, I.; Liu, C.; Zhang, H.; Zhao, Q. A Local Contrast Method for Infrared Small-Target Detection Utilizing a Tri-Layer Window. IEEE Geosci. Remote Sens. Lett. 2020, 17, 1822–1826. [Google Scholar] [CrossRef]
Cui, H.; Li, L.; Liu, X.; Su, X.; Chen, F. Infrared Small Target Detection Based on Weighted Three-Layer Window Local Contrast. IEEE Geosci. Remote Sens. Lett. 2022, 19, 7505705. [Google Scholar] [CrossRef]
Deng, L.; Zhu, H.; Tao, C.; Wei, Y. Infrared moving point target detection based on spatial–temporal local contrast filter. Infrared Phys. Technol. 2016, 76, 168–173. [Google Scholar] [CrossRef]
Du, P.; Hamdulla, A. Infrared Moving Small-Target Detection Using Spatial–Temporal Local Difference Measure. IEEE Geosci. Remote Sens. Lett. 2020, 17, 1817–1821. [Google Scholar] [CrossRef]
Ma, Y.; Liu, Y.; Pan, Z.; Hu, Y. Method of Infrared Small Moving Target Detection Based on Coarse-to-Fine Structure in Complex Scenes. Remote Sens. 2023, 15, 1508. [Google Scholar] [CrossRef]
Wang, W.; Qin, H.; Cheng, W.; Wang, C.; Leng, H.; Zhou, H. Small target detection in infrared image using convolutional neural networks. In Proceedings of the AOPC 2017: Optical Sensing and Imaging Technology and Applications, Beijing, China, 4–6 June 2017; Jiang, Y., Gong, H., Chen, W., Li, J., Eds.; International Society for Optics and Photonics, SPIE: Bellingham, WA, USA, 2017; Volume 10462, p. 1046250. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 779–788. [Google Scholar] [CrossRef]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems—Volume 2, NIPS’14, Montreal, QC, Canada, 8–13 December 2014; pp. 2672–2680. [Google Scholar]
Wang, Z.; Yang, J.; Pan, Z.; Liu, Y.; Lei, B.; Hu, Y. APAFNet: Single-Frame Infrared Small Target Detection by Asymmetric Patch Attention Fusion. IEEE Geosci. Remote Sens. Lett. 2023, 20, 7000405. [Google Scholar] [CrossRef]
He, K.; Sun, J.; Tang, X. Guided Image Filtering. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 1397–1409. [Google Scholar] [CrossRef] [PubMed]
He, K.; Sun, J. Fast Guided Filter. arXiv 2015, arXiv:1505.00996. [Google Scholar]
Gao, C.Q.; Tian, J.; Wang, P. Generalised-structure-tensor-based infrared small target detection. Electron. Lett. 2008, 44, 1349–1351. [Google Scholar] [CrossRef]
Hui, B.; Song, Z.; Fan, H.; Zhong, P.; Hu, W.; Zhang, X.; Ling, J.; Su, H.; Jin, W.; Zhang, Y.; et al. A dataset for infrared detection and tracking of dim-small aircraft targets under ground/air background. China Sci. Data 2020, 5, 12. [Google Scholar]

Figure 1. Spatial saliency of IR target.

Figure 2. Temporal saliency of the IR target under a fixed system.

Figure 3. Vectorization of IR image sequence.

Figure 4. Flowchart of the proposed HS-STSM model.

Figure 5. The IR target and its neighboring region. The target region is enclosed within the red box, whereas the background region is enclosed within the blue box.

Figure 6. Five real IR sequences used in the experiments. The target regions are marked with red boxes.

Figure 7. Three-dimensional maps of the IR sequences used in the experiments. The target regions are marked with red boxes. The numbers represent corresponding IR sequences.

Figure 8. The impact of the filter window’s radius r.

Figure 9. Three−dimensional ROC curves of ablation experiments on five real infrared sequences.

Figure 10. Comparative detection results of data 1.

Figure 11. Comparative detection results of data 2.

Figure 12. Comparative detection results of data 3.

Figure 13. Comparative detection results of data 4.

Figure 14. Comparative detection results of data 5.

Figure 15. Three−dimensional ROC curves of the detection results by different methods.

Figure 16. Intuitive display of the detection performance by different methods.

Table 1. Description of the dataset.

Sequence	Frames	Image Size	Average SCR	Target Size
Data 1	399	256 × 256	6.07	3 × 3 to 4 × 4
Data 2	1500	256 × 256	5.20	1 × 1 to 2 × 2
Data 3	750	256 × 256	3.42	1 × 1 to 3 × 3
Data 4	1599	256 × 256	3.84	2 × 2 to 4 × 4
Data 5	499	256 × 256	2.20	3 × 3 to 5 × 5

Table 2. AUC values of the ablation experiments.

Method	Data 1	Data 2	Data 3	Data 4	Data 5
Proposed	0.9908	0.9579	0.8925	1.0000	0.9930
Without prior	0.9641	0.8502	0.7919	0.9979	0.9698
Without vectorization	0.9351	0.4458	0.5037	0.7500	0.8129

Table 3. Comparison methods and parameters.

Methods	Parameters
HB-MLCM	Window size: $15 \times 15$ , $ε = 25$ , $K = 4$
MPCM	Window size: $3 \times 3$ , $5 \times 5$ , $7 \times 7$ , mean filter size: $3 \times 3$
PSTNN	Patch size: $40 \times 40$ , step: 40, $λ = \frac{3.7}{\sqrt{\min (m, n)}} \cdot n_{3}$ , $ε = 10^{- 7}$
STLCF	Window size: $5 \times 5$ , frames $= 5$
STLDM	Frames $= 5$
TLLCM	Gaussian filter kernel
WTLLCM	Window size: $3 \times 3$ , $K = 4$
Proposed	Window radius: $r = 6$ , regularization parameter: $ϵ = 0.01$

Table 4. Quantitative results of different methods.

	Data 1			Data 2			Data 3
Methods	BSF	SCRG	AUC	BSF	SCRG	AUC	BSF	SCRG	AUC
HB-MLCM	17.320	4.465	0.8816	16.343	3.107	0.0229	5.642	4.741	0
MPCM	402.936	2.907	0.5223	185.529	0.359	0.0426	NaN	0.032	0.0015
PSTNN	18.799	4.941	0.9974	8.690	2.156	0.6647	14.545	0.205	0.0951
STLCF	6.314	4.189	0.9699	7.160	2.245	0.6854	11.349	0.494	0.2224
STLDM	10.127	3.241	0.8905	8.500	2.443	0.7925	10.884	2.606	0.5433
TLLCM	6.144	1.407	0.3503	3.735	2.484	0.5163	8.173	1.647	0.5310
WTLLCM	12.210	5.564	0.9709	5.157	5.014	0.8244	10.245	2.845	0.7210
Proposed	396.504	5.762	0.9908	199.205	3.176	0.9579	32.117	4.965	0.8925

Table 5. Quantitative results of different methods.

	Data 4			Data 5
Methods	BSF	SCRG	AUC	BSF	SCRG	AUC
HB-MLCM	8.800	5.428	0.9243	30.757	2.184	0.1100
MPCM	242.064	2.236	0.9000	129.428	1.407	0.0482
PSTNN	24.889	2.030	1.0000	47.581	5.210	0.8416
STLCF	5.185	1.967	0.9146	18.567	4.203	0.6956
STLDM	5.653	3.717	0.9219	56.253	6.547	0.8121
TLLCM	30.354	2.812	0.9997	13.933	0.923	0.0679
WTLLCM	45.199	2.944	0.9992	16.641	3.214	0.3164
Proposed	64.105	4.757	1.0000	650.755	7.758	0.9930

Table 6. Computation time of different methods (in seconds).

Method	Time
Method	Data 1	Data 2	Data 3	Data 4	Data 5
HB-MCLM	0.026	0.030	0.026	0.031	0.022
MPCM	0.032	0.053	0.062	0.041	0.050
PSTNN	0.210	0.527	0.891	0.428	0.293
STLCF	0.315	0.327	0.375	0.333	0.383
STLDM	1.571	1.650	1.634	1.582	1.702
TLLCM	1.078	1.143	1.165	1.132	1.186
WTLLCM	0.054	0.805	1.145	0.216	0.038
Proposed	0.030	0.025	0.024	0.028	0.021

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Aliha, A.; Liu, Y.; Zhou, G.; Hu, Y. High-Speed Spatial–Temporal Saliency Model: A Novel Detection Method for Infrared Small Moving Targets Based on a Vectorized Guided Filter. Remote Sens. 2024, 16, 1685. https://doi.org/10.3390/rs16101685

AMA Style

Aliha A, Liu Y, Zhou G, Hu Y. High-Speed Spatial–Temporal Saliency Model: A Novel Detection Method for Infrared Small Moving Targets Based on a Vectorized Guided Filter. Remote Sensing. 2024; 16(10):1685. https://doi.org/10.3390/rs16101685

Chicago/Turabian Style

Aliha, Aersi, Yuhan Liu, Guangyao Zhou, and Yuxin Hu. 2024. "High-Speed Spatial–Temporal Saliency Model: A Novel Detection Method for Infrared Small Moving Targets Based on a Vectorized Guided Filter" Remote Sensing 16, no. 10: 1685. https://doi.org/10.3390/rs16101685

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

High-Speed Spatial–Temporal Saliency Model: A Novel Detection Method for Infrared Small Moving Targets Based on a Vectorized Guided Filter

Abstract

1. Introduction

1.1. Related Works

1.2. Motivation

2. Proposed Model

2.1. Spatial–Temporal Saliency

2.2. Vectorization of IR Image Sequence

2.3. Filtering Process Based on Vectorized Guided Filter

2.4. Edge Suppression Based on Prior Weights

2.5. Adaptive Threshold Segmentation

2.6. The Flowchart of the Proposed Model

3. Experimental Results and Analysis

3.1. Evaluation Metrics

3.2. Description of the Dataset

3.3. Parameter Analysis

3.4. Ablation Experiments

3.5. Qualitative Experiments

Qualitative Comparison with State-of-the-Art Methods

3.6. Quantitative Analysis

3.7. Detection Efficiency

3.8. Intuitive Effect

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI