1.1. Related Work
According to the basic principle, the existing detection methods can be divided into four categories: methods based on background subtraction, methods based on the human visual system(HVS), methods based on optimization, and methods based on deep learning.
The first category is the methods based on background subtraction, which predicts the background image and then subtract the background from the original image to obtain the target image. Classical background subtraction methods include the Top-Hat transformation [
3], the Max-mean/Max-median filtering [
4], the two-dimensional least mean square filtering (TDLMS) [
5], and the Low-pass filter (LPF). Many modifications to the classic methods have been proposed and are widely applied. The histogram rightwards cyclic shift binarization (HRCSB) combines background subtraction with histogram curve transformation [
6]. The new white top-hat (NWTH) transformation improves the detection performance by structure element construction and operation reorganization [
7]. Double-layer two dimensional least mean square filter uses different filter settings for background suppression and target enhancement [
8]. These methods assume that the background is flat. They work well in simple backgrounds, but they cannot cope with complex backgrounds because the heterogeneous regions in complex backgrounds will bring difficulties to background prediction.
The second category is the methods based on target enhancement, which can directly suppress the background and enhance the target by filtering. The difference of Gaussian (DoG) [
9] and the Laplacian of Gaussian (LoG) [
10] are two classic methods considering the Gaussian model of small targets’ gray distribution. Inspired by the human visual system (HVS) mechanism, researchers developed many detection methods using the local contrast or gray difference [
11]. The calculation of these methods usually takes the form of center-surround difference. For example, the average absolute gray difference (AAGD) is a typical center-surround difference method, but AAGD does not utilize directional information, so it cannot suppress clutter edges [
12]. The absolute average difference with cumulative directional derivatives (AADCDD) improved AAGD by introducing direction information [
13]. Absolute directional mean difference (ADMD) also used direction information to suppress structural backgrounds [
14]. The local contrast measure (LCM) proposed a classic nine-cell sliding window [
15]. Currently, many improvements in LCM have been developed. For example, the improved LCM (ILCM) [
16] and the novel LCM (NLCM) [
17] divided the image into many sub-blocks to reduce computation. The multi-scale patch-based contrast measure (MPCM) adopted the product of the differences in opposite directions [
18]. The tri-layer LCM (TLLCM) adopted a tri-layer nested window [
19]. The homogeneity-weighted LCM (HWLCM) combined the LCM with the homogeneity of the cells [
20]. The weighted strengthened local contrast measure (WSLCM) combined the strengthened LCM and the weighting function [
21]. The above methods belong to the spatial filtering methods, which are easy to implement and fast to calculate, but they did not take advantage of the target’s motion information. Considering the gray fluctuation caused by the target motion in the time domain, researchers developed many spatial–temporal filtering methods. The spatial–temporal local contrast filter (STLCF) [
22] and the spatial–temporal local contrast map (STLCM) [
23] calculated the temporal and spatial local contrast of moving small targets separately and then fused them by multiplication. The spatial–temporal local difference measure (STLDM) directly detected targets in the 3-D spatial–temporal domain [
24]. The novel spatiotemporal saliency method (NSTSM) is proposed for LSS IR target detection, which utilizes the variance characteristics and gray intensity characteristics of target pixels and background pixels in the spatiotemporal domain [
25]. These methods based on target enhancement can effectively suppress the heterogeneous regions in complex backgrounds, but they cannot deal with the clutter similar to the real small target.
The third category is the methods based on optimization, which transforms the small target detection problem into an optimization problem of recovering sparse and low-rank matrices. The IR patch-image (IPI) model assumed that the background matrix is low rank and the target matrix is sparse, and then restores the target image by optimization [
26]. Then, this model was extensively researched, and many researchers have proposed improved methods based on the IPI model. Dai et al. proposed the weighted IR patch-image (WIPI) model to solve the problem of excessive target shrinkage, in which the target likelihood coefficient is designed as the weight of the target patch-image [
27]. Dai et al. also proposed the non-negative IR patch-image model based on partial sum minimization of singular values (NIPPS) to suppress strong edges better, which replaces the kernel norm in the IPI model with the partial sum of singular values [
28]. Like NIPPS, Zhao et al. also tried to improve the sparse term in the IPI model and proposed the method based on non-convex optimization with Lp-norm constraint (NOLC), which replaces the nuclear norm with the Lp-norm [
29]. Considering that the single subspace assumption is not suitable for the estimation of complex backgrounds, many researchers proposed models based on multiple subspaces. He et al. proposed the low-rank and sparse representation (LRSR) model, which constructs over-complete dictionaries for sparse representation of small targets [
30]. Wang et al. proposed the stable multi-subspace learning (SMSL), which adopts the subspace learning strategy to improve the ability to resist complex background and noise [
31]. To make the most of prior information, Dai et al. extended the IPI model to the tensor field and proposed the IR patch-tensor (IPT) model [
32]. Then, a large number of methods based on the IPT model began to appear. The partial sum of tensor nuclear norm (PSTNN) can achieve very fast computing speed [
33]. The non-convex tensor rank surrogate joint local contrast energy (NTRS) utilized a non-convex tensor rank surrogate merging tensor nuclear norm and the Laplace function for background patch constraint [
34]. Many researchers have extended the tensor model to the spatial–temporal domain, such as the multiple subspace learning and spatial–temporal IPT (MSL-STIPT [
35]), the spatial–temporal tensor model (STTM) [
36], and the novel spatial–temporal tensor model with saliency filter regularization (STTM-SFR) [
37]. These optimization-based methods can continuously improve performance through model refinement. However, their assumption of low-rank background makes them highly demanding of background uniformity, and some clutter also confirms the assumption of local sparse. Meanwhile, they are generally time-consuming due to iterative operations.
The fourth category is the methods based on deep learning, which first trains the model to mine image features using a large number of data sets and then detect small targets using the trained model. Convolutional neural network (CNN) is a commonly used network that is beneficial to learning infrared image hierarchical features. Zhao et al. proposed the novel lightweight convolutional neural network called the TBC-Net, which consists of two modules: target extraction and semantic constraints [
38]. With these two modules, TBC-Net can add high-level semantic constraint information on images into training. Since pooling layers in CNN can cause the targets in deep layers to be missed, Li et al. proposed a dense nested attention network (DNA-Net), which can preserve the target deep features through progressive interaction between the high-level and low-level features [
39]. Considering that CNN cannot capture large-scale dependencies, many researchers have begun to use transformer based on self-attention mechanisms. Liu et al. proposed an IR small-dim target detection method with the transformer, which uses a feature enhancement module to improve feature learning for small-dim targets [
40]. Chen et al. proposed a hierarchical vision transformer-based method called the IRSTFormer, specifically used for small target detection in large-size images [
41]. Due to the small size and few features of IR small targets, it is difficult for pure data-driven methods to achieve high performance. Researchers have tried to combine neural networks with traditional small-target models. Dai et al. proposed a novel model-driven deep network, which introduces the local contrast and extends the application of traditional small target features in the field of neural networks [
42].
The performance of deep learning-based methods relies heavily on a large number of training samples. However, obtaining an ample amount of training samples for IR small target detection is challenging in military scenarios, which hinders the application of deep learning approaches. Therefore, there is a need for further development of models based on small samples.
While the aforementioned existing methods demonstrate good performance in many scenarios, they struggle to handle heavy sea clutter environments. Extensive clutter on the sea surface can significantly interfere with the detection process. Some clutter may exhibit a very similar appearance to real small targets, posing difficulties for existing methods to suppress them effectively.
1.2. Motivation
At a long imaging distance, the projection of a small target on the camera focal plane typically covers only a few pixels or even less than one pixel. However, due to scattering, diffraction, and focusing effects, the emitted IR radiation from the target undergoes diffusion, resulting in an image spot that is larger than the target’s physical size. Although this spot cannot be perfectly circular, it approximates isotropy to some extent. The IR radiation of the target and the properties of the optical system will not change dramatically, allowing for the appearance of the target spot to remain relatively stable over a short period of time.
Figure 1a illustrates an IR image containing a small target and heavy sea clutter, with the small target marked by a red box. The first line of
Figure 1b shows the local images of the small target captured in five consecutive frames. It can be observed that the appearance of the small target exhibits little variation over these five frames and consistently maintains an approximate isotropic appearance. The combination of isotropy in the spatial domain and stability in the temporal domain is referred to as the Appearance Stable Isotropy.
Sea clutter primarily consists of waves and sun glints. The IR radiation emitted by the waves gives rise to heterogeneous regions in the background, with its appearance being influenced by the shape of the waves. Sun glints, on the other hand, are bright spots formed due to sunlight reflecting off the sea surface, and their appearance depends on local reflective surfaces. The sea surface itself exhibits significant randomness, resulting in irregular shapes and varying sizes of clutter. In an image, both isotropic and anisotropic clutter can coexist simultaneously. Furthermore, the dynamic nature of the sea surface leads to continuous deformation of clutter. An initially isotropic clutter can rapidly transform into an anisotropic one within a short duration. Consequently, sea clutter does not maintain a consistent appearance and lacks the characteristic of Appearance Stable Isotropy.
In
Figure 1a, two examples of clutter are marked with yellow boxes, labeled as Clutter A and Clutter B, respectively. The second and third lines of
Figure 1b display the local images of Clutter A and B, respectively, captured over five consecutive frames. It can be observed that Clutter A initially appears isotropic in the first frame but undergoes subsequent changes in shape. By the fifth frame, Clutter A has become significantly anisotropic. On the other hand, Clutter B consistently demonstrates anisotropy.
The above analysis and examples demonstrate substantial differences between small targets and sea clutter in terms of ASI. Leveraging this distinction can effectively differentiate between them.
Based on the above analysis, this paper proposes a detection method that utilizes the appearance stable isotropy measure (ASIM). The contributions of this paper can be summarized as follows:
- (1)
The Gradient Histogram Equalization Measure (GHEM) is proposed to effectively characterize the spatial isotropy of local regions. It aids in distinguishing small targets from anisotropic clutter.
- (2)
The Local Optical Flow Consistency Measure (LOFCM) is proposed to assess the temporal stability of local regions. It facilitates the differentiation of small targets from isotropic clutter.
- (3)
By combining GHEM, LOFCM, and Top-Hat, ASIM is developed as a comprehensive characteristic for distinguishing between small targets and different types of sea clutter. We also construct an algorithm based on ASIM for IR small target detection in heavy sea clutter environments.
- (4)
Experimental results validate the superior performance of the proposed method compared to the baseline methods in heavy sea clutter environments.
The remainder of this paper is organized as follows:
Section 2 presents the proposed method, detailing its key components. Subsequently, in
Section 3, comprehensive experimental results and analysis are provided. Finally, this paper is concluded in
Section 4.