1. Introduction
Compared to the visible spectrum, the infrared band is broader and contains more detailed target information. Meanwhile, infrared optical imaging systems exhibit stronger penetration capabilities and better anti-interference performance. By directly utilizing the thermal radiation characteristics of targets for passive detection, these systems can operate day and night, adapt to various environments, and maintain concealment. They are particularly effective for detection under adverse weather conditions with low visibility, such as fog, rain, snow, and frost, making them invaluable for industrial inspection, aerial surveillance, border patrol, and disaster rescue scenarios. Among these, Mid-Wave Infrared (MWIR) systems demonstrate sensitivity to both thermal and reflective radiation. Meanwhile, cooled infrared detectors significantly reduce thermal noise generated in high-temperature environments by cooling the core detector components. This enables cooled MWIR detectors to operate at lower temperatures, thereby enhancing signal clarity and detection accuracy. The low noise characteristics allow the detectors to better capture weak infrared radiation signals, substantially improving detection performance. Currently, most infrared imaging systems employ single-aperture configurations. However, as application scenarios and target detection requirements become increasingly diverse, single-aperture systems struggle to meet demands such as wide field-of-view or rapid target detection. Consequently, building upon traditional single-aperture designs, researchers have developed multi-aperture imaging systems, which are typically categorized into multi-view vision systems, camera arrays, microlens arrays, and compound eye lenslet arrays. Among these, multi-view vision systems and camera arrays are typically bulky, while microlens arrays are constrained by aperture and focal length limitations, restricting their application scenarios. Consequently, researchers have developed bio-inspired compound eye imaging systems. These systems consist of numerous small optical units, which we term “ommatidia” by analogy to their biological counterparts, enabling wide field-of-view imaging with comparable image quality while maintaining a more compact and smaller form factor [
1,
2]. Moreover, compound eye visual information processing differs significantly from single-aperture systems. Generally, higher resolution demands greater processing power, and higher pixel counts entail more information to process. Compound eye cameras, however, extract valuable information with limited resolution. Additionally, due to their unique structural design, compound eye imaging systems can acquire more spatial and motion information about targets. Existing compound eye imaging systems are typically designed for visible light or uncooled infrared applications, whereas our developed cooled bio-inspired infrared compound eye camera exhibits enhanced detection capabilities for weak targets [
3].
A key technology in infrared target detection is small target detection. As shown in
Figure 1, this is a typical scenario of infrared small target detection against an aerial or space background. Currently, infrared small targets are typically defined as targets occupying no more than 9 × 9 pixels [
4]. With the advancement of modern technology, the rapid and accurate detection of small targets across diverse scenarios has become a primary task. Therefore, developing algorithms with high detection accuracy, strong robustness, and fast processing speed for complex scenarios presents a significant challenge. For infrared bio-inspired compound eye systems, adjacent ommatidia capture the same scene, but due to their geometric arrangement on a spherical surface, the same scene exhibits different pixel coordinates across ommatidia, consistent with overlap ratio analysis. Simultaneously, for the same target, the relative positions of each ommatidium result in varying target poses in the captured images. This is reflected in the compound eye images as targets exhibiting different sizes and responses across ommatidia. Detecting all small targets appearing in the ommatidia is essential to further combine their geometric information for estimating target distance ranges.
Infrared small target detection methods can be categorized into model-based approaches and deep learning-based approaches broadly. Due to the wide field of view and low pixel resolution of individual ommatidia in our infrared compound eye camera, it is challenging to obtain detailed features. Therefore, within the model-based framework, we employ Low-Rank and Sparse Decomposition (LRSD) for small target detection. The LRSD-based method assumes that infrared images consist of low-rank background signals, sparse target signals, and noise signals [
6,
7]. Specifically, in infrared imaging, some image patches in the background are approximately linearly correlated, indicating that the background satisfies the low-rank property. For targets, due to the considerable distance between the real targets and the imaging system, the targets occupy only a few pixels in the entire infrared image, making the target component sparse relative to the whole image [
7]. By performing low-rank and sparse decomposition on the original infrared image, the low-rank and sparse components are separated, with the sparse component corresponding to the target. Applying threshold segmentation to the sparse component enables target detection. Based on the aforementioned assumptions, Gao et al. [
7] initially developed an Infrared Patch-Image (IPI) model for infrared small target detection. However, when dealing with non-smooth backgrounds, the IPI model results in target images with strong edge residuals. Consequently, numerous improved algorithms have been proposed based on IPI, such as the Non-negative Infrared Patch-Image model via Partial Sum Minimization of Singular Values (NIPPS) proposed by Dai et al. [
8], and the approach by Zhang and Xue [
9,
10], which employs a non-convex γ-norm to constrain the background. This not only overcomes the limitations of the nuclear norm in the IPI model but also achieves higher detection speeds, while using the
-norm to constrain sparse edges with approximately linear structures. Dai et al. [
11] were the first to extend the low-rank and sparse decomposition-based infrared small target detection algorithm from two-dimensional matrices to three-dimensional tensors, proposing the Reweighted Infrared Patch Tensor (RIPT) model. Following this tensor construction scheme, Zhang and Peng [
12] utilized the Partial Sum of the Tensor Nuclear Norm (PSTNN) to constrain the low-rank component and effectively suppress the background. Kong et al. [
13] employed the Tensor Fiber Nuclear Norm and Hyper Total Variation (TV, Log TFNN) to suppress both background and noise.
In image sequences, the background exhibits low-rank properties in the temporal domain, while the target is sparsely distributed in the temporal domain [
14]. To leverage the spatial–temporal correlations in infrared image sequences, Sun et al. [
15] extended the IPI model from the spatial domain to the spatial–temporal domain. In recent years, many scholars have explored various methods, each with its distinct advantages. For instance, Liu et al. [
16] proposed a Non-convex Tensor Low-rank Approximation (NTLA) with Asymmetric Spatial–Temporal Total Variation (ASTTV) to enhance target detection capabilities. Meanwhile, Wu et al. constructed a four-dimensional infrared tensor from a series of infrared images and decomposed it into low-dimensional tensors using the Tensor Train (TT) technique and its extension, Tensor Ring (TR) [
17], while preserving the spatial structural features and temporal characteristics of the original data. Liu et al. applied the RCTV method to small target detection in infrared image sequences [
18], improving detection efficiency. Sun proposed a novel approach based on multi-subspace learning and spatial–temporal tensor data structures [
19]. Lu introduced a new Long-term Spatial–Temporal Tensor (LSTT) model, employing image registration techniques to achieve frame-to-frame alignment and constructing a new image tensor through direct superposition of aligned frames [
20]. Wei developed a four-dimensional tensor model based on superpixel segmentation and statistical clustering for infrared dim target detection [
21], further enhancing the spatial structural correlations among image patches. Liu proposed a novel IPT model (termed IPT–TCTV), constructing an improved Spatial–Temporal Tensor (STT) model through sliding 3-D windows, which better preserves the spatial correlation and temporal continuity of multi-frame infrared images in the constructed tensor [
22]. Yin introduced a new 3-D paradigm framework that incorporates spatial–temporal weighting and regularization into the low-rank sparse tensor decomposition model [
23]. Zhao developed an iterative corner and edge weighting method based on tensor decomposition, designing corner intensity as the weight for target components and edge intensity as the weight for interference components, enabling more accurate separation of targets and interference [
24]. These are excellent methods proposed in recent years, capable of preserving the characteristics of the original image sequences during tensor construction and calculating target weights from the original image, demonstrating strong performance in small target detection for single-aperture infrared image sequences. However, they are not well suited for leveraging the unique features of bio-inspired infrared compound eye images.
This paper proposes a low-rank and sparse decomposition method based on the image characteristics of bio-inspired infrared compound eyes. The method reconstructs the structural tensor of our infrared compound eye images according to their features. Initially, the entire compound eye image is segmented to retain the images of the ommatidia regions. These ommatidia images are then arranged into a tensor along the temporal dimension based on their size and number. A compound eye structural weighting operator is designed to integrate information from all ommatidia images, effectively utilizing the scene correlation between adjacent ommatidia and the scene variability due to different imaging angles. We combine RCTV (Representative Coefficient Total Variation) and a reweighted -norm that incorporates the compound eye structural weights into a novel model. This model imposes TV regularization constraints on the representative coefficients to enhance computational efficiency and employs the compound eye structural weighting operator to improve the accuracy of target detection, enabling rapid and accurate detection of small infrared targets. Quantitative and qualitative experiments demonstrate that, compared to other methods, this approach offers significant advantages in quickly detecting small targets in infrared compound eye images and effectively suppressing background clutter.
2. Materials and Methods
2.1. Construction of the Compound Eye Spatiotemporal Tensor Model
We have designed a bio-inspired infrared compound eye camera, which employs a high-performance cooled mid-wave infrared detector. The array of small lenses is arranged in a curved surface configuration, with their optical axes perpendicular to the spherical mounting structure. The optical axis of the central microlens serves as the principal optical axis of the compound eye, extending outward at intervals of 10°. The microlenses on each concentric circle are evenly spaced, forming a field of view of 108° × 108°. The edge distortion of the microlenses is approximately 4–5%, essentially achieving wide-field imaging with minimal edge distortion [
3].
Due to the unique structure and optical system of the bio-inspired infrared compound eye camera, the infrared compound eye images we obtain exhibit distinct structural characteristics. In a complete compound eye image, there are both ommatidia images containing imaging information and non-imaging regions corresponding to the spherical shell portion. According to the arrangement of ommatidia in the compound eye camera, even the edge ommatidia have three adjacent neighbors, which means the same scene is captured by different clusters of adjacent ommatidia and is represented in the imaging results as overlapping scenes with a certain geometric regularity. Typically, infrared images can be represented by the following model [
6]:
Here, the subscripts
represent the original infrared image, background image, target image, and noise image, respectively. Based on the number of ommatidia and the imaging results from different ommatidia, we can represent the entire infrared compound eye image using the following model:
Here, represents the ommatidial image, denotes the total number of ommatidia, and ⊕ is the operator, indicating that the ommatidial images, based on the mechanical geometric structure, jointly form the entire compound eye image.
In our previous research, we explored the scene overlap rate between adjacent ommatidia after imaging the same scene using a compound eye camera. This overlap rate can reach between 50% and 70% [
25]. Based on this characteristic and the geometric arrangement of the ommatidia, we can make the following observations: the background images between adjacent ommatidia exhibit correlations, and the target is present in a subset of adjacent ommatidia. As shown in
Figure 2, we utilize these features to reconstruct our infrared compound eye image structural tensor. In the full compound eye image, only the ommatidial regions are retained, and the tensor is rearranged along the temporal scale.
By combining the IPI model for the extension of infrared images, our infrared compound eye image tensor model can be described as:
where
represent the ommatidial tensor, background data, target, and noise, respectively.
The assumptions of low rankness and sparsity are well aligned with the structural characteristics of our compound eye infrared imaging data. Each frame consists of multiple ommatidium images capturing the scene from different but overlapping perspectives. This spatial redundancy, especially between neighboring ommatidia, results in strong correlations across the image regions, which in turn gives rise to a low-rank structure in the data.
To empirically verify this, we conducted mode-n unfolding of the four-dimensional data tensor (spatial width, spatial height, number of ommatidia, and time), followed by singular-value decomposition. In all four modes, the singular values exhibit a steep decay, where only a few components retain most of the energy, while the rest drop rapidly to near zero. This confirms the presence of a substantial low-rank structure across all dimensions. The singular value distributions are shown in
Figure 2. Furthermore, in the context of infrared small target detection, the low-rank and sparsity priors have a clear physical meaning: the background tends to be smooth, slowly varying, and thus low rank, while small targets are localized, rare, and can be naturally modeled as sparse components. Analysis validates the low-rank nature of the background tensor:
, where the constant
> 0 represents the complexity of the background and the target tensor is a sparse tensor, satisfying the following condition:
Typically, it can be assumed that the random noise in the infrared image is additive Gaussian white noise with a noise intensity of
, which satisfies the following condition:
Here,
denotes the Frobenius norm. Therefore, the low-rank background tensor and sparse target tensor can be separated by considering the following problem:
Here,
denotes the regularization weight parameter. However, since solving the
-norm is an NP-hard problem, the
-norm is typically used as a substitute. Therefore, it can be rewritten as:
2.2. Compound Eye Structural Weighting Operator
For compound eye images, due to the overlapping imaging of different parts of the same scene by different ommatidia, adjacent ommatidia can capture the same target. Moreover, in our preliminary research, as the detection distance increases, the overlap ratio between ommatidia approaches a fixed value. Although adjacent ommatidia exhibit overlapping scenes, different ommatidia have varying scene complexities. For ommatidia with low scene complexity, the visual saliency of targets is higher. We use this characteristic to compute a weighting operator that enhances the sparsity of targets.
For each segmented
, the Laplace operator is used to compute the gradients in four directions:
After ignoring the influence of gradient values at the edges of circular ommatidia imaging, we normalize the gradient values for a single ommatidium to obtain and calculate its standard deviation . Due to varying scene complexities across ommatidia, those with low scene complexity exhibit relatively smaller values of normalized gradient standard deviation. We take the ommatidium image with the smallest normalized gradient standard deviation as the baseline, and we calculate the horizontal and vertical displacement ranges for the remaining ommatidia relative to the baseline, based on their geometric arrangement on the spherical shell. The estimation method for the displacement ranges is as follows:
We define the field of view of the compound eye camera as:
where
represents the half-field of view of a single ommatidium,
denotes the number of ommatidia, and
indicates the arrangement position of the ommatidia on the spherical shell, containing the geometric location information of the ommatidia.
Figure 3 shows the lens of our compound eye camera and a schematic diagram of the geometric arrangement of the ommatidia. The overlap rate between adjacent ommatidia can be expressed as [
25] [
Supplementary Materials]:
We use the overlap rate of the imaging area to calculate the pixel displacement range:
where
is the radius of the ommatidial imaging region. For all ommatidial images after displacement estimation using geometric distance, it can be roughly assumed that their local imaging regions correspond to each other. The base ommatidial image can be used to suppress the background gray-level response of the remaining ommatidial images:
where
represents the tolerance range of the estimation result, and ⊗ denotes the multiplication operation based on the corresponding region of the displacement results. The weight values are also constructed into a tensor consistent with the input image to be detected.
2.3. Weighted Regularization Model
Because the global low-rank nature of the background makes it difficult to describe the local details within the background, a common strategy to reduce the impact of edges, corners, noise, and other factors on detection performance is to add an additional regularization term that imposes constraints on the background [
26]. This allows more details to be preserved within the background, thereby reducing its impact on the target detection.
Currently, when processing discrete images, two commonly used TV regularization terms are the anisotropic total variation (Anisotropic TV, ATV) based on the
-norm and the isotropic total variation (Isotropic TV, ITV) based on the
-norm. Isotropic TV (ITV) tends to smooth the image while preserving edge clarity, whereas Anisotropic TV (ATV) places more emphasis on preserving image details and edges. For a given two-dimensional image
of size
, and without considering the boundaries, the definitions of isotropic and anisotropic total variation are as follows:
Clearly, ITV is isotropic but non-differentiable and is not a convex function. The optimization speed, difficulty, and stability of non-convex functions cannot be compared to those of convex functions. Therefore, ATV is more commonly used and generally yields better results than ITV.
For convenience of representation, two auxiliary operators are introduced. Let
and
represent the two-dimensional difference operators in the horizontal and vertical directions:
where the boundary condition of
is:
Therefore, the ATV expression can be rewritten as:
In Ref. [
27], Theorem 1 mentions that the spatial information of the original large-size
can, to some extent, be reflected in the smaller size
, thus avoiding complex computations and improving the efficiency of target detection. That is, for a matrix
with rank
, its decomposition is
, where the orthogonal matrix
, and the coefficient matrix can be obtained by
. The TV semi-norm is applied to each slice of
[
28], and they are summed together, which results in Coefficient Representation TV (RCTV) regularization. Mathematically, its definition is as follows:
where
. The expression can be simplified as:
where
.
Unlike existing TV-based methods that directly apply TV regularization to constrain the background tensor , RCTV employs TV regularization to constrain the representation coefficient matrix , describing local smoothness priors. This method eliminates the need to compute Singular Value Decomposition (SVD) and solve complex regularization terms, thus lowering computational complexity and enhancing detection speed.
To further improve the sparsity of targets and differentiate sparse non-target points, we employ a reweighted
minimization scheme combined with the compound eye structural weighting operator to adaptively assign weights to targets. Our model can be expressed as:
Here, ⊙ denotes the Hadamard product, and , where is the reciprocal of the values of , and represent positive trade-off parameters, and , where ε is a minimal value introduced to avoid division by zero in computations.
2.4. Iterative Optimization Process
Using the Alternating Direction Method of Multipliers (ADMM) approach [
29], we solve Equation (21) by fixing all other variables while solving for one variable at a time. This method introduces auxiliary variables and Lagrange multipliers, decomposing the original problem into a series of smaller subproblems, which are then alternately updated to progressively approach the optimal solution of the original problem.
By introducing auxiliary variables
, we reformulate Equation (21) as follows:
We employ the Augmented Lagrange Multiplier Method (ALMM) [
30] to solve the convex optimization problem involving the combination of nuclear norm and
-norm minimization, reformulating the above equation as:
In the equation, represent the Lagrange multipliers, and denotes the penalty parameter. Subsequently, the ADMM method is employed to iteratively solve Equation (23). The complete solution process is outlined as follows.
- (1)
Updating
: By fixing all variables except
in Equation (23), we obtain the following subproblem:
The soft thresholding function [
31] is employed to solve:
- (2)
Updating
: By fixing all variables except
in Equation (23), we obtain the following subproblem:
The calculation is performed using the theorem mentioned in [
32]:
- (3)
Updating
: By fixing all variables except
in Equation (23), we obtain the following subproblem:
Taking the derivative of the above equation:
In the equation,
represents the “transpose” operator of
. Treating the difference operation
as the convolution of the difference filter
, where
is the corresponding difference filter, the closed-form solution for
can be easily derived by applying the Fourier transform to both sides of the equation and utilizing the convolution theorem [
33].
Here,
and
denote the Fourier transform and element-wise square operation, respectively, while
represents a tensor with all elements equal to 1.
- (4)
Updating : By fixing all variables except in Equation (23), we obtain the following subproblem:
- (5)
Updating :
- (6)
Updating the Lagrange multipliers , and the penalty parameter :
Algorithm 1 summarizes the entire solution process. In the initialization step, the input data dimensions
represent the pixel size of each ommatidium image, the number of ommatidia, and the number of frames in the image sequence, respectively. The matrices
and
are computed based on the given parameters
and
. The Lagrange multiplier and penalty parameter are initialized according to the values listed in the table.
Algorithm 1 Pseudocode outlining the main steps of the proposed algorithm |
Input: Compound eye image sequences, |
Initializatio: |
|
While:not converged do |
1: Updata by (25) |
2: Updata by (27) |
3: Updata by (30) |
4: Updata by (32) |
5: Updata by (33) |
6: Updata by (35) and (36) |
7: Updata by (37) |
8: Check the convergence conditions |
End while |
Output T |
2.5. Complexity Analysis
The computation of the compound eye structural weights involves matrix operations, and we primarily analyze the computational complexity of the iterative optimization process. Assuming the constructed infrared image tensor is , where is the temporal dimension, the computation involves , where and . Updating requires soft thresholding calculations, with a combined computational complexity of . Updating involves FFT calculations, resulting in a computational complexity of . Updating requires soft thresholding calculations, with a complexity of . Updating necessitates SVD calculations, leading to a computational complexity of . The complexity of updating ww and other parameters is . Therefore, the overall computational complexity of the iterative solution is .
For an input image of size 640 × 640, the detection time per frame is approximately 0.1821 s. This includes all preprocessing steps applied to the original image, such as retaining only the valid imaging regions and computing the weighting operator. These steps involve various matrix operations, and there is room for further optimization in future implementations to improve computational efficiency.
2.6. Convergence Analysis
Following the above algorithm, each variable is iteratively solved, and the optimization iteration terminates when the following condition is met:
If any of the above conditions are met, it terminates.
2.7. Target Detection Process
As shown in
Figure 4, the input sequence images are cropped to retain the effective imaging regions, unfolded according to the ommatidia image dimensions and temporal sequence to obtain the input tensor. Simultaneously, the effective imaging regions of all ommatidia in a single frame can be statistically computed to derive the compound eye structural weight tensor. The results are iteratively updated using the process outlined in Algorithm 1.
3. Results
3.1. Objective Evaluation Metrics
- (1)
Model-based approach evaluation metrics
To assess the effectiveness of the algorithm, we adopted four widely used metrics: the Receiver Operating Characteristic (ROC) curve, Signal-to-Clutter Ratio Gain (SCRG), Background Suppression Factor (BSF), and Contrast Gain (CG). The ROC curve, representing the Receiver Operating Characteristic curve, provides a comprehensive evaluation of detection performance. The
y-axis of the curve represents the True Positive Rate (TPR), while the
x-axis represents the False Positive Rate (FPR). The formulas for these metrics are as follows:
Detection performance can be quantified using the Area Under the Curve (AUC). A higher AUC value indicates better detection performance. The Signal-to-Clutter Ratio (SCR) measures the difficulty of detecting targets in infrared images.
Here,
represents the average grayscale of the target region,
denotes the average pixel value of the local neighborhood region, and
is the standard deviation.
and
refer to the SCR values of the input source and output detection images, respectively. The Signal-to-Clutter Ratio Gain (SCRG) is defined as SCR Gain.
The effectiveness of background suppression in the detection algorithm is quantified by the Background Suppression Factor (BSF), expressed as follows:
Here,
and
represent the standard deviations of the local region before and after suppression, respectively. In summary, higher SCRG and BSF values indicate better performance of the algorithm in target enhancement and background suppression. However, in methods based on sparse and low-rank recovery, if background noise is minimized, the standard deviation becomes almost negligible, resulting in a calculated value of Inf. To address this issue, we utilize the Contrast Gain (CG) metric, defined as follows:
Here,
and
represent the contrast of the input and output infrared images, respectively. Additionally, the calculation of CON is as follows:
Furthermore, in the measurements of SCRG, BSF, and CG, the neighborhood size is defined as the area of the effective imaging region of a single ommatidium.
- (2)
Deep learning-based approach evaluation metrics
To fairly evaluate and compare the performance of both traditional model-based and deep learning-based small target detection methods, we adopt two commonly used metrics: precision and recall. These metrics are model-agnostic and can be consistently applied across different types of detection outputs (saliency maps, confidence maps, bounding boxes), making them well suited for cross-method evaluation.
Recall is defined as the proportion of true targets that are successfully detected. Precision is defined as the proportion of correctly detected targets among all detections:
Furthermore, the Precision–Recall (PR) curve provides an intuitive and informative visualization of the trade-off between these two metrics, making it well suited for comparing detection performance across fundamentally different methodologies.
3.2. Data Characteristics
A collimator, blackbody, and circular aperture target are combined to form a simulated infinite-distance target source system, as show in
Figure 5. The radiative flux of the target source is controlled by adjusting the blackbody temperature, while the projected cross-sectional size of the target source is simulated by replacing circular aperture targets of different sizes.
A rotating reflection device reflects the radiation from the target source at a fixed angle and, through a rotating arm moving within the vertical plane, generates an optically moving target. The optical axis of the infrared compound eye camera is aligned with the center of the target’s rotational axis to ensure that the optical motion target appears to move with a certain line-of-sight angular velocity and angular acceleration within the compound eye camera’s field of view. By setting the angular velocity of the rotating optical target, the equivalent angular velocity of the target relative to the compound eye camera can be precisely adjusted.
Figure 6a–c show different small target datasets, where the targets simulate objects at an infinite distance, and the backgrounds represent the laboratory environment.
To analyze the performance of our small target detection method across various scenarios, we captured image sequences using an infrared bio-inspired compound eye camera. By moving the camera at non-uniform speeds, we captured several common ground-based infrared scenarios as data backgrounds. Panels (1)–(5) in
Figure 6 depict scenes of buildings at different distances and densities, forests, and sky with clouds, respectively.
We employed the method mentioned in IPI to add targets:
Here, and represent the normalized target image and background image, respectively, where and is the maximum grayscale value of the background scene containing the target ommatidium.
The data characteristics are listed in
Table 1.
3.3. Parameter Settings
- (1)
Frame Count: We evaluate the impact of different F values on algorithm performance by varying F from 1 to 10 in increments of 1. We use ROC curves to compare the results of different parameter settings.
- (2)
The trade-off parameter
between the target and background is related to the image dimensions and is typically calculated using the H parameter in combination with the image size and frame count. We perform parameter analysis using different H values, adjusting H from 1 to 10 in increments of 1 during the experiments.
- (3)
Rank
primarily describes the low-rank property of infrared images, and it determines the size of U so it affects the computational complexity. We analyze the impact of on algorithm performance by varying in increments of 1, with F set to 10 frames.
- (4)
α: α is used to balance the target and background. Typically, a lower α-value enables the reconstruction of more detailed background information, but the target may also be retained in the background, leading to missed detections. A higher α-value results in a coarser background, with many background details potentially preserved in the target image, causing false alarms.
Based on
Figure 7, we set the parameters as follows:
.
3.4. Ablation Study
We conducted an ablation study to demonstrate the importance of the weighting operator. As shown in the
Figure 8, we compare the detection results with and without the weighting operator across five image sequences. It can be clearly observed that incorporating the weighting operator effectively suppresses false alarms.
3.5. Comparative Algorithms
We primarily compared several classic small target detection algorithms and recent low-rank sparse decomposition-based small target detection algorithms that have demonstrated excellent performance in single-aperture infrared image sequences. These include 4D_TT and 4D_TR [
17], ADMD [
34], ASTTV_NTLA [
16], ICEW [
24], IPI [
7], RCTV [
26], RCTVW [
18], NFTDGSTV [
35], Top_hat [
36], and MPCM [
37]. The parameters for all algorithms are listed in
Table 2, and these algorithms are applied to all ommatidia imaging regions during data processing. We also compared several representative deep learning-based small target detection methods: SSD [
38], YOLOv5 [
39], ILNet [
40], and MSHNet [
41].
3.6. Visual Analysis
As shown in
Figure 9, 4D_TT and 4D_TR exhibit good detection results but suffer from missed detections when target responses are low and false alarms in regions with high background responses. RCTV can detect most targets but generates numerous false alarms, especially in ommatidia that do not capture targets. RCTVW further improves the ability to distinguish targets from the background, yet it still faces false alarms in non-target ommatidia regions and missed detections in images with low target responses. MPCM and ADMD exhibit missed detections even for strong targets, indicating that the local features of small targets in compound eye images are not prominent. ASTTV_NTLA and NFTDGSTV produce fewer false alarms in images with smooth background grayscale transitions but struggle with detecting bright backgrounds as targets. ICEW and Top_hat show varying degrees of false alarms across different image sequences. The IPI algorithm can detect targets almost completely but still misidentifies background regions in ommatidia images without targets as targets.
Traditional model-based small target detection methods typically produce intermediate or post-processed result maps, which are well suited for direct visual comparison and qualitative analysis. In contrast, deep learning-based methods often provide detection results in the form of bounding boxes or confidence maps, which differ in format and interpretation.
Therefore, in the qualitative analysis section, we mainly focus on comparing traditional algorithms, as their visual outputs are more consistent and interpretable for analysis. Nonetheless, we have included quantitative comparisons with deep learning-based detection networks to ensure a comprehensive evaluation of all methods.
3.7. Quantitative Analysis
Since detection methods based on low-rank sparse decomposition typically minimize the background, the standard deviation becomes almost negligible, resulting in Signal-to-Clutter Ratio Gain (SCRG) and Background Suppression Factor (BSF) values of Inf, as shown in
Table 3. Our method better suppresses the background in ommatidia imaging regions, while a higher Contrast Gain (CG) value indicates that our algorithm enhances contrast. Therefore, the proposed algorithm demonstrates strong performance in background suppression.
As illustrated in
Figure 10 and
Table 3, across the five image sequences, our detection algorithm achieves more comprehensive target detection in each ommatidium at a given false alarm rate, exhibiting a higher detection rate.
In addition to comparing with traditional model-based methods, we also evaluate the performance of our approach against several representative deep learning-based small target detection models or networks, including SSD [
38], YOLOv5 [
39], ILNet [
40], and MSHNet [
41]. These methods are trained and tested on our compound eye infrared dataset, with a training-to-testing split ratio of 9:1. In
Figure 11, the comparison helps to further validate the effectiveness and generalizability of our method under the same experimental conditions.
Although the current deep learning-based small target detection methods included in our comparison demonstrate limited performance on our infrared compound eye dataset, we believe that developing deep learning approaches specifically tailored for compound eye infrared imaging is a promising and meaningful direction.
In future work, we plan to construct a more diverse and task-specific dataset to better support data-driven learning. Additionally, we aim to explore novel neural network architectures that are well suited to the unique spatial and angular structure of compound eye images, with the goal of enhancing detection performance under complex scenarios.
3.8. Noise Impact
Our equipment utilizes cooled detectors, which to some extent mitigate the influence of noise. Nevertheless, it remains imperative to evaluate the robustness of our algorithm against noise. As illustrated in
Figure 12, even after the introduction of varying degrees of Gaussian white noise, the algorithm proposed in this paper is still capable of detecting all small targets, demonstrating its robustness in noisy environments.
3.9. Algorithm Runtime Analysis
The computer configuration utilized for the experiment is delineated as follows: MATLAB R2020b, 12th Generation Intel(R) Core(TM) i5-12600K processor at 3.70 GHz, and the Windows 11 operating system. The original single-frame infrared compound eye image input has dimensions of 640 × 640 pixels.
Taking into account the runtime and detection efficiency of various methods in
Table 4, our approach demonstrates superior performance.
4. Discussion
The ability to accurately detect small targets in all ommatidium images greatly benefits the further application of compound eye cameras.
Upon obtaining the pixel coordinates of all detected targets, the motion trajectory of the targets can be determined by integrating the geometric structural features of the compound eye camera and the results of camera calibration. As illustrated in
Figure 13, we simulated the structure of 19 ommatidia within our compound eye camera and calculated the spatial coordinates of a target using the pixel coordinates of an ideal imaging result of a point in space. The error of the calculation results is shown in
Figure 13b. The method employed involves establishing imaging equations using the pixel coordinates from a varying number of ommatidia and deriving the optimal result through the least squares method. The calculation results indicate that detecting all targets in a single-frame compound eye image and utilizing more target information can significantly reduce computational errors. Therefore, our method ensures a lower false alarm rate and a higher detection rate, which greatly aids in further obtaining the spatial information of the targets.
Although the proposed method is evaluated using data from a custom-built mid-wave infrared (MWIR) compound eye camera, the underlying detection framework is not tightly coupled to any specific hardware configuration or waveband. The method is based on general assumptions about the data, such as the sparsity of target signals, the low-rank nature of the background, and the spatial redundancy among ommatidia—features that are generally applicable to infrared imaging in both MWIR and long-wave infrared (LWIR) systems.
Therefore, we expect the method to generalize well to other compound eye platforms with different microlens arrangements or operating wavelengths. Further validation on diverse datasets and hardware will be part of our future work to enhance the robustness and applicability of the proposed approach.