An Efficient Knowledge Distillation-Based Detection Method for Infrared Small Targets

Tang, Wenjuan; Dai, Qun; Hao, Fan

doi:10.3390/rs16173173

Open AccessCommunication

An Efficient Knowledge Distillation-Based Detection Method for Infrared Small Targets

by

Wenjuan Tang

^1,2

,

Qun Dai

^1,*

and

Fan Hao

³

¹

College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China

²

HIWING Technology Academy, China Aerospace Science and Industry Corporation Limited (CASIC), Beijing 100074, China

³

School of Integrated Circuits, Beijing University of Posts and Telecommunications, Beijing 100876, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(17), 3173; https://doi.org/10.3390/rs16173173

Submission received: 13 June 2024 / Revised: 16 August 2024 / Accepted: 23 August 2024 / Published: 28 August 2024

Download

Browse Figures

Versions Notes

Abstract

:

Infrared small-target detection is now commonly used in maritime surveillance, flight guidance, and other fields. However, extracting small targets from complex backgrounds remains a challenging task due to the small-target scale and complex imaging environment. Many studies are based on designing model structures to enhance the precision of target detection, and the number of Params and FLOPs has been significantly augmented. In this work, a knowledge distillation-based detection method (KDD) is proposed to overcome this challenge. KDD employs the small-target labeling information provided by a large-scale teacher model to refine the training process of students, thereby improving the performance and becoming lightweight. Specifically, we added efficient local attention (ELA), which can accurately identify areas of interest while avoiding dimensionality reduction. In addition, we also added the group aggregation bridge (GAB) module to connect low-level and high-level features for the fusion of different feature scales. Furthermore, a feature fusion loss was introduced to enhance the precision of target detection. Extensive evaluations have demonstrated that KDD performs better compared to several methods, achieving extremely low Params and FLOPs, as well as higher FPS.

Keywords:

infrared small-target detection; knowledge distillation; lightweight; complex background

1. Introduction

Infrared small-target detection (ISTD) identifies dim and small targets from infrared images full of clutter interference [1]. It is widely applied in traffic management, maritime surveillance, and flight guidance [2,3,4]. Compared with general target detection tasks, infrared small targets exhibit the following traits: (1) due to the characteristics of long-distance imaging, the target usually takes up a small number of pixels and the shape and texture characteristics are not obvious; and (2) infrared images are characterized by a low signal-to-clutter ratio (SCR) and a great deal of noise, so small targets are more likely to be submerged in the background and will be difficult to identify [5]. These characteristics result in the phenomenon of false alarms and miss detection in ISTD, which makes the task of detection challenging.

Researchers have continuously improved and optimized detection methods to solve the problems caused by the above characteristics. These methods can be categorized into traditional and deep learning-based. Among the traditional methods, filtering-based methods [6,7,8,9] are computationally small and can serve to suppress uniform backgrounds to some extent. Still, they cannot solve the problem of complex backgrounds, resulting in low detection rates. The human visual system (HVS) [10,11,12] mainly makes use of saliency maps, where the presence of small targets leads to a saliency change in the local texture of the image. HSV can be divided into two kinds of methods based on spectral residual and local contrast. However, spectral residuals cannot suppress clutter well. Local contrast-based methods have poor generalization ability and cannot adapt to complex background images. The low-rank sparse recovery [13,14,15,16,17] can be applied in scenes with high noise interference and low target contrast. It needs to consume large computational resources and time in practical applications.

Unlike traditional methods, deep learning-based methods acquire target features through data-driven learning processes. Liu et al. [18] constructed a five-layer detection network by utilizing multilayer perception. MDvsFA-cGAN [19] separates the ISTD task into subtasks for reducing false alarms and miss detections. Attentional local contrast network (ALCNet) [20] designs a cyclic shifting method for feature maps and utilizes bottom-up attention modulation. Dense nested attention network (DNANet) [21] achieves progressive interactions between the higher and lower layers that are prior through the dense nested network. Ma et al. [22] generated background components and reduced their interference via a generative adversarial network. Multiscale Feature Extraction U-Net [23] extracts rich multi-scale information to accurately detect targets in complex backgrounds. Li et al. [1] proposed a diffusion model framework, IRSTD-Diff. The above methods can also be considered as ones that convert ISTD to binary semantic segmentation, where the positive and negative samples exhibit extreme imbalance [24]. With the continuous development of YOLO, it is applied to various kinds of scenarios [25,26]. It has also been applied in infrared small-target detection. Liu et al. proposed a compact network based on YOLO to improve the accuracy and speed of detection [27]. Ciocarlan et al. introduced an inverse decision criterion in the training of a YOLO detector [28]. But YOLO still performs poorly in identifying tiny objects. Illuminatingly, some classic clustering methods for unbalanced data [29] may be used to detect infrared targets.

Influenced by the features of infrared small targets, deep learning-based methods tend to extract different levels of features by designing some specific modules or retaining the information of infrared small targets by some feature fusion strategies. These complex network models improve the detection accuracy but consume a great deal of storage space and computational resources. Therefore, we would like to construct a network to make use of the idea where the knowledge distillation in the model is a form of lightweight technology so that the network of high-precision complex network models assists the student model training. And we also designed relevant modules to make the positioning of small targets more accurate and would retain the different levels of characteristics of small targets.

Inspired by knowledge distillation [30], we proposed KDD, a knowledge distillation-based ISTD network. DNANet is utilized as the teacher model in KDD, providing soft labels to guide and assist the training process of the student model. U-Net [31] is used as the basic architecture of the student model. Infrared targets appear in relatively small areas of the image and have higher requirements for the localization ability of the network. Therefore, in the student model, we added the ELA [32] module, which can effectively utilize spatial information to accurately locate the position of interest without sacrificing the channel dimension. Infrared images result in small targets with different appearance characteristics due to the imaging environment. Acquiring multi-scale information can better understand the contextual information. We added the GAB [33] module to fuse the low-level and high-level features in a grouping for multi-scale information. KDD achieves the goal of being lightweight and it also improves accuracy. The main contributions of this article are summarized below:

We designed a novel ISTD method based on the idea of knowledge distillation for infrared target detection. KDD not only improves the detection accuracy, but also achieves a lightweight network.
KDD incorporates the ELA module and the GAB module. The small-target pixels account for a relatively small percentage of pixels. To enhance the localization ability, the ELA module was added to achieve precise localization of the region of interest without sacrificing the channel dimension. The GAB module was integrated to exploit grouping for multi-scale information capture and to make rational use of the contextual information.

2. Related Work

Deep learning has demonstrated significant benefits in the areas of computer vision, speech understanding, and various other domains. Deep learning AI relies on large pre-trained models to improve the performance of deep neural networks in different tasks. These models are complex and have numerous parameters, making them difficult to use on mobile hardware. Therefore, model compression and acceleration techniques have become one of the current research hotspots. Model quantization is one of the main effective methods. The popular methods are mainly divided into model pruning [34,35], knowledge distillation [30,36,37], neural network architecture search [38,39], and tensor decomposition [40,41]. Specifically, model pruning reduces model size by removing redundant parameters and connections while maintaining performance. Knowledge distillation uses trained teacher models to guide the training of student models by designing specific losses. Neural network architecture search automatically designs the neural network structure through algorithms. Tensor decomposition decomposes the tensor into small sizes to achieve the goal of reducing the storage space of the neural network model.

Large-scale language models (LLMs) like GPT-4 and Gemini have emerged as seminal techniques [42]. These models have become pioneering techniques for solving complex problems. However, their wide application has been hindered by the high cost of use and limited access. In contrast, some open-source models (e.g., LLaMa [43]) do not have cost and access restrictions, but their performance is lagging behind due to their small size. The emergence of knowledge distillation techniques has revolutionized this situation, allowing open-source models to learn from extensively trained and fine-tuned proprietary models, thereby improving performance. Knowledge distillation is widely used in the field of natural language processing [44]. The difference between proprietary and open-source models has gradually narrowed or even disappeared due to the use of knowledge distillation techniques, which has led to further improvements in both the performance and efficiency of open-source models.

Currently, knowledge distillation is categorized into logit-based distillation [45] and feature-based distillation [36]. To learn the decision process and probability distribution of the large model, the logit-based knowledge distillation takes the logit output of a teacher model as a soft target for training a student model. Neural networks commonly produce class probabilities by employing a softmax output layer, which transforms the logit value,

z_{i}

, which is calculated for each class into a probability,

q_{i}

, through a comparison with the logit values of other classes. It can be expressed as follows:

q_{i} = \frac{exp (z_{i} / T)}{\sum_{j} exp (z_{j} / T)},

(1)

where T denotes the temperature. In using the cross-entropy loss of soft targets, the performance of the student model can be further improved.

Feature-based knowledge distillation allows student models to improve the training process using the learned intermediate representation of the teacher as a cue. Since the intermediate hidden layer of the student is usually smaller than that of the teacher, additional parameters need to be introduced to ensure that the hidden layer maps of the student model with the hidden layer of the teacher model effectively produce accurate predictions. In general, the representation of hidden layer feature distillation can be described as follows:

L_{f e a} = \frac{1}{N} \sum_{c = 1}^{C} \sum_{h = 1}^{H} \sum_{w = 1}^{W} {(T_{c, h, w} - f (S_{k, i, j}))}^{2},

(2)

where N denotes the sum of elements in the feature; T and S denote the teacher and student features, respectively; and f denotes the adaptation layer for the same dimensions of its teacher feature map. In summary, the idea of knowledge distillation is enlightening to explore for ISTD.

3. Methodology

3.1. Overall Architecture

As shown in Figure 1, the overall structure of KDD has two components: the teacher model and the student model. DNANet was selected as the teacher model. A single infrared image was used as the network input, and the process of detection was achieved through feature extraction, feature fusion, and clustering. The core of the DNANet network lies in its dense nested architecture, enabling interactions of different level features, coupled with the cascaded attention module that adaptively boosts multi-scale features. The student model uses the U-Net structure. Convolution, pooling, and interpolation operations are utilized in the shallow part for encoding and decoding operations, and the ELA module is used in the deep part for accurate localization of regions of interest. In the skip connection, we added the GAB module to extract information about the interaction features between different levels.

3.2. Network Details

(1) Logits-based distillation: The purpose of knowledge distillation is to allow students to obtain valuable potential knowledge from the teacher. The negative labeling information provided by the teacher model contains a certain amount of information that can assist the student model in rapidly acquiring an understanding of the reasoning process of the teacher model. The specific process of knowledge distillation using the logits method for infrared small targets is shown in Figure 2.

(2) Loss function: To better learn the information provided by the teacher model, we calculated the loss between the teacher and student model outputs. Soft Intersection over Union Loss (SoftIoULoss) was chosen as the loss function for when the smaller value indicates that the output detection result of the student model has a higher degree of overlap with the teacher model. It can be expressed as follows:

L_{S o f t I O U} = 1 - \frac{| O_{s} \cap O_{t} | + θ_{s m o o t h}}{| O_{s} | + | O_{t} | - | O_{s} \cap O_{t} | + θ_{s m o o t h}},

(3)

where

O_{s}

and

O_{t}

are the result of the student and teacher model, respectively.

We used BCEloss and Diceloss [46] to measure the difference between the results of the student model and the mask. BCEloss was used to judge the degree of the prediction results of a binary classification model, and Diceloss was used as a regionally correlated loss function. We also fused the feature maps of different stages by convolution and interpolation operations, and we generated the predicted mask for comparison with the original target. The loss representation is as follows:

L = L_{B C E} (O, M) + L_{D i c e} (O, M),

(4)

L_{h a r d} = L (O_{s}, M) + L (O_{p g}, M),

(5)

where

O_{p g}

denotes the result of feature fusion, and

M

denotes the ground truth.

3.3. Efficient Local Attention

U-Net realizes the encoding and decoding operations through convolutional operations, which may lose the feature information and have some impact on the subsequent segmentation or detection tasks. Although attention mechanisms can significantly improve the performance of models, existing implementations are often accompanied by a reduction in channel dimensions or an increase in network complexity. For example, the SE block attention [47] only considers the information between coding channels and ignores the potential benefits of spatial location information. The coordinate attention (CA) [48] method focuses on spatial information but has insufficient generalization ability.

In our approach, we use ordinary convolution to extract features in the shallow layer and replace the ordinary convolution operation with the ELA module in the deep layer. Infrared small targets are a relatively small proportion of the image and have higher requirements on the localization ability of the network. The ELA module combines the feature enhancement techniques of one-dimensional convolution and group normalization to help the network focus its attention on the region, and it may contain small targets without sacrificing the channel dimensions, thus improving the accuracy and precision of localization. Figure 3 shows the schematic diagrams of ELA.

The input feature map x was average pooled over two spatial scales, with the horizontal direction

(H, 1)

and the vertical direction

(1, W)

, and the outputs of the

c_{t h}

channel with height h and width w are denoted as follows:

z_{c}^{h} (h) = \frac{1}{W} \sum_{0 \leq i < W} x_{c} (h, i),

(6)

z_{c}^{w} (w) = \frac{1}{H} \sum_{0 \leq j < H} x_{c} (j, w) .

(7)

The obtained two types of feature vectors were augmented with information by one-dimensional convolution, and a group normalization operation was added to deal with the augmented positional information. Finally, the two spatial ranges of features were output through an activation function. The process is represented as follows:

f_{h} = δ (G N (C o n v 1 d (z_{h}))),

(8)

f_{w} = δ (G N (C o n v 1 d (z_{w}))),

(9)

where

C o n v 1 d

denotes 1D convolution operation, the kernel size is set to 7, and the padding is set to 3.

G N

denotes group normalization, the num group is set to 8, and

δ

denotes the Sigmoid function.

Finally, the horizontal and vertical outputs were multiplied with the inputs, and the outputs were fed into the 2D convolution to obtain the outputs.

y = C o n v 2 d (f_{h} \otimes f_{w} \otimes x) .

(10)

3.4. Group Aggregation Bridge

One of the advantages of U-Net is the incorporation of encoder information into the decoding process through hopping layer connections. In target detection tasks, it is necessary to obtain information about features at different scales. Due to the different imaging distances and angles, targets present different appearance characteristics and are easily disturbed by noise and other factors. The acquisition of multi-scale information can capture visual representations at different scales, as well as better understand contextual information to improve accuracy. Therefore, we introduced the GAB module in the skip connection. Figure 4 shows the schematic diagrams.

GAB received low-level features and high-level features. Firstly, the size of the high-level features was changed using 2D convolution and interpolation to match the bottom-level features.

y_{h} = P (C o n v (x_{h}))),

(11)

where P denotes the interpolation, and

C o n v

denotes the Conv2d.

Second, the different levels of features were divided into four groups along the channel direction, which makes the correlation between each group of features stronger. After that, we combined the high-level and low-level counterparts and performed layer normalization and 2D convolution for each combination to extract information at different scales.

z_{h} = {y_{1}^{h}, y_{2}^{h}, y_{3}^{h}, y_{4}^{h}}, z_{l} = {x_{1}^{l}, x_{2}^{l}, x_{3}^{l}, x_{4}^{l}},

(12)

α_{i} = D C o n v (L a y e r (C (y_{i}^{h}, x_{i}^{l}))),

(13)

where

z_{h}

and

z_{l}

denote the division results of high-level features and low-level features, respectively; C denotes the cat operation;

L a y e r

denotes the LayerNorm operation; and

D C o n v

denotes the Dilated Conv2d with different subgroups and dilation rates of 1, 2, 5, and 7.

Finally, the four combinations werre reconnected to realize the feature interaction by convolution.

β = C o n v (L a y e r (C (α_{1}, α_{2}, α_{3}, α_{4}))) .

(14)

4. Experiments

4.1. Experimental Setup

(1) Datasets: We conducted experiments on three datasets, IRSTD-1k [4], SIRST [49], and MSISTD [50]. The IRSTD-1k dataset comprises one thousand infrared images covering various types of small targets including vehicles, aircraft, boats, etc. The background of the dataset covers a wide range, including cities, fields, mountains, oceans, rivers, and skies with heavy clutter and noise [4]. The SIRST dataset comprises 427 images. A significant proportion of these targets are low in brightness and are lost in cluttered and complex backgrounds. The MSISTD dataset consists of 1077 images with multi-scene, multi-scale, and lower SCR targets. It covers all types of scenes including sky, sea, land, forest, urban buildings, etc. The datasets are split into training and testing sets in a ratio of 8:2.

(2) Comparison methods: We compared KDD with some SOTA methods, including New TopHat [6], nonconvex rank approximation minimization (NRAM) [15], asymmetric contextual modulation (ACM) [49], DNANet [21], interior attention-aware network (IAANet) [51], attention-guided pyramid context networks (AGPCNet) [2], and lightweight IR small-target segmentation network (LW-IRSTNet) [52]. New TopHat improves on the classic top hat method by utilizing two distinct but related structural elements and by taking into account the differences between the target and its surroundings. NRAM employs non-convex, tighter-ranked surrogates, and a weighted

l_{1}

criterion. ACM utilizes a bottom-up mechanism to preserve low-level features with fine details. DNANet is designed with densely nested interaction modules to maintain deep information about small targets, and it utilizes cascaded channels and spatial attention modules to enhance multilevel features. IAANet captures the coarse target area and first filters out the background, then performs fine detection. AGPCNet is designed with attention-guided context blocks to enhance the ability to sense small targets through using context pyramids and asymmetric fusion modules to retain small-target feature information. LW-IRSTNet integrates regular convolutions, depthwise separable convolutions, atrous convolutions, and asymmetric convolutions modules.

(3) Evaluation Metrics: Among the most commonly used metrics to assess the accuracy of detection are

r e c a l l

and

p r e c i s i o n

. The pursuit of high

p r e c i s i o n

or high

r e c a l l

alone is not a sufficiently comprehensive measure of detection method performance. For a more balanced assessment, we introduced the

F 1

,

I o U

, and ROC. The calculation of

F 1

is shown in Equation (15).

I o U

reflects the accuracy of localization through the intersection and concurrency ratio between prediction results and target, and the calculation of

I o U

is shown in Equation (16).

A U C

is the area under the ROC curve, and it is a quantitative assessment between the true target and misclassification.

F 1 = 2 \times \frac{R e c a l l \times P r e c i s i o n}{R e c a l l + P r e c i s i o n},

(15)

I o U = A_{i} / A_{u} .

(16)

(4) Implementation Details: The process was implemented on PyTorch and the computer used an NVIDIA GeForce GTX 1650. The teacher model was trained according to the original parameters of DNANet, where the epoch was 1500 and the batch size was 4. In the student model, the AdamW optimizer was employed, and it was initialized with a learning rate of 0.001 and a weight decay rate of 0.01, which was adjusted using CosineAnnealingLR. In addition, the epoch was 150 and the batch size was 4.

4.2. Basic Experiments

(1) Visual Evaluation: In Figure 5, Figure 6 and Figure 7, some visualization results of the different methods on the SIRST and IRSTD-1k datasets are shown, where the red color means the positive detection results, green color indicates the miss detections, and the yellow color denotes the false alarms.

Figure 5 illustrates the detection results on the SIRST dataset. For Images (1)–(3), which have significant bright region effects, New Tophat, NRAM, and IAANet were prone to false alarms when detecting the target. AGPCNet was prone to miss detections of dark targets. However, KDD was able to avoid the interference of the highlighted regions and accurately detect targets that are very similar to the masks. For the targets in a simple background but very dark, e.g., Images (4)–(5), New Tophat and NRAM were able to detect the target location but the detection results were incomplete. IAANet and AGPCNet were prone to miss detections of dark targets. In the challenging building background (e.g., Image (6)), New Tophat, NRAM, ACM, IAANet, and AGPCNet all produced varying degrees of false alarms. In summary, KDD can perform relatively stable detection in different situations.

The results on the IRSTD-1k are shown in Figure 6. For images where small targets are more obvious and the background interference is small (e.g., Image (1)), all types of methods were able to accurately localize the target, but the New Tophat method still produced false detections. When the target was obvious but there was interference information in images such as highlighted areas in the background (e.g., Images (4), (5)), New Tophat, NRAM, ACM, IAANet, AGPCNet, and LW-IRSTNet all produced a certain degree of false detection for highlighted areas. In images where the target was not obvious and there was a great deal of distracting information in the background (e.g., Images (2), (3), and (6)), the detection was susceptible to background influence and was difficult to distinguish from the highlighted regions in the background. New tophat and NRAM produced a great deal of false alarms, and there was a leakage of the target detection by AGPCNet. From the detection results of the whole dataset, it can be concluded that the traditional methods have defects in the integrity of their detection of targets. The size of the detection result differed significantly from the mask when there was more interfering information in the background.

Figure 7 shows the visualized detection results for the MSISTD dataset. For small-target images with a simple sky background (e.g., Image (2)), all of the methods except NRAM were able to detect the small-target region but, due to the influence of clouds, New Tophat, IAANet, and LW-IRSTNet were prone to false detection. For infrared images with a relatively large proportion of small-target pixels (e.g., Image (4)), all of the methods except ACM were able to detect the targets, but they had some missing cases. In images with urban building backgrounds and more obvious targets (e.g., Images (1), (3), and (6)), the detection was affected by the highlighted areas of the buildings, and there were a large number of false detections in New Tophat and NRAM, and there were also a small number of false detections in IAANet and LW-IRSTNet. In images with complex backgrounds and unclear targets (e.g., Image (5)), NRAM and AGPCNet experienced missed detections, and New Tophat, NRAM, and IAANet produced false detections. Overall, KDD was able to obtain more satisfactory results in both simple and complex scenes.

(2) Numerical Evaluation: Table 1 lists the

F 1

,

I o U

, and

A U C

for the different methods on three datasets, with the best results highlighted in red and the second-best results highlighted in green. Figure 8 shows the ROC curves.

(a): SIRST: From Table 1, it is evident that deep learning methods are superior to traditional methods. KDD scored the highest in $r e c a l l$ , $F 1$ , and $I o U$ although DNANet had a high $p r e c i s i o n$ . The $r e c a l l$ performance was average so it led to the $F 1$ value of the combined metric of the two to not be good. The results of the traditional methods included false alarms, which increased in connected areas between the detection results and the area of concatenation between detection results and targets, resulting in smaller values for $I o U$ . The $A U C$ of KDD reached 0.9601 on the STRST dataset.
(b): IRSTD-1k: In the quantitative comparison, KDD had the best $I o U$ , which was 0.6359, and the ROC curve is shown in Figure 8b. Although DNANet performed the best on $p r e c i s i o n$ and $F 1$ , the difference with the results of KDD was very small and was not the best on several other metrics. In addition, due to the large impact of background and clutter in the images, as well as the small-target size, this dataset produced a lower metric than the SIRST dataset. Overall, KDD still obtained the best detection performance.
(c): MSISTD: In quantitative evaluation, KDD had the highest value of $F 1$ and $I o U$ with 0.7571 and 0.6688, respectively. The ROC curve is shown in Figure 8c. DNANet also performed well on the MSISTD dataset and achieved the highest value on PRECISION. But the value of recall was lower than KDD, and the combined value of both F1 scores was not the best. IAANet achieved the highest value on AUC, and KDD was located at second.

4.3. Ablation Experiments

In KDD, the basic network structure of our model was composed of U-Net and was incorporated of three parts: the ELA module, the GAB module, and the feature fusion loss. To assess the effect of each part on the overall network performance, a set of ablation studies was conducted on two distinct datasets. Table 2 lists the test results of the ablation experiments. In the absence of the ELA module and lack of feature fusion loss, both datasets performed poorly in

F 1

,

I o U

, and

A U C

. In the absence of the GAB module, although the results of

r e c a l l

were good and were able to reach 0.8264 and 0.8118 on the two datasets, the results of

p r e c i s i o n

were worse at 0.7957 and 0.6216, respectively, which led to a poorer equalization measure

F 1

for both. When all three modules were present, the evaluation metrics of the experimental results had better performance. In conclusion, the additional modules incorporated into the fundamental structure proved to be beneficial in enhancing network performance.

4.4. Complexity Analysis

Currently, computational complexity is an important index. Table 3 provides a comparison of the Params, FLOPs, and FPS for various deep learning methods. Notably, KDD achieved optimal performance with the lowest values for both Params and FLOPs, followed by ACM in second place. The FPS value of New Tophat was higher among the traditional methods, but the method was simpler and there were more miss detections and false alarms in the results. Among the deep learning based methods, KDD had the highest FPS value because it is a lightweight network. The FPS values of DNANet and ACM were second and third, respectively.

5. Conclusions

This paper aims to propose a lightweight method based on knowledge distillation for ISTD. Utilizing the concept of knowledge distillation, KDD transfers labeling information from the teacher model to the student model, aiding the training process of the latter. Subsequently, the region of interest is identified by the ELA module, while the GAB module is used to obtain different scale features. Experimental results on three public datasets demonstrate that the KDD can accurately detect infrared small targets and has superior performance in several evaluation metrics. The ablation experiments demonstrate the efficacy of each module of KDD. Through the comparison of computational complexity, it can be seen that KDD achieved becoming a lightweight system. In missions such as search and rescue and surveillance, real-time monitoring capabilities and low-energy consumption characteristics are critical, and KDD fulfills these needs with a great deal of real-world applications. Although KDD shows superior performance, it still has certain limitations. In the second dataset, we can see that the performance of the detection results on the evaluation metrics was very close to the teacher model but not optimal. In future work, we intend to optimize the student model further by retaining more information about small-target features. To obtain better guidance from the teacher model, we can also consider introducing feature-based distillation methods, which could potentially lead to further improvements in performance.

Author Contributions

W.T.: conceptualization, methodology, formal analysis, and writing—review and editing; Q.D.: investigation, resources, writing—review and editing, and supervision; F.H.: software, validation, data curation, writing—original draft preparation, and visualization. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under Grant no. 62476126.

Data Availability Statement

SIRST can be obtained from (https://github.com/YimianDai/sirst, accessed on 17 February 2024) and IRSTD-1k can be obtained from (https://github.com/RuiZhang97/ISNet, accessed on 17 February 2024). MSISTD can be obtained from (https://github.com/Crescent-Ao/MSISTD, accessed on 17 February 2024).

Conflicts of Interest

Author Wenjuan Tang was employed by the HIWING Technology Academy, China Aerospace Science and Industry Corporation Limited (CASIC). The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Li, H.; Yang, J.; Xu, Y.; Wang, R. Mitigate Target-level Insensitivity of Infrared Small Target Detection via Posterior Distribution Modeling. arXiv 2024, arXiv:2403.08380. [Google Scholar] [CrossRef]
Zhang, T.; Li, L.; Cao, S.; Pu, T.; Peng, Z. Attention-Guided Pyramid Context Networks for Detecting Infrared Small Target Under Complex Background. IEEE Trans. Aerosp. Electron. Syst. 2023, 59, 4250–4261. [Google Scholar] [CrossRef]
Teutsch, M.; Krüger, W. Classification of small boats in infrared images for maritime surveillance. In Proceedings of the 2010 International WaterSide Security Conference, Carrara, Italy, 3–5 November 2010; pp. 1–7. [Google Scholar]
Zhang, M.; Zhang, R.; Yang, Y.; Bai, H.; Zhang, J.; Guo, J. ISNet: Shape Matters for Infrared Small Target Detection. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–21 June 2022; pp. 867–876. [Google Scholar]
Gao, C.; Meng, D.; Yang, Y.; Wang, Y.; Zhou, X.; Hauptmann, A.G. Infrared Patch-Image Model for Small Target Detection in a Single Image. IEEE Trans. Image Process. 2013, 22, 4996–5009. [Google Scholar] [CrossRef] [PubMed]
Bai, X.; Zhou, F. Analysis of new top-hat transformation and the application for infrared dim small target detection. Pattern Recognit. 2010, 43, 2145–2156. [Google Scholar] [CrossRef]
Deshpande, S.D.; Er, M.H.; Venkateswarlu, R.; Chan, P. Max-mean and max-median filters for detection of small targets. In Proceedings of the Signal and Data Processing of Small Targets 1999, Denver, CO, USA, 20–22 July 1999; Volume 3809, pp. 74–83. [Google Scholar]
Zeng, M.; Li, J.; Peng, Z. The design of top-hat morphological filter and application to infrared target detection. Infrared Phys. Technol. 2006, 48, 67–76. [Google Scholar] [CrossRef]
Azimi-Sadjadi, M.; Pan, H. Two-dimensional block diagonal LMS adaptive filtering. IEEE Trans. Signal Process. 1994, 42, 2420–2429. [Google Scholar] [CrossRef]
Chen, C.L.P.; Li, H.; Wei, Y.; Xia, T.; Tang, Y.Y. A Local Contrast Method for Small Infrared Target Detection. IEEE Trans. Geosci. Remote Sens. 2014, 52, 574–581. [Google Scholar] [CrossRef]
Han, J.; Ma, Y.; Zhou, B.; Fan, F.; Liang, K.; Fang, Y. A robust infrared small target detection algorithm based on human visual system. IEEE Geosci. Remote Sens. Lett. 2014, 11, 2168–2172. [Google Scholar]
Wei, Y.; You, X.; Li, H. Multiscale patch-based contrast measure for small infrared target detection. Pattern Recognit. 2016, 58, 216–226. [Google Scholar] [CrossRef]
Zhu, H.; Liu, S.; Deng, L.; Li, Y.; Xiao, F. Infrared Small Target Detection via Low-Rank Tensor Completion With Top-Hat Regularization. IEEE Trans. Geosci. Remote Sens. 2020, 58, 1004–1016. [Google Scholar] [CrossRef]
Dai, Y.; Wu, Y. Reweighted Infrared Patch-Tensor Model With Both Nonlocal and Local Priors for Single-Frame Small Target Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2017, 10, 3752–3767. [Google Scholar] [CrossRef]
Zhang, L.; Peng, L.; Zhang, T.; Cao, S.; Peng, Z. Infrared Small Target Detection via Non-Convex Rank Approximation Minimization Joint l2,1 Norm. Remote Sens. 2018, 10, 1821. [Google Scholar] [CrossRef]
Lin, Z.; Ganesh, A.; Wright, J.; Wu, L.; Chen, M.; Ma, Y. Fast Convex Optimization Algorithms for Exact Recovery of a Corrupted Low-Rank Matrix; Coordinated Science Laboratory Report No. UILU-ENG-09-2214, DC-246; Coordinated Science Laboratory: Urbana, IL, USA, 2009. [Google Scholar]
Zhu, H.; Ni, H.; Liu, S.; Xu, G.; Deng, L. TNLRS: Target-Aware Non-Local Low-Rank Modeling With Saliency Filtering Regularization for Infrared Small Target Detection. IEEE Trans. Image Process. 2020, 29, 9546–9558. [Google Scholar] [CrossRef]
Liu, M.; Du, H.-y.; Zhao, Y.-j.; Dong, L.-q.; Hui, M. Image Small Target Detection based on Deep Learning with SNR Controlled Sample Generation. In Current Trends in Computer Science and Mechanical Automation Vol. 1; Wang, S.X., Ed.; De Gruyter Open Poland: Warsaw, Poland, 2022; pp. 211–220. [Google Scholar]
Wang, H.; Zhou, L.; Wang, L. Miss Detection vs. False Alarm: Adversarial Learning for Small Object Segmentation in Infrared Images. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Repulic of Korea, 27 October–2 November 2019; pp. 8508–8517. [Google Scholar]
Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Attentional Local Contrast Networks for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2021, 59, 9813–9824. [Google Scholar] [CrossRef]
Li, B.; Xiao, C.; Wang, L.; Wang, Y.; Lin, Z.; Li, M.; An, W.; Guo, Y. Dense Nested Attention Network for Infrared Small Target Detection. IEEE Trans. Image Process. 2023, 32, 1745–1758. [Google Scholar] [CrossRef] [PubMed]
Ma, Z.; Pang, S.; Hao, F. Generative Adversarial Differential Analysis for Infrared Small Target Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 6616–6626. [Google Scholar] [CrossRef]
Wang, X.; Han, C.; Li, J.; Nie, T.; Li, M.; Wang, X.; Huang, L. Multiscale Feature Extraction U-Net for Infrared Dim- and Small-Target Detection. Remote Sens. 2024, 16, 643. [Google Scholar] [CrossRef]
Kou, R.; Wang, C.; Peng, Z.; Zhao, Z.; Chen, Y.; Han, J.; Huang, F.; Yu, Y.; Fu, Q. Infrared small target segmentation networks: A survey. Pattern Recognit. 2023, 143, 109788. [Google Scholar] [CrossRef]
Lan, W.; Dang, J.; Wang, Y.; Wang, S. Pedestrian Detection Based on YOLO Network Model. In Proceedings of the 2018 IEEE International Conference on Mechatronics and Automation (ICMA), Changchun, China, 5–8 August 2018; pp. 1547–1551. [Google Scholar]
Xia, Z.; Ma, K.; Cheng, S.; Blackburn, T.; Peng, Z.; Zhu, K.; Zhang, W.; Xiao, D.; Knowles, A.J.; Arcucci, R. Accurate identification and measurement of the precipitate area by two-stage deep neural networks in novel chromium-based alloys. Phys. Chem. Chem. Phys. 2023, 25, 15970–15987. [Google Scholar]
Liu, S.; Liu, Z.; Li, Y.; Liu, W.; Ge, C.; Liu, L. Design Compact YOLO based Network for Small Target Detection on Infrared Image. In Proceedings of the 2022 China Automation Congress (CAC), Xiamen, China, 25–27 November 2022; pp. 4991–4996. [Google Scholar]
Ciocarlan, A.; Le Hegarat-Mascle, S.; Lefebvre, S.; Woiselle, A.; Barbanson, C. A Contrario Paradigm for Yolo-Based Infrared Small Target Detection. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 5630–5634. [Google Scholar]
Zhang, Z.W.; Liu, Z.G.; Martin, A.; Zhou, K. BSC: Belief Shift Clustering. IEEE Trans. Syst. Man Cybern. Syst. 2023, 53, 1748–1760. [Google Scholar] [CrossRef]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Xu, W.; Wan, Y. ELA: Efficient Local Attention for Deep Convolutional Neural Networks. arXiv 2024, arXiv:2403.01123. [Google Scholar]
Ruan, J.; Xie, M.; Gao, J.; Liu, T.; Fu, Y. EGE-UNet: An Efficient Group Enhanced UNet for Skin Lesion Segmentation. In Proceedings of the Medical Image Computing and Computer Assisted Intervention—MICCAI 2023, Vancouver, BC, Canada, 8–12 October 2023; pp. 481–490. [Google Scholar]
Dong, X.; Chen, S.; Pan, S.J. Learning to prune deep neural networks via layer-wise optimal brain surgeon. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 4860–4874. [Google Scholar]
Liu, Z.; Li, J.; Shen, Z.; Huang, G.; Yan, S.; Zhang, C. Learning Efficient Convolutional Networks through Network Slimming. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2755–2763. [Google Scholar]
Romero, A.; Ballas, N.; Kahou, S.E.; Chassang, A.; Gatta, C.; Bengio, Y. Fitnets: Hints for thin deep nets. arXiv 2014, arXiv:1412.6550. [Google Scholar]
Ahn, S.; Hu, S.X.; Damianou, A.; Lawrence, N.D.; Dai, Z. Variational Information Distillation for Knowledge Transfer. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 9155–9163. [Google Scholar]
Zoph, B.; Le, Q.V. Neural Architecture Search with Reinforcement Learning. arXiv 2017, arXiv:1611.01578. [Google Scholar]
Alizadeh Vahid, K.; Prabhu, A.; Farhadi, A.; Rastegari, M. Butterfly Transform: An Efficient FFT Based Neural Architecture Design. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 12021–12030. [Google Scholar]
Kim, Y.D.; Park, E.; Yoo, S.; Choi, T.; Yang, L.; Shin, D. Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications. arXiv 2015, arXiv:1511.06530. [Google Scholar]
Gusak, J.; Kholiavchenko, M.; Ponomarev, E.; Markeeva, L.; Blagoveschensky, P.; Cichocki, A.; Oseledets, I. Automated Multi-Stage Compression of Neural Networks. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27–28 October 2019; pp. 2501–2508. [Google Scholar]
Xu, X.; Li, M.; Tao, C.; Shen, T.; Cheng, R.; Li, J.; Xu, C.; Tao, D.; Zhou, T. A Survey on Knowledge Distillation of Large Language Models. arXiv 2024, arXiv:2402.13116. [Google Scholar]
Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288. [Google Scholar]
Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A Survey of Large Language Models. arXiv 2023, arXiv:2303.18223. [Google Scholar]
Zhang, Y.; Xiang, T.; Hospedales, T.M.; Lu, H. Deep Mutual Learning. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4320–4328. [Google Scholar]
Milletari, F.; Navab, N.; Ahmadi, S.A. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 565–571. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 13708–13717. [Google Scholar]
Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Asymmetric Contextual Modulation for Infrared Small Target Detection. In Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2021; pp. 949–958. [Google Scholar]
Wang, A.; Li, W.; Huang, Z.; Wu, X.; Jie, F.; Tao, R. Prior-Guided Data Augmentation for Infrared Small Target Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 10027–10040. [Google Scholar] [CrossRef]
Wang, K.; Du, S.; Liu, C.; Cao, Z. Interior Attention-Aware Network for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
Kou, R.; Wang, C.; Yu, Y.; Peng, Z.; Yang, M.; Huang, F.; Fu, Q. LW-IRSTNet: Lightweight Infrared Small Target Segmentation Network and Application Deployment. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–13. [Google Scholar] [CrossRef]

Figure 1. The overall structure of KDD, where Output-S is the result of the student model, Output-T is the result of the teacher, gt-pre is the result of feature fusion, and Mask is the ground truth.

Figure 2. The knowledge distillation process.

Figure 3. The schematic diagrams of ELA. The feature maps were subjected to one-dimensional convolution and normalization operations at different spatial scales, followed by combining the two using multiplication. Finally, the result was output using convolution.

Figure 4. The schematic diagrams of GAB, which was processed by grouping different level features.

Figure 5. The target detection results obtained based on the SIRST dataset. (a) Input image, (b) New Tophat, (c) NRAM, (d) ACM, (e) DNANet, (f) IAANet, (g) AGPCNet, (h) LW-IRSTNet, (i) KDD, (j) Ground truth.

Figure 6. The target detection results obtained based on the IRSTD-1k dataset. (a) Input image, (b) New Tophat, (c) NRAM, (d) ACM, (e) DNANet, (f) IAANet, (g) AGPCNet, (h) LW-IRSTNet, (i) KDD, (j) Ground truth.

Figure 7. The target detection results obtained based on the MSISTD dataset. (a) Input image, (b) New Tophat, (c) NRAM, (d) ACM, (e) DNANet, (f) IAANet, (g) AGPCNet, (h) LW-IRSTNet, (i) KDD, (j) Ground truth.

Figure 8. Performance of the ROC curves on different datasets.

Table 1. The

p r e c i s i o n

,

r e c a l l

,

F 1

,

I o U

, and

A U C

for different methods on the three datasets.

Table 1. The

p r e c i s i o n

,

r e c a l l

,

F 1

,

I o U

, and

A U C

for different methods on the three datasets.

	SIRST					IRSTD-1k					MSISTD
	Precision	Recall	F1	IoU	AUC	Precision	Recall	F1	IoU	AUC	Precision	Recall	F1	IoU	AUC
New Tophat	0.5098	0.6968	0.5218	0.2260	0.8284	0.2265	0.6263	0.2702	0.0535	0.6635	0.2341	0.5237	0.2702	0.0483	0.5861
NRAM	0.8518	0.6142	0.6827	0.4831	0.7970	0.5199	0.3554	0.3587	0.1549	0.5867	0.5586	0.3270	0.3681	0.1040	0.5553
ACM	0.9095	0.6895	0.7677	0.4921	0.9367	0.6075	0.7589	0.6490	0.5821	0.8908	0.7808	0.5643	0.6274	0.5312	0.8671
DNANet	0.9506	0.6451	0.7549	0.5992	0.9424	0.8897	0.6370	0.7199	0.6172	0.9058	0.8630	0.7010	0.7522	0.6554	0.9159
IAANet	0.8541	0.6083	0.6774	0.5097	0.9459	0.6567	0.7815	0.6763	0.5038	0.9116	0.8137	0.6721	0.6992	0.5552	0.9500
AGPCNet	0.6943	0.4968	0.5396	0.4465	0.8531	0.4836	0.4983	0.4622	0.3910	0.8159	0.6961	0.4327	0.5042	0.3962	0.7836
LW-IRSTNet	0.9406	0.6217	0.7296	0.6109	0.9809	0.6685	0.6575	0.6241	0.5039	0.9388	0.8620	0.5063	0.6089	0.4931	0.8660
KDD	0.8641	0.7647	0.7949	0.6671	0.9601	0.7214	0.7691	0.7164	0.6359	0.9364	0.7616	0.7964	0.7571	0.6688	0.9261

Table 2. Ablation study of the ELA, GAB, and loss in different evaluation metrics. ✓ and ✗ indicate that the module exists or does not exist respectively. The red number represents the highest value in each evaluation metric.

Dataset	ELA	GAB	Loss	Precision	Recall	F1	IoU	AUC
SIRST	✗	✓	✓	0.8391	0.7775	0.7849	0.6456	0.9501
	✓	✗	✓	0.7957	0.8264	0.7883	0.6567	0.9721
	✓	✓	✗	0.8554	0.7361	0.7765	0.6620	0.9659
	✓	✓	✓	0.8641	0.7647	0.7949	0.6671	0.9601
IRSTD-1k	✗	✓	✓	0.7135	0.7842	0.7001	0.5785	0.9042
	✓	✗	✓	0.6216	0.8118	0.6673	0.5540	0.9158
	✓	✓	✗	0.7048	0.7659	0.7057	0.5966	0.9007
	✓	✓	✓	0.7214	0.7691	0.7164	0.6359	0.9364

Table 3. The Params and FLOPs of the different methods. The red number represents the highest value in each evaluation metric.

	Params (M)	FLOPs (G)	FPS (Hz)
New Tophat	–	–	28
NRAM	–	–	0.6
ACM	0.39	0.22	22
DNANet	2.61	8.59	36
IAANet	14.05	408.90	16
AGPCNet	12.36	33.06	8
LW-IRSTNet	0.16	0.232	18
KDD	0.041	0.056	41

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tang, W.; Dai, Q.; Hao, F. An Efficient Knowledge Distillation-Based Detection Method for Infrared Small Targets. Remote Sens. 2024, 16, 3173. https://doi.org/10.3390/rs16173173

AMA Style

Tang W, Dai Q, Hao F. An Efficient Knowledge Distillation-Based Detection Method for Infrared Small Targets. Remote Sensing. 2024; 16(17):3173. https://doi.org/10.3390/rs16173173

Chicago/Turabian Style

Tang, Wenjuan, Qun Dai, and Fan Hao. 2024. "An Efficient Knowledge Distillation-Based Detection Method for Infrared Small Targets" Remote Sensing 16, no. 17: 3173. https://doi.org/10.3390/rs16173173

APA Style

Tang, W., Dai, Q., & Hao, F. (2024). An Efficient Knowledge Distillation-Based Detection Method for Infrared Small Targets. Remote Sensing, 16(17), 3173. https://doi.org/10.3390/rs16173173

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Efficient Knowledge Distillation-Based Detection Method for Infrared Small Targets

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Overall Architecture

3.2. Network Details

3.3. Efficient Local Attention

3.4. Group Aggregation Bridge

4. Experiments

4.1. Experimental Setup

4.2. Basic Experiments

4.3. Ablation Experiments

4.4. Complexity Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI