Res-SwinTransformer with Local Contrast Attention for Infrared Small Target Detection

Zhao, Tianhua; Cao, Jie; Hao, Qun; Bao, Chun; Shi, Moudan

doi:10.3390/rs15184387

Open AccessArticle

Res-SwinTransformer with Local Contrast Attention for Infrared Small Target Detection

by

Tianhua Zhao

¹,

Jie Cao

^1,2,*,

Qun Hao

^1,2,3,

Chun Bao

¹

and

Moudan Shi

¹

School of Optics and Photonics, Beijing Institute of Technology, Beijing 100081, China

²

Yangtze Delta Region Academy, Beijing Institute of Technology, Jiaxing 314003, China

³

School of Opto-Electronic Engineering, Changchun University of Science and Technology, Changchun 130022, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(18), 4387; https://doi.org/10.3390/rs15184387

Submission received: 9 June 2023 / Revised: 31 August 2023 / Accepted: 4 September 2023 / Published: 6 September 2023

Download

Browse Figures

Versions Notes

Abstract

:

Infrared small target detection for aerial remote sensing is crucial in both civil and military fields. For infrared targets with small sizes, low signal-to-noise ratio, and little detailed texture information, we propose a Res-SwinTransformer with a Local Contrast Attention Network (RSLCANet). Specifically, we first design a SwinTransformer-based backbone to improve the interaction capability of global information. On this basis, we introduce a residual structure to fully retain the shallow detail information of small infrared targets. Furthermore, we design a plug-and-play attention module named LCA Block (local contrast attention block) to enhance the target and suppress the background, which is based on local contrast calculation. In addition, we develop an air-to-ground multi-scene infrared vehicle dataset based on an unmanned aerial vehicle (UAV) platform, which can provide a database for infrared vehicle target detection algorithm testing and infrared target characterization studies. Experiments demonstrate that our method can achieve a low-miss detection rate, high detection accuracy, and high detection speed. In particular, on the DroneVehicle dataset, our designed RSLCANet increases by 4.3% in terms of mAP@0.5 compared to the base network You Only Look Once (YOLOX). In addition, our network has fewer parameters than the two-stage network and the Transformer-based network model, which helps the practical deployment and can be applied in fields such as car navigation, crop monitoring, and infrared warning.

Keywords:

infrared small target detection; SwinTransformer; local contrast calculation; attention mechanism; infrared vehicle dataset

1. Introduction

Remote sensing is a comprehensive technology for earth observation, so its development requires the cooperation of many disciplines. At present, there are already many research directions related to remote sensing, such as multisource data synergy [1,2,3] and target detection [4,5,6]. According to the different working platforms, remote sensing can be divided into ground remote sensing, aerial remote sensing such as UAV (unmanned aerial vehicle), and space remote sensing. Among them, remote sensing based on the UAV platform is a very active topic in the field of remote sensing [7]. Based on the UAV platform, we construct an infrared vehicle dataset and propose an infrared small target detection algorithm for aerial remote sensing.

Infrared small target detection technology refers to using an infrared detector to sense the difference in infrared radiation between the target and the background, obtain an infrared image, and then detect the target (target here refers to a small target with an area of less than 32 pixels by 32 pixels [8]). Because this technology combines infrared imaging, it has the advantages of high invisibility, strong penetrability, and can work around the clock. In the military field, infrared small target detection is a key component of the battlefield defense system, which plays an important role in battlefield threat assessment, command decision-making, precision strikes, and so on. In the civil field, infrared small target detection technology has penetrated various aspects, such as traffic control and agricultural monitoring, which brings great convenience to people’s lives [9,10]. Therefore, it is of great practical significance to study infrared small target detection algorithms for complex backgrounds of aerial remote sensing and improve detection accuracy while ensuring real-time performance.

All objects in nature can radiate infrared light, so an infrared image can be obtained using a detector to measure the infrared difference between the target and the background, even in poor lighting conditions. Compared to visible images, it is more scientific and practical to use infrared images for target detection [11,12,13,14]. However, infrared images have the following disadvantages compared to visible images. (1) Low resolution. This is mainly affected by the infrared imaging equipment. (2) Low contrast. When the target temperature of the captured infrared image is close to the background temperature, and the image is not corrected for the different emissivity of each object, it will lead to blurred edges of the target, which makes it difficult to distinguish from the background and is not conducive to target detection. Due to these characteristics of infrared images, the popular target detection algorithms are not well-adapted to infrared images, so it is difficult to achieve ideal detection results. To overcome the difficulties faced by infrared small target detection., researchers have conducted numerous studies. The traditional infrared small target detection algorithms mainly include algorithms based on filters [15], algorithms based on the human visual system [16,17,18], and algorithms based on sparse low-rank representation [19]. However, most of these traditional algorithms depend on manual feature extraction, which is time-consuming, complex, and limited.

Compared with traditional infrared small target detection algorithms, deep learning-based methods mostly use convolutional neural networks (CNN) to extract target features, which greatly improves detection efficiency and gradually creates a research boom. Shi et al. [20] proposed a method based on a denoising autoencoder network and convolutional neural network, which transforms small target detection tasks into denoising problems. Zheng et al. [21] proposed a target detection network for low-resolution infrared images based on You Only Look Once V3 (YOLOV3), which reduces the risk of missing small infrared targets by reducing the number of pooling layers. Du et al. [22] added the hard negative example mining block to the You Only Look Once V4 (YOLOv4) model and reduced the false detection rate. Dai et al. [23] proposed a new model-driven attentional local contrast network by implanting the traditional local contrast into a deep network. Zhu et al. [24] proposed an unsupervised infrared object detection framework based on spatial–temporal patch tensor and object selection. Apart from that, there are also methods that use complementary information from infrared and visible images to detect small targets [25,26,27]. Although the detection speed and accuracy of the deep learning-based method have been improved, the information of small infrared targets is gradually reduced or even lost as the convolutional layers are stacked, which makes it difficult for the detector to extract effective features.

In recent years, the Transformer [28] has developed and become a popular research topic. The Transformer was initially used in Natural Language Processing (NLP). This structure does not have the cyclic structure of Recurrent Neural Network (RNN) or the convolutional structure of CNN and has achieved some improvement in tasks such as machine translation. Then, the excellent performance of the Vision Transformer (ViT) [29] in image recognition tasks proved that Transformer also has great potential in Computer Vision (CV). However, some difficulties were encountered in the transition from NLP to CV, such as the high image resolution and large pixel points that can take up a large amount of memory and calculation. To solve this problem, Liu et al. [30] proposed the SwinTransformer based on sliding windows, which first computes local self-attention within a non-overlapping window and then implements information interaction between non-overlapping regions by sliding the window. This window-based approach reduces the computational effort while ensuring accuracy, which allows the Transformer to be widely used in various vision tasks. In addition, with continuous research, many Transformer models specifically applied to target detection tasks have emerged one after another. For example, Carion et al. [31] defined target detection as a set prediction problem and proposed a new framework called Detection Transformer (DETR). Subsequently, Zhu et al. [32] proposed Deformable DETR using a locally sparse deformable attention module, which solved the problems of slow convergence and poor detection of small targets in DETR. Wang et al. [33] proposed a query mechanism based on anchor points and Row-Column Decouple Attention (RCDA). The detector is called Anchor DETR and achieves better performance than the DETR.

The discovery that the human visual system has the ability to analyze visual information for saliency opened a new path for detecting small infrared targets with small size and weak information. Then, researchers began to conduct a lot of research on contrast mechanisms. Chen et al. [16] proposed the Local Contrast Measure (LCM), which enhances the target and suppresses the background by calculating the ratio of the maximum average gray value between the central region of the target and its neighbors. But, the algorithm has poor suppression ability for prominent noise. Han et al. [17] used a size-adaptation process and proposed an Improved LCM (ILCM), which can pursue a good performance in detection rate, false alarm rate, and speed simultaneously. Wei et al. [34] proposed a multi-scale local contrast measure based on image blocks to solve the problem of varying sizes of infrared targets. Chen et al. [35] proposed a weighted local contrast method based on the contrast mechanism of the human visual system. This method does not require any preconditioning and can enhance the target while suppressing background interference. Later, researchers tried to combine local contrast computation with deep neural networks to achieve better performance. For example, Dai et al. [23] proposed Attentional Local Contrast Networks (ALCNet), which modularized the traditional local contrast method and embedded it into a convolutional network. This method alleviates the inaccurate modeling and hyperparameter sensitivity problems of pure model-driven methods.

However, with the appearance of the Transformer, the combination of local contrast and Transformer is still less available, so we began research on this and explored its greater potential. Because YOLOX has the advantages of a small number of model parameters and fast convergence using strategies such as anchor-free and decoupled detection heads, we carried out structural design based on YOLOX and proposed a Res-SwinTransformer with a Local Contrast Attention Network (RSLCANet). To reduce the risk of small target information loss and improve the ability to interact with global information, we discarded the conventional method of using CNN to extract target features and instead used Transformer as the backbone to extract features. Meanwhile, we designed a Local Contrast Attention Block (LCA Block) to highlight weak infrared targets and suppress complex backgrounds. To summarize, the main contributions of this paper are shown below:

(1): We designed a ResSwin Backbone based on residual structure and SwinTransformer, which improves the interaction of global information and fully preserves the shallow detail features of small infrared targets through self-attentive computation and residual.
(2): We proposed a plug-and-play attention module, LCA Block, which is based on local contrast calculation. This block helps to enhance the feature representation of infrared small targets and helps the network to locate and identify infrared small targets more accurately.
(3): We developed an air-to-ground multi-scene infrared vehicle dataset using a UAV. The dataset has scene diversity and environment diversity, which can support infrared target detection model testing and infrared target characterization studies for aerial remote sensing. Experiments on our dataset and other infrared datasets such as DroneVehicle showed that our method achieves state-of-the-art performance and our method is also applicable in real-time.

2. Materials

2.1. Motivation

Ground vehicle targets are important civil and military targets. In the civil field, the detection of vehicle targets brings huge social and economic benefits to realize intelligent traffic control. In the military field, the detection of military equipment such as tanks plays a significant role in threat assessment, command decision-making, and precise strikes on the battlefield. In addition, infrared sensors are one of the most common sensors in remote sensing because of their unique advantages, such as good concealment, ability to work during the night, and strong anti-interference capability. Therefore, for the advancement of infrared vehicle intelligent detection technology, it is very necessary to build a large-scale real infrared vehicle dataset. However, there is a relative lack of large-scale real infrared vehicle data covering multiple scenes and weather conditions in China. Therefore, in this paper, we capture an air-to-ground infrared vehicle dataset using a UAV with a visible camera and a thermal camera. This dataset can provide basic data for infrared target detection and tracking, infrared target characterization research, etc.

2.2. Dataset Introduction

The equipment we used is the DJI MAVIC2 Enterprise Advanced, and the basic parameters of its thermal camera and visual camera are shown in Table 1.

When collecting infrared vehicle data, we considered the light magnitude, atmospheric transmittance, and other factors. Therefore, we chose different time periods, from day to night, and different weather conditions, such as sunny and foggy, to capture images. At the same time, the dataset we developed covers different practical scenarios, such as highways, parking, malls, and crossroads. Additionally, the shooting height ranges from 70 m to 300 m, so our dataset covers large, medium, and small vehicle targets of different sizes. In addition, we recorded the conditions of each video data, as shown in Table A1 of Appendix A, which can facilitate users to quickly find a suitable image or video according to their needs.

The dataset includes 20 pairs of infrared and visible video and 1543 pairs of infrared and visible image data frames. Figure 1 shows examples of our visible and infrared images taken in different scenes, in which the first row is the infrared image and the second row is the corresponding visible image. The main categories included in the dataset are cars, trucks, and buses, as shown in Figure 2. The scene and environment diversity of this dataset provides support for testing infrared vehicle target detection and tracking tasks.

3. Methods

3.1. RSLCANet

The overall network structure is shown in Figure 3. Firstly, we designed a ResSwin backbone based on residual structure and SwinTransformer. Then, in the neck, we used the Feature Pyramid Networks (FPN) [36] and Pyramid Attention Network (PAN) [37] structure of YOLOv5 [38] and YOLOX [39], and we added the Local Contrast Attention Block (LCA Block) at the beginning of the FPN. For the head, we used the decoupled head structure of YOLOX because it can greatly improve convergence efficiency.

When the infrared image inputs to RSLCANet, the input image is firstly cropped into patches of patch size × patch size (patch size is a default parameter and is set to 4 [30]) by patch Embedding, as described in Equation (1).

P a t c h E m b e d d i n g = L N (f l a t (C o n v_{p \times p} (x))),

(1)

where

x

refers to the input feature map,

C o n v_{p \times p}

refers to the convolution operation with a convolution kernel of p × p and a step size of p × p, p is the parameter patch size,

f l a t

is the flatten operation, and

L N

is the layer normalization.

After that, the self-attentive computation is performed by layers consisting of ResSwin Block and patch merging. To fully preserve infrared small target features, we introduced residual structure in each ResSwin Block, as described in Section 3.2. In neck, the feature maps output from three layers are passed through the LCA Block, and then multi-scale feature fusion is performed. Finally, classification and regression losses are computed in each of the three decoupled heads to optimize the network parameters.

3.2. ResSwin Backbone

Most of the deep learning-based infrared target detection methods use CNN as a backbone, but the convolutional structure is limited by the size of the convolutional kernel, which can only focus on the local area of the feature map and is not sensitive to global information. Meanwhile, CNN usually stacks some convolutional layers and pooling layers to extract target features, but this is unfavorable for small infrared targets with small size and little texture information because it can easily cause the loss of their feature information and thus increase the risk of missed detection of infrared small targets. Therefore, we designed a backbone based on SwinTransformer [30]. The SwinTransformer can extract global features through a self-attentive mechanism and sliding windows, which not only improves the interaction of global information but also effectively reduces the amount of calculation.

As shown in Figure 4, the ResSwin backbone consists of three layers. In the three layers, the number of stacked SwinTransformer Blocks is 2, 2, and 6, respectively. The calculation process of the SwinTransformer Block is shown in Equations (2) and (3).

X_{1} = M L P (L N (W - M S A (L N (X)) + X)) + (W - M S A (L N (X)) + X),

(2)

Z = M L P (L N (S W - M S A (L N (X_{1})) + X_{1})) + (S W - M S A (L N (X_{1})) + X_{1}),

(3)

in which

X

is the input,

X_{1}

is the intermediate value, and

Z

is the output result.

The input

X

first undergoes layer normalization (LN) and Windows Multi-head Self-Attention (W-MSA) and then passes through LN and multi-layer perceptron (MLP). Both of these steps include a residual connection. Subsequently, the W-MSA in the above operation is replaced by the Shift Windows Multi-head Self-Attention (SW-MSA), the intermediate value

X_{1}

is viewed as input, and then the above operation is repeated to obtain the feature map

Z

after the SwinTransformer Block.

In infrared images, targets usually occupy a few effective pixels [40,41]. Therefore, to maximize the retention of the original effective information of the infrared small target in the deep structure, we introduced the residual structure after each layer of the stacked SwinTransformer, which can add the input information and output information of the SwinTransformer block; thus, the original information was both retained and interacted with the global information. This design effectively improves the expression ability of the target feature. So, it is beneficial to the recognition of small infrared targets.

3.3. LCA Block

Texture detail information is not apparent in single-band infrared images, which creates some difficulties for related computer vision tasks. However, grayscale information in infrared images is important, which can reflect the difference between target areas and background areas, so the grayscale information is especially valuable to locate and highlight these target area features in infrared target detection tasks. Inspired by the human visual system, methods that apply the human visual luminance contrast mechanism to enhance the contrast between the target and the background are widely used in infrared small target detection [16,18,23]. We similarly designed a plug-and-play attention module specifically for infrared small target detection using local contrast computation. The module is inserted into the neck to enhance the deep semantic feature of infrared small targets so that our network can localize infrared targets more accurately.

As shown in Figure 5, our LCA Block divides the input feature map

F

into two paths. In the Local Contrast Calculation (LCC) path, the original feature map

F

is firstly split and recombined at conversion factor p for local, regional position transformation to obtain two feature maps

F 1

and

F 2

with opposite directions. For example, assume that the dimension of the input feature F is [b,c,h,w] (b is the batch size; c is the channel of the image; h is the height of the image; w is the width of the image.), the dimensions of the split feature L1-1, L1-2, L1-3 and L1-4 in Figure 5 are, respectively [b,c,p,p], [b,c,p,w-p], [b,c,h-p,p], [b,c,h-p,w-p]. After that,

F

is subtracted from the two shifted feature maps

F 1

and

F 2

, respectively, then multiplied to obtain the local contrast feature map

F 3

with certain weights. Finally, we use the feature map after Softmax as the enhancement feature

F 4

and multiply it with the original feature map

F

of the other way to calculate the output feature map

F_{o u t p u t}

, as described in Equation (4). After the LCA Block, the target information is enhanced, and the background information is suppressed. Thus, our model achieves a feature enhancement effect.

F_{o u t p u t} = F \times S o f t \max ((F - F 1) \times (F - F 2)) .

(4)

The process of LCC calculation is Algorithm 1:

Algorithm 1: Local Contrast Calculation

Input: Input feature F of dimension [b,c,h,w], conversion factor p.
Output: Enhancement feature

L1-1 = F [0:b, 0:c, 0:p, 0:p]
L1-2 = F [0:b, 0:c, 0:p, p:w]
L1-3 = F [0:b, 0:c, p:h, 0:p]
L1-4 = F [0:b, 0:c, p:h, p:w]
Recombine L1-1, L1-2, L1-3, L1-4 to obtain F1.
L2-1 = F [0:b, 0:c, 0:h − p, 0:w − p]
L2-2 = F [0:b, 0:c, 0:h − p, w − p:w]
L2-3 = F [0:b, 0:c, h − p:h, 0:w − p]
L2-4 = F [0:b, 0:c, 0: h − p, w − p:w]
Recombine L2-1, L2-2, L2-3, L2-4 to obtain F2.
F3 ← (F − F1) × (F − F2)
F4 ← softmax(F3)

4. Experiment and Results

In this section, we first introduce the evaluation metrics and experimental details. After that, in order to verify the effectiveness of our proposed method, we conduct comparison experiments on the Dim-small aircraft targets dataset [42] and the DroneVehicle dataset [43]. Meanwhile, we also visualize the test results using the multi-scene infrared vehicle dataset we captured. Finally, the effectiveness of ResSwin Backbone and LCA Block is verified by ablation experiments.

4.1. Evaluation Metrics

There are several evaluation metrics for target detection algorithms, which evaluate the performance of the algorithm from various aspects. The evaluation metrics used in this paper include Precision (P), Recall (R), mean Average Precision (mAP), F1 Score (F1) and Frames Per Second (FPS).

4.1.1. Precision and Recall

Table 2 is a confusion matrix through which we can clarify the concepts of True Positives (TP), True Negatives (TN), False Negatives (FN), and False Positives (FP).

Precision is the probability that the correctly detected target (TP) accounts for all detected targets (TP and FP), so it is also known as the check rate. The calculation formula is shown in (5).

P = \frac{T P}{T P + F P} .

(5)

Recall, also known as the check-all rate, is used to measure the missed detection of the network. It is the probability that the correctly detected target (TP) accounts for all targets to be detected (TP and FN). The calculation formula is provided in (6).

R = \frac{T P}{T P + F N} .

(6)

4.1.2. F1 Score

The F1 Score is a comprehensive indicator of precision and recall. The calculation formula is provided in (7).

F 1 = \frac{2 \times R \times P}{R + P} .

(7)

4.1.3. Mean Average Precision

In target detection, there are generally several categories of targets to be detected. Each category corresponds to a PR curve (PR curve refers to a curve drawn with precision as the vertical coordinate and recall as the horizontal coordinate) and an Average Precision value (AP). The mAP is obtained by averaging the AP values of all the categories to be detected. The formula is shown in (8).

m A P = \frac{1}{n} \sum_{}^{} A P = \frac{1}{n} \int_{0}^{1} P (R) d R,

(8)

where n refers to the number of target categories.

4.1.4. Frames per Second (FPS)

FPS is used to evaluate the speed of target detection, which indicates the number of images that can be processed in a second.

4.2. Experimental Details

Our experiments are based on the PyTorch framework with NVIDIA GeForce RTX 2080Ti GPU. In the experiments, the batch size is set to 4, the training optimization algorithm is stochastic gradient descent (SGD), the initial learning rate is 0.01, and momentum and weight decay are 0.9 and 0.0005, respectively. In addition, we use methods such as Mosaic and Mixup for data augmentation during the training period.

4.3. Comparison Experiments

To validate the strength of our proposed network in the infrared small target detection task, we use two infrared datasets for comparison experiments: the Dim-small aircraft targets dataset [42] and the DroneVehicle dataset [43]. In addition, we visualize the test results on our captured multi-scene infrared vehicle dataset.

4.3.1. Comparison Experiments on the Dim-Small Aircraft Targets Dataset

The Dim-small aircraft targets dataset [42] is based on fixed-wing aircraft and covers a variety of backgrounds, such as sky and ground, which can be used for infrared small target detection. For comparison experiments, we use traditional target detection algorithms, including RLCM [18], TLLICM [44], and ADMD [45], and deep learning-based target detection algorithms, including ACM [46], Center Net [47], YOLOv5 [38], and YOLOX [39]. Because the mAP is not commonly used in traditional infrared target detection methods, we uniformly use P, R, and F1 for comparison.

Table 3 shows the experimental results, in which the best results are marked in red and the second results are marked in blue. We can see from this table deep learning-based infrared target detection methods are generally more effective than traditional methods. Meanwhile, among the deep learning-based methods, YOLOX that used techniques such as anchor-free and decoupled head performs better than the YOLOv5 based on anchor and coupled head, which is the reason why we use YOLOX as baseline. Among all the networks, our method achieves 87.8% F1, which is 0.7% better than the base YOLOX.

4.3.2. Comparison Experiments on the DroneVehicle Dataset

The DroneVehicle dataset [43] is a large-scale visible-infrared cross-modal vehicle dataset. In our experiments, we select a total of 5000 images of infrared vehicles, of which the training set accounts for 90% and the validation set accounts for 10%. Since the three categories of freight car, van, and truck are very similar in the original dataset, we identify them as the truck category uniformly. The final dataset contains a total of three categories: car, truck, and bus.

We compare the proposed network with different types of classical networks, including two-stage networks such as Faster R-CNN [48], one-stage networks such as YOLOX [39], Transformer-based networks such as DETR [31], and anchor DETR [33]. As can be seen from Table 4, our network achieves the best detection results on the DroneVehicle dataset. The mAP@0.5 of our RSLCANet reaches 89.8%, which is 0.8% better than the newly proposed anchor DETR. The mAP@0.5:0.95 of our RSLCANet reaches 67.3%, which is 3.1% better than the baseline YOLOX. Meanwhile, the number of parameters of our method is reduced by about 50% compared to the two-stage networks and Transformer-based networks, and our network can also meet the real-time requirement, so it is more suitable for model deployment and practical applications.

4.3.3. Comparison Experiments on Our Captured Multi-Scene Infrared Vehicle Dataset

In addition, we conduct test experiments on our multi-scene infrared vehicle dataset. Figure 6, Figure 7 and Figure 8 show the visualization of the test results in three different scenes. It can be seen from Figure 6 that other networks cannot adequately detect the occluded vehicles, while our network can reduce the risk of missing detection, mainly because our residual connection and attention modules preserve and enhance the information of small infrared targets. In addition, it can be seen from Figure 7 that for distant vehicle targets, our network identifies and locates more accurately, while other networks misidentify trucks as cars or fail to identify them due to occlusion, small size, and other factors. Additionally, as can be seen in Figure 8, our network has a lower missing detection rate, which can accurately detect more vehicle targets. This further proves that our RSLCANet has the advantage of low miss rate and high detection accuracy in infrared small target detection with complex backgrounds.

4.4. Ablation Experiments

4.4.1. The Design of ResSwin Backbone

In the ablation experiments of the ResSwin Backbone, we take YOLOX as the baseline and use its default training parameters. To verify the effectiveness of the ResSwin backbone, we first replace the original Darknet53 backbone of YOLOX with the ResSwin backbone for comparison. After that, for the training balance of accuracy and speed, we also compare the effect of ResSwin backbone with three layers and four layers, respectively.

To test the generalization ability of the backbone, we conduct comparison experiments on the visible dataset VOC2017 and the infrared dataset DroneVehicle. As can be seen from Table 5, the network with ResSwin backbone has a higher mAP than the original network with Darknet53 backbone on both datasets. When we use the four-layer ResSwin backbone, mAP@0.5 further improves by 0.6% on the VOC2007 dataset (0.5% on the DroneVehicle dataset), but at the same time, the number of parameters increases by 14.2 M, which is not favorable for model deployment. Therefore, after balancing the model accuracy and the number of parameters, we chose a three-layer ResSwin backbone as our backbone.

In addition, we also compare two different residual connections. As shown in Figure 9, the left side is an additive connection. This method first adds the corresponding elements of the original feature map and the feature map after SwinTransformer Block and then normalizes through Batch Normalization (BN). The right side is concat connection, which is to concatenate the before and after feature maps along the last channel and then adjust the number of channels through the Fully-Connected (FC) layer. From Table 6, we can see that the additive connection achieves better results. The model with additive residual connection has 1.5% mAP@0.5 improvements compared to the model without residual connection on the VOC2007 dataset, while the model with concat residual connection has 1.0% mAP@0.5 improvements.

4.4.2. The Design of LCA Block

We propose a plug-and-play Local Contrast Attention Block (LCA Block) for small infrared targets to enhance target information and suppress background. To verify the effectiveness of this block, we conduct ablation experiments on the DroneVehicle dataset. As shown in Table 7, ✗ indicates that the LCA Block is not included, and ✓ indicates that the LCA Block is included. In addition, we consider the position of the LCA Block; one is to place it after the patch embedding in the backbone to enhance the shallow features of infrared small targets, and the other is to place it before the three paths of FPN in the neck (the structure in Figure 1) to enhance the deep semantic features of small infrared targets.

As can be seen from Table 7, the structure without LCA Block mAP@0.5 is only 87.1%, and mAP@0.5:0.95 is only 64.7%. After inserting the LCA Block into the backbone, mAP@0.5 improves by 2.1%, and mAP@0.5:0.95 improves by 2.2%. Additionally, after inserting the LCA Block into the neck, mAP@0.5 and mAP@0.5:0.95 improve 2.7% and 2.6%, respectively, which achieves the best accuracy, and the number of parameters does not increase. Therefore, the design of introducing the LCA Block into the neck can enhance feature representation without increasing the number of parameters and thus improve the accuracy of small infrared targets.

Figure 10 shows a comparison of our ablation experiments on the DroneVehicle dataset. In Figure 10, we compare the precision, recall, mAP@0.5, and mAP@0.5:0.95 of different structural designs. The YOLOX-Darknet53 is the baseline, and the YOLOX-3ResSwin-Add-LCA Block (neck) is our RSLCANet. As can be seen from the figure, our network achieves the best results in all metrics.

5. Discussion

Infrared targets in UAV remote sensing have a small scale and little detailed texture information. In order to address this issue, we proposed RSCANet, which uses a ResSwinBackone based on the Transformer to improve the interaction of global information and fully retain the shallow detail features of infrared small targets and uses a LCA Block based on local contrast to enhance the feature representation of infrared small targets. Experiments show that the combination of the Transformer structure and local contrast has advantages in infrared small target detection.

In previous studies, in order to detect infrared targets, researchers have proposed traditional detection methods and deep learning-based detection methods. Most of the traditional methods are based on manual feature extraction, which is time-consuming, complex, and has some limitations. Deep learning-based detection methods use a large amount of data and labels. The detection speed and accuracy have been improved, but they mostly use convolutional neural networks to extract features, which can easily cause the loss of target feature information. In contrast to previous work, we used the Transformer structure as a backbone to extract features. From Table 5, it can be seen that the mAP@0.5 can be improved by 2.1% after replacing the CNN-based backbone with the Transformer-based backbone. In addition, we believe that local contrast based on the human visual system is an important approach for infrared small target detection, so we introduced local contrast in our Transformer-based network. Through ablation experiments, we proved that the local contrast-based LCA Block can further improve the detection accuracy. As can be seen in Table 7, the introduction of the LCA Block in the neck improved the mAP@0.5 by 2.7%.

However, our network is based on a single-frame infrared image, which only utilizes information such as the contrast of the target, but for moving targets in remote sensing images, their continuity and correlation across multiple frames are also important. Therefore, in our future work, we will further study how to apply the proposed ResSwin backbone and Local Contrast Attention Block to infrared target detection in multi-frame images to improve the performance of the network and to expand its application scenarios.

In conclusion, in this study, we investigated the combination of local contrast with the Transformer structure and also verified the effectiveness and advantages of this combination in infrared small target detection. This structure provides more possibilities for remote sensing target detection and can be extended to other areas, such as remote sensing target tracking and target segmentation.

6. Conclusions

In this paper, we proposed RSLCANet for the problem of high miss rate and low detection accuracy in small infrared target detection tasks. The RSLCANet is based on YOLOX and mainly consists of two innovations: ResSwin backbone and LCA Block. The ResSwin backbone improves the global information interaction through the self-attentive calculation of the SwinTransformer. Meanwhile, the introduction of residual connection fully preserves the shallow detail features of small infrared targets and improves the detection accuracy. In addition, we designed a plug-and-play attention module based on local contrast calculation, which enhanced the feature representation of small infrared targets and suppressed background noise, thus reducing the miss detection rate. The experimental validation on the Dim-small aircraft targets dataset, DroneVehicle, VOC2017 dataset, and the visualization of test results on our multi-scene infrared vehicle dataset demonstrate that our method can meet the requirements of real-time, high accuracy, and low number of parameters simultaneously. Our proposed method can be useful for civil and military applications. For example, in the civil field, it can utilize infrared remote sensing images of crops to identify crop types and quickly monitor and assess disaster information such as agricultural drought and pests. In the military field, this technology can quickly identify missiles, combat vehicles, and other targets on the battlefield, which is conducive to command decision-making.

Author Contributions

Conceptualization, T.Z.; methodology, T.Z. and J.C.; software, T.Z.; validation, T.Z. and C.B.; formal analysis, T.Z.; investigation, M.S.; resources, T.Z.; data curation, T.Z., J.C. and C.B.; writing—original draft preparation, T.Z. and J.C.; writing—review and editing, T.Z., C.B. and Q.H.; visualization, T.Z. and M.S.; supervision, J.C. All authors have read and agreed to the published version of the manuscript.

Funding

Beijing Nature Science Foundation of China (No. 4232014).

Data Availability Statement

The Dim-small aircraft targets dataset was obtained from https://www.dx.doi.org/10.11922/sciencedb.902 (accessed on 17 August 2020). The DroneVehicle dataset was obtained from https://github.com/VisDrone/DroneVehicle (accessed on 29 December 2021). The VOC2007 dataset was obtained from https://host.robots.ox.ac.uk/pascal/VOC/ (accessed on 11 June 2007). Our Multi-scene infrared vehicle dataset is available upon request by contact with the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. Recording of our dataset shooting situation.

Number	Time	Location	Weather	Height (m)
1	6:30	Highway	Sunny	80
2	15:30	Highway	Sunny	80
3	21:00	Parking	Heavy fog	50
4	20:00	Parking	Heavy fog	80
5	21:00	Highway	Cloudy	100
6	19:30	Mall	Light Fog	100
7	20:30	Highway	Light Fog	80
8	19:30	Mall	Sunny	100
9	10:00	Crossroad	Light Fog	30–200
10	11:00	Crossroad	Light Fog	200
11	12:30	Crossroad	Sunny	250
12	12:40	Crossroad	Sunny	80–250
13	17:00	Crossroad	Sunny	100
14	20:30	Highway	Sunny	30–100
15	20:30	Highway	Sunny	100
16	21:30	Highway	Sunny	200
17	20:30	Highway	Sunny	100–200
18	20:30	Highway	Sunny	300
19	6:30	Highway	Sunny	300–100
20	15:30	Highway	Sunny	70

References

Ren, K.; Sun, W.; Meng, X.; Yang, G.; Peng, J.; Huang, J. A locally optimized model for hyperspectral and multispectral images fusion. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5519015. [Google Scholar] [CrossRef]
Zhou, J.; Sun, W.; Meng, X.; Yang, G.; Ren, K.; Peng, J. Generalized linear spectral mixing model for spatial–temporal–spectral fusion. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5533216. [Google Scholar] [CrossRef]
Sun, W.; Ren, K.; Meng, X.; Yang, G.; Xiao, C.; Peng, J.; Huang, J. MLR-DBPFN: A multi-scale low rank deep back projection fusion network for anti-noise hyperspectral and multispectral image fusion. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5522914. [Google Scholar] [CrossRef]
Hou, T.; Sun, W.; Chen, C.; Yang, G.; Meng, X.; Peng, J. Marine floating raft aquaculture extraction of hyperspectral remote sensing images based decision tree algorithm. Int. J. Appl. Earth Obs. Geoinf. 2022, 111, 102846. [Google Scholar] [CrossRef]
Sun, W.; Liu, K.; Ren, G.; Liu, W.; Yang, G.; Meng, X.; Peng, J. A simple and effective spectral-spatial method for mapping large-scale coastal wetlands using China ZY1-02D satellite hyperspectral images. Int. J. Appl. Earth Obs. Geoinf. 2021, 104, 102572. [Google Scholar] [CrossRef]
Ma, J.; Guo, H.; Rong, S.; Feng, J.; He, B. Infrared Dim and Small Target Detection Based on Background Prediction. Remote Sens. 2023, 15, 3749. [Google Scholar] [CrossRef]
Toth, C.; Jozkow, G. Remote sensing platforms and sensors: A survey. ISPRS J. Photogramm. Remote Sens. 2016, 115, 22–36. [Google Scholar] [CrossRef]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. pp. 740–755. [Google Scholar]
Henini, M.; Razeghi, M. Handbook of Infrared Detection Technologies; Elsevier: Amsterdam, The Netherlands, 2002. [Google Scholar]
Razeghi, M.; Nguyen, B.-M. Advances in mid-infrared detection and imaging: A key issues review. Rep. Prog. Phys. 2014, 77, 082401. [Google Scholar] [CrossRef]
Li, X.; Sun, S.; Gu, L.; Liu, X. Infrared scene prediction of night unmanned vehicles based on multi-scale feature maps. Infrared Phys. Technol. 2021, 118, 103897. [Google Scholar] [CrossRef]
Qiu, G.Y.; Wang, B.; Li, T.; Zhang, X.; Zou, Z.; Yan, C. Estimation of the transpiration of urban shrubs using the modified three-dimensional three-temperature model and infrared remote sensing. J. Hydrol. 2021, 594, 125940. [Google Scholar] [CrossRef]
Ren, H.; Ye, X.; Nie, J.; Meng, J.; Fan, W.; Qin, Q.; Liang, Y.; Liu, H. Retrieval of land surface temperature, emissivity, and atmospheric parameters from hyperspectral thermal infrared image using a feature-band linear-format hybrid algorithm. IEEE Trans. Geosci. Remote Sens. 2021, 60, 4401015. [Google Scholar] [CrossRef]
Zhang, J.; Liu, C.; Wang, B.; Chen, C.; He, J.; Zhou, Y.; Li, J. An infrared pedestrian detection method based on segmentation and domain adaptation learning. Comput. Electr. Eng. 2022, 99, 107781. [Google Scholar] [CrossRef]
Deshpande, S.D.; Er, M.H.; Venkateswarlu, R.; Chan, P. Max-mean and max-median filters for detection of small targets. In Proceedings of the Signal and Data Processing of Small Targets 1999, Denver, CO, USA, 18–23 July 1999; pp. 74–83. [Google Scholar]
Chen, C.P.; Li, H.; Wei, Y.; Xia, T.; Tang, Y.Y. A local contrast method for small infrared target detection. IEEE Trans. Geosci. Remote Sens. 2013, 52, 574–581. [Google Scholar] [CrossRef]
Han, J.; Ma, Y.; Zhou, B.; Fan, F.; Liang, K.; Fang, Y. A robust infrared small target detection algorithm based on human visual system. IEEE Geosci. Remote Sens. Lett. 2014, 11, 2168–2172. [Google Scholar]
Han, J.; Liang, K.; Zhou, B.; Zhu, X.; Zhao, J.; Zhao, L. Infrared small target detection utilizing the multiscale relative local contrast measure. IEEE Geosci. Remote Sens. Lett. 2018, 15, 612–616. [Google Scholar] [CrossRef]
Gao, C.; Meng, D.; Yang, Y.; Wang, Y.; Zhou, X.; Hauptmann, A.G. Infrared patch-image model for small target detection in a single image. IEEE Trans. Image Process. 2013, 22, 4996–5009. [Google Scholar] [CrossRef]
Shi, M.; Wang, H. Infrared dim and small target detection based on denoising autoencoder network. Mob. Netw. Appl. 2020, 25, 1469–1483. [Google Scholar] [CrossRef]
Zheng, G.; Wu, X.; Hu, Y.; Liu, X. Object detection for low-resolution infrared image in land battlefield based on deep learning. In Proceedings of the 2019 Chinese Control Conference (CCC), Guangzhou, China, 27–30 July 2019; pp. 8649–8652. [Google Scholar]
Du, S.; Zhang, P.; Zhang, B.; Xu, H. Weak and occluded vehicle detection in complex infrared environment based on improved YOLOv4. IEEE Access 2021, 9, 25671–25680. [Google Scholar] [CrossRef]
Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Attentional local contrast networks for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2021, 59, 9813–9824. [Google Scholar] [CrossRef]
Zhu, R.; Zhuang, L. Unsupervised Infrared Small-Object-Detection Approach of Spatial–Temporal Patch Tensor and Object Selection. Remote Sens. 2022, 14, 1612. [Google Scholar] [CrossRef]
Wang, Q.; Chi, Y.; Shen, T.; Song, J.; Zhang, Z.; Zhu, Y. Improving RGB-infrared object detection by reducing cross-modality redundancy. Remote Sens. 2022, 14, 2020. [Google Scholar] [CrossRef]
Dang, L.M.; Wang, H.; Li, Y.; Min, K.; Kwak, J.T.; Lee, O.N.; Park, H.; Moon, H. Fusarium wilt of radish detection using RGB and near infrared images from Unmanned Aerial Vehicles. Remote Sens. 2020, 12, 2863. [Google Scholar] [CrossRef]
Wu, J.; Shen, T.; Wang, Q.; Tao, Z.; Zeng, K.; Song, J. Local Adaptive Illumination-Driven Input-Level Fusion for Infrared and Visible Object Detection. Remote Sens. 2023, 15, 660. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2017; pp. 10012–10022. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part I 16. pp. 213–229. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Wang, Y.; Zhang, X.; Yang, T.; Sun, J. Anchor detr: Query design for transformer-based detector. In Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22), Virtual, 22 February–1 March 2022; pp. 2567–2575. [Google Scholar]
Wei, Y.; You, X.; Li, H. Multiscale patch-based contrast measure for small infrared target detection. Pattern Recognit. 2016, 58, 216–226. [Google Scholar] [CrossRef]
Chen, Y.; Wang, H.; Pang, Y.; Han, J.; Mou, E.; Cao, E. An Infrared Small Target Detection Method Based on a Weighted Human Visual Comparison Mechanism for Safety Monitoring. Remote Sens. 2023, 15, 2922. [Google Scholar] [CrossRef]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Jocher, G.; Stoken, A.; Borovec, J.; Christopher, S.; Laughing, L.C. Ultralytics/Yolov5: v6.0; Zenodo: Geneva, Switzerland, 2021. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Braun, M.; Krebs, S.; Flohr, F.; Gavrila, D.M. The eurocity persons dataset: A novel benchmark for object detection. arXiv 2018, arXiv:1805.07193. [Google Scholar]
Xia, G.-S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3974–3983. [Google Scholar]
Hui, B.; Song, Z.; Fan, H.; Zhong, P.; Hu, W.; Zhang, X.; Ling, J.; Su, H.; Jin, W.; Zhang, Y. A dataset for infrared detection and tracking of dim-small aircraft targets under ground/air background. China Sci. Data 2020, 5, 291–302. [Google Scholar]
Sun, Y.; Cao, B.; Zhu, P.; Hu, Q. Drone-based RGB-infrared cross-modality vehicle detection via uncertainty-aware learning. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 6700–6713. [Google Scholar] [CrossRef]
Han, J.; Moradi, S.; Faramarzi, I.; Liu, C.; Zhang, H.; Zhao, Q. A local contrast method for infrared small-target detection utilizing a tri-layer window. IEEE Geosci. Remote Sens. Lett. 2019, 17, 1822–1826. [Google Scholar] [CrossRef]
Moradi, S.; Moallem, P.; Sabahi, M.F. Fast and robust small infrared target detection using absolute directional mean difference algorithm. Signal Process. 2020, 177, 107727. [Google Scholar] [CrossRef]
Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Asymmetric contextual modulation for infrared small target detection. In Proceedings of Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Salt Lake City, UT, USA, 18–23 June 2018; pp. 950–959. [Google Scholar]
Zhou, X.; Wang, D.; Krähenbühl, P. Objects as points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
Devaguptapu, C.; Akolekar, N.; Sharma, M.M.; Balasubramanian, V.N. Borrow from anywhere: Pseudo multi-modal object detection in thermal imagery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019. [Google Scholar]

Figure 1. Example of our multi-scene infrared vehicle dataset. The first row is the infrared image, and the second row is the corresponding visible image. (a) Crossroad scene (foggy); (b) highway scene (sunny); (c) parking scene; (d) mall scene.

Figure 2. Main target categories in our multi-scene infrared vehicle dataset. The first row is the visible image, and the second row is the corresponding infrared image. (a) Car category; (b) bus category; (c) truck category.

Figure 3. The overall network structure of Res-SwinTransformer with Local Contrast Attention Network (RSLCANet). This network consists of three parts: ResSwin backbone with residual and SwinTransformer, neck with Local Contrast Attention Block (LCA Block), and decoupled head, including classification loss and regression loss.

Figure 4. ResSwin backbone. It mainly consists of patch embedding and ResSwin Block. This backbone additionally adds residual structure to maximize the retention of shallow feature information of small infrared targets.

Figure 5. Local Contrast Attention Module (LCA Block). This module divides the input feature

F

in two ways. One way for local contrast calculation. Firstly, the shifted feature maps

F 1

and

F 2

are obtained by splitting and recombining. After that,

F

is subtracted from

F 1

and

F 2

, respectively, and then multiplied to obtain the local contrast feature map

F 3

, and finally, the enhancement feature

F 4

is obtained after Softmax. The other way performs feature mapping to transfer the original feature

F

. The output feature map

F_{o u t p u t}

is obtained by multiplying

F

and

F 4

.

Figure 5. Local Contrast Attention Module (LCA Block). This module divides the input feature

F

in two ways. One way for local contrast calculation. Firstly, the shifted feature maps

F 1

and

F 2

are obtained by splitting and recombining. After that,

F

is subtracted from

F 1

and

F 2

, respectively, and then multiplied to obtain the local contrast feature map

F 3

, and finally, the enhancement feature

F 4

is obtained after Softmax. The other way performs feature mapping to transfer the original feature

F

. The output feature map

F_{o u t p u t}

is obtained by multiplying

F

and

F 4

.

Figure 6. Visual comparison of the detection results between our network and other networks in scene 1.

Figure 7. Visual comparison of the detection results between our network and other networks in scene 2.

Figure 8. Visual comparison of the detection results between our network and other networks in scene 3.

Figure 9. Different residual connection methods in the backbone. The left side is additive connection; the right side is concat connection.

Figure 10. Comparison chart of different network structures on DroneVehicle. (a) Comparison of precision; (b) comparison of recall; (c) comparison of mAP@0.5; (d) comparison of mAP @0.5:0.95; YOLOX-Darknet53 is the baseline; YOLOX-3ResSwin-Concat is a three-layer ResSwin backbone design using the concat residual connection. YOLOX-3ResSwin-Add is a three-layer ResSwin backbone design using the additive residual connection. YOLOX-3ResSwin-Add-LCA Block (backbone) is a three-layer ResSwin backbone design using the additive residual connection with backbone’s LCA Block. YOLOX-3ResSwin-Add-LCA Block (neck) is a three-layer ResSwin backbone design using the additive residual connection with neck’s LCA Block.

Table 1. Camera basic parameters.

Indicators	Thermal Camera	Visual Camera
Spectral Band	8–14 μm	0.38~0.7 μm
Resolution	640 × 512	3840 × 2160/1920 × 1080
Sensors	Uncooled VOx Microbolometer	1/2 CMOS

Table 2. Confusion matrix.

Ground Truth\Predicted Value	Positive	Negative
Positive	True Positive (TP)	False Negative (FN)
Negative	False Positive (FP)	True Negative (TN)

Table 3. Comparison experiments on the Dim-small aircraft targets dataset.

Method	P	R	F1
RLCM [18]	0.949	0.444	0.605
TLLICM [44]	0.931	0.347	0.506
ADMD [45]	0.754	0.455	0.567
ACM [46]	0.523	0.242	0.331
Center Net [47]	0.876	0.728	0.795
YOLOv5 [38]	0.978	0.291	0.449
YOLOX [39]	0.970	0.790	0.871
Ours	0.996	0.785	0.878

Table 4. Comparison experiments on the DroneVehicle.

Method	Backbone	mAP@0.5	mAP@0.5:0.95	FPS	Parameters (M)
Faster R-CNN [48]	resnet50	0.744	0.479	26	42.0
DETR [31]	resnet50	0.874	0.531	27	41.3
Anchor DETR [33]	resnet50	0.890	0.568	37	36.8
YOLOX [39]	Darknet53	0.855	0.642	55	8.95
Ours	ResSwin	0.898	0.673	41	28.5

Table 5. Comparison of YOLOX with different backbone.

Model	Dataset	mAP@0.5	mAP@0.5:0.95	Parameter
Darknet53_YOLOX	VOC2007	0.513	0.286	8.95 M
ResSwin_YOLOX (3 layers)	VOC2007	0.528_(+0.015)	0.297_(+0.011)	28.54 M
ResSwin_YOLOX (4 layers)	VOC2007	0.534_(+0.021)	0.305_(+0.019)	42.74 M
Darknet53_YOLOX	DroneVehicle	0.855	0.642	8.95 M
ResSwin_YOLOX (3 layers)	DroneVehicle	0.871_(+0.016)	0.647_(+0.005)	28.54 M
ResSwin_YOLOX (4 layers)	DroneVehicle	0.876_(+0.021)	0.648_(+0.006)	42.74 M

Table 6. Comparison of different connection methods.

Model	Connection Mode	Dataset	mAP@0.5	mAP@0.5:0.95	Parameter
Darknet53_YOLOX	/	VOC2007	0.513	0.286	8.95 M
ResSwin_YOLOX (3 layers)	Add	VOC2007	0.528_(+0.015)	0.297_(+0.011)	28.54 M
ResSwin_YOLOX (3 layers)	Concat	VOC2007	0.523_(+0.010)	0.298_(+0.013)	28.93 M
Darknet53_YOLOX	/	DroneVehicle	0.855	0.642	8.95 M
ResSwin_YOLOX (3 layers)	Add	DroneVehicle	0.871_(+0.016)	0.647_(+0.005)	28.54 M
ResSwin_YOLOX (3 layers)	Concat	DroneVehicle	0.869_(+0.014)	0.644_(+0.002)	28.93 M

Table 7. Ablation experiment of LCA Block.

Model	LCA Block	Position	mAP@0.5	mAP@0.5:0.95	Parameter
ResSwin_YOLOX (3 layers)	✗	/	0.871	0.647	28.54 M
ResSwin_YOLOX (3 layers)	✓	Backbone	0.892_(+0.021)	0.669_(+0.022)	28.54 M
ResSwin_YOLOX (3 layers)	✓	Neck	0.898_(+0.027)	0.673_(+0.026)	28.54 M

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, T.; Cao, J.; Hao, Q.; Bao, C.; Shi, M. Res-SwinTransformer with Local Contrast Attention for Infrared Small Target Detection. Remote Sens. 2023, 15, 4387. https://doi.org/10.3390/rs15184387

AMA Style

Zhao T, Cao J, Hao Q, Bao C, Shi M. Res-SwinTransformer with Local Contrast Attention for Infrared Small Target Detection. Remote Sensing. 2023; 15(18):4387. https://doi.org/10.3390/rs15184387

Chicago/Turabian Style

Zhao, Tianhua, Jie Cao, Qun Hao, Chun Bao, and Moudan Shi. 2023. "Res-SwinTransformer with Local Contrast Attention for Infrared Small Target Detection" Remote Sensing 15, no. 18: 4387. https://doi.org/10.3390/rs15184387

APA Style

Zhao, T., Cao, J., Hao, Q., Bao, C., & Shi, M. (2023). Res-SwinTransformer with Local Contrast Attention for Infrared Small Target Detection. Remote Sensing, 15(18), 4387. https://doi.org/10.3390/rs15184387

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Res-SwinTransformer with Local Contrast Attention for Infrared Small Target Detection

Abstract

1. Introduction

2. Materials

2.1. Motivation

2.2. Dataset Introduction

3. Methods

3.1. RSLCANet

3.2. ResSwin Backbone

3.3. LCA Block

4. Experiment and Results

4.1. Evaluation Metrics

4.1.1. Precision and Recall

4.1.2. F1 Score

4.1.3. Mean Average Precision

4.1.4. Frames per Second (FPS)

4.2. Experimental Details

4.3. Comparison Experiments

4.3.1. Comparison Experiments on the Dim-Small Aircraft Targets Dataset

4.3.2. Comparison Experiments on the DroneVehicle Dataset

4.3.3. Comparison Experiments on Our Captured Multi-Scene Infrared Vehicle Dataset

4.4. Ablation Experiments

4.4.1. The Design of ResSwin Backbone

4.4.2. The Design of LCA Block

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI