1. Introduction
In recent years, infrared detection technology has been widely applied in military, remote sensing, civil, and other fields, such as infrared reconnaissance and early warning, infrared space detection, automotive navigation, medical infrared detection, and many other application scenarios. As a critical technology in the field of infrared early warning detection, infrared object detection algorithms adapted to different complex scenes have been widely studied by researchers. Under the situation that the spatial resolution of the optical system is complex to further improve, it is of great significance to study the infrared object detection algorithm with a low false alarm rate and strong adaptability, which is suitable for different scenes.
However, the detection of infrared images also has many challenges. First, the object has fewer features available. Secondly, the signal-to-noise ratio of the image is low. Finally, the real-time performance of infrared image object detection is limited. These factors indicate that designing an object detection network with high accuracy and good real-time performance in infrared images is challenging. We can see from the current research interests that the most popular object detection methods mainly focus on visible scenes, such as Single Shot Detection (SSD) [
1], You Only Look Once (YOLO) series [
2,
3], Fully Convolutional One-Stage (FCOS) Object Detection [
4], and other single-stage object detection networks. Furthermore, two-stage object detection algorithms such as Faster R-CNN [
5] and Task-aligned One-stage Object Detection (TOOD) [
6] exist. In addition, there are also some object detection methods established on anchor-free [
7] or transformer [
8]. These methods perform well on visible images, but there are always limitations for infrared image detection.
Although there are challenges for infrared target detection, many methods have been tried, and these methods have achieved relatively good results. For example, the YOLO-FIRI [
9] algorithm, by improving the YOLOv5 [
10] practice, proposed a region-free infrared image object detection method and reached the advanced level on the KAIST [
11] and FLIR [
12] datasets. The work of I-YOLO [
13] is aimed explicitly at infrared object detection on the road. I-YOLO combines DRUNet [
14] with YOLOv3 [
2] to enhance the infrared image through DRUNet, and finally uses YOLOv3 for accurate object recognition. This method not only has excellent advantages in precision and speed. In the scene of infrared object detection, air-to-ground detection is also a hot issue of single infrared image detection. In [
15], Jiang et al. proposed a UAV object detection framework for infrared images and video. The feature is extracted from the ground object, and the improved YOLOv5s is used for object recognition. This infrared recognition method can achieve 88.69% recognition accuracy and 50 FPS speed. The IARet [
16] performs well in single infrared image object detection, and the Focus module is designed to improve the detection speed. The IARet is also lightweight, with the entire model measuring just 4.8 MB. For example, many object detection methods are only for a single infrared image. Although they have achieved good results, their common problem is that the single infrared image object detection ability is limited, the feature loss of the object is severe, and the false alarm rate is high.
As we all know, producing visible images requires compensation for external illumination when the illumination conditions are poor. Infrared cameras can produce infrared spectral images throughout the day, but infrared spectral images lack details such as texture and color. Moreover, in infrared images, the critical factor determining the object’s visibility is the temperature difference between the object and the environment. For example, the car object is brighter than the background [
17,
18]. However, when there are some non-object heat points, it will also lead to the false detection of the object. Therefore, infrared and visible images have advantages and are complementary in information distribution. Combining the unique benefits of visible images with infrared images can compensate for the lack of precision reduction caused by infrared image object detection.
According to the above analysis, some researchers began to try to make complementary detection between infrared and visible images. For example, MFFN [
19] proposes a new multi-modal feature fusion network, which uses morphological features, infrared radiation, and motion features to compensate for the deficiency of single-modal detection of small infrared objects. At the same time, MFFN also proposed a characteristic pyramid structure with layer hopping structure (SCFPN) in morphology. In addition, the network’s backbone integrates SCFPN and the voided convolutional attention module into Resblock. This design also gives the network a detection accuracy of 92.01% on the OEDD dataset. However, not all fusion features are helpful. There are also a lot of research works in progress for how to solve the problems caused by fusion features, such as TIRNet [
20]. To solve the problem of information redundancy in the fusion of infrared and visible images, RISNet [
17] designed a new mutual information minimization module to reduce redundancy. In addition, the RISNet proposed a classification method of light conditions based on histogram statistics. This method automatically classifies more detailed lighting conditions to facilitate the complementary fusion of infrared and RGB images. This design also makes RISNet better than the state-of-the-art methods for infrared image detection, especially under conditions of insufficient illumination, complex background, and low contrast. In addition, the PearlGAN [
21] also plays a role in promoting infrared and visible image fusion detection. PearlGAN designed a top-down guided attention module to make the corresponding attention loss reach the hierarchical attention distribution, reduce local semantic ambiguity, and use context information for image coding. Moreover, PearlGAN introduces a structured gradient alignment loss. This design has a good performance effect in the image translation task and provides a new idea for infrared object detection. Like PearlGAN’s constraint design on the loss function of infrared and visible image fusion detection, there are many excellent works, such as CMPD [
22].
We propose the Dual-YOLO method based on the above observations on visible image object detection and infrared and visible image fusion detection. This method effectively solves the problems of low accuracy, feature loss, too many fused redundant features, and slow detection speed in infrared image object detection. Compared with the general target detection, our proposed Dual-YOLO is more suitable to solve the problem of target detection based on RGB UAV imagery. We can also see from [
23] that target detection based on RGB UAV imagery is more challenging than general target detection. For targets with complex backgrounds, dense distribution, and small size, such as crop quality detection, the detection method based on RGB UAV imagery can improve the detection accuracy. In summary, the main contributions of this paper are listed as follows:
(1) Based on the current YOLOv7 [
3] network with the highest accuracy in real-time object detection, we propose the dual-branch that includes an infrared and visible object detection network named Dual-YOLO. This method alleviates the problem of missing texture features in object detection of a single infrared image. The detection accuracy is improved by complementing the infrared and visible image feature information.
(2) We propose the attention fusion module, which added the Inception module and SE mutual attention module in the infrared and visible feature fusion process. So that infrared and visible images can achieve the best feature complementarity and fusion effect without increasing the number of parameters.
(3) We propose the fusion shuffle module, which adds dilated convolution in the infrared and visible feature fusion process and increases the receptive field for feature extraction of the fusion module. In addition, we add the channel shuffle module to make the infrared and visible features more uniform and reduce redundant features. In addition, we design a feature fusion loss function to accelerate the convergence of Dual-YOLO.
(4) Our method achieves state-of-the-art results on the challenging KAIST multispectral pedestrian dataset and the DroneVehicle [
24] remote sensing dataset. Moreover, experiments on a multispectral object detection dataset FLIR also demonstrate the effectiveness and versatility of our algorithm.
The rest of this paper is structured as follows: In
Section 2, we describe the network structure and methods in detail.
Section 3 gives the details of our work and experimental results and related comparison to verify the effectiveness of our method. Finally, we summarize the research content in
Section 4.