*3.2. Method Comparison*

The proposed detection method is compared with five other infrared and visible image detection methods, including MCDetection [51], FusionDetection [1,52], TwoFusion [53], TripleFusion [8], and IAF R-CNN [9]. The first two detection methods use pixel-level fusion, and the last three methods and our detection method use feature-level fusion. All six methods are respectively trained and tested with the same training images and testing images, and the common network parameters of the six methods are identical.

**MCDetection:** As shown in Figure 7a, MCDetection [51] first combines a singlechannel infrared image and a single-channel visible image into a two-channel pseudo-color image. Then, the two-channel image is used as the input of the detection network Fast R-CNN. Fast R-CNN finally outputs the detection results for the infrared image and the visible image. Since the other five detection networks are designed based on Faster R-CNN, we change the Fast R-CNN in MCDetection to Faster R-CNN. The infrared features and the visible features are fused at the pixel level, and thus MCDetection uses one base CNN to extract infrared and visible features.

**FusionDetection:** As shown in Figure 7b, FusionDetection first utilizes the multi-band image fusion method HMSD [52] to fuse a single-channel infrared image and a singlechannel visible image into a single-channel fused image. Note that other state-of-the-art multi-band image fusion methods [54–58] can also be used for FusionDetection. Then, the fused image is detected with the detection network Faster R-CNN [1]. For the detection method FusionDetection, infrared features and visible features are also fused at the pixel level, and thus one base CNN is used to extract multi-band features.

**TwoFusion:** As shown in Figure 7c, TwoFusion [53] uses two base CNNs to respectively extract infrared features and visible features, and then the extracted features are fused for the subsequent recognition and location of the objects of interest. The detection architecture of Faster R-CNN is adopted in TwoFusion, and the two base CNNs have the same structure.

**TripleFusion:** As shown in Figure 8a, TripleFusion [8] proposes a three-branch detection architecture, and takes advantage of infrared features, visible features, and their fused features to respectively perform classification and regression of the region proposal. Then, the output results of the three branches are combined with an accumulated probability fusion layer to produce more accurate detection results. Note that TripleFusion also uses two of the same base CNNs.

**Figure 7.** (**<sup>a</sup>**–**<sup>c</sup>**) The pipelines for MCDetection, FusionDetection, and TwoFusion.

**IAF R-CNN:** As shown in Figure 8b, IAF R-CNN [9] discovers that the detection performance is correlated with illumination conditions. Therefore, IAF R-CNN first uses two detection networks (Faster R-CNN) to respectively produce detection results for the infrared image and the visible image. In this process, the fused features from the two networks are used to generate region proposals. Then, an illumination-aware network is introduced to measure the illumination of the visible image. According to the measured illumination value, the detection results from the two detection networks are adaptively merged to obtain the final detection outputs.

**Figure 8.** (**<sup>a</sup>**,**b**) The pipelines for TripleFusion and IAF R-CNN.
