*3.3. Comparison*

Figure 9 shows the infrared and visible image detection results from the six different detection methods. In order to display the results more clearly, the detection results are shown on the fused images of the infrared and visible images, and the fusion method HMSD [52] is used to produce the fused images. Since MCDetection and FusionDetection simply combine an infrared image and a visible image into a fused image, the detection network cannot effectively extract complementary features from only a fused image. Therefore, they produce inaccurate location and classification results (Figure 9c,d). TwoFusion uses two of the same base CNNs to extract complementary and diverse infrared and visible features, and it is very difficult for this strategy to achieve the desired goal. Without complementary features, the detection network TwoFusion gives incorrect detection results (Figure 9e). Although TripleFusion and IAF R-CNN design more complex network structures, the detection results are still unsatisfactory (Figure 9f,g). Thanks to the carefully designed difference maximum loss function, focused feature enhancement module, and cascaded semantic extension module, our detection network gives more accurate detection results.

Figure 10 shows another comparison example. We can see that objects in the images are crowded and not easily distinguished, and some irrelevant objects are very similar to interesting objects to be detected. In this situation, the other five detection methods give inaccurate location or classification results (Figure 10c–g), while the proposed method outputs satisfactory detection results (Figure 10h). Figure 11 shows a similar example to Figure 10. Some objects in Figure 11 are very small, and some lighting can confuse detection networks. From Figure 11c–e, we can see that MCDetection, FusionDetection, and TwoFusion give quite inaccurate location and classification results. The detection results of TripleFusion and IAF R-CNN can also be improved (see Figure 11f,g). In contrast, the output of our detection network is more accurate.

**Figure 9.** (**<sup>a</sup>**–**h**) The infrared and visible image detection results from six different detection methods. In order to better display the results, the detection results are shown on the fused images.

We use mean average precision (mAP) to quantitatively evaluate the detection performance of the six detection networks. Table 2 lists the mAPs of the different detection methods on the testing set. The first column shows the six detection methods, the second column shows the mAPs of the testing images in the daytime, the third column shows the mAPs of the testing images in the night, and the fourth column shows the mAPs of all testing images. MCDetection and FusionDetection only use one base CNN to extract diverse multi-band features, resulting in lower mAP compared with the other detection networks. Since TwoFusion uses the same two CNNs, its detection accuracy is also relatively low. Although TripleFusion and IAF R-CNN introduce well-designed network structures, their mAPs are still lower than that of our detection network.

**Figure 10.** (**<sup>a</sup>**–**h**) The infrared and visible image detection results from six different detection methods. In order to better display the results, the detection results are shown on the fused images.


**Table 2.** The mAP for the different detection methods on the testing set.

(**e**)TwoFusion TripleFusion

(**h**) Ours

**Figure 11.** (**<sup>a</sup>**–**h**) The infrared and visible image detection results from six different detection methods. In order to better display the results, the detection results are shown on the fused images.
