Next Article in Journal
Spatial Sustainable Development Assessment Using Fusing Multisource Data from the Perspective of Production-Living-Ecological Space Division: A Case of Greater Bay Area, China
Previous Article in Journal
Comparative Sensitivity of Vegetation Indices Measured via Proximal and Aerial Sensors for Assessing N Status and Predicting Grain Yield in Rice Cropping Systems
 
 
Article
Peer-Review Record

GF-Detection: Fusion with GAN of Infrared and Visible Images for Vehicle Detection at Nighttime

Remote Sens. 2022, 14(12), 2771; https://doi.org/10.3390/rs14122771
by Peng Gao, Tian Tian *, Tianming Zhao, Linfeng Li, Nan Zhang and Jinwen Tian
Reviewer 1: Anonymous
Reviewer 2:
Reviewer 3: Anonymous
Reviewer 4: Anonymous
Remote Sens. 2022, 14(12), 2771; https://doi.org/10.3390/rs14122771
Submission received: 13 April 2022 / Revised: 14 May 2022 / Accepted: 3 June 2022 / Published: 9 June 2022

Round 1

Reviewer 1 Report

The paper is interesting, on a challenging topic, but there are some critical flaws:

  • It seems impossible to actually reproduce the results. The parameters of the proposed architecture are not specified (dimensions and settings of the generators and of the discriminators, number of layers, size of the kernel etc.). These are important parameters. Without those, the paper is only advocating for a general idea. This idea might be sound, though.
  • In the branches used to generate the IR and VI images, the fusion strategy is not clear (Figure 3.b & 3.d). Is the fusion done by concatenation of tensors? This must be clearly specified and explained.
  • It seems there is an incoherence between figure 5 and table 1. On figure 5 (precision recall curve), it shows that the retina-net model with VI images alone gives very good results (very close to the GF-detection model). But in table 1, this same model is one of the worst. Please clarify.
  • The following paper could be of interest for the authors:
  • Towards perceptual image fusion: A novel two-layer framework,
    Chenwei Deng, Xun Liu, Jocelyn Chanussot, Yang Xu, Baojun Zhao,
    Information Fusion,
    Volume 57,
    2020,
    Pages 102-114,
    ISSN 1566-2535,
    https://doi.org/10.1016/j.inffus.2019.12.002.
    https://www.sciencedirect.com/science/article/abs/pii/S1566253518303075    see figure 8 in the paper.

Author Response

Reviewer1

The paper is interesting, on a challenging topic, but there are some critical flaws:

1)it seems impossible to actually reproduce the results. The parameters of the proposed architecture are not specified (dimensions and settings of the generators and of the discriminators, number of layers, size of the kernel etc.). These are important parameters. Without those, the paper is only advocating for a general idea. This idea might be sound, though.

Response:

Thank you for underlining this deficiency. And your comments are important to us to improve the manuscript substantially. Model reproducibility is one of our concerns in the study and a detailed introduction of our model are listed in the revised manuscript as follows:

Table 1 parameters of the main module of GF-detection

module

submodule

parameters

generator

encoder

Conv(1,3,7,2,0)+ maxpooling,

Conv(3,64,4,2,1)+BatchNorm2d,

Conv(64,128,4,2,1)+BatchNorm2d,

Conv(128,256,4,2,1)+BatchNorm2d,

Conv(256,512,4,2,1)+BatchNorm2d

Decoder

Convtranspose(512,256,4,2,1) + BatchNorm2d,

Convtranspose(256,128,4,2,1) + BatchNorm2d,

Convtranspose(128,64,4,2,1) + BatchNorm2d,

Convtranspose(64,3,4,2,1) + BatchNorm2d,

uppooling

discriminator

encoder

Conv(3,64,7,2,0)+ maxpooling,

Conv(64,128,4,2,1)+BatchNorm2d+LeakyReLU,

Conv(128,256,4,2,1)+BatchNorm2d+LeakyReLU,

Conv(256,512,4,2,1)+BatchNorm2d+ LeakyReLU,

FC(204800,4096)

Fusion

encoder

Conv(3,64,7,2,0)+ maxpooling,

Conv(64,256, 7,2,0)+BatchNorm2d+ LeakyReLU,

Conv(256,512, 7,2,0)+BatchNorm2d+ LeakyReLU,

Conv(512,1024, 7,2,0)+BatchNorm2d+ LeakyReLU,

Conv(1024,2048,7,2,0)+BatchNorm2d+ LeakyReLU

fusion

Conv(512,512, 3,1,1)+BatchNorm2d+ LeakyReLU,

Conv(1024,1024,3,1,1)+BatchNorm2d+ LeakyReLU,

Conv(2048,2048,3,1,1)+BatchNorm2d+ LeakyReLU

decoder

Convtranspose(2048,1024,3,2,1) + BatchNorm2d+ LeakyReLU,

Convtranspose(1024,512,4,2,1) + BatchNorm2d+ LeakyReLU,

Convtranspose(512,256,4,2,1) + BatchNorm2d+ LeakyReLU,

Convtranspose(256,64,4,2,1) + BatchNorm2d+ LeakyReLU,

Convtranspose(64,3,4,2,1) + BatchNorm2d+ LeakyReLU,

uppooling

Self-attention fusion model

channel attention model

AdaptiveAvgPool2d+FC(512,32)+ LeakyReLU+ Conv(32,512,1,1)

AdaptiveMaxPool2d+FC(512,32)+ LeakyReLU+ Conv(32,512,1,1)+Sigmoid()

spatial attention model

torch.mean+ torch.max+ Conv(2,1,7,1,1)+ Sigmoid()

 

RetinaNet

RetinaNet with ResNet(50)

2)In the branches used to generate the IR and VI images, the fusion strategy is not clear (Figure 3.b & 3.d). Is the fusion done by concatenated of tensors? This must be clearly specified and explained.

Response:

I am sorry for the inconvenience we bring to you. The fusion strategy is explained as follows with a clear manner:

In the fusion branch, two tensors are concatenated, and several convolution layers are conducted with the fused tensors. In the visible fusion branch, these tensors are visible features extracted by the encoder and visible detection features extracted by the detection backbone. In the infrared fusion branch, these tensors refer to infrared features extracted by the encoder and infrared detection features extracted by the detection backbone.

3)It seems there is an incoherence between figure 5 and table 1. On figure 5 (precision recall curve), it shows that the retina-net model with VI images alone gives very good results (very close to the GF-detection model). But in table 1, this same model is one of the worst. Please clarify.

Response:

We thank you for reminding us the important point. RetinaNet model with visible images suffers a large loss in the high recall score, which is important in the table 1. More clarifications are as follows:

PR curve, precision, recall and F1 score are the evaluation metrics we exploit in the experiments. PR curve in the Figure 5 visualizes the curve of precision and recall scores in a range number of the predicted boxes. The more predicted boxes we count, the higher recall score and lower precision score we get. As figure 5 shows, the red line refers to detection model trained in the visible images own a similar distribution with the black line, our proposed GF-detection model in the most range of the recall scores, especially the low recall scores. However, as the table 1 lists, most of the detection models achieve a high recall scores, which means the PR curve in the high recall score part matters in the whole detection performance. The visible model suffers a large loss in the high recall part, which is shown in the figure 5. Table 1 lists the precision, recall and F1 score with the IoU threshold 0.5, which is the common threshold in the practical applications. It is shown that high precision score and high recall scores are located with IoU threshold 0.5, which refer to the top right zone in the figure 5.

4)The following paper could be of interest for the authors:

Towards perceptual image fusion: A novel two-layer framework,
Chenwei Deng, Xun Liu, Jocelyn Chanussot, Yang Xu, Baojun Zhao,
Information Fusion,
Volume 57,
2020,
Pages 102-114,
ISSN 1566-2535,
https://doi.org/10.1016/j.inffus.2019.12.002.
https://www.sciencedirect.com/science/article/abs/pii/S1566253518303075    see figure 8 in the paper.

Response:

We greatly appreciate your valuable suggestion. The proposed TLF model puts forward decomposition model in according to the human perceptual mechanism and the model could extract the visible and infrared features discriminately which is beneficial to the visible and infrared image fusion. The proposed TLF model proposes an innovative thought for the visible and infrared image fusion, which is of a great value for the image fusion. And a citation has been added as follows:

Many decomposition manners such as DRF [8], Decomposition [9] and TLF [10] try to analyze the variance of the infrared and visible images and choose the valuable features for image fusion.

Reference:

  1. Deng, C.; Liu, X.; Chanussot, J.; Xu, Y.; Zhao, B. Towards perceptual image fusion: A novel two-layer framework. Information Fusion 2020, 57, 102–114.

 

Author Response File: Author Response.pdf

Reviewer 2 Report

This paper presents deep learning-based approach to perform image registration. Infrared and visible images of car data set are acquired during night-time with the help of drones.  Overall, the paper address very interesting practical problem and application of Generative adversarial networks but it lacks the explanation of deep models. Also, I would suggest rewriting abstract and start of introduction section as I find in appropriate selection of vocabulary for scientific writing e.g., hotpot etc. Several sentences are not clear e.g., Line 2,4,18, 28-30, 35,67 etc.  Although the obtained results are encouraging but the poor explanation of models does not justify the paper. So, I would suggest revising the paper with more detail technical explanation and precising the introduction part especially when there is no state of the part there e.g., on page 3, it seems that you are summarizing the reference paper.

Author Response

Reviewer2

This paper presents deep learning-based approach to perform image registration. Infrared and visible images of car data set are acquired during night-time with the help of drones.  Overall, the paper address very interesting practical problem and application of Generative adversarial networks but it lacks the explanation of deep models. Also, I would suggest rewriting abstract and start of introduction section as I find in appropriate selection of vocabulary for scientific writing e.g., hotpot etc. Several sentences are not clear e.g., Line 2,4,18, 28-30, 35,67 etc.

So, I would suggest revising the paper with more detail technical explanation and precising the introduction part especially when there is no state of the part there e.g., on page 3, it seems that you are summarizing the reference paper.

Response:

We feel great thanks for your keen and careful check on our article, which is of a great value to improve the quality of our manuscript. All comments are considered seriously and the responses are put forward accordingly. The revised introduction, abstract and words are as follows:

Revisions in the introductions:

More detailed technical explanations and precise introductions are added in the Introduction in Page 3, especially to the key model GAN, visible fusion branch, infrared fusion branch and self-attention fusion model as follows:

Generative Adversarial Networks (GAN) is a popular generation model used in the image reconstruction. GAN contains a generator and a discriminator in order to generate the similar images with the reference images. The generator consists of several convolution layers with up-pooling operations to enlarge the size of the inputs. And the discriminator is a classification with convolution layers and down-pooling operations to discriminate the fake reconstructed images from the true reference images. GAN is introduced into the fusion work of infrared and visible images in recent years.

In this paper, to improve the vehicle detection performance in nighttime, we propose a fusion model (GF-detection) of visible and infrared image fusion with GAN for the vehicle detection task. A detection task driven fusion model with GAN are designed. GF-detection contains a visible branch, an infrared branch, a self-attention fusion model and detection model. Each branch contains a generator, a discriminator and a detection backbone utilized to extract the detection features.

The fusion operation is as follows: two tensors from visible images and visible detection features are concatenated with convolution layers in the visible fusion model in the visible branch. Two tensors from infrared images and infrared detection features are concatenated with convolution layers in the infrared fusion model in the infrared branch. Two tensors from sub-branches, such as visible branch and infrared branch, are concatenated in the self-attention fusion model and are sent to the detection model. Since visible images contain more contours and color information while infrared images contain more targets in the low illumination condition, visible images are suitable to extract the semantic features for vehicle classification, and the infrared images are suitable to extract the salient features for vehicle detection. In this way, different feature extraction strategies are conducted to the visible and infrared images. Semantic features are extracted with deep convolution models in the visible images for the vehicle classification, and salient features are extracted with swallow convolution models in the infrared images for the vehicle detection. Self-attention fusion model contains a channel attention and a spatial attention to fuse the two reconstructed images. There are two fusion stages in our fusion model, the first one is the fusion in the GAN in each branch of the images and detection features, and the second fusion is conducted in the self-attention fusion mechanism.

Revisions in the abstract:

Vehicles are important targets in the remote sensing applications and nighttime vehicle detection is a hot study topic in recent years. Vehicles in the visible images in nighttime suffer the inadequate features for object detection. Infrared images retain the contours of vehicles while losses the color information. Thus, it is valuable to fuse infrared and visible images to improve the vehicle detection performance in nighttime. However, it is still a challenge to design effective fusion models due to the complexity of visible and infrared images. In order to improve vehicle detection performance in nighttime, this paper proposes a fusion model of infrared and visible images with Generative Adversarial Networks (GAN) for vehicle detection named GF-detection. GAN is utilized in the image reconstruction and introduced in the image fusion recently. To be specific, to exploit more features for the fusion, GAN is utilized to fuse the infrared and visible images via the image reconstruction. The generator fuses the image features and detection features, and then generates the reconstructed images for the discriminator to classify. Two branches, visible and infrared branches, are designed in the GF-detection model. Different feature extraction strategies are conducted according to the variance of the visible and infrared images. Detection features and self-attention mechanism are added to the fusion model aiming at building a detection task driven fusion model of infrared and visible images. Extensive experiments based on nighttime images are conducted to demonstrate the effectiveness of the proposed fusion model in the night vehicle detection.

Revision in the words:

All inappropriate words in the line 2,4,18, 28-30, 35,67 are rectified. Furthermore, a careful check has been conducted to eliminate the spieling errors in our manuscript.

 

Author Response File: Author Response.pdf

Reviewer 3 Report

The authors proposed a GF framework for night vehicle detection, and gives the experimental results and comparison works.

However, I have some questions and suggestions about this article as follows:

1) GF detection in this paper refers to GAN for fusion?  I don't understand the meaning of its abbreviation well?

2) There are many keywords listed, but in fact, they do not highlight the key technology of this paper.

3)In figure 3, about the structures of GF-Detection. The relationship between B, C, D and A is not very intuitive, so further description is recommended.

4) The whole paper focuses on the fusion method of infrared and visible light, or vehicle detection? The logical relationship is a little confused and the description is not very clear.

5) In addition, is the field of view of the two types of images contained in the data set, infrared sensor and visible image, registered?

In short, the description of the work flow is not clear, the idea of the work flow is not very clear, and the suggestions given in this paper are not very clear.

Author Response

Reviewer3

The authors proposed a GF framework for night vehicle detection, and gives the experimental results and comparison works.

However, I have some questions and suggestions about this article as follows:

1) GF detection in this paper refers to GAN for fusion?  I don't understand the meaning of its abbreviation well?

Reponses:  

We are sorry for not explaining clearly. G in GF-detection refers to GAN, and F refers to the fusion of visible and infrared images. GF-detection is the abbreviations of detection model based on the fusion model of visible and infrared images with GAN. In a brief, GF-detection is a vehicle detection model, and it contains a fusion work of the visible and infrared images. Furthermore, the fusion work is designed with Generative adversarial networks (GAN), and the GAN is designed with the detection task driven mechanism.

2) There are many keywords listed, but in fact, they do not highlight the key technology of this paper.

Reponses:

Thank you for pointing out our inappropriate expression. The key technology of this paper is the fusion of visible and infrared images with Generative Adversarial Networks for vehicle detection in nighttime.

The keywords are listed as follows:

Keywords: vehicle detection, fusion of visible and infrared images, Generative Adversarial Networks

3)In figure 3, about the structures of GF-Detection. The relationship between B, C, D and A is not very intuitive, so further description is recommended.

Reponses:

Thanks for your valuable suggestion. The color of each part in the Figure 3 has been changed to get an intuitive impression. Further description of the whole structure and the submodules are added as follows:

The whole structure of GF-detection contains vi  sible branch, infrared branch and self-attention fusion model. Visible branch is aiming to reconstruct the visible images via the fusion of the visible images and detection features. Infrared branch is aiming to reconstruct the infrared images via the fusion of the infrared images and detection features. The trained detection models extract the detection features from the visible and infrared images into the image reconstruction. Self-attention fusion model fuses visible features and infrared features and send them to the detection model. Different feature extraction strategies are conducted according to to the variance of the visible and infrared images. Two GANs are contained in the whole model, and the two trained detection models are employed. A detection model is placed after the self-attention fusion model.

Fig. 3: The structures of GF-Detection. Part (a) is the whole structure, which contains visible branch, infrared branch, self-attention fusion model and detection model. Part (b) is the visible branch, and it contains a generator (including an encoder and a decoder) and a discriminator. Part (c) is the self-attention fusion model, containing a fusion model and a self-attention mechanism. Part (d) is the infrared branch, and it contains a generator (including an encoder and a decoder) and a discriminator.

4) The whole paper focuses on the fusion method of infrared and visible light, or vehicle detection? The logical relationship is a little confused and the description is not very clear.

Reponses:

Thank you for this valuable feedback. The whole paper is aiming to improve the vehicle detection via the image fusion manners. The variance of visible and infrared images provides complementary features for vehicle detection in nighttime. It is demonstrated in the Table 1 that the fusion models have achieved a better detection performance in nighttime compared with the single source detection models. Furthermore, detection task driven manners are designed to adjust the fused images for vehicle detection task in nighttime.

5) In addition, is the field of view of the two types of images contained in the data set, infrared sensor and visible image, registered?

Reponses:

We thank you for reminding us the important point. The visible and infrared images are registered and share the same image size. Pixel to pixel mapping has been conducted in the visible and infrared images before the fusion.

In short, the description of the work flow is not clear, the idea of the work flow is not very clear, and the suggestions given in this paper are not very clear.

Reponses:

We have carefully considered all the comments you list, and a point-to-point response have been proposed for your convenience to check. Furthermore, we have revised the manuscript carefully, and a clear description of the work flow are guaranteed to put forward to improve the readability of our manuscript. The work flow is as followsï¼›

First, we introduce the necessary of fusion works of visible and infrared images for vehicle detection in nighttime. It is the background of our works. Visible images in nighttime suffer the false alarms and missing detections in the vehicle detection due to low illumination conditions. Infrared images contain the high contrast contours of the vehicles while losses the color information for vehicle detection. It is valuable to fuse the visible images and infrared images to integrate the complementary features to improve the vehicle detection performance in nighttime.

Second, we point out the defects of the existed fusion works, such as the fusion based on features, multi-scale and fusion based on Generative Adversarial Network (GAN). All above cannot utilize the variance of the visible and infrared images effectively, and do not achieve a competitive fusion performance for vehicle detection. These defects are the motivation of our study.

Third, we introduce our fusion works GF-detection, and the competitive fusion performance for vehicle detection. The innovations of our model are fusion branch with GAN, fusion model with detection features, fusion with a variance of visible branch and infrared branch, self-attention fusion model and a dataset containig paired image of visible and infrared images for fusion study.

In a brief, these comments are all valuable and very helpful for revising and improving our paper, as well as the important guiding significance to our researches. We have studied comments carefully and have made corrections which we hope to meet with approval.

 

Author Response File: Author Response.pdf

Reviewer 4 Report

This paper discusses fusion with GAN of infrared and visible images for vehicle detection in nighttime. It is an interesting paper. However, there are some problems.

  1. The English writing is not qualified. There are a lot of English grammar and English typos. The paper should be revised carefully.
  2. It is good for the authors to construct their own data set with visual and infrared paired vehicle images. However, the authors used only one data set to verify their research work, it seems very weak. As my consideration, the only data set is simply a typical one, and it can not prove the generic themes of the night time vehicles detection. Therefore, the authors have to generate or construct more data sets of the visible and infrared paired images for vehicle detection in nighttime in various conditions, such as different illumination, different weather conditions (such as sunny day, raining day, foggy day, snowy day, thunder storm day, … etc.), morning time, evening time, … etc. Then train the proposed approach under the various data sets.
  3. Since there are many other similar researches, the authors should compare the results with that of other researches.
  4. In Section 4.3, the paper mentioned Table III. However, Table III is not shown.

Author Response

Reviewer 4

This paper discusses fusion with GAN of infrared and visible images for vehicle detection in nighttime. It is an interesting paper. However, there are some problems.

  • The English writing is not qualified. There are a lot of English grammar and English typos. The paper should be revised carefully.

Response:

Thanks for your valuable suggestion. And these suggestions are all valuable and very helpful for revising and improving our paper, as well as the important guiding significance to our researches.

All above English grammar and English typos are rectified in the revision. And a careful check has been conducted to eliminate the wrong words.

(2) It is good for the authors to construct their own data set with visual and infrared paired vehicle images. However, the authors used only one data set to verify their research work, it seems very weak. As my consideration, the only data set is simply a typical one, and it can not prove the generic themes of the night time vehicles detection. Therefore, the authors have to generate or construct more data sets of the visible and infrared paired images for vehicle detection in nighttime in various conditions, such as different illumination, different weather conditions (such as sunny day, raining day, foggy day, snowy day, thunder storm day, … etc.), morning time, evening time, … etc. Then train the proposed approach under the various data sets.

Response:

Thank you again for your valuable suggestions. Datasets are important for the effectiveness evaluation of the designed models. Since there is a lack in the open paired visible and infrared images for vehicle detection, we have proposed a dataset named RGBT-Vehicle by collecting and labeling paired visible and infrared vehicle images from a crowed counting dataset in this paper. RGBT-Vehicle is utilized to demonstrate the effectiveness of our fusion model.

Many fusion models of visible and infrared images referred in this paper are conducted in pedestrian datasets [19,20,22,23,24,25], open datasets (TNO Image Fusion Dataset) [26,27,28]. As we know, there is no such a paired visible and infrared vehicle dataset containing different illumination and different weather conditions. There would be a new work to collect those images besides the fusion works we put forward.

Coincidentally, two works related to the vehicle datasets are in progress recently.

One is the infrared images collection in various illumination and weather conditions. Figure 1 is the long wave infrared image samples we collect from drones. It contains different time images (night and day) correspond to different illumination conditions (low and high), and different weather condition images (sunny and cloudy). However, there is still a lack in the variety of the illumination conditions and weather conditions. More illumination and weather condition infrared images would be collected in the future.

 The other one is the image transferring from visible remote sensing images to infrared images via the Generative Adversarial Networks (GAN). GAN could transfer the visible images to infrared images via the image reconstruction. In this way, the paired visible and infrared images are generated and could be utilized in the vehicle detection. A test has been conducted in those reconstructed images, which is visualized in the figure 2. There are 1230 paired of visible and generated infrared images in the new evolution.

Those two woks would be fulfilled in the future to generate the paired visible and infrared images with different illumination and weather conditions. With limited revision times, a rough test on the collected and generated images are conducted. And your encouragement and affirmation would be important for us to conduct the further researches.

Figure 1: Infrared images in different illumination and weather conditions. Part a is the night image; Part b is the day image; Part c is the sunny image and Part d is the cloudy image. The night infrared image contains less interference compared with the day infrared image. The sunny image contains more interference compared with the cloudy infrared image.

Figure 2: Samples of visible images and generated infrared images. The generated infrared images contain the contour similar to the infrared images.

 

Table 1: Evaluations of the generated infrared images.

detection model

precision

recall

F1 score

visible

65.3

77.0

70.6

generated infrared

63.2

72.4

67.4

fusion

77.4

82.3

79.7

GF-detection

82.1

86.2

84.1

 

Reference:

  1. Chen, Y.; Shin, H. Multispectral image fusion based pedestrian detection using a multilayer fused deconvolutional single-shot detector. JOSA A 2020, 37, 768–779.
  2. Ding, L.; Wang, Y.; Laganière, R.; Huang, D.; Luo, X.; Zhang, H. A robust and fast multispectral pedestrian detection deep network. Knowledge-Based Systems 2021, 227, 106990.
  3. Takumi, K.; Watanabe, K.; Ha, Q.; Tejero-De-Pablos, A.; Ushiku, Y.; Harada, T. Multispectral object detection for autonomous vehicles. In Proceedings of the Proceedings of the on Thematic Workshops of ACM Multimedia 2017, 2017, pp. 35–43.
  4. Xiao, X.;Wang, B.; Miao, L.; Li, L.; Zhou, Z.; Ma, J.; Dong, D. Infrared and visible image object detection via focused feature enhancement and cascaded semantic extension. Remote Sensing 2021, 13, 2538.
  5. Zhang, Y.; Yin, Z.; Nie, L.; Huang, S. Attention based multi-layer fusion of multispectral images for pedestrian detection. IEEE Access 2020, 8, 165071–165084.
  6. Li, C.; Song, D.; Tong, R.; Tang, M. Illumination-aware faster R-CNN for robust multispectral pedestrian detection. Pattern Recognition 2019, 85, 161–171.
  7. Guan, D.; Cao, Y.; Yang, J.; Cao, Y.; Yang, M.Y. Fusion of multispectral data through illumination-aware deep neural networks for pedestrian detection. Information Fusion 2019, 50, 148–157.
  8. Ma, J.; Yu, W.; Liang, P.; Li, C.; Jiang, J. FusionGAN: A generative adversarial network for infrared and visible image fusion. Information Fusion 2019, 48, 11–26.
  9. Li, J.; Huo, H.; Li, C.; Wang, R.; Feng, Q. AttentionFGAN: Infrared and visible image fusion using attention-based generative adversarial networks. IEEE Transactions on Multimedia 2020,23, 1383–1396.
  10. Ma, J.; Tang, L.; Xu, M.; Zhang, H.; Xiao, G. STDFusionNet: An infrared and visible image fusion network based on salient target detection. IEEE Transactions on Instrumentation and Measurement 2021, 70, 1–13.

 

(3) Since there are many other similar researches, the authors should compare the results with that of other researches.

Response:

Thank you again for your valuable suggestion. Many works have been done in the collection and reproduction of the vehicle detection via the fusion of visible and infrared images. However, some problems are countered as follows:

Some similar researches cited in this paper such as DRF [8], Dunit [14], instance aware [28], MFDSSD [19], SKNet [20], multispectral ensemble detection [21], IAFR-CNN [24], IATDNN+IAMSS [25] are not applied in the vehicle detection via image fusion of visible and infrared images. Some related researches cited in this manuscript, such as FFECSE [22] and CS-RCNN [23], are not open in the paper.

In this way, we have reconstructed the fusion models, such as Fusion based on features, DRF [8] and Fusion with GAN [26] as the comparisons of our fusion models. A large amount of works has been done to collect and reproduce those models as the comparisons of our models.

Reference:

  1. Xu, H.; Wang, X.; Ma, J. DRF: Disentangled representation for visible and infrared image fusion. IEEE Transactions on Instrumentation and Measurement 2021, 70, 1–13.
  2. Bhattacharjee, D.; Kim, S.; Vizier, G.; Salzmann, M. Dunit: Detection-based unsupervised image-to-image translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 4787–4796.
  3. Chen, Y.; Shin, H. Multispectral image fusion based pedestrian detection using a multilayer fused deconvolutional single-shot detector. JOSA A 2020, 37, 768–779.
  4. Ding, L.; Wang, Y.; Laganière, R.; Huang, D.; Luo, X.; Zhang, H. A robust and fast multispectral pedestrian detection deep network. Knowledge-Based Systems 2021, 227, 106990.
  5. Takumi, K.; Watanabe, K.; Ha, Q.; Tejero-De-Pablos, A.; Ushiku, Y.; Harada, T. Multispectral object detection for autonomous vehicles. In Proceedings of the Proceedings of the on Thematic Workshops of ACM Multimedia 2017, 2017, pp. 35–43.
  6. Xiao, X.;Wang, B.; Miao, L.; Li, L.; Zhou, Z.; Ma, J.; Dong, D. Infrared and visible image object detection via focused feature enhancement and cascaded semantic extension. Remote Sensing 2021, 13, 2538.
  7. Zhang, Y.; Yin, Z.; Nie, L.; Huang, S. Attention based multi-layer fusion of multispectral images for pedestrian detection. IEEE Access 2020, 8, 165071–165084.
  8. Li, C.; Song, D.; Tong, R.; Tang, M. Illumination-aware faster R-CNN for robust multispectral pedestrian detection. Pattern Recognition 2019, 85, 161–171.
  9. Guan, D.; Cao, Y.; Yang, J.; Cao, Y.; Yang, M.Y. Fusion of multispectral data through illumination-aware deep neural networks for pedestrian detection. Information Fusion 2019, 50, 148–157.
  10. Ma, J.; Yu, W.; Liang, P.; Li, C.; Jiang, J. FusionGAN: A generative adversarial network for infrared and visible image fusion. Information Fusion 2019, 48, 11–26.
  11. Ma, J.; Tang, L.; Xu, M.; Zhang, H.; Xiao, G. STDFusionNet: An infrared and visible image fusion network based on salient target detection. IEEE Transactions on Instrumentation and Measurement 2021, 70, 1–13.

 

(4) In Section 4.3, the paper mentioned Table III. However, Table III is not shown.

Response:

Thank you for underlining this deficiency. Table III refers to Table 3, and the mistake has been rectified in the revised version. And a careful check has been done to eliminate similar mistakes in the revised manscript.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

I would suggest accepting the paper as the issues are explained. 

Reviewer 3 Report

In this version, the author responded to my questions and made corresponding modifications according to my previous suggestions. It can be published in its present form.

Back to TopTop