Next Article in Journal
Spatio-Temporal Mapping of Multi-Satellite Observed Column Atmospheric CO2 Using Precision-Weighted Kriging Method
Previous Article in Journal
Advancing High-Throughput Phenotyping of Wheat in Early Selection Cycles
Previous Article in Special Issue
Convolutional Neural Network for Remote-Sensing Scene Classification: Transfer Learning Analysis
 
 
Article
Peer-Review Record

A Method for Vehicle Detection in High-Resolution Satellite Images that Uses a Region-Based Object Detector and Unsupervised Domain Adaptation

Remote Sens. 2020, 12(3), 575; https://doi.org/10.3390/rs12030575
by Yohei Koga 1,*, Hiroyuki Miyazaki 2 and Ryosuke Shibasaki 2
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Reviewer 4: Anonymous
Remote Sens. 2020, 12(3), 575; https://doi.org/10.3390/rs12030575
Submission received: 31 December 2019 / Revised: 28 January 2020 / Accepted: 5 February 2020 / Published: 9 February 2020
(This article belongs to the Special Issue Deep Transfer Learning for Remote Sensing)

Round 1

Reviewer 1 Report

Thank you for the opportunity to review this valuable paper. The paper titled `A Method for Vehicle Detection in High-Resolution Satellite Images that Uses a Region-based Object Detector and Unsupervised Domain Adaptation` is well structured, has a potential to be of great interest to readers, and it presents an up-to-date research problem. 

The introduction provide sufficient background and include all relevant references but the reference list is not up-to-date. There is a interesting background about related research in the second part of the study, giving the research problem more significance.

To my knowledge, the methodology and the accuracy assessment are preformed correctly.  

I would like to see more discussion and conclusion related to this and other papers. The authors need to present more about the advantages and disadvantages of the used method. 

Also, the authors must discuss and compare their work with most recent studies, almost all of the reference are somewhat old, there are only few references from the past 3 years (2019,18,17).

 

Author Response

Response to Reviewer #1:

Thank you for the opportunity to review this valuable paper. The paper titled `A Method for Vehicle Detection in High-Resolution Satellite Images that Uses a Region-based Object Detector and Unsupervised Domain Adaptation` is well structured, has a potential to be of great interest to readers, and it presents an up-to-date research problem.

 

Thank you for your valuable comments, which helped to improve the quality of this manuscript. We carefully addressed your comments in the revision. The responses are highlighted in the manuscript colored by PURPLE when not mentioned.

 

The introduction provide sufficient background and include all relevant references but the reference list is not up-to-date. There is a interesting background about related research in the second part of the study, giving the research problem more significance.

 

We added two recent papers of vehicle detection in L118-120 and comparative discussion in L139-146 (in GREEN).

 

To my knowledge, the methodology and the accuracy assessment are preformed correctly. 

 

I would like to see more discussion and conclusion related to this and other papers. The authors need to present more about the advantages and disadvantages of the used method.

 

Also, the authors must discuss and compare their work with most recent studies, almost all of the reference are somewhat old, there are only few references from the past 3 years (2019,18,17).

 

We listed recent papers of domain adaptation for remote sensing task in related work section and add discussions.(L192-207 in BLUE). We added discussions of the advantages and disadvantages of our method in Section 5 (L556-563, 572-576, 599-627 in GREEN).

Reviewer 2 Report

The authors use two domain adaptation methods in the area of vehicle detection to improve detection accuracy in the absence of the labeled data. The approach has been tested on a real-world dataset and the authors demonstrate that through knowledge transfer from a source domain, the target domain performance, where labeled data is not accessible, is almost as good as the case when labeled data is available. The results are convincing and I find the paper interesting from the practical point of view. I have the following comments before final publication:

  1. The authors can include the following recent papers that are published in Remote Sensing in the Related Work section and compare their method with them in terms of novelty:   a. Benjdira, B., Bazi, Y., Koubaa, A. and Ouni, K., 2019. Unsupervised Domain Adaptation Using Generative Adversarial Networks for Semantic Segmentation of Aerial Images. Remote Sensing, 11(11), p.1369.   b. Rostami, M., Kolouri, S., Eaton, E. and Kim, K., 2019. Deep Transfer Learning for Few-Shot SAR Image Classification. Remote Sensing, 11(11), p.1374.   c. S Garea, A., Heras, D.B. and Argüello, F., 2019. TCANet for Domain Adaptation of Hyperspectral Images. Remote Sensing, 11(19), p.2289.   d. Bejiga, M.B., Melgani, F. and Beraldini, P., 2019. Domain Adversarial Neural Networks for Large-Scale Land Cover Classification. Remote Sensing, 11(10), p.1153.   2. In the experiments, can the authors add an additional result for different values of \gamma parameter? I want to see how sensitive the algorithm is with respect to this parameter.   3. In Figure 6, please run the experiment 10 times and plot the average and STD. This is important to trust the results.   4. In Tables 1 and 2, also please run the experiment 10 times and report both the average and STD.

Author Response

Response to Reviewer #2:

The authors use two domain adaptation methods in the area of vehicle detection to improve detection accuracy in the absence of the labeled data. The approach has been tested on a real-world dataset and the authors demonstrate that through knowledge transfer from a source domain, the target domain performance, where labeled data is not accessible, is almost as good as the case when labeled data is available. The results are convincing and I find the paper interesting from the practical point of view. I have the following comments before final publication:

 

Thank you for your valuable comments, which helped to improve the quality of this manuscript. We carefully addressed your comments in the revision. The responses are highlighted in the manuscript colored by BLUE when not mentioned.

 

The authors can include the following recent papers that are published in Remote Sensing in the Related Work section and compare their method with them in terms of novelty: a. Benjdira, B., Bazi, Y., Koubaa, A. and Ouni, K., 2019. Unsupervised Domain Adaptation Using Generative Adversarial Networks for Semantic Segmentation of Aerial Images. Remote Sensing, 11(11), p.1369.   b. Rostami, M., Kolouri, S., Eaton, E. and Kim, K., 2019. Deep Transfer Learning for Few-Shot SAR Image Classification. Remote Sensing, 11(11), p.1374.   c. S Garea, A., Heras, D.B. and Argüello, F., 2019. TCANet for Domain Adaptation of Hyperspectral Images. Remote Sensing, 11(19), p.2289.   d. Bejiga, M.B., Melgani, F. and Beraldini, P., 2019. Domain Adversarial Neural Networks for Large-Scale Land Cover Classification. Remote Sensing, 11(10), p.1153.   

 

Thank you for your introduction of valuable papers. We listed those papers in related work section and added discussions. And we referred the paper of Rostami et al. as a promising method to improve our method in our future work in L556-563 (in PURPLE)

 

In the experiments, can the authors add an additional result for different values of \gamma parameter? I want to see how sensitive the algorithm is with respect to this parameter.

 

We conducted additional experiments of different gamma. Because the performance difference seemed subtle, we repeated experiments 10 times and report statistics as you suggested. Results are shown in Table 2, Figure 10 and L590-592.

 

In Figure 6, please run the experiment 10 times and plot the average and STD. This is important to trust the results. 4. In Tables 1 and 2, also please run the experiment 10 times and report both the average and STD.

 

We are sorry to say that it was infeasible to repeat all experiments 10 times due to the limited time for revision and our limited computing resource, although we understood that procedure would trust the result. Instead, we repeated limited conditions of experiments to justify our results effectively. Specifically, we repeated the experiments of all gamma parameter to investigate how our proposed method (reconstruction loss) improved accuracy. Results and discussions are shown in Table 2, Figure 10 and L590-592, 597-598,599-615 (in GREEN).

 

Reviewer 3 Report

The authors present a method based on unsupervised domain adaptation to detect vehicles in satellite images. 

The work is very well presented and explained, following a structure that facilitates the understanding of the method. The intro is short but sufficient, the RW is divided into three lines that converge in the realization of this work. The methodology is very well explained, with good mathematization and justifications of the use of the networks. The weakest sections are the results and conclusions. The paper needs some improvement before it is accepted:

-Line 11, abstract: The first line contains too many pauses and clarifications. It must be rewritten

-Line 31: Recerence can be deleted. It is explained later. 

-Line 37: "their model..."Sentence confuse. Rewrite"

-In RW, YOLO and SSD need more explanation given its current relevance

-Line 91: AdaBoost [13] algorithm needs more explanation. The relevance of adaboost is not clear

-Each subsection should contain in the end the contributions of this work with respect to the RW

-In vehicle detection, it would be better to mention more car detection work. Reference is made repeatedly to Tang et al, by way of example, but many other works have the same problems. In that aspect, the RW must be expanded with more papers

-Does Figure 1 requires reference?

-Line 236: "(the rectangle drawn in red 236 dotted line in Figure 2(a))". Figure 2(a) is enought.

-In results, for each training you must indicate parameters, equipment, time and number of samples per class, etc.

-Line 269: "We first down-sampled the images to 0.3 m/pixel because our targets are satellite images". Can the authors justify this based on some reference? It seems a somewhat resounding statement.

-Figura 4.a needs scale.

-It is not clear why the two resolutions in the images, why the 1000x1000 images?

-Can the authors explain the peak in Figure 7?

-In results, there should be some example image of the vehicle detection, and if possible comparative images of the detection results with different methods.

-Very short discussion, lack of comparison with other methods, and mention strengths and weaknesses of detection

-Conclusions must be extended

Author Response

Response to Reviewer #3:

The authors present a method based on unsupervised domain adaptation to detect vehicles in satellite images.

 

The work is very well presented and explained, following a structure that facilitates the understanding of the method. The intro is short but sufficient, the RW is divided into three lines that converge in the realization of this work. The methodology is very well explained, with good mathematization and justifications of the use of the networks. The weakest sections are the results and conclusions. The paper needs some improvement before it is accepted:

 

Thank you for your valuable comments, which helped to improve the quality of this manuscript. We carefully addressed your comments in the revision. The responses are highlighted in the manuscript colored by GREEN when not mentioned.

 

-Line 11, abstract: The first line contains too many pauses and clarifications. It must be rewritten

 

We rewrote. (L11)

 

-Line 31: Recerence can be deleted. It is explained later.

 

We deleted. (L31)

 

-Line 37: “their model…”Sentence confuse. Rewrite”

 

We rewrote. (L35-36)

 

-In RW, YOLO and SSD need more explanation given its current relevance

 

We listed recent methods relevant to YOLO and SSD in L81-97 (in PURPLE).

 

-Line 91: AdaBoost [13] algorithm needs more explanation. The relevance of adaboost is not clear

 

We added explanation in L109-112.

 

-Each subsection should contain in the end the contributions of this work with respect to the RW

 

We added descriptions about our novelty at each end of subsection of related work. (L98-99, 150-152, 209-213)

 

-In vehicle detection, it would be better to mention more car detection work. Reference is made repeatedly to Tang et al, by way of example, but many other works have the same problems. In that aspect, the RW must be expanded with more papers

 

We added recent vehicle detection papers in related work (L118-120 in PURPLE) and discussion about those methods (L139-146). Additionally, we added experiment of evaluation of M2Det as a vehicle detection method in L447-462 (in PURPLE) and comparison to DA methods in L572-576 (in PURPLE).

 

-Does Figure 1 requires reference?

 

We believe reference is not required because Figure 1 was drawn by us and exactly the same figure does not appear in other papers, including the original paper.

 

-Line 236: “(the rectangle drawn in red 236 dotted line in Figure 2(a))”. Figure 2(a) is enough.

 

We modified to Figure 3(a) (new numbering) in L310.

 

-In results, for each training you must indicate parameters, equipment, time and number of samples per class, etc.

 

While we believe all parameters are explained in Section 3 and 4, we added descriptions about other settings: equipment in L434, 457-458 and 496, time in L434, 458 and 495, samples per class (vehicle number): in Table 1.

 

-Line 269: "We first down-sampled the images to 0.3 m/pixel because our targets are satellite images". Can the authors justify this based on some reference? It seems a somewhat resounding statement.

 

We selected 0.3m/pixel according to the resolution of Worldview3 (0.31m/pixel), which is one of the highest-resolution commercial satellites. We added explanation in L336-338 (in RED) and L348-350.

 

-Figura 4.a needs scale.

 

We added scale.

 

-It is not clear why the two resolutions in the images, why the 1000x1000 images?

 

We added explanation in L351, 383-387.

 

-Can the authors explain the peak in Figure 7?

 

We added explanation in 435-437.

 

-In results, there should be some example image of the vehicle detection, and if possible comparative images of the detection results with different methods.

 

We added detection example images in Figure 11.

 

-Very short discussion, lack of comparison with other methods, and mention strengths and weaknesses of detection

 

-Conclusions must be extended

 

We extended discussion and conclusion parts according to the above modifications.

 

Reviewer 4 Report

Please see the attached report.

Comments for author File: Comments.pdf

Author Response

Response to Reviewer #4:

This paper is well-written. I have some suggestions that may improve the paper.

 

Thank you for your valuable comments, which helped to improve the quality of this manuscript. We carefully addressed your comments in the revision. The responses are highlighted in the manuscript colored by RED when not mentioned.

 

Is the SSD architecture shown in Figure 1 from another publication? If yes, it will be good to cite the source.

 

We believe reference is not required because Figure 1 was drawn by us and exactly the same figure does not appear in other papers, including the original paper.

 

For CORAL DA, it will be good to draw a block diagram to illustrate the process. For those covariance matrices CS and CT, are they generated using examples within a batch of 32 (32 source and 32 target examples)?

 

You are exactly right. We added a block diagram in Figure 2 and more detailed descriptions in L260-262, 468-469.

 

In Eq. (1), CF should have been CT.

 

We modified.

 

It is not very clear about what CORAL DA is doing. My understanding for DA is that a vehicle detection model is trained using source data. Now, we have a small set of target domain images and one wants to update the previously trained source model with the target images so that the source model is adapted to the new domain. Is the source model being used to generate CS? Is the updated model being adjusted using the target images to generate CT? Please clarify.

 

To minimize distance of CS and CT, not only CT but CS also need to be calculated. And deep learning training is minibatch SGD, CT is calculated from examples in a minibatch. Therefore, CS is also calculated from examples in a minibatch as same as CT. In this way, CS is calculated using source model. (As an implementation detail, source and target models share weights (identical model) as Figure 2 shows.) We have clarified the points in L260-262.

 

Traditionally, a simple method in target detection is to apply histogram matching [a][b]. The detection model may be trained using some images, which may have different characteristics due to illumination or other factors to the test image. If the test image is histogram matched to those training images, then the detection model based on source images can still be applied. For example, in cloud and shadow detection for Landsat images, there are some formulae to detect cloud and shadows [a]. To detect shadow ad cloud in Worldview images, one cannot directly used those formulae because Landsat and Worldview images are very different. In ref [a] below, a simple histogram matching was applied to transform Worldview images to have similar characteristics as Landsat. After that, those formulae for Landsat can then be applied to Worldview images. I believe such a simple histogram matching can be applied to vehicle detection in this paper. The target images can be histogram matched to those source images. After that, the source trained model can then be applied to detect vehicles in target images.

Actually, this idea was used in vehicle tracking and classification paper [c]. Please comment on ref [a] [b] [c] on the use of histogram matching and include some discussions about this idea in the revised paper. If possible, the authors may also include some experiments using this simple idea.

[a] “Simple and effective cloud-and shadow-detection algorithms for Landsat and Worldview images,” Signal Image and Video Processing, 2019.

[b] “Handbook of Image and Video Processing, A volume in Communications,” Networking and Multimedia, 2nd Edition, Academic Press, 2005.

[c] “Deep Learning-Based Target Tracking and Classification for Low Quality Videos Using Coded Aperture Cameras,” Sensors 19 (17), 3702, 2019.

 

Thank you for your valuable insights. In our understanding, actually, traditional image processing method as you mentioned are widely used as “data augmentation” techniques in deep learning, and that method is already incorporated in our vehicle detector (SSD). Despite its effectivity, simple data augmentation is insufficient because in object detection more sophisticated features like geometrical structures need to be captured unlike simple pixel-wise classification based on pixel intensity. The image feature differences at geometrical level cannot be aligned by simple image processing, thus DA method is required. We listed above articles and added relevant descriptions and discussions in L129-139, 237-242. And we are sorry that we could not recognize where histogram matching is used in paper [c] thus we did not refer paper [c] in this revision.

 

There are six datasets in Section 4.1. I spent quite some time to go back and forth to see which dataset is used for what. It will be good to summarize them in a table so that readers can trace which dataset is used for which purpose. I created a table for the datasets. In later sections, the authors can refer to those datasets instead of writing the whole phrase. For example, in L407, the author can simply refer to Dataset 5 instead of “Target domain test dataset with annotations”.

(table is omited)

 

Thank you for your valuable suggestion. We summarized all training data as a new table as you suggested with some modification. We named each dataset so that the purpose of the dataset can be understood more easily at a glance, and we added vehicle number contained in each dataset. We added Table 1 and replaced all references of datasets in the body text with new ones.

 

There are some minor presentation issues. I list some of them below.

L269: I am confused about the statement “our targets are satellite images”. In L283, the target images are actually aerial images as mentioned in Figure 4. So, is the image in Figure 4(b) a satellite image, not an aerial image?

L281: The image resolution of the test images is 0.16 m/pixel, which is similar to those images in the source (0.15m/pixel). Why are those images downsampled to 0.3 m?

 

We simulated experiments of satellite images by using downsampled aerial images. We selected 0.3m/pixel according to the resolution of Worldview3 (0.31m/pixel), which is one of the highest-resolution commercial satellites. We added explanation in L336-338 and L348-350 (in GREEN).

 

L240: L1  L1

Figure 6: Tonly  T only; SandT  S and T

L363, L370: “beta 1”, “beta 2” should use subscript.

L451: interfering  interfering with

L458: Section 4.3  Section 4.4

 

Thank you for your pointing out. We revised all.

Round 2

Reviewer 3 Report

The work has improved considerably. The authors have done a great deal of work to improve it. Their answers are clear and direct. I recommend its acceptance in the present form.

Reviewer 4 Report

The authors have responded to all of my comments. The paper looks very good and I enjoyed reading it.

Back to TopTop