Next Article in Journal
Validation and Comparison of Physical Models for Soil Salinity Mapping over an Arid Landscape Using Spectral Reflectance Measurements and Landsat-OLI Data
Next Article in Special Issue
Semi-Supervised Multi-Temporal Deep Representation Fusion Network for Landslide Mapping from Aerial Orthophotos
Previous Article in Journal
Implementation of an Improved Water Change Tracking (IWCT) Algorithm: Monitoring the Water Changes in Tianjin over 1984–2019 Using Landsat Time-Series Data
Previous Article in Special Issue
Raindrop-Aware GAN: Unsupervised Learning for Raindrop-Contaminated Coastal Video Enhancement
 
 
Article
Peer-Review Record

Landscape Similarity Analysis Using Texture Encoded Deep-Learning Features on Unclassified Remote Sensing Imagery

Remote Sens. 2021, 13(3), 492; https://doi.org/10.3390/rs13030492
by Karim Malik * and Colin Robertson
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Remote Sens. 2021, 13(3), 492; https://doi.org/10.3390/rs13030492
Submission received: 19 December 2020 / Revised: 20 January 2021 / Accepted: 25 January 2021 / Published: 30 January 2021

Round 1

Reviewer 1 Report

Dear authors, 

I read the resubmitted version of your paper! Thank you for addressing all my comments from the initial submission. Therefore, I consider that the paper should be accepted in the present form!

Best regards

 

Author Response

We highly appreciate the reviewer's comments and suggestions as these have helped improved the manuscript.

Reviewer 2 Report

1. Only limited 3 landscape types are actually used in this study. Need to add at least one more type, e.g., residential type appeared in Fig. 1.
2. (Page 22, Line 680-682) “Presenting the models … likely to refine … reduce misclassification rates”. Please carry out additional experiments to support this conclusion.

Author Response

We highly appreciate the reviewer's comments and suggestions as these have helped improved the manuscript. Additional response to the issues raised has been attached.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

Suggest final writing improvements.

Author Response

We appreciate the reviewer's comment. The manuscript has been double checked and the writing improved.

This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.


Round 1

Reviewer 1 Report

Dear Editor,

I have revised your paper and seen that it is an interesting and well-organized work and paper. I would like to suggest publishing this paper with several moderate comments as below:

  1. Please provide the training/testing ratio used for modeling in the Abstract.
  2. This paper developed novel models; this should be indicated in better way in the Introduction.
  3. Also, the literature review needs to be extended in the Introduction section.
  4. Please provide the equations used in each method and model and give the proper citations for them.
  5. Please provide more detail on the steps used in the flowchart, what is the hyper-parameters used in each model and method, please also provide the table of values used for these parameters.
  6. Please provide the limitation and future works for this study in the Conclusion part.
  7. Please add the section with the contribution of each co-author.

 

Reviewer 2 Report

The manuscript addresses the landscape images classification problem by means of modified/adapted CNNs the so–called TeX–CNN or tex–based convolutional neural networks. The scheme is somehow classic: This combines some (three) convolutional layers + reshaping reduced images + finally SoftMax classifier. The novelty turns out to be the modification of a convolutional layer by introducing a kind of texture detector.

The manuscript is clearly written, describes clearly the contributions of the manuscript, provides a large list of references, provides and state–of–art, and the experiments certainly support the announced improvements of the proposal through the numerical results, e.g. in Figures 5 and 6.

However I find a couple of issues. First of all the performances of the proposed algorithms shows improve- ments that goes much further than one can expect in view of proposal. Unfortunately I have to believe in that results since I haven’t the algorithms at hand (to validate the results or even to stress the implementation by means of harder experiments).

On the other hand the proposal combines in a convenient manner several standard functions/step/procedures/... of CNN libraries whose performance has been largely tested in many instances. This put in doubt the level of novelty of the present work and if this worth deserves to be accepted in a top journal.

Reviewer 3 Report

  1. Section 3.1: Clear and detailed descriptions are needed for the architecture.
  2. More detailed mathematical equations/descriptions are needed in methods.
  3. The classical CNN used in this study needs its architecture and emphasize its difference with Tex-CNN.
  4. Line 303, 305: Corresponding Eq. (2) seems not exist.
  5. Eq. (1): need numerical formulation for computing W.
  6. Eq. (2): LocX or LocY is one grid or a set of grids?
  7. Fig. 6: Why the results in first row are so bad?
  8. Only three landscape types are used in this study. It is not enough to demonstrate the advantages of this study. More types are needed. For example: 5-10.

Reviewer 4 Report

This paper describes a method to measure the landscape similarity between images. It relies on deep learning features obtained thanks to a specific CNN architecture. This CNN aims at classifying images, i.e. to affect them a landscape label, and is trained out of the AID dataset. The specificity of the proposed architecture states in the concatenation of the features produced by the different convolutional layers just before a fully convolutional layer in order to have a multi-resolution analysis of the image. However, the paper do not focus on image classification but on landscape similarity analysis. So feature maps generated by the different convolutional layers of the network are concatenated and then undergo a PCA. Only the first principal components is kept and is then encoded to a feature vector through a HoG analysis. So, at this step, each image is described by such a feature vector, and the similarity between different images can thus be calculated by computing a distance between their corresponding feature vectors. In this study, the used distance is the Earth Mover's Distance (EMD).

 

The CNN was trained out of the AID dataset, but experiments were then conducted both on the AID images but also on Sentinel-2 images. Landscape classification was assessed (confusion matrices), and more interesting, the relevance of the proposed landscape similarity measure was assessed through an analysis of the distribution (histograms) of the computed distance both between images of the same landscape class and images of distinct land cover class.

 

The paper is interesting. Its originality mostly states in its application, i.e. in the fact that it does not focus directly on image classification, but in the definition of an efficient similarity metric to compare landscapes.

However, it is not always completely clear, as explained below in my detailed comments.

 

  • The state-of-the-art is interesting and pleasant to read. It contains many interesting references. However, when discussing about landscape or image similarity measure strategies, it is mostly oriented toward remote sensing application, while this kind of approach is also used in more general content based image retrieval context (CBIR). Thus it could be nice to add a little more generic CBIR references.
  • CNN architecture. The proposed CNN architecture has 3 convolutional layers. Why 3 layers? Did you test it with less or more layers?
  • A PCA is used to have a one-channel feature map containing the most relevant information from the concatenated feature map generated by the CNN. However, this could have been possible to do this directly from the CNN using attention strategies, to make a weighted sum of the different feature maps instead of concatenating them before the fully connected layer. Why not having considered such strategy? It should be discussed.
  • Data augmentation. Data augmentation is performed: horizontal flips and rotations are mentioned. However, no scale transforms are mentioned. This is quite surprising, as it would be relevant to make the proposed strategy more invariant to scale, and as the Sentinel-2 dataset (not seen while training the network) is supposed to aim at assessing this invariance to scale.
  • Datasets. Two datasets are used: AID and Sentinel-2 scenes. They are mentioned but not sufficiently described. For instance, the number of images of both datasets is not provided. Please add more information about the datasets.
  • Proposed “Tex CNN” Vs. classical CNN. The proposed “Tex-CNN” is compared to a “classical CNN” in several experiments. However, the “classical CNN” architecture is never defined… I guesses it must correspond to a similar architecture without the feature concatenation, but it is not very clear. Please define it clearly.
  • Confusion matrices. Confusion matrices are provided but the way they are computed is not very clear. Do they correspond to the results of the CNN classifications? Or do they correspond to a nearest neighbour classification using the proposed similarity metric w.r.t. reference examples for each class? Please clarify.
  • Histograms presenting the distribution of the proposed similarity distance between scenes. I appreciated this approach to assess the proposed method. However, in practice, what is presented is not always very clear:
    • Do they correspond to AID or to Sentinel-2 datasets (or to both)?
    • What do “G1” and “G2” mean? Do they refer to AID or Sentinel-2 datasets? Or do they refer to landscape sub-types? Or do they correspond to two images? Please define.
    • The way these histograms were computed is not clearly explained. Is each histogram computed only for one specific scene (compared to other ones) or are they an accumulation of such results for all scenes? Please explain.
  • Additional results that would be relevant.
    • To assess the relevance of the method, it would be nice to also show results (EMS distribution histograms) obtained using a HoG descriptor directly computed over the original image (instead of over the derived features)
    • The outputs of the three convolutional layers of the networks are concatenated and used to calculate the descriptor. However, it would interesting to assess the relevance of the information provided by each layer. For instance, first layers mostly extract “texture information” that can lead to noise when comparing landscapes, while last layers mostly extract “structure information”. Thus, it would be really interesting to have an ablation study here and to show obtained results for different combinations of the feature maps provided by the different layers of the CNN (only layer 1, only layer 2, only layer 3, layers 1 and 2, layers 2 and 3).
  • Complex landscapes (‘mountain’). “Bad” results are obtained to distinguish mountain from forest, which can be perfectly understood as mountain scenes also contain forest. Maybe using only the information provided by the last layers of the CNN could improve this distinction, as it contains more structure information than texture information. Another alternative could be the use of bag-of-words approach, often used in CBIR.
  • Title. The title refers to aerial imagery. It is true that the AID dataset is mostly used in the paper and contains aerial data. However, their metric spatial resolution (1 to 8 m) corresponds more to satellite imagery. Besides experiments also involve Sentinel-2 data. Thus, maybe the title could be changed to “Landscape similarity analysis using texture encoded deep-learning features on unclassified remote sensing imagery”?

 

  • Section 3.1. Only convolutional layers are mentioned, but one can guess they also include Relu and (max/mean?) pooling... --> Please clarify.
  • Figure 1 is a little ambiguous as it could let the reader think that the input of the network requires two images.
  • Figure 2. I am not sure this figure is really useful.
  • Section 3.2. What is the size of the receptive field of the network at the end of conv layer 3?
  • Section 3.5. Equation 1 is useless. Maybe it could be deleted.
  • Figures 7, 8 and 9. It could be interesting here to also show the result of the PCA, as at the end, only this result is used to compute the HoG descriptor.

 

  • Typos. The paper is well written and contains very few typos. However, I noted the next typos:.
    • Section 1. “CNNS” --> “CNNs”
    • Section 2.2. “hand-crated” --> “hand-crafted”
    • Section 2.3. “the use features” --> “the use of features”
    • Section 2.3. “in demonstrated in a study” --> “is demonstrated in a study”
    • Section 3.3. “is depicted in figure 2 illustrate” --> Strange sentence with two verbs.
    • Section 3.5. “Equations 1 and 2 summarize feature maps derivation” --> Equation 2 refers to something else.
    • Section 4. “the availability sufficient” --> “the availability of sufficient”
    • Section 4. “Kolmogorov-smirnov” --> “Kolmogorov-Smirnov”
Back to TopTop