Patch-Based Discriminative Learning for Remote Sensing Scene Classification
Round 1
Reviewer 1 Report
The paper presents a new architecture called Patch Based Discriminative Learning (PBDL) for Remote Sensing Scene Classification. PBDL is created to improve classification by concentrating in two of the main RSSC challenges: inter-class similarity and intra-class diversity, by including mechanisms to approach the scenes in a patch based method, but analyzing the patches exploring the dependencies among them rather than only individually.
The new method is tested in four well-known benchmarking dataset for RSSC. The results are presented with the Accuracy metric, and suggest that the method outperforms previous works in the field.
The introduction states the motivation of the work and makes a first presentation of the proposed method, but it is not entirely clear as well as the images descriptions used. The related work section is well presented, still it would be more appreciated to clearly state the gaps that you fill with your work, e.g. how are you different from multi-scale approaches or attention mechanisms? (two approaches that solve similar problems to the ones you mention.
Section 3 describes the proposed method, but many things are not clearly explain (it’s hard to read). For example, it is mentioned that a multi-scale approach is used with different scales (σ) but only in section 4 it is said that this σ corresponds to the scale factor of the gaussian kernel. It also motivates the use of a BiLSTM to take advantage of the temporal pattern of the scenes across image time series, but the temporal aspect of the datasets is not present/explained, so motivation should be better presented.
The paper also presents a section called ablation to study the different parameters. While this section is valuable, there is no consistency in the section, each analysis is presented in a different dataset (specially when comparing against other approaches it is important to have consistency for a fair comparison) and most times not all information about how the test was carried out are clear (e.g. patch size, training split, etc).
In the experimental results, each dataset is used to compare the proposed method against previous work. The list of previous works should have a clear criteria and should be homogeneous among all datasets, as well as the training splits percentages.
Overall, the paper is interesting and of significance given that the results presented surpass previous works, but presentation of the proposed method as well as of the results should be improved. A list of points to further clarify or improve are listed section by section.
SECTION 1. Introduction
Lines 26-28: “Remote sensing has received unprecedented attention due to its role in mapping land cover, geographic image retrieval, natural hazards detection, and monitoring changes in land cover.” → Citation for each application example is recommended
Figure 1. Improve image labels (a) and (b) maybe make it more explicit to which class each image belong. Punctuation issue in figure description, Also this phrase would be better placed in the main text, than in the image description: “This encourages us to learn multi-level spatial features that have small within-class scatter but large between-class separation.“
Figure 2. “The main idea of the proposed work. “ → Rephrase. Also the term SURF is used for the fist time here, so maybe it is appropriate to add citations for the reader.
Line 55-58 “In comparison 55 to handcrafted features, the bag-of-words (BoW) model is one of the famous mid-level 56 (global) representations and is extremely popular in image analysis and classification, while 57 providing an efficient solution for aerial or satellite image scene classification. “ → Add citation
Line 78: “Moreover, deep learning-based methods generally analyze an individual 78 patch and treat different scene categories equally. “. Clarify this phrase, what does it mean that other methods treat categories equally, and how do you do it any differently?
Line 90. “Ideally, 89 special attention should be paid on the image patches that are the most informative for 90 classification. This is due to the fact that objects can appear at any location in the image “ → Attention mechanisms focus on the most relevant part of the images, and have been applied to RS scenarios as well. You should clearly state that as well, since this phrase you inserted calls for that analysis. They should be included in your literature review if you compare them with methods based on this premise.
Line 97-99: “In this paper, instead of working towards a new CNN model 97 or a local descriptor, we introduce patch-based discriminative learning (PBDL) to extract 98 image features region by region based on small, medium, and large neighborhood patches…” → What does small, medium and large actually mean? How small is small? Do they depend on the image size or is it fixed?
Line 100-101 “This is motivated by 100 the fact that different patch sizes still exhibit good learning ability of spatial dependencies 101 between image region features that may help to interpret the scene [19].” Is there a more recent reference for this? or maybe drop the “still”?
Line 102 introduces Figure 2 which has its own description caption but that’s not enough. The figure is not clearly explained in the introduction section. Add in this part of the main text an explanation of the workflow presented in the Figure.
Lines 103 to 110 are not clear, too many concepts are introduced at once.
Line 119. You use BiLSTM network, at least provide a citation since it is the first reference to it.
SECTION 2. Literature review
Line 170 “In brief, handcrafted features have their benefits and disadvantages. For instance, 170 the color features are more convenient to extract in comparison with texture and shape 171 features “ → add citation
Line 253. ”Thus, a natural question arises: can we combine different region 253 features effectively and efficiently to address scene image classification? With the exception 254 [20], to our knowledge, this question still remains mostly unanswered.” Introduce in one or two lines the answer provided by [20] and how you try to explore it further.
Recent surveys could be cited or used as tool to find relevant works aligned to yours (e.g. Remote Sensing Image Scene Classification Meets Deep Learning: Challenges, Methods, Benchmarks, and Opportunities).
SECTION 3. Proposed method
Line 260. The first paragraph introduces four components (a, b, c, d) (a) estimation 260 of patch-based regions (b) scale-space representation (c) information fusion and (d) a 261 BiLSTM based sub-network for classification purpose. Figure 3 presents three components instead (not ab but feature learning) and then c and d). Unify the concepts (names and amount of main components) and use the same components division throughout the document (e.g. first paragraph, figures and subsections names).
Figure 3. Why the output for one image in three different classes, a single or multi-label setting? Is the output the confidence of the class belonging to that class? (I guess so for the softmax) but it should be really clear from the paper.
Line 273/274 “Here, the definition of different neighborhood sizes is considered to be small, medium, or large regions.“ → how are these defined? This concept should be explained when introduced, the concept, the value and the utility. sur
Line 288/289. “For instance, taking 288 an equal 4 × 4 pixel stride at the lowest scale σ = 1.6“ → it is not clear to say that 1.6 is “the lowest” since at this point the different values of scale(σ) that you propose are not presented.
Figure 4. Which scales and patch size are used? Clarify image caption → ”Illustration of overlapping sample windows at two sizes. In both images, the pixel offset kept the same between the yellow window and the red window. A large overlapping can be observed in the bottom images.”
*at two sizes → which sizes?
*In both images, the pixel offset kept the same between the yellow window and the red window → it is not clear what you meant with that, which two images (left/right, bottom/up?) ?
Lines 341. “SURF features are clustered through the 341 k-means clustering process and mapped to a specific codeword“ → Which specific codeword? how do you generate it? based on a % of the dataset? how is this chosen?
Lines 373-377. “The Earth observation satellites normally capture consecutive images of the same ground by visiting the same area every few days. Thus, the time elapsed between consecutive images complement the temporal resolution (i.e., the time when it was acquired) [47]. Our motivation for using bidirectional long short-term memory (BiLSTM) [48] is to take advantage of the temporal pattern of the scenes across image time series.“
The motivation you stated for using BiLSTM is to take advantage of the image time series. But the four datasets that you use in the experiments do not include the images for the same location in consecutive times, do they? Please expand on this aspect to make it clear either way.
SECTION 4. Datasets and Experimental setup
Line 400 and 401. Numbers have different format, unify notation → 100000 vs. 31,500
Line 414. “The vocabulary size of k in the remote-sensing 414 domain varies from a few hundred to thousands“. Be more specific for replicability purposes.
Line 415. “We set the size of visual vocabulary 415 to 15000 for UC Merced, AID, NWPU, and 10000 for the WHU-RS dataset.“ Explain why? (e.g. given that WHU-RS has a lower amount of images).
Line 425 Ablation study. Some analysis were performed on 1 dataset or another (WHU-RS, Fig 7, UC Merced datasetTable 3) or 4 dataset (Gaussian scales Fig.8, Neighborhood size Table 1) or 3 datasets (image descriptors Fig. 9, pixel strides Table 4). It's confusing why certain dataset were chosen for certain experiments and not for others. Consistency would be appropriate, test all analysis on the four datasets. Otherwise present everything for one dataset in the paper and provide the other 3 in an annex or supporting material (if the journal supports it).
Line 449. “surpassing 90% with just 10% of all samples as a 449 training sample. This is a remarkable improvement compared with the previous methods.” → previous methods have not been presented in the paper at this point.
Line 451. “In addition, UC-Merced, WHU-RS, NWPU, and Aerial Image take 19343.48 s, 22904.76 451 s, 44542.16 s, and 82170.04 s for training, and 601.32 s, 1452.6 s, 1452.6 s, and 19263.92 s 452 for testing, respectively.” What significant insight is drawn from this information? What about showing the average time for processing each image, is that useful information too?
Figure 7. Letters (a) and (b) are cropped.
Figure 9. In the three sub-figures (a,b,c) it seems that SURF-BOW and SPM-BOW reach a vocabulary size in which the accuracy stops improving, while with the presented approach the bigger the vocabulary size the higher the accuracy. What is the intuition behind this?
Figure 10. Image description goes right, center, left, maybe it would be easier to read from left to right.
Figure 11. “All points in the scatterplots are class coded”--> suggestion→ ..are color coded by class.
Table 1, 2 and 4, which training ratio was used for each dataset?
Table 4. Explain in caption that PS1, PS2 and PS mean (not only in text).
Line 470. “One 470 can see that even the proposed one-stage detection method with the neighborhood size of 471 (4 × 4) significantly outperforms the SPM method.” How does the reader know that you used a 4X4 neighborhood size if in each subsection of the ablation studies you use different settings? e.g. which training ratio is used for each dataset. Also “neighborhood size” equals “patch size”, “window size”? Although more repetitive, it is clearer for the reader to always use the same terms and not change them from paragraph to paragraph since it is a complex method with a number of variables.
Line 482.”In comparison with these 482 state-of-the-art fusion methods, our proposed fusion performs best with an accuracy of 483 99%.”. Actually in Table 3 the “Ours” method only surpasses the other state methods in 1 out of 4 training split percentages (80%). How do you establish that it is better nonetheless. What is your intuition on how the different % affect the results?
Some confusion matrices present percentages others do not. Consistency among all would be better. Also some are really big and hard to read when printed.
Also reference the dataset always with the same names. Aerial Image (table 2) dataset vs. of AID in almost all others figures/tables.
Line 457. Figure 15. Why is this study of training ratios comparison among methods only presented for 1 out of 4 datasets? What valuable insight is drawn from this figure? (other than the fact that your methods perform better as already stated).
Line 512. For some datasets two training ratios are exploited while for others only 1. Ideal would be to confront all with the same percentables but at least use the ones that are usually used to benchmarking them in the literature. E.g. AID dataset (20, 50) but UC Merced only 80 instead of (80, 20). Review and add the missing ratios to make sure the comparison is fair.
Line 573. “As shown in Table 8, 573 the PBDL achieves the highest classification (99.63%) accuracy and outperforms all the 574 previous methods for the 19 classes.” The table only shows the overall accuracy and not a per-class accuracy, so it is not possible to corroborate that it works better for all 19 classes (which most likely it is but it is not something that can actually be seen in table 8 as stated).
I hope my recommendations help you in any way possible.
Best regards,
Author Response
Please find the attached reponse file. Thanks.
Author Response File: Author Response.pdf
Reviewer 2 Report
This study mainly proposed to explore the spatial dependencies between different image regions and introduced patch-based discriminative learning (PBDL) for remote-sensing scene classification. Experiments on four datasets verify the effectiveness of the proposed method. However, the following issues still need to be addressed before publication.
1. Abstract “Although deep learning algorithms can deal with a large amount of data, convolutional neural networks (CNNs) generally analyze an individual patch without considering the dependencies among different image regions”. However, with the development of attentional mechanisms, more and more studies have focused on the dependencies among different image regions.
2. It is recommended that the authors open source the code used in the study to make it easier for readers to understand, and provide downloadable links to the datasets.
3. The presentation of the manuscript needs further improvement.
1) In “Table 3. Comparison of classification accuracy (%) with feature-level fusion methods under different training sizes on the UC Merced dataset”, The accuracy value does not retain two decimal places.
2) The authors conducted several experiments in "4.4. Performance comparison with state-of-the-art methods", so the experimental results are presented as (Mean±std). Why other parts are not experimented with several times?
4. The author mentions that the training ratio is 80%, is this the ratio of the training set to the testing set? Or is it the ratio of the training set to the validating set?
5. An important aim of this study is to explore the dependencies of different image regions, and spatial attention is indeed a mature and applicable module for this problem, so I suggest that the authors include some of this type of approach in the section comparing with the latest methods.
Author Response
Please find the attached reponse file. Thanks.
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
The manuscript was improved based on the reviewer's recommendations. Still, there are some points for improvement and to my understanding some unanswered questions.
In particular, the presentation of the content. Please make sure to review all image descriptions which are really important. And review the English.
One aspect that draw my attention was that you discovered a new training setting that led to better results on UC Merced, while you updated Table 3, you missed updating the others in which UC Merced dataset is used (e.g. table 7). Also, I suggest you use this new setting on the other experiments as well, if it leads to improvements.
Also, I still encourage the authors to do the analysis on the same dataset with the same settings changing only one aspect at a time (e.g. patch size vs feature fusion methods), to enable a fair comparison otherwise it is not clear why a certain dataset is used for one analysis and not the other.
Some detailed comments:
> Figure 2 description: "These path sizes can significantly improve the scene classification performance." The anticipated result is not relevant as part of the image description. Please review all images to make sure that it is only image description and that do not draw conclusions.
Still don't really understand the value of the figure, the fact that the smaller the patch the denser the green squares are in the image seems obvious. Lines 108-111 could be part of the image description. While the main text should contain the value the reader can draw from the image itself.
> "The exact meaning of the small, medium, large can also context dependent according to the available image resolution from the image sensor". Does not seem a well-formed sentence. Additionally, what context-dependant mean? How the definition of the actual value is calculated w.r.t. the image resolution? Is there a formula? is it a grid search?
>This comment from the previous review still stands. "Figure 3. Why the output for one image in three different classes, a single or multi-label setting? Is the output the confidence of the class belonging to that class? (I guess so for the softmax) but it should be really clear from the paper"
> "To demonstrate this, we first define a region over the 288 entire image, where the patch sizes used are (4 4), (6 6), (8 8), (10 10), with the 289 sliding steps corresponding to patch sizes. Here, the definition of different neighborhood 290 sizes is considered to be small, medium, or large". There are four patch sizes (4,6,8,10) but only three descriptions (small, medium, large).
Also, it is still not clear, how one should choose the right patch size, you continue to say that it can depend on the image resolution, what's the relation? a percentage? a formula? how one should choose, how did you?
> Line 303. "Moreover, we adopt multi-scale representation by utilizing different scale σ sizes." How σ is actually used? How the right value is selected.
> Figure 4. It's not clear, what are you trying to show here? What I understand from the image:
Top row: Left: an image of P1 x P1, Right: an image covering the same territory by of size P2 x P2.
Bottom row: Left: an image of P3 x P3, Right: an image covering the same territory by of size P4 x P4.
Where P{1,2,3,4} stands for the pixel size. Is it the same image resized? or is it different sampling size of the same image?
What insight do you need to show here to the reader? Maybe add in the image description also the scale σ parameter for each image.
>Line 308. "Both images share significant overlapping even at the large scale. " If it is the same image resized at different scales what do you mean by redundancy? the same image at different sampling sizes sure would have the same content... It sure is obvious and clear to the authors, but not to the reader.
> "We set the size of visual vocabulary to 432 15000 for UC Merced, AID, NWPU, and 10000 for the WHU-RS dataset. The WHU-RS 433 dataset has relatively lower number of images in comparison to other datasets, which was 434 the reason of decreasing the size of visual vocabulary" Is it proportional?
> Figure 7. Not readable
>In the previous review I stated that all analyses should be performed in the same dataset, to make a fair evaluation (For example, you evaluate how different feature-level fusion methods work in UC Merced, but test how different pixel strides work on WHU-TS). You could do all analyses in four datasets or at least all analyses in the same one. Your answer "The previous methods [32,33] used for comparison in Table 3 provide the results only on UC Merced dataset, therefore, we quote their results for a fair comparison."
My comment stands, it is appropriate to perform all analyses in all datasets or all analyses in one dataset. (You can always replicate previous works if needed)
> Lines 470 to 474, you first indicate the time for training and inference in each dataset and then say to observe the size of the patch has a significant influence on the time. But the times provided are not associated with any patch size. Review this affirmation. Since with the numbers provided one can only assume the different times among datasets depends on the resolution and amount of images.
>Previous comment was not answered (you only removed sub-figures). "Figure 9. It seems that SURF-BOW and SPM-BOW reach a vocabulary size in which the accuracy stops improving, while with the presented approach the bigger the vocabulary size the higher the accuracy. What is the intuition behind this?"
> For each analysis and table, which training ratio was used for each dataset? If it is the same for all the analysis please specify it in a neutral point indicating that it is valid for all (not only in table 1).
> As a reply to my previous comment concerning Table 3 results, you indicate that "We have run the experiments again on UC Merced dataset and noticed that increasing the hidden layer size (100) of BiLSTM further improves the performance " Please also update Table 7 with the results of this new size of hidden layers.
Also, why not re-do the experiments for all the other cases to see if you can get better results for all?? Why did you remove the 80% split (that was the only one that in the previous version of the table was better?)
> Figure 10 caption: indicate window size and stride for each one of the three.
> English must be reviewed. (Just an example, Although the time requires to construct the vocabulary in the range of few hours.... --> the time required ... is in the range... or ranges from.. )
Author Response
Please find the attached response letter. Thanks.
Author Response File: Author Response.pdf