Next Article in Journal
Effects of Topside Ionosphere Modeling Parameters on Differential Code Bias (DCB) Estimation Using LEO Satellite Observations
Next Article in Special Issue
Retrieval of Land Surface Temperature over Mountainous Areas Using Fengyun-3D MERSI-II Data
Previous Article in Journal
On-Board Geometric Rectification for Micro-Satellite Based on Lightweight Feature Database
Previous Article in Special Issue
Invariant Attribute-Driven Binary Bi-Branch Classification of Hyperspectral and LiDAR Images
 
 
Article
Peer-Review Record

TransHSI: A Hybrid CNN-Transformer Method for Disjoint Sample-Based Hyperspectral Image Classification

Remote Sens. 2023, 15(22), 5331; https://doi.org/10.3390/rs15225331
by Ping Zhang 1, Haiyang Yu 1,2,*, Pengao Li 1 and Ruili Wang 1
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Reviewer 4:
Reviewer 5:
Remote Sens. 2023, 15(22), 5331; https://doi.org/10.3390/rs15225331
Submission received: 13 September 2023 / Revised: 10 November 2023 / Accepted: 10 November 2023 / Published: 12 November 2023

Round 1

Reviewer 1 Report (New Reviewer)

Comments and Suggestions for Authors

The author proposed a method that incorporates CNN into the transformer to extract both local and global spectral and spatial features. Some related experiments were conducted to verify the effectiveness of the network. However, there are still some issues in this manuscript need to be addressed.  

 

1、Recommend the author to include relevant work in the article.

2、Suggest that the author maintain consistency in terminology, such as using '3D CNN' in lines 195-196, and pay attention to proper formatting in lines 178-183, line 298, and lines 337-344 of the manuscript.

3、The author should enhance the credibility of the classification results for the three datasets, for instance, by presenting the standard deviation alongside them.

4、Please have the author list a specific table in the manuscript at line 521.

5、In Section 4.1.1, for Table 7, (a), (f), and (g), the AA accuracy is lower than that of TransHSI. Does this imply that the addition of a transformer encoder would decrease the network's accuracy? Or could it be that the author did not effectively integrate CNN with the transformer encoder?

6、The author should provide an analysis of the network's parameters and complexity in the manuscript.

7、The author should clearly label each specific module mentioned in lines 170-171, i.e., 'the HSI pretreatment module,' 'the spectral–spatial feature extraction module,' and 'the fusion module' in Figure 1. And please clarify whether 'the spectral–spatial feature extraction module' represents a joint spectral-spatial extraction module or separate spectral and spatial extraction modules.

8、The author mentions the use of a stratified sampling strategy for selecting the test and training sets. It is recommended to provide a brief example using one dataset to illustrate how the training and test sets are chosen.

9、The author should explicitly specify in the manuscript which dataset Figure 11 and Figure 12 correspond to.

10、The author should provide the experimental results for all datasets, along with the corresponding patch sizes and training sample percentages in the manuscript.

11、In Lines 184-185, the statement 'In this manuscript, the size of the India Pines dataset is 145*145*200, with B set to 30.' Is there experimental evidence to support the selection of 'B' as 30? If so, it is recommended to include this in the discussion.

12、Section 2.2.1 and Section 2.2.2 both describe specific procedures without explaining why these steps are taken and their respective purposes. It is suggested that the author provide the specific objectives for these actions.

13、The author selects several recently published comparative algorithms related to CNNs to demonstrate that the proposed method is superior.

14、The author includes the innovations of this manuscript in the conclusion section.

15、Please have the author explain why the activation maps of the Pavia University Dataset samples are displayed as 16 blocks and how they were sampled.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report (New Reviewer)

Comments and Suggestions for Authors

The paper proposes the TransHSI classification model, a spectral-spatial feature system in which the spectral feature extraction module combines 3D CNN with different convolution kernel sizes and Transformer to extract global and local spectral features of HSIs (Hyperspectral images), while the spatial feature extraction module combines 2D CNNs and Transformer to extract the local and global spatial features of HSIs. Furthermore, the proposed model contains a fusion module, that cascade the extracted spectral-spatial features and the original HSI after dimensionality reduction, enabling the classification of HSIs using both shallow and deep features. Experimental tests are conducted on three public datasets Indian Pines, Pavia University and Data Fusion Contest 2018, and the results show that compared with 11 other traditional and advanced HSI classification algorithms, TransHSI achieves competitive performance regarding the overall accuracy and kappa coefficient in all three datasets.

The Introduction clearly states the purpose of the paper.

The paper uses appropriate methods, well explained and referenced, to produce competitive results. The methodology corresponding to each of the end points in the Results section is provided. For each layer of the system, the authors explain its relevance and contribution to the classification results. However, to fully understand the work, a strong knowledge of working with CNNs/deep learning mechanisms is required.

Many experimental tests have been conducted on three public datasets, and the results are effectively presented, using appropriate non-textual elements, such as figures and tables. I particularly appreciated the comparative analysis with 11 traditional and advanced HSI classification algorithms. The results of the study are fully interpreted and discussed, some of them offering novel insights.

The reference list comprises of relevant papers focused on the authors' area of study, which are explicitly cited within the paper. Most of them are published in recent years.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 3 Report (New Reviewer)

Comments and Suggestions for Authors

This paper is very well written with quality contents. It is filled with well-designed experiments and detailed analyses. Not much modification is needed other than some minor improvements like line 15-18.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 4 Report (New Reviewer)

Comments and Suggestions for Authors

This paper is well-written and its contribution is acceptable. However, the conclusion should be reduced to the level of having comparable results to other techniques not superior or as mentioned in many places saying: "strong generaliation",..etc.
On the other hand, the authors should prove how their methodology reduces the network complexity compared to other models.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 5 Report (New Reviewer)

Comments and Suggestions for Authors

TransHSI: A Hybrid CNN-Transformer Method for Disjoint 2 Sample-based Hyperspectral Image Classification

 

In the abstract, the problem should be clearly defined, the dataset used, and the results acquired.

A separate section called Literature review is needed. All the recent state-of-the-art techniques should be discussed and what are the limitations and drawbacks of each technique should be properly discussed. A summary table is needed to show all the literature.

A dataset section is needed to show all the technical details of the dataset used for this study. Any preprocessing and data augmentation techniques applied.

Section 2.4 should be in a separate section.

Section 3.1 should be a separate main section.

What are the performance evaluation metrics???

What are OA, AA, K???

The results should be suitable and justified with graphs and confusion metrics.

The proposed methodology should be discussed in a separate section named: Proposed Methodology.

What is the novelty in the methodology should be discussed in a separate subsection.

The comparative analysis should be done with different studies and proper citations should be provided.

The comparative analysis should be done based on various metrics.

A discussion section is needed.

 

 

 

 

Comments on the Quality of English Language

No comments

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Round 2

Reviewer 1 Report (New Reviewer)

Comments and Suggestions for Authors

Thanks to the author for answering my comments comprehensively. A small suggestion, if you want to present the results in an confusion matrix, consider presenting them in a 3D confusion matrix. Also, it is recommended that you include the T-SNE algorithm to more visually display the data distribution of the network output. I hope the following literature can help you:

MS3Net: Multiscale stratified-split symmetric network with quadra-view attention for hyperspectral image classification

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 5 Report (New Reviewer)

Comments and Suggestions for Authors

The authors have revised the paper based on the comments. Hence, the paper can be accepted.

Comments on the Quality of English Language

NO Comments

Author Response

Please see the attachment.

Author Response File: Author Response.docx

This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.


Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The authors proposed a hybrid convolution and transformer framework for hyperspectral image classification. Extensive experiments are given to evaluate the proposed model, but the contribution is limited for publication to Remote Sensing. Some detailed suggestions are shown as follows.

1. The research motivation in the abstract, also throughout the paper, cannot highlight the originality, such as, CNNs are limited in global perception and concerning data’s sequential nature. There are many methods considered to combine convolution and transformer. What is your innovation and/or advantage?

2. Also in the abstract, what does “the network is optimized using residual modules” means? We can train a network with an optimizer, but residual modules are unsuitable here. And, the experimental results do not indicate excellent classification performance, Authors need to summarize their strengths carefully.

3. The authors can simplify the introduction and analyze the current research around their motivation. Some methods irrelevant to the proposed model can be removed from the Introduction.

4. In subsection 2.1 HSI pretreatment, (xi, xj) denotes the center pixel position, but it is easily confused with the variate for data feature, and the notation is not used later. I think the author can simplify the description.

5. “…, it is common to monotonically employ CNNs in the shallow stage and stack Transformer Encode blocks in the final one to two layers[39,40]. However, such strategies fail to capture the global information of the shallow stage in HSI. The authors could analyze why the previous works do not set Transformer blocks in the shallow layer, or, can you verify with an ablation study that the Transformer present in the shallow layer, such as removing the step in Eq. (5), is indispensable?

6. In line 262, the description, “every convolutional layer except the first Conv 3D is followed by a Batch Normalization (BN) layer, …”, is in disagreement with Eq. (1), which presents the BN operation.

7. In line 305, the sentence describes that “the weight matrix learned in the previous step is fed into the MLP layer”, but it is not a weight matrix if here you point to the output of MHSA in Eq.(13).

8. In line 328, the authors emphasize that Wa and Wb are initialized with the Xavier, what are the other parameters, or is there some special treatment?

9. How do you define the position embedding in Eq.(19)?

10. More critically, section 2 is more like a lab report, but not a research paper. So, I further think this paper is limited in novelty and significance of content.

Comments for author File: Comments.pdf

Comments on the Quality of English Language

Authors are recommended to carefully check the entire manuscript, including misspellings or unreadable descriptions. Here we list some cases, but not limited to these.

1) In subsection 2.1 HSI pretreatment, the sentence for PCA processing is too verbose.

2) In lines 251-252, there is a sentence that does not have a subject. This phenomenon appears elsewhere in the paper, such as in lines 320, and 323.

3) In line 317, In the sentence “The cascade layer is responsible for fusing the shallow features and deep spectral-spatial features of HSI.”, the word “responsible” is inappropriate.

4) In lines 339-341, the sentence here is not relevant to the context in this subsection.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

This paper combines CNN and transformer to extract spectral and spatial features separately, and then integrates them for classification. Overall, the topic is interesting, but there still exist some issues to address.

1. There are more convolution modules in Figure 2 (a) and (b) compared to the description in the previous text. Is the "×2" wrongly repeated?

2. What tokens does X_cls refer to? Please provide an explanation when it first appears.

3. In line 479, a "%" is missed.

4. What is the basis for setting the number of training samples? Why does the IP dataset have such a high proportion of training samples?

5. More experimental settings need to be clearly explained, such as network size and number of convolutional kernels.

6. Does TransHSI have any advantages in terms of time consumption or number of parameters?

7. Some relevant reference on how to better extract spatial-spectral features can be cited, such as "Unlocking the Potential of Data Augmentation in Contrastive Learning for Hyperspectral Image Classification" and "ContrastNet: Unsupervised feature learning by autoencoder and prototypical contrastive learning for hyperspectral imagery classification".

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

Review of Remote Sensing Manuscript ID: 2476124

TransHSI: A Hybrid CNN-Transformer Method for Disjoint Sample-based Hyperspectral Image Classification

 

The paper addresses the important challenge of building a deep neural network architecture that can combine spectral patterns and spatial correlations together in a learned representation of a hyperspectral image. It is well-known that CNNs have a spatial invariance inductive bias while transformers have a permutation invariance inductive bias. Yet, CNNs struggle to capture spectral sequence features and transformers struggle with spatial correlations unless additional care is taken. The authors of the current paper proposes a new spectral-spatial feature extraction module that utilizes 3D CNNs and 2D CNNs with different kernel sizes in combination with Transformers to extract both local and global spectral-spatial features of HSI. They also propose a fusion mechanism that concatenates features extracted at different stages and applies a semantic tokenizer to transform the features, enhancing their discriminative power. Residual connections are employed at various stages for better gradient propagation during training. Care has been taken to design the training and test sets using the disjoint sampling strategy to minimize overlaps in the training and test sets. This reviewer finds this paper to be within the scope of Remote Sensing journal and recommends acceptance for publication. Given that the suggestions for improvement are only minor, this reviewer recommends that the authors re-submit with the suggested changes but a second round review is not required.

 

Strengths:

1.     Paper is well-written, well-motivated and has appropriate review of the literature for context.

2.     The proposed model architectures, training procedures, and experiments are appropriately described.

3.     The proposed method has been compared to existing methods.

4.     Relevant ablation studies have been discussed.

 

Suggestions for further improvement:

1.     On page 6, just like the authors have done for the parameter B, it would be good to give the ranges/order of magnitudes for the parameters M, N, L, S.

Author Response

Please see the attachment.

Additionally, we have placed the modified files behind the comments.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors


Comments for author File: Comments.pdf

Comments on the Quality of English Language

Authors are recommended to carefully check the entire manuscript

Back to TopTop