Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Hyperspectral Image Super-Resolution Based on Spatial Correlation-Regularized Unmixing Convolutional Neural Network

Remote Sens. 2021, 13(20), 4074; https://doi.org/10.3390/rs13204074

by Xiaochen Lu¹

, Dezheng Yang¹

, Junping Zhang²

and Fengde Jia^1,*

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Remote Sens. 2021, 13(20), 4074; https://doi.org/10.3390/rs13204074

Submission received: 8 August 2021 / Revised: 1 October 2021 / Accepted: 7 October 2021 / Published: 12 October 2021

(This article belongs to the Special Issue Deep Learning in Remote Sensing Application)

Round 1

Reviewer 1 Report

This paper proposed a super-resolution approach based on convolutional neural network to spatially enhance the resolution of hyperspectral images. The main technique used in the proposed method is similar to the given reference, so the novelty of the work should be enhanced. Detailed comments are as follows.

It seems that the spatial spread transform matrix plays an important role in the method. Is there some physic meaning of this matrix, or could the author give a detained explain of the “spatial correlation”?
Why partition the image into small pieces of patches in a pixel-by-pixel manner?
Line 216: what’s the condition of convergence?
For the testing image, it’s more meaningful to use images with larger size since practically we need to process massive data. It’s suggested to make full use of the original images to conduct experiments instead of sub-images.
Line 258: how is the number of endmembers defined? It seems that the defined number is not the number of classes.
Line 290-292: it’s suggested to introduce the definition of the measurements.
Line 93: the space between "in" and "spired" should be removed.

Author Response

We would like to express our appreciation for your valuable comments and suggestions. We have considered all the comments and put serious efforts to answer these issues. The response to your requirements on a point-by-point basis is detailed below (our responses in blue).

Point 1: It seems that the spatial spread transform matrix plays an important role in the method. Is there some physic meaning of this matrix, or could the author give a detained explain of the “spatial correlation”?

Response 1: Thank you very much for your comments. The spatial spread transform matrix (SSTM) indeed plays an important role in our method. In fact, the idea of utilization of SSTM comes from the following publications:

Yokoya, N., Yairi, T., and Iwasaki, A. Coupled nonnegative matrix factorization unmixing for hyperspectral and multispectral data fusion. IEEE Trans. Geosci. Remote Sens. 2012, 50, 528-537. doi: 10.1109/TGRS.2011.2161320.
Dian, R., Li, S., Guo, A., and Fang, L. Deep hyperspectral image sharpening. IEEE Trans. Neural Networks Learn. Syst. 2018, 29, 5345-5355. doi: 10.1109/TNNLS.2018.2798162.

As is known, the point spread functions of high- and low-resolution images are highly correlated. Therefore, the point spread transform function, which is also named by spatial spread transform matrix in this work, can be used to formulate the transformation or degradation between high resolution image and its lower counterpart. The physical meaning of SSTM is to describe the transformation between each pixel of LR image and the corresponding pixel with its neighbours in the HR image. To better illustrate this point, we have revised the description of SSTM in Para. 3 of Section 2.1 as follows,

“Suppose denotes the spatial spread transform matrix (SSTM) [6, 10] that is used to formulate the transformation between each pixel of and the corresponding pixel, with its neighbors, of . For instance, a Gaussian low-pass filter [6] can be used to simulate the degradation of resolution and applied to each pixel with its neighbors for the HR image. Then, each column of SSTM is the vectorization of Gaussian filter at proper locations. In other words, represents the spatial degradation function between high resolution image and its counterpart for each pixel. Consequently, we have , and obviously, we also have,”

Besides, since the SSTM can also be regarded as the vectorization of local convolution operation, it is reasonable that we can define a union local SSTM for each sample patch. Hence, we define the form of local SSTM used in our experimental setup in Para. 8 of Sub-section 2.2 as follows,

“Note that the SSTM defined in (6) is actually a local transform matrix with elements, which transforms each pixel from the HR image patch to the LR patch, rather than the global transform matrix in (3). In this paper, we employ a Gaussian low-pass filter with (2r-1)×(2r-1) elements to simulate the transformation between HR and LR pixels. And then, each column of corresponds to the vectorized filter at proper locations.”

Last, we agree that we should give a detailed explanation of “spatial correlation”. Therefore, we append the following explanation in Para. 6 of the introduction as follows,

“Furthermore, the LR HS image is the degradation version of HR HS image actually, the spatial correlation between the HR and LR image can be formulated by the spatial spread transform matrix. Therefore, we propose a spatial correlation-regularized CNN to refine the prediction of abundance maps.”

We hope that we have clearly answered your questions. Thank you again.

Point 2: Why partition the image into small pieces of patches in a pixel-by-pixel manner?

Response 2: Thank you very much. On one hand, by partitioning the training image into small pieces of patches in a pixel-by-pixel manner, we aim to obtain sufficient training samples in order to effectively train our network, as well as other raised methods. In addition, the strider is set to 5×5 to avoid the high similarities among the training samples.

On the other hand, by partitioning the testing image into patches, we can obtain the HR image in a pixel-by-pixel manner. Therefore, the average operation can be applied to the overlapped parts to minimize the casual errors and enhance the reconstruction accuracy.

The following references also use this strategy in fact.

Mei, S., Yuan, X., Ji, J., Zhang, Y., Wan, S., and Du, Q. Hyperspectral image spatial super-resolution via 3d full convolutional neural network. Remote Sensing 2017, 9. doi: 10.3390/rs9111139.
Hu, J., Jia, X., Li, Y., He, G., and Zhao, M. Hyperspectral image super-resolution via intrafusion network. IEEE Trans. Geosci. Remote Sens. 2020, 58, 7459-7471. doi: 10.1109/TGRS.2020.2982940.
Jiang, J., Sun, H., Liu, X., and Ma, J. Learning spatial-spectral prior for super-resolution of hyperspectral imagery. IEEE Trans. Comput. Imaging 2020, 6, 1082-1096. doi: 10.1109/TCI.2020.2996075.

Thus, we have also cited them at the end of Para. 1 in Section 2.2.

Point 3: Line 216: what’s the condition of convergence?

Response 3: Thank you. In this work, we did not set explicit convergence conditions for the update of end-member matrix E. Instead, we set the maximum epoch numbers to 200 for CNN training, and update E in each epoch. Once the maximum epoch number is met, the network will stop update, as well as the member matrix E.

Considering your comments, we have described this point in the last paragraph of Section 2,

“Thus, will be updated during each epoch of the network training until a maximum epoch number is met.”

Point 4: For the testing image, it’s more meaningful to use images with larger size since practically we need to process massive data. It’s suggested to make full use of the original images to conduct experiments instead of sub-images.

Response 4: Thank you for your suggestion. It is indeed an interesting suggestion. We have to admit that we have not considered this manner, and we should have conducted those experiments again in such manner during this revision. However, we are asked to resubmit this manuscript in 10 days. Thus, we do not have much more time to perform additional experiments. Nevertheless, we believe this method may work well to some extent, and we will make effort in our next work. Thank you again.

Point 5: Line 258: how is the number of endmembers defined? It seems that the defined number is not the number of classes.

Response 5: Thank you. The number of endmembers is not defined according to the number of classes, since we cannot priorly know the number of classes for each scene in practice. In fact, in our previous manuscript, we set the endmember numbers according to the published references, our previous work and empirical researches.

Yokoya, N., Yairi, T., and Iwasaki, A. Coupled nonnegative matrix factorization unmixing for hyperspectral and multispectral data fusion. IEEE Trans. Geosci. Remote Sens. 2012, 50, 528-537. doi: 10.1109/TGRS.2011.2161320.
Lu, X., Li, T., Zhang, J., and Jia, F. A novel unmixing-based hypersharpening method via convolutional neural network. IEEE Trans. Geosci. Remote Sens. 2021, 1-14. doi: 10.1109/TGRS.2021.3063105.
Lu, X., Zhang, J., Yang, D., Xu, L., and Jia, F. Cascaded convolutional neural network-based hyperspectral image resolution enhancement via an auxiliary panchromatic image. IEEE Trans. Image Process. 2021, 30, 6815-3828. doi: 10.1109/TIP.2021.3098246.

We have found it is indeed not rigorous enough. Therefore, during the last reviewing period, we have conducted another several groups of experiments regarding various end-member numbers. The results have been appended in Figure 12 of the revised manuscript (Figure 11 is also been replotted for conciseness). The corresponding analyses are appended in the first paragraph of Sub-section 3.3. Although the number of end-member D seems to be not always optimal, for the sake of fairness and generality, we still prefer to set D=40 for all the experiments in Sub-section 3.2, as some other related works also suggest this number. And our results still seem to be acceptable.

Point 6: Line 290-292: it’s suggested to introduce the definition of the measurements.

Response 6: Thank you for your suggestion. We have introduced the definition of the measurements in (10)-(14) at the end of Sub-section 3.1.

Point 7: Line 93: the space between "in" and "spired" should be removed.

Response 7: Thank you for your careful reading. We have revised this mistake in this version of our manuscript.

Author Response File: Author Response.pdf

Reviewer 2 Report

In this paper, the problem of improving the resolution of HS images without using auxiliary image sources is addressed. As a proposal, a CNN-based single-image super-resolution approach combined with a linear spectral mixture model is studied. It is assumed that the differences between high and low-resolution HS images are due to discrepancies on end-members abundances.

The paper clearly explains the followed methodology and describes the experiments conducted to show how the proposal outperforms other networks already proposed in the literature. There are three points this reviewer would like to comment:

- In figure 1, the architecture of the network is described. In the very first stage, the last 64 feature maps are split into two groups (marked with red and blue boundaries). It would be very useful from my point for the potential reader to know the criterium that has been employed to do the splitting. Could you please elaborate more on the process that has been followed?

- It is mentioned that VCA (Vertex Component Analysis) is utilized to extract the endmenbers of the scene. One of the VCA parameters is the number of endmembers preset on the scene. For the datasets employed during the assessment of the proposal, it is clear that the value of this parameter is known in advance. However, this reviewer wonders what would happen in a general case where a new hyperspectral cube is captured and there is no prior knowledge about the number of endmembers. Or it is also possible to envisioned an application in which a sequence of hyperspectral cubes (hyperspectral video) is capture and therefore the scene could dynamically change. Could you please elaborate more on the consequences for the performance of the proposal if no prior knowledge about the number of endmembers is used.

- PSNR plots show more variability that is currently capture in the corresponding paragraph. This reviewer would suggest to add more details in the text to help the reader better understand the differences observed at various wavelengths.

Author Response

Point 1: In figure 1, the architecture of the network is described. In the very first stage, the last 64 feature maps are split into two groups (marked with red and blue boundaries). It would be very useful from my point for the potential reader to know the criterium that has been employed to do the splitting. Could you please elaborate more on the process that has been followed?

Response 1: Thank you very much for your comments. We have to admit that we forgot to tell the criterium for feature map splitting. We have revised the splitting procedure and following process in the text.

In our approach and experiments, the 64 feature maps are split into two groups with 32 maps in each group. The first 32 features maps are fed into the upper branch to predict the LR abundances, whereas the last 32 feature maps are fed into the lower branch to predict the HR abundances. In fact, the number of feature maps 64 is set according to our empirical knowledge. In the preparing phase of this manuscript, we have tested several numbers, including 128 feature maps that are split into 64, 64 maps, 32 feature maps that are split into 16, 16 maps. In summary, we found that the number of feature maps has little influence on the final results in practice. We select this number simply because it is usually used in CNN-based image processing tasks. Besides, we suggest to split the feature maps into two groups with same numbers of maps. This is because, in this manner, the upper branch and lower branch have the same time and space complexities. Therefore, we will be able to develop multi-processing program to conduct deeper convolutional operations simultaneously in our future work.

Considering this issue, we have appended the following statements in Para. 2 of Sub-section 2.2,

“Each convolutional layer involves C_out filters, resulting in C_out intermediate feature maps. Afterwards, the last C_out feature maps are split into two groups, (the maps with red and blue boundaries in Figure 1), and fed into a two-branch prediction stage. In our preparing works of this paper, we found that the number of C_outactually has limited influence on the super-resolution results. According to the existing works, a typical number of 64 is used in the experimental section. The 64 feature maps are then split into two groups, with each group contains C_out/2=32 maps that are fed into the two branches, respectively. Here, we suggest bisection of feature maps, since in this manner, the upper branch and lower branch have the same time and space complexities, which simplifies the development of implementation programs.”

In addition, we have also defined this number in Para. 4 of the experimental section. We believe this revision indeed helps to further clarify the implementation details of our approach.

Point 2: It is mentioned that VCA (Vertex Component Analysis) is utilized to extract the endmembers of the scene. One of the VCA parameters is the number of endmembers preset on the scene. For the datasets employed during the assessment of the proposal, it is clear that the value of this parameter is known in advance. However, this reviewer wonders what would happen in a general case where a new hyperspectral cube is captured and there is no prior knowledge about the number of endmembers. Or it is also possible to envisioned an application in which a sequence of hyperspectral cubes (hyperspectral video) is capture and therefore the scene could dynamically change. Could you please elaborate more on the consequences for the performance of the proposal if no prior knowledge about the number of endmembers is used.

Response 2: Thank you very much. We have carefully considered your comments, and would like to answer these issues as follows.

First, in practice, it is true that we cannot priorly know the number of endmembers for all captured scenes. Therefore, in our work, we suggest to set the endmember numbers according to the published references, our previous work and empirical researches,

Yokoya, N., Yairi, T., and Iwasaki, A. Coupled nonnegative matrix factorization unmixing for hyperspectral and multispectral data fusion. IEEE Trans. Geosci. Remote Sens. 2012, 50, 528-537. doi: 10.1109/TGRS.2011.2161320.
Lu, X., Zhang, J., Yang, D., Xu, L., and Jia, F. Cascaded convolutional neural network-based hyperspectral image resolution enhancement via an auxiliary panchromatic image. IEEE Trans. Image Process. 2021, 30, 6815-3828. doi: 10.1109/TIP.2021.3098246.

Second, a possible approach to calculate the number of endmembers is the virtual dimension, according to the existing references. In fact, the automatic calculation of endmember numbers is one of our next works. We are unable to conduct further researches due to the limited revision time. But we still appreciate your reminder.

Third, your prospect about hyperspectral video is indeed an interesting proposal. In hyperspectral image sequences, due to the diversity and time varying characteristics of ground objects, we can hardly know the endmembers for all of the frames. However, if we have enough training scenes or frames, it may be possible to build a prior endmember library, although the number of endmembers (i.e., the output feature maps) may be huge. And the correlation between endmembers may be calculated to reduce the numbers. If possible, in our future work, we would also like to devote to the super-resolution of hyperspectral videos.

Last, we have to admit that additional experiments should be performed to discuss the influence of endmember numbers at least. In fact, during the last reviewing period, we have conducted another several groups of experiments regarding various end-member numbers. The results have been shown in Figure 11 of the revised manuscript. The corresponding analyses are appended in the first paragraph of sub-section 3.3. Although it is not rigorous enough to explain the consequences of unknown endmember numbers, we believe that it can provide some guidance about the determination of endmember numbers for general cases. And in the conclusion, we have also introduced the prospect about the automatic selection of endmember numbers D.

Hope we have properly addressed your issues. Thank you again.

Point 3: PSNR plots show more variability that is currently capture in the corresponding paragraph. This reviewer would suggest to add more details in the text to help the reader better understand the differences observed at various wavelengths.

Response 3: Thank you very much. This is an important suggestion. Actually, in Figure 9, it can be seen that although the proposed method achieves highest PSNRs on the whole, for spectral bands with various wavelengths, the overall values of PSNR have different levels. In other words, for some bands, PSNR is much higher than the others. This is mainly because in some bands, the pixels are spatially smoother than the other bands. To better understand this point, we have plotted the average gradients of each band, for the three reference images in Figure 10.

Figure 10. Average gradient of each band for the reference images. (From left to right: University of Pavia, University of Houston, and San Diego air station.)

The average gradient (AG) reflects the differences between neighbor pixels. Therefore, we can see that for some bands, AGs are much higher than the others, which means the neighbor pixels are quite different with each other. Hence, the spatial structure information is complicated, and the reconstruction is relatively difficult. A simple example is that for Bicubic interpolation method, the large difference between neighbor pixels leads to obstruction in prediction of the center pixel. And the PSNR turns to be lower accordingly. The corresponding relationship between average gradient and PSNR of each band can be easily observed from the Figure 9 and 10, where the higher gradient is, the lower PSNR is, accordingly.

Based on this fact, we have revised the penultimate paragraph of Sub-section 3.2 and appended the corresponding discussions. The revisions have been highlighted in the manuscript.

Author Response File: Author Response.pdf

Reviewer 3 Report

This article is a well-written, concise and interesting work. It involves the use of a CNN-based framework to increase the spatial resolution of single hyperspectral image acquisitions. The novelty lies in the spatial correlation-regularized loss function definition, and the use of the linear spectral mixture model that does not make use of auxiliary sources. The paper in my opinion needs only a moderate revision. Please find my remarks of minor and moderate importance mixed below:

Line 28: I guess you refer to remote sensing / Earth Observation data because the rest of the sentence is not true for hyperspectral cameras operated in situ, e.g. in crop fields. Please specify if so.
Line 93: In spired => Inspired
Line 151: n is undefined
Line 155 to 156: “at lower resolution” but still the same dimensionality as in line 149?
Line 197: Have you tried any other padding besides mirror? Why mirror? Will that not be an issue in cases where edges are found within the patch and two different classes exist within that patch?
Line 221 missing full stop
How is the “Number of parameters (M)” of Table 1 calculated? Please define it in the text.
Please either define the metrics of lines 290 to 294 or provide references for them.
Line 310 “of proposed” => “of the proposed”
Rephrase line 323; it does not make sense to me.
Use a common legend for all three subfigures of figures 9 and 10 and put it outside the figures so that they don’t obscure the results
Line 382 “peculiar parameter”?
For Section 3.3, are ERGAS and UIQI presented in Figures 11 and 12 calculated in the training or the testing set?
A bar chart of Table 4 for a selected metric (e.g. ERGAS?) could go a long way here to easily compare the different methodologies.
The caption of Table 4 can be improved.
Line 433: “, thus,” => “. It thus”
A rather important aspect that I didn’t understand and hope you can clarify in the text, is how is matrix S (SSTM) calculated? Is it trainable? Is it simply a matrix that performs the mean operation to subsample from the HR to the LR abundance maps? My confusion also stems from the fact from that S is defined in 159 as NxN while it should operate on a patch basis (Line 210, p^2 x p^2). I think part of what you describe in lines 266 to 268 should move to the previous section.
I am missing some more in depth discussion why the bicubic method has the lowest SAMs in the San Diego dataset.
The discussion should be more critical of the results and highlight any potential shortcomings of the methodology.

Author Response

Point 1: Line 28: I guess you refer to remote sensing / Earth Observation data because the rest of the sentence is not true for hyperspectral cameras operated in situ, e.g. in crop fields. Please specify if so.

Response 1: Thank you for your comments. We agree that only HS remote sensing data suffer from such deficiencies. Therefore, we have revised the sentence as,

“Hyperspectral (HS) remote sensing image always suffers from the deficiencies of low spatial resolution and mixed pixels, which seriously deteriorate the performance of target detection and recognition in earth observation areas.”

Point 2: Line 93: In spired => Inspired

Response 2: Thanks. We have revised this mistake in this version of our manuscript.

Point 3: Line 151: n is undefined

Response 3: Thank you. We feel sorry about our carelessness. We have given the definition of n in line 4 of Para. 2 in Sub-section 2.1.

Point 4: Line 155 to 156: “at lower resolution” but still the same dimensionality as in line 149?

Response 4: Yes. Because we have up-scaled the LR HS image to the size of HR image, via bicubic interpolation method. Therefore, the LR HS image X has the same pixel number with Y, i.e., . We have clarified this point in Para 2 of Sub-section 2.1, above Eq. (2), in the previous version of our manuscript.

Point 5: Line 197: Have you tried any other padding besides mirror? Why mirror? Will that not be an issue in cases where edges are found within the patch and two different classes exist within that patch?

Response 5: Thank you very much. In fact, in our early works, we have conducted several groups of extra experiments to explore the influences of several secondary parameters, including the padding mode, batch size, number of intermediate feature maps. And according to our experimental results, the zero-padding and mirror padding modes seem to have little difference in practice. So, we did not conduct further researches or record the results. We simply select the same strategy as our previous work.

We agree that there may be an issue that edges will be found at the original boundaries within the patches. And it will probably affect the extraction of object features. Fortunately, in our method, it has limited influence on the super-resolution performance. A possible reason is that the input patches and output target patches of our network are selected from the same locations, and the sizes are larger enough to eliminate the boundary effect.

Nevertheless, your question is indeed worth to be further researched. Due to the limited revision time, we would like to make serious efforts in our next work. Thank you again.

Point 6: Line 221 missing full stop

Response 6: Thank you. We have added the full stop. We are sorry for this fault.

Point 7: How is the “Number of parameters (M)” of Table 1 calculated? Please define it in the text.

Response 7: Thank you for your suggestion. The numbers of parameters for CNN-based approaches are calculated by counting the numbers of weights and biases for each convolutional layer. For a convolutional layer with Cout convolutional filters and K₁×K₂×… filter size, i=2 or 3, for 2-D and 3-D convolutional layers respectively, the number of parameters is calculated by , where C_in is the number of input channels. Since the networks used in this manuscript do not contain full connected layer. And other layers, e.g., batch normalization layer, contain much fewer parameters that can be ignored. Thus, the parameter number of the entire network can be calculated by the summation of all convolutional layers.

We have given the main calculation method in the text according to your suggestion.

Point 8: Please either define the metrics of lines 290 to 294 or provide references for them.

Response 8: It is an important suggestion. In this revised manuscript, we have given the definitions of these measurements in Eq. (10)-(14) and the last paragraph of Sub-section 3.1.

Point 9: Line 310 “of proposed” => “of the proposed”

Response 9: Thank you. We have revised it in this version.

Point 10: Rephrase line 323; it does not make sense to me.

Response 10: Thank you for your suggestion. We have revised the statements as,

“Therefore, for different resolution ratios, e.g., r=2 and r=4, the training sets always have the same numbers of samples. And the running times are approximate on the whole.”

Point 11: Use a common legend for all three subfigures of figures 9 and 10 and put it outside the figures so that they don’t obscure the results

Response 11: Thank you very much. It is an important suggestion. In this revised version, we have merged Figure 9 and Figure 10, and replotted each sub-figure according to your suggestion.

Point 12: Line 382 “peculiar parameter”?

Response 12: Thanks. We have revised it as “unique parameters” to avoid potential misunderstanding.

Point 13: For Section 3.3, are ERGAS and UIQI presented in Figures 11 and 12 calculated in the training or the testing set?

Response 13: Thanks. The ERGAS and UIQI values in Figures 11 and 12 are all calculated in the testing data. To clarify this point, we have revised the fourth sentence of Para. 1 of Sub-section 3.3 as follows,

“To this end, Figure 11 and 12 exhibit the ERGAS and UIQI values of the super-resolved and reference images of testing data, under resolution ratio r=2, regarding different and D, respectively, …”

Point 14: A bar chart of Table 4 for a selected metric (e.g. ERGAS?) could go a long way here to easily compare the different methodologies.

Response 14: Thank you. It is an interesting suggestion. In the revised manuscript, we have plotted several bar charts in Figure 13, instead of Table 4. And the ERGAS and UIQI values are plotted simultaneously. This indeed helps to exhibit the comparisons of different methodologies more clearly. And the text is also revised accordingly.

Point 15: The caption of Table 4 can be improved.

Response 15: Thank you very much. We have substituted Figure 13 for Table 4 in this revised manuscript according to your suggestion, and the caption of Figure 13 is revised as “Figure 13. Comparison of ERGAS and UIQI values with respect to different structures of the proposed methods. (The upper line is under r=2, and the lower line is under r=4. From left to right: University of Pavia, University of Houston, and San Diego air station.)”.

Point 16: Line 433: “, thus,” => “. It thus”

Response 16: Thank you. We have revised this mistake.

Point 17: A rather important aspect that I didn’t understand and hope you can clarify in the text, is how is matrix S (SSTM) calculated? Is it trainable? Is it simply a matrix that performs the mean operation to subsample from the HR to the LR abundance maps? My confusion also stems from the fact from that S is defined in 159 as NxN while it should operate on a patch basis (Line 210, p^2 x p^2). I think part of what you describe in lines 266 to 268 should move to the previous section.

Response 17: Thank you very much for pointing out these issues. We would like to explain these issues as follows.

First, the SSTM S is calculated by vectorizing the Gaussian low-pass filter. It is not trainable. It acts more like a weighted average operation to transform the HR image into LR image (abundance maps). Let’ take the following Gaussian low-pass filter for example.

By vectorizing the filter and put it into each column of SSTM, we have the following S,

Suppose for a 5×5 image or patch, namely N=25, or p=5,

And each pixel of output image or patch can be calculated by multiplying corresponding elements of S and F, e.g.,

Obviously, this is also equal to multiplying the vectorized result of I and S. Similarly, for multi-channel image, the output channel can also be calculated.

Second, in line 159 of the previous manuscript, we define the global SSTM for the entire image. Thus, S has N×N elements. While as we have mentioned in the text, in order to train the network, the entire images will be partitioned into small pieces of patches. Considering the local calculation characteristics of low-pass filter, it is reasonable to define the local SSTM that performs transformation from HR patch to LR image patch. Therefore, in our implementation code, we only need to calculate the local SSTM which has p²×p² elements.

Last, we have to admit that in the previous version of this manuscript, we have not clearly described the definition of SSTM. So, in this revised version, we have made the following revisions according to your suggestions.

In Para. 3 of Sub-section 2.1, we use more words to clearly present the calculation method and physical meaning of SSTM,

Then we move the description of experimental setup, i.e., lines 266 to 268, to Para. 8 of Sub-section 2.2, and modify the statements of the aforementioned local SSTM,

We hope these revisions have properly addressed your questions, and will not confuse the potential readers any longer.

Point 18: I am missing some more in depth discussion why the bicubic method has the lowest SAMs in the San Diego dataset.

Response 18: Thank you for your careful reading. It is true that in our experiments, the bicubic method achieves lower SAMs for San Diego data set than the CNN-based super-resolution methods. The main reason is that the San Diego scene contains several large flattened areas. In such case, the bicubic method can simply predict the values by using a few of neighbor pixels, while CNN-based methods will be lack of advantages in practice. By combining the results in Figure 8, it can be seen that the CNN-based methods have certain advantages in protecting the edges and small objects, since the bicubic method has larger errors in edge areas. Considering your comments, we have appended a few words in the first paragraph of Sub-section 3.2 to explain this issue,

“The Bicubic method sometimes has the lowest SAMs (e.g., the San Diego data set). This probably occurs in those scenes with large flattened areas that the CNN-based super-resolution approaches will lose their advantages in exploring the spatial structure features. Nevertheless, from the following figures, we will also see that compared with the Bicubic method, CNN methods always have certain advantages in protecting the edges and small objects.”

In addition, an effective way to improve the performance of CNN is to increase the training samples that are also generated in these flattened areas accordingly. In our future work, we will devote more time to overcome this problem, and continue to improve the performance of our method.

Point 19: The discussion should be more critical of the results and highlight any potential shortcomings of the methodology.

Response 19: Thank you very much for your kind suggestion. Following your suggestion, we have made major revisions for the experimental section and the conclusion. Aside from the replotted figures and the revised discussion of results in Sub-section 3.2-3.3, we have also carefully revised the first paragraph of Sub-section 3.3, in which we point out one of the shortcomings of our method,

“Nevertheless, we should also point out that the selection of the two and D in this work actually depends on our empirical knowledge with massive experiments and tests. Yet, we have not figured out adaptive rules to automatically determine these two parameters. And this would be our following work.”

In the last paragraph of Sub-section 3.3, we also point out the importance of sample quality and quantity for image super-resolution tasks,

“Besides, it is also worth to note that the quantity and diversity of samples are also quite critical to the CNN-based super-resolution approaches. In our future work, we will also devote to promote the quality and quantity of samples so that we can train more robust and effective networks to handle various scenes.”

In the conclusion, we use more words to specify the meanings of experiments, including the ablation experiments and discussion of parameters. At last, we conclude the shortcomings of this work, and specify the future work,

“The ablation experiments also demonstrate that the proposed network architecture and spatial correlation-regularization can further improve the spectral qualities of super-resolved images. Meanwhile, in this paper, we have also analyzed the influences of two important parameters, namely the balance parameter h and number of endmembers D. However, it should be pointed out that, in this work, the selection of parameters basically depends on our empirical knowledge and massive experiments. Therefore, our next work will focus on enhance the adaptability of our approach by automatically determining the parameters. And we will also make efforts to improving the quality and quantity of training samples to handle various scenes, and expedite the network training procedure.”

Your comments indeed help to improve the readability and quality of our manuscript. We sincerely appreciate your valuable comments and suggestions. Thank you again.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Most of the comments have been addressed, some concerns remain as following:

It's suggested to enhance the introduction background by including more recently published relevant references: super-resolution-based change detection network with stacked attention module for images with different resolutions, TGRS, 2021; hyperspectral image denoising using a 3d attention denoising network, TGRS, 2020; super-resolution for hyperspectral remote sensing images based on the 3d attention-SRGAN network, RS, 2020.
How is the result if use the trained model with sub-image to test the image with the original size?
There are some methods to estimate the number of endmembers automatically. It's better to find a way to make the setting of number reproducible.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Round 3

Reviewer 1 Report

My comments have been addressed.

Article Menu

Hyperspectral Image Super-Resolution Based on Spatial Correlation-Regularized Unmixing Convolutional Neural Network

Further Information

Guidelines

MDPI Initiatives

Follow MDPI