Next Article in Journal
Leveraging Machine Learning and Remote Sensing for Water Quality Analysis in Lake Ranco, Southern Chile
Previous Article in Journal
Assessing the Impact of Agricultural Practices and Urban Expansion on Drought Dynamics Using a Multi-Drought Index Application Implemented in Google Earth Engine: A Case Study of the Oum Er-Rbia Watershed, Morocco
Previous Article in Special Issue
Adaptive Background Endmember Extraction for Hyperspectral Subpixel Object Detection
 
 
Article
Peer-Review Record

HyperKon: A Self-Supervised Contrastive Network for Hyperspectral Image Analysis

Remote Sens. 2024, 16(18), 3399; https://doi.org/10.3390/rs16183399
by Daniel La’ah Ayuba 1,*, Jean-Yves Guillemaut 1, Belen Marti-Cardona 2 and Oscar Mendez 1
Reviewer 1:
Reviewer 2:
Reviewer 3: Anonymous
Remote Sens. 2024, 16(18), 3399; https://doi.org/10.3390/rs16183399
Submission received: 26 July 2024 / Revised: 4 September 2024 / Accepted: 10 September 2024 / Published: 12 September 2024
(This article belongs to the Special Issue Advances in Hyperspectral Remote Sensing Image Processing)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Be careful in the use of images, each scene of ENMAP has 224 images (channels at a wavelength that produce an image). I would consider using scene to describe the 224x1300x1200 data that you are testing against.  

 

2.1: The biggest concern is the conflation of radiance, orthorectified, and atmospherically compensated (reflectance) data. Conflating these are not ideal based on the physics and geometrical relationships. What is the breakdown of the 800 different scenes, number of radiance scenes, number of orthorectified scenes, # of reflectance scenes?.

 

Puzzled by the 160 x 160 patch dimensions being optimal as 1300 / 160 = 8.125 and 1200 / 160 = 7.5. Does the 160 x 160 patch have a buffer?

 

Table 1: for ENMAP size 1276x1248 doesn’t match text in 2.1. 

 

Even with 5% overlap the calculation doesn’t work with 5% in either the x-direction or y-direction. This is 8 pixels of the 160. Lines 97-103. See previous comment about the 160x160 patch being optimal.

 

2.4 : Concerned that the spectral discrepancies (differences) in equation 7 is very similar in appear to Euclidean distance equation. This is concerning because Euclidean distance as a measure of separability decreases with increasing dimensions. Differencing between images of the same wavelengths from different wavelength wavelengths has been used in multispectral instruments

 

Iref vs Ihat is a challenge because radiance, reflectance/emittance and orthorectifed radiance are different phenomenology and geometries. See previous comment about radiance, orthorectified and reflectance.

 

TABLE 2. Use HyperKon instead of Ours, if that is the intent. It will clarify the table.

 

3.1 Curious if band reduction was discontinuous (i.e., removal of bad bands, channels with low SNR, or selected wavelengths) on the overall performance since the 3D convolution requires contiguous channels. 

 

Figure 7. Add value legend/colormap. All the overall value is there, what are the significances of the yellow vs. darker blue in a quantitative sense.

 

Figure 8. Show a mask map. This would clarify between sub figures, a, b, and c. Break up the % of correct classification as the ground truth mask has the number of pixels for each class. This would help further understand the challenges.

Author Response

We thank the reviewers and the editor for their insightful feedback. We have made substantial revisions to the document. For convenience, we have highlighted the changes and provided a detailed breakdown of how each comment was addressed below.

Comment 1: Be careful in the use of images, each scene of ENMAP has 224 images (channels at a wavelength that produce an image). I would consider using scene to describe the 224x1300x1200 data that you are testing against.

Response 1: Thank you for pointing this out. We agree with this comment. Therefore, we have revised the text to consistently use the term “scene” to describe the 224x1300x1200 data.

Comment 2: 2.1: The biggest concern is the conflation of radiance, orthorectified, and atmospherically compensated (reflectance) data. Conflating these are not ideal based on the physics and geometrical relationships. What is the breakdown of the 800 different scenes, number of radiance scenes, number of orthorectified scenes, # of reflectance scenes?

Response 2: Thank you for pointing this out. We appreciate your feedback on distinguishing the data types. Our dataset comprises 200 Level 1B (radiometrically corrected radiance), 200 Level 1C (orthorectified), and 400 Level 2A (atmospherically corrected reflectance) scenes. In subsection 2.1, we've added details on utilizing these processing levels as data augmentation(i.e radiometric and geometric distortions of the orthorectified bottom-of-atmospheric reflectance data) for self-supervised contrastive learning. For each anchor patch from one level, we generate positives from corresponding patches in other levels, while negatives come from unrelated patches in the same batch.

Comment 3: Puzzled by the 160 x 160 patch dimensions being optimal as 1300 / 160 = 8.125 and 1200 / 160 = 7.5. Does the 160 x 160 patch have a buffer?

Response 3: Thank you for pointing this out. We thank you for this observation and question. We also apologise for using the wrong term “optimal”. The 160x160 patch size was chosen to balance computational efficiency with adequate spatial context capture, while maintaining consistency with existing architectures for our self-supervised contrastive network backbone. Our patch extraction process, now clarified in subsection 2.1, employs a sliding window approach with a 5\% overlap buffer. Edge patches are zero-padded when necessary. The overlap percentage is an adjustable hyperparameter.

Comment 4: Table 1: for ENMAP size 1276x1248 doesn’t match text in 2.1.

Response 4: Thank you for pointing this out. We have corrected Table 1 to match the text in Section 2.1, ensuring consistency throughout the paper.

Comment 5: 2.4 : Concerned that the spectral discrepancies (differences) in equation 7 is very similar in appear to Euclidean distance equation. This is concerning because Euclidean distance as a measure of separability decreases with increasing dimensions. Differencing between images of the same wavelengths from different wavelength wavelengths has been used in multispectral instruments

Response 5: Thank you for pointing this out. We appreciate this observation, which highlighted the need for clarity in our explanation. In subsection 2.4, first paragraph, we have revised our description of the Hyperspectral Perceptual Loss (HSPL). We now clarify that HSPL operates in the 128-dimensional feature embedding space, rather than the original 224-band spectral space. This feature embedding (FE) is a learned representation of the input patch that provides a more compact and informative space for loss calculation.

Comment 6: TABLE 2. Use HyperKon instead of Ours, if that is the intent. It will clarify the table.

Response 6: We thank you for this suggestion. We have updated Table 2 to replace "Ours" with "HyperKon".

Comment 7: 3.1 Curious if band reduction was discontinuous (i.e., removal of bad bands, channels with low SNR, or selected wavelengths) on the overall performance since the 3D convolution requires contiguous channels.

Response 7: Thank you for this insightful question regarding band reduction and its impact on 3D convolutions. We have updated subsection 3.1.1 to clarify that band reduction was not always continuous in our experiments. Our results show that 3D convolutions can effectively handle non-contiguous spectral bands, maintaining robust performance even with discontinuous band selection. This finding challenges the assumption that 3D convolutions require strictly contiguous channels and demonstrates their flexibility in processing hyperspectral data under various band selection scenarios.

Comment 8: Figure 7. Add value legend/colormap. All the overall value is there, what are the significances of the yellow vs. darker blue in a quantitative sense.

Response 8: Thank you for this observation. We have updated Figure 7 to include a clearer value legend for the Mean Absolute Error (MAE) heatmaps. The colormap ranges from purple (low error) to yellow (high error), with the scale shown on the right side of the figure. Specifically, darker blue/purple areas indicate lower MAE values, representing regions where the model's prediction closely matches the ground truth, and yellow areas indicate higher MAE values, representing regions with larger discrepancies between the prediction and ground truth.

Comment 9: Figure 8. Show a mask map. This would clarify between sub figures, a, b, and c. Break up the % of correct classification as the ground truth mask has the number of pixels for each class. This would help further understand the challenges.

Response 9: We appreciate your valuable suggestions regarding Figure 8. We have implemented the following changes to address your comments:

  1. We have updated Figure 8 to include a mask map, which now shows:
    1. Predicted classification map generated by HyperKon
    2. Predicted classification map with masked regions (showing only labelled areas)
    3. Predicted Accuracy map: Green for correct predictions, Red for incorrect predictions, and Black for unlabelled areas
    4. Ground truth classification map
    5. RGB display of the scene
  2. To address your suggestion about breaking down the percentage of correct classification, we have added Table 5. This new table presents class-wise accuracies for each dataset (Indian Pines, Pavia University, and Salinas Scenes), along with additional performance metrics such as training time, test time, average inference time, and throughput.

 

Reviewer 2 Report

Comments and Suggestions for Authors

1. What does the H and W in formula (1) represent?

2. What is the X in formula (2)? Does it refer to the reflectance value of the pixel?

3. Figure 1 should appear after line 106 of Section 2.2.

4. What is the physical difference between Xα and X+ in Figure 1?

5. The model uses a training data set of 160 * 160 pixels image size. According to the spatial resolution of the hyperspectral data used is 30m, the spatial size of each training sample is 4800m * 4800m. Such a large area will contain multiple types of ground features, and there will be many different mixed spectral to the ground classification. It is suggested that the author take a single pixel as the training sample to extract different ground types from the subtle differences of spectral reflectance, so as to make full use of the advantages of hyperspectral data.

6. The graph or table referenced in the text should be placed after the location of the first reference in the text. For example, Figure 1 should be after line 106, and Table 4 should be after line 271.

7. The classification results of (b) and (c) in Figure 8 are very similar, and some objects can be very accurately identified, even better than the author's method, but why are other objects unrecognizable at all? Please give a reasonable explanation.

 

 

Author Response

We thank the reviewers and the editor for their insightful feedback. We have made substantial revisions to the document. For convenience, we have highlighted the changes and provided a detailed breakdown of how each comment was addressed below.

Comment 1: What does the H and W in formula (1) represent? & What is the X in formula (2)? Does it refer to the reflectance value of the pixel?

Response 1: Thank you for this observation. We have added the definitions: H and W represent the Height and Width of the input feature map, respectively. X in formula (2) represents the input feature map, with $x_{ijc}$ being the value at the spatial position(i-th row, j-th column), and c-th channel.

Comment 2: Figure 1 should appear after line 106 of Section 2.2.

Response 2: Thank you for this observation. We have repositioned Figure 1 to appear immediately after its first reference in the text, which occurs after line 106 in Section 2.2. We have also reviewed and adjusted the placement of all other figures and tables to ensure they follow their first mention in the text.

Comment 3: What is the physical difference between Xα and X+ in Figure 1?

Response 3: We appreciate your insightful question. In our contrastive learning framework, Xα represents the current sample under evaluation, known as the anchor. X+ denotes the positive samples, which are augmented versions of the anchor from the same scene. Our objective is to minimize the distance between the anchor and its positives in the embedding space. Conversely, X- represents the negative samples, which are patches from the batch that are neither anchors nor positives. We aim to maximize the distance between the anchor and these negatives. This approach encourages the model to learn representations that group similar samples closely while pushing dissimilar samples apart. We have modified subsection 2.3.2 to provide a more detailed explanation of these terms and their roles in our contrastive learning process.

Comment 4: The model uses a training data set of 160 * 160 pixels image size. According to the spatial resolution of the hyperspectral data used is 30m, the spatial size of each training sample is 4800m * 4800m. Such a large area will contain multiple types of ground features, and there will be many different mixed spectral to the ground classification. It is suggested that the author take a single pixel as the training sample to extract different ground types from the subtle differences of spectral reflectance, so as to make full use of the advantages of hyperspectral data.

Response 4: Thank you for pointing this out. We appreciate your suggestion as well. Spatial context is crucial in our HyperKon contrastive learning approach. Reducing the patch size to a single pixel removes the spatial information, thereby decreasing the model's capacity to learn meaningful spatial-spectral representations. Although, reducing the patch size is an interesting direction for research that we will explore in a future work.

Comment 5: The classification results of (b) and (c) in Figure 8 are very similar, and some objects can be very accurately identified, even better than the author's method, but why are other objects unrecognizable at all? Please give a reasonable explanation.

Response 5: Thank you for pointing this out. We appreciate your valuable concerns regarding Figure 8. We have implemented the following changes to address your comments:

  1. We have updated Figure 8 to include a mask map, which now shows:
    1. Predicted classification map generated by HyperKon
    2. Predicted classification map with masked regions (showing only labelled areas)
    3. Predicted Accuracy map: Green for correct predictions, Red for incorrect predictions, and Black for unlabelled areas
    4. Ground truth classification map
    5. RGB display of the scene

 

Reviewer 3 Report

Comments and Suggestions for Authors

This paper proposes HyperKon, a self-supervised contrastive network specifically developed for hyperspectral image (HSI) analysis. The authors design a novel hyperspectral native convolutional architecture and innovative HSPL functions designed to enhance performance in hyperspectral super-resolution and classification tasks. Additionally, they present EnHyperSet-1, a hyperspectral dataset specifically tailored for deep learning applications. The authors point out that HyperKon outperforms traditional RGB-based models and other state-of-the-art methods, excelling in maintaining spectral integrity and capturing complex spectral-spatial relationships.

 

I have the following concerns:

 

1. What does Xijc in formula (1) in Section 2.2.2 specifically mean? Please explain.

2. The author points out that NT-Xent loss is based on the concept of InfoNCE. Please explain the difference between the two.

3. Does the size and overlap rate of the hyperspectral patches in the pre-training dataset affect the feature extraction effect of the model? Please explain how to determine the optimal parameter settings.

4. The author mentioned in the Introduction that compared with models such as SpectralGPT, HyperKon has more than 30 times fewer parameters, but there is no comparison of the model parameters in the Section 3.Results. Please add.

5. The speed of hyperspectral classification has important application value. Therefore, please analyze the model operation efficiency problem and compare it with other methods.

Author Response

We thank the reviewers and the editor for their insightful feedback. We have made substantial revisions to the document. For convenience, we have highlighted the changes and provided a detailed breakdown of how each comment was addressed below.

Comment 1: What does Xijc in formula (1) in Section 2.2.2 specifically mean? Please explain.

Response 1: Thank you for this observation. We have clarified and added the definitions to terms used in Section 2.2.2. X represents the input feature map, with Xijc being the value at the spatial position(i-th row, j-th column), and c-th channel.

Comment 2: The author points out that NT-Xent loss is based on the concept of InfoNCE. Please explain the difference between the two.

Response 2: Thank you for this observation. We have added clarification on this point in subsection 2.3.1. NT-Xent (Normalized Temperature-scaled Cross Entropy) is a specific implementation of the more general InfoNCE principle. Key differences include L2 normalization of embeddings, temperature scaling, and typically symmetric implementation in NT-Xent.

Comment 3: Does the size and overlap rate of the hyperspectral patches in the pre-training dataset affect the feature extraction effect of the model? Please explain how to determine the optimal parameter settings.

Response 3: Thank you for pointing this out. We thank you for this observation and question. We also apologise for using the wrong term “optimal”. The 160x160 patch size was chosen to balance computational efficiency with adequate spatial context capture, while maintaining consistency with existing architectures for our self-supervised contrastive network backbone. Our patch extraction process, now clarified in subsection 2.1, employs a sliding window approach with a 5\% overlap buffer. Edge patches are zero-padded when necessary. The overlap percentage is an adjustable hyperparameter.

Comment 4: The author mentioned in the Introduction that compared with models such as SpectralGPT, HyperKon has more than 30 times fewer parameters, but there is no comparison of the model parameters in the Section 3.Results. Please add.

Response 4: We appreciate the reviewer’s observation. To address this, we have added a new subsection in the Results section titled "Model Efficiency Analysis," which includes a table comparing the computational metrics of HyperKon with other state-of-the-art models. Additionally, we clarified in the Introduction that while SpectralGPT was referenced as a Remote Sensing Foundation Model (RSFM), it was not used for downstream HSI classification tasks. SpectralGPT was trained on multispectral images (MSI) with 12 bands, whereas HyperKon was specifically trained on hyperspectral images (HSI) with 224 bands.

Comment 5: The speed of hyperspectral classification has important application value. Therefore, please analyze the model operation efficiency problem and compare it with other methods.

Response 5: Thank you for pointing this out. As stated in response 4, we have added Table 5. This new table presents class-wise accuracies for each dataset (Indian Pines, Pavia University, and Salinas Scenes), along with additional performance metrics such as training time, test time, average inference time, and throughput.

 

Round 2

Reviewer 3 Report

Comments and Suggestions for Authors

This manuscript has addressed the issues I raised earlier, and there are currently no obvious issues. There is only one thing that I suggested doing is to appropriately increase the size of the text in most of the Figures in the manuscript when the final version is submitted. 

Back to TopTop