Next Article in Journal
A New Method of Rainfall Detection from the Collected X-Band Marine Radar Images
Previous Article in Journal
A Joint Estimation Method of the Channel Phase Error and Motion Error for Distributed SAR on a Single Airborne Platform Based on a Time-Domain Correlation Method
 
 
Article
Peer-Review Record

Self-Supervised Keypoint Detection and Cross-Fusion Matching Networks for Multimodal Remote Sensing Image Registration

Remote Sens. 2022, 14(15), 3599; https://doi.org/10.3390/rs14153599
by Liangzhi Li 1, Ling Han 1,2 and Yuanxin Ye 3,*
Reviewer 1: Anonymous
Reviewer 2:
Reviewer 3: Anonymous
Remote Sens. 2022, 14(15), 3599; https://doi.org/10.3390/rs14153599
Submission received: 20 June 2022 / Revised: 22 July 2022 / Accepted: 22 July 2022 / Published: 27 July 2022

Round 1

Reviewer 1 Report

The proposed method includes a two-step "detection + matching" framework in this paper, where each step consists of a deep neural network, avoiding the traditional tedious "detection + feature description + matching" step. Firstly, a self-supervised detection network is designed to generate similar keypoint feature maps between multimodal images, which uses to detect highly repeatable keypoints. Second, the authors propose a cross-fusion matching network, which aims to exploit global optimization and fusion information for cross-modal feature descriptors and matching. The proposed method superiors feature detection and matching performance compared with current state-of-art methods. This is an interesting research paper. There are some suggestions for revision.

1)  The motivation is not clear. Please specify the importance of the proposed solution.

2)  The listed contributions are weak. Please highlight the innovations of the proposed solution.

3) However, this model is complicated. With the increase of image size and information content, the time cost of processing tasks will increase. How to optimize the model to reduce complexity and time cost while improving the effect?

4)  This separate processing of tasks is still cumbersome. How to improve the efficiency of task processing alone still needs to be studied.

5)  Obtaining repeatable keypoints on the reference and sensed images is crucial to matching. How to solve the repeatability problem of key points is not well described.

6)  Please specify how to obtain the suitable parameter values used in the proposed solution.

7)  There is insufficient discussion on existing work.

8)  In experiments section, some results are very descriptive, while fail to come across as compelling to the reader due to obfuscation in the language chosen for description.

9)  The experimental results are not convincing. Please compare the proposed solution with more recently published solutions.

10)  In introduction, there is few descriptions of the shortcomings of existing work.

Author Response

Reviewer1

 

The proposed method includes a two-step "detection + matching" framework in this paper, where each step consists of a deep neural network, avoiding the traditional tedious "detection + feature description + matching" step. Firstly, a self-supervised detection network is designed to generate similar keypoint feature maps between multimodal images, which uses to detect highly repeatable keypoints. Second, the authors propose a cross-fusion matching network, which aims to exploit global optimization and fusion information for cross-modal feature descriptors and matching. The proposed method superiors feature detection and matching performance compared with current state-of-art methods. This is an interesting research paper. There are some suggestions for revision.

 

1) The motivation is not clear. Please specify the importance of the proposed solution.

Reply: Thanks for your valuable comments. We apologize for our earlier lack of clarity about this sentence. We have rewritten the motivation.

Line 1, edited sentence to read from “Abstract” now Remote sensing image matching is the basis to obtain integrated observations and complementary information representation of the same scene from multiple source sensors, which is a prerequisite for remote sensing tasks such as remote sensing image fusion and change detection. However, the intricate geometric and radiometric differences between the multimodal images, render the registration quite challenging.

Line 22, edited sentence to read from “Introduction” nowThe joint observation of multimodal remote sensing images can bring unpredictable discoveries to the region and significantly improve the interpretation of the same scene. There is thus a remote sensing image registration seek, which is to align two or more multi-temporal/multi-sensor images in the same coordinate system for the observed scenarios. Registration is crucial since it determines the quality and accuracy of the remote sensing image fusion and change detection. Therefore, a matching process requires to be implemented for remote sensing missions for joint observations.

 

2) The listed contributions are weak. Please highlight the innovations of the proposed solution.

Reply: We appreciate your comment. We have re-described contributions in the revised version. The following are the detailed contributions.

Line 22, edited sentence to read from “Introduction” nowIn this work, a two-step "detection + matching" network framework is proposed, where each network is applied to a different task to accommodate multimodal remote sensing image matching. To generate keypoints with repeatability, we construct a self-supervised keypoint detection network both from spatial and channel domains rather than local keypoint response by maximum pooling. To obtain cross-modal similarity feature descriptions, an interactive fusion network is proposed for global optimization. The main contributions are summarized as follows:

(1) For the detection network, considering the differences in the input multimodal images, we build confidence at the same locations from the spatial and channel domains, which discards the maximum pooling keypoint detection mechanism. Overall, we design a self-supervised training manner that communicates the same locations with the same keypoint confidence. which shifts the keypoint detection from a single image response to the fitting of keypoint positions conditional on two images.

(2) In terms of matching networks, it is considering that constructing feature descriptors with local image patches ignores the global information and the interaction between image patch pairs, we develop a cross-fusion mechanism for exchanging high-level semantic information between image patches for feature description. Simultaneously, the network aims to generate a matrix of matching relationships from the overall image patches, unifying the "description + matching" steps into a single network to obtain optimization of overall remote sensing image matching.

 

3) However, this model is complicated. With the increase of image size and information content, the time cost of processing tasks will increase. How to optimize the model to reduce complexity and time cost while improving the effect?

Reply: Thanks for your valuable comments. Your concern is worth being discussed. In fact, since our proposed method is designed with a suitable network for each matching step, the whole matching framework consists of two networks, which reduces the complexity. Similarly, these models are trained separately according to their respective tasks, which will reduce the consumption time of network fitting. However, this separated matching step is tedious, which requires further integration of the two networks in the next study to improve the computational efficiency.

 

4) This separate processing of tasks is still cumbersome. How to improve the efficiency of task processing alone still needs to be studied.

Reply: Thanks for your review. The proposed framework is composed of two networks, which is a compromise on the remote sensing image matching accuracy. In the actual data processing, it is troublesome to handle the task separately. Therefore, we believe this will be the future research direction to integrate detection and matching in one network while reducing the complexity of the network.

 

5) Obtaining repeatable keypoints on the reference and sensed images is crucial to matching. How to solve the repeatability problem of key points is not well described.

Reply: We apologize for the lack of clarity in the previous version. We have added a description of the proposed keypoint detection methods.

Line 116, edited sentence to read from “Introduction” nowTo generate keypoints with repeatability, we construct a self-supervised keypoint detection network both from spatial and channel domains rather than local keypoint response by maximum pooling.

Line 121, edited sentence to read from “Introduction” now(1) For the detection network, considering the differences in the input multimodal images, we build confidence at the same locations from the spatial and channel domains, which discards the maximum pooling keypoint detection mechanism. Overall, we design a self-supervised training manner that communicates the same locations with the same keypoint confidence. which shifts the keypoint detection from a single image response to the fitting of keypoint positions conditional on two images.

Line 223, edited sentence to read in section2.2 nowTo obtain keypoints that are robust to scaling changes, three stages of feature maps in the network Sconv-8/20 and DCN-2 are used for keypoint detection. Subsequently, the outputs from the three stages are input to the upsampling network, which is used to recover the original size and assign the corresponding weights. Keypoints are determined from the peaks in the local spatial- and channel domains, as shown in Figure 2. Specifically, for each position and channel in the feature map output by the detection network, the local spatial and channel scores are calculated by

 

6) Please specify how to obtain the suitable parameter values used in the proposed solution.

Reply: Thanks for your review. To help design these two networks, we perform separate ablation experiments in our experiments to obtain the optimal image input size, network operation, number of network layers and size weight combinations.

Line 348, edited sentence to read in section 3.3 nowTo aid in the design of the detection network described in Section 2, we made different combinations between the variants of the three-stage network. The three components were replaced with the original convolution, where the combinations are .

In experiments, we required to analyze the performance of keypoint detection by changing the combination of scale weighting parameters, where the scale parameter combinations could be enumerated as , , for the Sconv-8/20, and DCN-2 weights varying between  and . Therefore, an ablation study was performed between network combinations and scale weight parameters. The repeatability with different pixel threshold was used to measure the detection performance. The specific parameter combination was shown in Figure 6.

Line 348, edited sentence to read in section 3.4 nowTo fully understand the matching network, four different variants were evaluated. ViT was employed as the basis network () in ablation studies. The network variants between the ViT and interaction fusion modules were performed for the experiments (). Furthermore, ablation studies were conducted on the layers of the interaction fusion module, with ) as the number of layers. For a fair comparison, the random data during training was set to be deterministic. The performance of the cross-fusion matching network was evaluated using MMA and NN mAP with different pixel thresholds. Detailed results of the ablation tests are described in Table 2.

Line 443, edited sentence to read in section 3.6 nowTo determine the optimal patch size, MMA with a 2.0 pixel threshold on the same dataset was evaluated in different patches sizes (32 × 32, 48 × 48, 64 × 64, 96 × 96 and 128 × 128), whose MMA were 0.321, 0.539, 0.553, 0.596, 0.504, respectively. It illustrated that the matching performance gradually increased as the patch size increased and reached its highest value near the image patch with a size of 96 × 96, where the matching accuracy gradually leveled off. A patch with a size of 96 × 96 was selected as the following experimental size to improve the efficiency of the calculation while ensuring the matching performance.

 

7) There is insufficient discussion on existing work.

Reply: Thanks for your valuable comments. We apologize for the lack of clarity in the previous version. We re-described the existing work in the introduction and the experiment, respectively.

Line 84, edited sentence to read from “Introduction” nowThere are some methods that combine feature detection and description in one network for natural images, such as D2-Net, Superpoint and R2D2. D2-Net uses the original image as input to generate feature maps of keypoints. However, the accuracy of the keypoint locations is relatively low due to the detection on the feature maps. Superpoint applies a simulated training approach to obtain keypoint localization, which is difficult to obtain keypoint responses on multimodal remote sensing with complex features. R2D2 uses upsampling to maintain the size of the original image and regarded the final output as key information to generate keypoints, which will lose feature responses at corners and edges.

For these natural image-based matching methods, their keypoint detection and feature description mechanisms must be improved, e.g., keypoints detected based on maximum pooling have low repeatability on remotely sensed images with nonlinear radiometric differences. Integrating multiple tasks in a single network may not perform well for keypoint detection and feature description. Furthermore, these unified networks are optimized based on a fixed size image patches, which are difficult to perform global matching optimization for remote sensing images with large sizes. Therefore, each matching step requires an adjusted network structure to optimize remote sensing images globally. Based on the above description, the establishment of multimodal remote sensing image matching needs to overcome the following problems.

Line 415, edited sentence to read from “Experiment” nowFor Superpoint, its repeatability was lower than R2D2, which was since Superpoint used simulation data to train the keypoint, hardly reflecting the real remote sensing application scenes. For the R2D2 method, it found the correspondence by processing the images independently and did not consider the information differences between multimodal images, which made the repeatability of keypoints in multimodal images lower than our proposed method. The keypoint confidence threshold was used to filter keypoints on the proposed detection network, which determines the number of reproducible keypoints. Choosing a larger threshold could increase the repeatability of these keypoints, while would reduce the overall number of keypoints. The confidence thresholds needed to be determined manually in practical applications. Therefore, we would study the algorithm for adaptive confidence threshold selection.

Line 490, edited sentence to read from “Experiment” nowFor Superpoint, it obtained a lower M.S. on multimodal remote sensing images, which might be attributed to non-linear radiometric differences between images resulting in distinct feature descriptions. The Harris detector employed in SAR-SIFT was too sensitive to nonlinear radiometric differences, resulting in a low matching performance accuracy. The proposed method used a cross-fusion mechanism to make the network robust in both geometrical and radiometric variations, which obtained feature descriptions that fused the similarity between two images. Furthermore, the proposed method performed both detection, description and matching on the original image, which enhanced image matching localization accuracy even more when compared to the R2D2 method for positioning on the output feature map.

 

8) In experiments section, some results are very descriptive, while fail to come across as compelling to the reader due to obfuscation in the language chosen for description.

Reply: Thank you very much for your comment. Following your comment, we rechecked our manuscript and rewrote the description of the test, and the changed parts are highlighted.

 

9) The experimental results are not convincing. Please compare the proposed solution with more recently published solutions.

Reply: Thanks for your valuable comments. In fact, we have considered the effectiveness of our proposed method on test datasets compared with competitive approaches, such as R2D2, Superpoint and D2-Net. In addition, we compared the proposed method with other compared methods to obtain matching results for each method on the test images, as shown in the figure below.

Line 500, edited sentence to read from “Experiment” nowFigure 10 presented the qualitative results of the proposed method's overall performance, with blue lines indicating good matches and red indicating mismatches with greater 4-pixel error. According to the results, the proposed method was capable of obtaining a uniformly distributed correspondence on all tested datasets. For SIFT and POS-SIFT, they could hardly obtain the correct correspondence, overall covered by the red line. SAR-SIFT for SAR and optical images obtained a higher number of correct correspondences. For Superpoint and D2-Net, the number of their correct correspondences was overall lower than that of the proposal. For the proposed method, the correspondence found on , , and  was largely concentrated on construction areas, which could be since , and  were mostly covered by plants and water bodies with little textural differences. Overall, the proposed method was effective in providing correspondence on multimodal remote sensing images with geometrically and radiometrically invariant.

Figure 10 Qualitative matching results by the overall network on , where the blue lines indicate correct matching and the red lines indicates incorrect correspondences.

 

10) In introduction, there is few descriptions of the shortcomings of existing work.

Reply: We apologize for the lack of clarity in the previous version. We re-described the existing work in the introduction and the experiment, respectively.

Line 37, edited sentence to read from “Introduction” nowIt used structure information from the entire template window, combined with a fast similarity measure, to detect correspondences between images. However, since these area-based methods are weak in dealing with large geometric deformation, they fail to process complex remote sensing image registration.

Line 50, edited sentence to read from “Introduction” nowSince these images are obtained by various sensors, multi-temporal observation, or different imaging views, they have complicated geometric and radiometric differences. Additionally, those non-learned features (e.g., statistical information of edges, textures, corners and gradients) lack high-level semantic information. Therefore, those methods cannot be guaranteed that the extracted features are highly repeatable and distinct between multimodal remote sensing images.

Line 68, edited sentence to read from “Introduction” nowThe above methods introduce high-level features as matching primitives, which achieve considerable matching performance in many cases. While these methods use deep neural networks to provide descriptors of salient features, they still use non-learned methods for feature detection rather than learn-based schemes.

Line 82, edited sentence to read from “Introduction” nowAlthough the above methods avoid the keypoint detection step, they are very time consuming to search for the best correspondence for each patch.

Line 84, edited sentence to read from “Introduction” nowThere are some methods that combine feature detection and description in one network for natural images, such as D2-Net, Superpoint and R2D2. D2-Net uses the original image as input to generate feature maps of keypoints. However, the accuracy of the keypoint locations is relatively low due to the detection on the feature maps. Superpoint applies a simulated training approach to obtain keypoint localization, which is difficult to obtain keypoint responses on multimodal remote sensing with complex features. R2D2 uses upsampling to maintain the size of the original image and regarded the final output as key information to generate keypoints, which will lose feature responses at corners and edges.

For these natural image-based matching methods, their keypoint detection and feature description mechanisms must be improved, e.g., keypoints detected based on maximum pooling have low repeatability on remotely sensed images with nonlinear radiometric differences. Integrating multiple tasks in a single network may not perform well for keypoint detection and feature description. Furthermore, these unified networks are optimized based on a fixed size image patches, which are difficult to perform global matching optimization for remote sensing images with large sizes. Therefore, each matching step requires an adjusted network structure to optimize remote sensing images globally. Based on the above description, the establishment of multimodal remote sensing image matching needs to overcome the following problems.

Author Response File: Author Response.docx

Reviewer 2 Report

Self-supervised keypoint detection and cross-fusion matching networks for multimodal remote sensing image registration

This is a revised version of a paper that I previously revised.

The authors present the use of two different deep learning networks for key point detection and matching (respectively) in multimodal satellite images (optical and SAR). The detection network uses a siamese network while the matching network is based on transformers.

In my previous review I raised three main concerns:

1) The novelty of the paper was not clear enough. I believe this has not been improved enough. Please make a further effort to make clear how your research differs from previous contributions. You have made your goals very clear as well as the structure of your algorithms but you have not stressed enough what ideas are new. As far as I can tell, you are mostly re-using existing ideas with the correct emphasis for your application. This is not a problem but you should make it clearer.

2) Data issues. I am satisfied that the data problems have been solved.

3) Applicability and significance of the results. At this moment I believe this is the main issue preventing publication. As the technical novelty of the paper is low I believe the interesting part of the contribution is whether or not you achieve better results in practice. Unfortunately, all the metrics that you use focus on technical definitions that are closer to your optimization problem than to any practical use. Furthermore, the differences observed between the proposed methods and other methods using similar ideas (R2D2, superpoint) seem small. Why do these differences matter in practical applications?  


The paper contains a number of typos and grammar problem, please re-check the paper thoroughly:

Author Response

Responds to the suggestions:

 

 

Reviewer2

 

Self-supervised keypoint detection and cross-fusion matching networks for multimodal remote sensing image registration


This is a revised version of a paper that I previously revised.


The authors present the use of two different deep learning networks for key point detection and matching (respectively) in multimodal satellite images (optical and SAR). The detection network uses a siamese network while the matching network is based on transformers.

In my previous review I raised three main concerns:

1) The novelty of the paper was not clear enough. I believe this has not been improved enough. Please make a further effort to make clear how your research differs from previous contributions. You have made your goals very clear as well as the structure of your algorithms but you have not stressed enough what ideas are new. As far as I can tell, you are mostly re-using existing ideas with the correct emphasis for your application. This is not a problem but you should make it clearer.
Reply: Thank you very much for your comments and encouragement of the work. We believe that this self-supervision training manner is where we draw from the current research work. However, for the keypoint detection network, our main contribution is the construction of a keypoint detection mechanism based on two images conditioned from the channel and spatial domains, which discards the current keypoint response operation based on maximum pooling. For the matching network, our main contribution is to propose a cross-fusion module that interactively fuses the candidate patches from the reference and sensed images to improve the similarity of feature descriptions. We illustrate the novel ideas of the proposed method.

Line 121, edited sentence to read from “Introduction” now(1) For the detection network, considering the differences in the input multimodal images, we build confidence at the same locations from the spatial and channel domains, which discards the maximum pooling keypoint detection mechanism. Overall, we design a self-supervised training manner that communicates the same locations with the same keypoint confidence. which shifts the keypoint detection from a single image response to the fitting of keypoint positions conditional on two images.

(2) In terms of matching networks, it is considering that constructing feature descriptors with local image patches ignores the global information and the interaction between image patch pairs, we develop a cross-fusion mechanism for exchanging high-level semantic information between image patches for feature description. Simultaneously, the network aims to generate a matrix of matching relationships from the overall image patches, unifying the "description + matching" steps into a single network to obtain optimization of overall remote sensing image matching.”


2) Data issues. I am satisfied that the data problems have been solved.

Reply: Thanks for contributing to our manuscript

3) Applicability and significance of the results. At this moment I believe this is the main issue preventing publication. As the technical novelty of the paper is low I believe the interesting part of the contribution is whether or not you achieve better results in practice. Unfortunately, all the metrics that you use focus on technical definitions that are closer to your optimization problem than to any practical use. Furthermore, the differences observed between the proposed methods and other methods using similar ideas (R2D2, Superpoint) seem small. Why do these differences matter in practical applications?  

For question of “At this moment I believe this is the main issue preventing publication. As the technical novelty of the paper is low I believe the interesting part of the contribution is whether or not you achieve better results in practice. Unfortunately, all the metrics that you use focus on technical definitions that are closer to your optimization problem than to any practical use.”
Reply: Thank you for your review. We have recounted the innovation of our method in this version.

For the detection network, we used the repeatability metric of keypoints, which indicates the quality of our detection of available keypoints. Without reproducible keypoints will not be able to complete the next matching step.

For the matching network, we use the average matching accuracy and the nearest-neighbor mean average matching accuracy, which measure the distinguishability of the feature description and the matching accuracy.

To evaluate the practical use of our proposed method, we tested the proposed pipeline framework as a whole, using RMSE and M.S. to evaluate the accuracy of the whole image matching, and the experimental results show that our proposed method achieves high matching accuracy on multimodal remote sensing images. For remote sensing tasks of image fusion and change monitoring, the matching is a fundamental step that must be performed, due to the fact that the degree of pixel alignment affects the accuracy of these remote sensing tasks.

 

For question of “Furthermore, the differences observed between the proposed methods and other methods using similar ideas (R2D2, Superpoint) seem small. Why do these differences matter in practical applications?”

Reply: In fact, we evaluate the performance of our proposed method using the average value of the metrics on the test dataset; therefore, the quantitative comparison results between the proposed method and R2D2, Superpoint are less different. As can be seen from the qualitative matching results (Figure 10), although the quantitative matching results between them differ less, they have a greater impact on the correct matching correspondence in the actual matching.

4) The paper contains a number of typos and grammar problem, please re-check the paper thoroughly:

Reply: Thank you very much for your comment. Following your comment, we rechecked our manuscript carefully, and the reviewed parts are highlighted.

Author Response File: Author Response.docx

Reviewer 3 Report

Nice paper.

Overall Decision:Minor revision

This manuscript introduces a two-step "detection + matching" framework , where each step consists of a deep neural network, avoiding the traditional tedious "detection + feature description + atching". In summary, the research is interesting and provides valuable results, but the current document has several weaknesses that must be strengthened in order to obtain a documentary result that is equal to the value of the publication.

General considerations:

(1) At the thematic level, the proposal provides a very interesting vision, as remote sensing image registration determines the quality and accuracy of the remote sensing mage fusion and change detection would be a very useful resource for researcher. Nevertheless, multimodal remote sensing images are geographically aligned using satellite orbit parameters and GPS .While geographically matched images minimize global geometric distortion, local matching errors of tens of pixels remain, severely limiting the joint application of multimodal remote sensing images. Therefore, a matching process requires to be implemented for remote sensing missions for joint observations. This issue is an important limitation about the aspirations of the proposal, whose limitations should be assumed with more rigour and realism in the development of the argumentation of the manuscript.

Title, Abstract and Keywords:

(2) The abstract is complete and well-structured and explains the contents of the document very well. Nonetheless, from the abstract, it is found that the two steps in the article use two neural networks to achieve the function, the part relating to the results could provide numerical indicators obtained from the two neural network in the research.

 

Chapter 1: Introduction

(3) Vision technology applications in various engineering fields, may be introduced briefly for a full glance of the scope of related areas (A Study on Long–Close Distance Coordination Control Strategy for Litchi Picking. Agronomy 2022; Seismic Performance Evaluation of Recycled aggregate Concrete-filled Steel tubular Columns with field strain detected via a novel mark-free vision method. Structures, 2022).

Chapter 2: The method

(4) Based on the complexity of the contents developed in chapter 2, it is noted that the scheme in figure 1 could be even more complex and detailed in the explanation of the processes.

Chapter 3: Experiments

(5) The number and resolution of the images in the training dataset need to be reflected in the article.

Chapter 4: Conclusions

(6) The authors may mention the limitations of your research and the scope for further research as well as the application of the study.

Author Response

Reviewer3

 

Nice paper.

 

Overall Decision:Minor revision

 

This manuscript introduces a two-step "detection + matching" framework , where each step consists of a deep neural network, avoiding the traditional tedious "detection + feature description + atching". In summary, the research is interesting and provides valuable results, but the current document has several weaknesses that must be strengthened in order to obtain a documentary result that is equal to the value of the publication.

 

General considerations:

 

(1) At the thematic level, the proposal provides a very interesting vision, as remote sensing image registration determines the quality and accuracy of the remote sensing mage fusion and change detection would be a very useful resource for researcher. Nevertheless, multimodal remote sensing images are geographically aligned using satellite orbit parameters and GPS .While geographically matched images minimize global geometric distortion, local matching errors of tens of pixels remain, severely limiting the joint application of multimodal remote sensing images. Therefore, a matching process requires to be implemented for remote sensing missions for joint observations. This issue is an important limitation about the aspirations of the proposal, whose limitations should be assumed with more rigour and realism in the development of the argumentation of the manuscript.

 

Title, Abstract and Keywords:

 

Reply: Thank you very much for your comment. Following your comment, we have rewritten the sentence.

Line 22 edited sentence to read from “Introduction” nowThe joint observation of multimodal remote sensing images can bring unpredictable discoveries to the region and significantly improve the interpretation of the same scene. There is thus a remote sensing image registration seek, which is to align two or more multi-temporal/multi-sensor images in the same coordinate system for the observed scenarios. Registration is crucial since it determines the quality and accuracy of the remote sensing image fusion and change detection. Therefore, a matching process requires to be implemented for remote sensing missions for joint observations.”

 

(2) The abstract is complete and well-structured and explains the contents of the document very well. Nonetheless, from the abstract, it is found that the two steps in the article use two neural networks to achieve the function, the part relating to the results could provide numerical indicators obtained from the two neural network in the research.

Reply: Thank you very much for your suggestion. We have added numerical indicators obtained from the two networks.

Line 12, edited sentence to read from “Abstract” now The experiments show that the proposed method superiors feature detection and matching performance compared with current state-of-art methods. Specifically, the keypoint repetition rate of the detection network and the NN mAP of the matching network are 0.435 and 0.712 on test datasets, respectively. The proposed whole pipeline framework is evaluated, which achieves the average M.S. and RMSE of 0.298 and 3.41, respectively. This provides a novel solution for the joint use of multimodal remote sensing images for observation and localization.

 

Chapter 1: Introduction

 

(3) Vision technology applications in various engineering fields, may be introduced briefly for a full glance of the scope of related areas (A Study on Long–Close Distance Coordination Control Strategy for Litchi Picking. Agronomy 2022; Seismic Performance Evaluation of Recycled aggregate Concrete-filled Steel tubular Columns with field strain detected via a novel mark-free vision method. Structures, 2022).

Reply: Following your suggestion, we have added in introduction.

Line 56, edited sentence to read from “Abstract” now Currently, deep learning (DL) methods have achieved a great success.

 

Chapter 2: The method

 

(4) Based on the complexity of the contents developed in chapter 2, it is noted that the scheme in figure 1 could be even more complex and detailed in the explanation of the processes.

Reply: We apologize for the lack of clarity earlier. Thanks for your encouraging remarks. We have rewritten contributions about the proposed approach. As shown below:

Line 12, edited sentence from “Methodology” to read now “Flowchart of the multimodal remote sensing image matching framework. (a) Keypoint detection network. The detection network parameterizes the multimodal remote sensing image ( and ) and calculates the responses for keypoints from three scales to generate the keypoint feature maps ( and ). (b) Matching network. The matching network performs global interaction fusion of all patches cropped from  and , which is used to obtain similarity feature descriptions ( and ) with cross-modal matching. The matching matrix is used to describe the matching correspondence of all candidate patches.

Line 181, edited sentence from “Methodology” to read now “The proposed matching method consists of two parts, a detection network and a cross-fusion matching network, which are used for specific tasks. A detailed description of these components is shown in Figure 1.

In the detection network, let $R$, $S$ be the reference and sensed images. DSCs are used to encode the input image to improve the computational efficiency of the network, while introducing DCN operations to enhance the geometric robustness of keypoints in the decoding stage. The weighted results from three scales are chosen to input into the peakiness measurement to obtain the candidate keypoint feature maps. The keypoints measurement is calculated conditional on the two images from the local spatial and channel domains respectively, additionally using  and confidence threshold  to activate the peaks to positive values. The detailed calculation procedure is described in this subsection 2.2.

In the matching network, image patches containing the candidate keypoints are first fed into the position embedding operation, extracting position-dependent features. The image patches are cropped to be matched from their neighborhoods of keypoints for matching, and enhancing the local domain information of keypoints, Both the reference and sensed image patches are used as the input of the designed interactive fusion layers, which aim to exchange the information. Therefore, they capture each other's information and transform it into an easy-to-match feature descriptor (feature description step). In the matching phase, multiple fully connected layers are used to optimize the feature descriptor by an overall constraint strategy, and normalize it to  for obtaining the final matching matrix between  and  (matching step). The proposed cross-fusion network unifies "description + matching" steps into the cross-fusion network for global optimization.”

 

Chapter 3: Experiments

 

(5) The number and resolution of the images in the training dataset need to be reflected in the article.

Reply: We apologize for missing a detailed description about the training dataset. We have reorganized introduce.

Line 310, edited sentence to read from “Experiments” now To train the network model, many optical and SAR images from the different areas are acquired to generate the dataset. The optical satellite sensor is SkySat, with a spatial resolution of 0.8 $m$. SAR images are acquired by Sentinel-1, which contains all the ground range detected scenes. Each scene is available in three resolutions and four band combinations (corresponding to the scene polarization). In the experiments, we used a combination of $VV + VH$ polarizations. According to the above training data generation method, 70,000 pairs of images are used for detection network training and 120,000 positive and negative samples are generated for matching network training. The original images used to generate the dataset (available at https://github.com/liliangzhi110/SARopticaldataset), are released for benchmark evaluation.

 

Chapter 4: Conclusions

 

(6) The authors may mention the limitations of your research and the scope for further research as well as the application of the study.

Reply: Thanks for the advice. We give detailed description about the limitations of the proposal.

Line 530, edited sentence to read from “Experiments” now Although two networks are proposed for the problems of detection and description in multimodal remote sensing image registration, this separate processing of tasks is still cumbersome. Therefore, we will study a unified network to achieve end-to-end matching in future studies.

Author Response File: Author Response.docx

Round 2

Reviewer 1 Report

There is a significant improvement since last version. More solutions published in 2022 should be discussed, such as "Small Object Detection Method Based on Adaptive Spatial Parallel Convolution and Fast Multi-Scale Fusion", Remote Sensing 14 (2), 420, 2022.

Author Response

Responds to the suggestions:

 

Reviewer1

There is a significant improvement since last version. More solutions published in 2022 should be discussed, such as "Small Object Detection Method Based on Adaptive Spatial Parallel Convolution and Fast Multi-Scale Fusion", Remote Sensing 14 (2), 420, 2022.

Reply: Thanks for your valuable comments. Following your suggestion, we have added in introduction.

Line 56, edited sentence to read from “Introduction” now They have been applied in remote sensing image processing tasks including image registration, change detection, object detection \cite{qi2022small}, etc.

Author Response File: Author Response.docx

Reviewer 2 Report

My previous concerns have been sufficiently addressed.

Author Response

Reviewer2

My previous concerns have been sufficiently addressed.
Reply: Thank you very much for your comments and contributions to our manuscript.

Author Response File: Author Response.docx

This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.


Round 1

Reviewer 1 Report

Self-supervised keypoint detection and cross-fusion matching networks for multimodal remote sensing image registration

The authors present the use of two different deep learning networks for key point detection and matching (respectively) in multimodal satellite images (optical and SAR). The detection network uses a siamese network while the matching network is based on transformers. This is solid research work in an interesting problem that has, however, significant problems in its present form.

The research does not deviate too much from existing networks and focuses instead in their adaptation to the specific problem at hand. This is not a problem in itself but, as it is, the text does not make a sufficiently clear distinction on the novelty aspects of the paper and does not go sufficiently deep into the application significance of what is being presented. Moreover, the testing methodology is flawed and I am not sure the numbers presented are totally trustworthy application-wise due to this flaw. Finally, the language of the paper contains some problems that the authors should take care to fix as they detract from the paper's contribution.

I believe the changes that need to be made in the paper would need more than the 10 days normally granted by mdpi for "major revisions", so I am going to recommend the rejection of the paper and encourage the authors to resubmit to remote sensing when they have had time to properly improve the paper. If am also going to tell the editors that if an extended revision period is considered I think this paper would merit it.


Major comments:

Novelty:

The section starting in line 125 does a good job at summarising your paper but does not clearly indicate what is novel in your approach. For example, I would say that neither self-supervised keypoint detection networks, nor using siamese networks for keypoint detection (even in multi-modal images) are new. Please indicate clearly what part of your contribution are fundamentally new.    


Data issues:

"We mix data from all scenes, ensuring that the ratio of data from each scene in the training, validation and test datasets are all 0.7/0.2/0.1."

This methodology presents a problem that may be misrepresenting the application performance of the research. As the traning, validation and testing data are chosen from the same images, the training/validation sets and the testing sets share many characteritics. This is a commonly occurring data issue refered often as "non-independent test set". See, for example:

https://www.mdpi.com/2072-4292/13/14/2837

or

https://www.sciencedirect.com/science/article/pii/S1361841520300591?via%3Dihub


In particular, it is very likely that patches constructed from very close phisical locations and/or sharing part of their manually-detected corresponding points will be presented both in the training and testing datasets. This can result in a subtle form of overfitting (not to the training images themselves as is usually defined but to the scenes that were used to create the dataset). If this happens, the numbers that we are seeing in the paper relate the ability to find and detect keypoitns in the scenes used, not in general scenes. I am not saying that this is definetly the case but I do believe that the methodology should be changed to make sure that this is not happening. My suggestion would be to separate a number of scenes and use those only in the testing set. This way the patches in the testing set would not be related in any way to those in the trainig/validation sets. Additionally, if the testing numbers in testing were then to be measured with those in validation (with the validation set constructed as it is done now), then we would also obtain a measure of the impact of this subtle from of overfitting.


Application issues:

The authors make a compelling case (data issues aside) that their algorithm outperforms others according to several criteria. This is all well and good. However, this paper has an important applied aspect. I would be very interesting in getting some kind of sense on how much the differences observed affect the practical use of the algorithms. Specifically, supose I want to use you keypoint matching algorithm to track the evolution of a deforestation pehnomenon in one region. How much is the 0.653 to 0.712 change pictured in figure 10 going to help me?  I understand this is not easy to do, but any advance in this direction would make your paper much stronger.


Minor comments (language)


The paper contains a number of typos and grammar problem, I will list just a few examples but please check the peper thoroughly:


Handcraft-> Handcrafted (I am personally also not crazy about this terminology, I would use "classical" or "non-learned", but handcrafted has become sadly popular)


Grammar!: ... which aims exploit global and interactive information of keypoint descriptors by self-attention and cross-attention


"invariant to the translation, rotation and scale" (drop "the", the use of the definite article "the" should be checked in the whole text) 


Grammar!:Accordingly, these methods are hard to extract highly repeatable keypoints between multimodal images with significant geometric and radiometric differences.


figure 7 caption "uesed"

Reviewer 2 Report

This paper proposes a keypoint detection network and a cross-fusion matching network for multimodal remote sensing image registration. For keypoint detection, this paper designs a self-supervised trained siamese network to detect highly repeatable keypoints. In the key point matching stage, this paper designs a cross-fusion matching network based on Transformer, which improves the cross-modal matching ability of the model. Extensive comparative experiments show that proposed method is competitive with the state-of-the-art methods. The effectiveness of the proposed module is also effectively evaluated in ablation study. This is an interesting research paper. There are some suggestions for revision.

  1. The motivation is not clear. Please specify the importance of the proposed solution.
  2. The listed contributions are a little bit weak. Please highlight the innovations of the proposed solution.
  3. What is "SAR-derived GCPs", there is no corresponding explanation or reference in this paper.
  4. How are the interleaving times of self-attention layers and cross-attention layers affect the final prediction result? Authors are suggested to add relevant instructions or experiments for verification.
  5. In the introductory part of Section 3, what does subsection A/B/C/D/E/F stand for? Authors are suggested to check similar questions to increase the readability of the paper.
  6. During the training phase, this paper does not report the model optimizer used and the corresponding hyperparameter settings.
  7. The readability of Table 2 is a little bit low. It is recommended to highlight the best results in each project to make the table more descriptive.
  8. Please specify how to obtain the suitable parameters used in the proposed solution.
  9. The experimental results are not convincing. Please compare the proposed solution with more recently published solutions.
  10. English grammar errors and inaccurate descriptions cause distraction and make interpretation difficult to understand. Authors are suggested to correct the linguistic errors in the edited text.
Back to TopTop