Next Article in Journal
An Enhanced Target Detection Algorithm for Maritime Search and Rescue Based on Aerial Images
Previous Article in Journal
Detecting Volcano Thermal Activity in Night Images Using Machine Learning and Computer Vision
 
 
Article
Peer-Review Record

A Swin Transformer with Dynamic High-Pass Preservation for Remote Sensing Image Pansharpening

Remote Sens. 2023, 15(19), 4816; https://doi.org/10.3390/rs15194816
by Weisheng Li, Yijian Hu, Yidong Peng * and Maolin He
Reviewer 1:
Reviewer 2: Anonymous
Reviewer 3:
Remote Sens. 2023, 15(19), 4816; https://doi.org/10.3390/rs15194816
Submission received: 30 August 2023 / Revised: 29 September 2023 / Accepted: 2 October 2023 / Published: 3 October 2023
(This article belongs to the Section Remote Sensing Image Processing)

Round 1

Reviewer 1 Report

Authors Have presented their work with the title “A Swin Transformer with Dynamic High-Pass Preservation for Remote Sensing Image Pan-Sharpening”  

   The research work contributed by the authors of this paper is mainly reflected in the following points:

·         Authors have investigated the detailed injection mechanism in pansharpening networks. A dynamic high-pass preservation module is developed to enhance the high frequencies present in input shallow features.

·         They achieved their objectives by adaptively acquiring the expertise to generate convolution kernels. Furthermore, it strategically employs distinct kernels for each spatial location, facilitating the effective amplification of high frequencies and the subtraction framework with details directly extracted by differencing the single PAN image with each MS band.

·         Their work allows to avoid compromising the spatial information with a preprocessing step using detailed extraction techniques proposed in classical pansharpening approaches, letting the framework spectrally adjust the extracted details through the estimation of the nonlinear and local injection model.

·         They developed a full Transformer network named SwinPAN for pansharpening based on the Swin Transformer. They introduced content-based interactions between image content and attention weights, resembling spatially varying convolutions.

·         This is achieved through a shifted window mechanism, which enables effective long-range dependency modelling. Notably, the Swin Transformer boasts improved performance while utilizing fewer parameters in comparison to the Vision Transformer (ViT).

·         Experimental results on three remote sensing datasets, including QuickBird, GaoFen2 and WorldView3, demonstrate the proposed method achieves superior performance competitiveness compared with other state-of-the-art CNN-based methods.

·         I request the authors to avoid the words like I, WE, YOU in the paper running text and the paper should be written in third person format.

·         I request the authors to add the following papers which are relevant to your work in references and cite them in the running text:

 

1.    https://www.scopus.com/inward/record.uri?eid=2-s2.0-85048196566&doi=10.3923%2fjeasci.2018.1606.1612&partnerID=40&md5=6e5eb995d61528240cff131ec45fd3b0

2.    https://www.scopus.com/inward/record.uri?eid=2-s2.0-85096224355&doi=10.1007%2fs12524-020-01265-7&partnerID=40&md5=b849e51ed4724a086901f2923c055698

Authors Have presented their work with the title “A Swin Transformer with Dynamic High-Pass Preservation for Remote Sensing Image Pan-Sharpening”  

   The research work contributed by the authors of this paper is mainly reflected in the following points:

·         Authors have investigated the detailed injection mechanism in pansharpening networks. A dynamic high-pass preservation module is developed to enhance the high frequencies present in input shallow features.

·         They achieved their objectives by adaptively acquiring the expertise to generate convolution kernels. Furthermore, it strategically employs distinct kernels for each spatial location, facilitating the effective amplification of high frequencies and the subtraction framework with details directly extracted by differencing the single PAN image with each MS band.

·         Their work allows to avoid compromising the spatial information with a preprocessing step using detailed extraction techniques proposed in classical pansharpening approaches, letting the framework spectrally adjust the extracted details through the estimation of the nonlinear and local injection model.

·         They developed a full Transformer network named SwinPAN for pansharpening based on the Swin Transformer. They introduced content-based interactions between image content and attention weights, resembling spatially varying convolutions.

·         This is achieved through a shifted window mechanism, which enables effective long-range dependency modelling. Notably, the Swin Transformer boasts improved performance while utilizing fewer parameters in comparison to the Vision Transformer (ViT).

·         Experimental results on three remote sensing datasets, including QuickBird, GaoFen2 and WorldView3, demonstrate the proposed method achieves superior performance competitiveness compared with other state-of-the-art CNN-based methods.

·         I request the authors to avoid the words like I, WE, YOU in the paper running text and the paper should be written in third person format.

·         I request the authors to add the following papers which are relevant to your work in references and cite them in the running text:

 

1.    https://www.scopus.com/inward/record.uri?eid=2-s2.0-85048196566&doi=10.3923%2fjeasci.2018.1606.1612&partnerID=40&md5=6e5eb995d61528240cff131ec45fd3b0

2.    https://www.scopus.com/inward/record.uri?eid=2-s2.0-85126820141&doi=10.1007%2fs13042-022-01524-8&partnerID=40&md5=4421e4c4cd9db0380f8325803a0b4ee2

3.    https://www.scopus.com/inward/record.uri?eid=2-s2.0-85132786767&doi=10.1007%2f978-981-16-8739-6_3&partnerID=40&md5=8a15c3bcf73edaad60b2a50447d7326a

4.    https://www.scopus.com/inward/record.uri?eid=2-s2.0-85128518037&doi=10.1007%2fs11063-021-10679-4&partnerID=40&md5=a721b35ad7b90875db9856b407d7f0b2

5.    https://www.scopus.com/inward/record.uri?eid=2-s2.0-85096224355&doi=10.1007%2fs12524-020-01265-7&partnerID=40&md5=b849e51ed4724a086901f2923c055698

Author Response

We have made a detailed reply to your comments in the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

This paper studied the pansharpening problem in dynamic high-pass preservation, attempting to break through the limitation of the short-range contextual dependencies of convolution operations by Swin transformer. Here are some concerns. 

(1) The authors should care about the terminology used in this scientific article. For example, in figure 2 where illustrates the structure of SwinPan, the result of the PAN image substracted by upsampled MS image is called LR-HF features, and the result of DRNet is called HR-HF features. However, according to the figure, they should have the same spatial size. Why did the authors call them LR- and HR- features? As far as my understandings, they should be Shallow-HF and Deep-HF features. 

(2) Furthremore, one of the biggest innovations is attributed to dynamic high-pass preservation. However, in the most important figure 2, the reviewer did not learn which module account for this innovation. For my point, this SwinPAN did not have the difference, compared to the network without Dynamic High-pass Preservation. 

(3) Additionally, the illustration of Figure 2 and the description in Section 3.1 did not match. Some abbrievations are used without explainations. 

(4) Figure 1 carefully classified different structures of pansharpening networks. However, the authors ignored the algorithm unrolling-based networks, many of which have been published in top-tier remote sensing and computer vision journals/conferences, for example, the first algorithm unrolling pansharpening network, GPPNN [R1] and its enhanced version [R2]. Following it, [R3] inserted the Memory-Augmented mechanism to facilitating training. 

[R1] S Xu, J Zhang, Z Zhao, K Sun, J Liu, C Zhang. Deep Gradient Projection Networks for Pan-sharpening. CVPR 2021: 1366-1375. https://doi.org/10.1109/CVPR46437.2021.00142

[R2] Jamila Mifdal, Marc Tomás-Cruz, Alessandro Sebastianelli, Bartomeu Coll, Joan Duran. CVPRW, 2023, pp. 2105-2114. https://openaccess.thecvf.com/content/CVPR2023W/EarthVision/html/Tomas-Cruz_Deep_Unfolding_for_Hypersharpening_Using_a_High-Frequency_Injection_Module_CVPRW_2023_paper.html

[R3] Man Zhou, Keyu Yan, Jinshan Pan, Wenqi Ren, Qi Xie & Xiangyong Cao. Memory-Augmented Deep Unfolding Network for Guided Image Super-resolution. IJCV, 131(1): 215-242 (2023). https://doi.org/10.1007/s11263-022-01699-1

(5) The compared methods are old, and not so SOTA. There have been many transformer-based pansharpening networks, including [R5-R7]. I know that it is unfair for a CNN to compared with transformer. But SwinPAN proposed in this article is based on transformer. At least, one of the transformer-based network should be compared in the experiments. 

[R4] Nan Wang; Xiangjun Meng; Xiangchao Meng; Feng Shao. Convolution-Embedded Vision Transformer With Elastic Positional Encoding for Pansharpening. IEEE Trans. Geosci. Remote. Sens. 60: 1-9 (2022) https://doi.org/10.1109/tgrs.2022.3227405

[R5] Xunyang Su; Jinjiang Li; Zhen Hua. Transformer-Based Regression Network for Pansharpening Remote Sensing Images. IEEE Trans. Geosci. Remote. Sens. 60: 1-23 (2022) https://doi.org/10.1109/TGRS.2022.3152425

[R6] WGC Bandara, VM Patel. Hypertransformer: A textural and spectral feature fusion transformer for pansharpening. CVPR, 2022, pp. 1767-1777. https://doi.org/10.1109/CVPR52688.2022.00181

[R7] Feng Zhang, Kai Zhang, Jiande Sun. Multiscale Spatial–Spectral Interaction Transformer for Pan-Sharpening. Remote Sens. 2022, 14(7), 1736; https://doi.org/10.3390/rs14071736

(6) In the section 5.1, the extra ablation study should be added. Now, it has proved that no high-pass leads to worse results. However, I believe that most reads also want the results if the high-pass information is generated statically instead of dynamically.

Author Response

We have made a detailed reply to your comments in the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

In this paper the new SwinPAN framework for pansharpening has been proposed. SwinPAN is a Swin transformer-based framework. The new framework aims to solve some limitations of previous CNN-based and transformer-based methods. CNN-based methods fail at detect long-range contextual information. Transformer-based methods relieve this problem but loss high-resolution details.

The paper approach is well justified and the proposed framework is well described. Good quantitative and qualitative results for three datasets have been shown in the experimental section. The proposed method surpasses other state of the art methods.

Some comments:

- Line 193: The word shadow should be substituted for shallow.
- Figure (3) in page 6 seems to be wrong. W_i should be convolved with X_i
- line 218 C_{in} has not been defined
- Definitions for some of the acronyms used in subsection 3.3 and DRL scheme in figure(3) should be included.

An English text review should be performed.

Author Response

We have made a detailed reply to your comments in the attachment

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

The authors have addressed most issues. Additionally, some tranditional pansharpening methods should be compared. 

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Back to TopTop