Next Article in Journal
VFGF: Virtual Frame-Augmented Guided Prediction Framework for Long-Term Egocentric Activity Forecasting
Previous Article in Journal
Research on Road Surface Recognition Algorithm Based on Vehicle Vibration Data
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

S2Transformer: Exploring Sparsity in Remote Sensing Images for Efficient Super-Resolution

1
School of Electronics and Control Engineering, Chang’an University, Xi’an 710000, China
2
Aviation University of Air Force, Nanhu Road Campus, Changchun 130022, China
3
Academy of Military Science, Beijing 100080, China
*
Authors to whom correspondence should be addressed.
Sensors 2025, 25(18), 5643; https://doi.org/10.3390/s25185643
Submission received: 8 August 2025 / Revised: 5 September 2025 / Accepted: 7 September 2025 / Published: 10 September 2025
(This article belongs to the Section Sensing and Imaging)

Abstract

Remote sensing image super-resolution (SR) techniques play a crucial role in geographic information analysis, environmental observation, and urban development planning. However, existing approaches are computationally intensive, which hinders them from bewing applied on resource-constrained devices. Although numerous efforts have focused on efficient image SR, the intrinsic sparsity characteristics of remote sensing images remain under-explored. To tackle these challenges, this paper introduces an efficient SR method founded on a dynamic Sparse Swin Transformer (S2Transformer). First, a dynamic sparse mask module is proposed to distinguish important regions from other ones. Subsequently, a dynamic sparse Transformer is developed to adaptively focus on important regions with more computational resources being allocated, markedly reducing redundant computations over background regions. Experiments are conducted on several benchmark remote sensing datasets and the results demonstrate that the proposed approach significantly outperforms existing methods in detail restoration, edge sharpness, and robustness, achieving superior PSNR and SSIM scores.

1. Introduction

Remote sensing images play a crucial role in various fields such as geographic information science, environmental monitoring, and urban planning. With the continuous advancement of remote sensing technology, the demand for high-resolution remote sensing images is increasing. However, due to the limitations of sensors, imaging conditions, and transmission bandwidth, the resolution of remote sensing images is often constrained, leading to the loss of image details. To address this issue, remote sensing image super-resolution (SR) has been widely applied to restore these missing details.
The research on remote sensing image SR can be dated back to the 1980s, initially focusing on traditional interpolation methods such as bicubic interpolation and nearest-neighbor interpolation. Motivated by the great success of deep learning, learning-based methods have been widely investigated and have gradually dominated the research of remote sensing image SR. Although multi-image SR demonstrates superior performance in the area of remote sensing image SR, acquiring multiple images is often challenging in the real world. Consequently, this paper focuses on single-image SR which has broad applicability in practical scenarios. Despite the promising results achieved by previous methods on benchmark datasets, the expensive computational cost of these methods hinders their application in resource-limited devices.
Recently, many efforts have been made to address this issue. Specifically, Shi et al. [1] proposed ESPCN, an efficient image SR network that accelerates the SR process by utilizing subpixel convolution. Motivated by the success of Transformers, Zhang et al. [2] proposed ELAN, which develops an efficient long-range attention module and acceleration mechanism to achieve efficient image SR. Later, Liu et al. [3] proposed a lightweight image SR method termed CATANet to efficiently aggregate content-similar tokens using a content-aware token aggregation (CATA) module. For remote sensing image SR, Peng et al. proposed CALSRN [4], which reduces the amount of parameters by about 30% while maintaining the quality of reconstruction by fusing the global features of the Swin Transformer with the local features produced by a CNN. Lin et al. proposed DTCNet [5], which distills knowledge from a Transformer-based teacher network to guide a lightweight CNN-based student network. Hou et al. proposed CSwT-SR [6], which combines spatial- and frequency-domain features in an amplitude-phase learning framework to achieve the enhancement of structural details.
Despite the progress made by the aforementioned methods, they commonly rely on techniques developed for general image SR and do not fully consider the characteristics of remote sensing images. In a remote sensing image, most regions are usually the background while targets of interests (e.g., airplanes, ships and cars) occupy only a few pixels. As a result, considerable redundant computational cost is involved in background regions. To remedy this, we introduce a Dynamic Swin Transformer that can dynamically allocate the computation on different regions according to their various importance. Particularly, we first develop a dynamic sparse mask module to produce a binary mask that distinguishes patches of higher importance from others in the image. Then, we propose a dynamic Swin Transformer block that can adaptively activate corresponding patches for subsequent computation based on the learned binary mask. In this way, redundant computation in background regions can be largely reduced to achieve significant speedup on edge devices without decreasing the accuracy of the reconstructed images. As shown in the Figure 1, our S2Transformer achieves the best balance between computational efficiency and reconstruction quality.
Overall, the contributions of this paper can be summarized as follows:
  • We develop an efficient remote sensing image super-resolution network by exploiting the inherent sparsity of remote sensing images.
  • We construct a dynamic sparse mask module to distinguish patches of high importance from others in an image, and then guide the Swin Transformer block to adaptively activate corresponding patches for efficient inference.
  • We conduct extensive experiments on multiple remote sensing image datasets, validating the effectiveness and superiority of the proposed method.

2. Related Work

In this section, a brief review of existing image SR methods is first presented. Then, recent advances of diverse network acceleration techniques are discussed.

2.1. Image Super-Resolution

2.1.1. General Image SR

Image super-resolution (SR) aims to reconstruct high-resolution (HR) images from their low-resolution (LR) counterparts. Over the last decade, significant advancements have been made in the field of general image super-resolution, with a shift from traditional techniques to deep learning approaches.
SRCNN [7] stands out as the pioneering CNN-based SR network, laying the foundation for learning-based methods. Later, VDSR [8] adopted a deeper network to achieve superior performance to SRCNN. Lim et al. [9] employed residual connections to build the developed EDSR with over 60 layers. Subsequently, Zhang et al. presented RCAN [10], which adopts a deep residual network structure with channel attention to focus on more informative features, which achieves much higher accuracy against previous approaches. Chen et al. proposed HAT [11] by combining channel attention with self-attention to better capture long-range correspondence in image SR tasks. SMSR [12] learns spatial masks to identify important regions and channel masks to prune redundant channels in unimportant regions. With the emergence of Transformers, many efforts have been made to introduce Transformers to image SR. Specifically, TTSR [13] was developed as the first Transformer-based SR model, which performs reference image super-resolution through rigid and soft attention modules. Inspired by the Swin Transformer, SwinIR [14] combined its local window self-attention with convolutional operations, and ESTNet [15] is an efficient Swin Transformer that uses an ECAB to select key channels and a group-wise multi-window self-attention (GAB) to strengthen cross-window modeling, achieving better RS image SR with lower computational cost, further enhancing SR performance. Recently, ESRT [16] optimized the structure by integrating CNN and Transformer features, reducing complexity.
With recent development of Mamba and Diffusion, Xia et al. proposed S3Mamba [17], which combines the scalable state-space model and a scale-aware self-attention mechanism to achieve high-quality arbitrary-scale SR with linear complexity. Di et al. developed QMambaBSR [18] that is capable of efficiently extracting subpixel information while mitigating the impact of noise interference. Liu et al. [19] introduced the first diffusion-based RSISR approach, which leverages low-resolution images as conditional inputs to produce high-resolution outputs.

2.1.2. Remote Sensing Image SR

The success of learning-based image SR methods has promoted the research of remote sensing image SR (RSISR). Particularly, Leibel and Korner et al. [20] proposed the first CNN-based SR method for remote sensing images, termed msiSRCNN. Then, Xu et al. introduced DMCN [21], which performs feature integration by constructing local–global memory connections. Ren et al. [22] proposed ERCNN, which employs feature attention modules to optimize the mixed pixel problem. Dong et al. [23] proposed DSSR, which significantly improves the super-resolution of remotely sensed imagery through the introduction of a dense-sampling mechanism, a wide-feature attention module, and a chained training strategy. These methods have presented significant contributions in different remote sensing super-resolution tasks, especially in multi-scale feature extraction, residual learning and adaptive feature fusion, which significantly improve the quality and effectiveness of image reconstruction.
To leverage the great model capacity of Transformers, TransENet [24] was developed by employing a multi-stage enhancement architecture to integrate multi-scale features, overcoming the limitations of traditional models relying on upsampling layers and significantly improving RSISR performance. Subsequently, SLTN [25] constructed a multi-level feature extraction module guided by a spectral response function and employed Transformers for multi-layer nonlinear mapping learning. ESSAformer [26] combines a self-attention mechanism based on spectral correlation coefficients to improve computational efficiency and strengthen spectral feature interaction.
More recently, Mamba-based SR methods have also been investigated for remote sensing images. For instance, ConvMambaSR [27] integrates the state-space model (SSM) with a CNN to model global dependencies and extract local details, achieving better RSISR performance. Later, MambaFormerSR [28] further combines Mamba with a Transformer, designing a state-space attention fusion module and a convolutional Fourier feedforward network (CTFFN).
Although the aforementioned learning-based methods have achieved promising RSISR performance, these methods suffer form considerable redundant computation, which hinders their deployment on resource-limited devices. Despite the fact that several lightweight SR networks have been developed for general image SR, these methods do not fully consider the inherent characteristics of remote sensing images.

2.2. Network Acceleration

2.2.1. Network Quantization

Network quantization techniques are primarily used to reduce storage cost and accelerate model inference, especially on resource-constrained devices. Quantization methods can be broadly divided into two categories: quantization-aware training (QAT) and post-training quantization (PTQ).
Quantization-aware training (QAT) introduces quantization operations during the training process, enabling the model to adapt to low-bit quantization while training. In the field of image SR, PAMS [29] proposes a tunable truncation parameter that dynamically adjusts the upper limit of the quantization range, thereby mitigating quantization errors. Then, DAQ [30] further employs channel-level distribution-aware quantization, which quantizes each channel based on its distribution characteristics. Subsequently, CADyQ [31] introduces a mix-bit quantization method that is able to dynamically allocate the bit widths to various regions according to their image contents.
Post-training quantization (PTQ) quantizes the model after training is completed, without the need to retrain the entire model. Instead, PTQ methods search for optimal quantization boundaries by minimizing the performance loss caused by quantization. As a pioneering PTQ method for image SR, DBDC+Pac [32] utilizes boundary compression and quantization calibration to reduce quantization loss. However, this method performs poorly on Transformer models, especially in handling activation values with long-tail distributions. Later, MinMax [33] employs a Min–Max quantization strategy, quantizing weights and activation values as integers within the minimum and maximum values.

2.2.2. Network Pruning

Network pruning reduces computational and storage requirements by removing redundant neurons or connections in a neural network. Existing pruning methods are categorized into three groups: pruning before training (PBT), pruning during training (PDT), and pruning after training (PAT).
Pruning before training (PBT) refers to pruning before the training starts by selecting the least important connections to remove. Typically, SNIP [34] removes the least important weights by scoring once, avoiding the computational overhead during training. SynFlow [35] determines the pruning structure by evaluating the interactions between network layers. GraSP [36] stabilizes the training process by retaining the weights that contribute the least to the gradient signal.
Pruning during training (PDT) progressively removes unimportant connections during training. For example, SET [37] progressively prunes unimportant weights through dynamic sparse training while regenerating new weights during training. RigL [38] optimizes the pruning process by removing weights with the smallest gradients and gradually restoring sparse connections during training. Network slimming [39] introduces scaling factors for each channel and uses L1 regularization to selectively prune unimportant channels. MorphNet [40] forces network sparsity through the regularization of the batch normalization layers.
Pruning after training (PAT) occurs after training is completed, typically by removing redundant or unimportant weights to reduce the model size. Specifically, LTH [41] prunes the "winning subnet" and retrains it to recover original performance. FreeTickets [42] improves model performance by integrating multiple sparse sub-networks. Network pruning not only reduces the computational load but also speeds up inference, and it can decrease the model’s complexity without significantly affecting its accuracy.

2.2.3. Sparse Inference

Sparse inference accelerates the inference efficiency of neural networks by skipping unnecessary computations. This technique exploits the abundance of zero elements in neural networks and reduces computational burden by optimizing the execution path. In recent years, sparse inference has been widely adopted to speed up neural networks. Willette et al. [43] proposed the Delta Attention method, which improves efficiency and accuracy in long-sequence inference by correcting the distribution shift of sparse attention while preserving high computational sparsity. Zhang et al. [44] proposed SpargeAtt, a sparse-quantized attention operator that requires no retraining. Gao et al. [45] proposed a lightweight and pluggable sparse-attention gating framework termed SeerAttention-R. Acharya et al. [46] proposed Star Attention, a two-stage block-sparse attention mechanism that combines “anchor-block” local encoding with distributed global querying.

3. Method

In this section, the overview of our proposed network is first introduced. Then, the structure details of our proposed dynamic sparse Transformer module and mask prediction block are presented.

3.1. Overview

The overall framework of our method is illustrated in Figure 2. Given a low-resolution image I L R R H × W × 3 , shallow features are first extracted through a convolutional neural network (CNN). Then, the resultant features are pathified to obtain tokens and passed to M dynamic sparse Transformer blocks (DSTMs), which is the core module of our network. Within each DSTM, the tokens are passed to the mask prediction module to produce a pair of binary masks ( M R H P × W P and M s h i f t R H P × W P , where P is the patch size), which are then incorporated by subsequent B Transformer blocks to achieve adaptive inference. Afterwards, the resultant tokens are unified and fed to another convolutional neural network (CNN) to generate the final high-resolution image I S R R s H × s W × 3 .

3.2. Dynamic Sparse Transformer Module

As shown in Figure 2, the dynamic sparse transformer module is a core component of our network, which aims at activate important tokens for subsequent processing while keeping other ones untouched to improve computational efficiency. The detailed structure of this module is illustrated in Figure 3.
Fist, the input tokens T R H × W × C are fed to a layer normalization layer. Optionally, a cyclic shift is performed on the results following the Swin Transformer [47]. Then, the normalized tokens are partitioned into P × P patches, resulting in R H P × W P × P × P × C . Afterwards, the corresponding binary mask R H P × W P is employed to activate tokens in corresponding patches, which are then passed to an attention layer to capture the relationship between tokens in each patch. Meanwhile, other tokens skip the attention layer to recover the original shape of tokens R H P × W P × P × P × C . In this way, only regions of higher importance are processed by the attention layer, striking a good balance between accuracy and efficiency. Afterwards, the results are passed through another layer normalization layer and Multilayer Perceptron (MLP), producing the final results.

Mask Predictor Block

Within each dynamic sparse Transformer module, a mask prediction block aims at identifying important patches in the input tokens to obtain binary masks. As shown in Figure 4, a cyclic shift is first performed on the input tokens. Next, both the original and shifted tokens are passed to a linear layer and a GeLU layer, which are then partitioned into patches of size R H P × W P × P × P × C . Afterwards, average pooling is conducted on each patch to aggregate all tokens within the patch, producing R H P × W P × C . Finally, the results are fed to the linear layer to generate a pair of binary masks.
Although binary masks are able to mark "important" patches out of other ones, they are inherently non-differentiable. To make the binary spatial mask learnable, a Gumbel softmax layer is employed. Specifically, the pooled tokens R H P × W P × C are fed to the linear layer to produce R H P × W P × 2 . Then, the Gumbel softmax trick is used to obtain a softened spatial mask M R H P × W P :
M [ x , y ] = exp F [ x , y , 1 ] + G [ x , y , 1 ] / τ i = 1 2 exp F s p a [ x , y , i ] + G [ x , y , i ] / τ ,
where x , y are vertical and horizontal indices, G R H P × W P is a Gumbel noise tensor with all elements following a Gumbel ( 0 , 1 ) distribution and τ is a temperature parameter. When τ , samples from the Gumbel softmax distribution become uniform. When τ 0 , samples from the Gumbel softmax distribution become one-hot and binary masks can be obtained.

3.3. Loss Function

During the training phase, L1 loss between the SR result and the groundtruth is used for end-to-end optimization of the entire model. Compared to L2 loss, L1 loss avoids the blurring effect in our experiments.

4. Experiments

In this section, the implementation details are presented, including datasets, training settings, and evaluation metrics. Then, experiments are conducted to compare our proposed method against previous approaches. Finally, ablation experiments are conducted to study the effectiveness of our network designs.

4.1. Implementation Details

  • Datasets: In this section, our experiments are carried out on four widely-used remote sensing datasets: AID [48], DOTA V1.0 [49], DIOR [50], and NWPU-RESISC45 [51]. The specific details of each dataset are presented in Table 1.
  • AID: This dataset contains 3000 training images and 900 test images, each at 600 × 600 pixels, with spatial resolutions ranging from 0.5 m to 8 m. It is designed for scene-classification tasks encompassing 30 categories.
  • DOTA V1.0: This dataset contains 900 test images whose dimensions vary between 800 and 4000 pixels. This dataset targets object-detection tasks with 15 categories.
  • DIOR: This dataset contains 1000 test images at 800 × 800 pixels. It is intended for object-detection tasks across 20 categories.
  • NWPU-RESISC45: This dataset contains 315 test images at 256 × 256 pixels, with spatial resolutions from 0.2m to 30m. This dataset serves scene-classification tasks spanning 45 categories.
As illustrated in Table 1, these datasets span diverse spatial scales and task domains, thereby offering sufficiently heterogeneous data to evaluate the generality and effectiveness of the proposed method.
  • Evaluation Metrics: In this paper, the peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) [52] are employed to evaluate the results. The PSNR is calculated as follows:
    P S N R = 20 × log M A X I M S E ,
    where MSE is the MSE between the SR result and the groundtruth, M A X I is the maximum pixel value. SSIM is computed as follows:
    SSIM ( x , y ) = 2 u x u y + C 1 2 σ x y + C 2 u x 2 u y 2 + C 1 σ x 2 σ y 2 + C 2 ,
    where u x and u y are the pixel means of images x and y, σ x y is the covariance of images x and y, σ x and σ y are the variances corresponding to images x and y, and C 1 and C 2 are non-zero constants.
  • Model Details: In our experiments, the window size of our model is set to 8 and the MLP expansion ratio is fixed at 4. For our S2Transformer (Ours), the embed dimension is set to 180, the number of Transformer layers is set to 6 and the number of attention heads is set to 4. For the lightweight S2Transformer (Ours_s), the embed dimension is set to 60, the number of Transformer layers is set to 4 and the number of attention heads is set to 6.
  • Training Details: Our experiments were performed on two NVIDIA GeForce RTX 4090 GPUs (Santa Clara, CA, USA). During the training phase, the AdamW optimizer was adopted to train the model with batch size set to 24. The learning rate was initialized as 2 × 10 4 . The training was stopped after 500 epochs.

4.2. Performance Evaluation

4.2.1. Bicubic Degradation

We first compare our S2Transformer with previous state-of-the-art SR methods on widely applied bicubic degradation. Table 2 summarizes the SR performance of different methods achieved on 30 distinct scene categories of the AID dataset. As we can see, compared with the HAT, this study attains the highest PSNR and SSIM scores across different scene categories and clearly outperforms EDSR, RCAN, HAT, and other baselines with notable margins. It achieves the largest PSNR gains in structure-intensive scenes such as Airport (+0.98 dB), Parking (+0.83 dB) and D-Residential (+0.92 dB), indicating that extra computation is indeed routed to fine man-made details. Improvements are also consistent—though smaller—in low-texture classes like Desert (+0.52 dB) and Meadow (+1.26 dB), showing that the model avoids over-processing homogeneous backgrounds while still preserving radiometric fidelity.
We further visualize the results produced by different methods in Figure 5. It can be observed that bicubic interpolation produces blurring artifacts, yielding poor visual quality and correspondingly low PSNR and SSIM scores. EDSR and RCAN can recover some details, yet they still exhibit noticeable blurring at the boundaries of intricate textures or fine objects. Although SRFlow and SRGAN provide a certain degree of visual improvement, their overall PSNR/SSIM metrics remain sub-optimal. By contrast, the our proposed method more effectively restores fine details and produces sharper, clearer edges, yielding visuals that closely approximate the ground truth and achieving the highest PSNR and SSIM values. For example, in the first image set, Ours attains 35.12 dB/0.9482, clearly surpassing HAT (34.68 dB/0.9426) and RCAN (32.03 dB/0.9023). These results clearly demonstrate the stability and robustness of our method for remote sensing image SR.
Table 3 presents the performance of various methods on the AID, DOTA, DIOR, and NWPU-RESISC45 datasets. Our proposed method attains the best overall performance with a moderate parameter count of 16.01 million. Meanwhile, the lightweight version of Ours_s contains merely 4.16 million parameters, yet still delivers highly competitive results. Averaged over all datasets, Ours clearly surpasses widely used CNN/Transformer baselines (e.g., HAT: 31.22/0.8201, RCAN: 31.13/0.8160, SwinIR: 31.16/0.8188, ESRT: 31.17/0.8189, ESTNet: 31.55/0.8263, SMSR: 31.59/0.8272). The lightweight version of Ours_s further reduces computation to 8.73G FLOPs yet remains in the leading cluster, demonstrating the method’s efficiency and suitability for resource-constrained scenarios.
Figure 6 illustrates the visual results and corresponding quantitative metrics (PSNR/SSIM) achieved on the DOTA V1.0 dataset. Qualitatively, bicubic interpolation yields pronounced blurring and detail loss especially along object boundaries and within texture-rich areas resulting in inferior visual quality. Methods such as SRFlow and SRGAN offer some visual improvement, yet they still suffer from inadequate detail restoration, for instance, boundary blurring and shape distortion of targets with SRGAN exhibiting marked instability in certain images. EDSR and RCAN perform relatively better in edge sharpening and detail recovery, but still fall short of the desired level of finesse. In contrast, Ours surpasses all competitors in both visual fidelity and quantitative metrics. For example, on “Img_P0168”, Ours attains 34.87 dB/0.8901, outperforming HAT (34.66 dB/0.8871) and RCAN (33.82 dB/0.8741) and delivering crisper object boundaries and richer texture details. Moreover, lightweight Ours_s also demonstrates strong performance even under constrained computational settings, highlighting the efficiency and practicality of our approach.
It is evident in Figure 7 that bicubic interpolation suffers from pronounced blurring and detail loss on the DIOR dataset, yielding the poorest visual quality. EDSR and RCAN recover certain details effectively, yet they remain deficient in restoring fine textures and crisp edges; for instance, in Img_01903 and Img_03841, target boundaries still exhibit blurring or artifacts. However, SRFlow and SRGAN produce noticeable texture distortions and artifacts across multiple images (e.g., Img_03841 and Img_10862), resulting in substantially lower PSNR and SSIM scores. By contrast, Ours exhibits clear superiority in both visual fidelity and quantitative performance. For example, on Img_03758, Ours achieves the highest score of 35.63 dB/0.9335, surpassing the next-best HAT (35.21 dB/0.9318) and RCAN (32.62 dB/0.9171).
We also visualize the results achieved on the NWPU-RESISC45 dataset in Figure 8. Bicubic interpolation yields the lowest image quality, with substantial blurring of details particularly along object boundaries and within textured areas, as is evident in the “basketball_court” and “freeway” scenes. EDSR, RCAN, and VDSR improve sharpness to some extent, yet they remain deficient in texture fidelity and contour clarity, performing poorly in complex scenarios such as “parking_lot.” By contrast, Ours yields pronounced visual improvements; for instance, in “basketball_court,” it achieves the highest score of 29.35 dB/0.7699. Even in the relatively complex “freeway” scene, Ours attains the top score of 27.84 dB/0.7373, attesting to its robustness and generalization capability. Moreover, the lightweight Ours_s exhibits consistently strong performance, underscoring the efficiency of the proposed dynamic sparse Transformer architecture.

4.2.2. Realistic Degradations

Following [53], experiments are also conducted on realistic degradations. Specifically, HR images are first blurred using 21 × 21 Gaussian kernels and then bicubicly downsampled with noises being added. The width of the Gaussian kernel is determined by a Gaussian probability density function that follows a normal distribution N ( 0 , Σ ) . Here, Σ is the covariance, which is determined by two random eigenvalues λ 1 , λ 2 U ( 0.2 , 4 ) and a random rotation angle θ U ( 0 . π ) . The noise level ranges within [0, 25].
Table 4 presents the results achieved on the DIOR dataset under varying noise levels (0,5,10) and Gaussian blur kernels. Ours exhibits the highest robustness across all noise levels and anisotropic blur settings, with PSNR values that substantially surpass those of competing approaches such as EDSR, RCAN, VDSR, and HAT. For instance, at a noise level of 0, Ours attains an average PSNR of 31.21dB, whereas the runner up reaches only 31.00 dB. This margin widens at higher noise levels (5 and 10), further underscoring the superiority of the proposed method.
We further visualize the results in Figure 9. It can be observed that Figure 9 illustrates the visual restoration results and quantitative metrics (PSNR/SSIM) of different methods. As noise and blur intensities increase, previous methods experience a marked decline in their ability to restore fine details. For instance, at a noise level of 10, images reconstructed by these methods exhibit pronounced blurring and structural loss, with PSNR and SSIM scores dropping appreciably. By contrast, our proposed method demonstrates pronounced robustness across all degradation settings and effectively restores image details while sustaining high visual fidelity and metric scores. Concretely, under the first anisotropic Gaussian kernel ([ λ 1 , λ 2 , θ ] = [2.0,0.6,0]), Ours attains 31.18 dB/0.8143, outperforming the runner-up Ours_s (31.15 dB/0.8127) and all other methods. This advantage persists at a noise level of 10, underscoring our proposed method’s strong noise-resilience.
We also further visualize the results achieved by different methods in Figure 10. It can be observed that, in the presence of increasing noise, the performance of all models decreases, and Ours decreases the least, showing greater robustness. Under the same noise and blur conditions, Ours consistently achieves the highest PSNR and SSIM values, followed by Ours_s. For example, under the first anisotropic Gaussian kernel ([ λ 1 , λ 2 , θ ] = [3.4,3.2, π ]) in extreme conditions of high noise (Noise = 10) and strong blurring, the advantages of Ours are even more pronounced. These results further validate the effectiveness and superiority of our method.
In summary, the visual and quantitative results consistently highlight the superiority of Ours across diverse noise and blur conditions, further confirming the broad applicability of our model to complex remote sensing image degradations.

4.3. Model Analyses

4.3.1. Mask Predictor Block

To verify the effectiveness of the Mask Predictor Block (MPB), we conducted an ablation study in Table 5. When the MPB is removed (Baseline), the model processes all regions equally and suffers a huge computational cost. Once the MPB is introduced, Ours maintains competitive performance with FLOPs being significantly reduced. This is because that MPB can well recognize the importance of different regions and can focus on those with high object existence probability. These findings conclusively demonstrate the effectiveness of our MPB.

4.3.2. Visualization of Learned Masks

We further visualize our learned masks in Figure 11. As we can see, our mask prediction module is able to identify patches with texture-rich edges and targets very well. For example, on Img_Parking_314, we can see that vehicles in the parking lot are recognized as important regions. Furthermore, as the network depth increases, the learned mask gradually focuses on fewer regions. By skipping the computation of these patches, inference efficiency can be improved while maintaining superior performance.

5. Discussion

Our approach differs from prior RSISR methods by introducing a dynamic, block-wise sparse inference mechanism. Conventional baselines (e.g., SwinIR [14]) execute dense self-attention over all regions without considering their different image contents. To improve the efficiency, many efforts rely on network pruning [54,55,56], network quantization [29,57,58] or lightweight designs [59,60]. For example, ESTNet [15] develops group-wise attention and channel attention modules to boost the efficiency for Transformer-based SR. A related prior work on SMSR [12] reveals that most LR images contain large flat regions where dense computation is wasteful. However, SMSR is specially developed for CNN-based structures and cannot be directly extended to Transformers. In this paper, S2Transformer learns binary, patch-level spatial masks that route only high-importance regions through attention and MLP branches, while bypassing low-importance areas with a light path. This content-adaptive approach aligns better with the sparse nature of remote sensing imagery, which typically contains large homogeneous areas (e.g., sea, cropland, desert) with sparse but critical objects (e.g., roads, buildings, vessels).
Compared with the baseline SwinIR, the key difference is a shift in the inference paradigm rather than the choice of backbone. SwinIR centers on local window self-attention with shifted windows and convolutions but treats all tokens uniformly, making computation weakly coupled to content. In contrast, our model explicitly predicts binary masks per block to preserve tha coverage of cross-window salient regions under the shifted-window scheme. Then, only activated tokens are processed by attention, while non-activated tokens follow a lightweight recovery path. As a result, our method maintains a high reconstruction quality while substantially reducing its redundant computation and memory footprint, offering a more practical efficiency–accuracy trade-off for resource-constrained deployments.

6. Conclusions

This paper proposes an efficient remote sensing image SR approach founded on a dynamic Sparse Swin Transformer (S2Transformer), which is tailored to exploit the sparsity of target regions and the redundancy of background information in remote sensing images. Our sparse Transformer blocks can adaptively identify regions of interest and markedly reduces the computational load on background areas, thereby boosting efficiency while preserving high-quality reconstructions. Experiments on remote sensing datasets show that the proposed approach delivers marked gains in visual quality, quantitative metrics, and robustness. Additional studies under various noise levels and blur degradations confirm that the method reliably recovers fine details even in complex scenarios.

Author Contributions

Z.Z. and S.L. designed the S2Transformer structure and compared the methods. Z.Z. and H.X. analyzed and summarized the experimental results, data, and visualization images. Z.Z. wrote the manuscript. D.L. and Y.G. provided reliable advice for writing and revising. D.L. and Y.G. provided reliable advice during the revision process. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (52172302), the Jilin Province Science and Technology Development Projects (20250102209JC), and the Science and Technology Research Projects of the Education Office of Jilin Province (JJKH20251951KJ).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new datasets were created or analyzed.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1874–1883. [Google Scholar]
  2. Zhang, X.; Zeng, H.; Guo, S.; Zhang, L. Efficient long-range attention network for image super-resolution. In The European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 649–667. [Google Scholar]
  3. Liu, X.; Liu, J.; Tang, J.; Wu, G. CATANet: Efficient Content-Aware Token Aggregation for Lightweight Image Super-Resolution. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 10–17 June 2025; pp. 17902–17912. [Google Scholar]
  4. Peng, G.; Xie, M.; Fang, L. Context-aware lightweight remote-sensing image super-resolution network. Front. Neurorobotics 2023, 17, 1220166. [Google Scholar] [CrossRef]
  5. Lin, C.; Mao, X.; Qiu, C.; Zou, L. Dtcnet: Transformer-cnn distillation for super-resolution of remote sensing image. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 11117–11133. [Google Scholar] [CrossRef]
  6. Hou, M.; Huang, Z.; Yu, Z.; Yan, Y.; Zhao, Y.; Han, X. CSwT-SR: Conv-swin transformer for blind remote sensing image super-resolution with amplitude-phase learning and structural detail alternating learning. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5629514. [Google Scholar] [CrossRef]
  7. Dong, C.; Loy, C.C.; He, K.; Tang, X. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 295–307. [Google Scholar] [CrossRef] [PubMed]
  8. Kim, J.; Lee, J.K.; Lee, K.M. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654. [Google Scholar]
  9. Lim, B.; Son, S.; Kim, H.; Nah, S.; Mu Lee, K. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 136–144. [Google Scholar]
  10. Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 286–301. [Google Scholar]
  11. Chen, X.; Wang, X.; Zhou, J.; Qiao, Y.; Dong, C. Activating more pixels in image super-resolution transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 22367–22377. [Google Scholar]
  12. Kang, X.; Duan, P.; Li, J.; Li, S. Efficient swin transformer for remote sensing image super-resolution. IEEE Trans. Image Process. 2024, 33, 6367–6379. [Google Scholar] [CrossRef]
  13. Wang, Y.; Jin, S.; Yang, Z.; Guan, H.; Ren, Y.; Cheng, K.; Zhao, X.; Liu, X.; Chen, M.; Liu, Y.; et al. TTSR: A transformer-based topography neural network for digital elevation model super-resolution. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4403719. [Google Scholar] [CrossRef]
  14. Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van Gool, L.; Timofte, R. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 1833–1844. [Google Scholar]
  15. Wang, L.; Dong, X.; Wang, Y.; Ying, X.; Lin, Z.; An, W.; Guo, Y. Exploring sparsity in image super-resolution for efficient inference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4917–4926. [Google Scholar]
  16. Lu, Z.; Li, J.; Liu, H.; Huang, C.; Zhang, L.; Zeng, T. Transformer for single image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 457–466. [Google Scholar]
  17. Xia, P.; Peng, L.; Di, X.; Pei, R.; Wang, Y.; Cao, Y.; Zha, Z.J. S3mamba: Arbitrary-scale super-resolution via scaleable state space model. arXiv 2024, arXiv:2411.11906. [Google Scholar]
  18. Di, X.; Peng, L.; Xia, P.; Li, W.; Pei, R.; Cao, Y.; Wang, Y.; Zha, Z.J. Qmambabsr: Burst image super-resolution with query state space model. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 23080–23090. [Google Scholar]
  19. Liu, J.; Yuan, Z.; Pan, Z.; Fu, Y.; Liu, L.; Lu, B. Diffusion model with detail complement for super-resolution of remote sensing. Remote Sens. 2022, 14, 4834. [Google Scholar] [CrossRef]
  20. Liebel, L.; Körner, M. Single-image super resolution for multispectral remote sensing data using convolutional neural networks. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2016, 41, 883–890. [Google Scholar] [CrossRef]
  21. Xu, W.; Guangluan, X.; Wang, Y.; Sun, X.; Lin, D.; Yirong, W. High quality remote sensing image super-resolution using deep memory connected network. In Proceedings of the IGARSS 2018-2018 IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 22–27 July 2018; pp. 8889–8892. [Google Scholar]
  22. Ren, C.; He, X.; Qing, L.; Wu, Y.; Pu, Y. Remote sensing image recovery via enhanced residual learning and dual-luminance scheme. Knowl.-Based Syst. 2021, 222, 107013. [Google Scholar] [CrossRef]
  23. Dong, X.; Sun, X.; Jia, X.; Xi, Z.; Gao, L.; Zhang, B. Remote sensing image super-resolution using novel dense-sampling networks. IEEE Trans. Geosci. Remote Sens. 2020, 59, 1618–1633. [Google Scholar] [CrossRef]
  24. Lei, S.; Shi, Z.; Mo, W. Transformer-based multistage enhancement for remote sensing image super-resolution. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5615611. [Google Scholar] [CrossRef]
  25. Li, Z.; Li, L.; Liu, B.; Cao, Y.; Zhou, W.; Ni, W.; Yang, Z. Spectral-learning-based transformer network for the spectral super-resolution of remote-sensing degraded images. IEEE Geosci. Remote Sens. Lett. 2023, 20, 5505705. [Google Scholar] [CrossRef]
  26. Zhang, M.; Zhang, C.; Zhang, Q.; Guo, J.; Gao, X.; Zhang, J. ESSAformer: Efficient transformer for hyperspectral image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 23073–23084. [Google Scholar]
  27. Zhu, Q.; Zhang, G.; Zou, X.; Wang, X.; Huang, J.; Li, X. Convmambasr: Leveraging state-space models and cnns in a dual-branch architecture for remote sensing imagery super-resolution. Remote Sens. 2024, 16, 3254. [Google Scholar] [CrossRef]
  28. Zhi, R.; Fan, X.; Shi, J. MambaFormerSR: A lightweight model for remote-sensing image super-resolution. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6015705. [Google Scholar] [CrossRef]
  29. Li, H.; Yan, C.; Lin, S.; Zheng, X.; Zhang, B.; Yang, F.; Ji, R. Pams: Quantized super-resolution via parameterized max scale. In The European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2020; pp. 564–580. [Google Scholar]
  30. Hong, C.; Kim, H.; Baik, S.; Oh, J.; Lee, K.M. Daq: Channel-wise distribution-aware quantization for deep image super-resolution networks. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 2675–2684. [Google Scholar]
  31. Hong, C.; Baik, S.; Kim, H.; Nah, S.; Lee, K.M. Cadyq: Content-aware dynamic quantization for image super-resolution. In The European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 367–383. [Google Scholar]
  32. Tu, Z.; Hu, J.; Chen, H.; Wang, Y. Toward accurate post-training quantization for image super resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 5856–5865. [Google Scholar]
  33. Jacob, B.; Kligys, S.; Chen, B.; Zhu, M.; Tang, M.; Howard, A.; Adam, H.; Kalenichenko, D. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2704–2713. [Google Scholar]
  34. Lee, N.; Ajanthan, T.; Torr, P.H. Snip: Single-shot network pruning based on connection sensitivity. arXiv 2018, arXiv:1810.02340. [Google Scholar]
  35. Tanaka, H.; Kunin, D.; Yamins, D.L.; Ganguli, S. Pruning neural networks without any data by iteratively conserving synaptic flow. Adv. Neural Inf. Process. Syst. 2020, 33, 6377–6389. [Google Scholar]
  36. Wang, C.; Zhang, G.; Grosse, R. Picking winning tickets before training by preserving gradient flow. arXiv 2020, arXiv:2002.07376. [Google Scholar] [CrossRef]
  37. Mocanu, D.C.; Mocanu, E.; Stone, P.; Nguyen, P.H.; Gibescu, M.; Liotta, A. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nat. Commun. 2018, 9, 2383. [Google Scholar] [CrossRef]
  38. Evci, U.; Gale, T.; Menick, J.; Castro, P.S.; Elsen, E. Rigging the lottery: Making all tickets winners. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual Event, 13–18 July 2020; pp. 2943–2952. [Google Scholar]
  39. Liu, Z.; Li, J.; Shen, Z.; Huang, G.; Yan, S.; Zhang, C. Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2736–2744. [Google Scholar]
  40. Gordon, A.; Eban, E.; Nachum, O.; Chen, B.; Wu, H.; Yang, T.J.; Choi, E. Morphnet: Fast & simple resource-constrained structure learning of deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1586–1595. [Google Scholar]
  41. Frankle, J.; Carbin, M. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv 2018, arXiv:1803.03635. [Google Scholar]
  42. Liu, S.; Chen, T.; Atashgahi, Z.; Chen, X.; Sokar, G.; Mocanu, E.; Pechenizkiy, M.; Wang, Z.; Mocanu, D.C. Deep ensembling with no overhead for either training or testing: The all-round blessings of dynamic sparsity. arXiv 2021, arXiv:2106.14568. [Google Scholar]
  43. Willette, J.; Lee, H.; Hwang, S.J. Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction. arXiv 2025, arXiv:2505.11254. [Google Scholar] [CrossRef]
  44. Zhang, J.; Xiang, C.; Huang, H.; Wei, J.; Xi, H.; Zhu, J.; Chen, J. Spargeattn: Accurate sparse attention accelerating any model inference. arXiv 2025, arXiv:2502.18137. [Google Scholar] [CrossRef]
  45. Gao, Y.; Guo, S.; Cao, S.; Xia, Y.; Cheng, Y.; Wang, L.; Ma, L.; Sun, Y.; Ye, T.; Dong, L.; et al. SeerAttention-R: Sparse Attention Adaptation for Long Reasoning. arXiv 2025, arXiv:2506.08889. [Google Scholar]
  46. Acharya, S.; Jia, F.; Ginsburg, B. Star attention: Efficient llm inference over long sequences. arXiv 2024, arXiv:2411.17116. [Google Scholar] [CrossRef]
  47. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
  48. Xia, G.S.; Hu, J.; Hu, F.; Shi, B.; Bai, X.; Zhong, Y.; Zhang, L.; Lu, X. AID: A benchmark data set for performance evaluation of aerial scene classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3965–3981. [Google Scholar] [CrossRef]
  49. Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3974–3983. [Google Scholar]
  50. Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
  51. Cheng, G.; Han, J.; Lu, X. Remote sensing image scene classification: Benchmark and state of the art. Proc. IEEE 2017, 105, 1865–1883. [Google Scholar] [CrossRef]
  52. Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
  53. Xiao, Y.; Yuan, Q.; Jiang, K.; He, J.; Wang, Y.; Zhang, L. From degrade to upgrade: Learning a self-supervised degradation guided adaptive network for blind remote sensing image super-resolution. Inf. Fusion 2023, 96, 297–311. [Google Scholar] [CrossRef]
  54. Wang, L.; Guo, Y.; Dong, X.; Wang, Y.; Ying, X.; Lin, Z.; An, W. Exploring fine-grained sparsity in convolutional neural networks for efficient inference. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 4474–4493. [Google Scholar] [CrossRef]
  55. Zhan, Z.; Gong, Y.; Zhao, P.; Yuan, G.; Niu, W.; Wu, Y.; Zhang, T.; Jayaweera, M.; Kaeli, D.; Ren, B.; et al. Achieving on-mobile real-time super-resolution with neural architecture and pruning search. In Proceedings of the ICCV, Montreal, QC, Canada, 10–17 October 2021; pp. 4821–4831. [Google Scholar]
  56. Zhang, Y.; Zhang, K.; Van Gool, L.; Danelljan, M.; Yu, F. Lightweight image super-resolution via flexible meta pruning. In Proceedings of the ICML, Vienna, Austria, 21–27 July 2024. [Google Scholar]
  57. Wang, L.; Dong, X.; Wang, Y.; Liu, L.; An, W.; Guo, Y. Learnable lookup table for neural network quantization. In Proceedings of the CVPR, New Orleans, LA, USA, 18–24 June 2022; pp. 12423–12433. [Google Scholar]
  58. Yamamoto, K. Learnable companding quantization for accurate low-bit neural networks. In Proceedings of the CVPR, Nashville, TN, USA, 20–25 June 2021; pp. 5029–5038. [Google Scholar]
  59. Zhang, X.; Zhang, Y.; Yu, F. HiT-SR: Hierarchical transformer for efficient image super-resolution. In Proceedings of the ECCV, Milano, Italy, 29 September–4 October 2024; pp. 483–500. [Google Scholar]
  60. Zamfir, E.; Wu, Z.; Mehta, N.; Zhang, Y.; Timofte, R. See more details: Efficient image super-resolution by experts mining. In Proceedings of the ICML, Vienna, Austria, 21–27 July 2024. [Google Scholar]
Figure 1. Quantitative comparison between our method and previous approaches. Our method well balances PSNR (dB) accuracy against parameter count and computational cost FLOPs (G), with metrics averaged across four remote sensing datasets.
Figure 1. Quantitative comparison between our method and previous approaches. Our method well balances PSNR (dB) accuracy against parameter count and computational cost FLOPs (G), with metrics averaged across four remote sensing datasets.
Sensors 25 05643 g001
Figure 2. An overview of our proposed network. The input LR image is first fed to a CNN for feature extraction. Then, the resultant features are passed to M dynamic sparse Transformer modules and another CNN to reconstruct the SR result.
Figure 2. An overview of our proposed network. The input LR image is first fed to a CNN for feature extraction. Then, the resultant features are passed to M dynamic sparse Transformer modules and another CNN to reconstruct the SR result.
Sensors 25 05643 g002
Figure 3. An illustration of our dynamic sparse transformer module.
Figure 3. An illustration of our dynamic sparse transformer module.
Sensors 25 05643 g003
Figure 4. An illustration of our Mask Predictor Block.
Figure 4. An illustration of our Mask Predictor Block.
Sensors 25 05643 g004
Figure 5. Visualization ×4 results using different methods on “parking_251” and “parking_314” samples of AID. The best result PSNR/SSIM is shown in boldface. Magnify to get a clearer view.
Figure 5. Visualization ×4 results using different methods on “parking_251” and “parking_314” samples of AID. The best result PSNR/SSIM is shown in boldface. Magnify to get a clearer view.
Sensors 25 05643 g005
Figure 6. Visualization ×4 results using different methods on DOTA V1.0. The best PSNR/SSIM is shown in boldface. Magnify to get a clearer view.
Figure 6. Visualization ×4 results using different methods on DOTA V1.0. The best PSNR/SSIM is shown in boldface. Magnify to get a clearer view.
Sensors 25 05643 g006
Figure 7. Visualization ×4 results using different methods on DIOR. The best PSNR/SSIM is shown in boldface. Magnify to get a clearer view.
Figure 7. Visualization ×4 results using different methods on DIOR. The best PSNR/SSIM is shown in boldface. Magnify to get a clearer view.
Sensors 25 05643 g007
Figure 8. Visual x4 comparisons on NWPU-RESISC45.
Figure 8. Visual x4 comparisons on NWPU-RESISC45.
Sensors 25 05643 g008
Figure 9. The visual comparison of experiments on noises and anisotropic Gaussian blur. This image, “Img_01903”, is taken from the DIOR dataset. The best PSNR/SSIM is shown in boldface.
Figure 9. The visual comparison of experiments on noises and anisotropic Gaussian blur. This image, “Img_01903”, is taken from the DIOR dataset. The best PSNR/SSIM is shown in boldface.
Sensors 25 05643 g009
Figure 10. The visual comparison of experiments on noises and anisotropic Gaussian blur. This image, “Img_P0189”, is taken from the DOTA V1.0 dataset. The best PSNR/SSIM is shown in boldface.
Figure 10. The visual comparison of experiments on noises and anisotropic Gaussian blur. This image, “Img_P0189”, is taken from the DOTA V1.0 dataset. The best PSNR/SSIM is shown in boldface.
Sensors 25 05643 g010
Figure 11. Visualization of learned sparse masks. (ad) visualize the masks produced at stages 1–4.
Figure 11. Visualization of learned sparse masks. (ad) visualize the masks produced at stages 1–4.
Sensors 25 05643 g011
Table 1. A detailed summary of the dataset attributes, including AID, DOTA, DIOR, and NWPU-RESISC45.
Table 1. A detailed summary of the dataset attributes, including AID, DOTA, DIOR, and NWPU-RESISC45.
TrainTest
DatasetAID [48]AID [48]DOTA V1.0 [49]DIOR [50]NWPU-RESISC45 [51]
Used (Total) Images3000 (10,000)900 (10,000)900 (2806)1000 (23,463)315 (31,500)
Image Size 600 × 600 600 × 600 800 4000 800 × 800 256 × 256
Resolution 0.5 8 m 0.5 8 m -- 0.2 30 m
Categories3030152045
TaskScene ClassificationScene ClassificationObject DetectionObject DetectionScene Classification
Table 2. Quantitative results on the AID test set. Here we report the PSNR/SSIM performance of SISR models on 30 classes of scenes. The best result is shown in boldface.
Table 2. Quantitative results on the AID test set. Here we report the PSNR/SSIM performance of SISR models on 30 classes of scenes. The best result is shown in boldface.
Land CoverBicubicEDSRVDSRSRFlowRCANSRGANHATThis Study
PSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIM
Airport27.830.755429.930.828230.120.830128.970.791630.130.831830.290.841630.150.831931.130.8571
Bare Land35.600.856436.940.883736.180.885435.750.863436.990.884435.860.862436.880.884137.390.8987
Baseball Field31.000.830533.050.876532.160.835132.890.868733.300.878932.780.863533.250.878933.810.8817
Beach32.900.844634.180.872734.190.875434.130.870634.330.875134.260.881434.340.875634.990.8849
Bridge30.220.828332.930.880032.910.879532.570.865333.130.881932.640.868933.040.880933.680.8946
Center26.510.694428.770.792128.680.785927.650.772128.960.796627.470.767128.920.795629.670.8159
Church24.290.633326.300.746926.350.751925.910.738926.540.752925.930.740926.560.753227.750.7758
Commercial27.330.717429.010.794028.980.791528.960.784029.240.800029.360.797829.210.800730.290.8179
D-Residential22.930.567124.380.683924.710.692624.150.673724.630.693024.370.688924.670.693625.590.7316
Desert39.260.910040.200.926840.050.917540.030.921840.240.927240.560.935640.370.927840.890.9413
Farmland33.100.822635.000.868334.810.862834.860.861635.110.870134.060.834935.030.869135.680.8824
Forest28.790.660529.850.731529.760.728629.450.711529.950.734529.150.701930.010.736330.490.7595
Industrial26.770.695228.880.793128.670.792728.680.773129.040.797728.610.772129.040.798029.980.8159
Meadow33.860.748334.630.780434.590.779534.370.770434.650.781534.290.769434.700.781535.960.8023
M-Residential26.360.633528.340.736528.310.734928.170.726728.520.741528.080.715928.460.740829.650.7610
Mountain29.510.734930.630.788530.600.784930.260.778530.720.790830.310.781630.780.792331.590.8098
Park29.060.753030.540.813030.510.800930.010.795430.720.817030.090.800430.710.818931.560.8327
Parking24.240.706027.250.831727.190.829427.110.828727.500.837227.160.829727.560.840528.390.8585
Playground32.640.845035.370.894335.290.890735.140.881635.610.896435.260.885635.490.895936.560.9115
Pond30.700.816732.110.854232.080.852431.940.834232.210.855531.920.834832.180.855533.890.8716
Port26.670.798628.500.859628.550.867928.310.849728.760.863528.330.850128.810.863829.490.8789
Railway Station26.780.679328.720.773828.770.778628.550.765428.910.778928.510.766728.880.778029.870.7979
Resort26.790.702928.520.779928.530.780127.310.740628.720.784627.360.746928.710.784929.560.7936
River30.370.740231.550.789131.510.787631.040.778431.620.790631.110.784131.630.790932.270.8016
School27.410.723729.360.804429.310.799729.060.791529.550.808929.090.792129.540.810430.130.8269
S-Residential26.660.600627.710.672827.690.673627.560.662127.840.675927.590.663927.880.675928.890.6956
Square28.550.739130.840.820030.870.823830.530.800631.030.823730.590.802731.000.825131.730.8357
Stadium27.160.754729.630.838729.570.834629.410.821629.820.842529.570.827629.770.842230.540.8526
Storage Tanks25.650.679327.440.766427.330.764327.170.746927.610.770527.210.748627.600.769828.560.7813
Viaduct26.970.675528.990.775728.720.768928.670.755929.160.780528.710.761329.110.779430.160.7889
Average28.860.738230.650.808630.430.798830.280.780230.820.812130.510.794230.810.812431.130.8229
Table 3. Quantitative Results on AID, DOTA V1.0, DIOR and NWPU-RESISC45 Test Sets. The best result is shown in boldface.
Table 3. Quantitative Results on AID, DOTA V1.0, DIOR and NWPU-RESISC45 Test Sets. The best result is shown in boldface.
Methods#Param.FLOPsAIDDOTA V1.0DIORNWPU-RESISC45Average
PSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIM
Bicubic--28.860.738231.160.794728.570.743226.200.687328.700.7587
EDSR43.09 M823.34 G30.650.808633.640.864830.630.811628.910.745830.960.8077
VDSR2.55 M36.78 G30.430.798833.540.856330.150.806428.650.749530.690.8025
RCAN15.59 M261.01 G30.820.812133.860.868030.850.815928.990.768131.130.8160
SRFlow16.78 M321.4 G30.280.780233.580.856130.390.802629.050.768730.820.8019
SRGAN2.79 M139.26 G30.510.794233.470.852630.450.804928.970.771230.850.8057
HAT40.32 M672.15 G30.810.812433.990.868430.870.816129.190.783531.220.8201
SwinIR5.85 M152.42 G30.850.813733.930.867130.750.813329.120.781231.160.8188
ESRT8.7 M187.5 G30.750.808933.910.867530.890.817329.110.782131.170.8189
ESTNet3.28 M89.32 G30.890.815834.280.869731.270.825729.780.794331.550.8263
SMSR1.2 M87.5 G30.950.817634.320.870331.390.827629.710.793631.590.8272
Ours_s4.16 M8.73 G31.040.819535.520.885632.240.835829.870.804932.170.8364
Ours16.01 M114.49 G31.130.822935.690.888132.310.837429.950.807832.270.8391
Table 4. We present the PSNR results under various noise conditions and Anisotropic Gaussian blurs. All models are evaluated on the DIOR dataset across 11 representative kernel widths and noise intensities ranging between 0 and 10. The best results are shown in boldface.
Table 4. We present the PSNR results under various noise conditions and Anisotropic Gaussian blurs. All models are evaluated on the DIOR dataset across 11 representative kernel widths and noise intensities ranging between 0 and 10. The best results are shown in boldface.
MethodNoiseAnisotropic Gaussian Blur KernelsAverage
Sensors 25 05643 i001Sensors 25 05643 i002Sensors 25 05643 i003Sensors 25 05643 i004Sensors 25 05643 i005Sensors 25 05643 i006Sensors 25 05643 i007Sensors 25 05643 i008Sensors 25 05643 i009Sensors 25 05643 i010Sensors 25 05643 i011
Bicubic027.5927.3626.6226.1226.0126.5326.7526.0326.3125.9525.7626.37
EDSR29.6529.4729.4629.1528.1529.1629.2628.7628.6928.7828.4129.00
RCAN29.8629.8129.1527.4827.9127.1729.0327.7427.6527.9629.1528.44
HAT29.8729.6629.2528.6128.3728.7529.1728.6528.7928.5528.1828.89
VDSR29.1529.0729.0528.7528.6328.8928.7428.8928.7528.5428.6528.82
Ours_s31.0931.0131.1231.2331.1531.2431.2531.2430.9131.0530.7531.00
Ours31.2531.2731.3631.4531.3731.4931.3531.3531.1431.2930.9531.21
Bicubic527.1526.9126.3325.8225.7626.2426.4125.7925.9525.6425.4825.59
EDSR28.4228.4828.2527.3927.6427.9728.2327.3127.6727.1927.1427.52
RCAN28.8628.4128.1726.9427.5426.6127.1026.4726.9725.7526.4127.20
HAT28.8928.4328.2726.9627.5726.6927.3526.8427.3925.4526.9527.35
VDSR27.9727.9527.8426.6927.1526.5927.5826.0327.1527.2926.8826.74
Ours_s29.9729.9429.0128.6529.1328.8929.1228.5629.6329.2129.1229.20
Ours30.0730.0129.9729.0529.4529.1929.4029.8728.9129.9429.5529.58
Bicubic1026.6826.3425.8525.3125.1925.7425.9125.2625.5925.1624.9124.72
EDSR27.0327.7927.1526.5926.4327.0727.1726.5726.8926.5126.0126.47
RCAN27.2327.0727.1826.8926.6927.2827.5426.8326.2426.7226.1926.53
HAT27.7627.1627.5726.9326.8127.5427.7326.9326.4626.7826.2726.81
VDSR27.0127.6527.0326.2226.3827.0127.0926.5426.5626.3525.9626.52
Ours_s28.8428.7928.1727.6527.7128.0128.2727.9827.8727.5427.4527.84
Ours29.0528.9928.8728.0127.9828.1528.5728.1028.0527.9927.6428.31
Table 5. Ablation results achieved by our method with different settings on AID.
Table 5. Ablation results achieved by our method with different settings on AID.
MethodMPB#Param.FLOPsPSNRSSIM
Baseline×11.58 M190.82 G31.200.8257
Ours16.01 M114.49 G31.130.8229
Ours_s4.16 M8.73 G30.910.8174
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, Z.; Xu, H.; Lin, S.; Li, D.; Gao, Y. S2Transformer: Exploring Sparsity in Remote Sensing Images for Efficient Super-Resolution. Sensors 2025, 25, 5643. https://doi.org/10.3390/s25185643

AMA Style

Zhang Z, Xu H, Lin S, Li D, Gao Y. S2Transformer: Exploring Sparsity in Remote Sensing Images for Efficient Super-Resolution. Sensors. 2025; 25(18):5643. https://doi.org/10.3390/s25185643

Chicago/Turabian Style

Zhang, Zicheng, Hongke Xu, Shan Lin, Dejun Li, and Yinghui Gao. 2025. "S2Transformer: Exploring Sparsity in Remote Sensing Images for Efficient Super-Resolution" Sensors 25, no. 18: 5643. https://doi.org/10.3390/s25185643

APA Style

Zhang, Z., Xu, H., Lin, S., Li, D., & Gao, Y. (2025). S2Transformer: Exploring Sparsity in Remote Sensing Images for Efficient Super-Resolution. Sensors, 25(18), 5643. https://doi.org/10.3390/s25185643

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop