Activated Sparsely Sub-Pixel Transformer for Remote Sensing Image Super-Resolution

Guo, Yongde; Gong, Chengying; Yan, Jun

doi:10.3390/rs16111895

Open AccessArticle

Activated Sparsely Sub-Pixel Transformer for Remote Sensing Image Super-Resolution

by

Yongde Guo

^1,†,

Chengying Gong

^1,†

and

Jun Yan

^2,3,*

¹

Faculty of Data Science, City University of Macau, Macau SAR, China

²

School of Data Science, Qingdao University of Science and Technology, Qingdao 266000, China

³

Zhuhai Aerospace Microchips Science & Technology Co., Ltd., Zhuhai 519000, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Remote Sens. 2024, 16(11), 1895; https://doi.org/10.3390/rs16111895

Submission received: 21 March 2024 / Revised: 15 May 2024 / Accepted: 21 May 2024 / Published: 24 May 2024

Download

Browse Figures

Versions Notes

Abstract

:

Transformers have recently achieved significant breakthroughs in various visual tasks. However, these methods often overlook the optimization of interactions between convolution and transformer blocks. Although the basic attention module strengthens the feature selection ability, it is still weak in generating superior quality output. In order to address this challenge, we propose the integration of sub-pixel space and the application of sparse coding theory in the calculation of self-attention. This approach aims to enhance the network’s generation capability, leading to the development of a sparse-activated sub-pixel transformer network (SSTNet). The experimental results show that compared with several state-of-the-art methods, our proposed network can obtain better generation results, improving the sharpness of object edges and the richness of detail texture information in super-resolution generated images.

Keywords:

self-attention; sparse representation; super-resolution(SR); transformer network; sub-pixel space

1. Introduction

Single image super-resolution pertains to the process of generating a high-resolution image from a low-resolution input [1]. This technique significantly enhances the quality of original images captured by imaging hardware, thereby improving data quality and facilitating subsequent data analysis and applications. In the context of remote sensing images, high-resolution images typically display more distinct contours and texture details than their low-resolution counterparts. Consequently, high-resolution images play a crucial role in the utilization of subsequent remote sensing data [2,3,4,5].

Super-resolution presents an inherently ill-posed challenge, as it introduces inherent smoothness due to the ambiguity in its outcomes [6,7]. Presently, a plethora of methods employ deep learning for feature representation. These deep learning-based modules are designed to predict non-linear mappings between low-resolution and high-resolution images throughout network iterations. To enhance the precision of these results, techniques such as residual learning [8,9,10], dense learning [11,12], and attention mechanisms [13,14,15,16] have been progressively incorporated.

In recent studies, it has been demonstrated that models based on the transformer architecture offer unique advantages in achieving high-quality super-resolution. The global modeling capability of self-attention proves more effective in capturing long-range information than traditional convolution. Vision transformer [17] stands out as a significant contribution, integrating the transformer into computer vision tasks. Subsequent works, such as SwinIR [18], HAT [19], and SRFormer [20], have further developed transformer’s self-attention in diverse network designs for super-resolution. Despite the undeniable advantage of a robust global perspective, self-attention comes with a relatively high computational cost. Hence, there is a need to strike a balance and make improvements between global modeling with self-attention and leveraging the computational advantages of convolution. Blindly increasing self-attention computations may introduce cost pressures that do not necessarily align with improvements in results. Simultaneously, many studies enhance the generation of remote sensing image super-resolution by incorporating additional guidance information, such as gradient details. However, using more data in self-attention computations could pose significant cost challenges, requiring further deliberation and refinement.

In contrast to natural images, remote sensing images have greater complexity. Remote sensing images contain more similar objects within the data. Additionally, the scale of objects in remote sensing images varies significantly, and objects of the same type appear at different scales. As a result, single-scale convolution and single-scale networks struggle to effectively extract multiscale information. Hybrid-scale self-similarity exploitation network (HSENet) [5] addresses this challenge by incorporating non-local attention with mixed scales within the convolutional neural network, enabling targeted computations of similar information. However, convolution-based non-local attention has limitations in terms of global representation capabilities. Transformer-based enhancement network (TransENet) [21] utilizes transformer self-attention computations for feature extraction from self-similarity but faces issues related to a large number of parameters and the need for improved feature representation.

The question arises whether we can further combine convolution and self-attention computations to obtain richer multi-scale feature information and enhance computational efficiency through the utilization of low-dimensional space. We can further consider the use of sub-pixel space to calculate the interaction between convolution and transformer modules in low-dimensional space, so as to obtain better multi-level fusion processing.

Significant objects frequently occupy a substantial portion of natural images. Conversely, remote sensing images often contain numerous valueless areas, such as oceans. Remote sensing images also exhibit many self-similarities, which are internal repetitions of information. These images typically cover larger areas, with similar geographical targets appearing in the image at the same or different scales repeatedly. While this repetition can provide more features for self-attention learning, it also presents the challenge of overall redundancy caused by repeated data. Consequently, the direct application of existing deep learning methodologies to extract features from remote sensing images not only escalates the complexity of the problem but also introduces superfluous computation. At the same time, it is noted that in dictionary learning, sparse coding shows excellent representation ability. Objects in remote sensing images often represent a minor portion of the entire image data, displaying evident sparse characteristics. Figure 1 provides an example of sparse key information in a typical remote sensing scenario. In this case, details requiring more focused learning, such as edges and textures for objects like bridges and vessels, constitute a smaller fraction of the overall scene. Hence, it is crucial in feature learning to further exploit the prevalent sparsity and redundancy in remote sensing images. The integration of sparse representations within a self-attention model enhances feature representation, resulting in a more robust depiction.

Regarding self-attention computations, a natural question emerges: Can we apply sparse representation theory in the transformer to obtain stronger feature representation?

In this article, we introduce a sparse-activated sub-pixel transformer network (SSTNet) to fully leverage remote sensing image information. Transformer self-attention computations are integrated into feature extraction modules at different stages to optimize global information utilization. In comparison to relying solely on convolutions, self-attention computations provide more long-range dependency information, facilitating better adaptation to the correlations between high and low-dimensional features in remote sensing images. Additionally, we incorporate sub-pixel space consideration in the decoding and reconstruction modules for multi-level fusion of high and low-dimensional features. Sub-pixel space aligns more effectively with pixel-related correlation information for reconstruction and proves to be more efficient than a transformer decoder. To impose constraints on self-attention computations, sparse loss is employed to guide and constrain attention, ensuring increased stability in the overall generation process. Sparse loss, in line with the characteristics of remote sensing image data, directs more attention to texture during the generation process.

Our contributions can be summarized in three aspects:

We employ sparse activation to direct feature extraction during the self-attention process, and we choose to focus on the correlation of adjacent pixels in sub-pixel space to augment the capacity for feature extraction and comprehension. The enhancement of the feature extraction capability is realized through the integration of sparse coding and sub-pixels.
We propose an encoder–decoder structure that further integrates and utilizes the advantages of both Transformer and CNN. In terms of encoder–decoder interaction, we have designed a multi-scale network structure to better align with the task characteristics of super-resolution in remote sensing images. At the same time, our proposed structure offers a more lightweight design with a reduced parameter count compared to other Transformer-based models.
Through comparative experiments conducted on the UCMerced dataset and the AID dataset, our method demonstrates satisfactory performance.

The organization of the remaining sections in this paper is outlined as follows: Section 2 offers an overview of existing research related to single-image super-resolution (SISR). The network proposed in this study is detailed in Section 3. Section 4 delves into implementation specifics and provides super-resolution experimental results obtained using the UCMerced dataset and AID dataset, along with accompanying discussions. Section 5 provides further discussion of the network proposed in this paper. Finally, Section 6 concludes this work.

2. Related Work

With the application of deep learning in image super-resolution tasks, deep neural networks have shown excellent effects and development prospects. Nowadays, there are two main architectures for commonly used image super-resolution deep learning methods, namely convolutional neural network (CNN)-based methods and transformer [22]-based methods.

2.1. CNN-Based Image SR

A super-resolution convolutional neural network (SRCNN) [23] initially represents the shallow CNN model aimed at high-frequency information recovery. A very deep convolutional network (VDSR) [24] extends its capabilities by incorporating a deeper convolutional network, enabling it to effectively learn residuals between the HR and the upsampled LR image. Shi et al. [25] introduced an efficient subpixel convolutional network (ESPCN) aimed at reducing the computational load of the network. This reduction is achieved by appending a subpixel convolutional layer at the network’s end, ensuring that all computations are conducted in the LR space. The use of subpixel convolution enhances the naturalness of the upsampled result.

Enhanced deep super-resolution network (EDSR) [8] pays attention to residual learning and subpixel convolution, which achieve better performance. EDSR is based on the SRResNet [7] model and remove batch normalization computation in the original residual block (RB). The change in RB helps to improve efficiency and simplify the network. Based on the residual block (RB) of EDSR, Yu et al. [26] further expanded the number of channels before the ReLU activation function to obtain a broader strategy of activation.

The deep laplace pyramid super-resolution network (LapSRN) [27] predicts the residual image step by step from coarse to fine by gradually zooming in on the residual image, which reduces the amount of computation and effectively improves the accuracy. In addition, in order to make full use of the extracted features, Li et al. [28] proposed a multi-scale base block (MSRB), which merged multi-scale modules into the residual structure. This innovation enables feature fusion at different scales and demonstrates the importance of multi-scales in the generation task. Wang et al. [29] discussed the design interaction of multi-scale convolutions and focused on effectively integrating multi-scale features in the overall network, demonstrating the promising framework of multi-scale fusion. Some studies [30,31,32,33] also incorporate adversarial loss [34] to create realistic textures during the reconstruction process.

Recently, there are several ways to incorporate attention mechanisms into CNN-based SR models to accommodate the importance of features under different tasks. Zhang et al. [13] pioneered the introduction of channel attention mechanisms into image super-resolution, proposing a residual channel attention network (RCAN). The channel attention mechanism sets different scale factors for the feature maps of different channels, so that the importance between the extracted features can be adjusted more accurately. Dai et al. [14] proposed a second-order attention network (SAN), which uses the second-order channel attention module to adjust channel features from adaptation through second-order feature statistics, so as to learn more expressive features. The non-local attention modules of RCAN and SAN only explore the characteristics of a uniform scale. Mei et al. [35] proposed a cross-scale non-local attention module to exploit feature similarity. Some studies [36,37] also introduce attention modules into the generative adversarial network. By implementing the design of attention modules, these studies improve the global modeling ability of the network, produce better generation results, and highlight the potential advantages of flexibly utilizing different strategies.

2.2. Transformer-Based Image SR

Transformer [22] has attracted the attention of the computer vision community due to its success in the field of natural language processing [38,39,40]. Vision transformer [17] has shown its advantages in modeling long-range dependencies, and there is still a lot of work to prove that convolution can help transformer achieve better visual representations.

IPT [41] is a large pre-trained model, built on standard transformers, that can be used for a variety of low-level vision tasks. It computes local features and self-attentions on non-overlapping blocks, but this can introduce the problem of losing some detail information that reproduces the image.

SwinIR [18] proposes an image recovery model based on the swin transformer, which further combines the advantages of convolution and transformer. SwinIR calculates self-attentions on a small, fixed-size window, which is limited in exploiting remote feature dependencies. ELAN [42] enables self-attention computation in a larger window by sharing weights on self-attention computation. HAT [19] proposes structural improvements that combine channel attention and self-attention based on the original transformer module and strives to activate more pixels to reconstruct high-resolution results.

2.3. Remote Sensing Image SR

The Local-Global Combined Network (LGCNet) [6] represents the inaugural CNN-based super-resolution (SR) model for remote sensing imagery. This model employs both local and global representations to learn image residuals between high-resolution (HR) images and their enlarged low-resolution (LR) counterparts. Liu et al. [43] leveraged gradient maps to augment edge generation. Furthermore, numerous studies have addressed the challenge of SR in remote sensing through wavelet analysis. Wang et al. [44] employed multiple parallel shallow CNNs to execute wavelet analysis across various scales. Ma et al. [45] transformed RGB images into the wavelet domain and utilized a recursive ResNet for generation. The wavelet transform enables the network to discern additional high-frequency information.

Several studies have incorporated attention mechanisms to enhance the quality of reconstruction results. The Multiscale Attention Network (MSAN) [46] employs hybrid high-order attention for extracting multi-level features from remote sensing images during feature extraction, while also utilizing scene adaptation strategies to align with varying scene structures. Similarly, the Hybrid-Scale Self-Similarity Exploitation Network (HSENet) [5] utilizes a non-local attention module to more effectively adapt to the multi-scale self-similarity found in remote sensing images.

Nowadays transformer-based architectures have also been integrated into remote sensing super-resolution networks. In the case of the transformer-based enhancement network (TransENet) [21], it incorporates a multi-scale structure within the encoder and decoder module of the transformer, achieving improved results by effectively leveraging long-term information. However, the computational expense of self-attention in the context of a multi-scale structure is a challenging issue, and making long-range correlations to better align with the data remains a significant consideration.

3. Proposed Method

3.1. Overall Structure

Figure 2 illustrates the overall architecture of the proposed network. Initially, the input data, which is of low resolution in the context of remote sensing, undergoes a transformation using convolutional layers to convert it into feature space.

In the shallow feature extraction module, the multi-scale feature extraction group (MEG) is proposed to extract shallow feature information with various convolution kernel sizes.

In the initial phase of feature extraction, the block of this module makes use of convolution kernels with dimensions of 3 × 3, 5 × 5, and 3 × 3, respectively. The design of the feature extraction module is shown in Figure 3 and the shallow feature extraction process can be represented as follows:

\begin{matrix} f_{0} = C o n v (L R) \\ f_{n} = M E G_{n} (f_{n - 1}) = M E G_{n} (M E G_{n - 1} (\dots M E G_{1} (f_{0}) \dots)) \end{matrix}

(1)

where

L R

stands for the low-resolution image,

C o n v

stands for convolution,

f_{0}

represents the initial feature, and

M E G_{n}

represents the nth MEG module. This utilization of multi-scale convolution enhances feature extraction, enabling the capture of richer information and the improvement in feature representation. Addressing the intricate nature of multi-scale feature processing, the deep feature extraction module involves a design that branches into multiple levels for fusion. This module utilizes transformer-based self-attention calculations to facilitate deep feature extraction, and the obtained features are expanded through subpixel space to perform the necessary multi-level fusion and reconstruction operations. Subpixel convolution is applied in the reconstruction module to effectively utilize subpixel space for generating results, leading to a significant reduction in parameter scale. Considering the well-established advantages of multi-level fusion in processing remote sensing data across diverse scales, our network architecture incorporates four branches to facilitate more effective interaction with multi-scale information. Thus, the process can be represented as

\begin{matrix} f_{E_{4}} = & E n c o d e r_{4} (S u b p i x e l (f_{3})) \\ f_{E_{n}} = & E n c o d e r_{n} (f_{n}), n = 1, 2, 3 \\ f_{D_{3}} = & D e c o d e r_{3} (f_{E_{3}}, f_{E_{4}}) \\ f_{D_{n}} = & D e c o d e r_{n} (f_{D_{n + 1}}, f_{E_{n}}), n = 1, 2 \\ I_{S R} = & C o n v (f_{D_{1}}) \end{matrix}

(2)

where

E n c o d e r_{n}

represents the nth encoding module,

f_{E_{n}}

represents the output of the nth encoder module,

D e c o d e r_{n}

represents the nth decoding module,

f_{D_{n}}

represents the output of the nth decoder module,

C o n v

represents convolution, and

I_{S R}

refers to the final super-resolution image. The encoders excel in the multi-level extraction of high/low-dimensional features across various stages. Following this, the decoder systematically fuses these features in a staged manner within the sub-pixel space, resulting in a progressive enhancement of features. The design of the overall structure of the network mainly consists of the module of the encoder–decoder computation part with a sparsely activated self-attention encoder and subpixel multi-level fusion decoder, and these will be further elaborated in the following two subsections.

3.2. Transformer Encoder for Sparse Representation

We apply the self-attention mechanism of the transformer after extracting features through multi-scale convolution. The self-attention mechanism has consistently demonstrated its ability to capture global similarity information and optimize feature weighting. Through the utilization of a combination of multi-head self-attention blocks and multi-layer perception blocks, the encoder structure effectively enhances and consolidates features extracted from the shallow layers of multi-scale convolution. We suggest that the generation guidance can be further augmented by incorporating a penalty term into the self-attention encoder architecture, thereby prompting the model to generate more explicit details in a constructive manner. Sparse coding will serve as the technique employed for super-resolution tasks in remote sensing images.

In encoder, the extracted shallow features are further divided into tokens for the self-attention computation. The feature

X_{0} \in R^{H x W x C}

is split into patches and flattened to obtain the

X_{p}^{i} \in R^{H_{p} W_{p} C}

. The number of patches

n = \frac{H W}{H_{p} W_{p}}

, and it is also the length of the input sequence. Use linear projection to map

X_{p}

to D-dimensions for vector computation in the module. Thus, the encoder is an input that can be expressed as follows:

X = [X_{p}^{1} W, X_{p}^{2} W, \dots, X_{p}^{N} W]

(3)

where W is the linear layer of

R^{(H_{p} W_{p} C) x D}

.

The encoder is following the original structure of [17,21]. It mainly contains a multi-head attention module and a multilayer perceptron structure and uses layer normalization and residual structure before each module. The encoder structure used is shown in Figure 4. The calculations within the encoder can be expressed as follows:

\begin{matrix} X_{i} & = L N (X) \\ X_{j} & = M S A (X_{i}) + X \\ Y & = M L P (L N (X_{j})) + X_{j} \end{matrix}

(4)

where

X_{i}

and

X_{j}

denote the intermediate features. Y represents the output of encoder block. Self-attention helps learn more similarity from the data. The variables Q, K, and V for attention calculation are obtained from three linear layers, and the self-attention calculation is performed according to Equation (5).

\begin{matrix} A t t e n t i o n (Q, K, V) & = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V \\ h e a d_{i} & = A t t e n t i o n (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}) \\ M u t i H e a d (Q, K, V) & = C o n c a t (h e a d_{1}, \dots h e a d_{h}) W^{O} \end{matrix}

(5)

where

d_{k}

denotes the dimensions of features in these encoders, h is the heads of the encoder module, and

W_{i}^{Q}

,

W_{i}^{K}

,

W_{i}^{V}

, and

W_{O}

are all projection matrices.

The inherent uncertainty of current super-resolution generation often leads to blurred outputs when implemented in deep neural networks, thereby undermining the clarity of intricate lines. This problem is particularly pronounced in object super-resolution for remote sensing images due to scale complexity. To address this, we propose the incorporation of supplementary bias generation guidance, leveraging self-attention mechanisms, to enhance image quality. Dictionary learning has demonstrated substantial theoretical potential in both machine learning and deep neural networks. By integrating sparse coding with self-attention encoders, we can optimize the deep feature extraction module to accommodate the sparsity of remote sensing image data. The self-attention calculation mimics human attention patterns by adopting a weighted approach to determine the final attention value for the original data. For the attention values K, Q, V, a sparse penalty term is introduced for V, ensuring that the overall self-attention outcome aligns more closely with a sparse structure. This method not only enhances attention quality but also accentuates key landmark features within the image, thereby providing superior generation guidance. Furthermore, sparse self-attention enables the network to filter out interference pixels from low-resolution inputs, resulting in more robust global feature modeling. Considering the information redundancy inherent in remote sensing images, the strategic incorporation of sparse representation introduces a computational bias that promotes sparsity, thereby amplifying the network’s capacity for feature learning.

The sparse encoder introduces the loss term of the sparsity penalty by calculating the sparsity degree for the output results of the encoder, which finally guides the encoder–decoder to learn more sparse features and obtain stronger learning ability. The sparse encoder [47,48] has proven that sparse coding can help with data feature learning in neural networks, assisting in learning task characteristics. The original sparse encoder–decoder principle is shown in Figure 5, and the sparsity of the encoding results of the encoder output is usually selected to calculate the sparsity, perform sparse penalty, and thus constrain the overall sparse direction of the encoder–decoder. Given the inherent similarities and complementarities between sparse coding and the self-attention mechanism, employing sparse coding to guide the attention of self-attention can be beneficial. This approach has the potential to enhance the capabilities of the Transformer structure, thereby yielding more robust and globally effective feature learning outcomes.

Imparting sparse loss to the encoding information shared between the encoder and decoder compels self-attention to prioritize key details during the learning process. This imposition not only enhances stability but also intensifies the focus on critical information in the self-encoding/decoding feature learning outcomes. The

L_{2}

regular value of V in the self-attention module is selected as its activation degree, and this value is used as the penalty term for sparse coding in the generation process, so as to better guide the self-attention. Equation (6) provides a representation of the sparse loss.

L_{S p a r s e} (θ) = \frac{1}{N} \sum_{n = 0}^{N} (\frac{1}{K} \sum_{k = 0}^{K} L_{2}^{k} (V))

(6)

where N denotes the number of images in a training batch,

θ

denotes the parameter set of our network, and K signifies the quantity of self-attention computations in the encoding module.

To quantify the accuracy of image reconstruction, we compute the mean absolute error (MAE) between the reconstructed images and their respective ground truth, represented as

L_{M A E} (θ) = \frac{1}{N} \sum_{n = 0}^{N} ‖ I_{H R}^{n} - H_{n e t} (I_{L R}^{n}) ‖_{1},

(7)

where

H_{n e t} (I_{L R}^{n})

and

I_{H R}^{n}

denote the nth reconstructed high-resolution image and the corresponding ground truth image, respectively. N represents the number of images in a training batch, and

θ

signifies the parameter set of our network. In summary, the ultimate target loss of the proposed model is the weighted sum of the two losses as Equation (8) shows.

L o s s = L_{M A E} + α L_{S p a r s e} .

(8)

The parameter

α

is employed as a weight to harmonize the balance between the two losses. By implementing sparse characterization in self-attention computations, we can attain feature extraction capabilities that are both stable and reliable.

3.3. Subpixel Multi-Level Decoder

Within the module corresponding to the decoder structure, features are subjected to further multi-level fusion and reconstruction within the subpixel space. The decoding structure of the basic Transformer [17,21,22] is frequently associated with the self-attention computation module. Nevertheless, the computational burden associated with this module frequently poses substantial challenges to the overall framework. Our research indicates that sub-pixel space can serve as a potential solution to mitigate these issues. Furthermore, given the theoretical benefits of sub-pixel space, its integration with the decoding structure could substantially improve the network’s comprehension and generation capabilities in relation to the physical world. Sub-pixel space for the imaging process of pictures has had more physical analysis and attention to the discretization processing of the physical world continuous image in the imaging. One pixel point on the result of the imaged picture contains its nearby color information. By calculating for the sub-pixels that are substantially present in the imaging process, it can better use the information obtained in the imaging, so as to generate a super-resolution result that is more consistent with the actual results of the physical world. Subpixel convolution [25] is recognized as an efficacious method for sub-pixel spatial utilization. Figure 6 exemplifies this process in the context of sub-image space. The procedure of subpixel convolution operates under the assumption that corresponding sub-pixel values exhibit a strong correlation with their original imaging pixels. By calculating the results of subdivision between two pixels, a more robust upsampling generation capability is achieved.

It has been observed that the self-attention encoder of the Transformer model can effectively transfer more global modeling information. The application of subpixel space theory can enhance the overall network’s self-attention, enabling it to learn super-resolution generation results that align more closely with physical reality. Furthermore, concentrating on the decoder with the assistance of subpixel space can facilitate superior multilevel fusion and upsampling generation for the extracted features. Utilizing subpixel convolution in lieu of self-attention computation during the decoding stage can significantly reduce computational cost pressure. Subpixel convolution is used to generate features within a low-dimensional space, and convolution effectively fuses multi-scale features from different branches. These successive convolution processes promote collaborative feature interaction across different stages, ultimately resulting in the production of high-resolution image outcomes. The decoder structure of the original Transformer primarily functions as a decoding step, involving a self-attention calculation. This design mirrors that of the decoder structure, incorporating a self-attention module, normalization, and a multilayer perceptron module. However, employing self-attention in both the encoder and decoder can lead to increased computational costs compared to convolution operations. Moreover, if each decoder adopts a self-attention design for multilevel fusion, it imposes significant computational strain on the structural foundation of multilevel fusion, resulting in an increase in cost that does not align with the intended lifting effect. The principle of subpixel space has been demonstrated to be an effective computing strategy in both physical imaging and neural network domains. In the interaction between the Transformer and convolution, this approach can further leverage the theoretical foundation of subpixel space. By focusing on embedding the calculation of subpixel convolution and convolution within the decoder module, it is possible to complete the reconstruction of deep features extracted by the Transformer and assess the impact of convolution calculation in multilevel fusion.

In the case of the upsampled encoder output, a decoder will be utilized to systematically combine and regenerate it with the output of the non-upsampled encoder branch. Specifically, the encoder processing results after upsampling are gradually added to the sub-pixel upsampling results for the output results of the encoders at each level of upsampling, and the convolution operation is fused step by step. The structure of the subpixel decoder of multi-level fusion is shown in Figure 7. Through multi-level feature fusion within the subpixel space, enhanced generation is attainable. The incorporation of subpixel convolution in synergy with conventional convolution in the decoder extends the advantages of both convolution and transformer, leading to a better utilization of local correlations on the basis of long-range modeling.

4. Experiment

4.1. Experimental Setup

In this paper, we conducte experiments using two publicly available datasets, namely the UCMerced dataset [49] and the AID dataset [50]. These datasets are widely employed in the field of remote sensing single-image super-resolution. The utilization of these datasets serves to further validate the effectiveness of the proposed network.

UCMerced Dataset [49]: This dataset comprises 21 distinct scene categories, encompassing commercial areas, densely populated residential areas, medium-sized residential areas, lakes, sparse residential regions, harbors, and more. The images have a pixel size of 600 × 600 with a spatial resolution of 0.5 m per pixel. Each category is comprised of 100 images, culminating in a total of 2100 images. For the UCMerced dataset, 50% of the images are utilized for training, with an additional 20% serving as a validation set.

AID Dataset [50]: This is a collection of remote sensing images with images sized at 256 × 256 and a spatial resolution of 0.3 m per pixel. It includes a total of 30 scene categories, covering areas such as airports, baseball fields, bridges, stadium central structures, churches, mountains, and commercial areas. This dataset has a total of 10,000 images. Each category contains approximately 220 to 420 image. For the AID dataset, 80 percent of the samples are chosen for training, while the remaining images are reserved for the test set. Additionally, within each of the 150 images, five random images per category are selected for validation during training.

This paper primarily focuses on remote sensing image super-resolution for upscaling factors of ×2, ×3, and ×4. For training, LR remote sensing images are randomly cropped into 48 × 48 patches, with corresponding HR image patches extracted as reference blocks. Training samples are augmented by random rotations of

90^{°}

,

180^{°}

,

270^{°}

, and horizontal flips. For encoder modules, K is set to 8. The length of D in the encoder and MLP block is set to 512. Furthermore, the number of attention heads in the multi-head self-attention mechanism is configured to be 6, while each individual attention head’s length is maintained at 32. The decoder module M is set to 1. The weight

α

of the loss function is configured as

10^{- 5}

. We employ the Adam optimizer during training, with hyperparameters set as follows:

β_{1} = 0.9

,

β_{2} = 0.99

, and

ϵ = 10^{- 8}

. The initial learning rate is set at

10^{- 4}

, the batch size is set at 16, and the learning rate is halved every 400 epochs. The total training epochs is 2000. The proposed network is implemented using PyTorch, and all experiments are conducted on an NVIDIA GeForce RTX 3060 graphics card. All experimental results are evaluated using the peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) [51] metrics to assess the performance of the generated images. Comparative analyses are conducted with alternative approaches to gauge the effectiveness.

4.2. Comparisons with Other Methods

We further compare the proposed method with several existing super-resolution techniques, including FSRCNN [52], DCM [53], CTNET [54], and TransENet [21]. Notably, TransENet is a Transformer-based remote sensing image super-resolution method, while FSRCNN, DCM, and CTNET are CNN-based remote sensing image super-resolution methods. Furthermore, DCM, CTNET, and TransENet are all designed for remote sensing images super-resolution networks. By comparing the existing technologies based on Transformer and CNN architecture, it can be more intuitive to compare the effect of the proposed method in the fusion design of CNN and Transformer. The selection of existing super-resolution methods for remote sensing images for comparison can also more clearly reflect the effectiveness of the improvement strategy of the proposed method for the characteristics of the remote sensing super-resolution task.

(1) Results on UCMerced Dataset: Table 1 illustrates a performance evaluation of all methods on this dataset. In the table, the best results are highlighted in bold. Compared to other methods, our method achieves the best performance at ×2 and achieves competitive results at the upsampling factor of ×3 and ×4. In addition, compared with the optimal method TransENet under ×4, the proposed network has a lower parameter quantity of 8%. The proposed network can achieve competitive generation results on a smaller number of parameters. We propose a lighter Transformer-based structure, which will also be further illustrated in the ablation study section with respect to parameter quantity. Table 2 shows the mean PSNR of the ×3 super-resolution results for each category on the UCMerced dataset. Our network attains class-optimal performance across eight classes, yielding the most superior results in comparison to other existing methodologies. We find that the proposed network performs much better than other existing methods in terms of storage tanks, tennis courts, and freeways, which require better perception of image information changes and global information to distinguish objects from the detailed texture of the reconstructed image. These promising results provide further evidence of the effectiveness of our approach.

Figure 8 illustrates the qualitative outcomes on UCMerced. It can be seen that the proposed model has richer results of shadow and texture restoration on super-resolution images. Our method exhibits better visual effects and fewer artifacts compared to other approaches, noticeable in aspects such as the stripes in storagetanks98 and the boundaries of the ship hull in harbor58.

(2) Results on AID Dataset: Table 3 displays the performance comparison results of all methods on this dataset. In the table, bold text highlights the best results. Our proposed method consistently outperforms all other methods on the evaluation metrics at 2×, 3× and 4×. The detailed results of different methods for all 30 scene classes of the AID dataset is provided in Table 4 at an upscale factor of 3. We can see that proposed method achieves the best results on all the ground target scenes. Compared to the UCMerced dataset, the AID dataset’s broader spectrum of scene categories and larger dataset size often results in enhanced evaluation metrics for super-resolution performance. Nevertheless, the augmented diversity in categories simultaneously presents more intricate processing challenges.

Figure 9 displays several super-resolution instances on the AID dataset. In the playground359, center145 and port227 examples, the proposed network excels in generating clearer boundary lines and better represents small objects within the scenes.

4.3. Ablation Study

In this section, we execute various ablation studies to provide evidence for the effectiveness of the principal components of our proposed approach, namely, the sub-pixel multi-level auto-encoder module and the self-attention sparse loss. The ablation experiments are compared on the UCMerced dataset at the upsampling factor of ×2.

(1) Effects of Encoders and Decoders: The sub-pixel multi-level auto-encoder module is engineered for the efficient computation of global feature information and the seamless integration of multi-stage feature fusion. In the context of comparative assessments under different configurations, the baseline model replaces the basic transformer encoder–decoder form TransENet [21] with the proposed module. As illustrated in Table 5, replacing the transformer decoder module with a sub-pixel decoder module can improve by 0.05 dB, and the number of parameters is reduced by 10%. This finding further substantiates the rationality and efficiency of the proposed encoder–decoder architecture for image generation tasks. In comparison to the original Transformer decoder structure, the design of a sub-pixel multi-level fusion decoder allows for greater focus on the correlation information of surrounding pixels. This enables more precise control in low-latitude space calculations, thereby facilitating simultaneous control of parameter scale reduction calculation in multi-level fusion upsampling. The operation of parameter sharing sub-pixel convolution is also incorporated into the overall decoder structure of the proposed network, further balancing the computational cost. The results demonstrate that not only does the performance improve with the use of the sub-pixel multi-level fusion module, but it also requires fewer computing resources. These contrasting results strongly validate the effectiveness of the proposed Transformer sub-pixel multi-level fusion module.

As shown in Table 6, SSTNet can achieve a lower parameter count compared to Transformer-based networks such as TransENet [21]. In comparison to TransENet [21], the proposed network exhibits a 8% reduction in parameter quantity. However, there is still room for a reduction in parameter count compared to CNN-based methods. The use of self-attention computation can provide better global modeling capabilities and achieve improved generation results, but it also leads to an increase in computational cost. This paper introduces SSTNet, a more lightweight transformer-based architecture that retains the benefits of self-attention computation while significantly reducing parameter quantity.

(2) Effects of Sparse Activation of Self-attention: The self-attention sparse loss enforces restrictions on the representational capacity of self-attention computations, resulting in enhanced stability in learning outcomes. To further assess the efficacy of the sparse loss, we conducted a comparison between using only the reconstruction

L_{1}

loss and employing both losses simultaneously. As depicted in Table 7, the inclusion of sparse loss in the overall reconstruction loss of the network results in a performance enhancement of 0.01dB compared to utilizing solely

L_{1}

reconstruction loss. In comparison to the exclusive use of

L_{1}

reconstruction loss, incorporating sparse loss enhances reconstruction accuracy. The sparsity penalty term quantifies the degree of sparsity in the v value, derived from self-attention, thereby constraining the overall encoding process to manage the sparsity bias. By leveraging sparse loss, both the encoder and decoder can acquire more pertinent features, subsequently bolstering their feature extraction capabilities within the network. This enhancement contributes to improved robustness and provides a consistent and effective guidance for enhancing the generation outcomes in super-resolution tasks. The application of sparse loss can significantly guide the modeling of sparse biased features in the computation of self-attention for feature extraction. This approach facilitates a more comprehensive learning of representative and stable feature representations, ultimately leading to enhanced refinement in generation outcomes. The efficacy of this sparse strategy is further validated through a comparative analysis of experimental results across various categories of super-resolution on both AID and UCMerced datasets. When compared to existing technologies based on Transformer and CNN that utilize the

L_{1}

objective function, the proposed Transformer-based sparse representation scheme consistently delivers stable and superior generation results across diverse scenarios, particularly in sparse contexts. Notably, the inclusion of sparse loss aligns more closely with our task objectives as it effectively guides data prioritization and optimizes attention towards small object generation.

5. Discussion

Although the proposed network has made some progress in the super-resolution of remote sensing images, there are still some problems that are not ideal in some cases. Since sub-pixel convolution is adopted in the decoder module, and parameter sharing is used for overall computational resource reduction, the proposed network will have a significant performance degradation at 4× magnification, which can be considered to further improve the encoder structure so that better generation can be obtained under super-resolution tasks with large magnifications. We also observe that although the proposed SSTNet can generally achieve better performance in detail generation, there still exist some artifacts in the generated results. In the future, we will investigate how to further improve different combinations of sparse coding and self-attention to solve this problem. Current sparse activation strategies tend to be rather conservative, confining computation solely to self-attention. More sparse penalties can be further considered for network guidance, which further activates the generation potential brought by sparse coding.

6. Conclusions

In this paper, we propose a novel sparse-activated sub-pixel transformer network (SSTNet). The central design of this network is rooted in the subpixel multi-level fusion structure of the transformer. This innovative approach combines convolution and transformer modules, utilizing the subpixel space to seamlessly integrate multi-scale features. By introducing an attention module for sparse representation, the network effectively extracts similar information and enhances feature stability. The subpixel multi-level structure systematically combines features through spatial expansion in low-resolution areas and across various scales. Simultaneously, our method’s superiority is demonstrated through experimental results on two publicly accessible datasets, showcasing promising results when compared to existing technologies.

Author Contributions

All authors contributed to the work. Conceptualization, C.G.; methodology, C.G. and Y.G.; software, C.G.; validation, C.G. and Y.G.; formal analysis, C.G.; investigation, C.G.; resources, C.G.; data curation, C.G.; writing—original draft preparation, C.G.; writing—review and editing, C.G. and Y.G.; visualization, C.G.; supervision, Y.G. and J.Y.; project administration, Y.G. and J.Y.; funding acquisition, Y.G. and J.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research is supported by NSFC-FDCT under its Joint Scientific Research Project Fund (Grant No. 0051/2022/AFJ).

Data Availability Statement

Data are contained within the article.

Acknowledgments

The authors would like to thank the editors and the reviewers for their crucial comments and suggestions which improved the quality of this paper.

Conflicts of Interest

Author Jun Yan was employed by the company Zhuhai Aerospace Microchips Science & Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Yue, L.; Shen, H.; Li, J.; Yuan, Q.; Zhang, H.; Zhang, L. Image super-resolution: The techniques, applications, and future. Signal Process. 2016, 128, 389–408. [Google Scholar] [CrossRef]
Hou, B.; Zhou, K.; Jiao, L. Adaptive super-resolution for remote sensing images based on sparse representation with global joint dictionary model. IEEE Trans. Geosci. Remote. Sens. 2017, 56, 2312–2327. [Google Scholar] [CrossRef]
Pan, Z.; Ma, W.; Guo, J.; Lei, B. Super-resolution of single remote sensing image based on residual dense backprojection networks. IEEE Trans. Geosci. Remote. Sens. 2019, 57, 7918–7933. [Google Scholar] [CrossRef]
Lei, S.; Shi, Z.; Zou, Z. Coupled adversarial training for remote sensing image super-resolution. IEEE Trans. Geosci. Remote. Sens. 2019, 58, 3633–3643. [Google Scholar] [CrossRef]
Lei, S.; Shi, Z. Hybrid-scale self-similarity exploitation for remote sensing image super-resolution. IEEE Trans. Geosci. Remote. Sens. 2021, 60, 5401410. [Google Scholar] [CrossRef]
Lei, S.; Shi, Z.; Zou, Z. Super-resolution for remote sensing images via local–global combined network. IEEE Geosci. Remote. Sens. Lett. 2017, 14, 1243–1247. [Google Scholar] [CrossRef]
Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.; et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4681–4690. [Google Scholar]
Lim, B.; Son, S.; Kim, H.; Nah, S.; Mu Lee, K. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 136–144. [Google Scholar]
Hu, Y.; Li, J.; Huang, Y.; Gao, X. Channel-wise and spatial feature modulation network for single image super-resolution. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 3911–3927. [Google Scholar] [CrossRef]
Li, J.; Fang, F.; Li, J.; Mei, K.; Zhang, G. MDCN: Multi-scale dense cross network for image super-resolution. IEEE Trans. Circuits Syst. Video Technol. 2020, 31, 2547–2561. [Google Scholar] [CrossRef]
Tong, T.; Li, G.; Liu, X.; Gao, Q. Image super-resolution using dense skip connections. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4799–4807. [Google Scholar]
Zhang, Y.; Tian, Y.; Kong, Y.; Zhong, B.; Fu, Y. Residual dense network for image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2472–2481. [Google Scholar]
Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2018; pp. 286–301. [Google Scholar]
Dai, T.; Cai, J.; Zhang, Y.; Xia, S.T.; Zhang, L. Second-order attention network for single image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 11065–11074. [Google Scholar]
Liu, J.; Zhang, W.; Tang, Y.; Tang, J.; Wu, G. Residual feature aggregation network for image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2359–2368. [Google Scholar]
Chen, Z.; Zhang, Y.; Gu, J.; Kong, L.; Yang, X.; Yu, F. Dual aggregation transformer for image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 12312–12321. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van Gool, L.; Timofte, R. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 1833–1844. [Google Scholar]
Chen, X.; Wang, X.; Zhou, J.; Qiao, Y.; Dong, C. Activating more pixels in image super-resolution transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 22367–22377. [Google Scholar]
Zhou, Y.; Li, Z.; Guo, C.L.; Bai, S.; Cheng, M.M.; Hou, Q. Srformer: Permuted self-attention for single image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 12780–12791. [Google Scholar]
Lei, S.; Shi, Z.; Mo, W. Transformer-based multistage enhancement for remote sensing image super-resolution. IEEE Trans. Geosci. Remote. Sens. 2021, 60, 5615611. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017): 31st Annual Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Dong, C.; Loy, C.C.; He, K.; Tang, X. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 295–307. [Google Scholar] [CrossRef] [PubMed]
Kim, J.; Lee, J.K.; Lee, K.M. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654. [Google Scholar]
Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1874–1883. [Google Scholar]
Yu, J.; Fan, Y.; Yang, J.; Xu, N.; Wang, Z.; Wang, X.; Huang, T. Wide activation for efficient and accurate image super-resolution. arXiv 2018, arXiv:1808.08718. [Google Scholar]
Lai, W.S.; Huang, J.B.; Ahuja, N.; Yang, M.H. Deep laplacian pyramid networks for fast and accurate super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 624–632. [Google Scholar]
Li, J.; Fang, F.; Mei, K.; Zhang, G. Multi-scale residual network for image super-resolution. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 517–532. [Google Scholar]
Wang, Y.; Shao, Z.; Lu, T.; Wu, C.; Wang, J. Remote sensing image super-resolution via multiscale enhancement network. IEEE Geosci. Remote. Sens. Lett. 2023, 20, 5000905. [Google Scholar] [CrossRef]
Wang, X.; Yu, K.; Wu, S.; Gu, J.; Liu, Y.; Dong, C.; Qiao, Y.; Change Loy, C. Esrgan: Enhanced super-resolution generative adversarial networks. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Munich, Germany, 8–14 September 2018; pp. 63–79. [Google Scholar]
Zhang, W.; Liu, Y.; Dong, C.; Qiao, Y. Ranksrgan: Generative adversarial networks with ranker for image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3096–3105. [Google Scholar]
Wang, X.; Xie, L.; Dong, C.; Shan, Y. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 1905–1914. [Google Scholar]
Park, J.; Son, S.; Lee, K.M. Content-aware local gan for photo-realistic super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 10585–10594. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems 27 (NIPS 2014): 28st Annual Conference on Neural Information Processing Systems, Montreal QC, Canada, 8–13 December 2014. [Google Scholar]
Mei, Y.; Fan, Y.; Zhou, Y.; Huang, L.; Huang, T.S.; Shi, H. Image super-resolution with cross-scale non-local attention and exhaustive self-exemplars mining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 5690–5699. [Google Scholar]
Jia, S.; Wang, Z.; Li, Q.; Jia, X.; Xu, M. Multiattention generative adversarial network for remote sensing image super-resolution. IEEE Trans. Geosci. Remote. Sens. 2022, 60, 5624715. [Google Scholar] [CrossRef]
Xu, Y.; Luo, W.; Hu, A.; Xie, Z.; Xie, X.; Tao, L. TE-SAGAN: An improved generative adversarial network for remote sensing super-resolution images. Remote. Sens. 2022, 14, 2425. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving language understanding by generative pre-training. arXiv 2018, arXiv:2012.11747v3. [Google Scholar]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. Openai Blog 2019, 1, 9. [Google Scholar]
Chen, H.; Wang, Y.; Guo, T.; Xu, C.; Deng, Y.; Liu, Z.; Ma, S.; Xu, C.; Xu, C.; Gao, W. Pre-trained image processing transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12299–12310. [Google Scholar]
Zhang, X.; Zeng, H.; Guo, S.; Zhang, L. Efficient long-range attention network for image super-resolution. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 649–667. [Google Scholar]
Liu, Z.; Feng, R.; Wang, L.; Zhong, Y.; Zhang, L.; Zeng, T. Remote Sensing Image Super-Resolution via Dilated Convolution Network with Gradient Prior. In Proceedings of the IGARSS 2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; pp. 2402–2405. [Google Scholar]
Wang, T.; Sun, W.; Qi, H.; Ren, P. Aerial image super resolution via wavelet multiscale convolutional neural networks. IEEE Geosci. Remote. Sens. Lett. 2018, 15, 769–773. [Google Scholar] [CrossRef]
Ma, W.; Pan, Z.; Guo, J.; Lei, B. Achieving super-resolution remote sensing images via the wavelet transform combined with the recursive res-net. IEEE Trans. Geosci. Remote. Sens. 2019, 57, 3512–3527. [Google Scholar] [CrossRef]
Zhang, S.; Yuan, Q.; Li, J.; Sun, J.; Zhang, X. Scene-adaptive remote sensing image super-resolution using a multiscale attention network. IEEE Trans. Geosci. Remote. Sens. 2020, 58, 4764–4779. [Google Scholar] [CrossRef]
Ng, A. Sparse autoencoder. Cs294a Lect. Notes 2011, 72, 1–19. [Google Scholar]
Chen, X.; Liu, Z.; Tang, H.; Yi, L.; Zhao, H.; Han, S. Sparsevit: Revisiting activation sparsity for efficient high-resolution vision transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Paris, France, 2–6 October 2023; pp. 2061–2070. [Google Scholar]
Yang, Y.; Newsam, S. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, San Jose, CA, USA, 2–5 November 2010; pp. 270–279. [Google Scholar]
Xia, G.S.; Hu, J.; Hu, F.; Shi, B.; Bai, X.; Zhong, Y.; Zhang, L.; Lu, X. AID: A benchmark data set for performance evaluation of aerial scene classification. IEEE Trans. Geosci. Remote. Sens. 2017, 55, 3965–3981. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
Dong, C.; Loy, C.C.; Tang, X. Accelerating the super-resolution convolutional neural network. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part II 14. pp. 391–407. [Google Scholar]
Haut, J.M.; Paoletti, M.E.; Fernández-Beltran, R.; Plaza, J.; Plaza, A.; Li, J. Remote sensing single-image superresolution based on a deep compendium model. IEEE Geosci. Remote. Sens. Lett. 2019, 16, 1432–1436. [Google Scholar] [CrossRef]
Wang, S.; Zhou, T.; Lu, Y.; Di, H. Contextual transformation network for lightweight remote-sensing image super-resolution. IEEE Trans. Geosci. Remote. Sens. 2021, 60, 5615313. [Google Scholar] [CrossRef]

Figure 1. Illustration of sparse key information in a representative remote sensing scene, where objects, including bridges and vessels (highlighted in red boxes), necessitating specific attention for edge and texture generation, constitute a relatively small portion of the overall image composition.

Figure 2. Overall architecture of SSTNet.

Figure 3. MEG structure.

Figure 4. Encoder structure.

Figure 5. An example of a sparse autoencoder.

Figure 6. An illustration of subpixel convolution within the subpixel space.

Figure 7. Decoder structure.

Figure 8. Qualitative results on UCMerced for 4× scaling factor.

Figure 9. Qualitative results on AID for 4× scaling factor.

Table 1. Mean PSNR (dB) and SSIM over the UCMerced test dataset.

Scale	FSRCNN [52]	DCM [53]	CTNET [54]	TransENet [21]	SSTNet
Scale	PSNR/SSIM	PSNR/SSIM	PSNR/SSIM	PSNR/SSIM	PSNR/SSIM
2	33.18/0.9196	33.65/0.9274	33.59/0.9255	34.03/0.9301	34.09/0.9311
3	29.09/0.8167	29.52/0.8394	29.44/0.8319	29.92/0.8408	29.88/0.8397
4	26.93/0.7267	27.22/0.7528	27.41/0.7512	27.77/0.7630	27.66/0.7598

Table 2. Mean PSNR (dB) of each class for upscaling factor 3 on UCMerced test dataset.

Class	FSRCNN [52]	DCM [53]	CTNET [54]	TransENet [21]	SSTNet
agricultural	27.61	29.06	31.79	28.02	27.65
airplane	28.98	30.77	28.22	29.94	29.94
baseballdiamond	34.64	33.76	29.37	35.04	35.04
beach	37.21	36.38	34.73	37.53	37.59
buildings	27.5	28.51	37.39	28.81	28.74
chaparral	26.21	26.81	28.01	26.69	26.69
denseresidential	28.02	28.79	26.42	29.11	29.12
forest	28.35	28.16	28.41	28.59	28.57
freeway	29.27	30.45	28.43	30.38	30.47
golfcourse	36.43	34.43	29.67	36.68	36.65
harbor	23.29	26.55	36.24	24.72	24.56
intersection	28.06	29.28	23.99	29.03	29.02
mediumresidential	27.58	27.21	28.42	28.47	28.42
mobilehomepark	24.34	26.05	27.86	25.64	25.52
overpass	26.53	27.77	24.99	27.83	27.87
parkinglot	23.34	24.95	27.48	24.45	24.38
river	29.07	28.89	23.63	29.25	29.21
runway	31.01	32.53	29.03	31.25	31.19
sparseresidential	30.23	29.81	30.68	31.57	31.61
storagetanks	31.92	29.02	31.18	32.71	32.77
tenniscourt	31.34	30.76	32.43	32.51	32.54
AVG	29.09	29.52	29.44	29.92	29.88

Table 3. Mean PSNR (dB) and SSIM over the AID test dataset.

Scale	FSRCNN [52]	DCM [53]	CTNET [54]	TransENet [21]	SSTNet
Scale	PSNR/SSIM	PSNR/SSIM	PSNR/SSIM	PSNR/SSIM	PSNR/SSIM
2	34.67/0.9308	35.21/0.9366	35.12/0.9357	35.24/0.9369	35.29/0.9376
3	30.71/0.8423	31.28/0.8560	31.21/0.8542	31.38/0.8581	31.45/0.8595
4	28.56/0.7620	29.16/0.7821	29.01/0.7782	29.32/0.7879	29.34/0.7896

Table 4. Mean PSNR (dB) of each class for upscaling factor 3 on AID test dataset.

Class	FSRCNN [52]	DCM [53]	CTNET [54]	TransENet [21]	SSTNet
airport	30.38	31.01	30.91	31.13	31.20
bareland	38.24	38.54	38.54	38.58	38.60
baseballdiamond	33.24	33.81	33.71	33.94	33.99
beach	34.20	34.54	34.55	34.61	34.64
bridge	32.92	33.60	33.52	33.75	33.81
center	28.91	29.82	29.68	29.96	30.05
church	25.57	26.25	26.15	26.33	26.38
commercial	29.61	30.21	30.13	30.29	30.35
denseresidential	26.29	26.92	26.84	27.01	27.08
desert	40.84	41.00	40.99	41.03	41.05
farmland	35.72	36.25	36.18	36.35	36.41
forest	30.71	30.98	31.01	31.06	31.08
industrial	28.40	29.14	29.02	29.25	29.32
meadow	34.49	34.72	34.72	34.77	34.80
mediumresidential	29.84	30.46	30.39	30.54	30.59
mountain	30.94	31.10	31.11	31.15	31.17
park	29.57	30.02	29.96	30.11	30.16
parking	27.58	28.83	28.64	29.05	29.27
playground	31.43	32.47	32.37	32.69	32.82
pond	31.66	32.06	31.10	32.12	32.16
port	28.08	28.72	28.61	28.82	28.94
railwaystation	29.92	30.58	30.48	30.69	30.75
resort	29.51	30.12	30.06	30.21	30.27
river	32.39	32.65	32.62	32.7	32.73
school	28.53	29.18	29.08	29.29	29.35
sparseresidential	27.97	28.27	28.25	28.32	28.35
square	30.75	31.46	31.37	31.58	31.65
stadium	28.26	29.02	28.90	29.19	29.27
storagetanks	27.03	27.64	27.54	27.73	27.78
viaduct	29.18	29.90	29.76	30.04	30.11
AVG	30.71	31.28	31.21	31.38	31.45

Table 5. Mean PSNR (dB), SSIM and Params (M) with different components.

Method	PSNR	SSIM	Params
TransENet [21] encoder and decoder module	34.01	0.9299	37.31
Proposed SSTNet encoder and decoder module	34.06	0.9307	33.48

Table 6. Params on UCMerced dataset for 2× scaling factor.

Method	Params
FSRCNN [52]	115.59 k
DCM [53]	1.84 M
CTNET [54]	401.91 k
TransENet [21]	37.31 M
SSTNet	34.14 M

Table 7. Mean PSNR (dB) and SSIM with different loss.

Loss	PSNR	SSIM
$L_{1}$	34.08	0.9308
$L_{1} + α L_{s p a r s e}$	34.09	0.9311

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guo, Y.; Gong, C.; Yan, J. Activated Sparsely Sub-Pixel Transformer for Remote Sensing Image Super-Resolution. Remote Sens. 2024, 16, 1895. https://doi.org/10.3390/rs16111895

AMA Style

Guo Y, Gong C, Yan J. Activated Sparsely Sub-Pixel Transformer for Remote Sensing Image Super-Resolution. Remote Sensing. 2024; 16(11):1895. https://doi.org/10.3390/rs16111895

Chicago/Turabian Style

Guo, Yongde, Chengying Gong, and Jun Yan. 2024. "Activated Sparsely Sub-Pixel Transformer for Remote Sensing Image Super-Resolution" Remote Sensing 16, no. 11: 1895. https://doi.org/10.3390/rs16111895

APA Style

Guo, Y., Gong, C., & Yan, J. (2024). Activated Sparsely Sub-Pixel Transformer for Remote Sensing Image Super-Resolution. Remote Sensing, 16(11), 1895. https://doi.org/10.3390/rs16111895

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Activated Sparsely Sub-Pixel Transformer for Remote Sensing Image Super-Resolution

Abstract

1. Introduction

2. Related Work

2.1. CNN-Based Image SR

2.2. Transformer-Based Image SR

2.3. Remote Sensing Image SR

3. Proposed Method

3.1. Overall Structure

3.2. Transformer Encoder for Sparse Representation

3.3. Subpixel Multi-Level Decoder

4. Experiment

4.1. Experimental Setup

4.2. Comparisons with Other Methods

4.3. Ablation Study

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI