1. Introduction
The technique of Single Image Super-Resolution (SISR) employs software algorithms to compensate for lost details in a low-resolution (LR) image, restoring it to a high-resolution (HR) counterpart. This technology has seen extensive application across various fields, notably in video surveillance [
1], medical diagnosis [
2], and remote sensing [
3,
4]. In remote sensing, high spatial resolution images are very important in many scenarios, such as target detection [
5], change detection [
6], and object tracking [
7].
Image sensors are the main limiting factor in the spatial resolution of remotely sensed images, and increasing the pixel density of sensors will significantly increase the cost of hardware. The remote sensing image super-resolution (RSISR) reconstruction technique is a method that can obtain high-resolution remote sensing images more efficiently than upgrading imaging equipment to improve image spatial resolution.
Image super-resolution reconstruction is an ill-posed problem, resulting in a scenario where one low-resolution input image can correspond to multiple high-resolution solutions. To overcome this issue, image prior information is typically used to constrain the solution space for HR reconstruction. Deep learning methods, applied to SR tasks in recent years, have shown capabilities to reconstruct images with clearer textures and edges compared to earlier learning-based SR approaches like those based on sparse coding [
8] and local linear regression [
9]. Super-Resolution Convolutional Neural Network (SRCNN) [
10], the first CNN-based image super-resolution method, initiated this trend by learning an end-to-end nonlinear mapping from LR to HR images through a three-layer convolutional network. Since then, numerous CNN-based SR methods have been proposed, emphasizing residual blocks [
11,
12], dense connections [
13,
14], and recursive structures [
15,
16].
Because the receptive field of the convolution kernel is limited, the convolutional neural network can only perceive local information of the image. Researchers used pooling methods to expand the receptive field by building deeper models. However, the process of reducing the resolution of feature maps will lose some information. By fusing feature map information of different resolutions, the receptive field can be expanded while avoiding information loss during pooling. UNet [
17], a classic Convolution Neural Network(CNN) architecture typically employed in image segmentation tasks, uses an encoder to extract features and downsample to lower-resolution feature maps, followed by a decoder that incrementally upsamples and merges these features through skip connections. Inspired by the success of the UNet structure in image segmentation, researchers have proposed various UNet variants, such as UNet++ [
18] and Attention U-Net [
19]. RUNet [
20] is the first model to adapt the UNet architecture for image super-resolution tasks.
In recent years, due to the success of transformers [
21] in the field of natural language processing, transformers have also attracted great attention in computer vision. The Multi-Head Self-Attention (MSA) mechanism, capable of establishing long-range dependencies and adaptively weighting different positions in a sequence, has proven particularly adept at processing image details and grasping global semantics. Vision Transformer (ViT) [
22] is the first pure transformer structure for image recognition, achieving comparable performance to other convolution-based state-of-the-art (SOTA) methods. Following the introduction of ViT, many vision tasks have started to incorporate transformer-based models, including object detection [
23] and image segmentation [
24,
25]. Image Pretrained Transformer (IPT) [
26] is the first model to apply transformers to low-level tasks such as image super-resolution and denoising.
For pixel-level visual tasks like image restoration and segmentation, the computation cost of ViT-based models increases significantly with the resolution of the input image. Additionally, ViT usually requires a fixed sequence length, while in practical vision tasks, image sizes are variable. As transformer models discard the inductive biases of CNN, a large amount of training data is typically required to achieve good accuracy with ViT. The swin transformer [
27] is a model based on window attention and shifting windows that substantially overcomes these drawbacks. SwinIR [
28] is a SOTA method for SISR tasks based on the swin transformer. It outperforms previous models based on both pure convolutional structures and ViT-based architectures in public datasets like DIV2K and DF2K.
Existing algorithms usually choose to increase the number of network layers in order to more adequately extract features from low-resolution images and improve the quality of reconstructed images. However, too many layers may bring other negative effects on the performance of the network, for example, gradient vanishing, network degradation, and overfitting. In recent years, there have been some lightweight super-resolution models such as feature enhancement networks (FeNet) [
29] and Omnisr [
30], all of which have less computational resource consumption but poor reconstruction quality. Larger network models generate higher-quality super-resolution images but consume a large number of resources. Achieving a balance between reconstruction quality and model complexity is an important goal in image super-resolution research.
This paper presents an Efficient Super-Resolution Hybrid Network (EHNet) based on a UNet-like architecture that adeptly fuses CNN and Swin Transformer. It also introduces a novel sequence-to-sequence upsample method that focuses more on semantic information, diverging from the previous convolution-based method. The patch merge module in the original Swin Transformer is not used in SwinIR to downsample the feature maps to get feature maps with different resolutions to extract features at different scales, so the feature extraction ability of SwinIR is weakened compared to the original Swin Transformer. UNet’s inherent encoder-decoder structure enables it to have better feature extraction capabilities. We use the encoder part to first downsample to extract features and then use the decoder part to upsample to recover the detailed information. We design a convolution-based Lightweight Feature Extraction Block (LFEB) as the fundamental module in the encoder, which gradually downsamples to extract semantic features. Convolutional structures, being more cost-effective than self-attention mechanisms, are better suited for extracting image features. To further reduce computation costs, we employ depthwise convolutions. For the decoder, we utilize Swin Transformer as the backbone network because it can establish long-range dependencies through self-attention, enhancing the restoration of image details. Its window attention mechanism also significantly reduces the model’s computational cost. On the other hand, almost all super-resolution tasks utilize convolution-based upsampling methods, like the widely-used sub-pixel convolution method [
31]. However, data in the transformer flow in the form of a sequence of tokens. Our experiments demonstrate that employing convolution-based upsampling methods between two transformer layers may inadvertently introduce extraneous semantic information unrelated to the target. This can potentially reduce the model’s accuracy. Thus, we propose a new upsampling module tailored for the attention mechanism of sequential data, the sequence-based upsample block (SUB).
The principal contributions of this paper are summarized as follows:
We propose the Efficient Super-Resolution Hybrid Network (EHNet), a lightweight RSISR network that efficiently fuses CNN and Swin Transformer within a UNet-like structure. This hybrid model is capable of utilizing both the inductive bias of convolution and the long-range modeling capability. On the other hand, the multi-scale capability of UNet and skip connection can reconstruct images with richer details;
We design a lightweight and efficient convolutional block as the fundamental unit for image feature extraction. The dual-branch design of CSP enables the integration of features from different stages, aiding the model in understanding and utilizing these varied stage features. In addition, we found that SELayer can also realize channel feature combinations with much less computational cost than pointwise convolution;
In the decoder, we innovatively propose an upsampling method SUB based on a sequence of tokens. Compared with convolution-based upsampling methods, our SUB is more suitable for transformer-based models and can improve image detail recovery capabilities by focusing on semantic information.
3. Methodology
In this section, we first introduce the overall architecture of EHNet. Then, we introduce our proposed lightweight feature extraction module (LFEB) and a new sequence-based upsample block (SUB) in detail.
3.1. Network Architecture
Figure 1 displays the overall architecture of our EHNet, which designs an advanced encoder-decoder pattern based on the UNet structure. The encoder part uses efficient convolutional layers designed by us to capture the low-level features and spatial context information of the image, while the decoder part uses swin transformer to reconstruct image details. Additionally, following the Swin Transformer, there is a specialized upsampling module designed for the sequence of tokens. This module can more richly express the characteristics of the sequence of tokens, as it operates directly at the sequence level, avoiding the potential information compression and loss caused by convolutional layers. Moreover, it can perform SR reconstruction of images based on semantic information during upsampling. To compensate for the possible spatial information loss when reshaping feature maps into sequences, we have incorporated skip connections between the encoder and decoder. This network architecture design of EHNet not only facilitates the effective integration of local details with global information but also enhances the performance of the model in performing image super-resolution reconstruction by utilizing the focused semantic information. This leads to significant improvements in image clarity and richness, making our model particularly suitable for application scenarios requiring high-quality image reconstruction.
Given an LR image, we first interpolate it to the target resolution and then use a 3 × 3 convolution to transform it into a feature map, thereby extracting the initial features
. This process can be expressed mathematically as follows:
where the Conv denotes a convolutional operation and
represents the initial feature, which will be the input of the following feature extraction part.
We use three LFEGs to construct the encoder within the UNet structure. The primary function of these LFEGs is to extract low-level features at various scales from the image. Each LFEG is composed of multiple stacked LFEBs. The feature map is downsampled
for each LFEG, so the resolution of the feature map after three LFEGs is
of the HR. The output of the encoder part can be written as follows:
where
and
represent the operation of ith LFEG and its output.
After passing through the encoder composed of convolutional structures, we will use Swin Transformer Blocks (STB) and SUB to gradually upscale and restore image details.
STB is the basic module of Swin Transformer, which divides the image into a series of windows, and all the Attention is computed only within the window. This windowed attention mechanism greatly reduces the amount of computation. However, only calculating the attention within the windows weakens the long-term modeling ability of the transformer, so there is also a window sliding mechanism in the Swin Transformer to transfer the information between the windows.
In our EHNet, STB is mainly used to extract higher-dimensional semantic features for SUB, while our specially designed SUB uses these features to recover the image details and upsample the feature maps by a factor of 2. The output of each upsampling is concatenated with the corresponding output of the encoder part before being used as the input of the next layer. This feature fusion operation compensates for the loss of spatial information due to downsampling.
where
and
represent the operation of ith sequence-based upsample block and Swin Transformer block, and
represent the output after ith upsample.
Finally, after concatenating the output of the decoder, , with , and then passing it through another convolutional layer, we can obtain the final SR image.
3.2. Lightweight Feature Extraction Block (LFEB)
In this section, we design an efficient feature extraction module that can extract rich features for the decoder to use with low computation. The LFEB is the base unit of the encoder, and we stack multiple LFEBs and incorporate residual learning to form a residual-in-residual structure of the LFEG, which is capable of constructing deeper networks without gradient explosion. Each LFEG ends with a pooling layer to downsample the feature map. Finally, three LFEGs form the encoder part. The encoder of our EHNet is shown in
Figure 2.
LFEB’s overall structural design concept is similar to the Residual Channel Attention Block (RCAB) [
32]. RCAB mainly consists of standard convolution and Channel Attention (CA) in tandem with it. Our LFEB is mainly composed of CSP and lightweight convolution modules. The dual-branch design of CSP effectively integrates information from different stages with minimal computational cost. On the other hand, the lightweight convolution modules, consisting of depthwise convolution (dwconv) and Squeeze and Excitation layer [
49] (SELayer), are able to extract features efficiently. The SELayer enables cross-channel feature fusion while reducing the computational cost caused by the pointwise convolution (pwconv) in separable convolutions. Whereas in our LFEB, we use depthwise convolutionin tandem with SELayer as the basic combination. In many lightweight convolutional designs, dwconv with pointwise convolution (pwconv) is a common combination, and pwconv is used to compensate for the lack of information fusion between channels in dwconv. However, in our experiments, it is demonstrated that this combination design is not necessarily helpful for super-resolution tasks, and SELayer can also take on the function of channel information fusion instead of pwconv, with lower computation effort.
SELayer adaptively recalibrates the feature responses between channels by explicitly modeling their interdependencies. Specifically, SELayer learns to automatically obtain the importance of each channel and then enhances useful features and suppresses features that are less useful according to this importance. The main operation of SELayer is to globally average pool the feature map to obtain 1 × 1 × C features (Squeeze) and then predict the importance of each channel through the fully connected layer, obtaining channel-level attention weights (Excitation), which are used to recalibrate the feature maps.
Because of the success of CSPDarknet in Yolov4 [
50], we also add our own design of Cross Stage Partial (CSP) connection to extend the channel space in LEFB, and the addition of CSP hardly increases the computation and also improves the performance of the model to a certain extent. The structure of LFEB is shown in
Figure 3.
CSP allows for the fusion of features at different network stages due to its dual-branch design. Doing so helps to integrate and propagate features from lower and higher levels more efficiently, improving the model’s understanding and utilization of features from different levels. In super-resolution tasks, this fusion can help the network better understand image details and facilitate more accurate detail reconstruction. The CSP structure in our LFEB divides the input 2C feature maps
into two branches, each with a number of channels C. This process can be written in the following form using Equation (4):
where
and
denotes the feature map at the beginning of the two branches.
In branch1, features are extracted as usual through the subsequent two convolutional layers. In branch2,
is directly concatenated with the features extracted in branch1. Finally, a 1 × 1 convolution is used for information fusion, producing the output feature
. This process can be mathematically described as follows:
where ‘branch2’ represents the convolutions, batch normalization (bn), SELayer, and all other operations within ‘branch2’.
The branch2 of our LFEB consists mainly of a tandem stack of dwconv and SELayer, both of which have low computation cost, with a BN layer added to speed up convergence.
3.3. Sequence-Based Upsample Block
In super-resolution tasks, most of the models use convolution-based upsampling methods such as transposed convolution or sub-pixel convolution. The design inspiration for our SUB originally came from the patch expanding layer by Cao [
51], which can achieve upsampling and feature dimension change without using convolution or interpolation. Compared with sub-pixel convolution and bilinear interpolation, this type of upsampling has achieved higher segmentation accuracy in segmentation tasks. Based on this sequence-based upsampling concept, we propose a new upsampling module SUB that is more suitable for super-resolution tasks. And our SUB can focus more on the semantic information of the image to obtain better reconstruction results, which is the first time that this sequence-based upsampling method is proposed for super-resolution tasks.
The structure of our SUB is shown in
Figure 4. The input sequence of tokens is first dimensionally transformed through the MLP layer, where the MLP layer is able to introduce nonlinear transforms to enhance the model feature learning and expression capabilities, and also to double the channel dimension. The MLP is then followed by a layer of Swin Transformer to recover more details of the image. There are three layers of Swin Transformer in the decoder, each of which corresponds to three downsampling layers in the encoder part. After one layer of transformer, we rearrange the sequence of tokens into feature maps of B × 2C × H × W and then go through a Pixel Shuffle operation to change the resolution of the feature maps to
of the input and the dimension of the channels to
of the input. Finally, we change the sequence of tokens into the form of feature maps mainly to facilitate the fusion with the features extracted from the convolution in the encoder.
In summary, our SUB effectively upsamples the sequence of tokens in transformers and restores more precise and accurate details in super-resolution tasks. To demonstrate the effectiveness of our SUB module, we used Local Attribution Maps (LAM) [
52] to analyze which pixels in the input LR contribute most to the SR (Super Resolution) reconstruction. LAM is a method for attribution analysis based on integrated gradients. By selecting a region of interest in the image, LAM can identify pixels that significantly contribute to the SR reconstruction of that area.
We applied LAM to analyze both the convolution upsampling method and our SUB, with results shown in
Figure 5. In the airplane scene, we selected the engine part as the target region. It can be seen that there are many pixels in the LAM results sampled on the convolution that do not match the semantic information of the airplane, also have an impact on the SR results, and this additional introduction of extraneous pixel information degrades the quality of the SR reconstruction. While the LAM results of our method are more focused on the part that matches the target semantics, most of the pixels with large contributions are focused on the airplane engine part, and SR reconstruction based on the semantic information is an important reason why our EHNet can obtain higher performance. Similar results also appear in the overpass scene, we selected a car on the road as the target region, and our method also obtains results that are more focused on the car part, which leads to better reconstruction results.
4. Experiments
4.1. Experiment Settings
To verify the effectiveness of our model, we trained on two widely used public remote sensing datasets, UCMerced [
53] and AID [
54], respectively.
UCMerced dataset: This dataset contains 21 types of remote sensing scenarios, including airports, highways, ports, etc. Each scene category has 100 images, each measuring 256 × 256 pixels, and the spatial resolution of these images is 0.3 m/pixel. This dataset is divided into two equal parts, one of which is used as a training set with a total of 1050 images, and the other part is used as a test set, with 20% of the training set being used as a validation set;
AID dataset: Compared with the UCMerced dataset, the AID dataset is a dataset with a larger number and size of images, containing 10,000 images and a total of 30 remote sensing scenes. The image size of the AID dataset is 600 × 600 pixels, and the spatial resolution of the image is 0.5 m/pixel. In this dataset, 8000 images were randomly selected as the training set images, and the remaining 2000 images were used as the test set images. In addition, we selected five images in each category for a total of 150 images as the validation set.
The images in both the UCMerced dataset and the AID dataset were used as HR images in the experiment, and their corresponding LR images were obtained by Bicubic interpolation. We trained and evaluated the model by constructing such paired HR-LR images.
We used peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) to evaluate the experimental results, and all evaluations of the super-resolution results were performed on the RBG channel. In general, SSIM is more reflective of image quality as perceived by the human eye but is computationally complex, whereas PSNR is computationally simple but does not necessarily fully reflect the human eye’s perception of image quality. In our experiments we used a combination of these two metrics to more comprehensively assess image super-resolution quality. In our experiments, the original images in each dataset were treated as HR images, and the corresponding LR images were obtained by performing bicubic interpolation on the HR images. The PSNR and SSIM of a super-resolution image can be calculated by the following equation:
where
and
are the mean and variance, respectively.
is the covariance between x and y.
and
are constants.
is the super-resolution image and
is the high-resolution image.
Floating Point Operations (FLOPs) and model parameters are used to measure the computation cost of the model, where the input image size is when calculating FLOPs.
Our loss function employs the L1 loss, which is most common in super-resolution tasks. Given a training set
, the loss function can be expressed as follows:
We conducted experiments on remote sensing images with scale factors of ×2 and ×4. During training, we randomly cropped the image, and the size of the cropped image was 192 × 192. We also performed random flips and rotations on the training samples to increase sample diversity. We used the Adam optimizer, where . We adopted the cosine annealing learning rate decay strategy with an initial learning rate of and a minimum learning rate of . During the training process, we used a batch size of 16 and trained 2000 epochs on the model. The entire training was performed on two NVIDIA 3080 Ti GPUs.
4.2. Ablation Studies
In this section, we performed a series of ablation experiments on the UCMerced dataset to explore the importance of each module in our model, where all models were trained on the same settings. For simplicity, all experiments had a super-resolution factor of 4.
4.2.1. Effects of LFEB
The LFEB is the most important component of the encoders, and we explored the effect of using this module with different settings. The number of LFEBs in each LFEG in our experiments is set to 9. Compared to RCAB, a benchmark module commonly used in super-resolution modeling, our LFEB is 0.11 dB higher in PSNR metrics. We compared the most commonly used combination of dwconv + pwconv with our dwconv + SELayer combination scheme and found that our approach has better performance. Also, the use of pwconv has a larger computation cost and memory usage, whereas SELayer is a lightweight feature calibration module using only fully connected layers. We also validated the effectiveness of the CSP dual-branch structure in LFEB and found that the PSNR improved by 0.06 after the introduction of the CSP; all results are shown in
Table 2.
In recent years, there have also been some excellent Attention Modules that are often used in various super-resolution tasks, and we also compared SELayer with these methods. The Convolutional Block Attention Module (CBAM) [
55] can perform Attention operations in both spatial and channel dimensions combining the Channel Attention Module And by combining the channel attention module and spatial attention module together, the network can achieve better feature selection and reinforcement in both channel and spatial dimensions, improving the model’s representation ability. Efficient Channel Attention (ECA) [
56] proposes a local cross-channel interaction strategy without dimensionality reduction, which can be efficiently implemented by one-dimensional convolution with high efficiency. We tested several other popular convolutional attention methods with other parts of the LFEB fixed unchanged, and SELayer obtained the best performance in both PSNR and SSIM metrics. The experimental results are shown in
Table 3.
4.2.2. Effects of SUB
SUB is a new sequence-based upsampling module we propose that can improve the detail restoration ability of the decoder composed of transformer by focusing on semantic information.
We explored the experimental performance of different components forming an SUB and found the most effective SUB settings. All experimental results are shown in
Table 4. Judging from the experimental results, if we only use the MLP layer for dimension transformation, the effect is average, and after adding a layer of Swin Transformer, the PSNR increases by 0.1 dB. There are two ways to transform features from an extended channel dimension to a larger spatial resolution: one is to directly reshape the feature map to the target resolution, and the other is to reshape with the channel dimension unchanged and then use pixel shuffle to increase the spatial resolution. From the experimental results, the latter scheme can obtain higher reconstruction results.
We also compared SUB with transposition convolution and subpixel convolution, which are commonly used as upsampling methods in other SOTA methods, and our SUB is higher than transposition convolution and subpixel convolution in PSNR by 0.23 dB and 0.12 dB, respectively, and SSIM is higher than them by 0.0051 and 0.0036, respectively. Our experimental results verified the validity of the SUB upsampling method. The experimental results are shown in
Table 5.
4.2.3. Ablation Study of Our EHNet
We performed ablation experiments on the whole EHNet, mainly including the number of layers of Swin Transformer and the number of layers of convolution, as well as the effect of feature dimensions on model accuracy and model complexity. We can see that when the LFEB, the number of swin layers, and the number of feature channels are set to 9, 2, and 96, respectively, the EHNet can obtain higher PSNR and SSIM and keep a low computational overhead. All the experimental results are shown in
Table 6.
4.3. Comparison with the State-of-the-Arts
To verify the effectiveness of the proposed EHNet, we conducted comparative experiments with some SOTA competitors, namely, SRCNN [
10], VDSR [
11], LGCNet [
34], DCM [
35], CTNet [
36], HSENet [
37], TransENet [
38], SwinIR [
28] and HAT [
33]. Among these methods, SRCNN [
10], VDSR [
11], HAT [
33], and SwinIR [
28] are the methods proposed for natural image SR. LGCNet [
34], DCM [
35], HSENet [
37], CTNet [
36] and TransENet [
38] are designed for RSISR. We retrained all of these methods based on open-source code and tested them under the same conditions.
4.3.1. Quantitative Evaluation
Quantitative Results on UCMerced Dataset:
Table 7 presents a comparison of the latency and performance accuracy of various methods on the UCMerced dataset. The results indicate that our EHNet achieves a superior balance between the number of parameters and accuracy. In the case of ×2 and ×4 super-resolution factors, EHNet demonstrates the best performance in terms of PSNR. Compared to recent high-performing models such as SwinIR [
28], TransENet [
38], and HSENet [
37], EHNet shows improvements in both parameter count and performance. Specifically, under the ×4 super-resolution factor, EHNet’s PSNR is higher than TransENet [
38], SwinIR [
28] and HAT [
33] by 0.24 dB, 0.15 dB and 0.16 dB, respectively, while having only 7%, 58%, and 50% of their parameter sizes. In comparison with lightweight models like SRCNN [
10], VDSR [
11], and CTNet [
36], our EHNet also maintains competitive performance in terms of model accuracy and efficiency.
Quantitative Results on AID Dataset: In
Table 8, our proposed EHNet demonstrates exceptional performance across all metrics on the AID test dataset. However, due to its limited model capacity, the performance of our model deteriorates when trained on the larger AID training dataset. Despite this limitation, EHNet still achieves the best or second-best performance in terms of PSNR on the AID test dataset and obtains the optimal results in the SSIM metric, which is more aligned with human visual perception. Overall, the method we propose maintains competitive performance. To further analyze the reasons behind these phenomena, we conducted an in-depth discussion on the quantitative performance of different methods across various categories.
Table 9 lists the performance across the 30 categories in the AID dataset. The experiments demonstrate that our method performs well in scenes with rich textural details, such as airports, schools, parking lots, and sparse residential areas, achieving the best PSNR results in most cases. In contrast, the scenes where PSNR results are less satisfactory tend to be those with more uniform and less detailed environments, such as bare land, beaches, and deserts. These images lack sufficient feature information. Our method primarily relies on enhancing high-frequency details to improve image resolution, and in scenes with simple content, there may not be enough information for effective reconstruction. On the other hand, the PSNR evaluation metric may be more suited to assessing detail enhancement in richly textured scenes. In less textured environments, PSNR may not fully reflect the true improvement in image quality.
4.3.2. Quantitative Evaluation
In addition to the quantitative comparisons discussed above, we also conducted a qualitative analysis of super-resolved image quality.
Figure 6 presents the visual results for two scenarios from the UCMerced dataset: airplane and freeway. In the case of ‘airplane78′, our method successfully recovers the texture of the engine part while maintaining sharp edges. For ‘freeway97’, our EHNet uniquely restores the car windows, a detail not achieved by other methods. Moreover, the super-resolved image exhibits clearer lane lines, demonstrating EHNet’s significant advantage in recovering image details.
Figure 7 shows two examples of the AID dataset. For parking210, our proposed method successfully recovers clear marker lines, while the other methods are either very blurred or have checkerboard artifacts. Furthermore, in the super-resolution result of ‘stadium262’, our model achieves sharper edges around letters, further evidencing its superior performance in enhancing details.
5. Conclusions
In our work, we introduce a novel model named EHNet, an efficient single-frame SR model for remote sensing. EHNet ingeniously merges an encoder formed by LFEB with an improved Swin Transformer within a UNet architecture. The LFEB utilizes depthwise convolution to reduce computation cost, while the incorporation of SELayer enhances inter-channel information fusion, addressing the shortcomings of insufficient channel information integration in depthwise convolution. Additionally, we employ a CSP dual-branch structure to boost model performance without adding extra parameters. In the decoder part, we utilize Swin Transformer to restore image details and introduce a novel sequence-based upsampling method, SUB, to capture more accurate long-range semantic information. EHNet achieves state-of-the-art results on multiple metrics in the AID and UCMerced datasets and surpasses existing methods in visual quality. Its 2.64 M parameters effectively balance model efficiency and computation cost, highlighting its potential for broader application in SR tasks.
The results of the experiment show that our EHNet performs better on smaller datasets, but its performance is degraded for datasets such as AID, which has a larger image size and dataset size. We investigate the model’s super-resolution reconstruction results for different scenes and find that our EHNet tends to underperform in those scenes with fewer details and smaller gradients. We speculate that the reason why the model does not perform well enough on large datasets may be that our model has a small number of parameters and cannot fully cope with all the scenes, especially those with smaller gradients. In addition, our model does not perform as well as the super-resolution factors of 2 and 4 on the super-resolution factor of 3, which may be due to the fact that our UNet architecture of EHNet adopts 2× downsampling, so it does not work well enough for LR reconstruction with a super-resolution factor of 3.
In future research, we will focus on enhancing the model’s performance in scenes with less texture, further improving its overall effectiveness.