1. Introduction
Satellite remote sensing is used in various practical domains, such as mapping, agriculture, environmental protection, land use, urban planning, geology, natural disasters, hydrology, oceanography, meteorology, and many more. In this field, the spatial resolution is a fundamental parameter that is of particular importance for two reasons. The higher the resolution of an image, the more details and information are available for human interpretation and automated machine understanding [
1]. However, the goal of providing high spatial resolution devices at the market price is not always the optimal solution [
2]. In the past two decades, super-resolution (SR) has been defined as the process of reconstructing and generating a high-resolution (HR) image from one or more low-resolution (LR) images. The challenge with the SR process is fundamentally that it is an ill-posed problem. This means that a single LR image can be derived from several HR images with minor variations, such as changes in camera angle, color, brightness, and so on [
3]. Therefore, various approaches have been developed with different constraints and multiple equations to map an LR image to an HR image.
Today, the Sentinel-2 constellation freely produces multispectral optical data, providing five-day temporal and global spatial coverage. These achievements have led to the creation of a new flow in the field of space businesses [
4].
The SR methods are categorized into multi-image or single-image approaches. In the Multi-Image Super-Resolution (MISR) approach, lost information is extracted from multiple LR images captured from a scene to recover the HR image [
5]. However, due to the different angles and passes from which satellite images are acquired for a scene and the sub-pixel misalignments present in these images, the correspondence between a pixel in one image and the same pixel in the next image will not be exact. Therefore, in practical applications, obtaining a sufficient number of images for a scene is challenging [
1]. In contrast, in the Single-Image Super-Resolution (SISR) approach, the output HR image is obtained using only one LR image. Therefore, SR can be applied without the need for satellite constellation, which eventually results in substantial cost savings and provides a good opportunity for small platforms, with low resolution and cheap instruments [
6].
With the advent of efficient Deep Learning (DL) methods, the SISR approach gradually became dominant among the approaches [
7]. Today, one common technique in SR algorithms is to utilize convolutional neural networks (CNNs). The pioneer in CNN-based algorithms was the super-resolution convolutional neural network (SRCNN) [
8], and its enhancement has persisted to this day through various methodologies. On the other hand, Generative Adversarial Networks (GANs) [
9] have garnered growing interest within the research community. A network based on GANs has been able to generate outputs that are more photo-realistic, albeit with lower quantitative metrics [
2].
Satellite imagery, unlike standard imagery, suffers from significant landscape at the pixel level due to its lower resolution. Consequently, satellite imagery, such as Sentinel-2 images, is more prone to pixelation artifacts during the SR process. In this context, we have proposed a model that has been able to mitigate the aforementioned problems using spatial features. In fact, in the proposed network, an attempt has been made to obtain better results in satellite images by modifying the input features of the Enhanced Super-Resolution Generative Adversarial Network (ESRGAN) [
10].
The rest of the paper is organized as follows. In
Section 2, we describe datasets and the proposed method. Evaluation metrics, comparative solutions, and experimental results are then gathered in
Section 3. Finally,
Section 4 provides concluding remarks.
2. Materials and Methods
The development of a DL-based method involves at least three key steps, with some iterations among them:
Preparing an appropriate dataset for training and testing.
Designing and developing a DL model.
Training and evaluating the model using prepared data.
Following this approach, we will outline the progression of our proposed model in the same order in order to make the material clearer.
2.1. Dataset
In deep learning, data functions as a critical component concerning the model’s quality and efficiency. Deep neural networks necessitate a significant volume of training data to understand intricate patterns. The vast and multifarious nature of data resolves numerous issues associated with model execution and performance decline [
11].
We utilized archived WorldView-2/3 images in the European Space Agency (ESA) as ground truth to prepare the LR-HR image pairs. These images were sensed by the satellite from February 2009 to December 2020, covering several areas of Earth. An example is shown in
Figure 1. Care was taken to select scenes with minimal cloud cover or no cloud cover for the use of archived images. It was also ensured that at the time of image acquisition, a Sentinel-2 image was available within a week or less from that time. Finally, 46 large images were selected, covering various areas with urban, rural, agricultural, maritime, and other land uses. However, considering the structural differences in densely populated urban areas and the lack of images in these regions, they were excluded from the training data.
After properly selecting the Sentinel-2 and Worldview image pairs, initial preprocessing steps were performed, such as converting the Digital Number (DN) to radiance. Additionally, the Sentinel-2 images were co-registered with their corresponding Worldview images, which have a higher spatial resolution. Then, the Worldview bands were resampled to a spatial resolution of 2.5 m. Finally, the Sentinel-2 and Worldview images were respectively divided into 64 × 64 and 256 × 256 patches and then augmented based on the frequency of each land use.
2.2. Proposed Method
In this section, we describe our experiments conducted to verify our methods. The method of generating the used dataset was explained in the past section. Generally, 6500 patches were collected and segregated into two sets, comprising 90% for training and 10% for testing for the purpose of validation. Therefore, for a better comparison, both the ESRGAN network and the proposed method were trained once. For a quantitative evaluation of the results, the well-known metrics of the Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM) were used, which have been widely used in previous studies. Additionally, Spectral Angle Mapper (SAM) metrics were selected to assess the spectral quality.
The ESRGAN [
10] network was presented in 2018 by Wang et al. The generator network architecture is in the post-upsampling framework. In the beginning, a convolutional layer is used as a feature extractor to generate 64 features with a kernel of 3 × 3 dimensions and stride 1. Then, the generated features are input to a deeper feature extractor, which is composed of 23 Residual of Residual Dense Blocks (RRDBs). The RRDB blocks contain one residual architecture that uses three other residual architectures. Finally, after the upsampling layers, the extracted features are combined by two convolution layers, and the output of the network is obtained in the form of three features that represent three RGB spectral bands.
In our proposed method, by extracting spatial features, such as the morphological profile and sharpening operation, we tried to reduce the pixelation in super-resolved images. Morphological analyses are designed using various shapes to extract information present in the form of spatial relationships between pixels. One common method in employing morphological analyses is to create a morphological profile. This means that different morphological operators are first applied to the image, and then their results are arranged alongside each other based on a specific pattern and utilized in a computational model [
12]. In this study, we employed two morphological operators, Opening and Closing, to generate the morphological profile. For each of the three RGB spectral bands, these operators were applied three times, and their results were arranged sequentially, as shown in
Figure 2.
Additionally, we used the slope to prevent the degradation of land surface topography by the model. For this purpose, corresponding to each data, the Shuttle Radar Topography Mission (SRTM) v3 Digital Elevation Model (DEM) was acquired, and the regional slope was calculated. The use of the slope is advantageous compared to the elevation model because it has a threshold and can be applied without considering the image’s position on the Earth.
After preparing the required input data, fine-tuning and training the implemented model become crucial. In GAN networks, both the Generator (G) and Discriminator (D) networks are trained simultaneously. These networks engage in an adversarial relationship, striving to enhance each other’s performance. Thus, maintaining equitable competition between them is of paramount importance. Therefore, initially, the G network was trained independently, and then the trained network was pitted against the D network.
3. Result
To compare our proposed method and the ESRGAN network, the three metrics PSNR, SSIM, and SAM were considered, and the outcomes are shown in
Table 1.
According to
Table 1, the provided method could achieve incredible results for SAM and SSIM metrics, with 0.10 and 0.92 radians, respectively, which was better than for the ESRGAN network, with 0.12 radian in SSIM and 0.90 in SAM. By contrast, the PSNR metric ESRGAN gained higher values, with 37.34 dB, and the approach provided shows 37.23 dB.
On the other hand, to compare the visualized images, the real outcome, the approach provided, and the ESRGAN network, two samples are selected from the test set.
Figure 3a–c corresponds to an area with dense vegetation and
Figure 3d–f is selected from a desert area with scattered bushes. By comparing the outputs obtained with WorldView images, it can be seen that our proposed model performs better than the ESRGAN in the vegetated area, but has generated extra noise in the desert area. This issue is most likely due to the lack of training data from desert areas. Moreover, to survey and evaluate the spectral images, the RGB bands’ histograms were considered, as shown in
Figure 4. By comparing the histograms, we reach similar results.