1. Introduction
Single Image Super-Resolution (SISR) is a classical problem of computer vision that aims to obtain a high-resolution (HR) image from a low-resolution (LR) version. In other words, the objective of SISR techniques is to make an image larger without losing details. One of the biggest challenges of SISR is the existence of multiple solutions for the same image. This makes the mapping between the LR space and the HR space unclear. The second one is intractable in most cases [
1,
2].
All these techniques are used in many visual applications that require high-resolution images to allow an adequate interpretation of the data stored within them. Examples are found in medicine, security and remote sensing, among others. In the case of crops, for example, satellite images are proving to be very interesting to optimize the efficiency and profitability of farms [
3]. However, the spatial resolution of these satellite images only allows us to identify general features. For example, it is sufficient to monitor crop growth, but not for early detection of the appearance of pests.
Among the most popular satellites are Sentinel-2, two twin satellites belonging to the Sentinel missions [
4] which provide free and global acquisitions of multispectral images with a revisit frequency of 5 days. The objective of these missions is to supply data for remote sensing tasks, such as land monitoring or disaster management. The multispectral bands of the Sentinel-2 satellite sensors have up to 10 m spatial resolution, which is not as much if we compare it with that provided by other commercial high-resolution satellites. PlanetScope, a satellite launched by Planet [
5], provides multispectral images with a resolution of 3.125 m. Nevertheless, these HR satellite images are very expensive and this makes them inaccessible for most people. This is the principal reason for increasing the resolution of the Sentinel-2 satellites without any additional cost.
There are three main methods for SISR: interpolation-based, reconstruction-based and learning-based. Interpolation-based methods are very fast and easy to implement, but do not provide very precise results. Among these methods, one of the most used is bicubic interpolation [
6]. On the other hand, reconstruction-based methods are more sophisticated and often provide better results. However, their performance is severely limited by the scaling factor, since the reconstruction degrades as fast as this factor increases [
1].
Deep-learning-based methods have become very popular in the past few years. However, deep learning super-resolution algorithms cannot be applied universally and are specific to the type of images they are trained with. Moreover, since most of the existing SISR methods have been implemented using synthetic data, their super-resolution performance is drastically altered when using real-world images [
7]. There is also another difficulty related with HR-LR image pairs needed for training. In order to create these pairs, a HR image is usually downsampled to obtain the LR version of it. Nevertheless, there are cases for which HR versions of the images to super-resolve do not exist. Despite these drawbacks, deep learning methods have proven to be a much better alternative to the original methods, offering great results both visually and in metric terms.
In this work, we propose a method for SISR of multispectral images using deep learning techniques. Specifically, we present a residual network-based model which incorporates a spectral attention mechanism. Such a mechanism allows our network to consider interdependencies among channels, highlighting the most informative ones.
The rest of the paper is arranged as follows. In
Section 2, we explain some of the work related to SISR, focusing on deep learning techniques. In
Section 3, we present our model for super-resolution of Sentinel-2 images. All the information of the used dataset and the pre-processing of the images can be found here. The experiments that have been carried out can be found in
Section 4, including comparisons with other existing models. Finally, the conclusion can be found in
Section 5.
3. Materials and Methods
3.1. Proposal
The agri-food sector is one of the engines of the Navarre’s economy. The European Commission published the Green Deal in 2019, which was followed by two strategies that will have a great impact in the coming years: the Biodiversity Strategy and the “From Farm to Fork” Strategy. The latter deals with the transition of the European food system towards an economically, socially and environmentally sustainable system, and is the one that will mark the path of many of the policies that affect the agri-food sector, including the new Common Agricultural Policy (CAP). This strategy will lead the sector to adopt measures for a more sustainable production from an environmental point of view, achieving reductions in the use of phytosanitary products and mineral fertilizers and promoting an increase in organic production and the digitization of the food chain.
NAITEC is a Technology Centre specialized in mechatronics. Thus, it wants to provide professional farmers, advisers and organizations with a tool which allows them to understand the evolution of crops in order to make predictive and precise decisions regarding their management, saving costs and reducing their environmental footprint.
It is well known that, from Sentinel images, it is possible to calculate vegetation indices such as the Normalized Difference Water Index (NDWI), the Normalized Difference Vegetation Index (NDVI) and the Normalized Difference Snow Index (NDSI) that are already being incorporated in different agricultural management software.
With this in mind, we propose a new model for super-resolution which is specifically designed to work with multispectral images. This differentiates us from the state-of-the-art models, since these are designed to work with RGB images. Additionally, our model incorporates the idea of channel attention, which takes advantage of the spatial correlations between bands. The result of this strategy is a model that not only meets its super-resolution purpose, but also exceeds the state-of-the-art methods presented so far.
3.2. Satellite Images
The Copernicus program [
27] is a joint initiative of the European Commission, the Member States, the European Space Agency (ESA), the European Organization for the Exploitation of Meteorological Satellites (EUMETSAT), the European Centre for Medium Range Weather Forecasts (ECMWF), the EU Agencies and the Mercator Ocean. Such a program provides operational information about our planet captured from space, which is useful for multiple security and environmental applications. The information services are free and openly accessible to users.
In this context, five different Earth observation missions, called Sentinels [
4], have been planned to guarantee the provision of data. Sentinel-2 is a mission with a constellation of two multispectral polar-orbiting satellites monitoring the Earth. It provides images for several applications, such as the study of vegetation, soil or water.
The constellation is based on two identical satellites (Sentinel-2A and Sentinel-2B) located on the same orbit and separated by 180º for optimal coverage of the Earth. The first Sentinel-2 satellite was launched on 23 June 2015.These satellites have a Multispectral Instrument (MSI) with 13 spectral bands.
Table 1 shows the spatial and spectral characteristics of Sentinel-2A and Sentinel-2B satellites.
Sentinel-2 images can be obtained through the “Copernicus Open Access Hub” platform [
28]. This provides access to images from the constellations Sentinel-1, Sentinel-2, Sentinel-3 and Sentinel-5P.
The other satellite we have used for the task of super-resolution is PlanetScope, a constellation of approximately 130 satellites operated by Planet [
5]. It has a coverage of 200 million
per day, which makes it capable of covering the entire Earth’s surface daily. Its multispectral cameras capture four bands (Blue, Green, Read and Near-Infrared) and the ortho tile product GeoTIFFs are resampled at 3.125 m. It has operated since 2017, after the successful launches of 88 Dove satellites in February and of a further 48 Dove satellites in July.
3.3. Dataset
One of the main problems when trying to super-resolve the Sentinel-2 10 m spectral bands is that there are not images of this satellite to use as ground truth. To overcome this problem, two main solutions have been proposed:
For a
super-resolution, a model to super-resolve Sentinel-2 20 m bands to 10 m is trained and it is used for super-resolving from 10 m to 5 m [
29,
30].
A high-resolution satellite as similar as possible to Sentinel-2 is selected and it is then used as ground truth [
19,
20,
25].
Nevertheless, learning 5 m images features from 10 m ones gives very poor results, because the high-level components of a 5 m resolution cannot be found in a 10 m image. Therefore, the models are not capable of generalizing. For this reason, we have decided to find a satellite as similar as possible to the one we want to super resolve. The PlanetScope satellite is a good candidate, because its high coverage frequency allows it to find images referring to the same place and time as those of Sentinel-2. Not to mention that high-resolution satellite images are very expensive and Planet offers different alternatives to obtain the images for free.
The PlanetScope’s images used in our experiments are the Ortho Tile Analytic Surface Reflectance products. These are orthorectified, radiometrically corrected and atmospherically corrected to Bottom of Atmosphere (BOA) reflectance images. This is the image processing level we are interested in, because it represents the real reflectance of the ground, removing the distortions created by the gases of the atmosphere. The images have been obtained using the “Education and Research Standard Plan” of Planet [
31], which has a download quota of 5000
per month.
On the other hand, the Sentinel-2 images are free to access and they are also provided as BOA reflectance images. The images we have used are the available Sentinel-2 Level-2A images.
The study focuses on Navarre. This region is committed to the use of new techniques that allow sustainable agriculture. Satellites’ data are essential to determine the state of agroecosystems, monitor vegetation and humidity in all productive areas. However, the images precision is not valid for woody crops (vines, fruit trees, etc.) or small farms such as those that abound in Navarre.
The dataset consists of 31 pairs of Sentinel-PlanetScope images that were taken in this area during the years 2020–2022. An example of the images used in the study can be seen in
Figure 2. The area of study has been separated into four parts: the north-east of the region (NE), the north-west (NW), the south-west (SW) and the south-east (SE).
Table 2 shows the images used for the analysis and the set they have been assigned to.
3.4. Image Pre-Processing
The next step after downloading the images is to properly co-register them. The Sentinel-2 images cover a much larger area, so we crop them following the bounding box of the corresponding PlanetScope image. However, since the images come from two different sensors, some misregistrations still exist. In order to correct them, we use the publicly available python package AROSICS [
32], a library created to perform automatic subpixel co-registration of two satellite images. Before the corrections, the PlanetScope images are resampled to 5 m resolution for the case of
and to 2.5 m resolution for the case of
using bicubic interpolation [
6].
Next, the PlanetScope images are divided in patches of and for and super-resolution, respectively, while the images of Sentinel-2 are divided in patches of . We obtain 56,821 pairs of patches for super-resolution and 56,823 for super-resolution.
Histogram Matching [
33] is applied to the PlanetScope patches to match with the corresponding Sentinel-2 images, while maintaining the high-frequency components. This is a very common pre-processing step used in many computer vision tasks, and in particular it has already been used in super-resolution tasks for remote sensing [
19]. Additionally, in
Section 5 we show that this is a fundamental step to preserve the spectral information of the original LR image. Finally, we match the PlanetScope patches to the bicubically upsampled versions of Sentinel-2 and normalize them.
3.5. Network Architecture
We propose a network for the super-resolution of multispectral satellite images named Spectral Attention Residual Network (SARNet). The model is based on SRResNet, a network proposed in [
14] which introduced the use of ResBlocks [
15] for SISR for the first time.
Following the argumentation given in [
16], we decide to study the effects of the Batch Normalization layer [
34]. As mentioned before, the authors argue that these layers reduce the flexibility of the network and increase the GPU memory for training. On the contrary, in [
18] the authors find that for real super-resolution the Batch Normalization could be beneficial due to the amount of noise in the images and the small size of the datasets. After having carried out our own experiments, we conclude that this layer helps to stabilize our training process.
One of the main differences with respect to the SRResNet network is that, instead of the ResBlocks, we propose the use of RCAB [
17], a residual block that incorporates a channel attention mechanism. The latter makes the network focus on the most informative components of the input and leads to notable performance improvements over previous state-of-the-art methods. Furthermore, the attention block is used to extract the spectral dependencies that standard residual networks are not capable of.
The architecture of a RCAB is shown in
Figure 3. A Global Average Pooling is applied as an information extractor which is then passed to a channel descriptor. At the end of the block, there is a sigmoid activation followed by an element-wise multiplication to distribute the importance among channels.
With the rise of deep learning, many studies focused on the improvements achieved by increasing the depth of the models [
1]. The authors of SRResNet proposed the use of 16 ResBlocks. In our experiments, a baseline model with eight RCABs is considered, but the benefit of using 16 blocks is also tested. We experimentally show in
Section 4 the effects of working with a deeper network in terms of Peak Signal to Noise Ratio (PSNR) and Structural Similarity (SSIM).
After the residual blocks, we use an upsampling layer to increase the resolution of the images as in SRResNet. Each layer increases the resolution by a factor of two, so two upsampling layers are concatenated for the case of
super-resolution. Each upsampling layer is originally formed by a convolutional layer to increase the number of filters, a Pixel Shuffle transformation to obtain a bigger image reorganizing the low-resolution image channels and a ReLU activation. Then, as proposed in [
35], we introduce an Average Pooling layer for blurring the output of the Pixel Shuffle operator, in order to prevent checkerboard artifacts [
36]. Our upsampling layer can be seen in
Figure 4.
Finally, we introduce Short Skip Connections (SSC) inside each RCAB and a Long Skip Connection (LSC) to help stabilize the network. This way, the low-frequency information goes through the skip connections and the main network can focus on learning high-frequency information.
Our model architecture is showed in
Figure 5.
3.6. Loss Function
Regarding the loss function, our starting point is the
metric instead of the commonly used
. In [
16], the authors prove that
loss provides better convergence than
. Nevertheless, this loss function only relies on pixel-wise differences and it is not capable of capturing other important aspects based on content or style of the images. To overcome this issue, the authors of [
10] propose a metric based on features extracted from the pre-trained VGG-16 network [
12]. We briefly describe the VGG-16 network in order to explain our final choice.
Finally, the loss function we propose is a combination of the previous ones. The
,
and
values in Equation (
5) represents the weights.
Therefore, as in the original SRResNet model, we study the benefit of using perceptual losses in the case of images coming from two different sensors. We also implement these losses with the VGG-16 network as feature extractor. The results are presented in
Section 4.
3.7. Evaluation Metrics
To measure the differences between target and predicted images, two standard metrics are considered.
3.8. Training Details
We train our model with Adam optimizer [
38] by setting
and
. The initial learning rate is
and is halved at every 25 epochs. We set the batch size as 16. The
,
and
values in Equation (
5) are
,
and
, respectively. Finally, our baseline model has a total of 848 K trainable parameters for the case of
super-resolution and 994 K for
super-resolution.
We implement the proposed model with TensorFlow and train it using NVIDIA A100. Results are evaluated using the metrics PSNR and SSIM.
5. Conclusions and Future Work
Super-resolution of multispectral satellite images is a complex task, since usually the images come from different sensors. In this context, the pre-processing step has lot of importance. We have demonstrated that an appropriate co-registration can make a big difference in the results. For example, it avoids pixel misalignments that affect loss functions, such as .
In this paper, we have presented a new model for the super-resolution of the RGBN bands of the Sentinel-2 Multispectral Instrument from the original 10 m to either 5 m or 2.5 m. Our model, named SARNet, has proven to be superior to the rest of state-of-the-art networks used for SISR. By incorporating a spectral channel attention mechanism, SARNet focuses on the spectral dependencies between bands, achieving improved results. We have also shown that standard loss functions such as fail to pay attention to the image’s perceptual characteristics, while other perceptual losses are a far better option. We have studied the benefits of using deeper models. Our results show that deeper models take advantage of skip connections in the training process. Moreover, we have ensured that the spectral information of the images is preserved after the upsampling process through Histogram Matching.
In addition, we have deal with the lack of data, one of the most common problems in deep learning. Even if there were more data, we would still take images from two different satellites. Then, the images should be as similar as possible, committing the dataset size again. Transfer learning could be a possible solution: a model is pre-trained only with images from the HR satellite, obtaining the corresponding LR images through downsampling, and then is trained with PlanetScope-Sentinel pair of images.
Another alternative is performing data augmentation, one of the most used methods when implementing a model with few data. However, the classical approach may not be the best choice for this task, mainly because the properties of the multispectral satellite images are very different from the standard RGB images used in most of the studies. The authors of [
39] advise about this issue and propose different approaches for using data augmentation with satellite images.
This study focuses on Navarre, but other areas could be studied to create a more generalized model. Finally, other architectures could be analyzed. For example, GANs have proven to be a very powerful tool for the task of SISR [
14,
25].