1. Introduction
The twin Sentinel-2 satellites ensure a global World coverage with a revisit time of five days at the equator, providing a multi-resolution stack composed of 13 spectral bands, between the visible and short-wave infrared (SWIR), distributed over three resolution levels. Four bands lying between visible and near-infrared (NIR) are given at the finer resolution of 10 m, while the remaining ones are provided at 20 (six bands) and 60 (three bands) m, as a result of a trade-off between storage and transmission bandwidth limitations. The 10 and 20 m bands are commonly employed for land-cover or water mapping, agriculture or forestry, estimation of biophysical variables, and risk management (floods, forest fires, subsidence, and landslide), while lower resolution 60 m bands can be used for monitoring of water vapor, aerosol corrections, pollution monitoring, cirrus clouds estimation and so forth [
1,
2]. Specifically, beyond land-cover classification, S2 images can be useful in such diverse applications as the prediction of growing stock volume in forest ecosystems [
3], the estimation of the Leaf Area Index (LAI) [
4,
5], the retrieval of canopy chlorophyll content [
6], the mapping of the extent of glaciers [
7], the water quality monitoring [
8], the classification of crop or tree species [
9], and the built-up areas detection [
10].
In light of its free availability, world-wide coverage, revisit frequency and, not least, its above remarked wide applicability, several research teams have proposed solutions to super-resolve Sentinel-2 images, rising 20 m and/or 60 m bands up to 10 m resolution. Besides, several works testify the advantage of using super-resolved S2 images in several applications such as water mapping [
11], fire detection [
12], urban mapping [
13], and vegetation monitoring [
14].
According to the taxonomy suggested by Lanaras et al. [
2] resolution enhancement techniques can be gathered in three main groups: (i) pansharpening and related adaptations; (ii) imaging model inversion; and (iii) machine learning. In addition to these category, it is also worth mentioning the matrix factorization approaches (e.g., [
15,
16]), which are more suited to the fusion of low resolution hyperspectral images with high resolution multispectral ones. In fact, the spectral variability becomes a serious concern to be handled carefully by means of unmixing oriented methodologies [
17,
18]. The first category refers to the classical pansharpening, where the super-resolution of low-resolution bands is achieved by injecting spatial information from a single spectrally-overlapping higher-resolution band. This is the case for many remote sensing systems such as Ikonos, QuickBird, GeoEye, WorldView, and so forth. The so-called component substitution methods [
19,
20], the multi-resolution analysis approaches [
21,
22], or other energy minimization methods [
23,
24,
25] belong to this category. A recent survey on pansharpening can be found in [
26]. Pansharpening methods can also be extended to Sentinel-2 images in different ways, although S2 bands at different resolutions present a weak or negligible spectral overlap, as shown by several works [
27,
28,
29,
30,
31].
The second group refers to methods that face the super-resolution as an inverse problem under the hypothesis of known imaging model. The ill-posedness is therefore addressed by means of additional regularization constraints encoded in a Bayesian or a variational framework. Brodu’s super-resolution method [
32] separates band-dependent from cross-band spectral information, ensuring the consistency of the “geometry of scene elements” while preserving their overall reflectance. Lanaras et al. [
33] adopted an observation model with per-band point spread functions that accounts for convolutional blur, downsampling, and noise. The regularization consists of two parts, a dimensionality reduction that implies correlation between the bands, and a spatially varying, contrast-dependent penalization of the (quadratic) gradients learned from the 10 m bands. In a similar approach, Paris et al. [
34] employed a patch-based regularization that promotes self-similarity of the images. The method proceeds hierarchically by first sharpening the 20 m bands and then the coarser 60 m ones.
The last category casts machine learning approaches, and notably deep learning (DL) ones, which have recently gained great attention from the computer vision and signal processing communities and nearby fields, including remote sensing. In this case, contrarily to the previous categories, no explicit modeling (neither exact nor approximated) of the relationship between high and low resolution bands is required, since it is directly learned from data. Deep networks allow in principle to mimic very complex nonlinear relationships provided that enough training data are available. In this regard, it is also worth recalling that the pansharpening of multi-resolution images is somewhat related to the unmixing of multi-/hyper-spectral images [
17,
18], since in both cases the general aim is to derive the different spectral responses covered by a single, spatially coarse observation. However, more specifically, in these two problems, expectations are considerably different: spectral unmixing is a pertinent solution when the interest is focused on surface materials, hence requiring high precision on the retrieval of the corresponding spectral responses without the need to improve their spatial localization. In pansharpening, the focus is mainly on spatial resolution enhancement while preserving at most the spectral properties of the sources, and no specific information discovery about the radiometry of materials is typically expected. In fact, traditional pansharpening methods try to model spectral diversity, for example, by means of the modulation transfer function of the sensor [
21,
22], instead of using radiative transfer models associated to the possible land covers. In any case, from the deep learning perspective, it makes little difference once the goal is fixed and, more importantly, a sufficiently rich training dataset is provided, as the knowledge (model parameters) will come from experience (data). To the best of our knowledge, the first notable example of DL applied to the super-resolution of remote sensing images is the pansharpening convolutional neural network (PNN) proposed by Masi et al. [
35], which has been recently upgraded [
36] with the introduction of a residual learning block and a fine-tuning stage for target adaptivity and cross-sensor usage. Another residual network for pansharpening (PanNet) is proposed in [
37]. However, none of these methods can be applied to S2 images without some architectural network adaptation and retraining. Examples of convolutional networks conceived for Sentinel-2 are instead proposed in [
2,
11]. In [
11], the super-resolution was limited to the SWIR 20 m band, as the actual goal was water mapping by means of the modified normalized difference water index (MNDWI), for which green and SWIR bands were requested. Lanaras et al. [
2], instead, collected a very large training dataset which has been used to train two much deeper super-resolution networks, one for the 20 m subset of bands and the other for the remaining 60 m bands, achieving state-of-the-art results. In related problems, for example the single-image super-resolution of natural images or other more complex vision tasks such as object recognition or instance segmentation, thanks to the knowledge hidden in huge and shared training databases, deep learning has shown really impressive results compared to model-based approaches. Data sharing has represented a key enabling factor in these cases allowing researchers to compete with each other or reproduce others’ models. In light of this consideration, we believe that Sentinel-2 is a very interesting case because of the free access to data that can serve as playground for a larger scale research activity on remote sensing super-resolution or other tasks. In the same spirit, Lanaras et al. [
2] pushed on the power of the data by collecting a relatively large dataset to get good generalization properties. On the other hand, complexity is also an issue that end users care about. In this regard, the challenge of our contribute is to design and train a relatively small and flexible network capable of achieving competitive results at a reduced cost on the super-resolution of the 20 m S2 bands, exploiting spatial information from the higher-resolution 10 m S2/VNIR bands. Indeed, the proposed network being lightweight, apart from enabling the use of the method on cheaper hardware, allows quickly fine-tuning it when the target data are misaligned from the training data for some reason. The proposed method for Fast Upscaling of SEntinel-2 (FUSE) images is an evolution of the
proof-of-concept work presented in [
38]. In particular, the major improvements with respect to the method in [
38] reside in the following changes:
- a.
Architectural improvements with the introduction of an additional convolutional layer.
- b.
The definition of a new loss function which accounts for both spectral and structural consistency.
- c.
An extensive experimental evaluation using diverse datasets for testing that confirms the generalization capabilities of the proposed approach.
The rest of the paper is organized as follows. In
Section 2, we describe datasets and proposed method. Evaluation metrics, comparative solutions and experimental results are then gathered in
Section 3. Insights about the performance of the proposed solution and related future perspectives are given in
Section 4. Finally,
Section 5 provides concluding remarks.
2. Materials and Methods
The development of a deep learning super-resolution method suited for a given remote sensing imagery involves at least three key steps, with some iterations among them:
- a.
Selection/generation of a suitable dataset for training, validation and test;
- b.
Design and implementation of one or more DL models;
- c.
Training and validation of the models (b) using the selected dataset (a).
By following this rationale, for ease of presentation, in this section, we first present the datasets and their preprocessing (a), then we describe design (b) and training (c) of the proposed model.
2.1. Datasets and Labels Generation
Regardless of its complexity and capacity, a target deep learning model remains a data-driven machinery whose ultimate behavior heavily depends on the training dataset, notably on its representativeness of real-world cases. Hence, we provide here detailed information about our datasets and their preprocessing.
For the sake of clarity, let us first recall the main characteristics of the 13 spectral bands of Sentinel-2, gathered in
Table 1, and clarify symbols and notations that are used in the following with the help of
Table 2.
Except for some cases where unsupervised learning strategies can be applied, a sufficiently large dataset containing input–output examples is usually necessary to to train a deep learning model. This is also the case for super-resolution or pansharpening. In our case, as we decided to fuse 10 m bands () with 20 m () to enhance the resolution of by a factor of 2 (resolution ratio), which means that we should have examples of the kind , being the desired (super-resolved) output corresponding to the composite input instance . In rare cases, one can rely on referenced data, for example thanks to ad hoc missions to collect full-resolution data to be used as reference, whereas in most cases referenced samples are unavailable.
Under the latter assumption, many deep learning solutions for super-resolution or pansharpening have been developed (e.g., [
2,
11,
35,
36,
39,
40,
41]) by means of a proper schedule for generating referenced training samples from the same no-reference input dataset. It consists of a resolution downgrade process that each input band undergoes which involves two steps:
- (i)
band-wise low-pass filtering; and
- (ii)
uniform spatial subsampling, being R the target super-resolution factor.
This is aimed to shift the problem from the original
full-resolution domain to a
reduced-resolution domain. In our case,
while the two original input components,
and
, will be transformed in corresponding variables
and
, respectively, lying in the reduced-resolution space, with associated reference
trivially given by
. How to filter the several bands before subsampling is an open question. Lanaras et al. [
2] pointed out that with deep learning one does not need to specify sensor characteristics, for instance, spectral response functions, since sensor properties are implicit in the training data. Contrarily, Masi et al. [
35] asserted that the resolution scaling should be done accounting for the sensor Modulation Transfer Function (MTF), in order to generalize properly when applied at full resolution. Such a position follows the same rationale of the so-called Wald’s protocol, a procedure commonly used for generating referenced data for objective comparison of pansharpening methods [
26]. Actually, this controversial point cannot be resolved by looking at the performances in the reduced-resolution space, since a network learns from training data the due relationship whatever preprocessing has been performed on the input data. On the other hand, in full-resolution domain, no objective measures can be used because of the lacking referenced test data. In this work we follow the approach proposed in [
35] making use of sensor MTF. The process for the generation of a training sample is summarized in
Figure 1. Each band undergoes a different low-pass filtering, prior to being downsampled, whose cut-off frequency is related to the sensor MTF characteristics. Additional details can be found in [
42].
Another rather critical issue is the training dataset selection as it impacts the capability of the trained models to generalize well on unseen data. In the computer vision domain, a huge effort has been devoted to the collection of very large datasets in order to support the development of deep learning solutions for such diverse problems as classification, detection, semantic segmentation, tracking video and so forth (notable examples are ImageNet and Kitty datasets). Instead, within the remote sensing domain, there are no examples of datasets which are as large as ImageNet or Kitty. This is due to several obstacles, among which the cost of the data and the related labeling which requires domain experts, as well as the data sharing policy usually adopted in the past years by the remote sensing community. Luckily, for super-resolution, one can at least rely on the above-described fully-automated resolution downgrading strategy to avoid labeling costs. Due to the scarcity of data, most deep-learning models for resolution enhancement applied to remote sensing have been trained on a relatively small dataset, possibly taken from a few large images, from which non-overlapping sets for training, validation and testing are singled out [
35,
37,
41]. The generalization limits of a pansharpening model trained on too few data have been stressed in [
36], for both cross-image and cross-sensor scenarios, where a fine-tuning stage has been proposed to cope with the scarcity of data. In particular, it was shown that, for a relatively small CNN that integrates a residual learning module, a few training iterations (fine-tuning) on the reduced-resolution version of the target image allow quickly recovering the performance loss due to the misalignment between training and test sets. For Sentinel-2 imagery, thanks to the free access guaranteed by the Copernicus program, larger and more representative datasets can be collected, as done by Lanaras et al. [
2], aiming for a roughly even distribution on the globe and for variety in terms of climate zone, land-cover and biome type. In this study, we opted for a lighter and flexible solution with a relatively small number of parameters to learn and a (pre-)training dataset of relatively limited size. This choice is motivated by the experimental observation that in actual application the tuning of the parameters is still recommendable even if larger datasets have been used in training, making appealing lighter solutions that can be quickly tuned if needed.
To be aligned with the work of Lanaras et al. [
2], we decided to keep their setting by using Sentinel-2 data without atmospheric correction (L1C product) for our experiments. For training and validation, we referred to three scenes (see
Figure 2), corresponding to different environmental contexts: Venice, Rome, and Geba River. In particular, we randomly cropped 18,996 square tiles of size 33 × 33 (at 20 m resolution) from the three selected scenes to be used for training (15,198) and validation (3898). Besides, we have choosen four more scenes for the purpose of testing, namely Athens, Tokyo, Addis Abeba, and Sydney, which present different characteristics, hence allowing for a more robust validation of the proposed model. From such sites, we singled out three 512 × 512 crops at 10 m resolution, for a total of twelve test samples.
2.2. Proposed Method
The proposed solution takes inspiration from two state-of-the-art CNN models for pansharpening, namely PanNet [
37] and the target-adaptive version [
36] of PNN [
35], both conceived for very high resolution sensors such as Ikonos or WorldView-2/3. Both methods rely on a residual learning scheme, while main differences concern loss function, input preprocessing, and overall network backbone shape and size.
Figure 3 shows the top-level flowchart of the proposed method. As we deal with Sentinel-2 images, differently from Yang et al. [
37] and Scarpa et al. [
36], we have 10 input bands, six lower-resolution ones (
), to be super-resolved, plus four higher-resolution bands (
). Let us preliminarily point out that we train a single (relatively small) network for each band
x to be super-resolved, as represented at the output in
Figure 3. However, the deterministic preprocessing bounded by the dashed box is a shared part, while the core CNN, with fixed hyper-parameters, changes from one band to another to be super-resolved. This choice presents two main advantages. The first is that whenever users need to super-resolve only a specific band, they can make use of a lighter solution with computational advantages. The second reason is related to the experimental observation that training separately the six networks allows reaching the desired loss levels more quickly than using a single wider network. This feature is particularly desirable if users need to fine-tune the network on their own dataset. Turning back to the workflow, observe that both input subsets,
and
, are high-pass filtered (HPF) as also done by PanNet. This operation relies on the intuition that the missing details that the network is asked to recover lie in the high frequency range of the input image. Next, the HPF component
is upsampled (
) using a standard bicubic interpolation, yielding
, in order to match the size of
with which to be concatenated prior to feed the actual CNN. The single-band CNN output
is therefore combined with the upsampled target band
to provide its super-resolved version
. This last combination, obtained through a skip connection that retrieves the low-resolution content of
directly from the input, is known as residual learning strategy [
43], and has soon became a standard option for deep learning based super-resolution and pansharpening [
2,
36,
37], as it is proven to speed-up the learning process.
The CNN architecture is more similar to the pansharpening models [
35,
36] than to PanNet [
37], making use of just four convolutional layers, whereas PanNet uses ten layers, each singling out 32 features (except for the output layer). Moreover, a batch normalization layer operating on the input stack precedes the convolutional ones. This has proven to make the learning process robust with respect to the statistical fluctuations of the training dataset [
44]. In
Table 3, the network hyper-parameters of the convolutional layers are summarized.
Training
Once the training dataset and model are fixed, a suitable loss function to be minimized needs to be defined in order for the learning process to take place.
or
norms are typical choices [
2,
35,
36,
38,
39] due to their simplicity and robustness, with the latter being probably more effective to speed-up the training, as observed in [
2,
36]. However, these measures do not account for structural consistency as they are computed on a pixel-wise basis and, therefore, assess only spectral dissimilarity. To cope with this limitation, an option is to resort to a so-called
perceptual loss [
45], which is an indirect error measurement performed in a suitable feature space generated with a dedicated CNN. In [
37], structural consistency is enforced by working directly on detail (HPF) bands. In the proposed solution, in addition to the use HPF components, we also define a combined loss that explicitly accounts for spectral and structural consistency. In particular, inspired by the variational approach [
46], we make use of the following loss function:
where three terms, corresponding to fidelity, or spectral consistency (
), structural consistency (
) and regularity (
), are linearly combined. The weights were tuned experimentally using the validation set as
,
, and
.
By following the intuition proposed in [
2,
36], we decided to base the fidelity term on the
norm, that is
where the expectation
is estimated on the reduced-resolution training minibatches during the gradient descent procedure.
stands for the CNN function (including preprocessing) whose parameters to learn are collectively indicated with
. This loss term, as well as the other two, refers to a single band (
) super-resolution whose ground-truth is
. As the training is performed in the reduced-resolution domain, in the reminder on this section, we drop the subscript ↓ for the sake of simplicity.
The structural consistency term is given by
where the operator
generalizes the gradient operator including derivatives in the diagonal directions that help to improve quality, as shown in [
46]. It has been shown that the gradient distribution for real-world images is better fit with a heavy-tailed distribution such as a hyper-Laplacian (
). Accordingly, we make use of a
-norm with
, which we believe can be more effective [
46]. This term penalizes discontinuities in the super-resolved band
if they do not occur, with the same orientation, in the panchromatic band. As the dynamics of these discontinuities are different, an additional prior regularization term that penalizes the total variation of
helps to avoid unstable behaviors:
Eventually, the network parameters were (pre-)trained by means of the Adaptive Moment Estimation (ADAM) optimization algorithm [
47] applied to the above-defined overall loss (Equation (
1)). In particular, we have set the ADAM default hyper-parameters, which are learning rate,
, and decay rate of the first and second moments,
and
, respectively [
48]. The training was run for 200 epochs, being an epoch a single pass over all minibatches (118) in which the training set has been split, with each minibatch composed of 128 33 × 33 input–output samples.
4. Discussion
To assess the impact of the proposed changes with respect to the baseline HP-M5, an additional convolutional layer and a composite loss that adds a regularization term and a structural term to the basic spectral loss (
-norm), we also carried out an ablation study. In particular, we have the three-layer scaled version of FUSE and the four-layer version trained without regularization and structural loss terms. These two solutions are also reported in
Table 4. As can be seen, except for the SAM index, the full version of FUSE outperforms consistently both scaled versions, with remarkable gains on ERGAS, in the reference-based framework, and on the spatial distortion
, in the no-reference context. Focusing on the two ablations, it seems that the use of the composite loss has a relatively better impact compared to the network depth increase. This is particular evident looking at the SAM indicator.
The experimental evaluation presented above confirms the great potential of the DL approach in the context of the data fusion problem at hand, as already seen for pansharpening [
35] and single-image super-resolution of natural images [
39] a few years ago. The numerical gap between DL methods and the others is consistent and confirmed by visual inspection. In particular, we observe that the use of the additional structural loss term, the most relevant change with respect to our previous models M5 and HP-M5, allowed us to reach and slightly overcome the accuracy level of DSen2. Beside accuracy assessment, it is worth focusing on the related computational burden. DL methods, in fact, are known to be computationally demanding, hence potentially limited for large-scale applicability. Thus, we focused from the beginning on relatively small CNN models. Indeed, the proposed model involves about 28K parameters in contrast to DSen2 which has 2M parameters. In
Table 5, we gather a few numbers obtained experimentally on a single GPU Quadro P6000 with 24 GB of memory. For both the proposed and DSen2, we show the GPU memory load and the computational time for the inference with respect to the image size.
As the proposed model is replicated, with different parameters, for each of the six bands to be super-resolved, we assume either a sequential GPU usage (as done in the table) or a parallel implementation, therefore with 6× memory usage but also 6× faster processing. In any case, to have a rough idea of the different burden, it is sufficient to observe that, by using about one third of the memory necessary for DSen2 to super-resolve a 512 × 512 image, FUSE can super-resolve a 16× larger image (2048 × 2048) in the same time slot. In addition, it also has to be considered that, in many applications, the user may be interested in super-resolving a single band, hence saving additional computational and/or memory load. Finally, this picture does not consider the less critical training phase or an eventual fine-tuning stage, which would further highlight the advantage of using a smaller network. To have a rough idea of this, we recall that, according to Lanaras et al. [
2], DSen2 was trained in about three days on a NVIDIA Titan Xp 12 GB GPU, whereas the training of our model took about 3 h using a Titan X 12 GB.