Next Article in Journal
Estimating Regional PM2.5 Concentrations in China Using a Global-Local Regression Model Considering Global Spatial Autocorrelation and Local Spatial Heterogeneity
Next Article in Special Issue
Towards Robust Semantic Segmentation of Land Covers in Foggy Conditions
Previous Article in Journal
SDGSAT-1 TIS Prelaunch Radiometric Calibration and Performance
Previous Article in Special Issue
Retrieval of Live Fuel Moisture Content Based on Multi-Source Remote Sensing Data and Ensemble Deep Learning Model
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Enhanced Multi-Stream Remote Sensing Spatiotemporal Fusion Network Based on Transformer and Dilated Convolution

College of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing 400065, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2022, 14(18), 4544; https://doi.org/10.3390/rs14184544
Submission received: 29 July 2022 / Revised: 27 August 2022 / Accepted: 7 September 2022 / Published: 11 September 2022

Abstract

:
Remote sensing images with high temporal and spatial resolutions play a crucial role in land surface-change monitoring, vegetation monitoring, and natural disaster mapping. However, existing technical conditions and cost constraints make it very difficult to directly obtain remote sensing images with high temporal and spatial resolution. Consequently, spatiotemporal fusion technology for remote sensing images has attracted considerable attention. In recent years, deep learning-based fusion methods have been developed. In this study, to improve the accuracy and robustness of deep learning models and better extract the spatiotemporal information of remote sensing images, the existing multi-stream remote sensing spatiotemporal fusion network MSNet is improved using dilated convolution and an improved transformer encoder to develop an enhanced version called EMSNet. Dilated convolution is used to extract time information and reduce parameters. The improved transformer encoder is improved to further adapt to image-fusion technology and effectively extract spatiotemporal information. A new weight strategy is used for fusion that substantially improves the prediction accuracy of the model, image quality, and fusion effect. The superiority of the proposed approach is confirmed by comparing it with six representative spatiotemporal fusion algorithms on three disparate datasets. Compared with MSNet, EMSNet improved SSIM by 15.3% on the CIA dataset, ERGAS by 92.1% on the LGC dataset, and RMSE by 92.9% on the AHB dataset.

Graphical Abstract

1. Introduction

Remote sensing images are generated by various types of satellite sensors, such as the Moderate Resolution Imaging Spectroradiometer (MODIS), Landsat-equipped sensors, and Sentinel. MODIS sensors are usually installed on Terra and Aqua satellites, which can circle the earth in half a day or one day, and the data obtained by them have superior time resolution. However, the spatial resolution of MODIS data (i.e., rough image) is very low, and accuracy can reach only 250–1000 m [1]. By contrast, data (fine image) acquired by Landsat have higher spatial resolution (15–30 m) and capture sufficient surface-detail information, but temporal resolution is very low because it takes 16 days to circle the earth [1]. In practical applications, we often need remote sensing images with high temporal and spatial resolution. For example, images with high temporal and spatial resolutions can be used for research in the fields of heterogeneous regional surface change [2,3], vegetation seasonal monitoring [4], real-time natural disaster mapping [5], and land-cover changes [6]. Unfortunately, current technical and cost constraints, coupled with the existence of such noise as cloud cover in some areas, make it challenging to directly obtain remote sensing products with high temporal and spatial resolution, and a single high-resolution image cannot meet practical needs. In order to meet these lacunae, spatiotemporal fusion has attracted considerable attention. In spatiotemporal fusion, two types of images are fused together, with the aim of obtaining images with high spatiotemporal resolution [7,8].
Existing spatiotemporal fusion methods can generally be subdivided into four categories: unmixing-based, reconstruction-based, dictionary pair learning-based, and deep learning-based.
Unmixing-based methods unmix the spectral information at the predicted moment, and then use the unmixed result to predict the unknown high spatial and temporal resolution image. Multi-sensor multi-resolution image fusion (MMFN) [9] was the first fusion method to apply the idea of unmixing. MMFN reconstructs the MODIS and Landsat images separately: first, the MODIS image is spectrally unmixed, and then the mixed result is spectrally reset on the Landsat image to obtain the final reconstruction result. Wu et al. considered the issue of nonlinear time-varying similarity and spatial variation in spectral unmixing, improved MMFN, and obtained a new spatiotemporal fusion method, STDFA [10], which also achieved good fusion results. A variable spatiotemporal data-fusion algorithm, FSDAF [11], has also been proposed, which combines the unmixing method, spatial interpolation, and spatiotemporal adaptive fusion algorithm (STARFM) to create a new algorithm that is computationally inexpensive, fast, and accurate, and performs well in heterogeneous regions.
The core idea of the reconstruction-based algorithm is to calculate the weights of similar adjacent pixels in the spectral information in the input and then add them. STARFM was the first method to be used for reconstruction for fusion [8]. In STARFM, the reflection changes of pixels between the rough image and the fine image should be continuous, and the weights of adjacent pixels can be calculated to reconstruct a surface-reflection image with high spatial resolution. In light of STARFM’s large number of computations and the need to improve the reconstruction effect for heterogeneous regions, Zhu et al. made improvements and proposed an enhanced version of STARFM called ESTARFM [12]. They use two different coefficients to deal with the weights of adjacent pixels in homogeneous and heterogeneous regions, achieving a better effect. Inspired by STARFM, the spatiotemporal adaptive algorithm for mapping reflection changes (STAARCH) [13] also achieves good results. Overall, the difference between these algorithms lies in how the weights of adjacent pixels are calculated. Although these algorithms generally have good results, they are unsuitable for data that change too much too quickly.
Dictionary learning-based methods mainly learn the correspondence between two types of remote sensing images to perform prediction. The sparse representation-based spatiotemporal reflection fusion method (SPSTFM) [14] may be the first fusion method to successfully apply dictionary learning. In SPSTFM, the coefficients of low-resolution images and high-resolution images should be the same, and the super-resolution ideas in the field of natural images are introduced into spatiotemporal fusion. Images are reconstructed by establishing correspondences between low-resolution images. However, in practical situations, the same coefficients may not be applicable to some of the data obtained under the existing conditions [15]. Wei et al. studied the explicit mapping between low-resolution images and proposed a new fusion method based on dictionary learning and utilizing compressive sensing theory, called compressive sensing spatiotemporal fusion (CSSF) [16], which improves the accuracy of the prediction results noticeably, but the training time also increases considerably, while the efficiency decreases. In this regard, Liu et al. proposed an extreme learning machine called ELM-FM for spatiotemporal fusion [17], which considerably reduces time and improves efficiency.
As deep learning has gradually been applied in various fields in recent years, deep learning-based spatiotemporal fusion methods of remote sensing have also advanced. For example, Song et al. proposed STFDCNN [18] for spatiotemporal fusion using a convolutional neural network. In STFDCNN, the image-reconstruction process is considered a super-resolution and nonlinear mapping problem. A super-resolution network and a nonlinear mapping network are constructed through an intermediate resolution image, and the final fusion result is obtained through high-pass modulation. STFDCNN achieved good results. Liu et al. proposed a two-stream CNN, StfNet [19], for spatiotemporal fusion. They effectively extracted and fused spatial details and temporal information using spatial consistency and temporal dependence, and achieved good results. On the basis of spatial consistency and time dependence, Chen et al. introduced a multiscale mechanism for feature extraction and proposed a spatiotemporal remote sensing image-fusion method based on multiscale two-stream CNN (STFMCNN) [20]. Jia et al. proposed a new deep learning-based two-stream convolutional neural network [21], which fuses the temporal variation information with the spatial detail information by weight, which enhances its robustness. Furthermore, Jia et al. adopted various prediction methods for phenological change and land-cover change, and proposed a spatiotemporal fusion method based on hybrid deep learning to combine satellite images with differing resolutions [22]. Tan et al. proposed DCSTFN [23] to derive high spatiotemporal remote sensing images using CNNs based on the methods of convolution and deconvolution combined with the fusion method of STARFM. However, in light of the loss of information in the reconstruction process of the deconvolution fusion method, Tan et al. increased the input of the a priori moment and added a residual coding block, using a composite loss function to improve the learning ability of the network, and an enhanced convolutional neural network EDCSTFN [24] was proposed for spatiotemporal fusion. In addition, CycleGAN-STF [25] introduces other ideas in the visual field into spatiotemporal fusion. It achieves spatiotemporal fusion through image generation of CycleGAN. CycleGAN is used to generate a fine image at the predicted time, the real image is used at the predicted time to select the closest generated image, and finally FSDAF is used for fusion. Other fusion methods are applied in specific scenarios. For example, STTFN [26], a CNN-based model for spatiotemporal fusion of surface-temperature changes, uses a multiscale CNN to establish a nonlinear mapping relationship and a spatiotemporal continuity weight strategy for fusion, achieving good results. DenseSTF [27], a deep learning-based spatiotemporal data-fusion algorithm, uses a block-to-point modeling strategy and model comparison to provide rich texture details for each target pixel to deal with heterogeneous regions, and achieves very good results. Furthermore, with the development of transformer models [28] in the natural language field, many researchers have introduced the concept into the vision field as well, e.g., vision transformer (ViT) [29], data-efficient image transformer (DeiT) [30], conditional position encoding visual transformer (CPVT) [31], transformer-in-transformer (TNT) [32], and convolutional vision transformer (CvT) [33] can be used for image classification. In addition, there are the Swin transformer [34] for image classification, image segmentation, and object detection, and texture transformer [35] for general image superclassification. These variants have been gradually introduced into the spatiotemporal fusion of remote sensing. For example, MSNet [36] is a new method obtained by introducing the original transformer and ViT into spatiotemporal fusion, learning the global temporal correlation information of the image through the transformer structure, using the convolutional neural network to establish the relationship between input and output, and finally obtain a good effect. SwinSTFM [37] is a new method that introduces the Swin transformer and combines linear spectral mixing theory, which finally improves the quality of generated images. There is also MSFusion [38], which introduces texture transformer into spatiotemporal fusion, which has also achieved quite good results on multiple datasets.
Existing spatiotemporal fusion algorithms perform a certain amount of information extraction and noise processing during the fusion process, but there remain certain lacunae. First, the acquisition and processing of suitable datasets is not easy. Owing to the existence of noise, the data that can be directly used for research are insufficient. In deep learning, the size of the dataset affects the learning ability during reconstruction: achieving good reconstruction with small datasets is a major challenge. Second, the same fusion model can have different prediction performance on different datasets, and the model is not robust. Furthermore, the features extracted only by the CNN are not sufficient, and an increase of the network depth will also result in potential feature loss.
In order to address the aforementioned challenges, this study improves MSNet and proposes an enhanced version of the spatiotemporal fusion method of multi-stream remote sensing images called EMSNet. In EMSNet, the input image adopts the original scale size, and the rough image is no longer scaled to fully extract the temporal information and reduce the loss. The main contributions of this paper are summarized as follows.
(1)
The number of prior input images required by the model is reduced from five to three, which achieves better results with less input, so that even a dataset with a small amount of data can reconstruct images with better effects.
(2)
The transformer encoder structure is introduced and its projection method improved to obtain the improved transformer encoder (ITE), which adapts the remote sensing spatiotemporal fusion, effectively learns the relationship between local and global information in rough and fine images, and effectively extracts temporal and spatial information.
(3)
Dilated convolution is used to extract temporal information, which expands the receptive field while keeping the parameter quantity unchanged and fully extracts a large amount of temporal feature information contained in the rough image.
(4)
A new feature-fusion strategy is used to fuse the features extracted by the ITE and dilated convolution based on their differences from real predicted images in order to avoid introducing noise.
The rest of the article has the following structure. The overall structure of EMSNet and its internal specific modules and weight strategies are introduced in Section 2. Experimental results are described in Section 3, along with the datasets used. Section 4 dis-cusses the performance of EMSNet. Finally, conclusions are provided.

2. Methods

2.1. EMSNet Architecture

Figure 1 shows the overall structure of EMSNet, where M i ( i = 1 , 2 ) represents the MODIS image at time t i , L i represents the Landsat image at time t i , and P r e _ L 2 represents the prediction result of the fused image at time t 2 based on time t 1 . Rectangles of different colors represent different operations, including convolution, dilated convolution, activation function ReLU, and various operations inside the improved transformer encoder (ITE). EMSNet is an end-to-end structure, which can be divided into three parts:
  • ITE-related modules, used to extract temporal change information and spatial texture detail features and learn local and global correlation information;
  • an extraction network composed of convolution and dilated convolution, used to establish a nonlinear relationship between input and output, while fully extracting the features of time information;
  • a weight strategy, used to calculate the corresponding weight according to the difference between the features obtained in the above two parts and the real prediction map for final fusion.
A detailed description of each module can be found in Section 2.2, Section 2.3 and Section 2.4.
In this study, three images of the same size are used as input, a pair of MODIS-Landsat images at a priori time t 1 and a MODIS image at prediction time t 2 . The overall procedure of EMSNet is as follows:
(1)
First, we subtract M 1 from M 2 to get M 12 , which represents the change area within two times and provides time-change information. We input into the feature-extraction network composed of convolution and dilated convolution, and then fully extract the time information contained in it.
(2)
Second, we add M 12 and L 1 to the ITE to extract the rich temporal information and spatial texture detail information, and simultaneously learn the connection between the local and the global information.
(3)
Inspired by ResNet [39], in DenseNet [40], as the network depth increases, the temporal and spatial information in the input image may be lost during transmission. Therefore, we add L 1 as the residual to the temporal variation information obtained in the first step to supplement the spatial details that may be lost in the subsequent fusion process.
(4)
Finally, the results obtained in the second and third steps are calculated by calculating the difference with L 2 to obtain their respective weights, so as to fuse and reconstruct the final prediction map P r e _ L 2 .
The structure of EMSNet can be represented by Equation (1) below:
P r e _ L 2 = W ( T ( M 12 + L 1 ) , E ( M 12 ) + L 1 )
Here, T represents the ITE module, E represents the time information-extraction network composed of convolution and dilated convolution, and W represents the weight strategy adopted in this study.

2.2. Improved Transformer Encoder

Transformer [28], as a kind of attention mechanism, is well suited not only to the field of natural language but also to the field of vision. Inspired by the application of the transformer in MSNet [36] and the cancellation of position encoding in CPVT [31] and CvT [33], in this study, the transformer encoder applied to remote sensing spatiotemporal fusion is further improved, the MLP part for classification is canceled, and position encoding is canceled. In addition, the convolutional projection method is used to replace the original linear projection method in the transformer, and a new structure, as shown in Figure 2 below, is obtained, called the improved transformer encoder (ITE), which is mainly used to learn temporal variation information and spatial texture details. Through the above operations, it is ensured that the input and output are of the same dimension, which facilitates subsequent fusion and reconstruction.
Figure 2 is the ITE structure diagram, in which the yellow part represents the convolution projection operation and the blue box and its interior represent the specific operation part of ITE. As can be seen from the figure, this study projects the input information directly through the convolution operation, and the overlap between the convolution blocks and the convolution blocks effectively strengthens the connection between the blocks. Consequently, the ITE strengthens the correlation between local information and global information, removing the need for the position encoding required by the linear projection method, thus making it more suitable for the spatiotemporal fusion method. The ITE is also composed of alternate multi-head attention mechanisms and feedforward parts. It will be normalized before each input to the submodule, and there will be residual connections after each block. The multi-head self-attention mechanism is a series of SoftMax and linear operations, and the input data will gradually change the dimensions during the propagation and training process to adapt to match these operations. The feedforward portion is composed of linear, Gaussian error linear unit (GELU), and random deactivation dropout, where GELU is used as the activation function. In practical applications, for different amounts of data, when learning global time-varying information, ICTE with different depths are required to learn more accurately. Nx in the figure represents the depth value.
In this study, the ITE is used as a module for learning time-varying information and spatial texture detail. Compared with the previous MSNet, it further expands the learning range of the transformer encoder for remote sensing.

2.3. Dilated Convolution

In order to extract the time information contained in M 12 and establish the mapping relationship between input and output, this study proposes a seven-layer neural network mainly composed of dilated convolution as a feature extraction network. The key feature of dilated convolution is that different sizes of receptive fields can be obtained after setting different dilation rates, so as to extract effective information at multiple scales. Compared with ordinary convolution operations, dilated convolution will not increase the number of redundant parameters. Figure 3 shows the proposed dilated convolution-based neural network and the receptive fields under different dilation rates.
The right side of the dotted line in Figure 3 shows the architecture of the seven-layer neural network, which consists of one layer of convolution, three layers of dilated convolution, and three layers of ReLU. The convolution operation is used to convert the original M 12 into a multidimensional nonlinear tensor, and the convolution kernel adopts the size of 3 × 3; the dilated convolution is used to effectively extract the temporal features in the M 12 , the basic convolution kernel is of the same size i.e., 3 × 3, and an expansion rate of 2, 3, and 4 is set in turn for three consecutive layers of dilated convolution. The left side of the dotted line is the schematic diagram of the receptive field under various expansion rates. When the dilation rate is 1, dilated convolution is no different from ordinary convolution. When the dilation rate increases, the receptive field also gradually increases, which enables it to better learn the feature information at various scales, and simultaneously guarantee the number of parameters taken during its operation will not increase [41].
Each dilated convolution operation can be defined as:
Φ ( x ) = w i x + b i
Here, x represents the input, “ “ represents the dilated convolution operation, w i represents the weight of the current convolutional layer, and b i represents the current offset. The output channels of the three convolution operations are 32, 16, and 1 in sequence. After the convolution, the ReLU operation is used to make the features non-linear and avoid network overfitting [42]. The ReLU operation can be defined as:
ReLU ( x ) = max ( 0 , x )

2.4. Weight Strategy

After feature extraction by the ITE and dilated convolutional neural network, plus residual L 2 for supplementary information, two distinct features are obtained. The difference between the prediction graphs is calculated by weight for final fusion, and the specific weight strategy can be defined as:
P r e _ L 2 = W ( T ( M 12 + L 1 ) , E ( M 12 ) + L 1 ) = α T ( M 12 + L 1 ) + β ( E ( M 12 ) + L 1 )
α = 1 T ( M 12 + L 1 ) L 2 1 T ( M 12 + L 1 ) L 2 + 1 ( E ( M 12 ) + L 1 ) L 2 β = 1 ( E ( M 12 ) + L 1 ) L 2 1 T ( M 12 + L 1 ) L 2 + 1 ( E ( M 12 ) + L 1 ) L 2
Here, T represents the ITE module, and E represents the temporal information extraction network composed of convolution and dilated convolution.

2.5. Network Training

During the entire training process of the model, the loss calculation is performed on the prediction results of the entire model, so as to continuously adjust the learning parameters during the backpropagation process to obtain better convergence results. When calculating the difference between the predicted result and the real value, the smooth L1 loss function, namely Huber loss [43], is chosen, which can be defined as L :
S ( L i ) = m = 1 H n = 1 W L i ( m , n )
L = l o s s ( P r e _ L 2 , L 2 ) = 1 N 1 2 ( S ( P r e _ L 2 ) S ( L 2 ) ) 2 , i f S ( P r e _ L 2 ) S ( L 2 ) < 1 S ( P r e _ L 2 ) S ( L 2 ) 1 2 , o t h e r w i s e
where H represents the height of the image, W represents the width of the image, L i represents the input image, and S represents for the pixel sum formula.

3. Experiments and Results

3.1. Datasets

Three separate datasets were employed to test the robustness of EMSNet.
The first study area was the Coleambally Irrigation Area (CIA) in southern New South Wales (NSW, Australia, 34.0034°E, 145.0675°S) [44]. The dataset was acquired from October 2001 to May 2002 and comprises 17 pairs of MODIS–Landsat images. The Landsat images are all from Landsat-7 ETM+, and the MODIS images are MODIS Terra MOD09GA Collection 5 data. The CIA dataset includes six bands and an image size of 1720 × 2040.
The second study area is the Lower Gwydir Watershed (LGC) in northern New South Wales (NSW, 149.2815°E, 29.0855°S), Australia [44]. The dataset was acquired from April 2004 to April 2005 and comprises 14 pairs of MODIS–Landsat images. All Landsat imagery is from Landsat-5TM, and the MODIS imagery is MODIS Terra MOD09GA Collection 5 data. The LGC dataset contains six bands and the image size is 3200 × 2720.
The third study area is the Alu Horqin Banner (AHB) region (43.3619°N, 119.0375°E) in the central Inner Mongolia Autonomous Region of northeastern China, which has many circular pastures and farmland [45,46]. Li Jun et al., collected 27 cloud-free MODIS–Landsat image pairs from 30 May 2013 to 6 December 2018, a time span of more than 5 years. The area has experienced substantial phenological changes owing to the growth of crops and other types of vegetation. The AHB dataset contains six bands and the image size is 2480 × 2800.
In this study, all images of the three datasets are combined according to a prior time and a prediction time. Each set of training data has four images, including two pairs of MODIS–Landsat images. The image size of each pair of MODIS-Landsat is the same, and the spatial resolution is 16:1. When combining the data, the data with the same time span between the prior moment and the predicted moment are given priority as the experimental data. In addition, for the training of the network, the images of the three datasets are all adjusted to a size of 1200 × 1200. Figure 4, Figure 5 and Figure 6 show the MODIS–Landsat image pairs obtained on two different dates for the three datasets. During the experiment process, the three datasets were input into EMSNet for training, 70% of the dataset was used for training, 15% was used for validation, and 15% was used as the final test set for evaluating the fusion and reconstruction ability of the model.

3.2. Evaluation

We evaluated the proposed spatiotemporal fusion method by comparing it with FSDAF, STARFM, DCSTFN, STFDCNN, StfNet, and the previous MSNet under the same criteria.
As in the case of MSNet, six evaluation metrics are used. The first indicator is the spectral angle mapper (SAM) [47], which can measure the spectral distortion of the fusion result. It can be defined as follows:
SAM = 1 N n = 1 N arccos j = 1 K = ( L i k P r e _ L i k ) j = 1 K ( L i k ) 2 j = 1 K ( P r e _ L i k ) 2
where N represents the total number of pixels in the predicted image, K represents the total number of bands, P r e _ L i represents the prediction result, P r e _ L i k represents the prediction result of the k th band, and L i k represents the true value of the L i k band. A small SAM indicates a better result.
The second metric was the root mean square error (RMSE), which is the square root of the MSE and is used to measure the deviation between the predicted image and the observed image. It reflects a global depiction of the radiometric differences between the fusion result and the real observation image, which is defined as follows:
RMSE = m = 1 H n = 1 W L i ( m , n ) P r e _ L i ( m , n ) 2 H × W
where H represents the height of the image, W represents the width of the image, L represents the observed image, and P r e _ L i represents the predicted image. The smaller the value of RMSE, the closer the predicted image is to the observed image.
The third indicator was erreur relative global adimensionnelle de synthèse (ERGAS) [48], which measures the overall integration result. It can be defined as:
ERGAS = 100 h l 1 K i = 1 K [ RMSE ( L i k ) 2 / ( μ k ) 2 ]
where h and l represent the spatial resolution of Landsat and MODIS images respectively; L i k represents the real image of the k th band; and μ k represents the average value of the k th band image. When ERGAS is small, the fusion effect is better.
The fourth index was the structural similarity (SSIM) index [49], which is used to measure the similarity of two images. It can be defined as:
SSIM = ( 2 μ P r e _ L i μ L i + c 1 ) ( 2 σ P r e _ L i L i + c 2 ) ( μ P r e _ L i 2 + μ L i 2 + c 1 ) ( σ P r e _ L i 2 + σ L i 2 + c 2 )
where μ P r e _ L i represents the mean value of the predicted image, μ L i represents the mean value of the real observation image, σ P r e _ L i L i represents the covariance of the predicted image P r e _ L i and the real observation image L i , σ P r e _ L i 2 represents the variance of the predicted image P r e _ L i , σ L i 2 represents the variance of the real observation image L i , and c 1 and c 2 are constants used to maintain stability. The value range of SSIM is [−1,1]. The closer the value is to 1, the more similar are the predicted image and the observed image.
The fifth index is the correlation coefficient (CC), which is used to indicate the correlation between two images. It can be defined as:
CC = n = 1 N ( P r e _ L i n μ L ^ i ) ( L i n μ L i ) n = 1 N ( P r e _ L i n μ L ^ i ) 2 n = 1 N ( L i n μ L i ) 2
The closer the CC is to 1, the greater the correlation between the predicted image and the real observation image.
The sixth indicator is the peak signal-to-noise ratio (PSNR) [50]. It is defined indirectly by the MSE, which can be defined as:
MSE = 1 H W m = 1 H n = 1 W L i ( m , n ) P r e _ L i ( m , n ) 2
Then PSNR can be defined as:
PSNR = 10 · log 10 ( M A X L i 2 MSE )
where M A X L i 2 is the maximum possible pixel value of the real observation image L i . If each pixel is represented by an 8-bit binary value, then M A X L i is 255. Generally, if the pixel value is represented by B-bit binary, then M A X L i = 2 B 1 . PSNR can evaluate the quality of the image after reconstruction. A higher PSNR means that the predicted image quality is better.

3.3. Parameter Settings

For the improved transformer encoder, the number of heads is set to 9, and the depth is set according to the data volume and characteristics of the three datasets: CIA is 20, LGC is 5, and AHB is 20. The size of the patch input into it is 240 × 240. The ordinary convolution as well as the three-layer dilated convolution in the dilated convolutional neural network each use a 3 × 3 convolution kernel. The dilation rates are 2, 3, and 4, and the number of channels is 32, 16, and 1. The initial learning rate is set to 0.0008, the optimizer adopts Adam, and the weight decay is set to 1 × 10−6. EMSNet was trained on two Windows 10 Professional editions, each with 64 GB memory, an Intel Core i9-9900K @ 3.60 GHz×16 CPU, and an NVIDIA Geforce RTX 2080 Ti.

3.3.1. Subjective Evaluation

In order to visualize the experimental results, Figure 7, Figure 8, Figure 9, Figure 10, Figure 11, Figure 12 and Figure 13 show the experimental results of FSDAF, STARFM, DCSTFN, STFDCNN, StfNet, MSNet, and the proposed improved EMSNet on each of three datasets. GT in the figure represents the real observed image, while Proposed is the proposed EMSNet method.
Figure 7 shows the overall prediction result on the CIA data set, while Figure 8 shows a cropped part of the prediction result enlarged. Visually, FSDAF, STARFM, and DCSTFN are less accurate than other methods in predicting phenological changes. For example, in the overall results in Figure 7, the black areas of these methods are noticeably less than those contained in GT. The prediction effects in the box in Figure 8 are also quite different. Relatively speaking, the prediction results obtained by the method based on deep learning are better, but the prediction map of StfNet is a bit blurry and the effect is not good. The results of STFDCNN and MSNet are relatively good, but those of our proposed method are better. Thus, Figure 8 shows that the results obtained by the proposed method are closer to the ground truth in terms of clarity and accuracy.
Figure 9 illustrates the overall prediction result on the LGC dataset, while Figure 10 illustrates the cropped and enlarged result of a portion of the prediction. In general, the performance of each algorithm is relatively stable, but there are differences in the specific spectral information and the processing of heterogeneous regions. It can be seen from the black box in the enlarged area in the lower right corner of Figure 10 that the prediction accuracy of the spectral information in DCSTFN and StfNet is lower than other methods, and the other methods have achieved good results, but the effect obtained by the proposed method is closer to the actual value. In addition, the proposed method also predicts the information of curved river channels with high heterogeneity under the black box, which no other method except for FSDAF can. Compared with the proposed method, FSDAF is closer to the real value. The method has achieved good results in spectral information and the processing of heterogeneous regions.
Figure 11 shows the overall prediction result on the AHB data set, while Figure 12 and Figure 13 show some cropped and enlarged results. On the whole, the prediction results of STARFM are not accurate enough in the processing of spectral information, and there is considerable ambiguous spectral information. DCSTFN fails to accurately predict the results, and fails to effectively extract information for datasets with a large number of heterogeneous regions and time information. The results obtained by StfNet are relatively good, such as in the spatial details between rivers, but there is still a large gap between the overall and the real value. In addition, although the prediction results of FSDAF are much better than STARFM in the processing of spectral information, there are still shortcomings compared with the real values. While STFDCNN and MSNet achieve better results, the spatial details and spectral time information are relatively adequate, but the proposed method achieves better results, with the spatial details and spectral information being closer to the real values. Locally, in Figure 12, in a large number of continuous phenological change areas, the proposed method has a noticeable improvement compared with the previous MSNet. Furthermore, compared with other methods, the processing of boundary information is also better, and is closest to the true value. In Figure 13, for the prediction of a large number of circular pasture areas, FSDAF, STARFM, DCSTFN, and StfNet failed at accurate prediction, which must be due to the complex spatial distribution and too much time-varying information on the AHB dataset, which led to the limited learning ability of the model, and the results obtained were not ideal. STFDCNN has achieved good results with the previous MSNet, but there is still insufficient boundary information. The proposed method thus achieves the best prediction effect, in the prediction of phenological change information as well as the boundary processing between circular pastures.

3.3.2. Objective Evaluation

Six evaluation indicators are used to objectively evaluate various algorithms and the proposed method. Table 1, Table 2 and Table 3 present the quantitative evaluation of the prediction results obtained by various methods on three datasets, including global indicators SAM and ERGAS as well as local indicators RMSE, SSIM, PSNR, and CC. Furthermore, the optimal value of each indicator is marked in bold.
Table 1, Table 2 and Table 3 present the quantitative evaluation results of several existing fusion methods and the proposed method on the CIA, LGC, and AHB datasets, respectively. In each table, it can be seen that the proposed method achieves the optimal value on the global indicators and all local indicators.

4. Discussion

Through the experiments, it can be seen that whether it is on the CIA dataset with phenological changes in regular areas or on the AHB dataset with phenological changes with a large number of irregular areas and a large number of heterogeneous areas, our proposed method is better at prediction. Similarly, for LGC datasets, which are mainly land cover-type changes, the proposed method is better at prediction than traditional methods and other deep learning-based methods in the processing of temporal information and high-frequency spatial details. The time information and high-frequency file texture information are processed more appropriately because of the combination of ITE and dilated convolution in EMSNet. More importantly, the refined ITE can further expand the range of learning in the remote sensing field, and can fully extract the spatiotemporal information contained in the input image.
It is worth noting that for datasets with different amounts of data and different characteristics, the depth of the improved transformer encoder (ITE) should also be different to better fit the datasets. Table 4 lists the average evaluation values of the prediction results obtained without the ITE and with the ITE with different depths, where the optimal value is shown in bold. The depth being 0 indicates that the ITE has not been introduced. It can be seen that when the depth is not introduced, the experimental results are relatively poor. As the depth changes, the results obtained vary. The best experimental results are obtained when the depth of the CIA dataset is 20, the depth of the LGC dataset is 5, and the depth of the AHB dataset is 20.
In addition, the difference between the original linear projection method of the transformer encoder and the improved convolution projection method was also determined. Table 5 lists the global indicators and average evaluation values of the prediction results obtained under various projection methods, where the optimal value is shown in bold. It can be seen that on the three datasets, the convolutional projection method is selected, and the ITE after position encoding is removed achieves better results.
Furthermore, the last six layers of the network for extracting time information in Figure 3 include three layers of dilated convolution and three layers of ReLU. This paper also conducts a comparative experiment on the three layers of dilated convolution operations. Table 6 lists the different result evaluations obtained when using convolution and dilated convolution. Among them, “conv” in the difference column means to replace the above-mentioned three layers of dilated convolution with three layers of convolution; “conv_dia” means that the above-mentioned three layers of dilated convolution remain unchanged, and “conv&conv_dia” means that the abovementioned three layers of dilated convolution are replaced by a three-layer alternating operation of convolution, dilated convolution and convolution. It can be seen that when the subsequent operations of extracting time information are all dilated convolutions, the implementation effect is better.
Although the proposed method has achieved good results, there are issues worthy of further exploration. First, in order to fully expand the learnable range of the ITE, the original input of a larger MODIS image has been used. Although dilated convolution is used to reduce the number of parameters, compared with MSNet, the number of parameters in this study is quite high. Table 7 presents the fusion model of deep learning and the number of parameters that the proposed method needs to learn. It can be seen that the proposed method needs the largest number of parameters, which means that compared with other methods, it requires more training time and equipment with larger memory during training. Considering the cost of learning, a way to obtain better results with a smaller model is a direction worthy of future research. Second, the refined ITE shows very good performance, but further improvements to adapt it to remote sensing spatiotemporal fusion can be researched in future. Furthermore, improving the fusion effect while avoiding the fusion strategy introduced by noise is also worthy of further study.

5. Conclusions

In this study, the effectiveness of EMSNet in three research areas with diverse characteristics is evaluated. Its performance enhancement is found to be mainly because of the following reasons:
  • The projection method of the original transformer encoder is improved to adapt to the fusion of remote sensing space and time, which further expands the learning range of the improved transformer encoder, effectively learns the connection between the local and the global information in the remote sensing image, and uses its own attention mechanism to fully extract the spatiotemporal information in remote sensing images.
  • Dilated convolution is used to expand the receptive field to adapt to the original input of larger size, while keeping the number of learned parameters unchanged, effectively extracting time information and balancing the increase in parameters brought about by the improved transformer encoder.
  • A unique residual structure and a differentiated weight fusion method are used to supplement the lost information and reduce the introduction of noise in the fusion process.
Experiments show that on the CIA and AHB datasets with noteworthy phenological changes and the LGC dataset with mainly land cover-type changes, EMSNet is better than other models using three and five original images for fusion and gives more stable prediction results on each dataset. Although EMSNet achieves good results, there are still many areas worth further research in the future. First, the application of transformer-related structures in the field of remote sensing spatiotemporal fusion will be further studied. Second, compared with other methods, the method proposed in this paper needs to learn significantly more parameters. How to achieve better fusion effect with smaller model and lower learning cost is also a focus of future research. Third, although the three datasets used in this paper cover a variety of phenological changes and land-cover changes, there are still regional types that are not included. For example, datasets containing changes in urban areas will also be discussed in the future.

Author Contributions

Data curation, W.L.; formal analysis, W.L.; methodology, W.L. and D.C.; validation, D.C.; visualization, D.C.; writing—original draft, D.C.; writing—review and editing, D.C. and M.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (61972060, U1713213, and 62027827), National Key Research and Development Program of China (2019YFE0110800), and Natural Science Foundation of Chongqing (cstc2020jcyj-zdxmX0025, cstc2019cxcyljrc-td0270).

Data Availability Statement

Data sharing is not applicable to this article.

Acknowledgments

The authors would like to thank all of the reviewers for their valuable contributions to our work.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Justice, C.O.; Vermote, E.; Townshend, J.R.; Defries, R.; Roy, D.P.; Hall, D.K.; Salomonson, V.V.; Privette, J.L.; Riggs, G.; Strahler, A.; et al. The Moderate Resolution Imaging Spectroradiometer (MODIS): Land remote sensing for global change research. IEEE Trans. Geosci. Remote Sens. 1998, 36, 1228–1249. [Google Scholar] [CrossRef]
  2. Lin, C.; Li, Y.; Yuan, Z.; Lau, A.K.; Li, C.; Fung, J.C.H. Using satellite remote sensing data to estimate the high-resolution distribution of ground-level PM2.5. Remote Sens. Environ. 2015, 156, 117–128. [Google Scholar] [CrossRef]
  3. Zhang, L.; Zhang, Q.; Du, B.; Huang, X.; Tang, Y.Y.; Tao, D.J. Simultaneous spectral-spatial feature selection and extraction for hyperspectral images. IEEE Trans. Cybern. 2016, 48, 16–28. [Google Scholar] [CrossRef] [PubMed]
  4. Yu, Q.; Gong, P.; Clinton, N.; Biging, G.; Kelly, M.; Schirokauer, D.J.P.E.; Sensing, R. Object-based detailed vegetation classification with airborne high spatial resolution remote sensing imagery. Photogramm. Eng. Remote Sens. 2006, 72, 799–811. [Google Scholar] [CrossRef]
  5. White, M.A.; Nemani, R.R. Real-time monitoring and short-term forecasting of land surface phenology. Remote Sens. Environ. 2006, 104, 43–49. [Google Scholar] [CrossRef]
  6. Hansen, M.C.; Loveland, T.R. A review of large area monitoring of land cover change using Landsat data. Remote Sens. Environ. 2012, 122, 66–74. [Google Scholar] [CrossRef]
  7. Gao, F.; Masek, J.; Schwaller, M.; Hall, F. On the blending of the Landsat and MODIS surface reflectance: Predicting daily Landsat surface reflectance. IEEE Trans. Geosci. Remote Sens. 2006, 44, 2207–2218. [Google Scholar] [CrossRef]
  8. Hilker, T.; Wulder, M.A.; Coops, N.C.; Seitz, N.; White, J.C.; Gao, F.; Masek, J.G.; Stenhouse, G. Generation of dense time series synthetic Landsat data through data blending with MODIS using a spatial and temporal adaptive reflectance fusion model. Remote Sens. Environ. 2009, 113, 1988–1999. [Google Scholar] [CrossRef]
  9. Zhukov, B.; Oertel, D.; Lanzl, F.; Reinhackel, G.; Sensing, R. Unmixing-based multisensor multiresolution image fusion. IEEE Trans. Geosci. Remote Sens. 1999, 37, 1212–1226. [Google Scholar] [CrossRef]
  10. Wu, M.; Niu, Z.; Wang, C.; Wu, C.; Wang, L. Use of MODIS and Landsat time series data to generate high-resolution temporal synthetic Landsat data using a spatial and temporal reflectance fusion model. J. Appl. Remote Sens. 2012, 6, 063507. [Google Scholar] [CrossRef]
  11. Zhu, X.; Helmer, E.H.; Gao, F.; Liu, D.; Chen, J.; Lefsky, M.A. A flexible spatiotemporal method for fusing satellite images with different resolutions. Remote Sens. Environ. 2016, 172, 165–177. [Google Scholar] [CrossRef]
  12. Zhu, X.; Chen, J.; Gao, F.; Chen, X.; Masek, J.G. An enhanced spatial and temporal adaptive reflectance fusion model for complex heterogeneous regions. Remote Sens. Environ. 2010, 114, 2610–2623. [Google Scholar] [CrossRef]
  13. Hilker, T.; Wulder, M.A.; Coops, N.C.; Linke, J.; McDermid, G.; Masek, J.G.; Gao, F.; White, J.C. A new data fusion model for high spatial-and temporal-resolution mapping of forest disturbance based on Landsat and MODIS. Remote Sens. Environ. 2009, 113, 1613–1627. [Google Scholar] [CrossRef]
  14. Huang, B.; Song, H.; Sensing, R. Spatiotemporal reflectance fusion via sparse representation. IEEE Trans. Geosci. Remote Sens. 2012, 50, 3707–3716. [Google Scholar] [CrossRef]
  15. Belgiu, M.; Stein, A. Spatiotemporal image fusion in remote sensing. Remote Sens. 2019, 11, 818. [Google Scholar] [CrossRef]
  16. Wei, J.; Wang, L.; Liu, P.; Chen, X.; Li, W.; Zomaya, A.Y.; Sensing, R. Spatiotemporal fusion of MODIS and Landsat-7 reflectance images via compressed sensing. Geosci. Remote Sens. 2017, 55, 7126–7139. [Google Scholar] [CrossRef]
  17. Liu, X.; Deng, C.; Wang, S.; Huang, G.-B.; Zhao, B.; Lauren, P.; Letters, R.S. Fast and accurate spatiotemporal fusion based upon extreme learning machine. IEEE Geosci. Remote Sens. Lett. 2016, 13, 2039–2043. [Google Scholar] [CrossRef]
  18. Song, H.; Liu, Q.; Wang, G.; Hang, R.; Huang, B.; Sensing, R. Spatiotemporal satellite image fusion using deep convolutional neural networks. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 821–829. [Google Scholar] [CrossRef]
  19. Liu, X.; Deng, C.; Chanussot, J.; Hong, D.; Zhao, B.; Sensing, R. StfNet: A two-stream convolutional neural network for spatiotemporal image fusion. Geosci. Remote Sens. 2019, 57, 6552–6564. [Google Scholar] [CrossRef]
  20. Chen, Y.; Shi, K.; Ge, Y. Spatiotemporal Remote Sensing Image Fusion Using Multiscale Two-Stream Convolutional Neural Networks. Geosci. Remote Sens. 2021, 60, 1–12. [Google Scholar] [CrossRef]
  21. Jia, D.; Song, C.; Cheng, C.; Shen, S.; Ning, L.; Hui, C. A novel deep learning-based spatiotemporal fusion method for combining satellite images with different resolutions using a two-stream convolutional neural network. Remote Sens. 2020, 12, 698. [Google Scholar] [CrossRef]
  22. Jia, D.; Cheng, C.; Song, C.; Shen, S.; Ning, L.; Zhang, T. A Hybrid Deep Learning-Based Spatiotemporal Fusion Method for Combining Satellite Images with Different Resolutions. Remote Sens. 2021, 13, 645. [Google Scholar] [CrossRef]
  23. Tan, Z.; Yue, P.; Di, L.; Tang, J. Deriving high spatiotemporal remote sensing images using deep convolutional network. Remote Sens. 2018, 10, 1066. [Google Scholar] [CrossRef]
  24. Tan, Z.; Di, L.; Zhang, M.; Guo, L.; Gao, M. An enhanced deep convolutional model for spatiotemporal image fusion. Remote Sens. 2019, 11, 2898. [Google Scholar] [CrossRef]
  25. Chen, J.; Wang, L.; Feng, R.; Liu, P.; Han, W.; Chen, X.; Sensing, R. CycleGAN-STF: Spatiotemporal fusion via CycleGAN-based image generation. Geosci. Remote Sens. 2020, 59, 5851–5865. [Google Scholar] [CrossRef]
  26. Yin, Z.; Wu, P.; Foody, G.M.; Wu, Y.; Liu, Z.; Du, Y.; Ling, F.; Sensing, R. Spatiotemporal fusion of land surface temperature based on a convolutional neural network. Geosci. Remote Sens. 2020, 59, 1808–1822. [Google Scholar] [CrossRef]
  27. Ao, Z.; Sun, Y.; Pan, X.; Xin, Q. Deep learning-based spatiotemporal data fusion using a patch-to-pixel mapping strategy and model comparisons. Geosci. Remote Sens. 2022. [Google Scholar] [CrossRef]
  28. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA; 2017; pp. 5998–6008. [Google Scholar]
  29. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16 × 16 words: Transformers for image recognition at scale. In Proceedings of the ICLR 2021, Virtual Conference (Formerly Vienna), Vienna, Austria, 3–7 May 2021. [Google Scholar]
  30. Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning, Virtual. 18–24 July 2021; pp. 10347–10357. [Google Scholar]
  31. Chu, X.; Zhang, B.; Tian, Z.; Wei, X.; Xia, H. Do we really need explicit position encodings for vision transformers. arXiv 2021, arXiv:2102.10882. [Google Scholar]
  32. Han, K.; Xiao, A.; Wu, E.; Guo, J.; Xu, C.; Wang, Y. Transformer in transformer. Adv. Neural Inf. Process. Syst. 2021, 34, 15908–15919. [Google Scholar]
  33. Wu, H.; Xiao, B.; Codella, N.; Liu, M.; Dai, X.; Yuan, L.; Zhang, L. Cvt: Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 22–31. [Google Scholar]
  34. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
  35. Yang, F.; Yang, H.; Fu, J.; Lu, H.; Guo, B. Learning texture transformer network for image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 5791–5800. [Google Scholar]
  36. Li, W.; Cao, D.; Peng, Y.; Yang, C. MSNet: A multi-stream fusion network for remote sensing spatiotemporal fusion based on transformer and convolution. Remote Sens. 2021, 13, 3724. [Google Scholar] [CrossRef]
  37. Chen, G.; Jiao, P.; Hu, Q.; Xiao, L.; Ye, Z. SwinSTFM: Remote Sensing Spatiotemporal Fusion Using Swin Transformer. Geosci. Remote Sens. 2022, 60, 1–18. [Google Scholar]
  38. Yang, G.; Qian, Y.; Liu, H.; Tang, B.; Qi, R.; Lu, Y.; Geng, J. MSFusion: Multistage for Remote Sensing Image Spatiotemporal Fusion Based on Texture Transformer and Convolutional Neural Network. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 4653–4666. [Google Scholar] [CrossRef]
  39. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  40. Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 June 2017; pp. 4700–4708. [Google Scholar]
  41. Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]
  42. Glorot, X.; Bordes, A.; Bengio, Y. Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, Ft. Lauderdale, FL, USA, 11–13 April 2011; pp. 315–323. [Google Scholar]
  43. Huber, P.J. Robust estimation of a location parameter. In Breakthroughs in Statistics; Springer: Berlin/Heidelberg, Germany, 1992; pp. 492–518. [Google Scholar] [CrossRef]
  44. Emelyanova, I.V.; McVicar, T.R.; Van Niel, T.G.; Li, L.T.; Van Dijk, A.I. Assessing the accuracy of blending Landsat–MODIS surface reflectances in two landscapes with contrasting spatial and temporal dynamics: A framework for algorithm selection. Remote Sens. Environ. 2013, 133, 193–209. [Google Scholar] [CrossRef]
  45. Li, Y.; Li, J.; He, L.; Chen, J.; Plaza, A. A new sensor bias-driven spatio-temporal fusion model based on convolutional neural networks. Sci. China Inf. Sci. 2020, 63, 140302. [Google Scholar] [CrossRef]
  46. Li, J.; Li, Y.; He, L.; Chen, J.; Plaza, A. Spatio-temporal fusion for remote sensing data: An overview and new benchmark. Sci. China Inf. Sci. 2020, 63, 140301. [Google Scholar] [CrossRef]
  47. Yuhas, R.H.; Goetz, A.F.; Boardman, J.W. Discrimination among semi-arid landscape endmembers using the spectral angle mapper (SAM) algorithm. In Proceedings of the Summaries 3rd Annual JPL Airborne Geoscience Workshop, Pasadena, CA, USA, 1–5 June 1992; pp. 147–149. [Google Scholar]
  48. Khan, M.M.; Alparone, L.; Chanussot, J. Pansharpening quality assessment using the modulation transfer functions of instruments. Geosci. Remote Sens. 2009, 47, 3880–3891. [Google Scholar] [CrossRef]
  49. Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [Green Version]
  50. Ponomarenko, N.; Ieremeiev, O.; Lukin, V.; Egiazarian, K.; Carli, M. Modified image visual quality metrics for contrast change and mean shift accounting. In Proceedings of the 2011 11th International Conference the Experience of Designing and Application of CAD Systems in Microelectronics (CADSM), Polyana, Ukraine, 23–25 February 2011; pp. 305–311. [Google Scholar]
Figure 1. EMSNet architecture.
Figure 1. EMSNet architecture.
Remotesensing 14 04544 g001
Figure 2. Structure of improved transformer encoder (ITE).
Figure 2. Structure of improved transformer encoder (ITE).
Remotesensing 14 04544 g002
Figure 3. Neural network based on dilated convolution and receptive fields with different dilation rates.
Figure 3. Neural network based on dilated convolution and receptive fields with different dilation rates.
Remotesensing 14 04544 g003
Figure 4. Composite MODIS (Top row) and Landsat (Bottom row) image pairs on 7 October (a,c) and 16 October (b,d) 2001 on the CIA [44] dataset. The CIA dataset focuses on noteworthy phenological changes in irrigated farmland.
Figure 4. Composite MODIS (Top row) and Landsat (Bottom row) image pairs on 7 October (a,c) and 16 October (b,d) 2001 on the CIA [44] dataset. The CIA dataset focuses on noteworthy phenological changes in irrigated farmland.
Remotesensing 14 04544 g004
Figure 5. Composite MODIS (top row) and Landsat (bottom row) image pairs on 29 January (a,c) and 14 February (b,d) 2005 from the LGC [44] dataset. The LGC dataset focuses on changes in land cover types after the flood.
Figure 5. Composite MODIS (top row) and Landsat (bottom row) image pairs on 29 January (a,c) and 14 February (b,d) 2005 from the LGC [44] dataset. The LGC dataset focuses on changes in land cover types after the flood.
Remotesensing 14 04544 g005
Figure 6. Composite MODIS (top row) and Landsat (bottom row) image pairs on 21 June (a,c) and 7 July (b,d) 2015 from the AHB [45,46] dataset. The AHB focuses on noteworthy phenological changes in the pasture.
Figure 6. Composite MODIS (top row) and Landsat (bottom row) image pairs on 21 June (a,c) and 7 July (b,d) 2015 from the AHB [45,46] dataset. The AHB focuses on noteworthy phenological changes in the pasture.
Remotesensing 14 04544 g006
Figure 7. Entire prediction results for the target Landsat image (16 October 2001) in the CIA dataset. Comparison methods include FSDAF [11], STARFM [8], DCSTFN [23], STFDCNN [18], StfNet [19], and MSNet [36], which are represented by (bg) in the figure respectively. (a) represents the ground truth (GT), and (h) represents the proposed method.
Figure 7. Entire prediction results for the target Landsat image (16 October 2001) in the CIA dataset. Comparison methods include FSDAF [11], STARFM [8], DCSTFN [23], STFDCNN [18], StfNet [19], and MSNet [36], which are represented by (bg) in the figure respectively. (a) represents the ground truth (GT), and (h) represents the proposed method.
Remotesensing 14 04544 g007
Figure 8. Specific prediction results for the target Landsat image (16 October 2001) in CIA dataset. Among them, the white framework is the prominent difference of the results obtained by each method. Comparison methods include FSDAF [11], STARFM [8], DCSTFN [23], STFDCNN [18], StfNet [19], and MSNet [36], which are represented by (bg) in the figure respectively. (a) represents the ground truth (GT), and (h) represents the proposed method.
Figure 8. Specific prediction results for the target Landsat image (16 October 2001) in CIA dataset. Among them, the white framework is the prominent difference of the results obtained by each method. Comparison methods include FSDAF [11], STARFM [8], DCSTFN [23], STFDCNN [18], StfNet [19], and MSNet [36], which are represented by (bg) in the figure respectively. (a) represents the ground truth (GT), and (h) represents the proposed method.
Remotesensing 14 04544 g008
Figure 9. Comprehensive prediction results for the target Landsat image (14 February 2005) in LGC dataset. Comparison methods include FSDAF [11], STARFM [8], DCSTFN [23], STFDCNN [18], StfNet [19], and MSNet [36], which are represented by (bg) in the figure respectively. (a) represents the ground truth (GT), and (h) represents the proposed method.
Figure 9. Comprehensive prediction results for the target Landsat image (14 February 2005) in LGC dataset. Comparison methods include FSDAF [11], STARFM [8], DCSTFN [23], STFDCNN [18], StfNet [19], and MSNet [36], which are represented by (bg) in the figure respectively. (a) represents the ground truth (GT), and (h) represents the proposed method.
Remotesensing 14 04544 g009
Figure 10. Specific prediction results for the target Landsat image (14 February 2005) in LGC dataset. Among them, the grey framework is the prominent difference of the results obtained by each method. Comparison methods include FSDAF [11], STARFM [8], DCSTFN [23], STFDCNN [18], StfNet [19], and MSNet [36], which are represented by (bg) in the figure respectively. (a) represents the ground truth (GT), and (h) represents the proposed method.
Figure 10. Specific prediction results for the target Landsat image (14 February 2005) in LGC dataset. Among them, the grey framework is the prominent difference of the results obtained by each method. Comparison methods include FSDAF [11], STARFM [8], DCSTFN [23], STFDCNN [18], StfNet [19], and MSNet [36], which are represented by (bg) in the figure respectively. (a) represents the ground truth (GT), and (h) represents the proposed method.
Remotesensing 14 04544 g010
Figure 11. Complete prediction results for the target Landsat image (7 July 2015) in AHB dataset. Comparison methods include FSDAF [11], STARFM [8], DCSTFN [23], STFDCNN [18], StfNet [19], and MSNet [36], which are represented by (bg) in the figure respectively. (a) represents the ground truth (GT), and (h) represents the proposed method.
Figure 11. Complete prediction results for the target Landsat image (7 July 2015) in AHB dataset. Comparison methods include FSDAF [11], STARFM [8], DCSTFN [23], STFDCNN [18], StfNet [19], and MSNet [36], which are represented by (bg) in the figure respectively. (a) represents the ground truth (GT), and (h) represents the proposed method.
Remotesensing 14 04544 g011
Figure 12. First specific prediction results for the target Landsat image (7 July 2015) in AHB dataset. Among them, the white framework is the prominent difference of the results obtained by each method. Comparison methods include FSDAF [11], STARFM [8], DCSTFN [23], STFDCNN [18], StfNet [19], and MSNet [36], which are represented by (bg) in the figure respectively. (a) represents the ground truth (GT), and (h) represents the proposed method.
Figure 12. First specific prediction results for the target Landsat image (7 July 2015) in AHB dataset. Among them, the white framework is the prominent difference of the results obtained by each method. Comparison methods include FSDAF [11], STARFM [8], DCSTFN [23], STFDCNN [18], StfNet [19], and MSNet [36], which are represented by (bg) in the figure respectively. (a) represents the ground truth (GT), and (h) represents the proposed method.
Remotesensing 14 04544 g012
Figure 13. Second specific prediction results for the target Landsat image (7 July 2015) in AHB dataset. Among them, the white framework is the prominent difference of the results obtained by each method. Comparison methods include FSDAF [11], STARFM [8], DCSTFN [23], STFDCNN [18], StfNet [19], and MSNet [36], which are represented by (bg) in the figure respectively. (a) represents the ground truth (GT), and (h) represents the proposed method.
Figure 13. Second specific prediction results for the target Landsat image (7 July 2015) in AHB dataset. Among them, the white framework is the prominent difference of the results obtained by each method. Comparison methods include FSDAF [11], STARFM [8], DCSTFN [23], STFDCNN [18], StfNet [19], and MSNet [36], which are represented by (bg) in the figure respectively. (a) represents the ground truth (GT), and (h) represents the proposed method.
Remotesensing 14 04544 g013
Table 1. Quantitative assessment of various spatiotemporal fusion methods for CIA dataset.
Table 1. Quantitative assessment of various spatiotemporal fusion methods for CIA dataset.
EvaluationBandMethod on CIA
FSDAFDCSTFNSTARFMSTFDCNNStfNetMSNetProposed
SAMall0.238750.21556 0.235560.214020.216140.192090.00114
ERGASall3.350443.07221 3.316763.144613.004042.944710.45234
RMSEband10.013650.01059 0.013060.010760.009560.010090.00051
band20.014150.01256 0.013660.012360.012710.011320.00044
band30.020750.01922 0.020550.017920.021210.017240.00032
band40.046190.04377 0.048990.041000.050010.036690.00079
band50.060310.05655 0.061530.059000.053020.048980.00026
band60.053220.04690 0.052780.053890.045000.043250.00067
avg0.034710.03160 0.035090.032490.031920.027930.00050
SSIMband10.901470.94678 0.916990.955170.941900.950500.99996
band20.918990.93652 0.923250.938120.943400.951490.99998
band30.857860.88428 0.862900.873290.899500.911560.99999
band40.760700.79776 0.746360.783180.848680.862480.99995
band50.665980.70744 0.660110.727890.741180.764600.99999
band60.661680.72121 0.663230.735550.740680.762570.99997
avg0.794450.83233 0.795480.835530.852560.867200.99997
PSNRband137.2953739.50404 37.6832739.3668040.3893939.9251065.81463
band236.9850738.01703 37.2911438.1612837.9197238.9264367.09016
band333.6582134.32276 33.7424734.9356033.4684235.2714169.83863
band426.7085427.17708 26.1985827.7435526.0182928.7087962.06650
band524.3924924.95152 24.2182224.5836625.5117526.1992071.78578
band625.4778426.57641 25.5505025.3705526.9352527.2809563.47700
avg30.7529231.75814 30.7807031.6935731.7071432.7186566.67879
CCband10.801380.79672 0.798450.845210.834280.844480.99951
band20.798730.81009 0.793190.837200.831560.849290.99978
band30.832900.84688 0.825540.873730.872640.877870.99996
band40.885110.89683 0.866970.911810.905460.927430.99997
band50.763950.79363 0.748940.787830.847320.847840.99999
band60.760360.80739 0.751440.765020.845880.838260.99996
avg0.807070.82526 0.797420.836800.856190.864200.99986
Table 2. Quantitative assessment of various spatiotemporal fusion methods for LGC dataset.
Table 2. Quantitative assessment of various spatiotemporal fusion methods for LGC dataset.
EvaluationBandMethod on LGC
FSDAFDCSTFNSTARFMSTFDCNNStfNetMSNetProposed
SAMall0.084110.08354 0.086010.067920.092840.063350.00035
ERGASall1.938611.91167 1.922731.803922.039701.686390.13248
RMSEband10.007630.00763 0.007290.007190.008240.005850.00006
band20.009130.00870 0.009070.008430.011670.007120.00006
band30.012790.01258 0.012560.011510.013530.009690.00006
band40.023830.02332 0.022950.021020.029710.018640.00006
band50.028300.02679 0.026070.022510.022840.021590.00006
band60.021970.02072 0.021810.016730.020540.014250.00006
avg0.017270.01662 0.016620.014570.017750.012860.00006
SSIMband10.974220.97455 0.973550.984600.974640.985580.99999
band20.966980.96918 0.964950.982090.960620.980310.99999
band30.944560.94632 0.941520.974750.941620.969540.99999
band40.924110.93283 0.917590.964170.914550.963930.99999
band50.894180.90416 0.885580.955390.912150.952390.99999
band60.884850.90337 0.877890.952590.901540.950870.99999
avg0.931480.93840 0.926840.968930.934190.967100.99999
PSNRband142.3548342.34734 42.7399742.8624541.6801644.6534584.25491
band240.7903441.20550 40.8522241.4858638.6561142.9505085.06363
band337.8642838.00486 38.0209938.7773337.3762940.2705983.84853
band432.4576032.64622 32.7853233.5485930.5433634.5905884.47622
band530.9641631.44212 31.6767132.9517932.8261333.3167184.01371
band633.1653533.67016 33.2281235.5292033.7492736.9208284.00084
avg36.2661036.55270 36.5505637.5258735.8052238.7837884.27631
CCband10.936270.92666 0.929350.946110.946640.961380.99999
band20.931860.93379 0.928800.945300.935660.958000.99999
band30.935490.93512 0.935160.952620.955390.964990.99999
band40.963600.96585 0.962870.971810.961250.975910.99999
band50.955270.95492 0.952220.975450.970480.978900.99999
band60.953130.95738 0.952140.972850.971640.979240.99999
avg0.945940.94562 0.943420.960690.956840.969740.99999
Table 3. Quantitative assessment of various spatiotemporal fusion methods for AHB dataset.
Table 3. Quantitative assessment of various spatiotemporal fusion methods for AHB dataset.
EvaluationBandMethod on AHB
FSDAFDCSTFNSTARFMSTFDCNNStfNetMSNetProposed
SAMall0.169910.23877 0.292770.185830.251170.146770.01297
ERGASall2.801564.03380 4.461474.252243.865352.906610.81967
RMSEband10.000390.00081 0.002510.000960.001120.000470.00007
band20.000440.00215 0.002350.000920.000810.000510.00007
band30.000670.00363 0.003580.001170.001180.000640.00007
band40.001090.00187 0.005900.001240.002010.001030.00006
band50.001260.00208 0.004080.001830.001770.001220.00006
band60.001360.00225 0.002630.002000.001980.001260.00007
avg0.000870.00213 0.003510.001350.001480.000850.00006
SSIMband10.998950.99459 0.965380.992050.989270.998220.99998
band20.998770.96845 0.969770.992930.995000.998050.99998
band30.997410.91914 0.934380.989470.989650.997400.99998
band40.996160.98506 0.920380.994190.982480.996310.99999
band50.993820.98085 0.941900.983710.984640.993880.99999
band60.991290.97145 0.968250.976250.976360.992260.99998
avg0.996070.96992 0.950010.988100.986230.996020.99998
PSNRband168.1817761.87013 52.0100860.3450259.0058266.4824983.62824
band267.0437153.35105 52.5648460.6892961.831665.8033983.61930
band363.4906848.79810 48.9319758.6369458.5597763.8802183.56309
band459.2255354.57435 44.5821158.1316953.9548659.7750684.23956
band558.0228253.65469 47.7910654.7470155.0553958.2859983.87554
band657.3535252.93719 51.6063453.9660154.0660258.0232283.56037
avg62.2196754.19759 49.5810757.7526657.0789162.0417383.74768
CCband10.840000.78227 0.711810.803680.497260.868450.99570
band20.856570.76351 0.745450.868450.380620.891140.99795
band30.849790.79147 0.812300.835760.271470.883450.99918
band40.539860.40161 0.340090.589440.375560.603030.99893
band50.795760.52206 0.765530.835800.629260.853200.99972
band60.802880.47565 0.764920.803380.610850.851540.99975
avg0.780810.62276 0.690020.789420.460830.825140.99854
Table 4. Average evaluation values of ITEs of various depths on the three datasets.
Table 4. Average evaluation values of ITEs of various depths on the three datasets.
DatabaseDepthSAMERGASRMSESSIMPSNRCC
CIA00.223768 3.144353 0.032796 0.844214 31.477961 0.819219
50.001597 0.530676 0.000550 0.999948 67.018571 0.999620
100.001182 0.473233 0.0004730.999971 68.3620240.999807
150.001394 0.509978 0.000639 0.999960 64.859474 0.999776
200.0011420.4523410.000499 0.99997466.678786 0.999863
LGC00.082166 1.939385 0.016704 0.943749 36.315476 0.948030
50.0003520.1324760.0000610.999998284.2763090.9999989
100.000367 0.139728 0.000069 0.9999979 83.319692 0.9999987
150.000378 0.153687 0.000092 0.9999976 81.181723 0.999998
200.000638 0.287639 0.000476 0.999885 77.511986 0.999900
AHB00.082166 1.939385 0.016704 0.943749 36.315476 0.748201
50.013112 0.826490 0.000066 0.999982 83.686718 0.998556
100.013106 0.825792 0.000066 0.999982 83.680289 0.998557
150.013102 0.828641 0.000066 0.999982 83.625675 0.998539
200.0129670.8196730.0000650.99998383.7476840.998540
The bold in the table indicates the optimal value at different ITE depths.
Table 5. Average evaluation values of ICTEs of various project methods on the three datasets.
Table 5. Average evaluation values of ICTEs of various project methods on the three datasets.
DatabaseProject MethodSAMERGASRMSESSIMPSNRCC
CIAline0.001142 0.452660 0.000500 0.999966 65.565960 0.999462
conv0.0011410.4523410.0004990.99997466.6787860.999863
LGCline0.000352 0.133565 0.000070 0.999990 82.659593 0.999984
conv0.0003510.1324760.0000610.99999884.2763090.999999
AHBline0.013024 0.823650 0.000066 0.999896 81.265960 0.990570
conv0.0129670.8196730.0000650.99998383.7476840.998540
Table 6. Average evaluation values of various convolution operations on the three datasets.
Table 6. Average evaluation values of various convolution operations on the three datasets.
DatabaseDifferenceSAMERGASRMSESSIMPSNRCC
CIAconv0.001491 0.549008 0.000593 0.999953 66.089266 0.999672
conv_dia0.0011420.4523410.0004990.99997466.6787860.999863
conv&conv_dia0.101984 2.197340 0.015059 0.943943 37.726046 0.963180
LGCconv0.000365 0.136848 0.000064 0.9999980 83.841604 0.9999988
conv_dia0.0003520.1324760.0000610.999998284.2763090.9999989
conv&conv_dia0.050764 1.532304 0.010584 0.975687 40.476035 0.980252
AHBconv0.012998 0.826653 0.000066 0.999982 83.658194 0.998290
conv_dia0.0129670.8196730.0000650.99998383.7476840.998540
conv&conv_dia0.091340 2.057066 0.000496 0.998751 66.825864 0.921079
Table 7. Number of parameters for different deep learning methods.
Table 7. Number of parameters for different deep learning methods.
MethodDCSTFNSTFDCNNStfNetMSNetProposed
Parameter445,889114,56236,866depth = 5521,064 depth = 53,673,617
depth = 10978,764 depth = 107,329,217
depth = 201,894,164 depth = 2014,640,417
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Li, W.; Cao, D.; Xiang, M. Enhanced Multi-Stream Remote Sensing Spatiotemporal Fusion Network Based on Transformer and Dilated Convolution. Remote Sens. 2022, 14, 4544. https://doi.org/10.3390/rs14184544

AMA Style

Li W, Cao D, Xiang M. Enhanced Multi-Stream Remote Sensing Spatiotemporal Fusion Network Based on Transformer and Dilated Convolution. Remote Sensing. 2022; 14(18):4544. https://doi.org/10.3390/rs14184544

Chicago/Turabian Style

Li, Weisheng, Dongwen Cao, and Minghao Xiang. 2022. "Enhanced Multi-Stream Remote Sensing Spatiotemporal Fusion Network Based on Transformer and Dilated Convolution" Remote Sensing 14, no. 18: 4544. https://doi.org/10.3390/rs14184544

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop