1. Introduction
Recently, cloud-motion prediction has received significant attention because of its importance for the prediction of both precipitation and solar-energy availability [
1]. Research has shown that the prediction of the short-time motion of clouds, especially of convective clouds, is important for precipitation forecasts [
2,
3,
4,
5,
6,
7]. Since most models of solar variability [
8,
9] and of solar irradiation [
10,
11,
12] require cloud motion velocity as the main input, accurate cloud motion estimation is also essential for the intra-hour forecast of solar energy [
13,
14,
15,
16]. The difference between weather forecasts and solar forecasts is that the latter are usually conducted in a shorter time window (less than one hour). Otherwise, cloud-motion prediction is essentially similar in these two fields. Because the temperature of clouds is lower than that of the ground, clouds can be identified from infrared (IR) satellite images (with wavelengths of 10.5 to 12.5 μm) in which the intensity of IR radiation is correlated with temperature [
1,
17]. Therefore, cloud motion can be estimated from a given sequence of IR images for weather forecasting [
18] or intra-hour solar forecasting.
Nevertheless, cloud motion is a complex phenomenon involving nonrigid motion and nonlinear events [
19], and predicting it remains challenging. Several methods have been proposed for the prediction of cloud motion; most of them are correspondence-based approaches. In general, cloud motion vectors (CMVs) are obtained by first locating salient image features, such as brightness gradients, corners, cloud edges, or brightness temperature gradients [
20,
21], and subsequently tracking these features in successive images with the assumption that they do not change significantly over a short interval. CMVs can be obtained from data collected by sky-imaging devices, such as whole-sky imagers (WSIs) [
22], or by satellites. CMVs derived from WSI data are used for short-term forecasts (less than 20 min) in the local spatial area [
11], whereas CMVs obtained from satellite images are commonly utilized to find the global atmospheric motion and the climate status of a large area [
21,
23]. Adopting a similar concept to CMVs, Brad and Letia [
19] developed a model combining a block matching algorithm (BMA) and a best candidate block search, along with vector median regularization, to estimate cloud motion. This method divides successive images into blocks, restricting the candidate list of blocks to a predefined number, while in the full search BMA, the best match is found between the two blocks of successive frames in a full domain. Based on the idea of block matching, Jamaly and Kleissl [
24] applied the cross-correlation method (CCM) and cross-spectral analysis (CSA) as matching criteria on cloud motion estimation. Additional quality-control measures, including removing conditions with low variability and less-correlated sites, can help to ensure that CSA and CCM reliably estimate cloud motion. Nevertheless, CCM can lead to relatively large errors because the assumption of uniform cloud motion does not hold in the presence of cloud deformation, topographically induced wind-speed variations, or a changing optical perspective [
25]. This is a common problem for other block matching methods as well.
One approach to overcoming the challenges brought by variations in cloud motion is to compute the CMV of every pixel. Chow et al. [
26] proposed a variational optical flow (VOF) technique to determine the subpixel accuracy of cloud motion for every pixel. They focused on cloud motion detection, and did not extend their work to prediction. Shakya and Kumar [
27] applied a fractional-order optical-flow method to cloud-motion estimation and used extrapolations based on advection and anisotropic diffusion to make predictions. However, their method is not an end-to-end method of cloud-motion prediction.
Since the CMV is computed by extracting and tracking features, ameliorating feature extraction is another approach to improving performance. The deep convolutional neural network (CNN) [
28] has proved able to extract and utilize image features effectively; it has achieved great success in visual recognition tasks, such as the ImageNet classification challenge [
29]. Methods based on deep CNN have been introduced to cloud classification [
30,
31], cloud detection [
32], and satellite video processing [
33] in recent years. Although deep CNN has performed excellently when dealing with spatial data, it discards temporal information [
34] that provides important clues in the forecasting of cloud motion. A prominent class of deep neural network called recurrent neural network (RNN) could learn complex and compound relationships in the time domain. However, the simple RNN model lacks the ability to backpropagate the error signal through a long-range temporal learning. Long short-term memory (LSTM) [
35] was proposed to tackle this problem and this model is widely used in the solar power forecasting field [
36,
37]. Recent deep-learning models trained on videos have been used successfully for captioning and for encoding motion. Ji et al. [
38] formulated a video as a set of images and directly applied deep CNN on the frames. Zha et al. [
39] extended deep 2-D CNN to deep 3-D CNN and performed a convolutional operation on both the spatial and the temporal dimensions. Donahue et al. [
40] combined convolutional networks with LSTM and proposed long-term recurrent convolutional network (LRCN). LRCN first processes the inputs with CNN and then feds the outputs of CNN into stacked LSTM. This method created a precedent on a combination of CNN and RNN regarded as recurrent convolutional network (RCN). Unlike previous proposals that focused on high-level deep CNN “visual percepts”, the novel convolutional long short-term memory (ConvLSTM) network proposed by Shi et al. [
41] has convolutional structures in both the input-to-state and state-to-state transitions to extract “visual percepts” for precipitation now-casting. Ballas et al. [
42] extended this work and proposed a variant form of the gated recurrent unit (GRU). They captured spatial information using an RNN with convolutional operation and empirically validated their GRU-RCN model on a video classification task. GRU-RCN has fewer parameters than ConvLSTM.
Since both the input and output of a cloud-motion forecast are spatiotemporal sequences, cloud-motion prediction is a spatiotemporal-sequence forecast problem for which GRU-RCN would seem well suited. However, Ballas et al. [
42] focused on video classification, which is quite different from our forecast problem. Given the input video data, the output of their model is a number that depends on the class of the video; in our problem, the output should have a spatial domain as well. We need to modify the structure of the GRU-RCN model and apply it directly on the pixel level.
Moreover, there exists another challenge in the cloud motion prediction problem: new clouds often appear suddenly, at the boundary. To overcome this challenge, our model includes information about the surrounding context in which each small portion of the cloud is embedded; this was not considered in previous methods.
In this paper, we suggest the use of deep-learning methods to capture nonstationary information regarding cloud motion and deal with nonrigid processes. We propose a multiscale-input end-to-end model with a GRU-RCN layer. The model takes the surrounding context into account, achieves precise localization, and extracts information from multiple scales of resolution. Using a database of FenYun-2G IR satellite images, we compare our model’s intra-hour predictions to those of the state-of-the-art variational optical-flow (VOF) method and three deep learning models (ConvLSTM, LSTM, and GRU); our model performs better than the other methods.
The remainder of this paper is organized as follows:
Section 2 introduces the GRU-RCN model.
Section 3 describes the data we used and the experiments we conducted.
Section 4 presents the results, as well as briefly describes the other methods with which the GRU-RCN model was compared.
Section 5 discusses the advantages and disadvantages of our model and our plans for future work.
Section 6 provides our concluding remarks.
2. Methodology
2.1. Deep CNN
Deep CNNs [
28] have been proven to extract and utilize image features effectively and have achieved great success in visual recognition tasks. Regular neural networks do not scale well to full images because, in the case of large images, the number of model parameters increases drastically, leading to low efficiency and rapid overfitting. The deep-CNN architecture avoids this drawback. The deep CNN contains a sequence of layers, typically a convolutional layer, a pooling layer, and a fully connected layer. In a deep CNN, the neurons in a given layer are not connected to all the neurons in the preceding layer but only to those in a kernel-size region of it. This architecture provides a certain amount of shift and distortion invariance.
2.2. GRU
A GRU [
43] is a type of RNN. An RNN is implemented to process sequential data; it defines a recurrent hidden state, the activation of which depends on the previous state. Given a variable-length sequence
, the hidden state
of the RNN at each time step
is updated by:
where
is a nonlinear activation function.
An RNN can be trained to learn the probability distribution of sequences and thus to predict the next element in the sequence. At each time step , the output can be represented as a distribution of probability.
However, because of the vanishing-gradient and exploding-gradient problems, training an RNN becomes difficult when input/output sequences span long intervals [
44]. Variant RNNs with complex activation functions, such as LSTMs and GRUs, have been proposed to overcome this problem. LSTMs and GRUs both perform well on machine-translation and video-captioning tasks, but a GRU has a simpler structure and lower memory requirement [
45].
A GRU compels each recurrent unit to capture the dependencies of different timescales adaptively. The GRU model is defined by the following equations:
where
denotes element-wise multiplication;
and
are weight matrices;
is current input;
is the previous hidden state;
is an update gate;
is a reset gate;
is the sigmoid function;
is a candidate activation, which is computed similarly to that of the traditional recurrent unit in an RNN; and
is the hidden state at time step
The update gate determines the extent to which the hidden state is updated when the unit updates its contents, and the reset gate determines whether the previous hidden state is preserved. More specifically, when the value of the reset gate of a unit is close to zero, the information from the previous hidden state is discarded and the update is based exclusively on the current input of the sequence. By such a mechanism, the model can effectively ignore irrelevant information for future states. When the value of the reset gate is close to one, on the other hand, the unit remembers long-term information.
2.3. GRU-RCN
In this section, we will introduce the GRU-RCN layer utilized in our model. A GRU converts input into a hidden state by fully connected units, but this can lead to an excessive number of parameters. In cloud imaging, the inputs of satellite images are 3-D tensors formed from the spatial dimensions and input channels. We regard the inputs as a sequence ; the size of the hidden state should be the same as that of the input. Let , and be the height, width, and number of the channels of input at every time step, respectively. If we apply GRU on inputs directly, the size of both the weight matrix and the weight matrix should be .
Images are composed of patterns with strong local correlation that are repeated at different spatial locations. Moreover, satellite images vary smoothly over time: the position of a tracked cloud in successive images will be restricted to a local spatial neighborhood. Ballas et al. [
42] embedded convolution operations into the GRU architecture and proposed the GRU-RCN model. In this way, recurrent units have sparse connectivity and can share their parameters across different input spatial locations. The structure of the GRU-RCN model is expressed in the following equations:
where ∗ denotes convolution and the superscript
denotes the layer of the GRU-RCN; the weight matrices
and
are 2-D convolutional kernels; and
, where
is a feature vector defined at the location
.
With convolution, the sizes of and are all , where is the convolutional-kernel spatial size (chosen in this paper to be 3 × 3), significantly lower than that of the input frame . Furthermore, this method preserves spatial information, and we use zero padding in the convolution operation to ensure that the spatial size of the hidden state remains constant over time. The candidate hidden representation , the activation gate , and the reset gate are defined based on a local neighborhood of size at the location in both the input data and the previous hidden state . In addition, the size of the receptive field associated with increases with every previous timestep … as we go back in time. The model implemented in this paper is, therefore, capable of characterizing the spatiotemporal pattern of cloud motion with high spatial variation over time.
2.4. Multi-GRU-RCN Model
In this section, we will introduce the model structure of Multi-GRU-RCN. Ballas et al. [
42] focused on the problem of video classification and therefore implemented a VGG16 model structure in their paper. However, this does not fit our problem well: we need to operate on the pixel level directly. The model structure of Shi et al. [
41] consists of an encoding network as well as a forecasting network, and both networks are formed by stacking several ConvLSTM layers. In their model, there is a single input, and the input and output data have the same dimension. We modified this model structure and proposed a new one, which can extract information from the surrounding context. The model structure is presented in
Figure 1.
There are multiple inputs in this model, and the input from each small region has the same center as the input from a larger region. The input from the small region has the same dimension as the output, while the input from the large region has four times as much area. We consider the region that is included in the large region but excluded in the small region as the surrounding context. The purpose of utilizing multiple inputs from different regions is to enrich the information with the surrounding context. Like the model of Shi et al. [
41], our model consists of an encoding part and a forecasting part. In addition to stacked GRU-RCN layers, batch normalization [
46] was introduced into both the encoding and forecasting parts to accelerate the training process and avoid overfitting. When utilizing input from a large region, we used a max pooling layer to reduce the dimension and improve the ability of the model to capture invariance information of the object in the image. The initial states of the forecasting part are copied from the final state of the encoding part. We concatenate the output states of the two inputs and subsequently feed this into a 1 × 1 convolutional layer with ReLU activation to obtain the final prediction results.
4. Results and Analysis
One epoch is one training cycle through the entire training dataset. The models described in the previous sections were trained on the training dataset for 50 epochs and evaluated on the validation dataset after every epoch. The MSE loss is presented in
Figure 5.
In
Figure 5, it is apparent that the MSE loss declined dramatically in the first 10 epochs; thereafter, the decline rate gradually decreased, and the MSE loss eventually converged to a lower level. When the training time was over 40 epochs, the loss was relatively small compared to that within the first 10. Despite the fluctuation of the validation loss, the integrated trend continued to decline, which indicates that the model was not overfitting. Thus, the training procedure was effective and converged to a quite satisfactory result.
We then randomly selected 20 days from 2018 as the test dataset, on which we compared our method (Multi-GRU-RCN) with the VOF technique, ConvLSTM, LSTM, and GRU. For the VOF algorithm, we used the method of Chow et al. [
26], which minimizes the objective function by using brightness constancy and global smoothness as model assumptions to realize VOF. We set the size of the input patches as 128 × 128 and the size of the output patches as 64 × 64 to produce comparable results with Multi-GRU-RCN. For ConvLSTM, we adopted the model structure of Shi et al. [
41], setting the kernel size at 3 × 3 for convolution. The input frame had the same size as the output frame. For LSTM and GRU, we deployed five frames to predict the next frame. Because LSTM and GRU cannot extract spatial information, we treated every pixel in a frame as an independent sample; thus, there were 4096 samples in a frame. All the experiments were carried out with NVIDIA Tesla T4 GPU. It takes 7.78 h to train the GRU model, 9.44 h to train the LSTM model, 12.29 h to train the ConvLSTM model, and 13.96 h to train the Multi-GRU-RCN model. There is no training process of the VOF method. In the test process, it requires 2.57, 3.65, 3.72, 4.28, and 4.73 s to predict one frame with the VOF method, GRU model, LSTM model, ConvLSTM model, and Multi-GRU-RCN model, respectively.
The MBEs predicted by VOF, GRU, LSTM, ConvLSTM, and Multi-GRU-RCN are 0.50%, 1.47%, 1.64%, −0.51%, and 0.45%. The nearly zero MBEs illustrate that none of these methods under or over forecast and no postprocessing steps are needed to calibrate the results. Quantitative results in terms of PSNR and SSIM over the test dataset are summarized in
Table 1. The results shown in
Table 1 confirm that Multi-GRU-RCN achieves the most promising results on both PSNR and SSIM metrics over the entire test dataset among these methods. To be specific, compared with ConvLSTM, Multi-GRU-RCN achieves a performance gain by 4.11% on PSNR and 2.60% on SSIM. In order to investigate the results in detail, we calculated the average PSNR and SSIM over the total 64 test samples for each day. The PSNR and the SSIM results using VOF, GRU, LSTM, ConvLSTM, and Multi-GRU-RCN on the test data for each day are compared in
Figure 6 and
Figure 7, respectively.
According to
Figure 6 and
Figure 7, the forecast results of these methods were consistent in terms of each metric. For instance, the PSNRs and SSIMs of the five methods were the highest on 2 February, which means that all four methods performed the best on the data of that day. Based on the forecast results, the Multi-GRU-RCN method consistently outperformed the other four methods during the whole computational time. The VOF was the worst-performing method on the test data. Multi-GRU-RCN and ConvLSTM had quite similar performance in terms of both MSE and PSNR values but Multi-GRU-RCN performed slightly better. The MSE of Multi-GRU-RCN forecasts on the test dataset was 72.93, which means that the average intensity difference per pixel between ground truth and prediction was 8.54 (a satisfactory result, given that the intensity range was 0~255).
To show the results more intuitively, we randomly picked three input sequences from the test dataset: May 2 between 0 and 5 am, January 31 between 6 and 11 pm, and July 7 between 6 and 11 am.
Figure 8 shows the predictions of the next hour produced by VOF, GRU, LSTM, ConvLSTM, and Multi-GRU-RCN. The PSNRs predicted by VOF are 22.99, 23.01, and 22.91; those predicted by GRU are 23.92, 24.63, and 23.50; those predicted by LSTM are 23.98, 24.61, and 23.42; those predicted by ConvLSTM are 28.40, 28.50, and 27.67; and those predicted by Multi-GRU-RCN are 29.66, 30.37, and 29.86. The PSNR values predicted by Multi-GRU-RCN are consistently larger than those of the other methods, which indicates that its predictions are more accurate. This result also agrees with the difference between the ground truth and prediction. Even though the predictions by VOF have sharper outlines, the predictions by Multi-GRU-RCN have better accuracy. When a cloud appears at the edge of the prediction domain, Multi-GRU-RCN can predict it better than VOF. This proves that some of the complex spatiotemporal patterns in the dataset can be learned by the nonlinear and convolutional structure of the network. The model also performs well at predicting nonstationary processes, such as inversion and deformation, whereas VOF does not: in the VOF prediction for such situations, an abrupt change of intensity between adjacent pixels occurs at the bottom of the image. Multi-GRU-RCN gives a better prediction result without a blocky appearance.
5. Discussion
The relationship of GRU, LSTM, ConvLSTM, GRU-RCN, and Multi-GRU-RCN is illustrated in
Figure 9. GRU is a simplification of LSTM by replacing the forget gate and input gate with the update gate, and combining the cell state and hidden state. Embedded convolutional operation in the recurrent unit, ConvLSTM, and GRU-RCN were implemented for spatial-temporal data and GRU-RCN has less parameters than ConvLSTM. As the GRU-RCN model structure proposed by Ballas et al. [
42] focused on the video classification problem, we changed the model structure to adapt for the pixelwise cloud motion prediction problem. We considered the ConvLSTM model structure proposed by Shi et al. [
41] as a reference, and replaced the ConvLSTM layer with the GRU-RCN layer. Besides, the surrounding context was introduced into our model to enrich input information.
In predicting cloud motion, both temporal and spatial information provide important clues. Temporally, the current frame correlates with the previous frame; spatially, the intensity of a given pixel correlates with those of the surrounding pixels. A GRU captures temporal information but ignores spatial information; therefore, it underperforms when compared to the ConvLSTM model, which captures both. However, in the ConvLSTM model, the input frame has the same shape as the output frame: as a result of the convolutional operation and same-padding method, it loses boundary information. In addition, the movement of the cloud is very complicated and cannot be determined by looking at the current region exclusively; more information must be brought into the model. To improve prediction accuracy, especially in the boundary region, we incorporated the surrounding context into our new end-to-end model. The performance improvement of Multi-GRU-RCN is also contributed to the model structure. For instance, in the experiment, we set the large region as the input and the small region as the output for the VOF algorithm, while we conducted a control experiment with the small region as both the input and output. The average PSNR and SSIM on the test dataset of the control experiment is 22.69 and 0.41, which indicates that the introduction of the large region only achieves a performance gain of 1.28% and 1.22% with the VOF algorithm. In the model structure aspect, we exploited the max pooling layer for dimension reduction and improved ability of the model to capture invariance information of the cloud while moving and fuse features from different scales. In addition, the activation functions introduce non-linearity into the model [
49]. Accumulation of activation functions produces a promising model to learn sophisticated patterns. The essential advantage of the end-to-end structure is that all the parameters of the model can be simultaneously trained, making the training process more effective. The predictions of our model have consistently higher PSNR and SSIM than those of other methods. The spatial and temporal patterns learned by the model from the region of interest provides the fundamental of cloud motion. The utilization of external information out of the region of interest enriched the model understanding of environmental circumstances. This illustrates that utilizing information from both the internal and external region reveals a more accurate pattern of cloud motion.
There are three possible explanations for the better performance of Multi-GRU-RCN over the VOF algorithm. First, Multi-GRU-RCN can learn complex patterns during the training process. The clouds often seem to appear instantaneously, indicating that they either derive from outside or are suddenly formed. If similar situations happened in the training dataset, Multi-GRU-RCN could learn these patterns during the training process and subsequently provide reasonable predictions in the test dataset. However, this could not be detected by the VOF algorithm. The second explanation is that Multi-GRU-RCN is trained end-to-end for this task. The VOF algorithm is not an end-to-end model, and it is difficult to find a reasonable way to update the future flow fields. The final reason is that Multi-GRU-RCN can smooth a blocky appearance, whereas the predictions of VOF will have a blocky appearance whenever there are abrupt changes in the motion vectors and therefore in the intensity between adjacent pixels.
Although the proposed Multi-GRU-RCN can achieve promising intra-hour cloud motion prediction, there are still limitations of this model. Compared with the VOF algorithm, the Multi-GRU-RCN produces blur prediction. This property is associated with the MSE loss when training the model. The future of the satellite cloud image is uncertain and by nature multimodal. When there are multiple valid outcomes with equally possibility, the MSE loss aims to accommodate the uncertainty by averaging all the possible outcomes, thus resulting in a blur prediction. Generative adversarial networks (GANs) have emerged as a powerful alternative to enhance prediction sharpness. In the future work, we will combine MSE loss with adversarial training loss to improve the visual quality of the predictions. Besides, limited by the number of layers in the architecture, the model could not totally eliminate the influence of interference, such as complex surface conditions. Li et al. [
50] proposed a multi-scale convolutional feature fusion method for the cloud detection method. Their research confirmed that the usage of dilated convolutional layers and feature fusion of shallow appearance information and deep semantic information helps to improve the interference tolerance.
In this paper, the current forecasting range was an hour. The extension of the forecast time will convert the output from one frame to a sequence of frames. The weakness of the encoder-decoder architecture is that it lacks the alignment of the input and output sequence. Bahdanau et al. [
51] proposed an attention mechanism utilizing a context vector to align the source and target inputs. The context vector preserves information from all hidden states in encoder cells and aligns them with the current target output. The attention mechanism allows the decoder to “attend” to different parts of the source sentence at each step of output generation; this concept has revolutionized the field. The introduction of the attention mechanism will address issues carried out in long-term horizon prediction. Furthermore, we plan to implement more data sources to enrich the information in the dataset and introduce data fusion techniques into the model improve the accuracy. Combining our current research with the precipitation forecast problem also merits further research.