*3.1. Decomposition Modules*

#### 3.1.1. Wavelet Transform Block

The object edges represent abrupt changes around smooth regions and are distributed in the high-frequency signals of an image. For the motion and coverage of clouds, cloud edges offer crucial information. Wavelet transform is a powerful analysis tool widely used in image signal processing [32]. Compared with Fourier transform, it can capture frequency properties without location information loss. Therefore, we build the wavelet transform block (WTBlock) to extract frequency features and provide the RSI encoder with additional cloud details. As shown in Figure 2, the image signals are passed through high-pass and low-pass filters, sequentially from the horizontal and vertical directions. The high-pass filter (HPF) is to extract high-frequency components such as edges, while the low-pass filter (LPF) is to obtain low-frequency components for approximation. We summarize this 2D discrete wavelet transform as

$$\mathbf{I}\_{\rm LL} \mathbf{I}\_{\rm LHV} \mathbf{I}\_{\rm HL} \mathbf{I}\_{\rm HH} \mathbf{I}\_{\rm HH} = \rm WTBlock(I)},\tag{2}$$

where image signal **I** is decomposed into four components: the approximation **I**LL (passing through LPFs in both directions), and three details (**I**LH,**I**HL,**I**HH) in horizontal, vertical, and diagonal orientations, respectively.

**Figure 2.** Illustration of 2D wavelet transform. LPF denotes the low-pass filter and HPF denotes the high-pass filter.

#### 3.1.2. Series Decomposition Block

Real-world sequential signals (e.g., PV power sequence) are often entangled with temporal patterns that are informative for forecasting. Time series decomposition is an effective strategy to decouple knotted patterns from sequences. Among decomposition methods, seasonal-trend decomposition [33] has been widely employed as a feature engineering technique, which separates sequences into seasonal and trend parts. Inspired by Autoformer [34], we apply this decomposition idea as a series decomposition block (SDBlock) to enhance the pattern extraction ability of DualET. Given an input sequence **X**, the procedure is

$$\mathbf{S}, \mathbf{T} = \text{SDBlock}(\mathbf{X}), \tag{3}$$

where **T** is the moving average result of **X** and is considered as the trend part; **S** is the residual part (i.e., the detrend part), which is regarded as the seasonal part. To keep the sequence length unchanged, a padding operation is performed on the input sequence.

### *3.2. Learning Modules*

#### 3.2.1. Residual Connection and Residual Block

The residual network architecture (i.e., residual connection) has become the foundation of deep neural networks, which learn residual mapping to ease the optimization of deep layers [35]. It can be formalized as *y* = *F*(*x*) + *x*, i.e., the input is added to the output of stacked layers (*F*) as the result. For the RSI encoder, we employ residual blocks to learn the representation of remote-sensing data. As shown in Figure 3, the residual block is stacked using two convolutions, batch normalization [36], and activation (ReLU) layers. The process is summarized as

$$
\mathfrak{X}\_{\mathbb{R}} = \operatorname{ResBlock}(\mathfrak{X}).\tag{4}
$$

**Figure 3.** Illustration of residual block.

#### 3.2.2. Attention Mechanism

As one of the most representative identifiers of transformers, the attention mechanism is proposed as a query–key–value (QKV) model to learn long-range dependencies without recurrent structures. Given the matrices **<sup>Q</sup>** <sup>∈</sup> <sup>R</sup>*Lq*×*Dk* , **<sup>K</sup>** <sup>∈</sup> <sup>R</sup>*Lk*×*Dk* , and **<sup>V</sup>** <sup>∈</sup> <sup>R</sup>*Lk*×*Dv* as the projected queries, keys, and values, the single-head version of standard attention mechanism can be formalized as <sup>A</sup>(**Q**, **<sup>K</sup>**, **<sup>V</sup>**) = Softmax( **QK** <sup>√</sup>*Dk* )**V**, where *Lq*, *Lk* denote the length of queries and keys/values; *Dk*, *Dv* denote the projected dimensions; <sup>√</sup><sup>1</sup> *Dk* is the scale factor to avoid Softmax(·) yielding extremely small gradients. Furthermore, the multi-head version is as follows:

$$\begin{aligned} \mathcal{A}\_{\text{multi-head}}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) &= \text{Concat}(\text{head}\_1, \dots, \text{head}\_H) \mathbf{W}^O, \\ \text{where } \text{head}\_i &= \mathcal{A}(\mathbf{Q} \mathbf{W}\_i^Q, \mathbf{K} \mathbf{W}\_i^K, \mathbf{V} \mathbf{W}\_i^V). \end{aligned} \tag{5}$$

The queries, keys, and values with dimension *D* are mapped into *H* heads (i.e., subspaces) by **W***<sup>Q</sup> <sup>i</sup>* , **<sup>W</sup>***<sup>K</sup> <sup>i</sup>* <sup>∈</sup> <sup>R</sup>*D*×*Dk* , and **VW***<sup>V</sup> <sup>i</sup>* <sup>∈</sup> <sup>R</sup>*D*×*Dv* . Then the outputs of these heads are concatenated and mapped back to *<sup>D</sup>* by **<sup>W</sup>***<sup>O</sup>* <sup>∈</sup> <sup>R</sup>*HDv*×*D*. In most cases, *Dk* <sup>=</sup> *Dv* <sup>=</sup> *<sup>D</sup>*/*H*. The standard transformer has two types of multi-head attention: self-attention and crossattention. For self-attention, its projected queries, keys, and values are shared with the same source, while the key–value pairs of cross-attention are typically from the output of encoder.

In practice, the attention modules used in DualET are modified to improve the performance of capturing dependencies. First, we design an additional cross-domain attention module to discover the correlation between the features of images and sequences. Concretely, the queries are the temporal features of the decoder, and the key–value pairs are the cloud information from the RSI encoder. Furthermore, we perform the fast Fourier transform (FFT) on the input and the inverse FFT on the output:

$$\mathcal{A}\_{\text{FFT}}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \mathcal{F}^{-1}(\mathcal{A}(\mathcal{F}(\mathbf{Q}), \mathcal{F}(\mathbf{K}), \mathcal{F}(\mathbf{V}))) \tag{6}$$

where <sup>F</sup>, <sup>F</sup> <sup>−</sup><sup>1</sup> denote the FFT and its inverse, which are also used in the self-attention module of the LSI encoder. The FFT plays a key role in signal processing because it can rapidly convert a signal from the time/space domain to the frequency domain (and vice versa) and describe relationships between these domains [37]. It is defined by

$$\overline{\mathbf{X}}\_{k} = \sum\_{m=0}^{L-1} e^{-2\pi ik \cdot (m/L)} \mathbf{X}\_{m} \quad k = 0, \ldots, L-1 \tag{7}$$

Based on FFT, the attention module can discover the frequency correlation between queries and keys. In addition, we employ the ProbSparse attention mechanism [38] as the self-attention and cross-attention modules of the decoder to improve their performances.

#### 3.2.3. Embedding and Feed-Forward Layer

For sequence modeling, the order information of time steps is crucial. Furthermore, the timestamp records of local sequences (meteorological and PV power series) are instructive for PV power prediction but are hardly utilized in the standard transformer architecture. To introduce this information, we employ timestamp-embedding layers (following Autoformer [34]) for the local sequence inputs.

The feed-forward layer is a position-wise fully connected module, i.e., the learnable parameters are shared with each step. It contains two linear layers (**W**1, **b**1, **W**2, **b**2) and a ReLU activation function in between, formulated as

$$\text{FeedForward}(\mathbf{X}) = \text{ReLU}(\mathbf{XW}\_1 + \mathbf{b}\_1)\mathbf{W}\_2 + \mathbf{b}\_2. \tag{8}$$

#### *3.3. Remote-Sensing Information Encoder*

The RSI encoder is designed to learn spatial and temporal features from remote-sensing data. As shown in the top diagram of Figure 1, it is mainly composed of a two-dimensional convolution layer, a wavelet transform block, and a residual block. Given the historical *L*in steps of cloud images **<sup>X</sup>**RS <sup>∈</sup> <sup>R</sup>*L*in×*H*×*W*×*C*RS as the input of the RSI encoder, the procedure is

$$\begin{aligned} \mathbf{I}\_1 &= \text{Conv2D}(\mathbf{X\_{RS}}), \\ \mathbf{I}\_{\text{LL}}, \mathbf{I}\_{\text{LH}}, \mathbf{I}\_{\text{HL}}, \mathbf{I}\_{\text{HH}} &= \text{WTBlock}(\mathbf{I}\_1), \\ \mathbf{I}\_2 &= \text{Conv2D}(\text{ResBlock}(\mathbf{I}\_1)), \\ \mathbf{I}\_3 &= \text{ConvFusion}(\text{Concat}((\text{I}\_2, \mathbf{I}\_{\text{LL}}, \mathbf{I}\_{\text{LH}} + \mathbf{I}\_{\text{HL}} + \mathbf{I}\_{\text{HH}}))), \\ \mathbf{Z}\_{\text{RS}} &= \text{Linear}(\text{Flatten}(\text{ReLU}(\mathbf{I}\_3))), \end{aligned} \tag{9}$$

where ConvFusion is 1 × 1 2D convolution to integrate frequency components with image features; **<sup>Z</sup>**RS <sup>∈</sup> <sup>R</sup>*L*in×*<sup>D</sup>* is the output of the RSI encoder.

#### *3.4. Local Seasonal Information Encoder*

For weather and PV power series, the seasonal part contains the main volatility feature, which is the key to accurate prediction. Therefore, we introduce series decomposition blocks to the LSI encoder to extract seasonal patterns from local measurement data. As shown in the middle of Figure 1, the LSI encoder is staked with *L*LSI LSI encoder layers. Given the input of the LSI encoder **X**<sup>0</sup> LS <sup>∈</sup> <sup>R</sup>*L*in×*<sup>D</sup>* that is embedded from **<sup>X</sup>**LS <sup>∈</sup> <sup>R</sup>*L*in×*D*DS , the procedure in *l*-th LSI encoder layer is

$$\begin{aligned} \mathbf{S}\_{\text{LS},1'-}^{l} &= \text{SDBlock}(\text{Attention}(\mathbf{X}\_{\text{LS}}^{l-1}) + \mathbf{X}\_{\text{LS}}^{l-1}),\\ \mathbf{S}\_{\text{LS},2'-}^{l} &= \text{SDBlock}(\text{FeedForward}(\mathbf{S}\_{\text{LS},1}^{l}) + \mathbf{S}\_{\text{LS},1}^{l}),\\ \mathbf{X}\_{\text{LS}}^{l} &= \mathbf{S}\_{\text{LS},2'}^{l} \end{aligned} \tag{10}$$

where "\_" is the ignored trend part; **S***<sup>l</sup>* LS,i, *i* ∈ {1, 2} denotes the seasonal part in the *l*-th layer; **<sup>Z</sup>**LS = **<sup>X</sup>***L*LSI LS denotes the output of the LSI encoder.

#### *3.5. Joint-Feature Decoder*

The joint-feature decoder is to model temporal dynamics based on the joint features of local and remote-sensing data and then outputs short-term PV power. As shown in the bottom diagram of Figure 1, the decoder is stacked with *L*de decoder layers, and each layer contains three attention modules (i.e., self-attention, cross-domain attention, and cross-attention) to determine the correlations from different perspectives. The outputs of two encoders are integrated as the input of cross-attention by ConvFusion: **Z**en = ConvFusion(Concat([**Z**RS, **Z**LS])). The inputs of the decoder contain initialized seasonal part **X**<sup>0</sup> de and trend part **<sup>T</sup>**<sup>0</sup> de. which are decomposed from the latter half of **<sup>X</sup>**<sup>0</sup> LS and concatenated with scalar placeholders (zeros for the seasonal part and means for the trend part). The details in the *l*-th decoder layer are

$$\begin{aligned} \mathbf{Z}\_{\rm{dc},1}^{l} &= \text{Attention}(\mathbf{X}\_{\rm{dc}}^{l-1}),\\ \mathbf{Z}\_{\rm{dc},2}^{l} &= \text{CrossDomain}(\mathbf{X}\_{\rm{dc}}^{l-1}, \mathbf{Z}\_{\rm{RS}}),\\ \mathbf{Z}\_{\rm{dc},3}^{l} &= \mathbf{X}\_{\rm{dc}}^{l-1} + \text{Conv}\text{Fusion}(\text{Concat}([\mathbf{Z}\_{\rm{dc},1}^{l}, \mathbf{Z}\_{\rm{dc},2}^{l}])),\\ \mathbf{S}\_{\rm{dc},1}^{l}, \mathbf{T}\_{\rm{dc},1}^{l} &= \text{SDBlock}(\text{Attention}(\mathbf{Z}\_{\rm{dc},3}^{l}, \mathbf{Z}\_{\rm{en}}) + \mathbf{Z}\_{\rm{dc},3}^{l}),\\ \mathbf{X}\_{\rm{dc}}^{l}, \mathbf{T}\_{\rm{dc},2}^{l} &= \text{SDBlock}(\text{FeedForward}(\mathbf{S}\_{\rm{dc},1}^{l}) + \mathbf{S}\_{\rm{dc},1}^{l}),\\ \mathbf{T}\_{\rm{dc}}^{l} &= \mathbf{T}\_{\rm{dc}}^{l-1} + \mathcal{W}\_{1}^{l} \* \mathbf{T}\_{\rm{dc},1}^{l} + \mathcal{W}\_{2}^{l} \* \mathbf{T}\_{\rm{dc},2'}^{l} \end{aligned} \tag{11}$$

where **Z***<sup>l</sup>* de,i, *<sup>i</sup>* ∈ {1, 2, 3} is the intermediate feature; **<sup>S</sup>***<sup>l</sup>* de,1, **<sup>X</sup>***<sup>l</sup>* de denote the seasonal parts in *l*-th layer; **T***<sup>l</sup>* de,1(*<sup>i</sup>* ∈ {1, 2}), **<sup>T</sup>***<sup>l</sup>* de denote the trend parts in *<sup>l</sup>*-th layer; <sup>W</sup>*<sup>l</sup> <sup>i</sup>* ∈ {1, 2}) denote the project functions for the trend parts. After *L*de decoder layers, the final prediction **yˆ** is from the sum of two parts: W ∗ **<sup>X</sup>***L*de de <sup>+</sup> **<sup>T</sup>***L*de de , where W is the projector for the seasonal part.

#### **4. Experiment**

In this section, we evaluate the proposed DualET on satellite images and actual PV station data. We first introduce the datasets and data preprocessing. Then, we describe the experimental setting in detail. Finally, we compare the prediction performance of DualET and the baseline models and conduct several ablation experiments.

#### *4.1. Datasets and Data Preprocessing*

Two datasets were used in this study, namely, satellite remote-sensing data and PV station data. The satellite data were the L1 grid data from Himawa-8, a geostationary satellite launched in 2015 by the Japan Meteorological Agency to provide weather forecasts and typhoon and storm reports for Japan, East Asia, and the Western Pacific. The detection range of Himawa-8 is 60◦ N, 160◦ E, and 80◦ W, with a spatial resolution of 0.05◦, which corresponds to about 5 km on the ground, and the temporal resolution is 10 min. The PV station data contains local measurements from three real PV stations at different latitudes and longitudes in Hebei, China. Each station records meteorological factors (including global and diffuse irradiance, temperature, wind direction and speed, and air pressure) and PV power records at 15 min intervals.

We set the temporal resolution as 30 min to harmonize the time intervals of these two datasets. The satellite remote-sensing data were processed into 40 × 40 cloud images centered on the latitude and longitude of the PV station. The satellite data were sampled from July 2018 to June 2019 to align with that of PV station data. For each day, the data with a time range from 7:00 to 19:00 (UTC +0800) were used. We divided these two datasets into a training set, a validation set, and a test set in the ratio of 8:1:1 after arranging them to ensure the test period covers multiple seasons. Before input to the model, the data were normalized using standardization to eliminate the inconsistency of the magnitude of each dimension.

#### *4.2. Experimental Setting*

### 4.2.1. Baseline Models

We primarily selected seven models as baselines for comparison, namely, a classic statistical model ARIMA [39], an RNN-based model LSTM [40], and three state-of-the-art models, i.e., Transformer [28] and its variants Informer [38] and Autoformer [34].

#### 4.2.2. Hyperparameters and Platform

Our model DualET and transformer baselines, i.e., Transformer, Informer, and Autoformer, were set to the same number of layers, including two encoder layers and one decoder layer. The hidden dimension of model *D* was set to 512. The number of attention heads was set to 8. The batch size was set to 32, and the training epochs were set to 10 (with early stopping). The loss function was the mean-squared error (MSE) (Equation (12)), and the optimizer was ADAM with an initial learning rate of 1 <sup>×</sup> <sup>10</sup>−4. The input length of the model was set to 24, i.e., 12 h, and the output prediction length was set to 12, i.e., 6 h. All the models were implemented with PyTorch and conducted on a Ubuntu server with four NVIDIA GeForce RTX 2080Ti 11GB GPUs.

$$\text{MSE} = \frac{1}{N} \sum\_{t=1}^{N} (y\_t - \hat{y}\_t)^2. \tag{12}$$

#### 4.2.3. Evaluation Metrics

We evaluated the performance of the model with three widely used metrics, i.e., mean absolute error (MAE), root-mean-squared error (RMSE), and symmetric mean absolute percentage error (SMAPE).

$$\begin{aligned} \text{MAE} &= \frac{1}{N} \sum\_{t=1}^{N} |y\_t - \hat{y}\_t|\_{\prime} \\ \text{RMSE} &= \sqrt{\frac{1}{N} \sum\_{t=1}^{N} (y\_t - \hat{y}\_t)^2} \\ \text{SMAPE} &= \frac{100\%}{N} \sum\_{t=1}^{N} \frac{|y\_t - \hat{y}\_t|}{(|y\_t| + |\hat{y}\_t|)/2} .\end{aligned} \tag{13}$$

#### *4.3. Results*

As shown in Table 1, we evaluated the prediction performance of the proposed DualET on three different PV stations. For short-term (6-h) PV power prediction, DualET achieved the best results on three error metrics: MAE, RMSE, and SMAPE. Compared with the results of other models on all stations, DualET achieved a relative MAE and RMSE reduction of 22.53% and 22.75%, respectively, which is a significant improvement. The average reduction in MAE was even more than 53%, compared with the traditional ARIMA, and 27.72% compared with the popular LSTM. Among these baselines, the transformer-based models, i.e., Transformer, Informer, and Autoformer, were better than ARIMA and LSTM models. Moreover, our DualET still outperformed the competitive transformer-based models and yielded a relative MAE reduction of 10.64%.


**Table 1.** Prediction performance of the proposed DualET.

The prediction results of different models are presented in Figures 4 and 5. It obviously shows that the number of deviation points predicted by the statistical model ARIMA is much larger than those predicted by other DNN-based models, which indicates that ARIMA is unsuitable for short-term prediction up to several hours, especially for the nonstationary PV power series. As shown in Figure 5, the LSTM model has a more scattered distribution of points than the transformer-based models, which indicates the significance of transformer architecture for sequence modeling. It can also be seen that the proposed DualET presents the fittest curves in Figure 4 and the narrowest band in Figure 5 compared with other baselines, which shows the advantages of PV power prediction.

**Figure 4.** Prediction results of different models.

**Figure 5.** Scatter plots of different models.

#### *4.4. Ablation Studies*

DualET contains dual encoders, including the LSI encoder to deal with local seasonal information and the RSI encoder to process remote-sensing information, with a shared decoder to combine joint features. In addition, there are different decomposition modules and attention modules employed in DualET to enhance feature extraction. We conducted additional experiments to evaluate the impact of the dual encoders and different functional modules with MAE, RMSE, and SMAPE as evaluation metrics, and we present the results of these experiments in this section.
