*3.3. Temporal Attention Mechanism*

Apart from the feature attention mechanism, the temporal attention mechanism can allocate attention weight to the temporal information carried by each historical moment of the input sequence to distinguish its influence on the output of the current time. At the same time, the time sequence of each historical moment can be extracted independently and the information expression of the critical moment can be enhanced; the architecture of the temporal attention mechanism is shown in Figure 3.

**Figure 3.** Architecture of the temporal attention mechanism.

From Figure 3, it can be found that the input is the hidden layer state of the Bi-LSTM network iterated to time, which can be expressed by *ht* = [*h*1,*t*, *h*2,*t*,..., *hn*,*t*], where *n* is the time window length of input sequences. The temporal attention weight vector *lt* of the current moment corresponding to each historical moment can be described as:

$$l\_l = \text{ReLU}(\mathcal{W}\_d X\_l + b\_d) \tag{9}$$

where *lt* = [*l*1,*t*, *l*2,*t*,..., *lk*,*t*]; *Wd* is a trainable weight matrix; *bd* is a bias vector; and Re*LU*(.) is an activation function to increase feature differences and make the weight distribution more centralized.

Moreover, from Figure 3, it can be seen that the input sequence generated by the activation function is normalized by the softmax function to obtain the temporal attention weight, which can be expressed by *β<sup>t</sup>* = [*β*1,*t*, *β*2,*t*,..., *βk*,*t*], where *βk*,*<sup>t</sup>* is the attention weight of character *k*, which can be denoted as:

$$\beta\_{k,t} = \frac{\exp(l\_{k,t})}{\sum\_{i=1}^{k} l\_{i,t}} \tag{10}$$

Hence, the weighted feature vector *h <sup>t</sup>* can be recalculated via data feature vector *ht* generated by the hidden layer, which can be expressed by:

$$h\_t' = \beta\_t \odot h\_t = \sum\_{i=1}^k \beta\_{i,t} h\_{i,t} \tag{11}$$

### *3.4. The Proposed Attention-Bi-LSTM PGPM*

In the paper, the Attention-Bi-LSTM PGPM based on the attention mechanism and Bi-LSTM network is proposed, which consists of an input layer, feature attention layer, Bi-LSTM layer, temporal attention layer, residual connected layer, and fully connected layer, and the architecture of the Attention-Bi-LSTM PGPM is shown in Figure 4.

From Figure 4, it can be found that a Bi-LSTM network is built to extract the hidden temporal correlation information from the input sample *Xt*, which is composed of the history sequence and related four-dimensional input feature vector extracted from environmental factors. The sample is fed into first Bi-LSTM network and the hidden layer feature *X <sup>t</sup>* is obtained.

**Figure 4.** Architecture of Attention-Bi-LSTM PGPM.

Then, the feature attention mechanism was used to explore the potential correlation between hidden layer features *X <sup>t</sup>*. The features *X <sup>t</sup>* extracted from the first Bi-LSTM were sent to the feature attention layer. In order to extract the hidden layer features *X <sup>t</sup>*, the attention weight of features was allocated dynamically. Based on the above statements, the weighted hidden layer feature *X <sup>a</sup>*\_*<sup>t</sup>* was obtained by dynamic distribution of the feature attention weight.

Next, the weighted feature *X <sup>a</sup>*\_*<sup>t</sup>* was residually linked to the original feature *X <sup>t</sup>*, which was fed into the second Bi-LSTM to obtain the hidden layer feature *ht*. Moreover, the correlation between the historical sequence and the feature *ht* was mined in the second Bi-LSTM's hidden layer, as well as the weighted feature vector *h <sup>t</sup>* being mined in the temporal attention layer. Finally, the power generation was predicted in the fully connected layer with the above-mentioned parameters.
