4.1. Results of Point Prediction
For point prediction, the prediction performance of six methods, namely Ridge, RF, GRU, NGB, GBRT-Mean, and GBRT-Med, are compared. For each prediction model, the choice of hyperparameters is crucial for the performance of the model. We use grid search and cross validation on the validation set to evaluate the performance of the model and determine the optimal hyperparameter of each model, which are demonstrated in
Table 1. The order of importance of the RF model variables is illustrated in
Figure 4.
It can be observed from
Figure 4 that the importance of global horizontal radiation ranks first among all independent variables, which constitutes for more than 50% of variable importance, followed by t-15 and t-30.
To ensure a faster convergence rate and learning effect, we standardize the data before establishing GRU model. As a mainstream optimization algorithm, Adam [
46] is opted as our optimizer. The number of iterations is set to 1000, learning rate to 0.01, hidden layer to 4, hidden node to 60, and the regularization parameter to 0.0001.
In the NGB model, we choose the Gaussian distribution, and the learning rate is set to 0.01, the number of iterations is 532, and the percent subsample of rows to use in each boosting iteration is 0.4. The importance ranking of location parameter variables is depicted in the left panel of
Figure 5. Like the RF results, the importance of global horizontal radiation ranks is first followed by t-15, and the importance of month_cos is higher than month_sin. However, the third one is slightly different, which is the diffuse horizontal radiation.
In the GBRT-Mean method, we set the maximum depth of the tree to 5, the number of boosting stages to perform is 400, the minimum number of samples required to split an internal node is 10, the minimum number of samples required to be at a leaf node is 15, and the learning rate is 0.05. For the median method, GBRT-Med, we set the maximum depth of the tree to 15, the number of boosting stages to perform is 400, the minimum number of samples required to split an internal node is 15, the minimum number of samples required to be at a leaf node is 10, and the learning rate is 0.15.
To measure the closeness between the predicted and real values, the relevant evaluation indexes of the prediction results of the six models mentioned earlier are calculated to evaluate the prediction performance of the models, which are listed in
Table 2. The best results under each evaluation index are displayed in bold according to the description in
Section 3.2.1.
Table 2 reveals that compared to other models, MAE, RMSE, MAPE, and SMAPE calculated from the prediction results of RF model are the minimum and
is the maximum, which indicates that the RF model has the best performance. However, the difference with RF in most of the indicators in GBRT-Mean and GBRT-Med are negligible, indicating that their performances are also similar, followed by GRU and NGB. The worst is Ridge regression, in which the accuracy is far from that of other methods.
4.2. Results of Interval Prediction
In this study, we have compared the performance and prediction interval quality of twelve relatively new interval prediction methods proposed recently. Then, we have used six ensemble methods for a combination of prediction intervals of the above-mentioned partial methods to obtain better prediction intervals.
Specifically, the following methods are chosen: KDE, including GRU-KDE, RF-KDE, Ridge-KDE, GBRT-Mean-KDE, and GBRT-Med-KDE based on their point prediction residuals, J+ab, including J+ab-Ridge, J+ab-MLP, and J+ab-RF, RF-OOB, QRF, and SC-RF based on random forest, and NGB.
For the KDE method, we train the point prediction models, including GRU, RF, Ridge, GBRT-Mean, and GBRT-Med (see
Section 4.1 for the training process and hyperparameter adjustment). Then, the error between the predicted and actual values of the five models is calculated. Next, we estimate the kernel density bandwidth of the residual by cross validation, of which the values are from 0.005 to 0.15. The values obtained for the five models are 0.040, 0.016, 0.072, 0.034, and 0.026, respectively. Then, we calculate the cumulative distribution function and corresponding quantiles. When the confidence levels are 95%, 90%, 85%, and 80%, respectively, the upper and lower quantiles corresponding to the five methods are demonstrated in
Table 3.
For the J+ab method, the multi-layer perceptron in J+ab-MLP chooses the Adam optimizer, maximum number of iterations is set to 8000, activation function selected is tanh, the sizes of the three hidden layers are 50, 40 and 30 in order, and the regularization parameter is 0.001. The parameter settings of J+ab-Ridge and J+ab-RF base learners are the same as those in
Section 4.1.
For the NGB method, the specific modeling process and parameters are described in
Section 4.1. For the interval prediction in NGB, its scale parameters are more important. The order of importance of the variables affecting scale parameters is depicted in the right panel of
Figure 5. It can be observed that the three most important factors affecting the scale parameters are the global horizontal radiation, t-15, and diffuse horizontal radiation.
For the prediction intervals established by the above method, we draw the resulting graph of active power from 1 September 2015 to 4 September 2015, as illustrated in
Figure 6. In addition, the corresponding weather conditions including wind speed, temperature (Celsius), and relative humidity are depicted in
Figure 7. It can be observed from
Figure 6 that, in general, the variation trend of the prediction intervals obtained by the twelve methods is similar to that of the actual value, and they include most of the actual values. Among them, the prediction interval obtained by J+ab-Ridge, J+ab-MLP, and J+ab-RF is relatively wide, followed by Ridge-KDE, while other methods are relatively narrow.
In addition, we calculate PICP, PINAW, Winkler score, CWC, and MPICD from the prediction interval of each method and the results are depicted in
Figure 8. The specific values at the 95% confidence level are listed in
Table 4 (see
Table A1,
Table A2 and
Table A3 for the results at other confidence levels).
According to the results in
Figure 8, the models with PICP close to the nominal level are J+ab-Ridge, RF-OOB, SC-RF, NGB, GRU-KDE, and Ridge-KDE for different indexes under each confidence level. The models with narrow PINAW and high Winkler score are RF-OOB, SC-RF, QRF, NGB, GRU-KDE, RF-KDE, GBRT-Mean-KDE and GBRT-Med-KDE, which are less than 0.12 and greater than −1.1, respectively. The models with smaller CWC are RF-OOB, SC-RF, QRF, NGB, GRU-KDE, Ridge-KDE, and GBRT-Mean-KDE, all of which are less than 0.23. The models with smaller MPICD are RF-OOB, SC-RF, QRF, NGB, GRU-KDE, RF-KDE, Ridge-KDE, GBRT-Mean-KDE, and GBRT-Med-KDE, all of which are less than 0.13.
More specifically, for the performance of each method, although the PICP of the three J+ab methods are basically close to the confidence level, PINAW is too large to be accurate enough and other indexes are not ideal as well. The PICP of QRF and GBRT-Mean-KDE sometimes cannot reach the confidence level. However, the value is relatively close with other indexes, which are relatively ideal. The PICP of RF-KDE and GBRT-Med-KDE are far from the given confidence level. Compared with other methods, the CWC of RF-KDE and GBRT-Med-KDE are larger, while other indexes are ideal. The PICP of Ridge-KDE is achieved, but the PINAW is slightly higher resulting in a relatively wide interval. In contrast, NGB, GRU-KDE, RF-OOB, and SC-RF perform well with all indexes.
In addition, we also compare the total computational time of each method for obtaining the prediction interval under four confidence levels, as depicted in
Figure 9. The training was completed by a personal computer with AMD R7-5800h CPU, 3.20 GHz processor and 16 GB memory. It can be observed from
Figure 9 that the methods with the shortest time are RF-OOB, NGB, and SC-RF, which are all less than 40 s, followed by QRF, J+ab-Ridge, J+ab-MLP, and RF-KDE, all of which are less than 200 s. Meanwhile, the most time-consuming methods are J+ab-RF, GBRT-Mean-KDE, GBRT-Med-KDE, GRU-KDE, and Ridge-KDE, all of which have taken more than 200 s. In general, the KDE method takes a longer time than other methods, probably due to the time-consuming cross validation when selecting the bandwidth.
To ensure that the results are more stable, we have repeated the tests ten times. As illustrated in
Figure 10, the results obtained by J+ab method are unstable, with values fluctuating by more than 0.1 for PICP and 0.05 for PINAW, while those obtained by other methods are overlapped and almost close to a straight line, indicating having more stable results, which are not shown in the graph for aesthetic reasons. Therefore, we only demonstrated the test corresponding to the fifth PICP of each method in the ten tests when the confidence level was 95% in the previous results.
Based on the above results, the J+ab method will not be considered in the further ensemble interval with the above-mentioned methods because it is already an ensemble method, performs poorly in various indicators, and has a number of outliers in the prediction interval.
As for the ensemble part, we have implemented six methods, including Ensemble-Mean, Ensemble-Med, Ensemble-En, Ensemble-TE, Ensemble-TI, and Ensemble-PM to combine the prediction intervals obtained by nine methods, namely RF-OOB, SC-RF, QRF, NGB, GRU-KDE, RF-KDE, Ridge-KDE, GBRT-Mean-KDE, and GBRT-Med-KDE to obtain the ensemble prediction intervals. Then, we compare the results from the ensemble methods with those of the previously implemented models with best performance, such as RF-OOB, SC-RF, NGB, and GRU-KDE, and calculate the five indexes, namely PICP, PINAW, Winkler score, CWC, and MPICD. The results are reflected in
Figure 11, in which the method with the best performance on the basis of four indexes except PICP is marked in black, and the corresponding confidence levels are indicated by dotted lines. Moreover, we have listed the index comparison results under 90% confidence level in
Table 5, in which the method with the best performance based on the four indexes except PICP is marked in bold (see Tables
Table A4,
Table A5 and
Table A6 for the results at other confidence levels). In addition, we have also compared the data with an interval of 5 min according to the same method. The corresponding results are illustrated in
Figure 12 and
Table 5 (see
Table A7,
Table A8 and
Table A9 for the results at other confidence levels).
It can be observed from
Figure 11 and
Figure 12,
Table 5 and
Table 6 that the PICP of all ensemble methods reaches the confidence level at both the time intervals of fifteen and five minutes. It is demonstrated that Ensemble-TE is the optimal method at all confidence levels with the value of PINAW and CWC being significantly smaller than that of other methods, and the value of Winkler score is higher than others. Meanwhile, in most cases, Ensemble-TE method has the smallest MPICD. It only reflects that the optimal method with MPICD is Ensemble-mean at 80% confidence level with an interval of fifteen minutes. Similarly, the optimal method with MPICD is RF-OOB at 95% confidence level with an interval of five minutes. However, in these two cases, it can be found that a slight difference exists between the results of Ensemble-TE method and that of the optimal method within
. Therefore, it is considered that Ensemble-TE is the best interval prediction method among all the other methods.
Finally, the numerical results of five interval quality indicators at different confidence levels are described in
Table 7 and
Table 8, respectively.
In addition, the indexes calculated by the interval prediction results in this paper also perform better than those in other literatures. To be more specific, when using the A-GRU-KDE [
15] method on the same data set with a fifteen-min and a five-min interval, the PINAW of the prediction interval obtained are 0.258 and 0.195, respectively, at a confidence level of 95%. In the same case, however, the results of PINAW in our method are 0.066 and 0.063, respectively, indicating that the relative average width of the interval is reduced by nearly three quarters with guaranteed coverage. Meanwhile, the Winkler scores in previously implemented methods are −2.39 and −1.88, respectively, at a confidence level of 95%. In contrast, the Winkler scores in our method are −0.608 and −0.591, and increased by about 70%, which indicates a high-quality prediction interval. Similar conclusions can be obtained at other confident levels, including 90%, 85%, and 80%. Thus, the proposed method exhibits increased accuracy and higher quality prediction interval on the same DKASC data set.