1. Introduction
As one of the most potential new energy utilization technologies, photovoltaic (PV) power generation has attracted the attention of most countries in the world [
1]. PV systems are available in two forms: Centralized and distributed. Compared with the centralized, distributed PV (DPV) systems are constructed close to loads, and the system output can be absorbed locally, which helps to overcome the common defects of mismatch between actual distribution and application demand of PV resources. Therefore, in recent years, the Chinese DPV industry has been vigorously developed along with the whole county advance.
However, with the increasing proportion of DPV installations in county-level regions, the power grid operation is greatly affected. Considering the system construction and operation costs, the traditional DPV system operation and maintenance are extensive, resulting in low system benefits. The centralized operation and maintenance management of DPV systems in a county can improve the overall efficiency of the systems. To meet the needs of power grid dispatching and the centralized operation and maintenance of DPV stations, the output of the regional DPV stations group needs to be predicted. Therefore, we take DPV systems in a county as a whole to study the regional DPV output prediction.
Regional PV prediction models can be divided into different types. Liu et al. [
2] reviewed the research on regional PV prediction methods based on multiple time scales. Kim and Kim [
3] divided the models into two categories: Type 1 is about the public utility-scale systems [
4,
5,
6,
7,
8,
9,
10,
11,
12,
13], Type 2 is about the system behind the meter [
14,
15,
16,
17]. Pierro et al. [
9] proposed an interesting classification method based on prediction strategies:
(1) Bottom-up strategy. In this strategy, firstly, the output of each PV system in the region needs to be predicted, and then the regional output can be achieved by accumulating these predicted values.
(2) Upscaling strategy. The strategy can be further divided into models output average strategy and model input average strategy. Model output average strategy is based on the selection and output prediction of a subset of regional PV systems, which is taken as the representation of the whole systems. Then the predicted subset power output is rescaled based on the subset capacity and total capacity to predict the regional power output. In the model output average strategy, the PV output in the predicted area is taken as the output of a virtual PV system. Then, the regional PV power output is directly predicted.
For the model output average strategy, Shaker et al. [
18] proposed a data-driven method to estimate the power output of invisible PV systems based on the measured values of a little of representative systems. The representative sites were selected using the proposed data dimension reduction model based on K-mean clustering and principal component analysis, and then regional PV power generation was obtained according to the mapping function. In addition to power generation estimation, Shaker et al. [
17] also proposed a probabilistic prediction model based on a Fuzzy Arithmetic Wavelet Neural Network (FAWNN) to predict the power generation of a large number of small PV systems. Bright et al. [
19] evaluated the satellite-only and upscaling-only PV output estimate methods, and the authors concluded that the method through combining the two methods is more beneficial. Saint-Drenan et al. [
8] analyzed the performance of the upscaling strategy by using measured power data of a set of 366 PV systems. They found that the error decreases with an increasing number of reference systems and a decreasing number of un-metered systems, and the average distance between a reference and the unknown system has a great influence on the performance of a set of reference systems.
For the model input average strategy, Fonseca et al. [
5] proposed a method based on principal component analysis, support vector regression, and weather prediction data. One-day ahead regional PV power outputs of the four main regions of Japan in 2009 were predicted with hourly power output data of 453 PV systems. Aillaud et al. [
20] proposed a model through a combination of a convolutional neural network (CNN) with a long short-term memory architecture. The day-ahead regional PV power outs of Germany were predicted, and the main result of this study shows that the proposed model is more accurate than the Random Forest model. Moschella et al. [
21] attempted to directly predict wind and solar power generation in each Italian region based on the model input average strategy. Based on the same strategy, Pierro [
22] et al. conducted a more detailed study on the solar power generation prediction of six regions in Italy by comparing six different prediction models. Yu et al. [
23] presented a probabilistic prediction method based on CNN and non-linear quantile regression (QR). The model was used to predict the regional PV power output of PV systems in the Weifang region of China, and the prediction result shows that the improved CNN can effectively process high-dimensional and complex input data and the non-linear QR model can provide quantile prediction information of regional PV power output.
The upscaling prediction strategy is improved in some interesting studies, such as the research of Pierro et al. [
9], Wolff et al. [
24], Saint-Drenan et al. [
12], and Fu [
10]. They first clustered the PV systems in the region and then used the upscaling prediction strategy on the clustered subsets, respectively, to predict the PV power output of the whole region.
There are also some studies in which different strategies were compared. Fonseca et al. [
6] conducted a comparative study on the four models, and each prediction method assumed a different scenario regarding the data available to make the prediction. In view of the complete availability of regional PV power data, the strategy of direct prediction and then accumulation is adopted. A prediction model based on stratified sampling is proposed for the partial availability of regional PV power data. In light of the availability of regional aggregate PV power, the model input average strategy is adopted. In the case that the power data cannot be obtained, the strategy of indirect prediction and then accumulation is adopted. By comparison, in the region with a variety of weather conditions, the prediction methods based on single systems’ predictions and the one based on stratified sampling provided the best results. Zamo et al. [
7] predicted the regional PV power generation in two counties based on the bottom-up strategy and the model input average strategy. By using a reference system to directly predict the regional aggregate PV power, the results can get an RMSE of about 6%, whatever the county and the RMSE can be reduced to about 5.8% by using the bottom-up strategy. Pierro et al. [
9] firstly clustered PV systems in a region and then compared two strategies: (1) Calculate the average prediction results of each cluster to obtain the regional PV power (the models output average strategy), (2) the input variables of each cluster center are used to directly predict the regional PV power (the model input average strategy). The results show that the accuracy of the latter is a little better.
Saint-Drenan et al. [
25] proposed a new strategy that can be used as an alternative to the upscaling strategy for the scenario where no or few power measurements are available. The strategy uses an average PV model to calculate the power output of the most frequent module orientation angles. The calculated power values are finally weighted according to their probability of occurrence to estimate the real power output. The basic condition of this strategy is that the physical model information of regional PV systems is available. However, in practice, it is usually difficult to meet this condition.
For regional PV output prediction, the bottom-up strategy needs to predict the output of all systems in the predicted area. It is necessary to establish a prediction model for each system and perform a lot of data processing and calculation. When some PV systems in the region lack historical data and cannot apply the data-driven model, it is necessary to adopt the prediction method based on the physical model. However, in reality, it is difficult to obtain the physical models of all PV systems in a region. Therefore, the bottom-up strategy is actually difficult to apply in practice [
10]. To reduce the amount of calculation through simplified methods, the research focus of regional PV output prediction mainly focuses on the upscaling strategy [
9].
Through the study of the previous research, we found that there are two problems with the county-level regional DPV output prediction. The first is about the available data resources for prediction. Most of the DPV systems in county-level regions are small rooftop PV systems. Considering the construction cost, there is generally no single output prediction device in these systems, which leads to the lack of predicted output data from single systems. For the same reason, there is also no meteorological data acquisition device in these systems, which leads to the lack of locally measured meteorological data for output prediction. Although the easily available weather prediction data can be used to predict the regional power output, the inherent weather prediction errors will affect the output prediction accuracy. The lack of available data resources and the weather prediction errors make it difficult to directly use the previously proposed models to predict the county-level regional DPV output. The second is about the prediction method. Most of the previously proposed deep neural network (DNN) architectures are successfully applied to images, text, and audio but are not well suited for tabular data [
26]. Therefore, there are few studies on the regional PV output prediction method based on DNNs. As is the case with other data-driven models, when a new DNN for tabular data, such as TabNet, is applied to predict the county-level regional DPV output, the optimal training sample collection period (TSCP) is dynamically changing, and it is difficult to select this hyper-parameter, so generally a fixed value is used, or all historical samples are taken as the training samples, which will reduce the accuracy of the predicted results.
In this paper, the weather prediction information is used to predict the county-level regional DPV output based on the model input average strategy. To eliminate the effect of the selected non-optimal TSCP on the prediction accuracy, an ensemble prediction method based on the minimum redundancy maximum relevance (mRMR) criterion and the TabNet model is carried out. To reduce the influence of weather prediction errors on the power output prediction, a modified model based on error prediction is proposed. The proposed ensemble prediction method is used to predict the day-ahead output, and a combination prediction model based on the proposed ensemble prediction method and the modified model is established to predict the hour-ahead output. Finally, the performance of the proposed models is verified by error analysis.
3. Proposed Method
3.1. Data Experiment Scheme
As shown in
Figure 4, the steps of the data experiment are:
Step 1: Collect the measured output data (sampling period: 15 min). Collect the weather prediction data (sampling period: 1 h), and then obtain the weather prediction data of the same period with the measured output data by linear interpolation. Preprocess and combine the output data and weather data to establish the experimental sample set.
Step 2: Take a test sample.
Step 3: For the test sample extracted in the previous step, fixed TSCPs are randomly generated. Based on different TSCPs, training sample sets are established by extracting qualified training samples from the experimental sample set.
Step 4: Based on the training sample sets established in the previous step, TabNet model is trained to generate prediction models.
Step 5: Taking the prediction models generated in the previous step as base predictors, an ensemble prediction model based on mRMR is established.
Step 6: Based on the test sample taken in Step 2 and the ensemble prediction model established in the previous step, the day-ahead and hour-ahead outputs are predicted, respectively.
Step 7: Based on the hour-ahead output predicted in the previous step and the proposed modified model, the final hour-ahead output is obtained.
Repeat Step 2 to Step 7 (test sample size) times to obtain the day-ahead and hour-ahead output prediction series, respectively. Step 8 and Step 9 are prediction error analyses.
The normalized mean absolute error (nMAE) calculated by Equation (13) and the normalized root mean square error (nRMSE) calculated by Equation (14) are imported to present the prediction errors in this paper.
where
is the test sample size,
is the normalized predicted regional DPV output, and
is the normalized measured output.
3.2. Proposed Ensemble Prediction Model
In predicting the regional DPV output based on the model input average strategy, the training sample set affects the prediction performance of the trained TabNet model. The training sample set depends on the TSCP. For the output prediction of a specific period, there is a relatively optimal training sample set, which corresponds to a specific TSCP and the historical samples in the TSCP. With the systems running, the newly generated samples will update the historical sample set in the previous optimal TSCP, which will cause the training sample set in this TSCP is no longer optimal. Therefore, the TSCP corresponding to the optimal training sample set is dynamic. However, it is difficult to select the optimal value of this hyper-parameter. The TSCP selected by traditional methods is often not optimal, which affects the prediction accuracy. To solve this problem, we proposed an ensemble prediction model based on mRMR criterion and TabNet model, and the specific steps of the model are as follows:
Step 1: Randomly generate fixed TSCPs and establish training sample sets for the day before the predicted day. Then, based on the established training sample sets and TabNet model, the regional DPV output series is predicted.
Step 2: Calculate each , the MI between the regional DPV output prediction series in the previous step and calculate each , the MI between the predicted output series and the measured output series .
Step 3: Let , select the with the largest , let , and then update as follows: ;
Step 4: Select which meet the conditions expressed in Equation (12) from , then update and as follows: , ;
Step 5: Repeat Step 4 for a total of times to select the output prediction series according to the mRMR criterion from the output prediction series in Step 1 and constitute a set . The fixed TSCPs corresponding to the output prediction series in the set are extracted to constitute a set . Then calculate the MI between the predicted output series and the measured output series to constitute a set .
Step 6: Calculate the weights by Equation (15) and construct a weight vector
.
Step 7: Predict the regional DPV output series of the predicted day based on the fixed TSCPs in and TabNet model. Then, construct an output prediction matrix .
Step 8: Calculate the county-level regional DPV output prediction series
of the predicted day by Equation (16):
3.3. Proposed Modified Model
The weather prediction information is taken as the input to predict the power output of the regional DPV stations group. Therefore, weather prediction accuracy has a great influence on the prediction accuracy of power output. However, weather prediction errors are similar in adjacent time periods. In this paper, a modified model based on error prediction is established by mining this similarity. Based on the known prediction errors, the unknown prediction errors are predicted, thus the influence of weather prediction errors on the output prediction accuracy is reduced.
To describe the proposed modified model, the concepts of potential test sample (PTS), non-potential test sample (NPTS), and the closest similar sample (CSS) are defined. A test sample is defined as PTS if there are some historical samples with the same weather type on the same day, otherwise it is defined as NPTS. There is some potential that the power output prediction error in a PTS period can be reduced by the proposed modified model. The closest historical sample on the same day and with the same weather type of PTS is defined as the CSS of the PTS. As shown in
Figure 5, the steps of the proposed modified model are:
Step 1: Take a test sample and determine whether it is a PTS. If so, proceed to the next step, if not, return to take the next test sample.
Step 2: Extract the historical PTSs with the same weather type of the PTS taken in the previous step and the CSSs corresponding to the historical PTSs.
Step 3: Calculate the prediction errors in the periods of PTSs extracted in Step 2 by Equation (17), and transform the errors by Equation (18):
where
is the prediction error,
is the normalized predicted output,
is the normalized measured output, and
is the transformed error.
Step 4: Calculate the differences of extraterrestrial solar radiation in the periods of PTSs extracted in Step 2 by Equation (19), and transform the differences by Equation (20):
where
is the difference of extraterrestrial solar radiation in one of the periods of PTSs extracted in Step 2,
is the normalized extraterrestrial solar radiation in the period of the PTS,
is the normalized extraterrestrial solar radiation in the period of the CSS corresponding to the PTS, and
is the transformed difference of extraterrestrial solar radiation.
Step 5: Establish a training sample set in which a training sample , where is the transformed error in the period of the CSS corresponding to the training sample.
Step 6: Train a TabNet model based on the training sample set established in the previous step, and then predict the error .
Step 7: Transform the predicted error according to Equation (21) and modify the predicted power output according to Equation (22).
where
is the transformed predicted error,
is the modified regional DPV output, and
is the output of the proposed ensemble prediction model.
3.4. Experimental Data and DATA Preprocessing
The raw data used in the experiment include measured regional DPV output data and weather prediction data from 7 January 2021 to 30 September 2021. The measured output data (sampling period: 15 min) is collected from 27 DPV systems in Xiaoshan District, Hangzhou City, Zhejiang Province, China. Weather prediction data (sampling period: 1 h) was obtained from Xinzhi weather prediction platform. The weather prediction data (sampling period: 15 min) is obtained based on the original weather prediction data and linear interpolation. The experimental sample set is established by combining multisource data sets according to the time attribute of samples. The attributes of the established experimental sample set include: Extraterrestrial solar radiation, weather type, air temperature, air index, wind speed, and measured regional DPV output.
After establishing the experimental sample set, data preprocessing is carried out. In order to improve the convergence speed and accuracy of DNN models, Max-Min normalization is usually carried out on the input and output features as shown in Equation (23):
where
is a normalized feature value,
is the original feature value,
is the maximum feature value, and
is the minimum feature value.
5. Conclusions
This paper presents a new prediction method for the output of the county-level regional DPV stations group, which aims to improve the centralized operation and maintenance of the stations and to meet the needs of power grid dispatching. The weather prediction information is used to predict the output based on the model input average strategy. Considering the effect of the selected non-optimal TSCP on the prediction accuracy, an ensemble prediction method based on the mRMR criterion and TabNet model is carried out to predict the day-ahead output. Firstly, multiple fixed TSCPs are randomly generated, and the output prediction series of the day before the predicted period are predicted based on the fixed TSCPs and the TabNet model. The weight vector of the output prediction series of the previous day is calculated according to the mRMR algorithm. Then, based on the fixed TSCPs and the TabNet model, the output vector of the predicted period is predicted. Finally, the output prediction value of the predicted period is obtained by the weighted average method. The nMAEs and nRMSEs of the prediction results based on the fixed TSCPs are in the interval (8.85%, 16.09%) and the interval (12.06%, 23.74%), respectively. The nMAE and nRMSE of the prediction results based on the proposed ensemble prediction model are 8.4% and 11.11%, respectively. Therefore, the effect of the selected non-optimal TSCP on the prediction accuracy can be eliminated by the proposed ensemble prediction model.
Taking into account the influence of weather prediction errors on the power output prediction, a modified model based on error prediction is proposed. Firstly, the functional relationship between the prediction errors of the same weather type on the same day is learned. Then, based on the functional relationship, the prediction error of the predicted period is predicted. Finally, the output prediction result of the proposed ensemble prediction model is modified by the predicted error. The nMAE and nRMSE of the hour-ahead output prediction results obtained by this combination prediction model are 6.9% and 9.49%, respectively, which is less than that of the proposed ensemble prediction model. Thus, the influence of weather prediction errors on the power output prediction is reduced by the proposed modified model.
According to the overall error analysis, compared with the reference day-ahead prediction model, the proposed ensemble prediction model reduces the nMAE by 2.86% and the nRMSE by 5.51%, respectively, and compared with the reference hour-ahead prediction model, the proposed combination prediction model reduces the nMAE by 3.05% and the nRMSE by 3.05%, respectively. Based on the daily error analysis, compared with the reference day-ahead prediction model, the proposed ensemble prediction model reduces the mean value of daily nMAE by 2.9% and daily nRMSE by 4.2%, respectively, and compared with the reference hour-ahead prediction model, the proposed combination prediction model reduces the mean value of daily nMAE by 3.11% and daily nRMSE by 3.08%, respectively. In accordance with the monthly error analysis, compared with the reference day-ahead prediction model, the proposed ensemble prediction model reduces the mean value of monthly nMAE by 2.91% and monthly nRMSE by 5.3%, respectively, and compared with the reference hour-ahead prediction model, the proposed combination prediction model reduces the mean value of monthly nMAE by 3.02% and monthly nRMSE by 3.01%, respectively. Therefore, the proposed day-ahead and hour-ahead prediction models are more accurate and stable than the corresponding reference models and show robust performance with monthly variations.