The proposed methodology is summarized in
Figure 1, and essentially uses DOE, PCA and ANN. The methodological process of this work is based on research [
37]. The authors encourage the application of this model due to its versatility, which reduces the computational effort and tends to produce good results. To facilitate the reader’s understanding, each topic that composes the steps of this process is detailed.
The proposed methodology’s application helps operate active distribution networks and emerging transmission systems, since the operator is informed about the actual generation availability in the next time window. Thus, generation and system configuration adjustments are possible, enabling the utilities to provide a reliable service.
In addition to the use of DOE, this work introduces the use of principal component analysis to reduce the dimensionality of climate data, with minimal loss of information, for training the machine learning model.
3.1. Data Collection and Preparation
An essential step that precedes data analysis is collecting and preparing time series. This data is often difficult to obtain due to the data protection policy of local generation plants [
38], which can compromise advances of photovoltaic generation forecasting. The entire forecasting process can be compromised if this step is not seriously considered [
39]. This step covers correcting missing data, normalizing data, adjusting data resolution and grouping data [
40]. Real photovoltaic generation data were used from the PVOutput.org [
41] repository, with the daily resolution, except for the data from the generation plants of the cities of Machado and Passos, which were acquired from the Federal Institute of South of Minas Gerais IFSULDEMINAS.
Seventeen generating units are considered in this study; each one has a different generation capacity and is geographically separated throughout the Brazilian territory. The reason for choosing these units was due to the availability and quality of data in the time horizon of the study. Missing or null data were disregarded.
The climatic data were obtained through the National Institute of Meteorology (INMET) [
42], considering the weather stations closest to the previously selected photovoltaic generation plants, covering sixteen parameters: instantaneous temperature (°C), maximum temperature (°C), minimum temperature (°C), instantaneous humidity (%), maximum humidity (%), minimum humidity (%), instantaneous precipitation (°C), maximum precipitation (°C), minimum precipitation (°C), pressure instantaneous (hPa), maximum pressure (hPa), minimum pressure (hPa), wind speed (m/s), wind direction (°), wind gust (m/s) and radiation (KJ/m²).
In order to facilitate the identification of each photovoltaic generation plant, and their respective climatic data, the closest city to that measurement point was considered and these characteristics are listed in the following
Table 2:
The data series was divided into seasons, since each one presents different characteristics that can be relevant factor for the success of a more accurate forecast. The increase or decrease in the efficiency of the panels can be influenced by environmental factors in the region, such as wind speed, humidity, dust, temperature, among others [
43], justifying the segmentation by season. In Brazil, summers are hot and humid, with a predominance of rain in several regions, while winter causes drought and cold. From
Figure 2, it is possible to identify these periods throughout the months of the year.
Therefore, when it comes to photovoltaic generation, the panels can gain or lose efficiency due to numerous uncontrollable factors, related to the season [
44], such as the accumulation of dust, predominance of clouds over the generation area, cooling of solar cells, etc. The separation of data into seasons aims to mitigate these effects so that the forecast model does not suffer from the inconsistencies that can be generated in the training process.
Since each photovoltaic generation plant has different generation capacity, and the climatic data have different measurement units, two ways of normalizing the data were considered, placing them in a feasible scale for the optimization process through algorithms of machine learning.
The first, which uses the maximum and minimum values of time series, rescales the data within the interval between 0 and 1 and is observed in Equation (1), where “
” is the observed value, “
” the minimum value of time series and “
” the highest value [
45]:
The other normalization technique, known as standardization or Z-Score method, uses the mean and standard deviation of the series itself, making the normalized value centered around the mean with unit standard deviation [
46]. The standardization calculation is performed according to the Equation (2), where “
” is the observed value, “
” is the mean and “
” the standard deviation.
3.2. Hierarchical Cluster–Grouping of Similar Days
After dividing the data series into seasons, the hierarchical clustering technique was used to group the days with certain similarity levels. These grouping methods initially consider each data point (or object) as a group [
47]. Then, similar objects begin to coalesce to form groups.
Figure 3, in a simplified way, schematizes the separation of the six data points into groups and structures the minimalist representation of the respective dendrogram. The distance between the groups that form is calculated by the linkage method, which in this work considered the following two: Complete and Ward.
The Complete linkage method, also known as the furthest neighbor, calculates the maximum distance between an object in a given cluster and another object belonging to another cluster. In general, the diameter of the formed groups tends to have similar sizes. The Complete method was chosen because it performs well in certain cases [
48] and is represented by Equation (3), where “
” is the distance between the clusters “
” and “
” and “
” symbolizes an object “
” in the cluster “
” [
49].
Ward’s linkage method minimizes the sum of squares within each cluster, and the distance between these clusters is calculated by the sum of squared deviations from the points to the centroids. In this case, each group tends to have the same number of objects. The choice of Ward’s method to compose the experiments of this work was because that it demonstrates good separability between groups and consistency [
50]. Equation (4) calculates the Ward’s distance, where “
” represents the number of objects present in cluster “y”, and so on.
3.3. Principal Component Analysis (PCA) for Dimensionality Reduction
PCA is a multivariate tool widely used in the literature [
51]. It reduces the dimensionality of the dataset, to an uncorrelated set, known as principal components, that may explain the whole original set. It can separate out information that is redundant and random. The representation of the variance of the data tends to be in the first components (where the first component has the maximum explanation compared to the other components [
52] and so on). The noise tends to be in the last components, that is, the principal components are uncorrelated linear combinations [
53] of the original variables weighted by the eigenvalues.
According to [
54], it can be described briefly by considering
observation vectors
and the respective mean vector
(where the ellipsoid axis origin will be). The change to the origin
is described
. Rotating the axis centered on the mean results in principal components, which are uncorrelated. The rotation movement multiplies each
by an orthogonal matrix
, according to Equation (5):
If
is orthogonal, then
, and the distance to the origin remains the same, as observed in Equation (6):
The rotation transforms
to a
point, keeping the same distance from the origin. The calculation of matrix
allows the discovery of the axes of the ellipsoid, making
uncorrelated. In this way, the sample covariance matrix of
,
is desired to be diagonal, as in Equation (7):
where
is the covariance matrix of
. Since
are the eigenvalues of
and
an orthogonal matrix in which the columns are the normalized eigenvectors of
,
. The transpose of matrix
is the orthogonal matrix
that diagonalizes
, as shown in Equation (8):
so that
is the normalized
th eigenvector of
. The principal components are represented by the variables
, , …, in
. The diagonal elements of
are eigenvalues of
. This makes the eigenvalues
of
the variances of the principal components
, as described in Equation (9):
Since the eigenvalues are the variances of the principal components, the expression of percentage of explanation by the first
components is used:
Reducing the dimensionality of meteorological data, for training machine learning models, avoids overfitting and allows the original data to be replaced by this new dataset, reduced, but retains most of the original information [
55].
The application of the PCA method extends to problems in different areas and has contributed to interesting solutions. For example, recently, some authors [
56] have proposed a variation of the PCA combined with the modified affinity propagation clustering algorithm (called PCA-MAP) to classify tourist preference information. It is also worth mentioning the work by [
57] that explores day-ahead carbon price prediction using PCA combined with several machine-learning methods, providing dimensionality reduction from 37 variables to only 4.
This work considers data dimensionality reduction in two specific cases, depending on the methodological process, defined by the consideration, or not, of the meteorological variables.
Figure 4 exemplifies the structure of the data collected, with each line representing a measurement day and each column representing an observed variable. The first column consists of the photovoltaic generation data. The others (2 to 17) comprise the climatic variables. The PCA is applied, when the climatic variables are considered in the experimental run, in columns 2 to 17 of
Figure 4.
On the other hand, when the experimental process does not consider the climatic variables, but only the photovoltaic generation variables, a data restructuring is necessary. In this case, the data stacking process for model training is exemplified in
Figure 5. Here, six generation days before the observed measurement day are chosen. These six days will compose the training data referring to that observed day, as observed in
Figure 5 “A” (green) and “B” (yellow) markings. As one walks through the generation data structure, the sliding window forms new training data for the measurements of subsequent days. Finally, the PCA is applied to this dataset (columns 2 to 7). The region highlighted in red is disregarded in this situation because it has many null cells, which represents noise for the prediction model.
Thus, when the climatic variables are considered, there is a reduction of 16 observations. When only the photovoltaic generation is considered, there is a formation of six variables for dimensionality reduction.
3.4. Artificial Neural Networks (ANN) Parametrization
There are numerous situations in which using of artificial neural networks is satisfactory [
58], such as pattern recognition, classification, fault detection and PV generation forecasting [
59]. Since the photovoltaic generation prediction problem has, in essence, non-linear characteristics, machine learning models try to efficiently capture these variations and present them in the output [
60], but with the premise that there is no model in the literature that performs well in all cases. ANNs were chosen in this work because of their superior performance compared to other machine learning models [
61].
Essentially, an ANN is made up of three layers [
62], in its minimal architecture. The first layer is known as data input. This layer may contain one or more neurons. The second layer, known as the intermediate (or hidden) layer, may not be unique and has several neurons set by the analyzer, independent of the number chosen for the first layer. Finally, there is the last layer, or output layer, where the results are obtained after the training and testing process.
Neurons are present in all layers and constitute the network’s architecture, and can be added (or removed) from each layer as it fits well (or poorly) to the problem at hand. The anatomy of a neuron shows that it receives an input, computes the weights relative to that input, and returns the result via an activation function [
63]. The training process consists of transferring information from one layer to another, by optimizing the adjustment of weights in the neurons, until a condition is reached. Equation (11) expresses, in a simplified way, the mathematical modeling of this calculation:
where ‘z’ is the network output, ‘b’ the bias value, ‘x’ the input information, ‘w’ the related weight and ‘n’ the total number of inputs.
The definition of parameters that optimize the functioning of the ANN is not immediate, and often there is no consensus regarding certain choices, such as the number of layers and the number of neurons in each layer [
64]. Some authors consider the choice of parameters by trial-and-error [
65] and not in a systematic way. The ANN parameters considered in this work were based on [
66] and [
67] research and are detailed in the next section, which presents DOE as a statistical tool for reducing the parametric search space.
3.5. Factorial Design of Experiments (DOE)
It is noticed in the literature, a vast record of the use of DOE, such as for parametric calibration of prediction models [
68], to choose the training set [
69] and also applied parameter optimization in manufacturing simulations [
70]. The DOE, through the composition of its statistical tools, allows the relationship between cause and effect to be systematically identified, which can lead to a solution that optimizes the process. In general, there is a choice of factors and levels, response variables, the structure of the experimental design and the execution itself [
14]. The logic of choice is intrinsically linked to the type of study.
Full or fractional factorial designs, usually with two levels, are well accepted by the industry [
71]. Full factorial designs consider all possible combinations, which generates a search space with a dimension of 2k, where k is the number of factors. It is understood that, by increasing the number of factors (even their respective levels), full factorial design leads to an extremely high number of experimental runs, which can generate high costs and high time demands [
72]. Thus, this study considers a two-level fractional factorial design due to the natural limitations of a simulated experiment, which are the scarcity of computational resources and time.
Figure 6 shows a schematic representation to clarify the potential of DOE, which allows the analyzer to restrict the parametric search space to factors that potentially lead to the solution of the problem. Scanning the entire search space implies a high computational and time cost. Thus, based on references (from the literature, for example), it manages to reduce this search space to a specific set of parameters, which naturally does not guarantee the optimal solution, but it allows having an idea of this adjustment and how the factors interact with each other.
A lot of data, both from photovoltaic generation and climate, as well as the number of parameters from the machine learning models that can be combined, challenge the processing power of current computers, which is limited [
40]. When referring to research involving computer simulation, there is usually many data and/or parameters involved. In order to mitigate the computational cost of the experiments of this work, the DOE was considered to reduce the parametric search space as it is an effective tool for this purpose [
73].
The quality of reducing (or increasing) the depth of this search using DOE is measured in terms of confounding and is summarized in the experiment’s resolution. When there is a shortage of resources to carry out the experiments, in addition to choosing the levels of factors, the DOE allows the reduction of experimental runs, maintaining the statistical reliability [
72] of these runs. As shown in
Figure 7, this work considered level IV resolution, since at this level the main effects are considered without confusion with the interactions of two factors.
This research considered 11 factors in the experimental architecture, being separated into five factors related to the time series and six factors associated with the artificial neural network. The versatility of the experimental design, related to the essence of the photovoltaic generation prediction problem, allowed the choice of factors and their respective levels to be based on previous works [
67], and one of these works also considers this object of study [
37]. Knowing that there are numerous combinations of factors and that each factor has numerous levels, this search space becomes reduced when using DOE and, in this way, the analyzer can make changes in the factors or levels and understand the impact that this change has on the quality of the results.
Table 3 summarizes each factor considered.
3.6. Mixture Design of Experiments (MDOE) for Defining the Ensemble Weights
Combining forecasts is to try to achieve better performance against the forecasters when considered individually [
74]. The literature reports an empirical benefit of this combination in improving the forecast results [
75]. Thus, this work uses Mixture DOE to combine the prediction results. Specifically, a mixing experiment considers finding the optimal proportions for each ingredient, that is, in the prediction problem, this proportion is identified by the weights
and the factors represent the ingredients of this analogy.
Here, the combined value (which is taken as an answer) depends only on the weights (proportion of ingredients) and not just on the factors themselves. According to [
76], the weights
are non-negative, expressed as fractions of the mixtures, whose sum of all
factors (ingredients) must be unity, as described in Equation (12):
Considering an example with three factors, or ingredients, there is a graphic representation of this arrangement as a triangle, as seen in
Figure 8. The vertices are considered pure mixtures, because at these points the values of the weights of the other factors are null [
14]. As the number of factors increases, the geometric representation also changes. For example, when considering four factors, the representation is given by a tetrahedron. Several factors greater than or equal to five are feasible, but there is no longer any possibility of visual representation.
The metrics for evaluating and defining the weights is based on the mean absolute percentage error (MAPE), which has already been used in recent forecasting works [
77], and on the root mean squared error (RMSE), which penalizes errors of greater magnitude, for comparison purposes. The error calculation is obtained as shown in Equation (13), for MAPE, and in Equation (14), for RMSE:
so that
is the actual measurement value of the photovoltaic generation,
is the predicted value and
corresponds to the number of predicted points.
Figure 9 details the pseudocode that automates the prediction process described in the previous topics. The implementation of this algorithm took place through the Matlab software.