Reduction of the Risk of Inaccurate Prediction of Electricity Generation from PV Farms Using Machine Learning

Krechowicz, Maria; Krechowicz, Adam; Lichołai, Lech; Pawelec, Artur; Piotrowski, Jerzy Zbigniew; Stępień, Anna

doi:10.3390/en15114006

Open AccessArticle

Reduction of the Risk of Inaccurate Prediction of Electricity Generation from PV Farms Using Machine Learning

by

Maria Krechowicz

^1,*

,

Adam Krechowicz

²

,

Lech Lichołai

³

,

Artur Pawelec

¹

,

Jerzy Zbigniew Piotrowski

⁴

and

Anna Stępień

⁵

¹

Faculty of Management and Computer Modelling, Kielce University of Technology, al. 1000-lecia P.P. 7, 25-314 Kielce, Poland

²

Faculty of Electrical Engineering, Automatic Control and Computer Science, Kielce University of Technology, al. 1000-lecia P.P. 7, 25-314 Kielce, Poland

³

Faculty of Civil Engineering, Environmental Engineering and Architecture, Rzeszow University of Technology, ul. Poznańska 2, 35-959 Rzeszow, Poland

⁴

Faculty of Environmental, Geomatic and Energy Engineering, Kielce University of Technology, al. 1000-lecia P.P. 7, 25-314 Kielce, Poland

⁵

Faculty of Civil Engineering and Architecture, Kielce University of Technology, al. 1000-lecia P.P. 7, 25-314 Kielce, Poland

^*

Author to whom correspondence should be addressed.

Energies 2022, 15(11), 4006; https://doi.org/10.3390/en15114006

Submission received: 26 April 2022 / Revised: 19 May 2022 / Accepted: 25 May 2022 / Published: 30 May 2022

(This article belongs to the Special Issue New Advances in Building Physics and Renewable Energy)

Download

Browse Figures

Versions Notes

Abstract

:

Problems with inaccurate prediction of electricity generation from photovoltaic (PV) farms cause severe operational, technical, and financial risks, which seriously affect both their owners and grid operators. Proper prediction results are required for optimal planning the spinning reserve as well as managing inertia and frequency response in the case of contingency events. In this work, the impact of a number of meteorological parameters on PV electricity generation in Poland was analyzed using the Pearson coefficient. Furthermore, seven machine learning models using Lasso Regression, K–Nearest Neighbours Regression, Support Vector Regression, AdaBoosted Regression Tree, Gradient Boosted Regression Tree, Random Forest Regression, and Artificial Neural Network were developed to predict electricity generation from a 0.7 MW solar PV power plant in Poland. The models were evaluated using determination coefficient (

R^{2}

), the mean absolute error (

M A E

), and root mean square error (

R M S E

). It was found out that horizontal global irradiation and water saturation deficit have a strong proportional relationship with the electricity generation from PV systems. All proposed machine learning models turned out to perform well in predicting electricity generation from the analyzed PV farm. Random Forest Regression was the most reliable and accurate model, as it received the highest

R^{2}

(0.94) and the lowest

M A E

(15.12 kWh) and

R M S E

(34.59 kWh).

Keywords:

photovoltaic systems; PV farm; machine learning; risk reduction

1. Introduction

The European Green Deal is aimed at reducing greenhouse emissions by at least 55% by the year 2030, with the hope of making Europe the first climate neutral continent in the world by 2050 [1]. Production of energy from renewable sources is considered one of the most efficient tools to counteract global warming. It is in line with electricity production from photovoltaic (PV) plants, enabling a reduction of the carbon footprint compared to traditional fossil fuel-based generators. Furthermore, the increasing demand for energy in the world together with the threat of a global energy crisis [2] are inclined to increase the production of energy from renewable sources. Currently, a visible trend is also to reduce primary energy consumption of buildings without negatively affecting their indoor microclimates, which can be achieved, for example, by using renewable energy sources, modern systems, such as facade ventilation equipped with heat recovery exchangers [3], and effective insulation (e.g., reflective insulation) [4]. PV energy systems are considered as an important renewable energy source [5], as Earth’s surface receives approximately 1.5 ×

10^{18}

kWh/year of solar energy [6]. The ecological approach to renewable energy is also related to the concept of fundamental research on devices and components of renewable energy systems, such as highly efficient heat exchangers (as presented, e.g., in [7,8]). The photovoltaic effect, which is the basis for cell operation, is the voltage generated in the semiconductor material as a result of the interaction of the semiconductor with electromagnetic radiation [9]. Currently, we can see the increase of PV capacity installed not only in Poland, but also in many locations in the world, and therefore the study of PV system performance under various external conditions is needed [10]. It is important to analyze the performance of PV systems with various module technologies under different climatic conditions [11].

Prediction of electricity generation from PV systems is a challenging task, as it is dependent on meteorological and climatic conditions in the analyzed location. It should be stressed that PV power is highly volatile, as it can vary from zero to a hundred percent, depending on the meteorological conditions and geographical characteristics. To ensure reliable and safe grid operation, operators require PV farms to provide energy production forecasts in advance [12,13]. It enables grid operators to optimally plan the spinning reserve as well as to manage inertia and frequency response in the case of contingency events [14]. Improper forecasting of electricity generation from PV power plants poses a risk not only to smooth operation, scheduling, and balancing of PV plants, but also to the security of the grid.

The planning and scheduling of PV power plant work is very difficult, as it is carried out under variable meteorological conditions, often leading to poor balancing of power generation with a load demand, causing penalties for the power producer [15], which vary by country or state due to their own patterns and policies [16]. In the case of an overproduction of energy from renewable sources in relation to the promised value, a phenomenon of negative pricing may occur in the electricity market. The typical time for negative pricing to occur is the middle of the day, when all renewable generators supply energy [6]. The solutions to this problem are accurate, appropriate, and flexible forecasts of PV energy production and a dynamic pricing scheme.

All in all, problems with inaccurate prediction of electricity production from PV farms cause severe operational, technical, and financial risks, which seriously affect both owners and grid’s operators. Technical risks related to an inaccurate forecast of electricity production from PV farms mean that the national energy system cannot adequately plan the operation of coal and gas-fired power plants at individual hours. The system may then be overloaded or unloaded, resulting in frequency fluctuations. The economic risk from the point of view of the power system is that if the system is unstable due to inaccurate forecasts, energy will be more expensive. Furthermore, from the point of view of a PV farm, if the farm makes an inaccurate forecast of energy production, it will get a lower price for the volume introduced into the power grid at the specified time. The operational risks associated with inaccurate planning of energy production from a PV farm include automatic shutdown of the installation when trying to introduce too much power into the power grid. From the point of view of the national energy system, underestimating energy production from PV farms results in the need to quickly switch on large power units, which is associated with a long start-up time and high start–up costs.

Effective, holistic risk management focuses on risk identification, risk treatment, risk registers, risk monitoring and review, as well as risk financing [17,18]. It is vital to analyze the detection possibilities of a certain risk, because correctly applied detection actions are usually much cheaper and less troublesome than treating risk after the risk occurrence [19]. Moreover, it was demonstrated that effective risk management helps to achieve the desired results of various projects [20,21]. Therefore, in the case of treating the risk of inaccurate prediction of electricity generation from PV farms, it is beneficial to look for risk cause reduction, which can be achieved by developing accurate prediction models using machine learning.

There are various approaches to PV power output forecasting, inter alia, on a national level, on a single PV farm level, and concerning various time horizons (one day, one month, the whole year). It should be stressed that from the point of view of an owner and a single PV farm, forecasting of electricity generation from PV systems is a desirable service addressed to a large group of recipients (owners operating PV farms). However, it is difficult to predict the energy production from PV farms in Poland, which is located in a warm temperate transitional climate, characterized by large fluctuations in temperature and solar radiation. In the case of Poland, the relationship between solar radiation and PV panel output can be much less significant than in Columbia (tropical and isothermal climate) or Australia (continental climate), making the forecasting task more difficult. Furthermore, the proposed models should take into account the variability associated with seasonality.

The aim of this work is to develop and compare machine learning models for the prediction of electricity generation from a PV farm, taking into account local weather patterns. The contribution to the body of knowledge is as follows:

The study of the impact of meteorological parameters on electricity generation from a PV farm in Poland,
Development of machine learning models for electricity generation from a PV farm, taking into account local weather patterns,
A comparative study of the performance of various machine learning models.

The scientific novelty of this work is connected with developing a model that will respond to the needs of PV farm owners in Poland by providing an easy to apply model that allows effective prediction of PV panel output, enabling the reduction of the risk of inaccurate forecasts, which are currently a concern.

The paper is organized as follows. Section 2 is a literature review covering solar PV power output prediction and the principles of chosen machine learning techniques. Section 3 presents the proposed approach to risk reduction of inaccurate prediction of electricity generation from PV farms using machine learning. It covers the description of data gathering and preprocessing, the relationship between power production and meteorological parameters, the selection of the attributes to be considered when developing machine learning models, machine learning model development, and model evaluation. Section 4 presents experimental results and discussion. Section 5 summarizes the paper.

2. Literature Review

2.1. Solar PV Power Output Prediction

Solar PV generation forecasting can be carried out using numerical weather prediction (NWP), satellite/sky image-based forecasting and machine learning models [22]. Several-hour ahead solar irradiance forecasting was carried out using a traditional NWP technique. In [23], 10 min ahead solar irradiance forecasting was successfully carried out using a machine learning model with processed sky images combined with cloud motion information. In [24,25,26], various models aiming to predict solar PV power output based on historical data were presented. In [5], PV power estimation models were divided into models based on past values and atmospheric models. Models based on past values cover persistence models (based only on historical power production records), statistical approaches (regression and auto–regression), machine learning models, and hybrid models (combining physical and statistical models). Atmospheric models use the forecasted variables obtained by numerical prediction programs of meteorological institutes, which may be enhanced with models based on past values.

Table 1 presents an overview of recent work on machine learning applications for PV power output forecasting.

In [27], it was stressed that the performance of the model depends on the geographical region. In [28], the possibilities of using solar energy under Polish climatic conditions were analyzed. In [29], PV generation in Poland was simulated at the national level using an artificial neural network. Data for one month concerning hourly irradiation, wind speed, and temperature were collected from 23 meteorological stations. The proposed model allowed successful simulation of PV generation on a national level with a low mean absolute percentage error. In [30], a real PV solar power production was compared with a SODA hl (SOlar radiation DAta) forecasting model for the city of Falenica, Poland. It was found that the forecasting of annual energy production simulation was associated with a minor absolute error, but monthly and daily forecasts were associated with a significant error.

Table 1. An overview of recent work on machine learning applications for PV panel output forecasting.

Literature	Year	Model	Comments
[31]	2022	3D-geographic information system combined with deep learning integrated approach, to predict dynamic rooftop solar irradiance	The solution is helpful in facilitating solar energy applications considering shading in urban areas
[14]	2021	PV power generation forecasting using various models: LR, PR, DTR, SVR, RFR, and MPR	RFR model outperforms other models
[32]	2021	Pattern identification of PV energy generation applying several deep neural architectures, such as LSTM, GRU, CNN, and Autoencoder	The proposed solution outperforms the state–of–the–art energy disaggregation approaches
[33]	2021	A deep learning approach to PV energy generation forecasting using LSTM neural network, adaptive neuro–fuzzy inference system (ANFIS) accompanied by fuzzy c-means and ANFIS with grid partition	LSTM model outperforms other analyzed models,
[34]	2021	Solar radiation prediction using weather data and feed-forward neural networks, SVR, KNNR	Models related to combining heterogeneous models using neural meta-models have shown superior performance
[35]	2021	Solar irradiance and PV power output prediction based on cloud coverage using a classical regression model, deep learning methods, and boosting methods	Sky–facing cameras combined with machine learning models can be used to predict PV power output
[36]	2021	Solar energy forecasting guided by Pearson connection with the following machine learning techniques: LR, RF, SVR and ANN.	RF and ANN are the most accurate for real–time solar energy prediction, while ANN for short–term forecasting
[37]	2020	Solar irradiance prediction using Genetic Algorithm/Particle Swarm Optimization and CNN	Superior performance of the model is a basis for a precise estimation of solar power
[38]	2020	Solar radiation and PV power forecasting using uncertainty bias and Kalman filter	The proposed model outperforms traditional autoregressive integrated moving average model
[22]	2020	Prophet Model for of one-day-ahead forecasting of PV panel short circuit current	The proposed model demonstrates a high accuracy and coefficient of determination
[39]	2019	Solar irradiance forecasting using past weather data and LR, DTR and SVM	SVM outperforms other analyzed models
[40]	2019	PV power forecasting using a Seasonal Auto- Regressive Integrated Moving Average (SARIMA) combined with ANN	Hybrid model enables reduction of forecast error by 10% in comparison with individual models used separately
[41]	2019	Forecasting soiled solar PV panel output using linear regression and neural network	Both linear regression and neural network models have high accuracy
[42]	2019	Wind and solar power prediction using a modified SVR	A proposed modified SVR outperforms other analyzed models
[43]	2018	Hourly day-ahead solar irradiance forecasting using LSTM	LSTM outperforms other analyzed models
[44]	2017	PV power output forecasting using SVR model	A minor improvement in prediction was observed in comparison with analytical method

LR—linear regression, PR—polynomial regression, DTR—decision tree regression, SVR—support vector regression, RFR—random forest regression, LSTM—long short-term memory, MPR—multilayer perceptron regression, GRU—Ggate recurrent unit, CNN—convolution neural network, KNNR—K–nearest neighbour regressors, RF—random forests, ANN—artificial neural network, SVM—support vector machine.

There is a research gap, because the literature still lacks publications related to models that would allow predicting electricity generation from PV systems in Poland, which would meet the needs of solar farms regarding short-term hourly forecasting of energy production throughout the year. Despite a large amount of work related to the topic of predicting energy production from PV systems, no work concerning świętokrzyskie voivodeship, Poland was found.

The power generation by a PV panel depends on many meteorological parameters, such as solar irradiance, temperature, relative humidity, and wind speed [5]. In [14], the dependence of PV power production in Australia on relative humidity, temperature, diffuse horizontal radiation, global horizontal radiation, and daily precipitation was studied. It was found that the last parameter is less significant. Solar irradiance depends on temperature, relative humidity, pressure, dew point, and wind speed and direction, among other factors [6]. As the PV system operation is based on solar energy, the PV power production is a function of the solar irradiance falling on a PV module area [29]. Global solar irradiation (GHI) is a sum of direct, diffuse, and reflected solar radiation. Diffuse horizontal radiation is dispersed by particles in the atmosphere. Reflected solar radiation is that part of the solar radiation that is reflected by Earth’s surface. Both global and horizontal diffuse radiation are important for PV power generation [14]. Temperature of the PV panel has a significant impact on the PV panel power generation. It depends on irradiation, ambient air temperature, and wind speed. A temperature rise results in a slight increase of short–circuited current and a considerable drop in potential difference across the PV panel [45]. The increase of the PV panel temperature leads to a drop in the net efficiency of the PV system, which, according to [46], can be of 0.45% in potential per Kelvin temperature rise. Relative humidity, representing the amount of water vapor in the air, also affects the PV power output, as the moisture in the air causes the diffraction of the incoming sunlight, reducing the PV panel effective solar irradiation of the PV panel [47]. Wind speed also influences PV power generation due to its cooling effect [29]. Moreover, dust present on the surface of the PV panel causes the panel temperature to rise, also impacting the PV panel output [48].

2.2. Principles of Chosen Machine Learning Techniques

2.2.1. Lasso Regression

Linear regression algorithms are often applied to carry out predictive analysis of continuous datasets. In this method, a set of variables that significantly influence the estimated outcome are determined [14]. Formula (1) shows a simple regression equation with one dependent and one independent variable.

y = c + b x

(1)

where c—constant, b—regression coefficient, x—independent variable, and y—estimated dependent (target variable). There are many variants of linear regression models: simple linear regression, multiple linear regression, ridge regression, lasso regression, and logistic regression. The lasso (least absolute shrinkage and selection operator regression) was originally proposed by Tibshirani in 1996 for linear regression models [49]. In this model, the residual sum of square subject to the sum of the absolute coefficient value (being less than a constant) is minimized. Due to the specificity of this constant, this method gives some coefficients that are equal to 0, giving interpretable models. This method combines the advantages of subset selection (interpretability of the model) and ridge regression (stability) [49]. Due to its advantages, lasso regression has become a popular model [50].

2.2.2. K–Nearest Neighbours Regression

The K–nearest neighbours algorithm can be used to solve both classification and regression tasks. In the K–nearest neighbours, for each record to be classified or forecasted, K records with similar characteristics (predictor values) should be found to determine which category is dominant among the records found, and assign that category to a new record. The results of the prediction depend on the scaling of the features, the size of the K, and the way in which the similarity is measured. All predictors should be presented numerically. In case the K is too low, an overfitting problem can occur, whereas higher K values lead to smoothing in the training data, which lowers the risk of overfitting. If K values are too high, data blur may occur, making it impossible to find local structures in the data [51].

2.2.3. Support Vector Regression

The support vector machine (SVM) is a comprehensive machine learning model that enables both linear and non-linear classification, regression, and outlier identification. This model is often used to classify complex datasets of different sizes. The decision limit of the SVM classifier allows one to separate the analyzed classes using the widest possible margin while maintaining a large distance from the nearest sample. Samples at the edge of the margin are called support vectors [52].

Although SVM is a basis for support vector regression, there is a difference between them. SVR enables us to forecast continuous class labels, in contrast to SVM, which concerns discrete class labels. SVR differs from a simple regression, aiming to minimize the error rate between the forecasted and the real value, as SVR is aimed at fitting the error within a certain threshold. A boundary space is developed using the predefined threshold. The points within this boundary are considered to improve the fit of the model. In this way, SVR tries to find the closest match between the real data points and the function representing them [14].

2.2.4. AdaBoosted Regression Tree

The name “decision tree” comes from the specific structure that is produced by the partition algorithm, which is similar to a flowchart or a tree. It begins with a root node, which ramifies into child nodes, containing all the cases and non-cases applied during model training. A leaf node is achieved thanks to recursive data partitioning by the impurity splitting criteria [53]. The AdaBoost algorithm was firstly presented in [54] and is considered to be an effective ensemble learning technique that is able to increase the performance of weak learners thanks to altering the distribution of sample weights [53]. An AdaBoost regressor starts its work by fitting a regressor on the analyzed dataset. Subsequently, it adjusts supplementary copies of the regressor on this dataset, while the weights of instances are fitted taking into account the error of the present prediction. In this way, further regressors concentrate more on complicated tasks [55].

2.2.5. Gradient Boosted Regression Tree

A decision tree is commonly used for classification and regression tasks, as it can offer many advantages, such as high effectiveness, simplicity, and interpretability [56]. The strategy of boosting was firstly developed for classification problems, but has also successfully been applied to regression. The aim of boosting is to merge/bring together the output of many “weak learners” into a strong “committee” [57]. Gradient boosting takes into account additive models, which are learned in a forward stage-wise manner of the form [58]:

F_{m} = F_{m - 1} (x) + h_{m} (x)

(2)

where:

h_{m} (x)

—the weak learning functions.

In the case of applying gradient boosting strategy to regression trees, small regression trees of various sizes are the basis functions, and the whole gradient boosting regression tree model is the sum of them.

2.2.6. Random Forest Regression

Random forests are a set of parallel decision trees. Due to the fact that the classifier of random forests is based on two main random factors, it contains different decision trees. The data used to generate each tree is sampled with replacement from the training set. The best division of all features is selected from a random subset of them [59].

A random forest works by fitting a number of decision trees on different sub-samples of the dataset and applying averaging to increase the predictive accuracy and monitor overfitting. Typical individual decision trees suffer from high variance and the tendency to overfit. The randomness aims to lower the variance of the forest estimator [55]. Random forest regression is believed to do well with diversified tasks and has a potential of coping with non-linear relationships [14].

Random forests are often used to solve the following problems connected with using decision trees: sensitivity to the used training data form (e.g., a change in the order of data may lead to different results), burdening subsequent branches of decision trees with an larger uncertainty due to their being developed on ever smaller datasets [60]. The application of random forests enables combining the results from many trees that were developed on a randomly selected subset of training data [61].

To perform random forest regression on a training dataset, first, a k number of data points is selected from the input data. It enables building a decision tree that is associated with these data points. Selections of k number of data points and generation of decision trees are repeated until N decision trees are generated.

2.2.7. Artificial Neural Network

The biological information processing processes that occur in the human brain constitute the impulse for the development of neural networks. An artificial neural network is a type of mathematical model that is learned to produce and optimize the definition of a function (or distribution) that defines a set of input (training) characteristics. The neural network training process takes place by modifying the weight parameters of the network nodes, which is possible after the performance measure is directed to the network training function. The weight tuning parameters are generated by the training function. They make it possible to minimize the error. The network consists of a set of neurons or scales, each of which has an activation function (weight function) that processes the input data. These weights must be updated in the learning process. The essential elements of the network are the linking functions, which define which node will transmit data to which node. Layer node structures can be created where data must flow in a specific direction. Different ways of connecting network nodes affect its capabilities. Neurons in the network can be connected in different ways to form different topologies, which has a big impact on the learning capabilities of the network. One of the most popular types of network topologies in supervised learning is the multilayer perceptron, a layered, forward-coupled network [59].

3. Proposed Approach

Figure 1 presents the proposed approach for predicting the electricity generation from a PV power plant. The proposed approach consists of six steps: data gathering, data preprocessing, analysis of the relationship between PV power output and meteorological parameters, selection of the attributes to be considered in the models, developing machine learning models, and model evaluation and comparison.

3.1. Data Gathering and Preprocessing

Data were collected for a solar PV plant in Zajączków, świętokrzyskie voivodeship, Poland. A total of 2540 Canadian solar polycrystalline photovoltaic modules CS6K-275P with a nominal power of 275 Wp and an efficiency of 16.80% were installed on the photovoltaic farm. Each module was 1650 × 992 × 40 mm. The modules had a fill factor of 76.65%. The modules were mounted on steel tables anchored in the ground at an angle of 30 degrees to the south (azimuth S). The installation was equipped with 7 SUNGROW PV inverters, type SG80 KTL, with a nominal power of 81,600 W. The installed power of the PV farm was 0.7 MW.

The hourly data concerning a number of meteorological parameters were gathered from the nearest meteorological station in Kielce, Poland, which is operated by the Institute of Meteorology and Water Management. These parameters included: ambient air temperature [°C], sunshine duration [h], gust of wind [m/s], rain in 6 h [mm], height of the freshly fallen snow [m], occurrence of dew [0/1], height of the base of the upper clouds [m], height of the base of the lower clouds [m], general cloud cover [octoants], operator visibility [m], visibility automat [m], visibility [m], water vapor pressure [hPa], and water saturation deficit [hPa]. Moreover, the hourly data concerning various types of solar irradiation, cloud opacity, azimuth, dew point, snow depth, pressure, wind direction, wind velocity, albedo daily and zenith, were taken from the PVSYST database [62]. All in all, the collected data involved 17,232 samples in one hour intervals. The data showing the PV power plant energy output measured every hour for 2 years were obtained from the PV solar plant in Zajączków. Energy measurement was carried out using the measurement system of the distribution system operator, in our case PGE Dystrybucja S.A., and it is a certified energy meter E650 Series 3, type: ZMD405CT by Landis + Gyr. All parameters were measured every hour for a 2-year period from 1 January 2019 to 21 December 2021. Inconsistent data due to technical breaks or failures in the operation of the PV power plant were filtered. It was decided not to apply normalization and transformation of the data before carrying out forecasting task, as experiments with them did not give better results.

The negative impact of deposited dust, scale, and dirt on the analyzed PV panels output causing disruptions in the data learning was minimized. Standard operating activities such as systematic visual inspection, seasonal mowing of grass, removal of spot dirt, and defect detection using a drone with a thermal imaging camera were performed on the analyzed PV farm. PV panels on this farm do not have excessive dust that require special washing. Rains in this latitude systematically remove dust from modules at an angle of about 30 degrees.

Figure 2 presents the electricity generation for the analyzed PV farm for 2020 (Figure 2a) and 2021 (Figure 2b).

3.2. Relationship between Power Production and Meteorological Parameters

Based on meteorological data from the Institute of Meteorology and Water Management and the PVSYST database [62], as well as data on electricity generation from the PV power plant in Zajączków in Poland, the Pearson correlation coefficient r was calculated.

The results of the correlation analysis are presented in Table 2. It was found that horizontal global irradiation and water saturation deficit have a strong proportional relationship between electricity generation from the PV system (r ≥ 0.79). The analysis of the Pearson coefficient also confirmed the influence of the moisture present in the air, revealing that the relative humidity has a strong inversely proportional relationship with electricity generation (r ≤ −0.76). Although the strong influence of horizontal global irradiation on electricity generation from the PV system is not surprising (in other climates it was even much higher, reaching 0.9112 in Columbia [5]), the observed strong relation between the amount of moisture present in the air and electricity generation from the photovoltaic system in the Polish climate is worth emphasizing. Moreover, similar to what was observed in Australia in [14], no dominant relation of rainfall with electricity generation from the PV system in Poland was found. A medium proportional relationship was found between PV system electricity generation and diffuse horizontal irradiance (DHI), direct (beam) horizontal irradiance (EBH), direct normal irradiance (DNI), ambient air temperature, and operator visibility. It was also observed that there is an inversely proportional medium relationship between the electricity generation from the PV system and zenith. Moreover, a minor proportional relationship between wind speed, height of the base of the upper clouds, and electricity generation from the PV system was found. In addition to this, the analysis showed that cloud opacity has a minor inversely proportional relationship with electricity generation from the PV system.

3.3. Selection of the Attributes to Be Considered When Developing Machine Learning Models

When selecting the data to be taken into account in the development of machine learning models, the following premises were considered:

The results of the correlation analysis between various meteorological parameters and electricity generation from the PV power plant (presented in Section 3.2),
The possibility of obtaining data from available weather forecasts (for real forecasting tasks),
The results of the literature analysis on the parameters that were considered in other machine learning models in different countries (presented in Section 2.1).

After conducting many experiments using different combinations of input meteorological parameters and various machine learning models, it was decided that the input meteorological variables into the models should include: global horizontal irradiation, relative humidity, ambient air temperature, and cloud opacity. Moreover, it was found to be beneficial to use the electricity generation from the PV power plant from the previous hour as an input parameter, as it reflects the earlier performance of the system in the face of various parameters that affect it. The electricity generation from the PV power plant was an output parameter.

3.4. Machine Learning Models Development

The best parameters for each model were experimentally selected:

In the case of the lasso regression (LassoR) model, the L1 regularization with the alpha coefficient = 1.0 was applied, the maximum number of iterations was set to 1000, and the tolerance for optimization was 1 × 10 $^{- 4}$ ;
In the case of the K–nearest neighbours regression (KNNR) model, K = 3 neighbours and the Euclidean distance measure were used;
In the case of the support vector regression (SVR) model, radial basis function was applied as the kernel type;
In the case of the AdaBoosted regression tree (AdaBoosted RT) model, the maximum number of estimators at which boosting is terminated was set to 5, learning rate was set to 0.9, and the exponential loss function was applied for each boosting iteration after updating the weights;
In the case of the gradient boosted regression tree (GBRT) model, 250 estimators were used, the applied learning rate was set to 0.1, and the maximum depth of the individual regression estimators was set to 8;
In the case of the random forest regression (RFR) model, the number of trees in the forest was 100, the minimum number of samples demanded to split an internal node was 2, and the minimum number of samples demanded to be at a leaf node was 1;
In the case of the artificial neural network (ANN) model, the network had 5 inputs and an output, multi layer perceptron (MLP) regressor was applied, two hidden layers containing 80 and 50 neurons were used, 500 iterations were performed during training, a learning rate was set to 0.0001, and the L2 penalty regularization was applied with an alpha coefficient 0.0001.

The other parameters are set by default. The experiments were carried out on laptop Lenovo Legion with Intel Core i7 processor, 16 GB RAM, with GeForce GTX 1650. All the experiments were carried out using Python applying SciKit learn libraries.

3.5. Model Evaluation

Eighty percent of randomly selected data were used to train the model, and the remaining 20% were used to test it. Additionally, in order to determine the stability of the developed models in terms of the selection of the training and test set, 5-K fold validation was performed. For this purpose, an additional division of the training set into five subsets was made. Five experiments were carried out in which each of the five subsets was used as the validation set, with the remaining subsets forming the training set. The average values of the metrics obtained from individual experiments and their standard deviations were analyzed.

The proposed models were evaluated using the most popular metrics, namely the determination coefficient (

R^{2}

), mean absolute error (

M A E

), root mean square error (

R M S E

), and standard deviations of these metrics obtained during 5–K fold cross-validation. The determination coefficient is a measure of the dependency between the forecasted and real electricity generation from the PV power plant, providing information about the correlation between these two datasets. The closer

R^{2}

is to 1, the better [14]. The value of

R^{2}

is defined by Formulas (3) and (4).

R^{2} = 1 - \frac{v a r (E_{r e a l} - E_{f o r e c a s t})}{v a r (E_{f o r e c a s t})}

(3)

R^{2} = 1 - \frac{n \sum_{i = 1}^{n} E_{r e a l, i} E_{f o r e c a s t, i} - (\sum_{i = 1}^{n} E_{r e a l, i}) (\sum_{i = 1}^{n} E_{f o r e c a s t, i})}{\sqrt{n (\sum_{i - 1}^{n} E_{r e a l, i}^{2}) - {(\sum_{i = 1}^{n} E_{r e a l, i})}^{2}} \sqrt{n (\sum_{i - 1}^{n} E_{f o r e c a s t, i}^{2}) - {(\sum_{i = 1}^{n} E_{f o r e c a s t, i})}^{2}}}

(4)

where

E_{r e a l}

—real electricity generation from the PV power plant (kWh), and

E_{f o r e c a s t}

—forecasted electricity generation from the PV power plant (kWh).

M A E

informs about a uniform error in the prediction, measuring the difference between the forecasted and real electricity generation from the PV power plant.

M A E

is defined by Formula (5).

M A E = \frac{1}{n} \sum_{i = 1}^{n} ∣ E_{f o r e c a s t, i} - E_{r e a l, i} ∣

(5)

R M S E

reflects the largest error in the forecasted dataset [63] and is defined by Formula (6).

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(E_{f o r e c a s t, i} - E_{r e a l, i})}^{2}}

(6)

4. Experimental Results and Discussion

In the case of RFR, GBRT, AdaBoosted RT and LassoR, the PV panel generation from the previous hour was the most important variable. In the case of RFR, GBRT, and AdaBoosted RT, global horizontal irradiation was in second place, whereas in the case of LassoR, ambient air temperature was in second place.

Table 3 shows a comparison of the performance between the proposed models using

R^{2}

,

M A E

, and

R M S E

.

All proposed models performed well in predicting electricity generation from the analyzed photovoltaic farm, as high determination coefficients ≥ 0.91 were obtained and fairly low mean absolute errors were obtained. Random forest regression was the best model, as it had the highest correlation coefficient (0.94), low mean absolute error (15.12 kWh), and low root mean square error (34.59 kWh). Those values are fully satisfactory for the owner of the analyzed solar farm. The performance of the artificial neural network model was very close to it, with correlation coefficient (0.938), low mean absolute error (16.331 kWh), and low root mean square error (35.25 kWh).

Due to the random nature of the selection of data for cross-validation, it was not possible to show how the models function for the case of forecasting energy production from consecutive hours within one day and for forecasting energy production from consecutive months on the basis of the results of cross-validation. That is why it was decided to carry out additional experiments, allowing one to show the models’ performance for forecasting electricity generation from consecutive hours within one day. To test this case, 20% of the days for testing each month were selected, and the remaining 80% for training the models.

Figure 3 presents the forecast results obtained for the example day, for which the consistency of the electricity production prediction with the actual production was high; Figure 4 is for the example day, for which the consistency was medium; and Figure 5 is for the example day, for which the consistency was low.

It should be emphasized that the energy production for days with low convergence was marginal, and therefore low convergence in these cases has a low impact on the overall performance of the machine learning models. It was observed that in some late autumn or winter days with low electricity generation, the predicted PV panel output was significantly higher than the real one (Figure 5). This may be related to the impact of an additional input parameter (attribute) that was not included in the models. This parameter may be smog, because in the świętokrzyskie voivodeship, especially during certain days of late autumn and winter, it is severe.

Moreover, in this work, additional experiments were carried out to present model performance in the case of forecasting electricity generation in consecutive months. The period from 10 August 2021 to 31 December 2021, which constitutes 20% of the last days of the whole dataset, was selected for testing for this case, and the remaining 80% for training the models. To better visualize the results, the hourly data from individual days were averaged.

Figure 6, Figure 7, Figure 8, Figure 9, Figure 10, Figure 11 and Figure 12 present the forecast results obtained for the period 10 August 2021–31 December 2021, which were the data selected for model testing.

Analyzing Figure 6, Figure 7, Figure 8, Figure 9, Figure 10, Figure 11 and Figure 12 and comparing the forecast results obtained from individual machine learning models with the real electricity generation, it can be concluded that all developed models cope well with this task. These figures confirm the observation made in Figure 5, also showing that for some days in late autumn and winter with low electricity generation, it is seen that the predicted PV panel output was higher than the real one. That is why further research on the impact of smog on PV power output is planned.

We carried our additional experiments comparing the performance of the proposed machine learning models with the traditional statistical model SARIMA. It was found that SARIMA did not give satisfactory results. The best

R M S E

obtained was 141 kWh. It should be emphasized that SARIMA turned out to be computationally expensive.

Comparing our results with other work, it was found that, in the case of Poland, the relationship between solar radiation and PV panel output is less significant than in Columbia (Person coefficient for świętokrzyskie voivodeship in Poland was 0.7968 and for Medellin in Columbia it was 0.9122 [5]), making the forecasting task for Poland more difficult than for warmer climates. In [33], authors used a deep learning approach to PV energy generation forecasting for the Turkish solar PV power plant of 1.15 MW capacity using LSTM neural network and obtained the following results:

M A E

= 30.47 kWh and

R M S E

= 60.66 kWh.

M A E

obtained for our models dedicated to Polish conditions was 15.124–22.887 kWh, and

R M S E

was 34.590–41.493 kWh. Significantly lower values of

M A E

and

R S M S E

indicate that our results perform much better; however, it is difficult to compare results for a Mediterranean climate (Turkey) and results from a warm temperate transitional climate (Poland), as well as results coming from PV power plants of various capacities (1.15 MW versus 0.7 MW). In [44], the authors analyzed the performance of the SVR model for 12 months for the PV installation of 544 W in southern India. Their

M A E

for the much smaller PV installation ranged from 34.8839 W to 69.1791 W.

Comparing our results with the results obtained for models dedicated to Australia located in ca ontinental climate (linear regression, polynominal regression, SVR, decision trees, RFR, MLP, and LSTM), it was found that their

R^{2}

ranged from −1.4027 to 0.9880 [14], whereas ours ranged from 0.913 to 0.940. The results obtained by [26] for the date collected only from one day in Port Harcourt (tropical climate) for linear SVM, cubic SVM, quadratic SVM, rational quadratic GRP, squared exponential Gaussian process regression (GPR) and Matern 5/2 GPR indicated

R^{2}

0.88–0.98. It shows that all our models perform very well, having high dependency between the forecasted and real electricity generation from the PV system. It is important to stress that our models are able to predict electricity generation throughout the whole year (they are based on data gathered from 2 years) and are not based only on data from a chosen one day. It is also important to notice that in the case of our models, the stability of the models was also analyzed by calculating standard deviations of the analyzed metrics. Their low values indicate the stability of the proposed models. We also normalized the results to be able to compare them with results of the other authors. We compared our

R M S E

after normalization to the results obtained for 300,000 solar power plants located in the south Germany with a specific climate, which is 9 degrees Celsius warmer than the average temperature in this country [42]. Their

R M S E

obtained for one month for support vector quantile regression combined with fuzzy information granulation, T-S fuzzy neural network, back propagation, radial basis function, ranged from 0.0756 to 0.6642, whereas ours ranged from 0.0632 to 0.0760. Moreover, we compared our normalized

R M S E

to

R M S E

obtained for prediction of solar power carried out in the Pecan Street dataset [32]. The authors obtained

R M S E

ranging from 0.0104 to 0.4119 using the following ANNs: rectified linear unit, autoencoder, CNN, GRU, and LSTM, whereas ours ranged from 0.0632 to 0.0760. Comparing our results with the results obtained for models dedicated to Australia located in a continental climate (linear regression, polynominal regression, SVR, decision trees, RFR, MLP, and LSTM), it was found that their

M A E

ranged from 0.0098 to 0.1492 [14], whereas our

M A E

after normalization ranged from 0.0276 to 0.0414.

Moreover, we tried to compare our results with other results concerning Poland found in the literature. It is difficult to compare our results with the results obtained from artificial neural networks simulating PV generation on a national level for one month period of a mean percentage error of 3.2%, which was presented [29]. The possibility of comparison results from a completely different scale of the installation—in our paper it is from the point of view of a single PV farm, and in [29] it is from the point of view of national energy production. Moreover, it is difficult to compare data from completely different time horizons (our 2 years of observation and in [29] one selected month).

The presented comparison of our results to the results of previous research obtained by other authors for various PV installations in the world indicate that our models perform very well in predicting electricity generation from PV panels.

5. Conclusions

Problems with inaccurate prediction of electricity generation from PV power plants cause severe operational, technical, and financial risks, which seriously affect both their owners and grid operators. This work is a response to the needs of PV farm owners in Poland for models that allow an effective prediction of PV panel output, allowing the reduction of the risk of inaccurate forecasts, with which they are currently struggling. The purpose of this work was to reduce the risk of inaccurate prediction of electricity generation from PV farms using machine learning. The analysis was based on data concerning electricity generation from the 0.7 MW PV power plant in świętokrzyskie voivodeship, meteorological data from the Institute of Meteorology and Water Management and the PVSYST database covering the period from 1 January 2020 to 31 December 2021. The input variables into the models were experimentally chosen and included global horizontal irradiation, relative humidity, ambient air temperature, cloud cover, and the generation of electricity from the PV power plant from the previous hour. The electricity generation from the PV power plant was an output parameter. Seven machine learning approaches were developed: lasso regression, K–nearest neighbours regression, support vector regression, AdaBoosted regression tree, gradient boosted regression tree, random forest regression, and artificial neural network. The models were evaluated using the most popular metrics, namely determination coefficient (

R^{2}

), mean absolute error (

M A E

), and root mean square error (

R M S E

). All proposed models were able to predict electricity production from the analyzed PV farm with high determination coefficients ≥ 0.91 and fairly low mean absolute errors. Random forest regression was the most reliable and accurate model, as it had the highest correlation coefficient (0.94), low mean absolute error (15.12 kWh), and low root mean square error (34.59 kWh).

The collected data were the basis for a correlation analysis to find the relationship between electricity production from the PV power plant and various meteorological parameters. It was found that horizontal global irradiation and water saturation deficit have a strong proportional relationship with electricity generation. Medium proportional relationships were found between electricity generation from the PV power plant and diffuse horizontal irradiance (DHI), direct (beam) horizontal irradiance (EBH), direct normal irradiance (DNI), ambient air temperature, and operator visibility. It was also observed that there is an inversely proportional medium relationship between PV power generation and zenith. In addition to this, a minor proportional relationship was found between wind speed, height of the base of the upper clouds, and electricity generation from the PV power plant. Additionally, the analysis showed that cloud opacity has a minor inversely proportional relationship with electricity generation from the PV power plant.

Our work aimed to develop an innovative approach to the environment of photovoltaic farms in Poland—long-term forecasting of energy generation on the one hand (globally) allows one to prepare conventional energy to replenish energy shortages and on the other hand (locally) allows one to adjust production processes to the availability of cheap energy. Better forecasting enables better financial results. All in all, the contribution to the body of knowledge of this work covers the following:

The study of the impact of weather parameters on the generation of electricity from the PV power plant in Poland on the example of świętokrzyskie voivodeship and finding that horizontal global irradiation and water saturation deficit have a strong proportional relationship with electricity generation;
Development of seven machine learning models for the prediction of PV power generation, taking into account local weather patterns;
A comparative study of the performance of various machine learning models and the choice of random forest regression as the best model.

One of the limitations of the proposed approach is that the real task of forecasting the electricity generation from a PV system requires the forecasted weather parameters to be applied as input data. In this work, this was not tested, as the parameters from weather forecasts were not gathered. The application of the proposed models on the forecasted weather parameters during real implementation may slightly change the forecasting performance due to the forecast error inherent in weather forecasts. That is why future work is oriented to integrate the proposed machine learning models with the weather forecast system to check its performance during real prediction tasks. Furthermore, as current knowledge about the impact of smog on the generation of electricity from PV power plants is still incomplete, future work is planned to analyze its impact.

In addition to this, in the face of advancing climate change that we currently observe, it is possible that the prediction efficiency of the developed machine learning models will decline over time. Therefore, it is planned to develop a retraining module, as in the face of climate changes, it seems advisable to retrain the model at a future date, taking into account the newly generated data on energy production and meteorological parameters.

Author Contributions

Conceptualization, M.K., L.L. and A.P.; methodology, M.K.; software, M.K. and A.K.; validation, M.K. and A.K.; investigation, M.K. and A.K.; resources, A.P.; data curation, A.P.; Writing—original draft preparation, M.K., L.L., A.P., A.K., J.Z.P. and A.S.; writing—review and editing, M.K., L.L., A.P., A.K., J.Z.P. and A.S.; visualization, M.K.; supervision, M.K. All authors have read and agreed to the published version of the manuscript.

Funding

The project is supported by the program of the Minister of Science and Higher Education under the name: “Regional Initiative of Excellence” in 2019–2022 project number 025/RID/2018/19 financing amount PLN 12,000,000.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

PV	Photovoltaic
LR	Linear Regression
PR	Polynomial Regression
DTR	Decision Tree Regression
SVR	Support Vector Regression
RFR	Random Forest Regression
LSTM	Long Short-Term Memory
MPR	Multilayer Perceptron Regression
GRU	Gate Recurrent Unit
CNN	Convolutional Neural Network
KNNR	K–Nearest Neighbour Regressors
RF	Random Forest
ANN	Artificial Neural Network
GBRT	Gradient Boosted Regression Tree
LassoR	Lasso Regression
AdaBoosted TR	AdaBoosted Tree Regression
$R^{2}$	Determination Coefficient
$M A E$	Mean Absolute Error
$R M S E$	Root Mean Square Error

References

European Green Deal. Available online: https://ec.europa.eu/clima/eu-action/european-green-deal_en (accessed on 12 November 2021).
Patiño, J.; López, J.D.; Espinosa, J. Analysis of control sensitivity functions for power system frequency regulation. In Workshop on Engineering Applications; Springer: Berlin/Heidelberg, Germany, 2018; pp. 606–617. [Google Scholar]
Zender-Świercz, E. Microclimate in rooms equipped with decentralized façade ventilation device. Atmosphere 2020, 11, 800. [Google Scholar] [CrossRef]
Piotrowski, J.Z.; Orman, Ł.J.; Lucas, X.; Zender-Świercz, E.; Telejko, M.; Koruba, D. Tests of thermal resistance of simulated walls with the reflective insulation. EPJ Web of Conferences. EDP Sci. 2014, 67, 2095. [Google Scholar]
Gutiérrez, L.; Patiño, J.; Duque-Grisales, E. A Comparison of the Performance of Supervised Learning Algorithms for Solar Power Prediction. Energies 2021, 14, 4424. [Google Scholar] [CrossRef]
Singla, P.; Duhan, M.; Saroha, S. A comprehensive review and analysis of solar forecasting techniques. Front. Energy 2021, 1–37. [Google Scholar] [CrossRef]
Orman, Ł.J. Boiling heat transfer on meshed surfaces of different aperture. In AIP Conference Proceedings; American Institute of Physics: College Park, MD, USA, 2014; Volume 1608, pp. 169–172. [Google Scholar]
Orman, Ł.J. Boiling heat transfer on single phosphor bronze and copper mesh microstructures. EPJ Web Conf. EDP Sci. 2014, 67, 2087. [Google Scholar] [CrossRef] [Green Version]
Tovar, M.; Robles, M.; Rashid, F. PV power prediction, using CNN-LSTM hybrid neural network model case of study: Temixco-morelos, méxico. Energies 2020, 13, 6512. [Google Scholar] [CrossRef]
Zdyb, A.; Gulkowski, S. Performance assessment of four different photovoltaic technologies in Poland. Energies 2020, 13, 196. [Google Scholar] [CrossRef] [Green Version]
Gulkowski, S.; Zdyb, A.; Dragan, P. Experimental efficiency analysis of a photovoltaic system with different module technologies under temperate climate conditions. Appl. Sci. 2019, 9, 141. [Google Scholar] [CrossRef] [Green Version]
Mohy-ud-din, G.; Muttaqi, K.M.; Sutanto, D. Transactive energy-based planning framework for VPPs in a co- optimised day-ahead and real-time energy market with ancillary services. IET Gener. Transm. Distrib. 2019, 13, 2024–2035. [Google Scholar] [CrossRef]
Csereklyei, Z.; Qu, S.; Ancev, T. The effect of wind and solar power generation on wholesale electricity prices in Australia. Energy Policy 2019, 131, 358–369. [Google Scholar] [CrossRef]
Mahmud, K.; Azam, S.; Karim, A.; Zobaed, S.; Shanmugam, B.; Mathur, D. Machine learning based PV power generation forecasting in alice springs. IEEE Access 2021, 9, 46117–46128. [Google Scholar] [CrossRef]
Antonanzas, J.; Osorio, N.; Escobar, R.; Urraca, R.; Martinez-de Pison, F.J.; Antonanzas-Torres, F. Review of photovoltaic power forecasting. Sol. Energy 2016, 136, 78–111. [Google Scholar] [CrossRef]
Gigoni, L.; Betti, A.; Crisostomi, E.; Franco, A.; Tucci, M.; Bizzarri, F.; Mucci, D. Day-ahead hourly forecasting of power generation from photovoltaic plants. IEEE Trans. Sustain. Energy 2017, 9, 831–842. [Google Scholar] [CrossRef] [Green Version]
Krechowicz, M. Comprehensive risk management in horizontal directional drilling projects. J. Constr. Eng. Manag. 2020, 146, 4020034. [Google Scholar] [CrossRef]
Krechowicz, M.; Piotrowski, J.Z. Comprehensive Risk Management in Passive Buildings Projects. Energies 2021, 14, 6830. [Google Scholar] [CrossRef]
Krechowicz, M.; Gierulski, W.; Loneragan, S.; Kruse, H. Human and Equipment Risk Factors Evaluation in Horizontal Directional Drilling Technology Using Failure Mode and Effect Analysis. Manag. Prod. Eng. Rev. 2021, 12, 45–56. [Google Scholar] [CrossRef]
Krechowicz, M. Risk management in complex construction projects that apply renewable energy sources: A case study of the realization phase of the Energis educational and research intelligent building. In Proceedings of the IOP Conference Series: Materials Science and Engineering, Chengdu, China, 9–11 October 2017; IOP Publishing: Bristol, UK, 2017; Volume 245, p. 62007. [Google Scholar]
Krechowicz, M. The hybrid Fuzzy Fault and Event Tree analysis in the geotechnical risk management in HDD projects. Georisk Assess. Manag. Risk Eng. Syst. Geohazards 2021, 15, 12–26. [Google Scholar] [CrossRef]
Shawon, M.M.H.; Akter, S.; Islam, M.K.; Ahmed, S.; Rahman, M.M. Forecasting PV panel output using Prophet time series machine learning model. In Proceedings of the 2020 IEEE Region 10 Conference (TENCON), Osaka, Japan, 16–19 November 2020; pp. 1141–1144. [Google Scholar]
Tiwari, S.; Sabzehgar, R.; Rasouli, M. Short term solar irradiance forecast based on image processing and cloud motion detection. In Proceedings of the 2019 IEEE Texas Power and Energy Conference (TPEC), College Station, TX, USA, 7–8 February 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–6. [Google Scholar]
Bou-Rabee, M.; Sulaiman, S.A.; Saleh, M.S.; Marafi, S. Using artificial neural networks to estimate solar radiation in Kuwait. Renew. Sustain. Energy Rev. 2017, 72, 434–438. [Google Scholar] [CrossRef]
Urrego-Ortiz, J.; Martínez, J.A.; Arias, P.A.; Jaramillo-Duque, Á. Assessment and day-ahead forecasting of hourly solar radiation in Medellín, Colombia. Energies 2019, 12, 4402. [Google Scholar] [CrossRef] [Green Version]
Zazoum, B. Solar photovoltaic power prediction using different machine learning methods. Energy Rep. 2022, 8, 19–25. [Google Scholar] [CrossRef]
Premalatha, N.; Valan Arasu, A. Prediction of solar radiation for solar systems by using ANN models with different back propagation algorithms. J. Appl. Res. Technol. 2016, 14, 206–214. [Google Scholar] [CrossRef] [Green Version]
Bigorajski, J.; Chwieduk, D. Analysis of a micro photovoltaic/thermal–PV/T system operation in moderate climate. Renew. Energy 2019, 137, 127–136. [Google Scholar] [CrossRef]
Jurasz, J.; Wdowikowski, M.; Figurski, M. Simulating Power Generation from Photovoltaics in the Polish Power System Based on Ground Meteorological Measurements—First Tests Based on Transmission System Operator Data. Energies 2020, 13, 4255. [Google Scholar] [CrossRef]
Chwieduk, M. Use of solar radiation data from HelioClim database for shortterm PY system power output prediction for Polish localization. Pol. Energetyka Slonecz. 2017, 1–4, 1–6. [Google Scholar]
Ren, H.; Xu, C.; Ma, Z.; Sun, Y. A novel 3D-geographic information system and deep learning integrated approach for high-accuracy building rooftop solar energy potential characterization of high-density cities. Appl. Energy 2022, 306, 117985. [Google Scholar] [CrossRef]
Khodayar, M.; Khodayar, M.E.; Jalali, S.M.J. Deep learning for pattern recognition of photovoltaic energy generation. Electr. J. 2021, 34, 106882. [Google Scholar] [CrossRef]
Ozbek, A.; Yildirim, A.; Bilgili, M. Deep learning approach for one-hour ahead forecasting of energy production in a solar-PV plant. Energy Sources Part A Recovery Util. Environ. Eff. 2021, 1–16. [Google Scholar] [CrossRef]
Al-Hajj, R.; Assi, A.; Fouad, M. Short-term prediction of global solar radiation energy using weather data and machine learning ensembles: A comparative study. J. Sol. Energy Eng. 2021, 143, 051003. [Google Scholar] [CrossRef]
Park, S.; Kim, Y.; Ferrier, N.J.; Collis, S.M.; Sankaran, R.; Beckman, P.H. Prediction of Solar Irradiance and Photovoltaic Solar Energy Product Based on Cloud Coverage Estimation Using Machine Learning Methods. Atmosphere 2021, 12, 395. [Google Scholar] [CrossRef]
Jebli, I.; Belouadha, F.Z.; Kabbaj, M.I.; Tilioua, A. Prediction of solar energy guided by pearson correlation using machine learning. Energy 2021, 224, 120109. [Google Scholar] [CrossRef]
Dong, N.; Chang, J.F.; Wu, A.G.; Gao, Z.K. A novel convolutional neural network framework based solar irradiance prediction method. Int. J. Electr. Power Energy Syst. 2020, 114, 105411. [Google Scholar] [CrossRef]
Dong, J.; Olama, M.M.; Kuruganti, T.; Melin, A.M.; Djouadi, S.M.; Zhang, Y.; Xue, Y. Novel stochastic methods to predict short-term solar radiation and photovoltaic power. Renew. Energy 2020, 145, 333–346. [Google Scholar] [CrossRef]
Javed, A.; Kasi, B.K.; Khan, F.A. Predicting solar irradiance using machine learning techniques. In Proceedings of the 2019 15th International Wireless Communications & Mobile Computing Conference (IWCMC), Tangier, Morocco, 4–28 June 2019; pp. 1458–1462. [Google Scholar]
Vrettos, E.; Gehbauer, C. A Hybrid approach for short-term PV power forecasting in predictive control applications. In Proceedings of the 2019 IEEE Milan PowerTech, Milan, Italy, 23–27 June 2019; pp. 1–6. [Google Scholar]
Shapsough, S.; Dhaouadi, R.; Zualkernan, I. Using linear regression and back propagation neural networks to predict performance of soiled PV modules. Procedia Comput. Sci. 2019, 155, 463–470. [Google Scholar] [CrossRef]
He, Y.; Yan, Y.; Xu, Q. Wind and solar power probability density prediction via fuzzy information granulation and support vector quantile regression. Int. J. Electr. Power Energy Syst. 2019, 113, 515–527. [Google Scholar] [CrossRef]
Qing, X.; Niu, Y. Hourly day-ahead solar irradiance prediction using weather forecasts by LSTM. Energy 2018, 148, 461–468. [Google Scholar] [CrossRef]
Nageem, R.; Jayabarathi, R. Predicting the power output of a grid-connected solar panel using multi-input support vector regression. Procedia Comput. Sci. 2017, 115, 723–730. [Google Scholar] [CrossRef]
Tobnaghi, D.M.; Madatov, R.; Naderi, D. The effect of temperature on electrical parameters of solar cells. Int. J. Adv. Res. Electr. Electron. Instrum. Eng. 2013, 2, 6404–6407. [Google Scholar]
Radziemska, E. The effect of temperature on the power drop in crystalline silicon solar cells. Renew. Energy 2003, 28, 1–12. [Google Scholar] [CrossRef]
Touati, F.A.; Al-Hitmi, M.A.; Bouchech, H.J. Study of the effects of dust, relative humidity, and temperature on solar PV performance in Doha: Comparison between monocrystalline and amorphous PVS. Int. J. Green Energy 2013, 10, 680–689. [Google Scholar] [CrossRef]
Jiang, H.; Lu, L.; Sun, K. Experimental investigation of the impact of airborne dust deposition on the performance of solar photovoltaic (PV) modules. Atmos. Environ. 2011, 45, 4299–4304. [Google Scholar] [CrossRef]
Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. (Methodol.) 1996, 58, 267–288. [Google Scholar] [CrossRef]
Meier, L.; Van De Geer, S.; Bühlmann, P. The group lasso for logistic regression. J. R. Stat. Soc. Ser. (Stat. Methodol.) 2008, 70, 53–71. [Google Scholar] [CrossRef] [Green Version]
Bruce, P.; Bruce, A.; Gedeck, P. Practical Statistics for Data Scientists: 50+ Essential Concepts Using R and Python; O’Reilly Media: Newton, MA, USA, 2020. [Google Scholar]
Géron, A. Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Duild Intelligent Systems; O’Reilly Media, Inc.: Newton, MA, USA, 2019. [Google Scholar]
Liu, Q.; Wang, X.; Huang, X.; Yin, X. Prediction model of rock mass class using classification and regression tree integrated AdaBoost algorithm based on TBM driving data. Tunn. Undergr. Space Technol. 2020, 106, 103595. [Google Scholar] [CrossRef]
Freund, Y.; Schapire, R.E. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 1997, 55, 119–139. [Google Scholar] [CrossRef] [Green Version]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Zhang, H.; Yang, Q.; Shao, J.; Wang, G. Dynamic streamflow simulation via online gradient-boosted regression tree. J. Hydrol. Eng. 2019, 24, 04019041. [Google Scholar] [CrossRef]
Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning; Springer Series in Statistics; Springer: New York, NY, USA, 2001. [Google Scholar]
Persson, C.; Bacher, P.; Shiga, T.; Madsen, H. Multi-site solar power forecasting using gradient boosted regression trees. Sol. Energy 2017, 150, 423–436. [Google Scholar] [CrossRef]
Hearty, J. Advanced Machine Learning with Python; Packt Publishing Ltd.: Birmingham, UK, 2016. [Google Scholar]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
Krechowicz, M.; Krechowicz, A. Risk Assessment in Energy Infrastructure Installations by Horizontal Directional Drilling Using Machine Learning. Energies 2021, 14, 289. [Google Scholar] [CrossRef]
PVSYST Photovoltaic Software. Available online: https://www.pvsyst.com/ (accessed on 7 April 2022).
Lauret, P.; Voyant, C.; Soubdhan, T.; David, M.; Poggi, P. A benchmarking of machine learning techniques for solar radiation forecasting in an insular context. Sol. Energy 2015, 112, 446–457. [Google Scholar] [CrossRef] [Green Version]

Figure 1. The proposed approach.

Figure 2. Power output for the analyzed PV farm for 2020 (a) and 2021 (b).

Figure 3. Forecasting results for a chosen day in which high compliance was achieved.

Figure 4. Forecasting results for a chosen day in which medium compliance was achieved.

Figure 5. Forecasting results for a chosen day in which low compliance was achieved.

Figure 6. Forecasting results for 10 August to 31 December 2021 using RFR.

Figure 7. Forecasting results for 10 August to 31 December 2021 using GBRT.

Figure 8. Forecasting results for 10 August to 31 December 2021 using ANN.

Figure 9. Forecasting results for 10 August to 31 December 2021 using LassoR.

Figure 10. Forecasting results for 10 August to 31 December 2021 using KNNR.

Figure 11. Forecasting results for 10 August to 31 December 2021 using SVR.

Figure 12. Forecasting results for 10 August to 31 December 2021 using AdaBoosted RT.

Table 2. The results of the correlation analysis between various meteorological parameters and the electricity generation from the PV power plant.

Parameter	Pearson Correlation Coefficient (r)
Electricity generation [kWh]	1.0000
Horizontal global irradiation (GHI) [W/m²]	0.7968
Water saturation deficit [hPa]	0.7950
Diffuse horizontal irradiance (DHI) [W/m²]	0.6519
Direct (beam) horizontal irradiance (EBH) [W/m²]	0.6473
Direct normal irradiance (DNI) [W/m²]	0.6200
Ambient air temperature [°C]	0.5435
Operator visibility [m]	0.5127
Visibility automat [m]	0.5028
Visibility [m]	0.3808
Wind speed [m/s]	0.3330
Height of the base of the upper clouds [m]	0.2608
Height of the base of the lower clouds [m]	0.2513
Azimuth [degree]	0.2349
Dew point [°C]	0.2245
General cloud cover [octoants]	0.1765
Water vapor pressure [hPa]	0.1663
Wind direction	0.1304
Gust of wind [m/s]	0.1151
Pressure [hPa]	−0.0201
The height of the freshly fallen snow [m]	−0.0215
Sunshine duration [h]	−0.0438
Rain in 6 h [mm]	−0.0438
The occurrence of dew [0/1]	−0.1123
Albedo daily [-]	−0.1207
Snow depth [cm]	−0.1398
Light cloud cover [octoants]	−0.2672
Cloud opacity [%]	−0.3188
Zenith [degree]	−0.6592
Relative humidity [%]	−0.7681

Table 3. A comparison of the performance between the proposed models.

Model	$\bar{R^{2}}$	$\bar{MAE}$	$\bar{RMSE}$	$σ (R^{2})$	$σ (MAE)$	$σ (RMSE)$
RFR	0.940	15.124	34.590	0.002	0.198	0.563
GBRT	0.931	17.007	37,177	0.001	0.215	0.344
ANN	0.938	16.331	35.245	0.002	0.320	0.742
LassoR	0.921	22.877	39.756	0.002	0.293	0.625
KNNR	0.925	17.279	38.797	0.001	0.348	0.713
SVR	0.924	19.363	39.052	0.003	0.399	0.839
AdaBoosted RT	0.913	22.685	41.593	0.003	0.453	0.801

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Krechowicz, M.; Krechowicz, A.; Lichołai, L.; Pawelec, A.; Piotrowski, J.Z.; Stępień, A. Reduction of the Risk of Inaccurate Prediction of Electricity Generation from PV Farms Using Machine Learning. Energies 2022, 15, 4006. https://doi.org/10.3390/en15114006

AMA Style

Krechowicz M, Krechowicz A, Lichołai L, Pawelec A, Piotrowski JZ, Stępień A. Reduction of the Risk of Inaccurate Prediction of Electricity Generation from PV Farms Using Machine Learning. Energies. 2022; 15(11):4006. https://doi.org/10.3390/en15114006

Chicago/Turabian Style

Krechowicz, Maria, Adam Krechowicz, Lech Lichołai, Artur Pawelec, Jerzy Zbigniew Piotrowski, and Anna Stępień. 2022. "Reduction of the Risk of Inaccurate Prediction of Electricity Generation from PV Farms Using Machine Learning" Energies 15, no. 11: 4006. https://doi.org/10.3390/en15114006

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Reduction of the Risk of Inaccurate Prediction of Electricity Generation from PV Farms Using Machine Learning

Abstract

1. Introduction

2. Literature Review

2.1. Solar PV Power Output Prediction

2.2. Principles of Chosen Machine Learning Techniques

2.2.1. Lasso Regression

2.2.2. K–Nearest Neighbours Regression

2.2.3. Support Vector Regression

2.2.4. AdaBoosted Regression Tree

2.2.5. Gradient Boosted Regression Tree

2.2.6. Random Forest Regression

2.2.7. Artificial Neural Network

3. Proposed Approach

3.1. Data Gathering and Preprocessing

3.2. Relationship between Power Production and Meteorological Parameters

3.3. Selection of the Attributes to Be Considered When Developing Machine Learning Models

3.4. Machine Learning Models Development

3.5. Model Evaluation

4. Experimental Results and Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI