1. Introduction
The growth in investment for the production of renewable energy is remarkable and the recent severe threats to the energy safety of Europe due to the scarcity and soaring gas prices led to the increasing pace in the design of a future European electricity system characterized by a dominant share of renewable energy supply. This is in line with the stated targets of European governments and the official position of the EU, increasing its renewable energy target from 40 to 45%, while setting a 592 GWac (740 GWdc) target for solar in the European Union by 2030 [
1]. In Greece, currently installed peak power is 4381 MWp of wind parks, 256 MWp of small-scale hydropower systems, 112 MWp of biomass systems, 3989 MWp of photovoltaic (PV) systems, and 352 MWp of rooftop PV systems on roofs [
2]. Photovoltaics applications are widespread in the residential and commercial sector, assisted by the continuous drop in purchasing prices which currently range from 0.20–0.40 EUR/W
p for silicon-based photovoltaic modules with 20% efficiency [
3]. Net-metering further boosted the expansion of PV applications in the residential sector. Its application varieties include net-metering, virtual net-metering, and net-metering with battery systems in order to achieve zero energy induction in the public grid. Sizing these applications demands meteorological data, consumption data and profiles, and, most importantly, reliable solar potential data because of the uncertainty of the solar insolation. Thus, forecasting power output is an essential task to support the optimal sizing of these applications. The significant expansion of electric vehicles is another application that will necessitate a close synergy with rooftop and other PV installations [
4]. Again, the system’s optimization in these cases needs to be supported by reliable power forecasting because of the inherent uncertainty of the availability of solar energy and the randomness of the EV charging behavior [
5]. Furthermore, power forecasting of the various forms of power production is also an important tool in electricity markets and energy markets [
6,
7].
Many researchers face this problem using different approaches. Hassan et al. proposed an empirical technique using experimental data and interpolated datasheets’ information in order to produce a mathematical equation for power prediction under various weather conditions [
8]. Huld et al. presented a method for energy rating using mathematical models for shallow-angle reflectivity, spectral sensitivity, module efficiency dependence on irradiance, and module temperature [
9]. Zamo et al. studied PV forecasting for some power plants in mainland France using statistical methods exploiting outputs from numerical weather prediction (NWP) models. Forecasts were built without using technical information on the power plants. The results showed a root mean square error (RMSE) in the range of 9–12% [
10]. Neural networks are another popular approach adopted by researchers in PV energy forecasting. Kim et al. [
6] presented the most recent research works on deep neural networks applied in the short term (forecast horizon from 30 min to 1 week ahead) photovoltaic power forecasting. Graditi et al. conducted a comparative analysis among three models for energy forecasting: a phenomenological model, a multilayer perceptron (MLP) neural network, and one based on regression analysis, applied to a dataset of an existing PV power plant. The neural networks technique demonstrated superior accuracy in the results [
11]. Sundaram et al. proposed an artificial neural network (ANN) for the long-term prediction of an existing 1MWp plant using three different options. Input variables were the module’s ambient temperatures, wind speed, and global irradiance. The ANN model with the four inputs achieved more accurate results with a mean absolute percentage error (MAPE) of 1.68% [
12]. Kardakos et al. proposed the application of the seasonal autoregressive integrated moving average (SARIMA) model and two ANNs for energy generation forecasting of grid-connected plants in Greece. Comparisons were conducted in terms of normalized root mean square error (nRMSE) [
13]. López Gómez et al. created an ANN to predict power from a real PV plant in Italy. Three scenarios were investigated: data employed for both training and prediction, data employed just for training and prediction was compared with data from the global data assimilation system, and finally, data from GDAS were used for training and prediction. The results show that the lack of on-site weather measurements can be faced by means of a validated numerical weather model in combination with machine learning techniques for PV power forecasting [
14]. Kothona et al. proposed a forecasting model based on the long short-term memory (LSTM) algorithm. The examined parameters were solar irradiance, PV module temperature, historical PV data, and clearness index. The results indicated that the inclusion of the clearness index as input can improve the performance of the forecaster [
15]. Omar Nour-Eddine et al. studied PV power production data of a 5.94 kWp grid-connected PV plant in Morocco. The PV system comprised panels of three different PV technologies. The proposed models are an ANN and a persistence model that are applied to a data set with solar irradiance and temperature as inputs. The results were compared in terms of mean bias error (MBE), mean absolute error (MAE), MAPE, RMSE, and nRMSE. The ANN-based model demonstrated a 2.107% MAE and 2.645% RMSE against 2.406% and 5.185%, respectively, for the persistence model [
16]. Akhter et al. proposed a model for an hour-ahead prediction on a yearly basis of three different PV plants, based on available data for wind speed, module, ambient temperature, and solar irradiation employing a long short-term memory (LSTM) recurrent neural network (RNN) with a deep learning method, with the results compared with regression, hybrid Adaptive neuro-fuzzy inference system (ANFIS), and machine learning methods [
17]. Natarajan et al. proposed a radial belied neural network (RBFN) with inputs from large-scale PV plants with evaluation metrics of RMSE, nRMSE, MBE, MAE, MaxAE, MAPE, and Kolmogorov–Smirnov test integral (KSI) and OVER metrics, skewness, and kurtosis and variability estimation [
18]. Pan et al. proposed a support vector machine (SVM) that uses weather temperature, relative humidity, global horizontal radiation, diffuse horizontal radiation, wind direction, and sampling time as input parameters. The output of the model was active power and the results were a correlation coefficient R
2 up to 0.997, MSE and MAE values of 0.0349 and 0.1569, respectively [
19]. Pasion et al. proposed a machine learning technique using latitude, month, hour, ambient temperature, pressure, humidity, wind speed, and cloud cover as independent variables and power production as output. A distributed random forest regression algorithm modeled the combined dataset with an R
2 value of 0.94 [
20]. Jaber et al. presented a prediction model for comparing the performance of six different photovoltaic (PV) modules using artificial neural networks (ANNs), with the basic characteristics of the PV panels’ manufacturer datasheets as cell temperature, irradiance, fill factor, short circuit current, open-circuit voltage, maximum power. Comparing the results to the actually measured data resulted in a 0.874% MAPE [
21]. Kim et al. proposed a combination of bidirectional long short-term memory (BLSTM) with an ANN to predict power generation for a specific hour, using four types of historical input data: hourly PV generation (168 h ahead), hourly horizontal radiation, hourly ambient temperature, and hourly surface temperature. The results showed that the LSTM prediction model with the ANN estimation model using exponential moving average preprocessing exhibited higher accuracy [
6].
Another related aspect of neural network application in the building’s rooftop PV sector is consumption forecasting. This is a useful tool for the optimal sizing of PV systems for net-metering applications, including those with battery storage systems. Knowledge of the building’s consumption profile is a necessary starting point for the improvement of the building’s energy management and application of energy-saving actions. Villanueva et al. proposed a method for predicting the consumption of household appliances by evaluating statistical distributions (Kolmogorov–Smirnov and Pearson’s X
2 tests) [
22]. Kalogirou et al. proposed a model for the prediction of energy consumption in a passive solar building in order to generate a mapping between the above easily measurable inputs and the desired output, i.e., the building’s energy consumption [
23].
In addition to the above issues, photovoltaic applications must be based on special techniques for fault detection, degradation investigation, and other effects such as soiling and potential induced degradation. Neural networks offer great help in studying these effects. Laurino et al. proposed a four-layered feed-forward artificial neural network that learns the correlation of the IV curves with irradiance and temperature. This approach has been also been applied to detect anomalous increases in series resistance from a large experimental set of curves [
24]. Chine et al. proposed an ANN that uses a given set of working conditions parameters—solar irradiance and photovoltaic (PV) module’s temperature—to predict a number of attributes such as current, voltage, and number of peaks in the I–V curve of the PV strings. The simulated attributes were then compared with real field measurements, leading to the identification of possible faulty operating conditions [
25]. Massi Pavan et al. studied the effect of soiling on large-scale photovoltaic plants using two different techniques: Four Bayesian neural network (BNN) models have been developed in order to calculate the performance at standard test conditions (STCs) of two plants installed in Southern Italy and compare against the results of a regression model. The results indicate that the losses due to dust accumulation on poly-crystalline Si PV modules’ surfaces range from roughly 1 to 5% after 1 year of operation [
26], which agrees with observations from other researchers [
27].
Neural networks are a significant help in solar radiation forecasting which is necessary for the planning, operation, and maintenance procedures of PV plants. Behrang et al. [
28] employed different ANN techniques to predict daily global radiation on a horizontal surface, based on the daily mean air temperature, relative humidity, and sunshine hours. De O. Santos et al. proposed a heterogeneous ensemble dynamic selection model, named HetDS, to forecast solar irradiance, choosing the most suitable forecasting model from a pool of seven well-known literature methods: ARIMA, support vector regression (SVR), multilayer perceptron neural networks, extreme learning machine (ELM), deep belief network (DBN), random forest (RF), and gradient boosting (GB). The experimental evaluation showed that the proposal can overcome the single models in almost all scenarios with a small dispersion [
29].
It can be observed that the various approaches differentiate regarding the type of Neural Network, the selection of inputs, outputs, and the kind of data employed for the training and validation. Another important selection parameter is the type of forecasting, which may be classified into three main categories according to Das et al. [
30]: (i) short-term forecasting of PV power generation that is applied in scheduling and dispatching of electrical power, or in the design of a PV-integrated energy management system, (ii) medium-term forecast, carried out for periods from one week to one month for power system and maintenance scheduling, and (iii) long-term PV power forecasting for periods from one month to one year, which is helpful for the planning of electricity generation and distribution [
30]. As expected, long-term PV power forecasting is the most demanding task for the use of machine learning methods, which need a lot more research to understand their complications and transform them into efficient computational tools [
31]. The experience from the application of deep learning in other related fields is useful in this context. Nguyen et al. [
32] constructed a long short-term memory (LSTM) RNN for making predictions of French nuclear power plants’ steam generator output over a long-term horizon, in which the network hyperparameters are automatically optimized by a tree-structured Parzen estimator (TPE) algorithm. RNNs include feedback connections from the hidden/ output layer to the preceding layers, to catch the dynamics of sequential data and reproduce previous patterns in the future prediction. Similarly, RNNs are applied in long-term forecasting of lithium-ion battery performance [
33], membrane degradation in fuel cells [
34], rolling bearings remaining useful life [
35], and machine health monitoring [
36].
The current work belongs to long-term power forecasting. It focuses on the exploitation of actual data collected from monitoring grid-connected PV systems and has three objectives. The first objective is the comparison of an ANN model with a simple linear regression in PV energy forecasting using actual data. The second is the evaluation of the data in order to be useful either for forecasting or for fault diagnosis and degradation analysis. The third is to create a methodology for the performance evaluation of a PV system and reliable estimation of degradation rates, which is essential information for the creation of energy forecasting models.
The rest of the manuscript is organized as follows:
Section 2.1 describes the photovoltaic park’s details and the experimental setup employed for its monitoring.
Section 2.2 describe the pre-processing and
Section 2.3 introduces the forecasting methods, whereas the results are presented and compared with actual data in
Section 3. In
Section 4, a comparative discussion of the performance of the alternative models and training approaches is carried out, and finally, conclusions are drawn on the long-term prediction and fault diagnosis capabilities.
3. Results
First, it is important to conduct a sensitivity analysis in order to correlate input parameters, irradiance, and back panel temperature with the output parameter (power output) and—consequently—energy production. An additional, important parameter that assists this sensitivity analysis procedure is the airmass [
45], which defines the direct optical path length of the Sun’s rays through the Earth’s atmosphere, expressed as a ratio relative to the minimum path length vertically upwards at the solar zenith. This coefficient is useful for characterizing the modification of the solar spectrum after traveling through the atmosphere. It is also intensively investigated as a valuable predictor in meteorological models [
46]. For our purposes, airmass is routinely calculated for the available system’s performance data according to
Appendix B.
Figure 2 classifies the electricity produced by the PV park according to the respective, dimensionless airmass values, which was found convenient to divide among six classes (AM1–AM2, AM2–AM3, AM3–AM4, AM4–AM5, AM5–AM10, AM10, and beyond) [
47].
According to the results of
Figure 2, more than 90% of electricity production corresponds to conditions with Airmass values in the range of 1 to 4. Moreover, 99% of the production is achieved for airmass values <10. For this reason, data for higher values are rejected as discussed in
Section 2. Data with values of irradiance smaller than 50 W/m
2 are also rejected.
Figure 3 shows that only 7.5% of electrical energy production is achieved for irradiance values smaller than 200 W/m
2. As already mentioned in
Section 2, records that correspond to irradiance values lower than 50 W/m
2 are rejected. These points represent in total, less than 1% of the annual electricity generation. Exclusion of the above-mentioned records with low information quality significantly improves the prediction accuracy of the neural networks since we avoid training the NN with noisy data with low information value.
The percentages of electricity production in the different backsheet temperature classes defined in the pre-processing are presented in
Figure 4. More than 99% of the total electricity production occurs in the four backsheet temperature classes between 15 and 55 °C.
Figure 5 shows the observed linear trend in the evolution of AC power output with irradiance. On the other hand, backsheet temperature has also a general linear trend with irradiance. However, according to the technical datasheet of the PV panels, there is a negative effect of backsheet temperature on the panel’s efficiency. These two conflicting trends produce the correlation in
Figure 6 between power output and temperature, characterized by a wide margin of possible backsheet temperatures associated with each AC output level.
The situation is more clearly depicted in
Figure 7, where the correlation between power output, temperature, and irradiance is apparent. The PV panels’ manufacturers describe the correlation of these three parameters with characteristic curves and temperature coefficients. STCs include irradiance (1000 W/m
2), temperature (25 °C), and spectrum corresponding to AM 1.5. The lack of availability of spectral measurements is adjusted by the introduction of the airmass factor. Air mass quantifies the length of trail of the Sun’s beams through the Earth’s atmosphere [
48].
Figure 8 shows the correlation between power output and airmass in different classes of irradiance. It is observed that power output has an exponential trend. Airmass factor calculation should operate as an indicator in order to compare values of time series in different seasons and hours when sunbeams have the same length of trail.
Next, we proceed to compare the results of the ANN and the linear regression model predictions to the actual performance during characteristic periods in the years 2016 to 2021.
Figure 9a compares the results of the linear regression model and the ANN with the actual performance data during three consecutive cloudy days in 2016. As expected, the prediction with the more complex (nonlinear) curve fitting of the ANN is always closer to the actual data. The same behavior is observed during the sunny days (
Figure 9b).
The comparison continues with three consecutive cloudy days in 2017, followed by three sunny days (
Figure 10). The performance of the ANN is more accurate in these examples from 2017. The linear regression model lacks accuracy in prediction, especially during the high insolation time during the noon hours of sunny days.
The comparison continues in
Figure 11 with three cloudy and three sunny days, with similar findings. The ANN prediction continues to be closer to the measurement, as seen by the qualitative comparison in
Figure 11.
In summary, as regards the years 2016 to 2018, which are closer to the training period, the characteristic results of
Figure 9,
Figure 10 and
Figure 11 show that both models forecast the power output with a very good accuracy, the ANN model being generally superior. The nRMSE values fluctuated between 4.6 and 7.4% for the ANN model and 6.2–8.8% for the linear regression model. Respectively, MAPE values fluctuate between −2.27 and −1.32% for the ANN model and −1.81 and −1.13% for the linear regression model. It is important to note additionally that during these years the performance ratio (PR) was in the range of 0.86–0.88.
Now, as one moves at more distance from the training period,
Figure 12 shows that both models forecast the power output with a remarkably higher nRMSE error value for 2019 compared to the previous years. nRMSE values are 11.2% for the ANN model and 12.5% for the linear regression model. The respective MAPE values are −5.97 for the ANN model and −5.75% for the linear regression model. It must be noted here that this year was characterized by a very low PR of 0.84. On the other hand, both models have an increase in RMSE error and especially 9.8% and 9.7% for the ANN and regression model, respectively.
Shifting attention to the year 2020 (
Figure 13), the situation further deteriorates with the prediction accuracy of both models, which both overpredict the electricity output of the PV park. Obviously, this behavior must be related to the rate of efficiency deterioration of the PV panels. During the assessment of the first six years of operation of the specific PV park [
47], the degradation rates of normalized efficiency for the first two AM classes, which have the highest impact on total energy production, varied between 1.28 and 6.92%. The degradation rate was variable during the years. Thus, the degradation embodied in the first three years’ data employed in the training of both models, cannot be expected to predict with high fidelity its evolution during the next several years.
Figure 14 shows that during the next year, 2021, for the first time an underestimation in forecasting values is observed by both models. This is further supported by the values of MAPE which are now positive. This fact would probably indicate an error in the irradiance sensor. To allow for a better assessment of the situation, a comparison of the calculated average PV plant efficiency for the period 2016–2021 is presented in
Table 1.
4. Discussion
The application of linear regression (M1) and ANN (M2) provides insight into the effectiveness of these models in understanding and predicting the operation of a PV plant over the period of a decade. It is clear that the ANN model achieves better statistical metrics than the regression model, as expected by the published experience of other researchers. However, as one moves further away from the training period, both models show a significant differentiation in performance, especially during the years 2019–2021. This fact indicates a possible drop in the PV plant’s performance that may be explained either by an expected degree of the degradation rate of the PV panel or by the soiling effect as there exists no systematic cleaning procedure in the plant. Another explanation could point to an error in the irradiance measurement system. The irradiance sensor described in
Appendix A is by itself a reference PV cell, thus subjected to soiling with an adverse effect on the irradiance values recorded.
Energy generation metrics in the form of PR and Total PV performance, indicate a decreasing trend from 2016 to 2019 that is coincident with the manufacturer’s warranties. On the contrary, PR and total PV performance show an increase during 2020–2021. Moreover, the average performance for the year 2021 is significantly higher than the years 2016–2018. This fact points to an erroneous irradiance sensor.
Figure 15 points to this fact by the comparison of irradiance–airmass correlations of the years 2013 and 2021. The readings for 2021 indicate a systematic shift to lower values of insolation producing useful electricity in all the airmass ranging from 1–10. This refers to a drift of sensor output.
A closer look in
Figure 16 reveals the problem with the soiled irradiance sensor. More specifically, the day of 26 April is a clear sky, sunny day; however, the sensor does not read correctly the maximum irradiance value of about 1000 W/m
2. The same observation can be carried out for the next three days, which are days with intermittent clouds; however, the maximum irradiance in the sunny intervals is always recorded at reduced levels.
Table 2 summarizes the calculated statistical and energy metrics of models M1 (regression model) and M2 (FF ANN trained with three-year data), for the years 2016 to 2021.
Figure 17 shows the trend of MAPE and nRMSE of the two models’ predictions with time. In general, an increase in error is observed over the years. However, this increase is not monotonous, as also observed in
Table 2. Of course, one cannot expect the specific approach to be very accurate for long-term forecasting, because of the erratic nature of the efficiency degradation rate of an ensemble consisting of a large number of PV panels, which are not regularly cleaned from dust and soiling. However, it is employed here as a useful tool for fault detection and degradation analysis. Previous studies, which investigate the efficiency degradation of PV plants and especially in Greece, point to an average annual degradation rate in the range of 1–4% [
47]. These findings converge—on average—with the general trends resulting from the proposed method (
Figure 17). Furthermore, the effect of heavy soiling is another important factor that may decrease energy production by up to 6.5% on an annual basis [
49]. Heavy soiling was observed during the spring of 2019 in Central Greece which resulted in an average decrease of 5.6% [
27] in energy generation. This fact may be related to the observed deterioration in the models’ performance for 2019. One must keep in mind that the dust and soiling effects are reversible, e.g., by the short-term accumulation and the subsequent meltdown of snow on the panels’ surface. This fact may partially explain the characteristic error fluctuations from year to year in
Figure 17.
Now, degradation is an important factor that must be considered in the long-term prediction of a PV plant’s performance. In the specific case study, due to the above-mentioned implications of fluctuating dust and soiling effects that overlap with the monotonous efficiency degradation, the short training period of three years may be the cause of the propagation of this instability. In order to further investigate this matter, it was decided to extend the training period of the ANN by two more years, to a total of five years. This is model version M3, which is added in
Figure 17 and
Table 3, for the years 2019 and 2020. Now, it is interesting to compare in
Table 3 the prediction errors of this second scenario for the remaining years 2019 and 2020. The improvement in the MAPE and nRMSE with the 5-year training of the ANN model is also apparent in
Figure 17 (years 2019–2020).
Table 3 compares the performance prediction for 2019 and 2020 for the regression model and FF ANN with 3-year training, and the FF ANN with 5-year training. The ANN models’ performance is improved with the extended training; however, there is not yet available a fair ground for comparison between the two ANN training scenarios: the monitoring of the PV park must be extended for at least two more years in order to assess whether the extension of the training period improves the performance for long-term analysis.