**A Machine Learning Approach to Low-Cost Photovoltaic Power Prediction Based on Publicly Available Weather Reports**

**Nailya Maitanova 1,\*, Jan-Simon Telle 1, Benedikt Hanke 1, Matthias Grottke 2, Thomas Schmidt 1, Karsten von Maydell <sup>1</sup> and Carsten Agert <sup>1</sup>**


Received: 20 December 2019; Accepted: 28 January 2020; Published: 7 February 2020

**Abstract:** A fully automated transferable predictive approach was developed to predict photovoltaic (PV) power output for a forecasting horizon of 24 h. The prediction of PV power output was made with the help of a long short-term memory machine learning algorithm. The main challenge of the approach was using (1) publicly available weather reports without solar irradiance values and (2) measured PV power without any technical information about the PV system. Using this input data, the developed model can predict the power output of the investigated PV systems with adequate accuracy. The lowest seasonal mean absolute scaled error of the prediction was reached by maximum size of the training set. Transferability of the developed approach was proven by making predictions of the PV power for warm and cold periods and for two different PV systems located in Oldenburg and Munich, Germany. The PV power prediction made with publicly available weather data was compared to the predictions made with fee-based solar irradiance data. The usage of the solar irradiance data led to more accurate predictions even with a much smaller training set. Although the model with publicly available weather data needed greater training sets, it could still make adequate predictions.

**Keywords:** photovoltaic power prediction; publicly available weather reports; machine learning; long short-term memory; integrated energy systems; smart energy management

#### **1. Introduction**

The building sector consumed one-third of global final energy use in 2016. About 80% of this final energy consumption was supplied by fossil fuels. The combustion of this fossil fuel amount caused 28% of global energy-related CO2 emissions, which represent one of the main reasons for the greenhouse effect in the atmosphere and global warming. A plan to limit global warming is described in the Paris Agreement, which entered into force on 4 November 2016. The first point of this agreement defines that the increase in global average temperature should not exceed 2 ◦C in comparison to the pre-industrial level [1].

Many countries included the aims of the Paris Agreement in their National Climate Action Plans, which, among other things, outline the national policies for reduction of the greenhouse gas emissions in the building sector through to 2050. The Climate Action Plan of Germany aims to make the building stock virtually climate neutral by reducing the primary energy demand of buildings by at least 80% compared to 2008 levels by 2050. This ambitious goal can be achieved by a combination of increasing efficiency, using renewable energy, and sector coupling with the energy and transport sectors [2].

These solutions, particularly the coupling between different sectors, face big challenges. One of these challenges is the development and integration of energy management systems for the buildings. Such smart energy management system can have a significant impact, especially on the energy consumption of non-residential commercial buildings, because this type of building has an advantage of simultaneity between load behavior and locally generated energy from photovoltaic (PV) systems [3,4]. However, the fluctuating electricity generation from PV systems challenges the energy management system to use a prediction of PV power output. This type of power prediction leads to increasing the self-consumption, avoiding higher grid fees, and efficient controlling of temporal coincidence between integrated energy systems such as PV systems, battery electric vehicles (BEV), heat pumps, and other flexible loads.

As a continuation of a previous study Hanke et al. [5], a predictive model based on a machine learning approach was developed within this study to forecast the PV power output for the next 24 h. This model had the same conditions as before [5]; the input dataset for the predictive approach consists of only measured power values of PV systems and free publicly available weather reports. The prediction of PV power is made without any technical information about the PV system (e.g., without installed capacity, efficiency, inclination of the PV modules, etc.). Wang et al. [1] used also only historical PV power output and numerical weather prediction for the short-term forecasting of PV power. One of the main differences between the study [6] and this study is the type of forecasting method, whereby Wang et al. [6] proposed interval forecasting of PV power with lower and upper boundaries, while the predictive model of this study makes deterministic point forecasting.

The study [5] and this study are not the first surveys, which used publicly available data for making predictions. For example, Kwon et al [7] used publicly available weather data (temperature, humidity, dew point, and sky coverage) to make predictions of global horizontal irradiance (GHI), which can theoretically be used for PV power prediction. The given study made direct predictions of PV power using publicly available weather reports as input data. However, these reports do not provide measurements and predictions of GHI. For this reason, the input dataset was extended with an additional descriptive feature. Two possible additional features were investigated within this study: PV power under clear-sky conditions and maximum PV power, calculated from power measurements of the last five days.

The requirement to use free publicly available weather data can be explained by the assumption that most commercial building integrated and grid-connected PV systems do not have any profitable business model. These systems do not generate enough revenue from feed-in tariffs and self-consumption. Without enough revenue, the buildings with PV systems cannot have an energy management system based on cost-intensive forecast data. Using publicly available weather reports is necessary to ensure that energy management systems with an integrated developed predictive approach can operate at a low-cost level.

In addition to using publicly available weather reports, the predictive model should also satisfy other requirements; it should operate fully automatically, be continuously learning, be transferable to other PV systems, and adapt to changes in weather conditions and the PV system, such as degradation of the solar modules.

With regard to these requirements, different studies were investigated in order to find an appropriate predictive approach. In recent years, more and more machine learning algorithms were developed and applied for time series predictions, as References [8–11] showed in their reviews about PV power forecast techniques. Mohammed et al. [12] investigated different machine learning models for forecasting PV power for the next 24 h: seven individual machine learning methods (k-nearest neighbors, decision tree, gradient boosting, etc.) and three ensemble models. Both individual and ensemble models were compared to the benchmark models (autoregressive integrated moving average model, naïve, and seasonal naïve). Das et al. [13] used support vector regression to forecast PV power, and they compared the prediction to a persistence and conventional artificial neural network (ANN). In particular, ANNs are used widely for PV power prediction. Rosato et al. [14] proposed three predictive approaches based on ANN, which were used to predict power output for a large-scale PV plant in Italy. An ANN model was also used by Khandakar et al. [15] to predict the PV power output in Qatar, but the authors of this study additionally investigated two different feature selection techniques. In References [16–18], it was discussed that the application of long short-term memory (LSTM) neural networks provides particularly good results in PV power forecasting.

Taking into account the conducted literature research and the defined requirements for the predictive model, it was not suitable for this study to use classical splitting of the dataset into training and test sets and training the model only once. The dataset with the measured values used in this study is regularly updated with current weather data and PV measurements. Therefore, a re-training of the model with new data occurs at regular time intervals (every 12 h). Together with the current data, the weather forecast is also updated regularly (every 3 h). After each update of the weather forecast the developed model makes updated predictions of the PV power output for the next 24 h. These procedures can ensure continuous learning of the predictive model and addressing concept drift, i.e., adaptation to possible changes in data, completely new data, and new relationships between input and output.

The developed predictive approach was tested for two PV systems, which have different installed capacity, age, solar cell types, and location. This test checked whether the predictive approach can be transferred to different PV systems, despite having different individual parameters. In addition to this, PV power was predicted for two seasons of the year with different levels of GHI, in order to investigate how seasonal fluctuations of GHI can influence the accuracy of PV power prediction.

After making predictions of PV power for two seasons using two different PV systems, it was necessary to evaluate the quality of the developed predictive approach. For this purpose, the developed predictive approach was compared to a benchmark model (seasonal naïve forecast). This comparison analysis was implemented with the help of seasonal mean absolute scaled error (MASE). Furthermore, the PV power prediction of the model, based on publicly available weather data, was compared to predictions made with fee-based solar irradiance data.

At the end of Section 1, it is important to underline the novelty of this study and its contribution to the knowledge of PV power prediction. The "low-cost" predictive approach, which was developed within this study, can predict PV power output without any knowledge of the PV system, i.e., no technical information of the PV system is required. Only measured PV power values and publicly available weather reports were used as input data for the predictive approach. These publicly available weather reports do not have any values of the GHI; thus, the PV power prediction was made without the prediction of GHI. Moreover, all conducted simulations proved that the developed "low-cost" approach can operate fully automatically, can predict the power output of different PV systems, and can adapt for different seasons of the year.

#### **2. Data**

In this section, the origin, main characteristics, and quality of input data are explained. In supervised machine learning, the input data are divided into descriptive and target features [19]. The descriptive features of this study included historical weather measurements and numerical weather predictions for two different locations in Germany: Oldenburg and Munich. These features were used for prediction of PV power output, which was defined as a target feature.

The datasets for Oldenburg and Munich included weather data and measured PV power for different observation periods. The observation period for Oldenburg lasted from 5 May 2017 until 10 April 2018, and the observation period for Munich lasted from 5 March 2019 until 30 June 2019. These datasets were explored extensively to determine possible data quality issues and to estimate correlation between descriptive and target features for feature selection.

#### *2.1. Photovoltaic Power Output*

The origins of the PV power measurements used in this study are roof-top PV systems. The first system is in operation at the DLR Institute of Networked Energy Systems in Oldenburg since November 2010. Another system was newly installed on a commercial building in Munich and it generates electricity since November 2018.

The investigated PV systems have not only different locations, but also different installed capacities, solar cell types, etc. The main technical characteristics of these two systems are presented in Table 1.


**Table 1.** Main technical characteristics of the investigated photovoltaic (PV) systems.

From 5 May 2017 until 10 April 2018, the PV system in Oldenburg generated 675.63 kWh of electricity. The measured values of the PV system in Munich are available for the DLR institute since 5 March 2019 and, from this time until 30 June 2019, it produced 50,740.14 kWh of energy.

The technical characteristics of the PV systems were not considered in the predictive model and they are given here only for better understanding of the prediction results. Only measured values of PV power output were used in the prediction model. These measured values for both PV systems were recorded with a time resolution of 5 min.

#### *2.2. Weather Data*

The requirements for the predictive model defined that input data should be open and publicly available, in order to ensure low-cost operation of the approach. Therefore, the input weather data for the predictive model were taken from an online service "OpenWeatherMap" (OWM) of a company Openweather Ltd. The main activity profile of this company includes providing current weather data, historical weather data, and weather forecasts of different locations to developers who want to present meteorological data on their homepages, mobile applications, or other web services. If the developers make fewer than 60 calls per minute, the current weather and five-day weather forecasts are available for free. The data of OWM are licensed by the Open Data Commons Open Database License (ODbL). Because the data from OWM are available to the public, the data are here referred to as "publicly available weather reports" [20].

OWM provides current and forecast values of various weather parameters, including ambient air temperature, pressure, air humidity, cloudiness factor, wind speed, precipitation type, etc. Current and forecast values from OWM have different time resolutions; meteorological parameters of the current weather are given every 30 min, but the forecasted weather data are available only every 3 h. The data receiving time of the meteorological values from OWM is in coordinated universal time (UTC) [20].

The used publicly available weather reports do not contain measured and predicted values of GHI. In order to evaluate how the absence of GHI in input data impacts the accuracy of PV power prediction, further weather data were purchased for use as input data for the same model. Because these data are not publicly available, they are here referred to as "fee-based solar irradiance data". These fee-based data include measured values of GHI, which are collected by pyranometer. They also contain calculated values of GHI, direct normal irradiance (DNI), and diffuse horizontal irradiance (DHI) under clear-sky conditions and angles of solar zenith and azimuth. The GHI prediction is based on an optimized combination of different forecasting methods, including satellite cloud images and different numerical weather prediction models [21]. This dataset does not include other meteorological

parameters, such as temperature or humidity. The measured values and predictions of solar irradiance in the fee-based dataset are presented in 1 h and 15 min time intervals, respectively. These data were provided by the DLR Institute of the Networked Energy Systems and the Energy Meteorology Group, Institute of Physics, Oldenburg University. Kalisch et al. [21] described these data and published a dataset for the year 2014. For this study, the solar irradiance data (W/m2) were taken from the year 2019.

The problem of different time resolutions of the input data was dealt with in the data pre-processing step, as described later.

#### *2.3. Additional Descriptive Feature*

The publicly available weather reports from OWM do not provide measured and predicted values of GHI, which are strongly correlated with PV power output. Because of this reason, the input dataset with weather parameters was extended with an additional descriptive feature. Two possible features were investigated: PV power under clear-sky conditions and maximum PV power calculated from the PV power measurements of previous days. Then, these values were inserted in the input dataset and used for training of the model and for making predictions.

#### 2.3.1. PV Power under Clear-Sky Conditions

PV power output under clear-sky conditions was the first additional descriptive feature investigated within this study. The process of this feature generation consisted of three steps.

Step 1: Calculation of the total solar irradiance on a horizontal surface, which is called GHI, under clear-sky conditions. GHI was estimated with the help of pvlib, a special open-source python toolbox for modeling PV system performance [22]. This toolbox provides three different simple clear-sky models in order to estimate solar irradiance on horizontal surface under clear-sky conditions: Ineichen, Haurwitz, and simplified soils [22]. These and other clear-sky models were investigated by Reno et al. [23], and the Ineichen model was ranked among the more accurate approaches. For this study, GHI under clear-sky conditions was estimated in pvlib with the Ineichen model, because this model does not need very specific information and it showed good performance in Reference [23]. The Ineichen model in pvlib needs the following input data for the calculation of GHI under clear-sky conditions [22]:


At the end of the first step, a time series with GHI values under clear-sky conditions was generated for the definite location of the PV system.

Step 2: Calculation of PV system-specific clear-sky index *CSIPV system* (m2). For this step, the maximum value of the PV power measurements *PPV*, *max* (W) was divided by the maximum value of the GHI under clear-sky conditions *GHIclear sky*,*max* (W/m2) (see Equation (1)).

$$\text{CSI}\_{PV\,\,system} = \frac{P\_{PV,\,\,max}}{GHI\_{clear\,\,sky,\,max}}.\tag{1}$$

According to the motivation of this study, the prediction of the PV power output should be made without any technical information of the PV system. The calculated index is a parameter for initial assessment of the size, installed capacity, and efficiency of the PV system.

Step 3: Multiplication of GHI values under clear-sky conditions from the first step with the PV system-specific clear-sky index from the second step. The third step generates a time series of power output, which can theoretically be produced by the PV system under clear-sky conditions. In the study, this additional feature is called "clear-sky PV power".

Figure 1 presents two time series for one summer day (1 June 2017) and one winter day (1 December 2017) for the PV system in Oldenburg (see technical parameters of this PV system in Table 1). The yellow areas in Figure 1 present PV power output under clear-sky conditions. As it can be seen, this feature takes into account seasonal properties such as lower solar irradiance and the shorter daytime in winter.

**Figure 1.** PV power output under clear-sky conditions on (**a**) June 1st and (**b**) December 1st 2017 estimated for the PV system in Oldenburg with an installed capacity of 1.14 kWP.

#### 2.3.2. Maximum PV Power

The maximum PV power curve was the next additional descriptive feature investigated within this work. The calculation methodology of maximum PV power and the optimal number of days to look back were taken from a fully automated model of Hanke et al. [5]. Table 2 presents the working steps for the estimation of maximum power from PV power measurements of the last five days.

**Table 2.** Working steps for the estimation of the maximum PV power of the last five days.

Table 2 explains only the calculation approach for the estimation of the maximum PV power output; it does not explain why this parameter was calculated using power values of the last five days. This number of days was not chosen randomly; rather, Hanke et al. [5] investigated different numbers of days. In this study, the model was consistently trained with data from only yesterday up to data of the last 30 days. Then, the simulation results were compared to each other and to the measured values. Statistical errors, mean absolute error (MAE), and root-mean-square error (RMSE) were selection criteria for choosing the optimal number of days for training of the model. Because the model which was trained with the last five days had the lowest statistical errors in the study [5], this number of days was chosen for generation of the additional feature, i.e., maximum PV power, within this study.

Both maximum PV power and clear-sky PV power were used not only for training the model, but also for verification of the PV power prediction. The prediction of PV power output should not be greater than these additional features. If the predicted PV power was greater than the additional feature, then this predicted value was replaced with the additional feature value. The main advantage of this verification rule was that it prevented the prediction of PV power output overnight, as well as great overestimation by day.

#### *2.4. Data Exploration*

The predictive model, which was developed within this study, is a data-driven approach. For this kind of approach, input data and quality of these data play key roles in prediction accuracy. Therefore, exploration of the input data is one of the main steps in data pre-processing and feature selection [19].

Firstly, all input data were explored with the goal of determining whether the data suffered from any data quality issues. Both the publicly available weather reports with current meteorological parameters and the dataset with measured PV power output suffered from missing values. Only 2.9% of values in the publicly available weather dataset were missing. The dataset with the PV measurements had 5.7% missing values. The rule of thumb of Kelleher et al. [19] helped to decide whether these amounts of missing values were critical or not. This rule recommends removing a feature from the input dataset if the proportion of missing values exceeds 60% of the whole dataset. In this case, the amount of relevant information stored in the rest of the data is very low. According to this rule, the proportion of missing values in the datasets with weather parameters and PV measurements was uncritical, and both datasets could be used as input data for the predictive model. The next common data quality issue involves outliers; publicly available weather reports had only one outlier (humidity value) within the investigated period, and the PV dataset did not have any outliers in the same period. According to the OWM homepage, humidity is calculated as a percentage and varies between zero and one hundred. During data exploration, it was determined that the humidity value on 12 March 2018 at 21:00 was 107%.

Secondly, the quality of the numerical weather forecast from publicly available weather reports could also be evaluated, because the weather forecast was available for the whole investigated period of time. The prediction accuracy of publicly available weather reports was evaluated with the help of three statistical metrics.

Mean absolute error (MAE):

$$MAE = \frac{1}{n} \sum\_{i=1}^{n} |\overline{Y}\_i - \underline{Y}\_i|. \tag{2}$$

Root-mean-square error (RMSE):

$$RMSE = \sqrt{\frac{1}{n} \sum\_{i=1}^{n} \left(\overline{Y\_i} - Y\_i\right)^2}. \tag{3}$$

Symmetric mean absolute percentage error (sMAPE) [24]:

$$\text{CsMAPE} = \frac{100\%}{n} \sum\_{i=1}^{n} \frac{\left| \overline{Y\_i} - Y\_i \right|}{\left| \overline{Y\_i} + Y\_i \right|}. \tag{4}$$

In Equations (2)–(4), *Yi* is the measured value, *Yi* is the predicted value, and *n* is the number of values.

The statistical metrics can only be calculated for the numerical variables. The meteorological parameter "precipitation" in publicly available weather reports is a categorical feature, which contains categories such as "rain" and "snow". In order to evaluate the forecast accuracies for this parameter, it was converted into dummy variables. After this procedure, the dataset contained two new columns with discrete variables: "precipitation (rain)" and "precipitation (snow)".

The meteorological parameters cover different ranges; for example, cloudiness and humidity cover the range [0, 100], ambient air temperature covers the range [−10, 29], etc. In order to compare them with each other, all values were converted into the range [0, 1] using range normalization [19]. Afterward, MAE, RMSE, and sMAPE for all meteorological parameters were calculated. The statistical metrics which indicate the quality of weather forecast of publicly available weather reports are presented in the Table 3.

**Table 3.** Statistical metrics of OpenWeatherMap (OWM) forecast: MAE—mean absolute error; RMSE—rootmean-square error; sMAPE—symmetric mean absolute percentage error.


Among all meteorological parameters, temperature and pressure, with an sMAPE of less than 3%, had the best prediction accuracy. The high statistical errors of cloudiness and precipitation make sense, because these meteorological parameters are the hardest to predict. The accuracy of humidity forecasting lay between the accuracies for temperature and cloudiness.

#### *2.5. Correlation Analysis and Feature Selection*

In the previous subsections, the meteorological parameters were investigated separately from the target feature, i.e., the PV power output. The next step was the calculation of the relationship between weather data, additional features (clear-sky PV power and maximum PV power), and measured PV power. The relationship between all these features was estimated with the help of covariance *cov* and correlation *corr* metrics [19]. The formulas for the calculation of these metrics are given below.

$$cov(a, b) = \frac{1}{n - 1} \sum\_{i = 1}^{n} \left( (a\_i - \overline{a}) \times \left( b\_i - \overline{b} \right) \right) . \tag{5}$$

$$corr(a,b) = \frac{cov(a,b)}{sd(a) \times sd(b)}.\tag{6}$$

In Equations (5) and (6), *a*, *b* represent features a and b, *a*, *b* represent the means of features a and b, and *sd* is the standard deviation.

The correlation values between meteorological parameters from publicly available weather reports, two additional features, and PV power output are presented as a heat map in Figure 2.

**Figure 2.** Correlation values between descriptive, additional, and target features. Values close to −1 mean a strong negative linear correlation, values close to 1 mean a strong positive linear correlation, and values around 0 mean no linear correlation [19].

The correlation analysis showed that air temperature and humidity have a stronger relationship with PV power output in comparison with the other meteorological features. The strong positive correlation between PV power and temperature can be explained by the fact that daily curves of the solar irradiation and air temperature were similar to each other. In general, an increase in solar irradiation also causes an increase in air temperature. The same explanation is valid for the strong negative correlation between PV power and humidity. However, in this case, the increase in solar irradiation causes a decrease in humidity.

Despite the commonly perceived fact that the PV generation strongly depends on the current cloud cover, the given probability for cloudiness from the OWM showed a weak correlation with the measured PV power. The calculated maximum PV power values and the expected PV power under clear-sky conditions had the strongest positive relationship with PV power output, as the correlation values were greater than 0.80. Pressure, wind speed, and both precipitation features had the lowest correlation with the measured PV power, or the relationship between these features was strongly non-linear.

The two precipitation features "rain" and "snow" were combined as a common feature, which received the name "precipitation". In the case of rain or snow, the parameter "precipitation" was equal to one; otherwise, it was zero. The correlation value between the new descriptive feature "precipitation" and the target feature "measured PV power" was −0.14.

The conducted correlation analysis helped for a better understanding of the data. Furthermore, the results of this analysis were applied for feature selection. Hall [25] defined the feature selection as a process of identifying and removing features which contain irrelevant or redundant information. The presence of irrelevant information can reduce the prediction accuracy and make the results less understandable. By the same token, the removal of irrelevant features can improve the performance of the machine learning algorithm, whereby the operation time decreases and the algorithm operates more effectively [25].

There are a lot of techniques for feature selection, but this research work used correlation values in order to identify and select the input features with relevant information. Only those descriptive features with correlation values greater than 0.1 or lower than −0.1 were selected for the predictive model. The selected features were air temperature, humidity, cloudiness, precipitation, maximum PV power, and clear-sky PV power.

The heat map in Figure 2 presents the correlation values for the whole observation period (one year). The next step was to investigate whether the correlation values were dependent on seasons or whether they remained constant throughout the whole year. For this purpose, the correlation was calculated for each month in the investigated period. Figure 3 presents the monthly correlation values among the measured PV power, four selected meteorological parameters, and two additional features.

**Figure 3.** Monthly correlation values among measured PV power, selected meteorological parameters, and additional features.

The correlation between the measured PV power and two additional features remained the strongest over the whole observation period. Moreover, the monthly correlation values between these parameters fluctuated slightly over the year; for example, the correlation values between the measured PV power and calculated maximum PV power ranged between 0.63 in December 2017 and 0.87 in April 2018. The relationship between the PV measurements and air temperature had strong seasonal dependency. The correlation between these features was much stronger in warm seasons. It decreased in autumn and reached its minimum in winter months. In spring, the correlation values began increasing again. The same tendency can be seen by observing the relationship between measured PV power and humidity. The correlation between the measured PV power and two remaining weather parameters did not fluctuate as strongly, and it lay between 0 and −0.25 for almost the whole year.

#### **3. Methodology**

#### *3.1. Machine Learning Algorithm*

There are two main approaches to PV power output forecasts: performance method and machine learning method. The performance or physical method needs technical specifications of the PV system and prediction of the solar irradiance for this location. However, the main aim of the developed predictive approach is to obtain the PV power output without any information about the PV system (except historical measured values of the generated power). Thus, the performance method could not be used according to the motivation of this study. The machine learning method does not need any information of the system. This was the first reason for choosing the machine learning approach. The second reason was an absence of the solar irradiance in the publicly available weather reports.

The next step was to select a machine learning technique among the many techniques of forecasting the PV power output. The artificial neural network (ANN) is currently the most used machine learning approach for the prediction of PV power. In 24% of all studies in Reference [10], the researchers predicted PV power using ANN models. ANN has many different variants with their own advantages and disadvantages, such as feed-forward, convolutional, recurrent, etc. This study used a special architecture of artificial recurrent neural network (RNN) named the long short-term memory network (LSTM).

LSTM was firstly proposed by Hochreiter and Schmidhuber [26]. This new model was developed in order to overcome a classic problem of the RNN, i.e., that error signals in the RNN blow up or vanish during their backpropagation through time. The LSTM model developed by Hochreiter and Schmidhuber [26] is not affected by this problem. Since then, the classical structure of LSTM was improved and developed further by many different scientists. One of the most relevant improvements was the addition of a component called a "forget gate", which was developed and described by Gers et al. [27]. This additional component and all standard components of the LSTM module are presented in Figure 4.

**Figure 4.** Internal structure of a single long short-term memory (LSTM) block (following the description of Hua et al. [28]).

The key components of the LSTM block are the cell state and gates (input, output, and forget). The cell state is displayed as a horizontal line running through the top of the LSTM module in Figure 4. This component is responsible for remembering the current state of the neural network. The gates are also marked and labeled in the figure. They consist of different units, such as a sigmoid layer (σ), tangent hyperbolic layer (tanh), and multiplication operator (×). These components help to control the information flow to the memory cell. The forget gate looks at the input of the current cell and the output of the previous cell and decides how much information remains in the current cell state. This gate returns a number between 0 and 1. The next step is to control what new information is stored in the current cell state. This is the task of the input gate, which combines sigmoid and tanh layers. The last gate, i.e., the output gate, decides what information flows to the next memory block [28].

The following characteristics of the LSTM network were the main reasons for choosing it for the predictive model of this study [26]:


The detailed mathematical explanation of the LSTM, as well as application cases, can be found in References [26–28].

#### *3.2. Description of Developed Predictive Model*

After the neural network structure was chosen, it was decided which programming language and framework would be used to develop the predictive model. The predictive model was designed and trained with machine learning library Keras, which was written in Python. This open-source framework was developed as part of the research project ONEIROS (open-ended neuro-electronic intelligent robot operating system). Keras is intended for the quick implementation of neural networks, both convolutional and recurrent, for different experimental purposes [29].

However, before training the model with the chosen machine learning technique, i.e., LSTM, the input data were prepared. The data preparation, model training, and other main steps of PV power forecasting are presented in a simplified flow chart in Figure 5.

**Figure 5.** Simplified flow chart of the predictive model.

The developed predictive tool is initialized at the very beginning of the operation. In this case, initialization means a definition of parameters such as geographical coordinates, location, and name of the additional descriptive feature. These parameters are necessary to generate the additional feature, i.e., clear-sky PV power or maximum PV power. The reasons and procedures of additional feature generation were described in Section 2.3. If maximum PV power was chosen as the additional input feature, the predictive tool waited five days to collect enough data to calculate the maximum power output (see Section 2.3.2).

After the definition of the additional feature, the main process of PV power prediction could start. This process occurred continuously in an endless loop. Each iteration of this loop started by making a decision about the execution of the training process (see rhombus in Figure 5). In the case of a positive decision, the predictive model was trained with updated weather data and PV measurements. A positive decision occurred twice per day, at 00:00 and at 12:00. As known from the previous chapter, all input data used in this study had data quality issues, e.g., missing values and different time resolutions. Therefore, data pre-processing was an unavoidable step, which included the following functions: timestamp transformation of PV measurements from local time UTC + 01:00 to UTC, detection of the double or completely inconsistent timestamps, imputation of missing values by linear interpolation, and transformation of data time resolutions into the pre-defined unified time resolution. This pre-defined unified time resolution for all used input data was 30 min. The last step of data pre-processing was the normalization of values. The normalization involved a scaling of values into the range [0, 1] using range normalization. The descriptive and target features were scaled separately from each other, and the scalers were saved for future use in the forecasting process, i.e., for scaling of the input weather forecast values and rescaling of the predicted PV power values.

After the normalization step, the scaled values of the weather reports and the PV measurements were used to train the predictive model. The training process consisted of three steps: architecture definition, compilation (configuration of the learning process), and training. The architecture of the model had the following characteristics:


Then, the learning process was configured and the model was trained for 100 epochs. After this process, the model architecture and the weights were saved in order to use them later for the prediction of PV power output.

As mentioned above, the model was trained only at midnight and at noon. At other times, the prediction of PV power output was made with the help of the model, which was saved after the previous training step. The last steps in the loop included rescaling of the predicted values and evaluation of the forecast accuracy.

#### *3.3. Content and Sizes of the Training and Test Sets*

Because the PV power output depends strongly on the weather conditions and seasons, it was checked whether the developed predictive model can forecast the PV power output for different seasons. For this purpose, the model was trained with the weather data of warm and cold periods, and then the trained model was used to make predictions for warm and cold periods, respectively.

Both the content and the size of the training set were varied. Figure 6 shows four different sizes of the training set (seven days, 14 days, 30 days, and 90 days), one constant size of the test set (23 days), and a general splitting of the whole dataset into training and test sets.

**Figure 6.** Splitting the dataset into training and test sets.

As seen in Figure 6, the first prediction of PV power output always began at the same time point of the test set, independent of the training set size. One after the other, the pre-defined training set sizes were used to train the model and make predictions of PV power. Then, the impact of the training set sizes on the forecast accuracy was investigated by comparing the predictions with each other.

#### *3.4. Evaluation of Prediction Accuracy*

The evaluation of the prediction accuracy was done by calculating standard statistical errors MAE, RMSE, and MAPE (see Equations (2)–(4)). Antonanzas et al. [10] described an adaptation of the classical MAPE for the evaluation of PV power forecasting.

$$MAPE = \frac{100\%}{n} \sum\_{i=1}^{n} \frac{\left| P\_{\text{pred}} - P\_{\text{meas}} \right|}{P\_0} \,\tag{7}$$

where *P*<sup>0</sup> is the installed capacity of the PV system.

Another measure for estimation of the forecast accuracy is the mean absolute scaled error (MASE), which was described in References [10,30].

$$MSE = \frac{MAE}{\frac{1}{n-m} \sum\_{i=1}^{n} \left| P\_{m \text{ans},i} - P\_{m \text{ans},i-m} \right|}. \tag{8}$$

This error differs from classical statistical errors in the fact that MASE is independent of the scale of the data. An MASE less than one points to the used predictive method being better than the average naïve forecast [30]. The naïve forecast and persistence forecast represent simple forecasting methods, whereby the forecast value is equal to the last measured value. There is an extension of the classical naïve forecast for seasonal data called the seasonal naïve forecasting method. According to this method, the forecast value is equal to the last measured value of the same season (season can be day, month, year, etc.) [31].

Because PV power output can be defined as seasonal data, it was decided to use the seasonal MASE to evaluate prediction accuracy in this study. Some assumptions which were needed to calculate MASE within this work are listed below.


In addition to statistical errors, the difference between the measured and predicted daily energies plays an important role in energy management systems. This measure is called the energy forecast error, and it was calculated with the following equation [32]:

$$
\Delta E = \frac{E\_{\text{daily, modeI}} - E\_{\text{daily, measured}}}{E\_{\text{daily, measured}}}.\tag{9}
$$

The sign of the energy forecast error indicates whether the predictive model overestimates or underestimates the measured energy production of PV system.

#### **4. Results and Discussion**

In this chapter, the PV power prediction, made with the publicly available weather reports, is presented, discussed, and compared with the prediction with fee-based solar irradiance data. Moreover, it was also checked whether the developed predictive model meets the defined requirements, such as self-learning ability, transferability, etc.

#### *4.1. Choice of the Additional Descriptive Feature*

Section 2.3. described two additional descriptive features: clear-sky PV power and maximum PV power. Here, these two features were compared by making predictions of PV power output and estimating the prediction accuracy. The developed predictive model was trained with the same weather data and the same power measurements of the PV system in Oldenburg (the installed capacity of this PV system is 1.14 WP). Both training datasets contained 90 days of data. The architecture of the predictive model and the procedure of the predictive process also remained the same. The only difference was that the first input dataset contained maximum PV power and the second input dataset contained clear-sky PV power as the additional descriptive feature.

After training the model and predicting PV power, the MAE of this prediction was calculated for each day in the test set. These MAE values are presented in Figure 7 in the form of boxplots.

**Figure 7.** Distribution of daily MAE values of PV power prediction after training the model without any additional feature and with maximum PV power or clear-sky PV power as additional features (training set 90 days; test set 8 August 2017–31 August 2017).

A boxplot is a very representative way of displaying the distribution of values. The main components of boxplots are the box, median, whiskers, and outliers. The box or interquartile range contains the middle 50% of all values. The line in the box indicates the median. The space between the lower whisker and the box covers the range between the minimum and lower quartile. The space above the box includes the values from the upper quartile until the maximum.

The first important conclusion which can be drawn from Figure 7 is that the extension of input data with one of the additional features (maximum PV power or clear-sky PV power) improved the prediction accuracy significantly; the median MAE decreased from 150 W to 67 W. The second important conclusion is that both additional features resulted in a similar accuracy of the PV power prediction. MAE values of these predictions varied between 25 W and 125 W. The only difference between these prediction accuracies was the distribution of the errors between the lower quartile, interquartile range, and upper quartile. Because MAEs of the predictions with both additional features had almost the same range, they could be equally used for PV power prediction.

To generate the maximum PV power, only two parameters are needed for the calculation: measured power of the PV system and pre-defined number of days looking back (see calculation procedure in Section 2.3.2). The feature "clear-sky PV power" required more input data for its generation, including geographical coordinates, location name, and calculated specific clear-sky index for the feature generation. Thus, the additional feature "maximum PV power" was more independent than the feature "clear-sky PV power". For this reason, it was decided to use the maximum PV power as the additional descriptive features for all predictions, as described later.

#### *4.2. Prediction with Publicly Available Weather Reports*

The developed predictive model with publicly available weather reports as input data must be able to make reasonably good predictions of PV power output. Moreover, the model must meet the requirements defined in Section 1. One of these requirements was the suitability of the model for different seasons of the year. This requirement was checked using the PV system in Oldenburg. Therefore, the predictive model was tested for two periods of time with different weather and solar irradiance conditions: a warm period from 8 August until 30 August 2017 and a cold period from 19 March until 10 April 2018.

The prediction accuracies for warm and cold periods are displayed in Figure 8 in the form of boxplots with the distribution of MAE, RMSE, MAPE, and MASE values. The predictions for both warm and cold periods were made using the predictive model, which was trained with four training set sizes one after the other. All training and test sets contained the data from publicly available weather reports. The *X*-axis in the figure presents the four sizes of the training set, which were used to train the model. The *Y*-axis shows the distribution of the error values for warm and cold periods.

**Figure 8.** Daily values of MAE, RMSE, MAPE, and mean absolute scaled error (MASE) of PV power prediction for warm and cold seasons, which were made after training with four sizes of the training set containing publicly available weather reports.

The training dataset size was increased step by step in order to investigate and improve the prediction accuracy of the model. The prediction accuracy improved with the increase in size of the training set. This was especially significant in the cold period; for example, if the training set was increased from seven days to 90 days, the median MAE of the cold period decreased from 51 W to 45 W (almost 12%).

The next important point, which can be seen in Figure 8, is that the scale-dependent errors MAE and RMSE in the cold period were lower than in the warm period (see Figure 8a,b). These results can be explained by the fact that the measured values of PV power in winter were much lower than in summer. The percentage error MAPE had the same disadvantage. For this reason, it was important to calculate the scale-independent metric MASE and to use it for comparison of the prediction accuracy across different seasons.

The distribution of the daily MASE values for the two seasons and four training sets are also presented as boxplots in Figure 8d. It is obvious from this figure that the prediction of the PV power in the cold period was less accurate than that in the warm period. The MASE medians of the cold period lay above 1.0 for almost all training sets, except for the set with 90 days, and the MASE medians in the warm period were about 0.90 for all training sets. One of the possible reasons for the better prediction in the warm period is that solar irradiance and, consequently, PV power output in the warm season was more stable, whereas the cold season had a lot of days with strongly fluctuating solar irradiance during the day. However, the developed predictive model should forecast the PV power output for all seasons of the year equally well. In this case, only the training set with 90 days could ensure appropriate prediction accuracy for both warm and cold seasons; the MASE medians of the PV power prediction for this training set were about 0.90, regardless of the season.

After testing whether the predictive model could accurately predict PV power for different seasons, the same model was also tested with regard to forecasting PV power for a completely different PV system. This system is located in Munich, Germany, and its installed capacity is almost one hundred times greater than the PV system in Oldenburg. The technical parameters of this PV system were presented in Table 1. The power output prediction for the PV system in Munich was also made for 23 days from 8 June until 30 June 2019. The dataset of the Munich PV system was split in the same way as the dataset of the Oldenburg PV system (see Figure 6).

Because the installed capacity of the Munich PV systems was much greater than the capacity of the Oldenburg PV system, only the scale-independent statistical metric MASE could be used for comparison of the prediction accuracy of these two systems. The average values of MASE for the whole test periods of the two PV systems are displayed as a dot chart in Figure 9.

**Figure 9.** Average of MASE values for two different PV systems in Oldenburg and Munich.

The predicted values of the Munich PV system had similar average values of MASE to the predicted values of the Oldenburg's system. The most accurate prediction occurred again after 90 days of training, but the extension of the Munich training set to 90 days led to a greater improvement of the prediction accuracy. Figure 9 not only shows the forecasting quality; it also verifies that the developed predictive model was able to forecast the power output for the different PV systems without any technical information about these systems, except for the measured power values, using only publicly available weather reports without direct prediction of GHI.

#### *4.3. Prediction with Fee-Based Solar Irradiance Data*

In this section, the PV power prediction with publicly available weather data was compared to the prediction with fee-based solar irradiance data. The main difference between these two data sources lies in the fact that publicly available weather data contain only indirect values of the GHI such as cloudiness and precipitation, while fee-based data contain measurements and predictions of the GHI, which are highly correlated with PV power output. The prediction of PV power output presented in this section was also made using the same predictive model described in Section 3.

In the previous subsection, it was proven that the developed predictive approach could be used to make predictions for two different PV systems. This is why fee-based solar irradiance data were used to make predictions for only one PV system. The PV system in Munich was chosen for this purpose.

Firstly, the MASE values of the PV power predictions with publicly available weather reports and fee-based solar irradiance data for all training sets were compared with each other, in order to determine the optimal sizes of the training set for each data origin. Figure 10 shows these values.

**Figure 10.** MASE of PV power predictions with publicly available data and fee-based data. The prediction was made for the Munich PV system (installed capacity of 99.9 kWP).

Figure 10 shows that the prediction accuracy of the model with the solar irradiance data was much better when using shorter training datasets. The extension of the training set with solar irradiance data from 30 days to 90 days led to an increase in MASE. The predictive model made the most accurate prediction if the training set contained 14 days of solar irradiance data or 90 days of publicly available data. The PV power predictions with these exact training set sizes are compared below.

The next measure for the comparison between predictions, made with two data origins, was the error between the predicted and measured daily energies, calculated using Equation (9). The normalized distribution of energy forecast errors of the predictive model, which was trained with 90 days of publicly available weather data and 14 days of fee-based data, is displayed in Figure 11.

**Figure 11.** Normalized distribution of the energy forecast errors of predictions with publicly available weather reports (training set with 90 days) and fee-based data (training set with 14 days). The prediction was made for the Munich PV system (installed capacity of 99.9 kWP).

Two main conclusions can be drawn from Figure 11. The first conclusion is that the developed predictive model was slightly inclined to overestimate the measured values regardless of the input data used; about 60% of the energy forecast errors had a positive sign The second conclusion is that the predictive model with fee-based solar irradiance data could predict daily energy more accurately than the same model trained with publicly available weather reports. The energy forecast errors of the prediction with solar irradiance data varied between −10% and 60%, but the usage of publicly available weather reports for the input data led to an increase in forecast errors in both directions, i.e., overestimating and underestimating.

It is relevant to consider the predicted PV power output not only for the whole test period, but also for single days. This is why two days with different solar irradiance were selected from the test set: one with clear-sky conditions during the whole day (29 June 2019) and another with strongly fluctuating solar irradiance (20 June 2019). The developed model was used to make predictions for these days with publicly available weather reports and fee-based solar irradiance data. The results were compared to each other and to the measured values. The predicted and measured power curves for the selected days are presented in Figure 12.

**Figure 12.** Measured PV power of the Munich PV system in comparison to the PV power predictions made by the model with publicly available weather reports (training set with 90 days) and fee-based solar irradiance data (training set with 14 days) on 20 and 29 June 2019.

Despite the smaller training set, the model trained with fee-based irradiance data could predict PV power output more accurately for both selected days than the model trained with publicly available weather reports. Moreover, the training with solar irradiance data led to the model predicting single peaks and drops of the PV system accurately even with fluctuating PV power production on 20 June 2019. Despite the fact that publicly available weather reports do not have direct predictions of the GHI, the model trained for 90 days with this dataset could forecast the main trends of the PV power production for both days. Furthermore, this model was also able to predict the rapid power drop on the day with strongly fluctuating solar irradiance (20 June 2019 at 12:00).

#### V **5. Conclusions**

The developed predictive approach is a data-driven method, where the quality of input data plays a key role. Therefore, the accuracy of the publicly available weather reports was investigated at the very beginning of the study. The accuracy of the PV power output prediction cannot be better than the accuracy of the used input data.

The evaluation of the prediction accuracy indicated that the machine learning approach provided adequate results of PV power prediction for the next 24 h even with publicly available weather reports. Although the publicly available weather data from OWM are to be used mainly on websites and

mobile applications, they can be also used for the purposes of PV power prediction. This study also proved that it is possible to predict the PV power output without solar irradiance measurements and forecasts, and without any technical information about the PV system, except the measured power values. Because publicly available weather reports do not provide GHI values, this input dataset was extended with an additional descriptive feature, i.e., the maximum PV power of the last five days. The addition of this feature improved the prediction accuracy, prevented the prediction of PV power overnight, and avoided overestimated predictions of PV power by day.

The requirements of the predictive approach were defined in the motivation of the study. The first requirement was a fully automated online operation of the day-ahead PV power forecasting. This requirement was proven during the simulation, where the weather forecast was updated every 3 h. In the same time interval, the PV power was predicted for the next 24 h. The constant updating of the training set with current weather data and PV measurements resulted in a periodic re-training of the predictive model every 12 h. The second requirement was the transferability of the model for different seasons and PV systems. The suitability for different seasons was tested using a simulation with the dataset from Oldenburg, which contains weather data for warm and cold periods. The comparison of the simulation results with these two datasets pointed to an influence of seasons on prediction accuracy, as well as the ability of the model to adapt to seasonal weather changes. The transferability of the model to the PV systems with different locations, sizes, and technical parameters was proven by the prediction of the PV power for two completely different PV systems. During the transferability check, the main disadvantage of MAE, RMSE, and MAPE was detected. Therefore, the seasonal MASE was selected to compare the forecasting accuracies across different seasons and PV systems.

Afterward, the PV power predictions with publicly available weather reports were compared to the predictions with fee-based solar irradiance data. The predictive model, which was fitted with publicly available weather data, needed more training data in order to make more accurate predictions of the PV power. The best accuracy of the prediction with publicly available weather reports occurred using the training set with data from the last 90 days (the time resolution of all data was 30 min). If fee-based solar irradiance data were used, the training set with the last two weeks of data led to the most accurate prediction. The predictive model which used solar irradiance data not only had better prediction accuracy, but it also forecasted single power peaks and drops of the PV system more accurately than the model which used publicly available data.

The developed predictive approach with publicly available weather data is not suitable for applications which require higher accuracy and finer resolution, e.g., grid stabilization. However, the accuracy of the developed predictive approach with publicly available weather reports is suitable for other applications, such as forecast-based energy management system for buildings with PV systems. An energy management system based on PV power prediction can increase the self-consumption of the PV system and optimize its operation and flexible loads, such as BEVs and heat pumps. Moreover, if the PV power prediction is considered for the distribution of energy demand over the day, it can support a reduction in peak load, which prevents power limits on the house connection point from being exceeded and avoids high grid fees.

The publicly available weather reports from OWM and the PV power measurements are continuously recorded in order to collect more input data for training and validation of the developed predictive approach with a greater input dataset. This greater input dataset contains more information about seasonal circumstances, rapid weather changes, and relationships between weather data and PV power, which can improve the accuracy of the predictive approach.

In this study, the accuracy of the developed predictive model with publicly available weather reports was improved in different ways, such as selection of appropriate input features and machine learning algorithm, optimal configuration of the LSTM network, increase in training set size, etc. However, the prediction accuracy may also be improved in the future if publicly available weather data sources provide measurements and prediction of solar irradiance.

Future investigations should not only deal with the improvement of the developed predictive approach, but also investigate many other questions. One of these refers to the optimal size of the training set. The accuracy of PV power prediction strongly depends on the type of solar irradiance for the next 24 h; in other words, the prediction accuracy depends on whether the prediction is made for a day with constant solar irradiance or for a day with strongly fluctuating solar irradiance. Therefore, it will be interesting to investigate dynamic sizes of the training set, i.e., the training set size can be adapted automatically depending on the weather forecast and type of solar irradiance. Dynamic training set sizes can also be more efficient in cases of disruptive changes like snow cover on PV modules or the failure of strings.

The focus for future investigation can be on an economic analysis of the PV power prediction. This study used only statistical metrics (MAE, RMSE, MAPE, and MASE) to evaluate the performance of the developed predictive approach and to estimate influences of the data origin on the prediction accuracy. Additionally, predictions made with different data origins can also be evaluated with economic metrics, e.g., whether paying for solar irradiance data can provide considerable economic benefit. Another goal for further investigations is a combination of the developed machine learning approach for PV power prediction with another approach for load prediction. These two approaches use different input features, predict different target features, and can use different measures for prediction accuracy evaluation. However, in order to evaluate the combination of these predictions uncertainty quantification needs to be conducted for both approaches. Afterward, these predictive tools can be integrated into the energy management system of the buildings.

**Author Contributions:** Conceptualization, N.M., J.-S.T. and B.H.; Data curation, N.M., B.H., T.S. and M.G.; Formal analysis, N.M.; Funding acquisition, B.H. and K.v.M.; Investigation, N.M. and J.-S.T.; Methodology, N.M., J.-S.T. and B.H.; Project administration, J.-S.T.; Resources, C.A., K.v.M. and M.G.; Software, N.M. and J.-S.T.; Supervision, B.H. and M.G.; Visualization, N.M.; Writing—original draft, N.M; Writing—review & editing, J.-S.T. All authors have read and agreed to the published version of the manuscript.

**Funding:** The authors would like to acknowledge the financial support of the "Federal Ministry for Economic Affairs and Energy" of the Federal Republic of Germany for the project "EG2050: EMGIMO: Neue Energieversorgungskonzepte für Mehr-Mieter-Gewerbeimmobilien" (03EGB0004G and 03EGB0004A). For more details, visit www.emgimo.eu. The presented study was conducted as part of this project.

**Conflicts of Interest:** The authors declare no conflicts of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Machine Learning Based Photovoltaics (PV) Power Prediction Using Di**ff**erent Environmental Parameters of Qatar**

#### **Amith Khandakar 1,\*, Muhammad E. H. Chowdhury 1, Monzure- Khoda Kazi 2, Kamel Benhmed 1, Farid Touati 1, Mohammed Al-Hitmi <sup>1</sup> and Antonio Jr S. P. Gonzales <sup>1</sup>**


Received: 27 May 2019; Accepted: 14 July 2019; Published: 19 July 2019

**Abstract:** Photovoltaics (PV) output power is highly sensitive to many environmental parameters and the power produced by the PV systems is significantly affected by the harsh environments. The annual PV power density of around 2000 kWh/m2 in the Arabian Peninsula is an exploitable wealth of energy source. These countries plan to increase the contribution of power from renewable energy (RE) over the years. Due to its abundance, the focus of RE is on solar energy. Evaluation and analysis of PV performance in terms of predicting the output PV power with less error demands investigation of the effects of relevant environmental parameters on its performance. In this paper, the authors have studied the effects of the relevant environmental parameters, such as irradiance, relative humidity, ambient temperature, wind speed, PV surface temperature and accumulated dust on the output power of the PV panel. Calibration of several sensors for an in-house built PV system was described. Several multiple regression models and artificial neural network (ANN)-based prediction models were trained and tested to forecast the hourly power output of the PV system. The ANN models with all the features and features selected using correlation feature selection (CFS) and relief feature selection (ReliefF) techniques were found to successfully predict PV output power with Root Mean Square Error (RMSE) of 2.1436, 6.1555, and 5.5351, respectively. Two different bias calculation techniques were used to evaluate the instances of biased prediction, which can be utilized to reduce bias to improve accuracy. The ANN model outperforms other regression models, such as a linear regression model, M5P decision tree and gaussian process regression (GPR) model. This will have a noteworthy contribution in scaling the PV deployment in countries like Qatar and increase the share of PV power in the national power production.

**Keywords:** PV power prediction; artificial neural network; renewable energy; environmental parameters; multiple regression model

#### **1. Introduction**

Due to global warming and climate change concerns, many pieces of energy legislation and incentives to promote the use of renewable energy have been established worldwide. Among renewable energy resources, photovoltaics (PV) energy is one of the most-promising supplements for fossil fuel-generated electricity, and has received a lot of attention recently it is abundant, inexhaustible, and clean. Arabian Peninsula is blessed with solar irradiance of more than 2000 kWh/m<sup>2</sup> annually [1]. Due to this high amount of solar irradiance in this region, PV technology has potential in comparison to other renewable energy sources (e.g., wind energy or tidal energy). Solar energy is gaining popularity day-by-day, due to some other salient features like noise and pollution free technology with low maintenance cost. Together with the ever-decreasing prices of PV modules and continuous

depletion of fossil fuel, it is expected that the penetration level of PV energy into modern electric power and energy systems will further increase. However, due to the chaotic and erratic nature of the weather systems, the power output of the PV energy system exhibits strong uncertainties regarding its intermittency, volatility, and randomness. These uncertainties may potentially degrade the real-time control performance, reduce system economics, and, thus, pose a great challenge for the management and operation of electric power and energy system. Predicting the power efficiency of a PV power plant is very crucial in making the best economic benefit out of it. The PV output power is directly related to the solar irradiance on the PV panels, which is a well-known fact. However, other meteorological parameters (e.g., ambient temperature, relative humidity, wind speed and dust accumulation) have been reported to influence the PV efficiency as well [2–5]. This association substantially increases in the harsh environment of the Gulf region. There are several recent works that showed the negative influence of dust accumulation on the PV panel on PV output power prediction [6,7]. The authors hypothesize that suitable weather parameters at a specific geographic location can be an important aspect for PV power forecasting. Moreover, PV power prediction can be particularly useful when multiple energy sources are combined to produce a hybrid energy matrix. Since solar energy source is highly intermittent, it is difficult to maintain system stability with an intolerable proportion of renewable energy injection. Solar power forecasting can be used to improve system stability by providing approximated future power generation to system control engineers. This will help the utility companies to devise a mechanism to design a switching controller to switch between the energy sources in a hybrid energy source [8]. It can be hypothesized that the key design parameters for the switching controller will be linked to the environmental parameters due to its potential effect in PV power generation.

Several recent works reported different approaches for PV output power forecasting and estimation. In detail, the specific literature on PV plant power production estimation presents three different types of models: Phenomenological, stochastic/statistical learning and hybrid ones. Deterministic approaches, based on physical phenomena, try to predict PV plant output by considering the electrical model of the PV devices constituting the plant using software like PVSyst, System Advisor Model (SAM). A deterministic approach was used to model electrical, thermal, and optical characteristics of PV modules [9]. Most of the published researches for PV power forecasting concentrate only on the deterministic forecast, i.e., point forecast. Deterministic forecasting methods sometimes fail to evaluate the uncertainties exhibited in PV power data. Probabilistic PV power forecasting models that can statistically describe these uncertainties have received much attention recently. One of the mainstreams for generating probabilistic uncertainty is to use an ensemble of deterministic forecaster. The main shortcoming of ensemble-based PV power forecasting model is their high computational cost—which may cause a real-time problem for practical implementation. Another demerit with respect to the methodologies used in deterministic and probabilistic PV power forecasting is their shallow learning models. Because of the complicated nature of the weather system, these shallow models may be inadequate to fully extract the corresponding nonlinear features and static traits in PV power data. Therefore, more investigation on the deterministic technique to provide high accuracy by optimization of artificial neural network (ANN) can lead to better performance.

By using the hourly solar resource and meteorological data, the model has been validated for different modules types. Statistical and machine learning ones, such as: Artificial neural network (ANN), support vector machine (SVM), multiple linear regression (MLR), adaptive neuro-fuzzy inference system (ANFIS) operate without any a priori knowledge of the system under consideration. They try to "understand" the relation between inputs and outputs by adequately analyzing a dataset containing acquired input and output variables collection. Statistical learning algorithms have many advantages. Firstly, they are able to learn from them, and they can also work in the presence of incomplete data. Secondly, once trained, they are able to generalize and to provide predictions. Their features make them suitable to be used in different contexts. Different machine learning (ML) algorithms to predict output power have been investigated for other renewable energy sources rather

than solar energy [10,11]. ML gives insights into the properties of data dependencies and shows the significance of individual characteristics in datasets [10]. Jawaid et al. [12], compared different ANN algorithms without showing the details of the prediction model and their comparative performance numerically. Several other works predicted the solar irradiance using machine learning techniques, rather than the PV power itself [13–15]. Some of the researches were only focusing on training and testing of one machine learning algorithm for PV power prediction [16]. An adaptive ANN was used to model and size a stand-alone PV plant, using a minimum input dataset [17,18]. An ANFIS was applied to model the different devices constituting a PV power system and its output signals [19]. A linear regression model and an ANN were applied to estimate daily global solar radiation [20,21]. Thirdly, a hybrid model can combine different models to overcome limitations characterizing one single technique [22]. In addition, "ensemble" methods [23] build predictive models by integrating multiple strategies in order to improve the overall prediction performance.

It can be noted that the previous works extensively explore different ML-based prediction models; however, there is still scope for improvement in prediction accuracy. In this manuscript, authors have reported the following new contributions:


The rest of the paper is organized as follows: Section 2 describes the materials and methodologies used in the paper; Section 3 describes the analysis techniques used in this work, Section 4 discusses the results, and finally, Section 5 concludes the work.

#### **2. Materials and Methodology**

To analyze the effect of PV performance due to the PV and environmental parameters, an in-house PV setup was designed and implemented, which acquired and recorded the PV performance and environmental parameters. The experimental setup comprised of two sub-systems (as shown in Figure 1: One at the rooftop which was equipped with sensors for acquiring the data and the other sub-system in the laboratory which was used for archiving the data and plotting them in real time.

The system on the terrace included PV modules (characteristics are shown in Table 1), signal conditioning circuits for all sensors of weather parameters, an Arduino Mega 2560 microcontroller, a wireless transceiver (XBee/Wifi) and a controllable electronic resistive load along with a DC-DC converter. Maximum power point tracking (MPPT) algorithm was implemented to produce pulse width modulation (PWM) signals to drive the controllable electronic load, which was emulating different levels of currents and voltage across the load without varying the actual load resistance itself. Power-voltage (P-V) and current-voltage (I-V) curves were plotted using the calibrated voltage and current sensors' data across the emulated electronic load. The MPPT was used to adjust the orientation of the PV panels to optimize solar irradiance, while achieving maximum PV output power yield. The sub-system at research laboratory consisted of XBee/Wifi adapters, connected to a workstation, for receiving and logging data from the rooftop subsystem wirelessly. All measurements from the rooftop sub-system sensors were received on demand or periodically at a specified time interval by these wireless adapters. Received data were recorded as a LabVIEW measurement file and displayed

on the LabVIEW front panel numerically on the workstation screen. The recorded parameters were also uploaded to an open Internet of things (IoT) data platform called Thingspeak [24] for widespread access. Both sub-systems communicate through an Xbee/EtherMega shield connected to the Arduino Mega 2560 microcontroller. The overall block diagram of the PV system is depicted in Figure 1.

**Figure 1.** Block diagram of experimental photovoltaics (PV) system set-up: Roof-top sub-system (left) and lab sub-system (right).

**Table 1.** Poly-crystalline silicon PV module characteristics (manufacturer: PTL Solar, UAE).


#### *2.1. Sensor Calibration*

The hardware components consist of a microcontroller (Atmega32), six sensors with signal conditioning circuits, a DC-DC Buck-boost converter and long-range XBee Pro wireless modules. The six sensors read the ambient temperature LM35DT (http://www.ti.com/product/LM35/datasheet/pin\_ configuration\_and\_functions#SNIS1593406), solar irradiance SP110 (http://www.apogeeinstruments.co. uk/pyranometer-sp-110/), humidity HSM-20G (http://www.geeetech.com/wiki/index.php/Humidity\_ /Temperature\_Sensor\_Module\_HSM-20G), dust GP2Y1010AU0F(https://digitalmeans.co.uk/shop/ compact\_optical\_dust\_sensorgp2y1010au0f), wind speed anemometer (https://www.adafruit.com/ products/1733) and the PV module surface temperature sensor PT100 (http://export.farnell.com/ labfacility/rtf4-3/sensorpt100-patch-3m/dp/1633500). The PT100 was fixed at the backside of the PV module using a highly thermally conductive adhesive. Also, the voltage and current transducers are used to sense voltages and currents from the PV module in order to plot the P-V and I-V curves. Before installing the overall system, all sensors were tested and calibrated methodically. The BK PRECISION-720 humidity and temperature meter are used as a reference when calibrating the humidity, surface and ambient temperatures' sensors. The temperature of a heating element is controlled to generate various ambient and surface temperatures for sensors' calibration. The HSM-20G sensor

was calibrated using steam generated inside an encapsulated box where both the HSM-20G humidity sensor along with the BK-PRECISION-720 m was placed. Simultaneous measurements were performed by taking the readings from both the BK PRECISION meter and the humidity and temperature sensors. The commercially available INSPEED VORTEX wind speed sensor, using a CATEYE VELO8 display, was used to calibrate the anemometer (wind speed sensor with analog output). The voltage and current sensors were calibrated using Yokogawa GS510 SMU (source measurement unit) with standard procedures. For the dust sensor, we used the firm calibration curve. In the laboratory, different dust levels were deposited on the sensing element of the GP2Y1010AU0F sensor, which were found to be within the operating range of this sensor. All the calibration results were repeatable. The output voltages of the various sensors were amplified in order to match the full-scale analog range of the microcontroller's analog-to-digital converter (ADC) without causing ADC saturation errors. However, the dust and PV surface temperature sensors do not directly provide analog signals, and a circuit was developed so that they can be interfaced to the microcontroller. For the PV surface temperature sensor, which is a resistive type (resistance temperature dependent, RTD), a constant current source circuit is devised to provide an output voltage that is linearly dependent to the variation of resistance. For the dust sensor, its output pulse lies on a 0.32 ms pulse width that needs to be acquired correctly by sampling at 0.28 ms of the pulse. All the sensors conditioning circuits are integrated into a single printed circuit board (PCB).

The maximum power point tracking (MPPT) used a back-boost converter, which serves as a direct load to the PV modules. Through a gate driver circuit, by adjusting the firing angles of the insulated gate bipolar transistor (IGBT) switch of the back-boost converter, the microcontroller keeps adjusting the output voltage of the PV modules until reaching the maximum power point. Then, the microcontroller reads the corresponding voltages and currents of the PV module through the voltage and current transducers above discussed. Furthermore, two XBee Pro transceivers were used to transfer the measured data wirelessly from the rooftop sensors and electronics modules to a LABVIEW-based monitoring station (Figure 1), which plots I-V curves, P-V curves and also save the measured data for future analysis.

#### *2.2. Machine Learning-Based Prediction*

The process of applying ML on any dataset to predict unknown output values consists of three general phases (Figure 2): Pre-processing of data to extract features, training the prediction models and observing validation accuracy on training dataset and evaluation of the pre-trained model for the test dataset. Firstly, the acquired dataset was pre-processed to make it suitable in format, free of anomalies, such as missing, outliers and erroneous data values. Most importantly, then the relevant features were extracted. We have used the collected parameters, e.g., Temperature, Relative Humidity, PV surface Temperature, Irradiance, Dust accumulation and Wind Speed as features for the training and testing; which eased this sub-task. Training and testing dataset was created using the cvpartition function in Matlab, which allows to randomly partition the training and testing data into 80% and 20% respectively. In this study, 380 instances were used for training and validation, whereas 95 instances were used for testing. In the prediction phase, data with known output response values were used for training several ML algorithms using Regression Learner from Statistical and Machine Learning Toolbox and Neural Network toolbox of Matlab.

**Figure 2.** Block diagram of the machine learning-based training and prediction stages.

An additional step of feature selection can be used to optimize the trained algorithm. Selection of features is the process of selecting a subset of relevant, high-quality and non-redundant features to create learning models with better accuracy [25,26]. Several feature reduction techniques were tested to obtain the optimized prediction model with the selected subset of features. In the testing phase, the best performing pre-trained models were evaluated for test dataset, and the evaluation parameters were computed to perform the reliable statistical evaluation. In addition, this process can be made adaptive and can be accomplished to improve model quality as historical data gradually becomes available.

#### *2.3. Features Selection*

After processing the data acquired from the data acquisition system, as shown in Figure 1, the acquired PV and environmental parameters were used as features for the training, validation and testing purpose. However, it is important to evaluate whether the complete set of environmental parameters are necessary for the prediction or the feature number can be reduced. Correlation feature selection (CFS) and Relief feature selection (ReliefF) techniques were used to select most contributory features. CFS technique selects feature sub-sets, based on correlation-based heuristic evaluation function, and uses a sub-set search method and calculates the level of redundancy between features in all sub-sets created. It then evaluates the importance of sub-sets, where the low inter-correlation, but high-correlation to the target result are selected. ReliefF is an instance-based algorithm that assigns a relevance weight to each feature that reflects its ability to differentiate class values. Because of sufficient data, ReliefF has the potential to detect interactions higher than pairwise. In order to select the best subset with ReliefF from the ranked features, the lowest ranked features were iteratively removed until the best result was achieved.

#### *2.4. Prediction Models*

There are many predictive methods, based on ML, and they can perform differently for the given datasets. Several simple and popular regression and prediction models were attempted in this work to estimate the PV output power. These are namely simple linear regression [27], gaussian process regression (GPR) [28] from the regression learner, M5P regression tree [29]. The simple linear regression model has a linear relationship between the output response and the input parameters. GPR involves a Gaussian process using lazy learning and a measure of the point similarity (kernel function) to predict the value from the training data for an unseen point. The prediction is not only an estimate for that

point, but also information about uncertainty. It is a one-dimensional Gaussian distribution (which at that point is the marginal distribution) [28]. In the M5P regression tree model, a tree-based model with an M5 algorithm developed by Quinlan, 1992 [29], combines a conventional decision tree with the linear regression functions at each branch end of the tree; it creates a model that predicts the target's value by learning simple decision rules [30]. In other words, predicted power would be the result of "if... then... else..." statements [8].

In this work, the ANN was also used to predict daily PV output power, which is a very popular machine learning tool for classification and regression application [31]. Figure 3 provides the layered structure of the ANN along with the detailed depiction of forward propagation and weight adjustment. The ANN tries to replicate the machine learning in the similar nature of the human brain with a layered structure (input, hidden and output layers) (Figure 3). Models of ANNs take the form of artificial neurons where a number of inputs are given to each neuron. The activation function is applied to these inputs resulting in neuron activation level (neuron output value) and learning knowledge is provided in the form of training inputs and output pairs (Figure 3). More details on the various training functions/algorithms are listed in Table 2. Each training function has their own advantages and disadvantages, and they work differently with different datasets. It was necessary to explore all the training functions to check which of them works the best for the dataset developed and used in this work.

**Figure 3.** The layered structure of the artificial neural network (ANN).

**Table 2.** Training algorithms of a ANN in Matlab and their description.


*Energies* **2019**, *12*, 2782

The artificial neural network and other regression-based predictors were implemented on Matlab 2017b version on a workstation with the below specification:

Processor: Intel® Core™ i7-7500U CPU @ 2.70 GHz Installed memory (RAM): 16.0 GB System type: 64-bit operating system, x64 based processor

In this work, various combinations of the number of hidden layers and training functions were explored to find the best combination that predicts the PV power most accurately, as shown in Figure 4. An in-house written Matlab script was used to train automatically 10 different training functions (Table 2). The script was written to change the number of hidden layers from 10 to 300 in increments of 10 and each training was performed using a particular training function and a specific number of hidden layers. This is due to the fact that each run provides different network and the best network out of that 10 tries is selected for that specific combination of training function and number of hidden layers. Later, a comparison was carried out between the best set of networks with different training functions, and the number of hidden layers and the best network for each function amongst all the combination was selected. A final comparison was made with the best network of different training functions, and the best network among all functions was selected. Figure 4 shows the flowchart of how the best model selection was carried out.

**Figure 4.** Method to find the best ANN to predict PV power.

Table 3 summarizes the network settings for the ANN-based PV power prediction. It can be seen in Table 3 that the optimum number of hidden layers providing the best model were different for all features, CFS technique and ReliefF technique.



#### *2.5. Bias Calculation and Correction in Prediction*

The biased forecast is described as a tendency to either over-forecast (the forecast is more than the actual), or under-forecast (the forecast is less than the actual). To improve the forecast accuracy in the presence of bias is possible if the bias is correctly identified. The correction of the forecast error can be achieved by adjusting the forecast by the appropriate amount in the appropriate direction, i.e., increase it in the case of under-forecast bias, and decrease it in the case of over-forecast bias.

Two different techniques are used in this work to calculate the bias in the forecast:


Tracking Signal-Based Technique

The other common metric used to measure forecast accuracy is the tracking signal. The "Tracking Signal" quantifies "Bias" in a forecast. No product can be planned from a badly biased forecast. Tracking Signal is the gateway test for evaluating forecast accuracy. The tracking signal in each period is calculated using the formula as follows:

> Tracking signal <sup>=</sup> Actual <sup>−</sup> Forecast ABS(Actual − Forecast)

Once this is calculated, for each period, the numbers are added to calculate the overall tracking signal. A forecast history totally void of bias is returned a value of zero, with 12 observations, the worst possible result would return either +12 (under-forecast) or −12 (over-forecast). Such a forecast history generally returns a value greater than 4.5 or less than negative 4.5 would be considered out of control. NFM Technique

Normalized Forecast Metric (NFM) can be used to measure the bias. The formula of NFM to calculate bias is:

$$\text{NFM} = \frac{(\text{Forecast} - \text{Actual})}{(\text{Forecast} + \text{Actual})}$$

As can be seen, this metric stays between −1 and 1, with 0 indicating the absence of bias. Consistent negative values indicate a tendency to under-forecast, whereas consistent positive values indicate a tendency to over-forecast. Over a 12 period window, if the added values are more than 2, we consider the forecast to be biased towards over-forecast. Likewise, if the added values are less than −2, we consider the forecast to be biased towards under-forecast. A forecasting process with a bias eventually get off-rails unless steps are taken to correct the course from time to time.

The bias correction and change factor methods work well for bias correcting non-stochastic variables. The quantile mapping (QM) technique removes the systematic bias in the predicted output and has the benefit of accounting for biases in statistical downscaling approaches.

#### **3. Analysis**

Several statistical analyses were carried out to evaluate the performance of machine learning algorithms for PV output power prediction. To compare models' performances, various evaluation metrics are commonly used: (i) Correlation coefficient, which measures the linear dependency between two variables; (ii) mean absolute error (MAE), which takes the average of the absolute difference between the real and predicted values; (iii) mean square error (MSE) measures the average squared error and the square difference between target and predicted values were calculated and averaged; (iv) root mean square error (RMSE) is the square root of MSE and similar to MAE, but it averages the squares of the difference and then finds the square root where it actually puts a heavier weight on larger errors; and (v) coefficient of determination (R2) always lies between − ∞ to 1 and is the ratio between how well the prediction model in comparison to naive mean model. These parameters provide better descriptions of predictor performance [32].

$$\text{Correlation Coefficient}, \mathbf{r} = \frac{\text{Con}(\mathcal{X}, \mathcal{Y})}{\sigma\_{\mathbf{x}} \sigma\_{\mathbf{y}}} \tag{i}$$

$$\text{Mean absolute error, } \text{MAE} = \frac{1}{\mathbf{n}} \sum\_{\mathbf{n}} |\mathbf{X} - \mathbf{Y}| \tag{ii}$$

$$\text{Mean Squared Error, MSE} = \frac{\sum |\mathbf{X} - \mathbf{Y}|^2}{\mathbf{n}} \tag{iii}$$

$$\text{Root mean square error, RMSE} = \sqrt{\frac{\sum |\chi - \chi|^2}{n}} = \sqrt{\text{MSE}} \tag{iv}$$

$$\text{Coefficient of determination, or } \text{R}^2 = 1 - \frac{\text{MSE (Model)}}{\text{MSE (Baseline)}} \tag{v}$$

$$\text{MSE}(\text{Baseline}) \text{ is calculated by } \frac{\sum \left| \mathbf{X} - \overline{\mathbf{Y}} \right|^2}{\mathbf{n}} \tag{vi}$$

where X is the actual data vector, Y and Y are the predicted data vector and mean of the predicted data vector.

Different ANNs and regression models were compared using MAE, MSE, RMSE, r-value and R2 value. After extensively exploring the ANN training functions that provide a better prediction of the PV, Bayesian regularization backpropagation algorithm was used from the neural network toolbox of Matlab. A built-in Matlab function for Bayesian regularization backpropagation minimizes a linear combination of squared errors and weights and then determines the correct combination so as to produce a network that generalizes well. It updates the weight and bias values according to Levenberg-Marquardt optimization [33]. The best ANN selection technique (as shown in Figure 4) was repeated for three different scenarios: (i) When all the features were used; (ii) when features selected using CFS technique are used; and (iii) when features selected using ReliefF technique.

#### **4. Results and Discussion**

The prototype system (setup shown in Figure 1) was used for collecting the PV and environmental parameters and PV power output data from the period of November 2014 until October 2016. Summary of the PV and environmental parameters and the data used for deriving the predictive model of the PV power is shown in Table 4.


**Table 4.** Details of the environment parameters used for the predictive model.

Table 5 shows that Temperature, PV Temperature, Irradiance and Accumulated dust were the selected feature using CFS algorithm, whereas the irradiance, wind speed, PV temperature, and environmental temperature were selected as highly ranked features by ReliefF technique.


**Table 5.** Selected features vector.

Table 6. summarizes the evaluation matrix for different regression techniques evaluated in this study. Linear regression, M5F tree and GPR were implemented using MATLAB with all the features, and also with reduced features using CFS and ReliefF. The reason for selecting these regression techniques, because they provided the best performance compared to other regression techniques commonly used. Table 6A shows the performance matrix of the different regression techniques in the validation phase, whereas Table 6B shows their performance matrix for the unseen test dataset.

**Table 6.** (A) Performance comparison between the various regression techniques (validation phase). (B) Performance comparison between the various regression techniques (testing phase).



**Table 6.** *Cont.*

This is clearly revealed from the tables in Table 6, the overall performance of the evaluated regression techniques for the testing dataset was not similar to that of the training dataset. For the testing dataset, CFS-based feature selection technique outperforms all features and ReliefF-based techniques for Linear and GPR regression techniques; however, M5P outperforms for all features-based technique.

Training and validation performance of the ANN-based prediction models were shown in Figures 5 and 6 for the three different techniques. The validation performance, as shown in Figure 5, the best training performance was observed at different epochs for different techniques. Out of the numerous models developed by the ANN using the different set of features, the best epochs were obtained 707, 91 and 598 respectively. This could be used in future to derive the model for predicting the PV power. Figure 6 shows the error histogram with 20 bins, where bins represent the vertical bars in the graph. Total error from each neural network ranges from (−25.84 to 16.88), (−49.08 to 26.43) and (−43.83 to 20.03) respectively. Each vertical bar represents the number of samples from corresponding dataset, which lies in a particular bin. There is a zero-error line in the graph, and more than 80% of the errors lie within +10 Watt. It is typically assumed that any algorithm which could predict the output where 80% of the error lying within 10% (i.e., approximately 10 W) of the target value, is a very good predictive model [34].

**Figure 5.** Comparison of the MSE for different techniques: (**a**) With all features; (**b**) CFS feature selection technique; (**c**) ReliefF feature selection technique for training and validation.

**Figure 6.** Error Histogram (**a**) with all features; (**b**) CFS feature selection technique; (**c**) ReliefF feature selection technique for training and validation.

Figure 7 shows the relation between the original power output and the predicted power output using the best epochs from the ANN. The dots represent the original power output, the blue line is the best linearized predictive model derived from ANN, and the dotted line represents the best linear relation for the true target. The difference between the dotted line and the blue line is represented by the correlation coefficient, i.e., in Figure 7, r represents the successful linearized model developed by the ANN using three different techniques. However, the difference between the predictive model trend-line and the true trend-line was noticed minimum for all features, which is evident in Figure 6a and Table 7.

**Figure 7.** Relation between the original power output and the predicted power output in training and validation: (**a**) With all features; (**b**) CFS feature selection technique; (**c**) ReliefF feature selection technique.

The best ANN model found using the Figure 4 approach, was validated during training and validation process and results were reported in Table 7A. The testing dataset was used to validate the trained model, and the results were shown in Table 7B. As seen in Table 7A, the ANN provides really good prediction compared to the other regression techniques. By using the various feature selection techniques, it was found that the ANN provides the lowest RMSE, i.e., 2.1436 for all feature set; whereas ReliefF feature selection technique provides the second-best performance in terms of RMSE, i.e., 5.5351 in the validation phase. Similar performance was observed for the testing dataset as well. All features technique outperforms others, and the best RMSE was observed to be 5.4784. Figure 8 shows the comparison of the ANN predicted power with the actual power using test dataset for different sets of features. It is apparent from Figure 9 that there is a consistent over-forecasting or under-forecasting in the predicted output, i.e., the predicted output was biased in prediction for some instances. Figure 9 shows the biased forecasting for all the features-based predictions, as shown in Figure 8a. It can be seen that tracking signal-based bias calculation can identify bias more accurately than the NFM technique.


**Table 7.** (A) Performance comparison between the various ANN techniques (validation phase). (B) Performance comparison between the various ANN techniques (testing phase).

**Figure 8.** Comparison of the ANN predicted power with the actual power using test dataset: (**a**) With all features; (**b**) CFS feature selection technique; (**c**) ReliefF feature selection technique.

**Figure 9.** Bias calculation for a biased prediction for all features based PV power prediction: (**A**) Using tracking signal technique; (**B**) using NFM technique.

In order to confirm if the ANN derived models are efficient in predicting PV power, test dataset (not included in training) was used for testing the performance of the ANN algorithms. From Figures 8 and 10, it is evident that using all the features, it is closer to predict the actual power than CFS and ReliefF filtering. It should be noted that more data should be included in the training dataset to increase the accuracy of prediction. It has been mentioned in the literature by several groups that the bias correction can improve accuracy; however, some other article showed a contrary performance. Most importantly, by incorporating bias correction in the prediction algorithm, overall computational complexity and cost will significantly increase. Moreover, shallow Convolutional Neural Network (CNN) can be used for PV power prediction to remove the biasing problem. The ANN's computational complexity could be less than CNN or deep learning approach and more flexible for real-time prediction.

**Figure 10.** Error Histogram between the Predicted and actual PV with test dataset: (**a**) With all features; (**b**) CFS feature selection technique; (**c**) ReliefF feature selection technique.

#### **5. Conclusions**

An in-house PV system was developed at Qatar University to monitor, analyze and evaluate the performance of PV using various weather factors. The PV and environmental data collected from the system was used to develop a prediction model that can be used to predict the PV power in advance. To conclude, the prediction model was developed using several regressions—and ANN-based networks using the data collected by the PV system. Two feature selection techniques (CFS and ReliefF) were used to select subsets of applicable, high-quality and non-redundant characteristics. Compared to the three best regression models (simple linear regression model, M5P decision tree model and GPR), the ANN was more accurate in predicting the output power with RMSE of 2.1436. It was found that using feature selection techniques along with the ANN can predict the PV power with RMSE of 6.1555 and 5.5351, respectively. The trained ANN models are simpler and can be used to accurately predict the output power of PV systems with minimal computational complexity. Since the PV system was designed and tested in Qatar, this work can help the researchers in the Gulf to utilize the optimized algorithm and its performance in prediction for this region. We believe this would help the solar industry of this region in a great deal for optimizing the overall PV output. More PV and environmental data are being acquired for training a more accurate predictive model using the approach described in this work and will be compared with CNN-based approach in the future. Moreover, the concept of

cooling the PV using the techniques discussed in Reference [5] are going to combine with the existing PV system, which can potentially help in increasing the efficiency and also avoid the accumulation of dust, which can affect the PV performance. Future works can open new horizons in this domain regarding different bias correction algorithm, with five years' system data, and more parameters for prediction—including cooling and cleaning effect on the PV system.

**Author Contributions:** Experiments were designed by A.K., and M.E.H.C.; Experiments were performed by F.T., A.J.S.P.G., K.B., M.-K.K.; Results were analyzed by A.K., M.E.H.C., F.T. and M.H.; All authors were involved in the interpretation of data and writing the paper.

**Funding:** The publication of this article was funded by the Qatar National Library.

**Acknowledgments:** The authors would like to thank the Electrical Engineering Department of Qatar University for providing the opportunity to conduct the experiments at Qatar University.

**Conflicts of Interest:** The authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## **Simulating Power Generation from Photovoltaics in the Polish Power System Based on Ground Meteorological Measurements—First Tests Based on Transmission System Operator Data**

#### **Jakub Jurasz 1,2,\*, Marcin Wdowikowski <sup>3</sup> and Mariusz Figurski 3,4**


Received: 5 July 2020; Accepted: 12 August 2020; Published: 17 August 2020

**Abstract:** The Polish power system is undergoing a slow process of transformation from coal to one that is renewables dominated. Although coal will remain a fundamental fuel in the coming years, the recent upsurge in installed capacity of photovoltaic (PV) systems should draw significant attention. Owning to the fact that the Polish Transmission System Operator recently published the PV hourly generation time series in this article, we aim to explore how well those can be modeled based on the meteorological measurements provided by the Institute of Meteorology and Water Management. The hourly time series of PV generation on a country level and irradiation, wind speed, and temperature measurements from 23 meteorological stations covering one month are used as inputs to create an artificial neural network. The analysis indicates that available measurements combined with artificial neural networks can simulate PV generation on a national level with a mean percentage error of 3.2%.

**Keywords:** photovoltaics; artificial neural networks; national power system

#### **1. Introduction**

The transformation of the power system is a continuous process, and to fully realize this we will need years of ongoing commitment and well-thought decisions on country and regional levels. In the past two years in the Polish power system, we could observe a significant increase in the installed capacity in both residential and commercial photovoltaic (PV) systems. To a set of interesting investments in PV capacity one could potentially include a 600 kWp system located near Por ˛abka-Zar ˙ pumped-storage hydropower station, 2.5 MW installation for waterworks in Szczecin, and 739 kW for 35 buildings belonging to a housing cooperative in Wrocław. The growing interest in PV systems can be linked to (a) increasing electricity prices, (b) the decreasing cost of PV systems, and (c) growing awareness of the impact of the energy generation sector on the natural environment. (Impact of PV systems from the perspective of Life Cycle Assessment should not be neglected. Readers are referred to other works strictly dedicated to this topic.)

The Polish Transmission System Operator (PSE) has been publishing wind generation data on an aggregated level for quite some time. These time series with an hourly time step are of great importance to visualize and analyze the variability [1] of wind generation with different time horizons. They can also be used to create forecasting or simulation models [2–5]. Recently, the PSE has also made publicly available aggregated generation time series from photovoltaic installations located in Poland. These data create new opportunities for further research, including an analysis of the complementarity between renewable energy sources in Poland [6] or the use of available ground measurements to simulate PV generation on a country level. The second research direction can be later used to simulate and predict how the growing installed capacity in PV systems in Poland will affect power system operations. Based on available measurement data, and knowing the transmission system constraints, decisions can be made regarding the optimal distribution of renewable generators in the power system.

Considering the information above, the objective of this study is to conduct first tests with regard to the possibility of using ground measurements from meteorological stations to simulate the power generation from PV systems on a country level. Such research results can be found in the international literature both for PV and wind generation, using different kinds of inputs and simulation tools. For example, for Sweden reanalysis based on MERRA (Modern Era Retrospective-Analysis for Research and Applications), data sets have been used to effectively model a national fleet of wind turbines' power output [7]. Recently, Olauson [8] found that ERA5 data sets performed much better than the MERRA reanalysis sets mentioned previously [7]. Similar to the analysis presented in this work, Black et al. [9] used meteorological data and regression techniques to simulate a fleet of PV systems.

#### **2. Materials and Methods**

To simulate the energy generation from PV systems on a country level, the method based on an artificial neural network (ANN) was applied. ANNs are a computational system loosely inspired by biological neural networks. They belong to a group of artificial intelligence information processing paradigms that has gained immense popularity in recent years. Simulation or forecasting models based on ANNs have been successfully applied in various areas of cognitive and application research, including water-demand forecasting [10], lake water-level forecasting [11], renewables integration studies [12], in the area of color image identification and reconstruction [13], multi-core optic fibers [14], wind speed prediction [15], or, most importantly from this paper's perspective, in the areas of direct and global radiation prediction [16] and PV energy yield forecasting [17].

A model of feedforward neural network (NN) has been developed in Matlab 2019a software. The Levenberg–Marquardt method was used to optimize the weights and bias values [18]. The input data were divided into training, validation, and testing subsets in proportions of 70, 15, and 15. Since the length of the available time series was limited to one month, the data were divided so that the first 70% of hours was used to teach the neural network, followed by 15% to validate/supervise the teaching process and the remaining 15% to test the NN performance. The number of neurons in the hidden layer was selected following a brute-force approach, namely NNs with the number of hidden neurons k ranging from 1 to *n*, where *n* is the number of input neurons being tested. In the literature, various approaches to solving this problem can be found [18,19]. However, considering the low computing effort to create an ANN, a brute-force approach in this particular case seems to be a justified choice.

As inputs, the PV generation time series covering the month of May (available at https://www.pse. pl/) and time series of wind speed, temperature, and irradiation for 23 meteorological stations located in Poland, obtained from the Institute of Meteorology and Water Management—National Research Institute (IMWM-NRI), were used. The locations of meteorological stations along with the equipment used are presented in Figure 1 and in Table 1. The data have an hourly time step. The nighttime hours and hours when the energy generation from the PV system was less than 5 MW were removed from the input data set. Such low values were removed because hourly temporal resolution does not take into account the spatial distribution of PV systems in Poland, and low generation periods occur during sunset and sunrise. The final data set consisted of 473 hourly records. The data were normalized to a range of 0–1 before the ANN creation procedure. The performance of ANN has been assessed based

on two commonly applied metrics, namely (MAPE) mean absolute percentage error (Equation (1)) and (RMSE) root-mean-square error (Equation (2)).

$$MAPE = \frac{1}{n} \sum\_{i=1}^{n} \left| \frac{o\_i - s\_i}{o\_i} \right| \tag{1}$$

$$RMSE = \left[ \sum\_{i=1}^{n} \left( o\_i - s\_i \right)^2 / n \right]^{1/2} \tag{2}$$

where *n*—sample size, *o*—observed value, and *s*—simulated value.

**Figure 1.** Locations of 23 IMWM-NRI meteorological stations.



#### **3. Results and Discussion**

The energy yield from photovoltaic modules is a function of irradiation falling on the module area, the modules' efficiency, and their temperature. Since detailed information about the PV systems' location is currently not available to the authors, nor is the specification of the modules used, it is justified to use a black-box approach where the input data are transferred into desired output. Because the temperature of the PV modules is determined by irradiation and ambient air temperature as well as wind speed, which has a cooling effect, the above-mentioned meteorological parameters have been considered as explanatory variables. To simulate an hourly power generation from PV systems in Poland, a set of 69 (3 meteorological parameters from 23 stations: Bielsko-Biała, Gdynia, Gorzów Wielkopolski, Jarczew, Jelenia Góra, Kasprowy Wierch, Kłodzko, Koło, Koszalin, Legionowo, Legnica, Łeba, Łód´z, Mikołajki, Piła, Radzy ´n, Sulejów, Suwałki, Toru ´n, Warszawa-Bielany, Wielu ´n, Włodawa, and Zakopane – please see Figure 1) explanatory variables was used. These explanatory variables exhibited various correlation coefficient values with the response variable. For the irradiation, it was on average 0.777, whereas the air temperature and wind speed were, respectively, 0.439 and 0.299. Figure 2 shows the observed hourly irradiation on the 1st of May 2020 and the production of energy from PV systems at the national level. During that day, the observed irradiation in individual hours varied significantly among considered locations (meteorological stations), while the PV systems maintained a relatively smooth energy generation pattern. This phenomenon can be attributed to the spatial smoothing of power generation due to the geographical dispersion of the PV systems [20]. This situation is beneficial from the perspective of variable renewable energy sources (VRES) integration to the power system, although constraints such as transmission network capacity may limit the benefits resulting from spatial smoothing.

**Figure 2.** Observed irradiation in 23 meteorological stations along with photovoltaic (PV) production on a country level on 1st of May 2020.

The generation from PV systems during the whole period is visualized in Figure 3. Significant variability between individual daily yield sums can be observed. On a side note, accordingly to the PSE data in May, the PV systems generated 223 GWh, whereas wind turbines generated almost 1.072 TWh, which contributed to covering, respectively, 1.8% and 8.6% of the national demand in this month.

**Figure 3.** PV generation on a country level during May 2020.

As mentioned in the Methods and Data section, in total 69 potential configurations of the ANNs were tested. For those, the one with the lowest MAPE was selected for further analysis. The performance of the selected and the remaining ANNs as a function of the number of neurons in the hidden layer is presented in Figure 4. The best performing ANN had 15 neurons in the hidden layer and MAPE of 3.2%.

**Figure 4.** Performance of neural networks for the testing subset based on mean absolute percentage error (MAPE) criteria.

The ANN also performed well in terms of RMSE, which was found to be 39.2 MW. In Figure 5, the performance of the ANN was visualized for the testing subset only. This set is the final verification if the neural network is capable of obtaining good-quality results for input data that remained unknown during the training phase. In Figure 5 it can be noted that the ANN performed very well for the extreme values (very low and high generation), whereas some systematic errors with an unknown source occurred for the mid-range values. On average, it was found that the ANN tended to overestimate the generation by 6 MW. The highest overestimation error was found to be 133 MW, whereas the highest underestimation was 150 MW.

**Figure 5.** Performance of the neural network for the testing subset. R<sup>2</sup> refers to linear regression. Red line indicates a theoretical perfect match.

Figure 6 visualizes the performance of the neural network over a period of 4.5 days by the end of May 2020. The night hours are excluded from the analysis. As shown in the figure, the values simulated by the ANN followed the real PV systems generation well. During the second day, the ANN wrongly simulated a sudden drop in PV generation, increasing the variability of the modeled time series (in terms of ramp rates). During the fourth day, one can observe that in the midday hours the simulated generation was slightly greater than the observed one. In general, the absolute errors did not exceed 150 MW.

**Figure 6.** Performance of the neural network for the testing subset. Please note that night hours (no irradiation) are excluded.

#### **4. Conclusions**

The presented short communication archives the first results regarding the potential of simulating the generation from the PV systems on a national level based on ground measurements provided by the Institute of Meteorology and Water Management. The conducted analysis based on artificial neural networks revealed that ground measurements consisting of irradiation, wind speed, and air temperature can be used to correctly model the power generation from PV systems on a country level. Despite some limitations, such as neglecting the technical specification of the PV systems, the spatial distribution of PV farms across the country, taking input irradiation on a horizontal rather than an inclined surface, and finally representing the meteorological conditions in Poland based on a sample of 23 locations, it was possible to model the PV generation with a mean absolute percentage error of roughly 3.2%.

Most importantly, the obtained results have some practical implications. In the research and reality of the operation of present energy systems, meteorology starts to play a very important and, in many instances, crucial role in enabling and securing an efficient and reliable operation of the power system. The need for including meteorology-based studies in energy research comes directly from the unprecedented increase in the installed capacity of renewable energy sources, especially the ones in which variability/availability is driven by climate and weather [21]. The Polish power sector is starting a process of transformation. The increasing share of renewables such as wind and, in particular, solar energy (as observed in 2019/2020) is driven by an increasing awareness of the energy sector's impact on the natural environment, the growing cost of electricity generation from conventional fuels, the decreasing cost of renewables, and national/international policy. The power system expansion/development studies [22] call for data with both higher temporal and spatial resolution. This can be provided by either ground measurements, satellite measurements, numerical weather prediction models, or reanalysis data sets. In this paper we have investigated whether the in-house data available from a governmental institution (Institute of Meteorology and Water Management) can be used to simulate aggregated PV generation on a country level. The results

obtained proved the high value of the already available data and its promising applications in power system expansion studies as well as research dedicated, for example, to optimal location of PV systems from the perspective of grid topology or the impact of PV systems on the residual load curve.

This study intended to present the results of the first tests on the freshly published data by the Polish Transmission System Operator—PSE. Therefore, we did not go into detail comparing different statistical or machine learning techniques. Clearly, for now, the data sample is relatively short and, at the time of writing, was limited to one month. In the future, we plan to extend this research by investigating (a) in detail the impact of the input set on the model performance, (b) comparing different simulation tools, (c) the impact of a long series of meteorological parameters on the quality of the model, and finally (d) selecting a minimal set of representative meteorological stations sufficient for simulations. The results presented here should be taken with caution since solar irradiation has a high annual variability, and model performance might, therefore, vary depending on the part of the year.

**Author Contributions:** Conceptualization, J.J. and M.W.; methodology, J.J.; software, J.J.; validation, M.W., M.F.; resources, M.W.; data curation, M.W.; writing—original draft preparation, J.J.; writing—review and editing, M.W. and M.F.; visualization, J.J.; supervision, M.F.; project administration, M.F. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Nomenclature**


#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Complex Network Analysis of Photovoltaic Plant Operations and Failure Modes**

#### **Fabrizio Bonacina 1,\*, Alessandro Corsini 1, Lucio Cardillo <sup>2</sup> and Francesca Lucchetta <sup>1</sup>**


Received: 23 April 2019; Accepted: 22 May 2019; Published: 24 May 2019

**Abstract:** This paper presents a novel data-driven approach, based on sensor network analysis in Photovoltaic (PV) power plants, to unveil hidden precursors in failure modes. The method is based on the analysis of signals from PV plant monitoring, and advocates the use of graph modeling techniques to reconstruct and investigate the connectivity among PV field sensors, as is customary for Complex Network Analysis (CNA) approaches. Five month operation data are used in the present study. The results showed that the proposed methodology is able to discover specific hidden dynamics, also referred to as emerging properties in a Complexity Science perspective, which are not visible in the observation of individual sensor signal but are closely linked to the relationships occurring at the system level. The application of exploratory data analysis techniques on those properties demonstrated, for the specific plant under scrutiny, potential for early fault detection.

**Keywords:** sensor network; data fusion; complex network analysis; fault prognosis; photovoltaic plants

#### **1. Introduction**

The cumulative global photovoltaic (PV) capacity has been growing exponentially around the world over recent years. In the decade 2005–2015, the solar PV generation capacity in the EU has increased from 1.9 GW to 95.4 GW [1]. Notwithstanding this, the Europe PV market's conditions are still substantially dependent on regional energy policies and public subsidies for renewable energies. As per the Italian market, in June 2013 the Italian public company GSE (Gestore dei Servizi Energetici) officially announced the discontinuation of the last Feed-in-Tariff incentive after its cap of 6700 million euro was reached [2]. The end of such subsidies has led to new attention being focused on PV plant performance management, lifetime and availability, with a view to reduce operating and maintenance costs [3].

Faults in PV plant components (i.e., modules, converters, connection lines), in addition to downtime correlated penalties, could result in an acceleration of system aging and as a consequence in a reduction of power plant reliability [4]. Typically, faults severity depends on various factors. These factors include the time to detect, time to repair or substitute, COE, and occurrence over time, and all of these factors can have a significant impact on profitability [5]. For these reasons, over the last decade Fault Detection and Diagnosis (FDD) in PV systems has been established as a critical field of research [6]. To mention but a few areas, research has addressed key topics like real-time monitoring, partial shading effects analysis, estimation of natural degradation rate over time and residual life for solar panels [7]. In the FDD arena, a number of studies have proposed data-driven fault detection algorithms based on statistical processing of performance parameters (e.g., power loss factor analysis studies, I-V output characteristics [8–11]) or an exponentially weighted moving average control chart [12]. In addition, model-based techniques, implementing dynamic fault trees [13] or combining PV system performance simulation (voltage ratio, performance ratio, &c) and Fuzzy and

neuro-Fuzzy logic classification systems [14,15], have also been studied as methods to detect fault occurrence [16]. The open literature already offers several review papers on this field, illustrating the state-of-the-art and opportunities [4,17–19].

With the advent of Internet of Things, technologies for smart monitoring have found widespread applications, including in PV systems, thereby enabling decision-making processes [20]. However, monitoring systems typically employ several distributed sensors and the collection of raw data opening the field to a number of specific challenges is still unresolved, which is in part due to the different communication standards, heterogeneous structure and the huge volume of data [21,22]. As reported in a number of big data studies [23,24], traditional data analysis in statistics, management and visualization fails with sensor network data streams. High-dimensionality and heterogeneity encourage data mining and artificial intelligence solutions, such as feature extraction, classification, clustering and sampling [25–30]. Furthermore, multi-sensor data fusion techniques have been introduced to provide a robust description of an environment or process of interest by combining observations from a number of different sources [31].

In this context, complex network science emerged as an active field, cross-fertilized by natural, social and computer science, exploiting the general idea that artificial systems like biological ones (metabolic network, worldwide web, distribution network, etc.) are complex systems, composed of a large number of interacting parts (components or entities) with an articulated self-organization. The complexity of the systems, then, resides in interaction and loops among its entities [32]. Looking for example at industrial systems, constitutive elements are governed by well-defined physical laws and are included in a pre-determined process scheme. However, surrounding environments and business laws (determining the operations) influence this by-definition deterministic structure. Moving to the component level, multi-physics processes are nonlinear, as is the component matching. In a general view, these systems can evolve surfacing complex structures characterized by diversity, as well as multiple interactions within and between layers. These characteristics can make industrial systems a complex engineering system [33,34].

In this paper CNA has been used to model the sensor network of a 1 MW PV field, advocating a complex system engineering point of view [35,36]. A data fusion criterion has been developed processing the signals from the sensor equipment with a view to detect fault precursors. The sensors include PV field AC and DC electrical parameters, temperature, and solar irradiance measurements.

The paper, first, illustrates the methodology for data processing with a consideration of connectivity analysis to create the digital twin of the sensor network in the form of a functional graph. The result of these analyses is a correlation matrix that defines the graph structure in terms of the weight of the edges connecting the nodes (element/sensor signal) at a given instant from all the pairs of pre-processed sensor time series. By applying the data processing within a fixed time window, sliding over the monitoring interval, the result is a fully-connected weighted dynamic graph. Then, dynamic graph analysis is used to explore the sensor network in order to unveil hidden correlations among signal time series. The results are discussed by using exploratory data analysis to combine the most significant topological graph metrics in the identification of operational patterns of the PV plant.

#### **2. Methodology**

The field sensors are modeled using graph theory as a complex network, where the nodes are represented by the signals from sensors and the edge are evaluated with non-linear statistical correlation functions applied to the time series pairs. Specifically, the data flow is structured according to the following steps:


At the initial stage (a), heterogeneous data acquired by the sensors are synchronized and cleaned by removing outliers. Then for stage (b), a fully-connected graph is created in which each monitored variable represents a node and all the nodes are connected to each other with directed edges. The output of this phase is a complete weighted graph model (also called functional graph) of the sensor network. After, some typical complex network measurements are applied to extrapolate synthetic properties from the functional graphs. Finally (c), these measurements were used as input for exploratory network analysis, with the aim of grouping the data as a function of multiple topological graph metrics.

The implementation of the proposed methodology has been done using Python version 3.5 [37]. In particular, a Python code has been created using Scikit-learn and NetworkX packages,respectively [38,39].

#### *2.1. Complex Network Analysis*

The CNA has its origins in graph theory and is used to describe the properties of complex systems through the mathematical study of networks. The key ingredient in the CNA is the study of the correlation among uni-variate signals recorded from different sources, as is customary for biological network analysis. For this purpose, the mutual information (MI) has been widely used [40–42].

In this paper we propose a novel approach based on the study of correlations between signals of m heterogeneous sensors within a fixed window of n samples. In particular, given two signals yn,i and yn,j, taken from the i-th and j-th component of the feature matrix Yn,m, where (n) is spanning over time steps, the mutual information MI quantifies the level of uncertainty in yn,i removed by the conditional knowledge of yn,j. This measure essentially tells us how much extra information one gets from one signal by knowing the outcomes of the other one [43–45]. Once for each sample, a correlation matrix is obtained computing the MI in the m-dimensional feature space, while sliding the window from the beginning of the data set up to the entire monitoring interval. As a consequence, the evolution of correlation matrix is represented in the form of a functional graph consisting of a set of m nodes associated by k weighted edges representing the connecting force between the pairs of nodes.

The result of this phase is therefore a dynamic graph that contains all the information about the evolution of spatial and temporal relationships between all the entities monitored and allows the definition of parameters quantifying the characteristics of the systems [46].

In the context of CNA, the network measurements represent the most used tools for the extrapolation of synthetic information through the analysis of network topology. These measurements can be evaluated over the entire network (e.g., Shortest Path length, Diameter, Global Efficiency, Modularity, &c) or they can locally refer to nodes (e.g., Centrality Measures). This type of metric has been applied in several social networking studies to address the problem of identifying and ranking those people who exert an unequal amount of influence on the decision-making of others (also called influencers or opinion leaders) and to study the diffusion information within the network [47–49].

In the present approach, both the network measurements relating to the whole network (global scale) and those specific to individual nodes (local scale) were considered. Specifically, the study uses Shortest Path Length [50], Diameter and Radius of the graph [51], among the different global-scale measurements, Eccentricity and Weighted Degree Centrality of the nodes [39,52–54], among the local-scale measurements.

#### *2.2. Exploratory Data Analysis*

Promoted by Tukey [55], exploratory data analysis (EDA) represents a powerful tool to maximize insight into the underlying structure of a complex dataset. It facilitates the understanding of the distribution of samples and simplifies the data analysis, pointing out to special observations (outliers), clusters of similar observations, groups of related variables, and crossed relationships between specific observations and variables [56,57]. All this information in turn, can be very informative for further data modeling and it is of paramount importance to improve data knowledge. For these purposes, it encompasses a wide range of statistics and graphical tools (e.g., histogram, box plot, Pareto chart, principal component analysis, etc.) which have been commonly used for decades in various research fields such as archeology, biology, anthropology, medicine, chemometrics, &c [58–60].

In this paper, EDA is used to investigate the results of the complex network analysis and compare them with the sensor signals Yn,m. In particular, the 3D scatterplot has been used for visual pattern recognition. This tool is based on a features representation in a multivariable space, allowing us to identify the dependencies between different network measurements in order to discriminate specific operational PV plant clusters.

#### **3. Case Study**

The data were collected from a PV plant with a power of 1 MWp. The solar field is connected to two inverters, each with three conversion blocks. Both inverters are grid-tied, feeding a medium voltage power distribution network. They are equipped with fully independent monitoring systems and incorporate a solar power controller to regulate the Maximum Power Point Tracking (MPPT).

Measurements were recorded every 5 min and included AC and DC electrical parameters, AC power output, ambient temperature and solar irradiance. The monitoring system is shown in Figure 1. In particular, solar radiation on the plane of the modules and ambient temperature have been measured by a pyranometer and a PT-100 RTD sensor respectively, both installed on a sensor box near the panels. Both inverters are equipped with resistive potential divider voltage sensors for DC and AC parameters measurement for each conversion block. Shunt resistors have been used to measure 12 string currents as a representative sample of the plant.

**Figure 1.** Schematic block circuit diagram of the PV system with sensors and measuring points.

The measurements were collected in the period from May 20th to September 15th 2017. The 47 monitored variables which define the m-dimensional feature matrix Yn,m are listed in Table 1.

**Table 1.** Monitored PV plant variables.


Table 2 provides the specification of the sensors used in the monitoring system. As far as the dataset is concerned, the 47 monitored variables Yn,m are used for the connectivity analysis with a sliding window set to 216 data (about 18 h) per variable, along the available 5-months records.

**Table 2.** Main characteristics of the different sensors involved.


#### **4. Results**

Figure 2 provides the evolution of the solar irradiance and the active power of the inverter 1 (block A) gathered from the measurement system during the period of investigation.

**Figure 2.** Solar irradiance and active power of the inverter 1 (block A) during the period of investigation.

The interest on inverter 1 is motivated by the observation of the signals revealing the occurrence of a fault in the period between 31st August and 15th September. In particular, a breakdown of switching devices occurred, which caused the failure of the block A of inverter 1 inductor. In this case the internal protection of the inverter automatically reacted by turning off the block involved in the fault. It is worth noting that the protection reacts within the sampling interval.

Detailing on a three-day period at the end of August, the trends of the active power on the 3 blocks (from A to C) of the two inverters with the irradiation are shown in Figure 3 (inverter 2) and Figure 4 (inverter 1) respectively. The plots refer to 30th August (Figures 3a and 4a), 31st August (Figures 3b and 4b) and September 1st (Figures 3c and 4c).

**Figure 3.** Analysis of sensor signals of the Inverter 2 on 30st August (**a**), on 31th August (**b**) and on 1st September (**c**)—Solar Irradiance over the Active Power of the inverter blocks.

In Figures 3a and 4a, August 30th plots reveals the typical operation of PV plant in sunny condition, where the solar irradiance reaches the peak value of about 950 W/m<sup>2</sup> at mid-day and the active powers of all the inverters blocks follow its trend.

Looking at the monitored data on 31st August (Figures 3b and 4b), it is possible to infer that the PV plant operates under cloudy weather conditions with variations of solar irradiance. In particular, while the inverter 2 (Figure 3b) appears to work in standard conditions, following the solar radiation trend, inverter 1 (Figure 4b) starting from 13:30 features the collapse of active power on block A, which then extends to 1st September (see Figure 4c) and gives evidence of fault occurrence.

To give more hints on this fault event, Figure 5 compares the signals Yn,7–8 and Yn,36 gathered from inverter 1 block A (Figure 5a) and the corresponding CNA metrics. Specifically, Figure 5b illustrates the behavior of degree centralities of solar irradiance, DC voltage, AC voltage and string currents.

**Figure 4.** Analysis of sensor signals of the Inverter 1 on 30st August (**a**), on 31th August (**b**) and on 1st September (**c**)—Solar Irradiance over the Active Power of the inverter blocks.

As confirmed by sensor signals (Figure 5a), at 13:30 the fault has an immediate effect on the string currents, while the voltages zeroed around 16:30. When looking at complex network topology measurements (Figure 5b), the fault occurrence correlates with the departure of the degree centrality of solar irradiance from those of voltage-current signals which remain correlated.

**Figure 5.** Analysis of (**a**) sensor signals from block A of the inverter 1 compared with (**b**) sensor network CNA metrics on 31st August.

In order to understand the dynamic of the PV field operation, EDA is applied to raw data from the plant monitoring system as well as to the modeled sensor network topology variables. Figure 6, first, shows the results of the three-dimensional scatterplot of sensor signals as a function of time and solar irradiance, focusing on: the active power (Figure 6a), DC voltage (Figure 6b) and AC voltage (Figure 6c) of the inverter 1 block A. Notably, the plots refer to the sampling period from 20th May to 15th September. Data have been colored according to an agglomerative clustering.

Irrespective of the EDA variables, the analysis of the scatter plots in Figure 6a–c identify three patterns. Specifically:


In terms of raw data clustering, it is possible to resolve only the two behaviors which determine the operations before and after the fault event.

Figure 7 illustrates the results of EDA applied to network topology measurements. The scatter plots are created to investigate the correlation in time between a global graph measure (e.g., the graph diameter) and a local node related variable (e.g., sensor signal degree of centrality). In particular, the sensor network graph diameter is plotted against active power degree centrality (Figure 7a), AC voltage degree centrality (Figure 7b), and DC voltage degree centrality (Figure 7c).

**Figure 6.** Exploratory data analysis applied to sensor signals from block A of the inverter 1. Trends of (**a**) active power, (**b**) DC voltage and (**c**) AC voltage (x-axis), against solar irradiance (y-axis) and time (z-axis).

**Figure 7.** Exploratory data analysis applied to complex network measurements from block A of the inverter 1. Trends of (**a**) active power, (**b**) DC voltage and (**c**) AC voltage normalized weighted degree centralities (x-axis), against network diameter (y-axis) and time (z-axis).

As a first general comment, Figure 7 gives the evidence of a wealth of information emerging from the application of CNA methods. All the scatter plots demonstrate how the combination of global-local network topology measures determine the emergence of the dynamics in the PV plant operations, opening the possibility for a different performance indexing. In detail, all the plots distinguish three patterns possibly linked to the operations of the inverter block:


In this case, the results of the agglomerative clustering coincide with those emerging from the scatterplots and confirm the existence of a pre-fault behavior following a nearly proportional decrease in global/local graph topology measures.

#### **5. Conclusions**

This paper proposes a sensor fusion approach based on the use of graph modeling techniques to investigate the connectivity among several key parameters of the PV plant.

The results show that the study of the properties of the graphs through the application of Complex Network Analysis techniques is able to reveal a wealth of hidden information.

As reported in Korn et al. [61] the global behavior of the system is more than the sum of its parts, so these properties can be thought of as behavior deriving from the interaction between the components that can't be identified through their simple functional decomposition.

In particular, the visual analysis of the single topological metrics of the functional graph focused on the inverter block where the fault occurs (see Figure 4), reveals an evident variation in terms of correlation between the monitored variables, mainly associated with an anomalous deviation of the degree centrality of the solar irradiance with respect to the centrality of the key block parameters in the fault conditions.

However, by combining the multiple information deriving from the different network measurements (i.e., Degree Centrality and Network Diameter) with EDA techniques, it is possible to clearly distinguish not only the standard operation from the fault conditions, but also to isolate specific pre-fault conditions with an advance time with respect to the event of about 7 h. It is important to note that these conditions only emerge after applying CNA and are not observable with EDA techniques based on simple sensor signals. This latter type of analysis, in fact, is only able to discriminate standard operating conditions from fault conditions.

Thus, the results of this study show interesting potential in the evaluation of useful Key Performance Indicators and Control Charts based on topographic metrics of graphs for early fault detection in the PV plant.

**Author Contributions:** Conceptualization, A.C. and F.B.; Methodology, F.B.; Writing—Python Code, L.C.; Data Curation and Validation, F.L.; Formal Analysis and Interpretation, F.B. and L.C.; Writing—Original Draft Preparation, all authors; Writing—Review & Editing, all authors; Supervision, A.C.

**Funding:** This research received no external funding.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Deterioration Diagnosis of Solar Module Using Thermal and Visible Image Processing**

#### **Heon Jeong 1, Goo-Rak Kwon <sup>2</sup> and Sang-Woong Lee 3,\***


Received: 1 May 2020; Accepted: 27 May 2020; Published: 3 June 2020

**Abstract:** Several factors cause the output degradation of the photovoltaic (PV) module. The main affecting elements are the higher PV module temperature, the shaded cell, the shortened or conducting bypass diodes, and the soiled and degraded PV array. In this paper, we introduce an image processing technique that automatically identifies the module generating the hot spots in the solar module. In order to extract feature points, we used the maximally stable extremal regions (MSER) method, which derives the area of interest by using the inrange function, using the blue color of the PV module. We propose an effective matching method for feature points and a homography translation technique. The temperature data derivation method and the normal/ abnormal decision method are described in order to enhance the performance. The effectiveness of the proposed system was evaluated through experiments. Finally, a thermal image analysis of approximately 240 modules was confirmed to be 97% consistent with the visual evaluation in the experimental results.

**Keywords:** thermal image; photovoltaic module; hot spot; image processing; deterioration

#### **1. Introduction**

Photovoltaic (PV) modules are made up of several solar cells. When an abnormality occurs in such a cell, the cell operates like an electrical resistor. As a result, the short circuit current decreases because of the power consumption and the series circuit characteristic. This phenomenon leads to an increase in temperature and accelerates the damage of the PV cell. Eventually, this abnormal process limits the short circuit current of the faulty PV module, reducing the power generation efficiency, performance and life of the PV module [1]. When the current capacity is reduced due to shadows produced by leaves, clouds, and dust or an abnormality occurs such that one cell is completely obscured or the cell is aged, a reverse bias is applied to the PV cell. This causes a so-called hot spot [2]. Some studies have found the modules that may cause failures by the thermal analysis of PV modules using an infrared thermal camera to prevent this loss [3–8].

Suárez-Domínguez's PV module thermal image analysis study [5] showed that the mean temperature of the PV module was 21 degrees. However, the temperatures at the two hot spots were found to be 26 degrees and 28 degrees. E. Kaplani's study [6], confirmed that the temperature of the hot spot area exceeded 100 ◦C in summer. Several other studies have also suggested that thermal imaging cameras are reliable, economical, and easy methods of inspecting hot spots, and suggest that periodic inspection can lead to optimal plant operation [6,7]. However, reflection and emissivity must be considered when inspecting a PV module using a thermal imaging camera. When PV modules are inspected from the front, a thermal imaging camera sees the heat distribution on the glass surface but only indirectly sees the heat distribution in the underlying cells. Glass reflections are specular, which means that the surrounding objects, with different temperatures, can be seen clearly in the

thermal image. In the worst case, this results in misinterpretations (false "hot spots") and measurement errors. In order to avoid the reflection of the thermal imaging camera and the operator in the glass, it should not be positioned perpendicularly to the module being inspected. However, emissivity is at its highest when the camera is perpendicular and decreases as the angle increases [8].

Some solar power plants are considering installing fixed dual cameras. However, high-resolution thermal cameras are still very expensive. In order to install a large number of these expensive cameras, a large initial investment cost is required. In addition, it is necessary to establish and maintain an information processing system for multiple cameras. In recent years, in order to inspect the PV modules installed in a wide area or in a difficult accessing place, a thermal image of the PV module using a drone is taken and a thermal temperature analysis is performed [9,10]. However, these studies rely on visually inspecting the temperature distribution and finding faulty modules while viewing infrared thermal images on the monitor. Thus, a deterioration diagnosis program of a photovoltaic module is required to manage a module installed in various solar power plants in a wide area.

Thermal cameras generally measure the wavelength emitted by an object to extract temperature information. However, it is difficult to distinguish objects according to temperature conditions. The thermal image alone provides the temperature value of a specific object. Nevertheless, object boundaries cannot be distinguished by the thermal image alone in order to automatically extract the region and determine the abnormality through the temperature in the region [11]. Therefore, research has been carried out to realize object classification and object temperature distribution through a visible camera and a thermal camera. These studies express the temperature image of a specific region by performing a matching process between a visible image and a thermal image [12–14]. The image registration process is particularly essential in order to obtain necessary information by using multiple thermal and visible cameras. The matching process performs a geometric alignment process on the image so that data can be compared or integrated from different images. In these studies, the feature points must be extracted well for matching, and the matching relation of feature points is important. Davis [12] proposed a method of extracting background-subtraction and contour and synthesizing thermal image and visible images to extract the feature points. Conaire [11] evaluated the use of feature points by applying various methods such as four feature points, color and edge histograms, and weighted models.

Another study suggests a method of finding a plane homography that aligns the thermal spectrum with the visible spectrum. Using the obtained homography, the thermal image is image-warped to match the image [14]. In order to obtain such a precise homography, a number of correspondence points are defined in the image manually and perform a homography search process is performed that minimizes the least squared error [15]. However, the lack of correlation between the brightness intensity of the degradation image and the intensity of the visible image makes it difficult to generate an automatic correspondence point due to the brightness. Thus, a passive correspondence point designation or a specific landmark search method is required [16].

In this paper, we introduce an image processing method that can automatically identify the module generating hot spots in the solar module. In order to overcome the difficulties of generating the automatic correspondence points proposed in the previous studies, the proposed PV module analysis system adopts the image processing method that uses the characteristic elements of the PV module. Additionally, the proposed method extracts temperature information automatically and determines whether there is an abnormality through the rectified module image. The proposed method consists of four major image processing methods using the unique features of the PV module. The first method uses the shape and color characteristics of the PV module. The second exploits the fact that the PV modules are installed in multiple clusters. The third method involves the extraction of characteristic points from the visible image and deterioration. Finally, the fourth method derives a homography transformation equation for matching between minutiae points. To extract the feature points, we use the maximally stable extremal regions (MSER) algorithm, which computes the depth change value with the neighboring pixels and generates a tree up to the neighboring pixel level value [17]. Next, the proposed

method derives the area of interest by using the inrange function using the blue color of the PV module. We propose an effective matching method for feature points and a homography conversion technique using the random sample consensus (RANSAC) algorithm [18]. The module area is rectified to increase the ease of the temperature distribution inspection of the module area after the matching of the thermal image and the visible image is completed.

The temperature information is extracted from the rectified thermal image and the failure prediction module is determined by calculating the abnormal area of high temperature area. Through this image processing method, it is possible to automatically identify the module generating the hot spot of the photovoltaic module, and it can help identify the cause of hot spots by displaying the thermal image and visible image generated by the hot spot. If there is contamination, shadows, or breakage, etc., in the visible image in the high temperature area, it is easy to identify the cause. However, if there is no abnormal point, it can be judged that the performance degradation is due to the characteristic change. In order to distinguish between the normal and abnormal modules, accurate region extraction of the photovoltaic module must proceed. If the photovoltaic module area is well extracted from the thermal image, the abnormal temperature distribution is not found as shown in Figure 1a, but it is judged as normal. If the module area is incorrectly extracted as shown in Figure 1b, an abnormal temperature distribution is found and it is judged to be an abnormal module.

**Figure 1.** Examples of photovoltaic module area extraction and decision: (**a**) Correct extraction; (**b**) Incorrect extraction.

In Section 2, we describe an algorithm for detecting the occurrence of hot spots by analyzing the temperature distribution of a solar module. Furthermore, we describe the proposed visible image and thermal image registration, rectification process, temperature data derivation method, and normal/abnormal decision method. In Section 3, image analysis is performed on various solar modules and the effectiveness of the proposed algorithm is examined. In Section 4, we discuss the results and draw conclusions.

#### **2. Deterioration Diagnosis Method of PV Module Using Thermal Image and Visible Image Processing**

#### *2.1. Proposed Method*

The flow chart of the proposed method is shown in Figure 2. The proposed algorithm can be summarized in the following five steps:


**Figure 2.** Flowchart of the proposed visible and thermal image matching and rectification algorithm.

Figures 3 and 4 show the algorithm proposed in this paper which includes the process of feature points extraction, feature points matching, segmentation, rectification and abnormal module judgment. The detailed description of the proposed method consists of following functional steps.


**Figure 3.** Feature points extraction and matching images creation through corresponding points.

**Figure 4.** Segmentation, rectification, abnormal module judgment.

#### *2.2. Image Acquisition*

To acquire the image, we constructed a stereo camera with a visible image camera (resolution: 1280 × 720) and an infrared camera (640 × 512) as shown in Figure 5. Camera calibration was performed to derive a common image area with different angles of view between the visible image and the thermal image. Through the camera calibration process, the lens distortion correction and the main shooting distance were taken at positions of 8 to 18 m, and a strategic area of the thermal image was set in the visible image area.

**Figure 5.** Stereo camera configuration of visible and infrared camera.

#### *2.3. PV Module Area Search and Block Segmentation*

2.3.1. Extraction of Region of Interest through Color Inrange

We implemented an algorithm that derives the area of the photovoltaic module by exploiting the fact that the main color is in the blue region due to the characteristics of the PV module. The color image was transformed into the HSV image using the color inrange function. The hue value was 80~140, the saturation value was 65~255, and the value was 65~255. An example of the image processing using HSV conversion and inrange function is shown in Figure 6.

**Figure 6.** Image processing through HSV image and inrange function: (**a**) visible image; (**b**) HSV image; (**c**) inrange function result; (**d**) mask image (RMImg1).

2.3.2. Block Segmentation through Find Contours

Since PV modules have a cluster array structure for serial-parallel connection and the installation type may be changed in units of cluster blocks, the image was segmented in units of blocks. An example of the block segmentation by the findcontours function is shown in Figure 7.

**Figure 7.** Block segmentation through findcontours: (**a**) find contours; (**b**) mask image; (**c**) sub-block mask image.

#### *2.4. Module Feature Point Detection Using MSER Algorithm*

This step converted the visible image into a gray image and extracted the module area using the MSER algorithm. Additionally, the thermal area image was extracted using the MSER algorithm. As an example, the results of using MSER without considering module size are shown in Figure 8.

**Figure 8.** MSER algorithm without consideration of module size: (**a**) result of visible image gray MSER algorithm; (**b**) result of thermal image MSER algorithm.

The MSER rectangular areas of the visible image as shown in Figure 8a were sorted in ascending order, and are used as the predicted area value of one PV module based on the intermediate area value. In Figure 9, we show a graph of rectangular area vs. rank by area. MSER rectangular areas within 20%, the appropriate value according to various experiments, are derived by comparing these with the predicted area values. The calculation was performed in following steps.

**Figure 9.** Choosing PV module area size by median value in a visible image.


In Figure 10, we show predicted areas by using the median value. The number of green rectangles in Figure 10 is reduced, compared to Figure 8, as a result of applying the MSER algorithm considering the module size and extracting the feature points. The yellow dots are used as feature points.

**Figure 10.** MSER algorithm and feature point detection result considering module size: (**a**) result of visible image; (**b**) result of thermal image.

#### *2.5. Valid Matching Point, Homography and Registration*

#### 2.5.1. Valid Matching Point

An unnecessary matching point of a rectangular area is estimated as shown in Figure 10a,b. In order to remove the unnecessary matching points, the validity between the points of two images is determined, which is expressed as:

$$\text{tr1}\_{i} \in \text{MP1}, \text{ x2}\_{j} \in \text{MP2}, \quad i: 1 \sim \text{MP1}\_{\text{size}}, \quad j: 1 \sim \text{MP2}\_{\text{size}} \tag{1}$$

$$\begin{array}{ll} r = \sqrt{\left(x1\_i - x2\_j\right)^2 + \left(y1\_i - y2\_j\right)^2} \\ \text{if } (r < thr) \text{ then} \quad M1\_k = MP1\_{i\prime} \quad M2\_k = MP2\_j \end{array} \tag{2}$$

The set of feature points of the visible image and degradation images are called *MP*1 and *MP*2, respectively as shown in Figure 3. The expressions (*x*1*i*, *y*1*i*) and *x*2*j*, *y*2*<sup>j</sup>* respectively represent the x value and the y value of the i-th feature point of the *MP*1 and the j-th element of the *MP*2, Equation (1) performs an iterative search while changing to the introduction number. The validity evaluation step computes the distance value (r) between the two points. If the estimated distance is smaller than the boundary value (*thr*), these two points are determined as an element of the effective feature point set (*M*1, *M*2).

#### 2.5.2. Homography Derivation through Valid Feature Points, Registration

A set *MP*1 of points (*x*1*i*) existing in the projection plane space corresponding to the set *MP*2 of points (*x*2*i*) existing in the two-dimensional projection plane space can be projected, and a projective transformation between the two images can be expressed as a relation function. Projectivity is defined as an invertible mapping *h*, which is a linear transformation through a non-singular 3 × 3 matrix [19].

$$
\begin{pmatrix} \mathbf{x}\_1' \\ \mathbf{x}\_2' \\ \mathbf{x}\_3' \end{pmatrix} = \begin{bmatrix} h\_{11} & h\_{12} & h\_{13} \\ h\_{21} & h\_{22} & h\_{23} \\ h\_{31} & h\_{32} & h\_{33} \end{bmatrix} \begin{pmatrix} \mathbf{x}\_1 \\ \mathbf{x}\_2 \\ \mathbf{x}\_3 \end{pmatrix} \tag{3}
$$

$$X' = HX \tag{4}$$

The homography converting the thermal image feature point *M*2 into the visible image feature point *M*1 was obtained as follows.

$$H1 \; :\; M2 \to M1 \tag{5}$$

Homography (*H*1) in the case of Figure 7 is as follows.

$$H1 = \begin{pmatrix} 1.011490762 & -0.0011793 & -0.17881936 \\ -0.0074779 & 0.95735177 & 5.3210135 \\ -0.0000029 & -0.0000903 & 1 \end{pmatrix} \tag{6}$$

Using the obtained homography*H*1, the thermal image is projected onto the visible image. Figure 11 shows a composite image of a red image with a visible image converted to green. The disparity between the images is shown by a simple synthesis before the conversion in Figure 11a,c. However, it can be confirmed that the images are well matched without any difference value in the composite image using the homography.

**Figure 11.** Registration images of visible and thermal using homography: (**a**) registration image without homography; (**b**) registration image with homography; (**c**) enlarged registration image of (**a**); (**d**) enlarge registration image of (**b**).

#### *2.6. Rectification, Thermal Data Extraction, and Determination of Results*

The module region was rectified to increase the ease of the temperature distribution inspection of the module area, once the registration of the thermal image and the visible image was completed. The faulty module was determined by calculating the high temperature area. Through this image processing procedure, the module generating hot spot in the PV module was automatically identified. This helped to identify the cause of the hot spot by displaying the thermal image and visible image generated by the hot spot. Visible images can be used to distinguish between contaminants, shadows, breakages, microcracks, or cell deterioration.

#### 2.6.1. Homography Derivation and Rectification

In the blocks obtained through the segmentation, the entire area of the PV module was obtained, and the homography was derived using the edges of the valid feature points needing to be rectified.

$$\text{ABn} : \text{MBn} \to \text{RBn}, \text{ n} : 1 \sim \text{MB}\_{\text{size}} \tag{7}$$

$$RBinding = \sum\_{n=0}^{k \text{locksize}} RBn \tag{8}$$

*HBn* uses a homography matrix for rectifying the sub-block *MBn* into the rectified block *RBn*, and obtains the rectified image *Rbimg* by summing the rectified images of sub-blocks. An example of obtaining the rectified images from sub-blocks is shown in Figure 12.

**Figure 12.** Process of obtaining a rectified image: (**a**) sub-block *MB*1; (**b**) sub-block *MB*2; (**c**) rectified block *RB*1*, RB*2.

2.6.2. Extracting Temperature Information and Determining the Abnormality

In order to determine whether an error had occurred in the PV module, the average temperature value (T\_avg), the maximum temperature value (T\_max), and the minimum temperature value, (T\_min) information in the module area were extracted. Next, the high threshold temperature (Th\_high) and the low threshold temperature (Th\_low) were determined in the PV module area. If the temperature value was less than Th\_high and less than Th\_low, the count value (a\_count) would be increased. If the value a\_count exceeded 0.2% of the module area value (area\_module), then the module would be considered as an abnormal module. Figure 13 shows an example of extracting temperature information and determining the abnormality. Figure 13a shows the temperature distribution for the abnormal module among several modules, and Figure 13b shows the results of the normal module (left) and the abnormal module (right) obtained through the proposed equation.

**Figure 13.** Extracting temperature information and determining the abnormality: (**a**) extracting temperature information; (**b**) normal module and abnormal module judgment.

$$\begin{aligned} \text{TH\\_high} &= \text{T\\_avg} + \text{T\\_max} \times 0.2\\ \text{TH\\_low} &= \text{T\\_avg} + \text{T\\_min} \times 0.2\\ \text{if } ((T\\_val > Th\\_high) \text{ or } (T\\_val < Th\\_low)) &\\_a\\_count ++;\\ \text{if } (a\\_count > area\\_module \times 0.002) &\\_abnormal\text{ PV\\_module} \end{aligned} \tag{9}$$

#### **3. Experiment**

To evaluate the validity of the proposed algorithm, 40 visible and thermal images were simultaneously captured using a drone. The shooting altitude was set to about 10 m, and the acquired image is shown in Figure 14.

**Figure 14.** Samples for experiment: (**a**) visible image samples; (**b**) thermal image samples.

In order to evaluate correspondence consistency, ground-true images were manually generated for seven images as shown in Figure 15.

**Figure 15.** Ground-true images for evaluating matching consistency.

For each image, feature points were derived, and the homography for matching was derived, projected, and finally, matched. To obtain the validity, the homography was applied to the ground-true image to calculate the match rate. The results showed an average of 97% agreement.

The mean area value of the module area obtained by the MSER algorithm was 10,336 pixels, and the standard deviation of the PV area was 215 pixels. The average area accuracy of the module area was about 98%. Finally, the proposed method determined whether the module was normal or faulty, and estimated the faulty module by calculating the high temperature area. The thermal image analysis of approximately 240 modules was confirmed to be 97% consistent with the visual evaluation. The module with the error was confirmed to be faulty due to the reflection of sunlight.

Figure 16 shows the experimental results for finding the abnormal module. Based on the GPS information, the analysis results are shown on Google Maps in Figure 16a. A green mark indicates a normal module and a red mark indicates an abnormal module. When an user presses a red mark on a screen, the user can see the detailed analysis such as the corresponding visible image (Figure 16b), thermal image (Figure 16c), merge image before registration (Figure 16d), merge image after registration (Figure 16e), normalized image (Figure 16f), and the abnormal region derivation result by the determination algorithm (Figure 16g) are expressed. As a result of pressing the mark manually, it was confirmed that both normal and abnormal modules can be distinguished well.

**Figure 16.** Abnormal module detection system: (**a**) main result screen; (**b**) visible image; (**c**) thermal image; (**d**) merge image before registration; (**e**) merge image after registration; (**f**) normalized image; (**g**) abnormal region derivation result.

#### **4. Conclusions**

In this paper, we introduced an image processing technique that can automatically identify the modules generating the hot spots in solar modules. The proposed PV module analysis system adopted the image processing method that uses the characteristic elements of the PV module in order to overcome the difficulties of generating the automatic correspondence point presented in the previous studies. Additionally, the proposed method determined an abnormality of the PV module. The image processing method using the characteristic elements of the PV module extracts the feature point derivation from the visible image and the degradation image by using the shape characteristic of the color module, the rectangular shape of the PV module, and the installation of a plurality of clusters, and a homography transform equation is derived from the matching points. To extract the feature points, we used the MSER algorithm, the inrange function, an effective proposed matching method and a homography conversion technique using the RANSAC algorithm.

Experimental results show that the match between the thermal image and the visible image was 97%. The accuracy of applying the MSER algorithm and the average area derivation algorithm was confirmed to be 98%. Similarly, through the rectification of the module area, the ease of the inspection of the temperature distribution of the module area was enhanced, and the failure prediction module was determined by calculating the abnormal area of high temperature. The thermal image analysis of approximately 240 modules was confirmed to be 97% consistent with the visual and visual evaluation. Through the application of these algorithms, it is possible to automatically identify the module generating the hot spots in the solar module, and to identify the cause of the hot spots by displaying the thermal image and visible image generated by the hot spot.

**Author Contributions:** Conceptualization, H.J. and G.-R.K.; methodology, H.J.; software, H.J.; validation, H.J., G.-R.K. and S.-W.L.; formal analysis, G.-R.K.; investigation, H.J.; resources, S.-W.L.; data curation, G.-R.K.; writing—original draft preparation, G.-R.K.; writing—review and editing, H.J.; visualization, G.-R.K.; supervision, S.-W.L.; project administration, H.J.; funding acquisition, H.J. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIP) (NO. 2017-0-01712) and National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (2019R1F1A1062075).

**Conflicts of Interest:** The authors declare that there are no conflicts of interest regarding the publication of this paper.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Clustering-Based Self-Imputation of Unlabeled Fault Data in a Fleet of Photovoltaic Generation Systems**

#### **Sunme Park , Soyeong Park, Myungsun Kim and Euiseok Hwang \***

School of Mechanical Engineering, Gwangju Institute of Science and Technology, 123 Cheomdangwagi-ro, Buk-gu, Gwangju 61005, Korea; pishp00200@gist.ac.kr (S.P.); soyeongp@gist.ac.kr (S.P.); rlaaudtjs@gist.ac.kr (M.K.)

**\*** Correspondence: euiseokh@gist.ac.kr; Tel.: +82-62-715-3223

Received: 30 November 2019; Accepted: 3 February 2020; Published: 7 February 2020

**Abstract:** This work proposes a fault detection and imputation scheme for a fleet of small-scale photovoltaic (PV) systems, where the captured data includes unlabeled faults. On-site meteorological information, such as solar irradiance, is helpful for monitoring PV systems. However, collecting this type of weather data at every station is not feasible for a fleet owing to the limitation of installation costs. In this study, to monitor a PV fleet efficiently, neighboring PV generation profiles were utilized for fault detection and imputation, as well as solar irradiance. For fault detection from unlabeled raw PV data, K-means clustering was employed to detect abnormal patterns based on customized input features, which were extracted from the fleet PVs and weather data. When a profile was determined to have an abnormal pattern, imputation for the corresponding data was implemented using the subset of neighboring PV data clustered as normal. For evaluation, the effectiveness of neighboring PV information was investigated using the actual rooftop PV power generation data measured at several locations in the Gwangju Institute of Science and Technology (GIST) campus. The results indicate that neighboring PV profiles improve the fault detection capability and the imputation accuracy. For fault detection, clustering-based schemes provided error rates of 0.0126 and 0.0223, respectively, with and without neighboring PV data, whereas the conventional prediction-based approach showed an error rate of 0.0753. For imputation, estimation accuracy was significantly improved by leveraging the labels of fault detection in the proposed scheme, as much as 18.32% reduction in normalized root mean square error (NRMSE) compared with the conventional scheme without fault consideration.

**Keywords:** PV fleet; clustering-based PV fault detection; unsupervised learning; self-imputation

#### **1. Introduction**

A photovoltaic (PV) power plant is one of the most renewable, sustainable, and eco-friendly setups for converting solar energy, which is the most abundant and freely available energy source, into electrical energy [1]. The contribution of solar energy to the total global energy supply has rapidly increased in recent decades with PV installation capacity growing to more than 500 GW by the end of 2018 [2,3]. However, PV systems are exposed to harsh working conditions owing to uncertain outdoor environments and their complex structure. In [4], it was reported that the annual losses in PV generation reached 18.9% under zero or shading faults. Improving the reliability of renewable energy generation by fault detection and diagnosis (FDD) and correcting faulty data is essential for maintaining the efficiency of PV generation [5]. In addition, reliable information is required to be applied for various power applications, e.g., energy scheduling [6] and energy forecasting [7,8], to guarantee safe and stable grid systems.

For utility planners and operators, it is essential to examine the power output variability [9] to aggregate the fleet of PV systems, which is defined as the number of individual PV systems spread out over a geographical area [10]. Several studies presented station-pair correlation analyses by introducing virtual networks. A correlation was observed between short-term irradiance variability as a function of diverse distance and time scale [11]. Similarly, the maximum output variability of a fleet of PV plants was estimated by using the clearness index [12]. A variability model was built in [13] to integrate a large amount of generated solar power into power systems. To integrate the fleet as a distributed power source, it should be managed by an intelligent monitoring system that can correct abnormal data via real-time fault diagnosis and power generation forecasts.

PV faults occur because of various reasons at different locations, such as a module, string, or any other spot related to the PV systems. Visual and thermal methods employed in PV fault detection can detect superficial problems, such as browning, soiling or snow, discoloration, delamination, and hot spots, using auxiliary measurements [14]. However, this requires expensive and complicated equipment [15]. In recent years, many studies have employed methods using electrical variables via a data-driven approach. The electrical signal approaches are mainly referred to as maximum power point tracking (MPPT) with I–V characteristic analysis and power loss analysis. They are usually utilized to distinguish an open circuit, short circuit, degradation or aging, and shading faults that may typically occur on the DC side of a PV array [16–18].

For data-driven methods, automatic fault detection approaches can be categorized into conventional modeling-based methods and methods that utilize intelligent machine learning [19]. For the former case, the model can be built with respect to the physical attributes from the PV module specification for simulation settings to compare the desired output with the measured output [20]. Conventional statistical detection methods have been primarily presented in previous studies [16,21–25]. The exponentially weighted moving average has been used to identify DC side faults by comparing the one-diode model and estimated MPPs [21,22]. Lower and upper limits were set when the ratio of the measured to the modeled AC power exceeds 3-sigma [22]. In [23], outlier detection rules were proposed in their statistical details: The 3-sigma rule, Hampel identifier, and Boxplot rule using a PV string current. A symbolic aggregate approximation (SAX) scheme was used to convert the voltage profile, prior to performing clustering and anomaly detection [24].

With the advent of artificial intelligence (AI), which can be applied to various domains, particularly suited to the nonlinear behavior of PV systems, numerous studies have exploited AI-based monitoring systems [26]. The common artificial neural network (ANN) is widely used either to predict PV generation behavior or as a fault detection module based on several electrical parameters [15,27–31]. In comparison with a conventional back-propagation network, a probabilistic neural network (PNN) uses a probability density function as the activation function; thus, it is less sensitive to noisy and erroneous samples [32–34]. Fault detection by a support vector machine (SVM) has been used in several studies because it has the ability to separate objects by finding an optimal hyperplane that maximizes the margin in both binary and multiclass problems [35–37]. The decision tree (DT) builds repetitive decision rules within if/else instructions, which is intuitive. The model can be implemented conveniently with a large dataset [35,38]. The random forest (RF) has been applied to improve multiclass classification accuracy and to generalize performance [19]. Fuzzy classifications based on a fuzzy inference system (FIS) were developed by constructing logic rules [29,39]. A kernel extreme learning machine was investigated owing to its fast learning speed and good generalization [40]. Particle swarm optimization-back-propagation (PSO-BP) has been shown to improve the convergence and prediction accuracy of fault diagnosis systems [41].

Most of the data-driven approaches involve a supervised learning-based fault detection system that assigns a label for binary PV states as either normal or abnormal or as multi-class for corresponding fault types in advance. The detection or classification model learns the complex and unrevealed relation between input attributes and predefined labels in the training phase, and then the model is tested to determine whether it can distinguish PV states properly for new inputs. However, these processes require human effort to manually assign labels, and it is not easy to visualize the trained model. Graph-based semi-supervised learning (GBSSL) was proposed to detect line-to-line and open circuit faults using a few labeled data [42]. In [20], five types of faults were classified based

on a single diode model with five input vectors associated with IV characteristics, solar irradiance, and temperature. Gaussian-fuzzy C-means was conducted using the distribution of each cluster and faults were diagnosed through PNN based on previous cluster center information [33]. A fuzzy membership algorithm based on degrees of fault data and cluster centers has been proposed [43]. Density peak-based clustering has also been proposed [44]. The 3-sigma rule was applied to determine each cluster center using the normalized voltage and current at the MPPs. Similarly, the PV local outlier factor (PVLOF) was computed from the current of the PV array to identify the degree of faults [45]. A single diode model-based prediction was implemented, enabling the generation of the residual, which was applied to the one-class SVM by quantifying the dissimilarity between the normal and faulty features [46].

In this study, we propose a framework of two stages of self-fault detection and self-imputation in a fleet of PV systems using neighboring PV power generation units based on correlation analysis. Because insolation data is not available with sufficient geographical resolution, especially for a small-scale PV system [12,37], neighboring PV generation data in the same fleet can be used jointly with distanced weather data. Since daily PV generation captures generally include unidentified erroneous samples, faulty data candidates were first labeled in the proposed scheme by an unsupervised manner with several extracted features. K-means clustering was employed to find out fault data point in the daily PV power outputs obtained from all the sites in the fleet. When the profile was considered as an abnormal pattern, restoration was accomplished by the following imputation step. Imputation schemes were implemented by autoregressive (AR) and multiple regression models with optional neighboring PV data of normal candidates obtained from the previous clustering step. For evaluation, several types of fault patterns observed in actual PV power profiles were simultaneously injected into a single or multiple sites, and proposed schemes were tested without injection information.

The remainder of this paper is organized as follows: Section 2 describes the PV fleet power output relationship between distance and the correlation with actual data measured on campus. Section 3 proposes an efficient fault detection and imputation methods for use with a PV fleet. The simulation setup, including the injected fault pattern, is provided in Section 4. Section 5 details the detection and imputation results, and Section 6 concludes the paper.

#### **2. PV Fleet Power Data Analysis**

#### *2.1. Materials*

In this study, actual PV fleet data were utilized for analysis and simulation. Hourly power generation data were measured in rooftop solar installations from 13 buildings at the Gwangju Institute of Science and Technology (GIST) campus located in Gwangju, South Korea, as shown in Figure 1, with installation details given in Table 1. In this densely distributed PV fleet, only the facility management building (site 3) collects environmental information, such as ambient temperature, module temperature, slope solar irradiance, and horizontal solar irradiance. The data were collected in 2019 for approximately nine months. To investigate the influence of solar irradiance with respect to the distance between the weather station and PV installation, local environmental data retrieved from the nearest weather station provided by the Korea Meteorological Administration (KMA) website were used [47]. The weather data including local solar irradiance were recorded every hour, the same as for the PV data. We selected 159 days of PV power data from 13 sites without missing values for an overlapping period. The meteorological data were acquired at the site weather station (SWS). In addition, a local weather station (LWS), which was approximately 7 km away from the campus, was subsequently employed for the same time period.

**Figure 1.** Weather stations and photovoltaic (PV) fleet on the Gwangju Institute of Science and Technology (GIST) campus.


**Table 1.** Installation information for rooftop PVs in the GIST fleet.

#### *2.2. Cross-Correlation Analysis*

Because previous studies have demonstrated that the correlation coefficient decreases when the distance between the weather station and the PV system increases, we investigated the cross-correlation as a function of distance. The cross-correlation is computed using the Pearson's correlation coefficient as follows:

$$\rho\_{\mathbf{x},\mathbf{y}} = \frac{\text{cov}(\mathbf{x}, \mathbf{y})}{\sigma\_{\mathbf{x}} \sigma\_{\mathbf{y}}} \tag{1}$$

The covariance of two time-series data **x** and **y**, cov(**x**, **y**) is normalized by the product of their standard deviation *σ***<sup>x</sup>** and *σ***y**. The correlation between each PV site with solar irradiance, which was obtained at SWS and LWS, was analyzed. As expected, the SWS data showed a stronger positive correlation than the LWS data for every site on the campus, as given in Table 2. The correlation analysis was conducted between PV generation data for each site in the same manner. Figure 2 shows the result of the correlation analysis, which produced five groups according to the degree of correlation. Apart from other sites, site 1 belonged to the independent group G1. Sites 2–4 were grouped into group 2 (G2) due to high correlation because of the relatively short distance from SWS. Group 3 (G3) comprised sites 5–11, which had a medium distance from SWS but were clustered together. Groups 4 (G4) and 5 (G5) comprised site 12 and 13, respectively, where one site was far from SWS and other sites. Based on the cross-correlation analysis, we determined that the use of neighbor PV power data can provide additional information that can help perform fault detection and imputation in the fleet of PV systems.




**Figure 2.** Cross-correlation of PV power output for the fleet.

#### **3. PV Fleet Fault Detection and Imputation**

The overall scheme of the proposed system is shown in Figure 3. SWS data and 13 power output data sets from the PV fleet were provided, from which features were extracted to distinguish the daily binary condition of normal or faulty. At the clustering stage, daily states for each PV site were estimated, which is reflected by the final step of imputation.

**Figure 3.** Overall architecture of the proposed system.

#### *3.1. Operation Time*

Operation time was considered to extract meaningful properties from the PV pattern accordance with environmental factor. The daylight hours or insolation duration can be obtained by mathematical modeling based on geographical components, such as the latitude of the PV site, using the following equation:

$$G\_0 = G\_{5\circ} \left( 1 + 0.33 \cos \left( \frac{360d}{365} \right) \right) \left( \cos \delta \cos \phi \cos \omega + \sin \delta \sin \phi \right) \tag{2}$$

where *Gsc* is the solar constant, which is 1368 [W/m2], *d* is the day number in the year, *δ* is the solar declination, *φ* is the latitude of the site, and *ω* is the solar hour angle. During an hour after sunrise and an hour before sunset, the small amount of solar irradiance has a tidal effect on PV operation. We defined the operation time, *T*, by excluding these time periods from the insolation duration [37].

#### *3.2. Feature Extraction*

Several features were extracted from the given dataset (e.g., the PV fleet power data **x**, and the solar irradiance observation data **y**) to distinguish the fault patterns from normal data at the candidate PV site *n*. In this study, **y**1:*<sup>N</sup>* = **y**<sup>3</sup> as there is one SWS at site 3. We distinguished *Fs* and *Ff* , which represent the feature set derived from daily solar irradiance and daily neighboring PVs, respectively. Seven features were introduced in this study for two types of datasets, which generated 14 features. The extracted features of the candidate PV can be written as *<sup>F</sup>*(*n*) = [*F*(*n*) *<sup>s</sup>* , *<sup>F</sup>*(*n*) *<sup>f</sup>* ].

#### 3.2.1. Total Coefficient of Determination

The coefficient of determination used in this study refers to how well a certain profile can explain the PV power generation profile for each PV site. For simulation comparison, the profile can be either solar irradiance data or neighboring PV generations, as follows:

$$\begin{aligned} F\_{s,1}^{(n)} &= 1 - R\_{\mathbf{x}\_n, \mathbf{y}\_n}^2\\ F\_{f,1}^{(n)} &= 1 - \frac{1}{N-1} \sum\_{k} R\_{\mathbf{x}\_n, \mathbf{x}\_k}^2 \end{aligned} \tag{3}$$

where *R*<sup>2</sup> denotes the coefficients of determination of the regression depending on the other independent variable.

#### 3.2.2. Normalized Profile Distance

The distance between normalized profiles is related to the closeness of the pattern shape. Daily profiles were converted within the range of [0,1] through min–max normalization, and the distance for each time step was computed, as follows:

$$\begin{aligned} F\_{s,2}^{(n)} &= \frac{1}{T\_p} \sum\_{t \in T} |\mathbf{x}\_{norm \, n,t} - y\_{norm \, n,t}| \\ F\_{f,2}^{(n)} &= \frac{1}{N-1} \cdot \frac{1}{T\_p} \sum\_{k} \sum\_{t \in T} |\mathbf{x}\_{norm \, n,t} - \mathbf{x}\_{norm \, k,t}| \end{aligned} \tag{4}$$

where *xnorm n*,*<sup>t</sup>* = *xn*,*<sup>t</sup>* max*t*∈*T*{*xn*,*t*}−min*t*∈*T*{*xn*,*t*} and *ynorm n*,*<sup>t</sup>* <sup>=</sup> *yn*,*<sup>t</sup>* max*t*∈*T*{*yn*,*t*}−min*t*∈*T*{*yn*,*t*} .

#### 3.2.3. Degree of Consistency

To investigate the fluctuation consistency, the degree of consistency between two profiles was proposed in [48]. Regardless of the heterogeneous dataset, e.g., comparing solar irradiance and *Energies* **2020**, *13*, 737

PV generations or different capacities of the PV installations, it can reflect the variation tendency. During the operation time period, *fi*, *fo*, and *fz* count whether the variation tendency is identical, opposite, or zero, respectively. Then, they are converted into *F*3, *F*4, and *F*<sup>5</sup> to be computed as a ratio.

*<sup>F</sup>*(*n*) *<sup>s</sup>*,3 <sup>=</sup> <sup>1</sup> <sup>−</sup> <sup>1</sup> *Tp* ∑ *t*∈*T fi*(*xn*,*t*, *yn*,*t*) where *fi*(*xn*,*t*, *yn*,*t*) = 1 if (*xn*,*<sup>t</sup>* − *xn*,*t*−1)(*yn*,*<sup>t</sup>* − *yn*,*t*−1) > <sup>0</sup> *<sup>F</sup>*(*n*) *<sup>f</sup>* ,3 <sup>=</sup> <sup>1</sup> <sup>−</sup> <sup>1</sup> *<sup>N</sup>* <sup>−</sup> <sup>1</sup> · <sup>1</sup> *Tp* ∑ *k* ∑ *t*∈*T fi*(*xn*,*t*, *xk*,*t*) where *fi*(*xn*,*t*, *xk*,*t*) = 1 if (*xn*,*<sup>t</sup>* − *xn*,*t*−1)(*xk*,*<sup>t</sup>* − *xk*,*t*−1) > <sup>0</sup> (5) *<sup>F</sup>*(*n*) *<sup>s</sup>*,4 <sup>=</sup> <sup>1</sup> *Tp* ∑ *t*∈*T fo*(*xn*,*t*, *yn*,*t*) where *fo*(*xn*,*t*, *yn*,*t*) = 1 if (*xn*,*<sup>t</sup>* − *xn*,*t*−1)(*yn*,*<sup>t</sup>* − *yn*,*t*−1) < <sup>0</sup> *<sup>F</sup>*(*n*) *<sup>f</sup>* ,4 <sup>=</sup> <sup>1</sup> *<sup>N</sup>* <sup>−</sup> <sup>1</sup> · <sup>1</sup> *Tp* ∑ *k* ∑ *t*∈*T fo*(*xn*,*t*, *xk*,*t*) where *fo*(*xn*,*t*, *xk*,*t*) = 1 if (*xn*,*<sup>t</sup>* − *xn*,*t*−1)(*xk*,*<sup>t</sup>* − *xk*,*t*−1) < <sup>0</sup> (6) *<sup>F</sup>*(*n*) *<sup>s</sup>*,5 <sup>=</sup> <sup>1</sup> *Tp* ∑ *t*∈*T fz*(*xn*,*t*, *yn*,*t*) where *fz*(*xn*,*t*, *yn*,*t*) = 1 if (*xn*,*<sup>t</sup>* − *xn*,*t*−1)(*yn*,*<sup>t</sup>* − *yn*,*t*−1) = <sup>0</sup> *<sup>F</sup>*(*n*) *<sup>f</sup>* ,5 <sup>=</sup> <sup>1</sup> *<sup>N</sup>* <sup>−</sup> <sup>1</sup> · <sup>1</sup> *Tp* ∑ *k* ∑ *t*∈*T fz*(*xn*,*t*, *xk*,*t*) where *fz*(*xn*,*t*, *xk*,*t*) = 1 if (*xn*,*<sup>t</sup>* − *xn*,*t*−1)(*xk*,*<sup>t</sup>* − *xk*,*t*−1) = <sup>0</sup> (7)

#### 3.2.4. Relative Error Percentile of the Maximum Value

By comparing the relative maximum property of each profile, the relative error percentiles of the maximum values were determined. Two different attributes, i.e., the first order difference (*F*6) and standard value (*F*7), were used:

$$\begin{aligned} F\_{s,6}^{(n)} &= \frac{\mathbf{x}\_{n\ d\max} - \mathbf{y}\_{n\ d\max}}{\mathbf{y}\_{n\ d\max}}\\ F\_{f,6}^{(n)} &= \frac{1}{N-1} \sum\_{k} \frac{\mathbf{x}\_{n\ d\max} - \mathbf{x}\_{k\ d\max}}{\mathbf{x}\_{k\ d\max}} \end{aligned} \tag{8}$$

$$\text{where } \mathbf{x}\_{n\,dmax} = \max\_{\mathbf{t}\in\mathcal{T}} \left\{ \frac{|\mathbf{x}\_{n,t} - \mathbf{x}\_{n,t-1}|}{|\mathbf{x}\_{n,T} - \mathbf{x}\_{n,T-1}|} \right\} \text{ and } y\_{n\,dmax} = \max\_{\mathbf{t}\in\mathcal{T}} \left\{ \frac{|y\_{n,t} - y\_{n,t-1}|}{|\mathbf{y}\_{n,T} - \mathbf{y}\_{n,T-1}|} \right\}.$$

$$F\_{s,T}^{(n)} = \frac{\mathbf{x}\_{n\,smax} - y\_{n\,smax}}{\mathbf{y}\_{n\,smax}}$$

$$F\_{f,T}^{(n)} = \frac{1}{N-1} \sum\_{k} \frac{\mathbf{x}\_{n\,smax} - \mathbf{x}\_{k\,smax}}{\mathbf{x}\_{k\,smax}} \tag{9}$$
 $\text{where } \mathbf{x}\_{n\,smax} = \max\_{\mathbf{t}\in\mathcal{T}} \left\{ \frac{\mathbf{x}\_{\mathbf{t}\,t} - \mathbf{x}\_{\mathbf{t}\,t}}{\mathbf{x}\_{n}} \right\} \text{ and } y\_{n\,smax} = \max\_{\mathbf{t}\in\mathcal{T}} \left\{ \frac{y\_{n\,s} - y\_{n,t}}{\mathbf{x}\_{n}} \right\}.$ 

where *xn smax* = max*t*∈*<sup>T</sup> σ***x***n*,*<sup>T</sup>* and *yn smax* = max*t*∈*<sup>T</sup> σ***y***n*,*<sup>T</sup>*

#### *3.3. Clustering*

Because the condition of the PV system is not labeled in the real environment, unsupervised learning was applied in this study. K-means clustering is the simplest method and requires low computation based on calculation of the Euclidean distance. The objective function of K-means clustering, *J*, is given by Equation (10). Determining the number of clusters and importing the initial center have a crucial impact on the clustering accuracy. In our case, we set the number of clusters as two to distinguish between normal and fault patterns, and they were used in the imputation part for filtering the PV site candidates of the explanatory variables. Initial centroids were set to zero vectors and one vectors multiplied by 0.5 for the normal and fault condition, respectively. Optimal features were selected by exhaustive search for the objective function as follows:

$$J = \sum\_{j=1}^{s} \sum\_{n=1}^{N} \left|| F s^{(n)} - c\_j ||^2 \right. \tag{10}$$

where *s* is the number of clusters (*s* = 2, binary condition for the fault or normal state), *cj* is the center of cluster *j*, and *Fs*(*n*) are the selected features of site *n*. Feature selection results are shown in Section 5.

#### *3.4. Imputation*

Once the PV profile was labeled as a fault pattern, the fault data were restored using the regression method. In [49], the time series analysis of various meteorological data related to renewable energy systems was conducted using a statistical method. We built a linear regression model based on the linear minimum mean squared error (LMMSE) to estimate the regression parameter, as follows:

$$P = \beta X + \epsilon \tag{11}$$

$$\beta = (X^TX)^{-1}X^TP\tag{12}$$

where *P* is the expected imputation profile at the candidate site, *β* is the regression coefficient, *X* is the set of explanatory datasets, and is the random error.

In this study, several cases were studied to compare the explanatory dataset for the regression model and to investigate the impact of the labeling of neighboring PV profiles at the previous detection stage, as demonstrated in Table 3. In Case 1, we assumed that only PV historical power output data were available without any other measurements. In this case, we built an AR model for a single PV site to focus on its periodic power generation behavior. Case 2 exclusively considers local irradiance data. For our dataset, solar irradiance data provided at site 3 were used identically to construct the univariate simple regression model. Case 3 allows a neighbor PV profile that has unknown operation states in each PV system, and all of them are inputted into the multivariate regression model. Case 4 conducted a multiple regression model that was similar to Case 3 but the explanatory data were refined with the normal pattern classified at the previous detection stage. Case 5 merged Cases 2 and 4 that utilizes both local irradiance data and normal PV fleet data. Furthermore, as several previous works have confirmed that kNN is an efficient method for missing data imputation, we adopted kNN for fault data imputation as Case 6.

**Table 3.** Imputation case study.


#### **4. Simulations**

#### *4.1. Fault Pattern Injection*

To evaluate the performance, we primarily sorted 159 days of PV generation patterns for all sites for the normal condition simultaneously. The overall normal dataset comprised 159 × 13 = 2067 PV power profiles. To characterize the fault patterns, the actual fault patterns detected from different PV plants were investigated. Six fault patterns were extracted, which were composed of whole zero, part zero, whole shift, part shift, constant padding, and spike, as shown in Figure 4. The faults were injected arbitrarily into a single or multiple power profiles of PV sites, thus retaining target labels to describe the performances of fault detection and imputation.

#### *4.2. Evaluation Metric*

#### 4.2.1. Fault Detection Performance Metric

A confusion matrix is widely used to evaluate classification performance, as depicted in Table 4. It matches the output label and target label for testing samples one by one; thus, target output label sets are sorted as true positive (*TP*), false negative (*FN*), false positive (*FP*), and true negative (*TN*). Based on these attributes, accuracy is derived as the ratio of correctly classified samples among all samples. The error rate is the opposite of accuracy, which indicates the error proportion for all samples. Precision is the ratio of *TP*s for all positives that are categorized in the test phase. Lastly, recall indicates the ratio of *TP*s out of all pre-labeled positives, as follows:

**Table 4.** Confusion matrix.


$$\begin{aligned} Accuracy &= \frac{TP + TN}{P + N} \\ Errorrate &= \frac{FP + FN}{P + N} \\ Precision &= \frac{TP}{TP + FP} = \frac{TP}{P'} \\ Recall &= \frac{TP}{TP + FN} = \frac{TP}{P} \end{aligned} \tag{13}$$

#### 4.2.2. Imputation Performance Metric

To evaluate the accuracy of imputation performance, normalized root mean square error (NRMSE) is adopted since the PV capacities in the fleet are different. In this study, RMSE was normalized by mean

as described in Equations (14) and (15) Low NRMSE implies valuable imputation performance where abnormal pattern is restored similar to the real profile. The daily abnormal pattern for a corresponding site is then substituted by the reconstructed normal pattern.

$$RMSE = \sqrt{\frac{\sum\_{t \in T} (\mathbf{x}\_t - \mathbf{x}\_t)^2}{T\_p}} \ [k\mathsf{W}] \tag{14}$$

$$NRMSE = \frac{RMSE}{\overline{\mathbf{x}}\_{t \in T}} \tag{15}$$

where **x***t*∈*<sup>T</sup>* is the mean of **x** during operation hours.

#### **5. Results and Discussion**

#### *5.1. Feature Selection*

The feature characterized in Section 3.2 can be assembled in various combinations. To determine the optimal parameter combination, an exhaustive search was followed by division into two scenarios: In one, combinations from *Fs* were selected and the other involved *Ff* . Tables 5 and 6 list the detection performance for all datasets that included 2067 feature sets and labels with and without neighboring PV data, respectively. Accuracy, error rate, precision, and recall for all the combinations were surveyed. We selected high accuracy with a similar portion of false alarm and missing fault to increase detection accuracy and prevent inclination to one side. Features 1, 2, and 5 were selected from *Fs* in the first scenario. However, the combination of features 1, 5, and 7 from *Fs* and 1, 2, 3, and 5 from *Ff* were selected for the second scenario to detect the fault pattern.

**Table 5.** Feature selection from *Fs*.


**Table 6.** Feature selection from the combination of *Fs* and *Ff* .


#### *5.2. Analysis of Fault Detection Results*

To validate the proposed clustering-based method compared with previous approaches, SolarClique [50] was implemented, and the obtained results were evaluated. To give a brief information of SolarClique, it detects anomalies on the basis of prediction by employing solar generation data from

geographically nearby sites. Each candidate site is predicted for 100 iterations and implemented by random forest with bootstrapping. Then, the local factor is excluded from prediction error and used to detect anomalies. The data are categorized into fault when the prediction error is outside the threshold, i.e., 4-sigma.

By performing numerical evaluation, the overall results of fault detection are indicated in Table 7. The fault identification by the proposed method with and without PV fleet data show accuracies of 0.9874 and 0.9777, respectively, which are more accurate than accuracy of 0.9247 obtained by the SolarClique technique. In the same context, the detection error rates derived by clustering with and without PVs are 0.0126 and 0.0223 which are smaller than 0.0753 of error rate obtained by SolarClique. Precision and recall for the proposed method are also improved, compared with those of SolarClique. This is because the prediction-based approach is sensitive to random fluctuations that depend on subtle unknown changes; the detection depends only on a threshold applied to the univariate component. As a result, clustering-based fault detection using neighboring PVs shows the best detection performance.

**Table 7.** Fault detection comparison with SolarClique.


Fault detection performances for each site are provided in Table 8. Fault detection tested for all sites in a campus fleet showed a high accuracy rate that exceeded 98%, except for site 12. For interpretation, normal profiles and injected profiles that are distinguished as a fault are depicted in Figure 5. The green-shaded profile shown in Figure 5c indicates the missing fault that the clustering machine identified from an abnormal condition as a normal status. In contrast, the yellow-shaded area in Figure 5b represents false positives or false alarms with wrong identification despite the normal condition. However, the normal condition is not guaranteed in this case in that the false alarm profile seemed to spike, which showed a disparate profile with irradiance compared with Figure 5a. Hence, these false positives should be reassigned as true negatives to provide a more accurate detection rate.


**Table 8.** Fault detection of individual sites.

(**c**) Fault detection

**Figure 5.** Profiles of PV power and solar irradiance for (**a**) normal, (**b**) fault injected, and (**c**) fault detected days.The green-shaded profile in (**b**) denotes a missing fault and the yellow-shaded profiles in (**c**) correspond to false alarms occurring at site 12.

#### *5.3. Analysis of Fault Imputation Result*

The imputation accuracy computed with the NRMSE metric is presented in Table 9. As generally expected, most of the sites in the PV fleet show improved imputation performance when using each other self-labeled PV data, corresponding to Cases 4 or 5 as their overall NRMSEs are the lowest at 0.2359 and 0.2367, respectively. Comparing to Case 2 (only SWS data utilized), where overall imputation accuracy is 0.2925, the improvement is shown to be 19.35% and 19.08%, respectively. Likewise compared to Case 3 (PV without label) where NRMSE is 0.2888, Cases 4 and 5 have been improved by 18.32% and 18.04%, respectively. In addition, as depicted in Figure 6, Cases 4 and 5 are shown to have slightly tighter deviations in comparison with the other cases, which means more stable. In the following we present the details for each imputation case.

Case 1 supposed that only historical PV data from a certain site is applicable, which means that information is restricted when considering weather variations. Therefore, an AR model was constructed in this case that focused only on the periodicity of the operation time without reflecting climatic properties. The imputation result showed insufficient performance compared to the other case studies for all sites in the fleet. Case 2 was a general case that applied the nearest meteorological data. The weather stations were lacking in the fleet, which resulted in representative station data being used after all. We concluded that when the PV site was close to the weather station, Case 2 showed better performance than when using surrounding PV data. For example, the imputation results from sites 3 and 4 that belonged to G2 were the best in this case. Case 3 used additional data obtained from neighboring PV systems beyond the labels of the system conditions. Because abnormal patterns were not removed in the previous stage, they were fed into the imputation phase as provided. This case was feasible only when the assumption that faults do not occur was satisfied. Case 4 was used to examine the effectiveness of a neighbor PV generation profile with labeled status. Because normal profiles were selected as candidates of explanatory data for the regression model, they showed reliable imputation results. In particular, PV sites from G4 showed improved results to those used in Case 2. This was because, even though the groups were slightly away from the weather station, they were crowded collectively, which provided more relevant information on the PV power output than distant weather data. As mentioned earlier, Case 5 merged Cases 2 and 4. It is observed that only sites in close distance to SWS have effects on combining SWS data to neighbor PVs. Otherwise, it did not appear to have a significant effect in faraway PVs despite of belonging to PV fleet. This demonstrates that even neighboring PVs in the same fleet have subtle differences in solar irradiance. The kNN in Case 6 shows low performance in PV fleet fault imputation. Since it depends on similar historical patterns of solar irradiance, it may not reflect random fluctuation especially in overcasting day which results in low imputation accuracy.

**Table 9.** Imputation normalized root mean square error (NRMSE).


**Figure 6.** Boxplot of imputation NRMSE for each imputation case.

#### **6. Conclusions**

This paper presents a framework for PV fault detection and imputation method on PV fleets without the manual annotation of the state of the PV systems. We supplement the meteorological data measured at LWS, which had a relatively low value for the cross-correlation with PV generation, using the neighboring PV fleet data. Several features were derived to be used as input for K-means clustering to label normal or abnormal patterns. PV fleet on the campus and solar irradiance data measured at one of the PV sites in the fleet were utilized to extract the features for fault pattern detection. We arbitrarily injected a fault pattern based on actual observations, and the detection accuracy was evaluated using a confusion matrix. The detection error rate was compared for three cases: Using SolarClique (a conventional prediction-based detection method), a clustering-based method without PV fleet data, and a clustering-based method with PV fleet data which is the proposed method. The error rates for these three cases were 0.0753, 0.0223, and 0.0126, which means that the proposed clustering-based detection using neighboring PV fleet data can effectively detect the faults.

Data imputation was conducted for the distinguished abnormal patterns. Five cases of regression-based imputation and kNN were evaluated by NRMSE. In general, Cases 4 and 5, which utilized neighboring self-labeled PV data, showed better imputation performance than imputation without nearby sites or without labeled PV data by reducing NRMSE over 19% and 18%, respectively. In addition, according to earlier grouping information based on cross-correlation analysis, G3, which were close SWS, generally showed better performance when only solar irradiance data were used. However, the imputation result for G3 sites and neighboring PV profiles provided more relevant information than weather data obtained at SWS that was relatively far away. In summary, neighboring PV data are effective in improving fault detection and imputation accuracy in a dense PV fleet.

**Author Contributions:** All authors contributed to this work by collaboration. S.P. (Sunme Park) developed the idea and methodologies and is the first author in this manuscript. S.P. (Soyeong Park) reviewed previous literature and conducted data analysis for implementation, M.K. conducted detection performance comparisons and E.H. led and supervised the research and is the corresponding author. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the ETRI R&D program (19ZK1140) and the government of Korea and by the Korea Institute of Energy Technology Evaluation and Planning (KETEP).

**Acknowledgments:** This work was supported by the ETRI R&D program (19ZK1140), funded by the government of Korea and by the Korea Institute of Energy Technology Evaluation and Planning (KETEP) and the Ministry of Trade, Industry and Energy (MOTIE) of the Republic of Korea (No. 20171210200810).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **References**


c 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Energies* Editorial Office E-mail: energies@mdpi.com www.mdpi.com/journal/energies

MDPI St. Alban-Anlage 66 4052 Basel Switzerland

Tel: +41 61 683 77 34 Fax: +41 61 302 89 18