*4.2. Data Preparation*

In general, data need to be pre-processed so that they have a proper format, and are free of irregularities such as missing values, outliers, and inaccurate data values. Missing values are typical in any dataset. They may have occurred during data collection or possibly due to sensor-connecting issues. However, they must be considered by dropping their rows, estimating their values or replacing them. In our case, the data had less than 1% missing values in the total dataset; thus, eliminating these missing values was imperative. Outliers and noisy data emerge due to data entering/transmission errors. We discovered one outlier for "PV Energy", which we handled by smoothing its value.

Data scaling is typically required because many ML algorithms perform more accurately and converge faster when attributes are on a moderately similar scale and close to normally distributed. In this work, standardization (see Equation (1)) was applied to rescale data to have a mean *μ*(**A**,**p**) of zero and a standard deviation *σ*(**A**,**p**) of one, where the scaled **p** is shown in Table 6.

$$\mu(a, p)\_{\text{scaled}} = \frac{((a, p)\_{\text{i}} - \mu\_{(\mathbf{A}, \mathbf{p})})}{\sigma\_{(\mathbf{A}, \mathbf{p})}} \tag{1}$$

#### *4.3. Feature Selection*

Feature selection is one of the core concepts in ML and profoundly affects the model's performance. Its principal objective is to select the feature set with minimum cardinality while maximizing the learning performance. We believe that, when predicting generated power in the PV system, not every feature equally contributes to the prediction performance. Features can be relevant, partially relevant, or even irrelevant. Feature selection algorithms aim to assign weight to each feature according to its pertinence. As illustrated in Figure 5, in this study, we applied two approaches to score each feature, namely, Pearson's correlation coefficient [36] (see Equation (2)) and Information Gain [37] (see Equation (3)). The former measures the amount of correlation between each variable and the target, while the latter quantifies the amount of information provided to the class by evaluating the impurity level of each variable using the entropy *H*(·) with respect to the target.

$$r\_{a,p} = \frac{\sum\_{i=1}^{n} (a\_i - \bar{a})(p\_i - \bar{p})}{\sqrt{\sum\_{i=1}^{n} (a\_i - \bar{a})^2} \sqrt{\sum\_{i=1}^{n} (p\_i - \bar{p})^2}} \tag{2}$$

$$IG(p, a) = H(p) - \sum\_{v \in Values(A)} \frac{|p\_v|}{|p|} H(p\_v) \tag{3}$$

The relevant attributes should be sasigned a greater scoring than less relevant attributes. In Equation (2), features were selected by correlating all input sensor parameters with PV-generated power **p**. Pearson's Correlation Coefficient Equation (2) was used to evaluate the correlation between the sensor parameters and PV-generated power, where *n* is the observation size, *ai* and *pi* are the single observation points indexed with *i*, and *a*¯ is the observation mean. A positive and negative correlation score would suggest higher prediction accuracy because an increase in one value of the attribute increases/decreases the generated power value. Meanwhile, zero correlation coefficient indicates no relation. Nevertheless, Figure 6 indicates the amount of correlation of each attribute with the generated power. The Solar Average has the most crucial positive correlation (+ve) with 88%, although the Out Humidity has the most significant negative correlation (−ve) with about −42%. Meanwhile, the rain rate, rain and arc exhibited zero correlation. Furthermore, profound/redundant features that are directly affected by the generated power have been dropped, such as Voltage, Current, PV Energy, and Solar Energy, where the number of attributes were reduced to *m* = 38.

**Figure 6.** Correlation Plots.

To evaluate the similarity between two ranked sets of features *r* represented by *ra*,*p* and *r*¯ represented by *IG*(*p*, *a*)), Spearman's rank correlation coefficient [38] (see Equation (4)) was used to assess the significance of the relationship between them.

$$S\_R(r, \vec{r}) = 1 - 6 \sum\_{i} \frac{(r\_i - \vec{r}\_i)^2}{m(m^2 - 1)} \tag{4}$$

Spearman's rank correlation coefficient resulted in a range of [−1,1]. The maximum value was reached when the two ranks were equivalent, while the minimum was reached when they were precisely in reverse order and zero meaned no correlation between *r* and *r*¯. However, after we measured the stability of the two sets of features, we observed them to be stable with the value 0.96. In Figure 7, we show the comparison of two ranked feature lists, where the *x*-axis and the *y*-axis represent the Pearson's correlation coefficient and information gain for features, respectively. Moreover, the linear line shows the stability between them.

Backward elimination was applied after Pearson's correlation coefficient was calculated, which selected the most appropriate attributes. We started with a complete set of attributes and then recursively removed one attribute after each iteration. The eliminated attribute is the attribute with the lowest absolute correlation coefficient |*ra*,*p*|. At each iteration, we evaluated the loss using the remaining set of features. The backward elimination criterion was applied from the lowest correlated attribute to the highest, one until only one attribute remained.

**Figure 7.** Spearman's Rank Correlation Coefficient.

#### *4.4. Model Selection and Evaluation*

The selection of appropriate ML algorithms to predict the amount of power generated **pˆ** based on the sensors' readings was challenging, because each ML model performs differently on the same dataset according to the model's nature. A number of ML models need to be trained and tested to select the optimal or superior one. Nonetheless, prior to the training, the dataset needs to be divided into a training set, to build up a model by extracting the features and train them to fit the model, and a testing set, to validate the built model by predicting the outcome of the unseen data. There are numerous methods of splitting the dataset, such as hold-out and cross-validation. As illustrated in the first part of Figure 5, in this experiment, we used k-fold cross-validation with *k* = 10. It is known for its ability to reduce overfitting while improving generalizability power. Moreover, cross-validation is known to have a better bias-variance trade-off. Therefore, the models are expected to perform equally well for the unseen data and the training data.

Many classical and modern regression and prediction models were examined in this study to estimate the generated power from the PV system. These include LASSO, RF, LR, PR, XGBoost, SVM, and DL.

The LR model [39] (see Equation (5)) is one of the simplest ML models used to find a linear relationship between the generated power **p** and the input parameters **A**. Taking **y** as the response value that lies in the best-fit regression plane, the intercept *b* in Equation (6) is the reference position of the plane, and **xm** is the *m* predictor variable from the most effective attributes. *w*1, ... , *wm* in Equation (7) are the parameters of slope coefficients. The response variable is the generated power **p**, and the predictor variables are selected from the most effective attributes **A** variables. Nevertheless, Equation (5) can present all the datapoints as a matrix (see (8)). Next, PR [40] (see Equation (9)) is a well-known algorithm, applied when the data are correlated, but the relationships are non-linear. This is a particular case of LR because we created polynomial attributes to fit the polynomial equation, where the *d*th power is the PR degree. LASSO [41] is also a type of LR model trained with an L1 regularizer in the loss function *J*(*w*)*L*<sup>1</sup> = <sup>1</sup> *<sup>n</sup>* <sup>∑</sup>*<sup>n</sup> <sup>i</sup>*=1(*fw*(*x*)*<sup>i</sup>* <sup>−</sup> *yi*) + *<sup>λ</sup>* <sup>∑</sup>*<sup>n</sup> j*=1 ! !*wj* ! !, to reduce overfitting, which applies shrinkage. Shrinkage is where data values are shrunk toward a central point, where *λ* denotes the amount of shrinkage. However, it is well-suited for data that show high multi-collinearity levels and fewer parameters.

$$y \leftarrow f\_w(\mathbf{x}) = b\_n + w\_1 \mathbf{x}\_1 + \dots + w\_m \mathbf{x}\_m \tag{5}$$

$$b\_n = \frac{\left(\sum\_{i=1}^n y\_i\right)\left(\sum\_{i=1}^n \mathbf{x}\_i^2\right) - \left(\sum\_{i=1}^n \mathbf{x}\_i\right)\left(\sum\_{i=1}^n \mathbf{x}\_i y\_i\right)}{n\left(\sum\_{i=1}^n \mathbf{x}\_i^2\right) - \left(\sum\_{i=1}^n \mathbf{x}\_i\right)^2} \tag{6}$$

$$w\_m = \frac{n\left(\sum\_{i=1}^m \mathbf{x}\_i y\_i\right)\left(\sum\_{i=1}^m \mathbf{x}\_i^2\right) - \left(\sum\_{i=1}^m \mathbf{x}\_i\right)\left(\sum\_{i=1}^m y\_i\right)}{n\left(\sum\_{i=1}^m \mathbf{x}\_i^2\right) - \left(\sum\_{i=1}^m \mathbf{x}\_i\right)^2} \tag{7}$$

$$
\begin{pmatrix} y\_1 \\ y\_2 \\ \vdots \\ y\_n \end{pmatrix} = \begin{pmatrix} b\_1 \\ b\_2 \\ \vdots \\ b\_n \end{pmatrix} + \begin{pmatrix} w\_1 \\ w\_2 \\ \vdots \\ w\_m \end{pmatrix} \begin{pmatrix} x\_1 & x\_2 & \cdots & x\_m \end{pmatrix} \tag{8}
$$

$$y = b + w\_1 x\_1 + w\_2 x\_1^2 + \dots + w\_d x\_1^d \tag{9}$$

An RF [42] is an ensemble of randomized regression trees that combine predictions from multiple ML algorithms to make more accurate predictions and control overfitting. XGBoost [43] has evolved as one of the most famous ML algorithms in recent years. It relates to a family of boosting algorithms named the gradient boosting decision tree (GBDT), a sequential technique that operates on the principle of an ensemble as it combines a set of weak learners and delivers an increased prediction accuracy. The most prominent difference between XGBoost and GBDT is that the former uses advanced regularization, such as L1 (LASSO) and L2 (Ridge), which is faster and has less chance of overfitting. An SVM [44] (see Equation (10)) performs a non-linear mapping of the training data to a higher-dimension space over a kernel function *φ*. It is possible to perform an LR where the kernel selection defines a more or less efficient model. The radial basis function (RBF) *e*−*<sup>γ</sup> <sup>x</sup>*−*<sup>y</sup>* <sup>2</sup> , as the kernel function, is used as a mapping function.

$$f\_w(\mathbf{x}) = \sum\_{i=1}^n w\_i^T \phi(\mathbf{x}^i) + b \tag{10}$$

NNs [45,46] have been extensively applied to solve numerous challenging AI problems. They surpass the traditional ML models by dint of their non-linearity, variable synergies, and customizability. The process of building an NN starts with the perceptron. In simple and straightforward terms, the perceptron receives inputs, multiplies them by some weights, and then carries them into an activation function such as a rectified linear unit (ReLU) to generate an output. NNs are designed by adding these perceptron layers together, in what is known as a multi-layer perceptron model. There are three layers of an NN: input, hidden, and output. The input layer immediately receives the data, whereas the output layer produces the required output. The layers in between are called hidden layers, and are where the intermediate computation takes place.

Model evaluation is a critical ML task. It helps to quantify and validate the model's performance, makes it easy to present the model to others, and ultimately selects the most suitable model. There are various evaluation metrics; however, only a few of these are applicable to regression. In this work, the most common metric used for regression tasks (MSE) is applied to compare the models' results. MSE (see Equation (11)) is the average of the squared difference between the predicted power **pˆ** and the actual power **p**. This penalizes large errors and is more convenient for optimization, as it is differentiable and has a convex shape.

$$MSE = \frac{1}{n} \sum\_{i=1}^{n} (p\_i - p\_i)^2 \tag{11}$$

Figure 5 schematically presents the overall AI system and methodology used in the research and delineates all the steps from data collection until the computation of predicted power.
