**2. Feature Selection**

According to the previous analysis, the daily generating energy is related to environmental factors for photovoltaic power stations, and there are correlations between the above-mentioned environmental factors. Therefore, finding the correlation between various environmental factors and selecting appropriate environmental factors as the input dataset can inevitably reduce the computational complexity of prediction models.

Generally, the environmental factors such as daily average temperature, maximum temperature, minimum temperature, daily sunshine duration, average cloud cover, average humidity, minimum humidity, precipitation from 8:00 a.m. to 8:00 p.m., etc., can affect power generation. Under normal circumstances, the more environmental factors, the larger the processing of high-dimensional vectors, as these factors would constitute the input feature vector, and the complexity of calculations will be improved greatly. To reduce the calculation complexity, these environmental factors should be properly selected, and the Pearson correlation coefficients that can evaluate the correlation between environmental factors and generating energy are introduced into the paper.

Pearson correlation coefficient is a value between −1 and 1 that denotes the similar trend between two datasets. For two random variables *X* and *Y*, the Pearson correlation coefficient can be expressed by:

$$\rho\_{XY} = \frac{\text{cov}(X, Y)}{r\_X r\_Y} = \frac{E(XY) - E(X)E(Y)}{\sqrt{E(X^2) - E^2(X)}\sqrt{E(Y^2) - E^2(Y)}} \tag{1}$$

where *cov*(*X*,*Y*) means the covariance between *X* and *Y*; *ρ<sup>X</sup>* and *ρ<sup>Y</sup>* are the standard deviation of *X* and *Y* respectively; *E*(.) function means the random variable's expectation.

In the paper, the Pearson correlation coefficient between environmental factors and generating energy can be calculated by:

$$\sigma = \frac{N\sum x\_i y\_i - \sum x\_i \sum y\_i}{\sqrt{N\sum x\_i^2 - \left(\sum x\_i\right)^2}\sqrt{N\sum y\_i^2 - \left(\sum y\_i\right)^2}}\tag{2}$$

where *r* is the Pearson coefficient; *xi* and *yi* are the environmental factors and corresponding generating energy respectively; *N* is the amount of historical data samples.

Hence, in order to select the optimal environmental factors to construct the input dataset, the Pearson coefficients between environmental factors and generating energy obtained from a photovoltaic power station in Suzhou, China (Supplementary Materials), were used and the results are shown in Table 1.


**Table 1.** Pearson coefficients between environmental factors and generating energy.

According to Pearson coefficient theory, factors with positive Pearson coefficients have good correlation with the generating energy, which means they are suitable to be regarded as the input data features to predict the generating energy. As can been found in Table 1, some factors such as average humidity, minimum humidity, precipitation from 8:00 a.m. to 8:00 p.m., and average cloud cover could be filtered because they have a weak correlation with generating energy. Hence, the remaining four environmental factors are taken to compose the input feature vector, which means the data feature vectors are four-dimensional.
