*2.3. Data Preprocessing*

Real data measured with automatic sensors are often incomplete or contain errors and inconsistencies, so data preprocessing is crucial. Cleaning and organisation techniques prepare the data and make them suitable for use with machine learning models. In particular, modifications applied to the monitoring dataset are focused on fixing minor inconsistences and removing erroneous or missing data. The dataset contains many empty records. In some of the night-time slots (between 10 p.m. and 3 a.m., both inclusive), no data is available since no measurements are collected. The solar irradiation is known to be zero (by definition) during night, but that is not the case for air temperature. However, the fact that no temperature data are available at night is not an issue; since there is no photovoltaic power production at night, it is therefore an irrelevant period. Adding night-time data would provide redundant information that would increase the complexity of the model and the calculation time needed without producing relevant results.

To avoid empty records that could negatively affect learning models, rows containing null data are removed. The same procedure is used when duplicated values or incomplete records are detected, where at least one of the variables is missing. Some discrepancies are also corrected, i.e., some records have a time shift of a few hours, probably due to an error in the data dumping. To correct this, the times of sunrise and sunset are taken into account as a reference. The times of sunrise and sunset are determined as indicated in [27], considering the dates, latitude, and longitude of the location of the facility and calculating the declination by Spencer's method and the sunset hour angle. Reference dawn, noon, and dusk hours for 2013 at the study location are shown in Figure 1.

**Figure 1.** Dawn, noon and dusk for the study location.

The GDAS dataset is initially composed of discrete values for one in every three hours only (i.e., hours in the [00:00, 03:00,..., 21:00] group, which will hereafter be called 'GDAS-hours'), while the monitoring dataset contains hourly measured values. A fast correction for ensuring compatible time coordinates between datasets would be to filter the monitoring dataset to remove 'non-GDAS-hours' (i.e., any hour not belonging to the GDAS-hours group). However, this coarse time discretisation would fail to capture relevant points for the solar irradiation and PV power variables. As shown in Figure 1, neither noon (maximum irradiation in clear sky conditions) nor most dawn (irradiation start) or dusk (irradiation end) times are close to any of the GDAS-hour instants.

To avoid losing information from the monitoring dataset, values from the GDAS dataset must be interpolated over their temporal coordinates to an hourly resolution. Second order B-splines are used to generate continuous piecewise polynomial functions for both the global horizontal irradiation and air temperature variables of the GDAS dataset. These piecewise curves are fitted to the available GDAS-hours values of temperature and irradiation. For the latter, additional fitting points are included for the dawn and dusk instants of each day (with a 0 W/m2 irradiation value). After generating the fitted curves, values for the non-GDAS-hours instants are extracted and added into the dataset. As all the original GDAS data correspond to either forecasts of 0 or 3 h, it can be stated that the interpolated GDAS dataset only contains forecasts inside the 0–5 hourly range.

As mentioned, the monitoring dataset contains two solar irradiation variables, each measured at a different tilt angle (3◦ and 15◦). The GDAS dataset, however, only contains values of irradiation on a horizontal plane (same convention as most general-purpose automatic weather stations). As increasing differences on tilt angle cause increasing divergences on irradiation values, care must be taken when comparing irradiation variables from the two datasets. After some preliminary analyses, it was found that the behaviour of irradiation variables with tilts of 3◦ (the first irradiation variable from the monitoring dataset) and 0◦ (the one from the GDAS dataset) were sufficiently similar. Thus, the irradiation measured at 15◦ is removed from the monitoring dataset, while that measured with a tilt of 3◦ is used for comparisons against the global horizontal irradiation from GDAS directly.

Once both datasets are preprocessed, they are compared to find faulty time instants. If at least one of the variables has a missing value at a given instant, that instant is entirely removed for all variables of both datasets. This reduces the final available number of hours of data, but ensures that all hours fed into the prediction models are complete. After this and previous filters have been applied, 11,132 valid hours of data remain. This value corresponds to 69.6% of the total hours belonging to the temporal span of the study, including all night hours that are not present in the monitoring data included in [23].
