The data-driven expected energy model training activities were supported by Sandia National Laboratories’ PV Reliability, Operations, and Maintenance (PVROM) database [
23]. Information about the PVROM database, as well as data processing, model training, and model evaluations, are described in the following Sections.
2.2. Preprocessing
Data quality issues stemming from measurement errors and system anomalous conditions (reflecting local field failures, such as communication loss) could introduce signal variations in field data that would hinder model performance. Problematic data convolute the relationships between features, making it more difficult to measure the true parameter estimates; these potential irreducible errors are decreased through numerous data quality filters (
Figure 2). Missing values (i.e., NaN or None values) were removed prior to applying data quality filters. An evaluation of these missing values revealed that a majority of them (~88%) occurred during nighttime hours (~7 p.m. to 8 a.m.), indicating that some sites captured night-time entries as null (
Figure 2). After removing these missing values, ~900 K data points remained, which were then subject to a series of data quality filtering steps.
Data were filtered to ensure they are within nominal sensor ranges, using thresholds following [
24] and the IEC 61724-1 standard [
15]. Specifically, we retained data that met the following criteria:
Wind speed was not consistently available from partners and thus was excluded from analysis. Although available temperature data were used in the preprocessing steps, they are not used as a predictor variable in the regression models, since they ate not included in current standard models [
15].
Flatlining values—determined by periods where consecutive data changed by less than a threshold—were flagged for removal using the
pecos package [
25], which follows the IEC 61724-3 standard [
26]. Specifically, four consecutive hours with either
0.01% of the site’s capacity or
10
were filtered. Lastly, inverter clipping, which occurs when the DC energy surpasses an inverter’s DC energy rating, was addressed by mathematically observing plateaus in the energy signal using the
pvanalytics package [
27]. Dropping energy measurements during inverter clipping, which manifest as a static value across high irradiance levels, would create a better linear fit. After data quality checks, 429 K data points across 150 sites remained (
Figure 2).
Data points that passed all quality checks were also assessed for system-level anomalies. These anomalies likely reflect abnormal operating conditions (i.e., local failures) and thus require removal to ensure the trained baseline energy models reflect nominal system performance. Anomalous entries were detected using a comparison of observed energy to irradiance and site capacities (
Figure 3). The comparison of observed energy and irradiance filter focuses on removing data where the E–I ratio (
) is outside its nominal distribution by 3 standard deviations
, where
and
are the mean and standard deviation of the E–I ratio, respectively [
28]. This filter was implemented for each site separately to capture site-specific variations (including system capacity) and resulted in the removal of 70 K data points (
Figure 2). The second system anomaly filter focused on removing sites with mismatches between observed energy and site capacity. Namely, if a site’s maximum recorded energy was over
or under
, then all data points for that site were excluded from subsequent analysis. This method filtered 23 sites; 50%+ of these sites were under 1000 kW, and only 1 was over 10,000 kW. Approximately 26 K data points were removed with this filter, resulting in a final dataset that contained 332 K data points across 127 sites for model training and testing activities (
Figure 2). The age of the sites within the final dataset ranged from newly installed sites up to 10 years, with a majority being less than 5 years in age (
Figure A2).
2.4. Model Design and Training
Similar to other machine learning models, regression techniques leverage input data to learn relationships and use those relationships to predict unseen quantities. These relationships are generally contained in model parameters (
), which map predictors, as summarized in a design matrix
, to an output
with residual model error
. Many different regression techniques exist; these techniques typically vary in the structure of the cost function, which quantifies the error between predicted and expected values. This cost function (
C) is usually captured as a summation of loss functions (calculated on each data point) across the training set. The set
, which renders the smallest cost, is defined as the learned parameters, mathematically notated as:
A popular regression model is the ordinary least squares (OLS), which defines its best model (
) with an objective function equal to the sum of squared errors (
SSE):
where
n is the number of samples,
p is the number of predictors, and
is the
value for the
explanatory variable. As shown in the equation, the
sums the squared difference between each sample (
y) and its associated model estimate (
). High emphasis is naturally placed on reducing high-error samples. Therefore, outliers can have a large effect on the learned parameters, so data preprocessing steps are required for robust model development. Additionally, OLS renders non-zero coefficients on all
, which can create small, insubstantial parameters which are likely components of the training dataset and therefore contribute to model overfitting and thus should be removed from the model.
Alternate approaches to OLS include the Theil–Sen regressor [
29], which is robust against outliers since it chooses the median of the slopes of all lines between pairs of points, as well as techniques such as Lasso regression [
30] that explicitly address model overfitting by reducing model complexity (i.e., the number of parameters used). For this analysis, the latter was selected since Lasso regression models are able to incorporate both parameter regularization and residual sum of squares into the loss function. The cost function for Lasso regression
incorporates an L1 regularization term
, which penalizes the magnitude of the
terms. This penalization tends to shrink coefficients to zero, rendering a more parsimonious model; we use an
for defining the impact of the regularization on the regression kernel. Specifically, the penalization acts as a bias, which in turn can reduce overall error due to the bias–variance tradeoff [
31].
Standardized variables are passed into Lasso regression to learn a linear model, which relates the input variables to energy. Multiple combinations of input variables were used to train the regression models (more details below). For all models, a randomized (80–20%) split is utilized to partition the preprocessed, standardized data into train and test partitions, respectively.
In addition to individual parameter influences, interactions and temporal factors were incorporated as input features to capture nuances within the datasets. Interaction parameters, which allow the effect of one parameter on the response variable to be weighted by the value of another variable, are introduced by including terms which are the product of two or more predictor variables. For example,
Figure 4 shows that the relationship between
E and
I does vary across
. Thus, the inclusion of an
I and
interaction term may be helpful in predicting the generated energy. The suite of interaction combinations are instantiated using polynomial models up to the third order (i.e., degree
). In a model with
and 2 covariates, the initiated regression model would take the following form:
Notice that a
also includes
parameters (i.e.,
and
). This remains true for all values of the polynomial power (e.g., for a model initiated with
, terms from
and
are also included). Two interaction polynomial orders are tested: a second-order (
) and a third-order (
) (
Table 2). The particular interaction noted above (
) is captured in multiple models, including an additive model with a single interaction term (
Table 2).
In addition to interactions, temporal factors are used to capture a variable’s changing effect on the energy generated over time. For instance, the correlation between
I and
E changes over the course of the year due to spectral irradiance effects [
32,
33]. Therefore, allowing the model to capture time-variant nuances may be important for capturing such nonlinearities. Three temporal based conditions were explored: seasonal (four per year), monthly, and hourly. A model with two predictor variables and monthly temporal-based variable conditions would be instantiated as:
where the
a and
b parameters are coefficients describing the effect of parameter
and
, respectively, when conditioned on a month of the year. For instance,
describes the effect of
on the
y response variable during the month of January. The indicator function
masks the predictor variable to ensure it is within its timeframe. With the various combinations of interactions and temporal conditions, a total of 13 regression kernels were evaluated (
Table 2; see
Appendix A for some of the mathematical formulations).
2.5. Model Evaluation
Three metrics were used to evaluate the performance of the trained expected energy models: logarithmic root mean squared error (), coefficient of determination (), and percent error (). Both partner-provided expected energy values and those calculated by the leading standardized expected energy model (i.e., IEC 61724) were used as reference values for model evaluations.
The root mean squared error (
) is a common goodness-of-fit statistic used for model evaluation. The
is expressed as:
where
and
are the measured and predicted values of the response variable, and
n is the number of samples.
is in the same units as the response variable (i.e., kWh). Lower
values indicate a better, lower predicted error. Because the error can be quite large in magnitude (
to
), a logarithmic transform is applied to facilitate evaluations. Because the magnitude of the error is closely connected to a site’s capacity, the
cannot be used to compare model performance between sites unless the sites are similar in size.
The coefficient of determination (
), however, can be used to compare model performance across different site sizes. Specifically,
is calculated as:
where
is the average of the
y values.
denotes the proportion of variability in the response explained by the model with a value of 1, indicating a perfect fit.
was used to compare trained model outputs with partner-generated expected energy values, whose underlying model structures were unknown.
Generally, however,
is not well-suited for comparing models across varying numbers of parameters. Thus, when comparing the 13 trained models to one another, we utilize an adjusted
metric, which checks whether the added parameters contribute to the explanation of the predictor variable and penalizes models with unnecessary complexity [
34]. Low-effect parameters (i.e.,
) reduce the model’s overall fit score. The adjusted
is calculated as follows:
where
n is the number of samples, and
p is the number of predictors.
Finally,
was used to capture the directionality of error (i.e., overprediction vs. underprediction):
The and were implemented to evaluate model performance at both site and fleet (i.e., across multiple sites) levels, while was only implemented at the fleet level; all metrics were reported on the test dataset. T-tests were used to evaluate significance in performance variations between the trained and reference values.