1. Introduction
Alongside short-term load forecasting, short-term electricity price forecasting (EPF) has become a core process of an energy company’s operational activities [
1]. The reason is quite simple. A 1% improvement in the mean absolute percentage error (MAPE) in forecasting accuracy would result in about 0.1%–0.35% cost reductions from short-term EPF [
2]. In dollar terms, this would translate into savings of ca. $1.5 million per year for a typical medium-size utility with a 5-GW peak load [
3].
As has been noted in a number of studies, be it statistical or computational intelligence, a key point in EPF is the appropriate choice of explanatory variables [
1,
4,
5,
6,
7,
8,
9,
10,
11]. The typical approach has been to select predictors in an ad hoc fashion, sometimes using expert knowledge, seldom based on some formal validation procedures. Very rarely has an automated selection or shrinkage procedure been carried out in EPF, especially for a large set of initial explanatory variables.
Early examples of formal variable selection in EPF include Karakatsani and Bunn [
12] and Misiorek [
13], who used stepwise regression to eliminate statistically insignificant variables in parsimonious autoregression (AR) and regime-switching models for individual load periods. Amjady and Keynia [
4] proposed a feature selection algorithm that utilized the mutual information technique. (for later applications, see, e.g., [
11,
14,
15]). In an econometric setup, Gianfreda and Grossi [
5] computed
p-values of the coefficients of a regression model with autoregressive fractionally integrated moving average disturbances (Reg-ARFIMA) and in one step eliminated all statistically-insignificant variables. In a study concerning the profitability of battery storage, Barnes and Balda [
16] utilized ridge regression to compute forecasts of the New York Independent System Operator (NYISO) electricity prices for a model with more than 50 regressors.
More recently, González et al. [
17] used random forests to identify important explanatory variables among the 22 considered. Ludwig et al. [
7] used both random forests and the least absolute shrinkage and selection operator (i.e., lasso or LASSO) as a feature selection algorithm to choose the relevant out of the 77 available weather stations. In a recent neural network study, Keles et al. [
11] combined the
k-nearest-neighbor algorithm with backward elimination to select the most appropriate input variables out of more than 50 fundamental parameters or lagged versions of these parameters. Finally, Ziel et al. [
9,
18] used the lasso to sparsify very large sets of model parameters (well over 100). They used time-varying coefficients to capture the intra-day dependency structure, either using B-splines and one large regression model for all hours of the day [
9] or, more efficiently, using a set of 24 regression models for the 24 h of the day [
18].
However, a thorough study involving state-of-the-art parsimonious expert models as benchmarks, data from diverse power markets and, most importantly, a set of different selection or shrinkage procedures is still missing in the literature. In particular, to our best knowledge, elastic nets have not been applied in the EPF context at all. It is exactly the aim of this paper to address these issues. We perform an empirical study that involves:
nine variants of three parsimonious autoregressive model structures with exogenous variables (ARX): one originally proposed by Misiorek et al. [
19] and later used in a number of EPF studies [
13,
18,
20,
21,
22,
23,
24,
25,
26,
27], one which evolved from it during the successful participation of TEAM POLAND in the Global Energy Forecasting Competition 2014 (GEFCom2014; see [
28,
29,
30]) and an extension of the former, which creates a stronger link with yesterday’s prices and additionally considers a second exogenous variable (zonal load or wind power),
three two-year long, hourly resolution test periods from three distinct power markets (GEFCom2014, Nord Pool and the U.K.),
nine variants of five classes of selection and shrinkage procedures: single-step elimination of insignificant predictors (without or with constraints), stepwise regression (with forward selection or backward elimination), ridge regression, lasso and three elastic nets (with , or ),
model validation in terms of the robust weekly-weighted mean absolute error (WMAE; see [
1]) and the Diebold–Mariano (DM; see [
31]) test
and draw statistically-significant conclusions of high practical value.
The remainder of the paper is structured as follows. In
Section 2, we introduce the datasets. Next, in
Section 3, we first discuss the iterative calibration and forecasting scheme, then describe the techniques considered for price forecasting: a simple naive benchmark, nine variants of three parsimonious ARX-type model structures and five classes of selection and shrinkage procedures. In
Section 4, we summarize the empirical findings. Namely, we evaluate the quality of point forecasts in terms of WMAE errors, run the DM tests to formally assess the significance of differences in the forecasting performance and analyze variable selection for the best performing elastic net model. Finally, in
Section 5 wrap up the results and conclude.
2. Datasets
The datasets used in this empirical study include three spot market time series. The first one comes from the Global Energy Forecasting Competition 2014 (GEFCom2014), the largest energy forecasting competition to date [
28]. The dataset includes three time series at hourly resolution: locational marginal prices, day-ahead predictions of system loads and day-ahead predictions of zonal loads and covers the period 1 January 2011–14 December 2013; see
Figure 1. The origin of the data has never been revealed by the organizers. The full dataset is now available as supplementary material accompanying [
28] (Appendix A); however, during the competition, the information set was being extended on a weekly basis to prevent ‘peeking’ into the future. The dataset was preprocessed by the organizers and does not include any missing or doubled values.
The second dataset comes from one of the major European power markets: Nord Pool (NP). It comprises hourly system prices, hourly consumption prognosis for four Nordic countries (Denmark, Finland, Norway and Sweden) and hourly wind prognosis for Denmark and covers the period 1 January 2013–29 March 2016; see
Figure 2. The time series were constructed using data published by the Nordic power exchange Nord Pool (
www.nordpoolspot.com) and preprocessed to account for missing values and changes to/from the daylight saving time, analogously as in [
20] (Section 4.3.7). The missing data values (corresponding to the changes to the daylight saving/summer time; moreover, eight out of 28,392 hourly consumption figures were missing for Norway) were substituted by the arithmetic average of the neighboring values. The ‘doubled’ values (corresponding to the changes from the daylight saving/summer time) were substituted by the arithmetic average of the two values for the ‘doubled’ hour.
The third dataset comes from N2EX, the U.K. day-ahead power market operated by Nord Pool. It comprises hourly system prices for the period 1 January 2013–29 March 2016; see
Figure 3. The time series was constructed using data published by Nord Pool (
www.nordpoolspot.com) and, like the second dataset, preprocessed to account for changes to/from the daylight saving time. Note that the U.K. dataset includes only prices, as no day-ahead forecasts of fundamental variables were available to us. Hence, models calibrated to the U.K. data are ‘pure price’ models. To better see the effect of excluding fundamentals from forecasting models, we use the GEFCom2014 dataset twice, once with fundamentals (system and zonal load forecasts; to compare with the results for Nord Pool) and once without them.
3. Methodology
It should be noted that although we use here the terms short-term, spot and day-ahead interchangeably, the former two do not necessarily refer to the day-ahead market. Short-term EPF generally involves predicting 24 hourly (or 48 half-hourly) prices in the day-ahead market, cleared typically at noon on the day before delivery, i.e., 12–36 h before delivery, the adjustment markets, cleared a few hours before delivery, and the balancing or real-time markets, cleared minutes before delivery [
32]. The spot market, especially in the literature on European electricity markets, is often used as a synonym of the day-ahead market. However, in the U.S., the spot market is another name for the real-time market, while the day-ahead market is called the forward market [
20,
33]. Furthermore, some markets in Europe nowadays admit continuous trading for individual load periods up to a few hours before delivery. With the shifting of volume from the day-ahead to intra-day markets, also in Europe, the term spot is more and more often being used to refer to the real-time markets [
1].
Throughout this article, we denote by
the electricity price in the day-ahead market for day
d and hour
h. Like many studies in the EPF literature [
1], we use the logarithmic transform to make the price series more symmetric (see
Figure 4) and compare with the top panels in
Figure 1,
Figure 2 and
Figure 3. We can do this since all considered datasets are positive-valued. However, this is not a very restrictive property. If datasets with zero or negative values were considered, we could work with non-transformed prices. Furthermore, we center the log-prices by subtracting their in-sample mean prior to parameter estimation. We do this independently for each hour
:
where
T is the number of days in the calibration window; hence, the missing intercept (
) in our autoregressive models; for model parameterizations, see
Section 3.2,
Section 3.3 and
Section 3.4.
For all three markets, the day-ahead forecasts of the hourly electricity price are determined within a rolling window scheme, using a 365-day calibration window. First, all considered models are calibrated to data from the initial calibration period (i.e., 1 January 2011–31 December 2011 for GEFCom2014 and 1 January 2013–31 December 2013 for Nord Pool and the U.K.), and forecasts for all 24 h of the next day (1 January) are determined. Then, the window is rolled forward by one day; the models are re-estimated, and forecasts for all 24 h of 2 January are computed. This procedure is repeated until the predictions for the 24 h of the last day in the sample (14 December 2013 for GEFCom2014 and 29 March 2016 for Nord Pool and the U.K.) are made.
For models requiring calibration of the regularization parameter (i.e.,
λ), we use a setup commonly considered in the machine learning literature. Namely, we divide our datasets into estimation (365 days), validation (91 days or 13 full weeks) and test periods (623 days for GEFCom2014, 728 days for Nord Pool and the U.K.; respectively 89 and 104 full weeks). For each of the five models—ridge regression, lasso and elastic nets with
,
and
—34 different ‘sub-models’ with 34 values of
λ spanning the regularization parameter space (see
Section 3.4.3 and
Section 3.4.4 for details) are estimated in the 91-day validation period directly following the last day of the initial calibration period; see
Figure 1,
Figure 2 and
Figure 3. For all hours of the day, only one value of
λ is chosen for each of the five models: the one that yields the smallest
error during this 91-day period; for error definitions, see
Section 4.1. This value of
λ is later used for computing day-ahead price forecasts in the whole out-of-sample test period. To ensure that all models are evaluated using the same data, predictions of all models are compared only in the out-of-sample test periods: 1 April 2012–14 December 2013 (623 days) for GEFCom2014 and 2 April 2014–29 March 2016 (728 days) for Nord Pool and the U.K. Obviously, such a simple procedure for the selection of the regularization parameter may not be optimal. Generally, better performance is to be expected from shrinkage models when
λ is recalibrated at every time step. Such an approach has been recently taken by Ziel [
18], who used the Bayesian information criterion to select one out of 50 values of
λ for every day and every hour in the 969-day-long out-of-sample test period. The downside of such an approach is, however, the increased computational time.
Our choice of the model classes is guided by the existing literature on short-term EPF. Like in [
12,
18,
25,
26,
27,
30], the modeling is implemented separately across the hours, leading to 24 sets of parameters for each day the forecasting exercise is performed. As Ziel [
18] notes, when we compare the forecasting performance of relatively simple models implemented separately across the hours and jointly for all hours (like in [
9,
34,
35,
36]), the latter generally performs better for the first half of the day, whereas the former are better in the second half of the day. At the same time, models implemented separately across the hours offer more flexibility by allowing for time-varying cross-hour dependency in a straightforward manner. Hence, our choice of the modeling framework.
In the remainder of this section, we first define the benchmarks: a simple similar-day technique and a collection of parsimonious autoregressive models. Since the latter are usually built on some prior knowledge of experts, like in [
18], we refer to them as expert models. Then, we move on to describe the selection and shrinkage procedures used in this study.
3.1. The Naive Benchmark
The first benchmark, most likely introduced to the EPF literature in [
34] and dubbed the naive method, belongs to the class of similar-day techniques (for a taxonomy of EPF approaches, see, e.g., [
1]). It proceeds as follows: the electricity price forecast for hour
h on Monday is set equal to the price for the same hour on Monday of the previous week, and the same rule applies for Saturdays and Sundays; the electricity price forecast for hour
h on Tuesday is set equal to the price for the same hour on Monday, and the same rule applies for Wednesdays, Thursdays and Fridays. As was argued in [
34,
35], forecasting procedures that are not calibrated carefully fail to outperform the naive method surprisingly often. We denote this benchmark by
Naive.
3.2. Autoregressive Expert Benchmarks
The second benchmark is a parsimonious autoregressive structure originally proposed by Misiorek et al. [
19] and later used in a number of EPF studies [
18,
20,
21,
23,
24,
25,
26,
27]. Within this model, the centered log-price on day
d and hour
h, i.e.,
, is given by the following formula:
where the lagged log-prices
,
and
account for the autoregressive effects of the previous days (the same hour yesterday, two days ago and one week ago), while
is the minimum of the previous day’s 24 hourly log-prices. The exogenous variable
refers to the logarithm of hourly system load or Nordic consumption for day
d and hour
h (actually, to forecasts made a day before, see
Section 2). The three dummy variables—
,
and
—account for the weekly seasonality. Finally, the
’s are assumed to be independent and identically distributed (i.i.d.) normal variables. We denote this autoregressive benchmark by
ARX1 to reflect the fact that the load (or consumption) forecast is used as the exogenous variable in Equation (
2). The corresponding model with
, i.e., with no exogenous variable, is denoted by
AR1. The
ARX1 and
AR1 models, as well as all autoregressive structures considered in
Section 3.2 and
Section 3.3, are estimated in this study with least squares (LS), using MATLAB’s regress.m function.
In what follows, we also consider two variants of Equation (
2) that treat holidays as special days:
and that additionally utilize the fact that prices for early morning hours depend more on the previous day’s price at midnight, i.e.,
, than on the price for the same hour, as recently noted in [
18,
29]:
We denote Models (
3) and (
4) by
ARX1h and
ARX1hm, respectively. Similarly, corresponding models with
are denoted by
AR1h and
AR1hm. Note, that when forecasting the electricity price for the last load period of the day, i.e.,
, models with suffix
hm reduce to models with suffix
h (this is true for all models considered in
Section 3.2).
In Equations (
3) and (
4),
is a dummy variable for holidays. The holidays were identified using the Time and Date AS (www.timeanddate.com/holidays) web page: U.S. federal holidays (for GEFCom2014), national holidays in Norway (for Nord Pool) and public holidays, bank holidays and major observances in the U.K. (option ‘Holidays and some observances’).
The third benchmark is an extension of the
ARX1 model, which takes into account the experience gained during the GEFCom2014 competition that it may be beneficial to use different model structures for different days of the week, not only different parameter sets [
29]. Hence, the multi-day ARX model (denoted later in the text by
mARX1) is given by the following formula:
where
,
and the term
accounts for the autoregressive effect of Friday’s prices on the prices for the same hour on Monday. Note that to some extent, this structure resembles periodic autoregressive moving average (PARMA) models, which have seen limited use in EPF [
37,
38]. Like for the
ARX1 model, also for
mARX1, we consider two variants:
mARX1h, which treats holidays as special days, i.e., with the
term in Equation (
5),
and
mARX1hm, which additionally implements the dependence on the previous day’s price at midnight, i.e., with the
and
terms in Equation (
5).
The corresponding price only models, i.e., with
, are denoted by
mAR1,
mAR1h and
mAR1hm.
Misiorek et al. [
19] noted that the minimum of the previous day’s 24 hourly prices was the best link between today’s prices and those from the entire previous day. Their analysis, however, was limited to one small dataset (California CalPXprices, 3–9 April 2000) and only one simple function at a time (maximum, minimum, mean or median of the previous day’s prices). To check if using more than one function leads to a better forecasting performance, we introduce a benchmark, which is an extension of the
ARX1 model that takes into account not only the minimum
, but also the maximum
and the mean
of the previous day’s 24 hourly prices. Additionally, we include a second exogenous variable
, which is taken as either the logarithm of the day-ahead zonal load forecast (GEFCom2014) or of the Danish wind power prognosis. The resulting
ARX2 model is given by the following formula:
Like for the
ARX1 and
mARX1 models, also for
ARX2, we consider two variants:
ARX2h with the
term in Equation (
6),
and
ARX2hm with the
and
terms in Equation (
6).
The corresponding price only models, i.e., with , are denoted by AR2, AR2h and AR2hm.
3.3. Full Autoregressive Model
Finally, we define a much richer autoregressive model that includes as special cases all expert models discussed in
Section 3.2 and call it the
full ARX or
fARX model. We consider all regressors that, in our opinion, posses a non-negligible predictive power. The
fARX model is similar in spirit to the general autoregressive model defined by Equation (
2) in [
18]. However, there are some important differences between them. On one hand,
fARX includes exogenous variables and a much richer seasonal structure. On the other, it does not look that far into the past and concentrates only on days
,
,
and
. The
fARX model is given by the following formula:
where
are dummies for the seven days of the week (we treat holidays as the eighth day of the week, hence
for holidays). The price only variant,
fAR, is obtained by setting to zero all coefficients of the terms involving exogenous variables, i.e.,
, for
.
Although we fit the
fARX model to power market data and evaluate its forecasting performance, the main reason for including it in this study is to use it as the baseline model for the selection and shrinkage procedures discussed in
Section 3.4. For this purpose, let us write the
fARX model in a more compact form:
where
’s are the
regressors in Equation (
7) and
’s are their coefficients.
3.4. Selection and Shrinkage Procedures
All autoregressive models considered in
Section 3.2 and
Section 3.3 are estimated in this study with least squares (LS). However, there are many alternatives to using LS in multi-parameter models, in particular [
39]:
variable or subset selection, which involves identifying a subset of predictors that we believe to be influential, then fitting a model using LS on the reduced set of variables,
shrinkage (also known as regularization), which fits the full model with all predictors using an algorithm that shrinks the estimated coefficients towards zero, which can significantly reduce their variance.
Depending on what type of shrinkage is performed, some of the coefficients may be shrunk to zero itself. As such, some shrinkage methods, like the lasso, de facto perform variable selection. It should be noted, however, that variable selection (or model sparsity) is beneficial for interpretability and faster simulation of model trajectories; for reducing the forecasting errors, only the shrinkage property is required.
3.4.1. Single-Step Elimination of Insignificant Predictors
This subset selection procedure is a simple alternative to stepwise regression discussed in
Section 3.4.2 and has been used, for instance, in [
5]. The idea is to fit the full regression model, in our case
fARX, then in a single step, set to zero all statistically insignificant coefficients. We use MATLAB’s regress.m function with the commonly-used 5% significance level. Setting to zero all coefficients in Equation (
7) whose 95% confidence intervals (CI) include zero yields the
ssARX model for a particular day and hour (the
ssAR model is obtained analogously from
fAR; see
Section 3.3). This procedure can be conducted by imposing some additional constraints, for instance, leaving in the model all coefficients of the basic
ARX1 (or
AR1) benchmark. This yields the
ssARX1 and
ssAR1 models. Of course, the most commonly-used significance level of 5% may not be optimal. We have additionally checked the performance of 90% and 97.5% CI. It turns out that the overall ranking of the
ssAR-type models does not change much. However,
ssARX and
ssAR perform slightly better for the 90% CI, while
ssARX1 and
ssAR1 either for the 95% or the 97.5% CI.
3.4.2. Stepwise Regression
Although very fast, the single-step elimination may remove too many explanatory variables at once and lead to a poorly-performing subset of predictors. On the other hand, selecting the best subset from among all
subsets of the
n predictors is not computationally feasible for large
n. Even if doable, it may lead to overfitting. For these reasons, stepwise methods, which explore a far more restricted set of models, are attractive alternatives to best subset selection [
39]. In the context of EPF, they have been used, for instance, in [
12,
13,
40].
There are two basic procedures: forward selection and backward elimination. Forward stepwise selection begins with a model containing no predictors and then iteratively adds variables to the model. At each step, the variable that gives the greatest additional improvement to the fit is added to the model, and the procedure continues until all important predictors are in the model. We use MATLAB’s stepwisefit.m function, which computes the p-value of an F-statistic at each time step to test models with and without a potential term. If a variable is not currently in the model, the null hypothesis is that it would have a zero coefficient if added to the model. If there is sufficient evidence to reject the null hypothesis, that variable may be added to the model (we use stepwisefit’s default 5% significance level for adding variables; naturally, this could be further fine tuned as for the single-step elimination procedures). In a given step, the function adds the variable with the smallest p-value. We denote the resulting models by fsARX and fsAR.
Backward stepwise elimination (or selection) begins with the full model containing all n variables, i.e., fARX or fAR, and then iteratively removes the least useful predictor, one at a time. MATLAB’s stepwisefit.m function computes the null hypothesis that a given variable has a zero coefficient. If there is insufficient evidence to reject the null hypothesis, the variable may be removed from the model (we use stepwisefit’s default 10% significance level for removing variables). In a given step, the function removes the variable with the largest p-value. We denote the resulting models by bsARX and bsAR.
3.4.3. Ridge Regression
Ridge regression is a regularization method introduced in statistics by Hoerl and Kennard [
41]. To our best knowledge, apart from a limited study of Barnes and Balda [
16] in the context of evaluating the profitability of battery storage, the method has not been used for EPF. Ridge regression is very similar to least squares, except that the
’s in (
8) are not estimated by minimizing the residual sum of squares (RSS), but by RSS penalized by a quadratic shrinkage factor:
where
T represents the calibration period and
is a tuning or regularization parameter, to be determined separately. Note that for
, we get the standard LS estimator; for
, all
’s tend to zero; while for intermediate values of
λ, we are balancing two ideas: minimizing the RSS and shrinking the coefficients towards zero (and each other).
Ridge regression produces a different set of coefficient estimates for each value of
λ; hence, selecting a good value for
λ is critical. Cross-validation provides a simple way to tackle this problem [
39]. We choose a grid of
λ values (here: 34 equally-spaced values spanning the range from 1–100; if
was selected, we additionally checked another set of 34 equally-spaced values spanning the range from 101–200) and using MATLAB’s ridge.m function (we scale the regressors) compute the prediction errors for each value of the tuning parameter in the 91-day validation period; see
Section 2. We then select
λ for which the
error (for the definition, see
Section 4.1) is the smallest and use it for computing day-ahead price forecasts in the whole out-of-sample test period. The resulting model is denoted in the text by
RidgeX or
Ridge when the baseline model is
fAR.
3.4.4. Lasso and Elastic Nets
Ridge regression has one unwanted feature when it comes to interpretation and model identification. Unlike stepwise regression, which will generally select models that involve just a subset of the variables, ridge regression will include all
n predictors in the final model [
39]. The quadratic shrinkage factor in Equation (
9) will shrink all
’s towards zero, but it will not set any of them exactly to zero. In 1996, Tibshirani [
42] proposed the least absolute shrinkage and selection operator (i.e., lasso or LASSO) that overcomes this disadvantage. It is the only shrinkage procedure that has been applied in EPF to a larger extent, however only in the last two years [
7,
9,
18,
25,
43].
The lasso is a shrinkage method just like ridge regression. However, it uses a linear penalty factor instead of a quadratic one:
This subtle change makes the solutions nonlinear in
, and there is no closed form expression as in the case of ridge regression. Because of the nature of the shrinkage factor in Equation (
10), making
λ sufficiently large will cause some of the coefficients to be exactly zero [
44]. Thus, the lasso de facto performs variable selection, just like the methods discussed in
Section 3.4.1 and
Section 3.4.2. As in ridge regression, selecting a good value of
λ for the lasso is critical. Here, we use MATLAB’s lasso.m function and a grid of exponentially-decreasing
λ’s (the largest just sufficient to produce all
; the function also automatically scales the regressors). We then select
λ for which the
error (for the definition, see
Section 4.1) in the 91-day validation period is the smallest. The resulting model is denoted in the text by
LassoX, or
Lasso when the baseline model is
fAR.
The lasso does not handle highly-correlated variables very well. The coefficient paths tend to be erratic and can sometimes show wild behavior [
44]. This is not a critical issue for forecasting, but for interpretation and model identification, this has more serious consequences. In 2005, Zou and Hastie [
45] proposed the elastic net, a new regularization and variable selection method that can be seen as an extension of ridge regression and the lasso. It often outperforms the lasso, while exhibiting a similar sparsity of representation. The elastic net uses a mixture of linear and quadratic penalty factors:
where
. When
, the elastic net reduces to the lasso, and with
, it becomes ridge regression. The
in the quadratic part of the elastic net penalty in Equation (
11) leads to a more efficient and intuitive soft-thresholding operator in the optimization; the original formulation in [
45] did not include the
scaling. Note also that every elastic net problem can be rewritten as a lasso problem on augmented data. Hence, for fixed
λ and
α, the computational difficulty of the elastic net solution is similar to the lasso problem [
44].
Compared to the lasso and ridge regression, the elastic net has an additional mixing parameter that has to be determined. It can be set on subjective grounds, as we do here, or optimized within a cross-validation scheme. We use MATLAB’s lasso.m function (with a grid of exponentially-decreasing
λ’s; the function also automatically scales the regressors) and three values of the mixing parameter,
,
and
. This yields six elastic net models:
EN25X, EN50X and EN75X when the baseline model is fARX,
and EN25, EN50 and EN75 when the baseline model is fAR,
that span the space between ridge regression (
RidgeX,
Ridge) and lasso models (
LassoX,
Lasso).
5. Conclusions
A key point in electricity price forecasting (EPF) is the appropriate choice of explanatory variables. The typical approach has been to select predictors in an ad hoc fashion, sometimes using expert knowledge, but very rarely based on formal selection or shrinkage procedures. However, is this the right approach? Can the application of automated selection and shrinkage procedures to large sets of explanatory variables lead to better forecasts than those of the commonly-used expert models?
Conducting an empirical study involving state-of-the-art parsimonious autoregressive structures as benchmarks, datasets from three major power markets and five classes of automated selection and shrinkage procedures (single-step elimination, stepwise regression, ridge regression, lasso and elastic nets), we have addressed these important questions. To this end, we have compared the predictive performance of 20 types of models over three two-year-long out-of-sample test periods in terms of the robust weekly-weighted mean absolute error (WMAE) and tested the statistical significance of the results using the Diebold–Mariano [
31] test.
We have shown that two classes of selection and shrinkage procedures—the lasso and elastic nets—lead to on average better performance than any of the considered expert benchmarks. On the other hand, single-step elimination, stepwise regression and ridge regression are not recommended for EPF as they do not yield significant accuracy gains compared to well-structured parsimonious autoregressive models. The lasso has been recently shown to perform well in EPF [
9,
18], but it is the more flexible elastic net that stands out as the best performing model overall. Given that both are automated procedures that do not require advanced expert knowledge or supervision, our results may have far reaching consequences for the practice of electricity price forecasting.
We have also looked at variables selected by the elastic net algorithm to gain insights for constructing efficient parsimonious models. In particular, we have confirmed the high explanatory power of the load forecasts for the target hour, of last day’s prices for the same or neighboring hours and of the price for the same hour a week earlier. Somewhat surprisingly, we have found that not only the last available data point (price for Hour 24), but also prices for Hours 21–23 of the previous day should be considered when building expert models.