*2.5. Cross-Validation*

A common issue in this area of work is the problem of overfitting. A model can theoretically be conditioned to precisely fit the training data, hence exhibiting very high accuracy in-sample. Nonetheless, such a model would be useless in forecasting, as it will likely exhibit a low fit in the test (out-of-sample) data. In such cases, the model is trained to only fit the training data and not the underlying phenomenon. To avoid this, in the empirical part of the study, we employed a *k*-fold cross validation procedure. The in-sample data, which are used to train the model, are divided into *k* parts (folds) of equal size. Then, in each of the *k* iterations, one fold is used as the testing set, while the remaining *k*-1 folds are used as the training set. This is repeated for all *k* folds. In this scheme, the model's accuracy is evaluated by the average performance over all of the *k* folds for each set of the model's parameters. Figure 3 provides a graphical representation of a 3-fold cross validation procedure.

#### *2.6. The Dataset*

For the training and the testing of our models, we compiled a dataset consisting of 2423 daily natural gas spot price values from the Energy Information Administration (EIA) database and 21 related economic variables from the Federal Reserve Bank of Saint Louis and Yahoo Finance databases. They span the period from 3 December 2010 to 18 September 2020 (Table 1). In addition, the momentum of the last 5 and 10 days (Momentum 5 and 10 are defined as the sum of the times that natural gas spot price increases

in the last 5 and 10 days, respectively) as well as the 5- and 10-day moving average was calculated and added to the independent variable set. With the exception of interest rates, all of the variables were converted to natural logarithms.

**Figure 3.** A three-fold cross validation for a given set of model parameter values. Each fold serves as a test sample, while the remaining folds are used to train the model. The average prediction accuracy for each set of parameters over the k-folds is used to assess the model [17].


**Table 1.** List of explanatory variables with mean, standard deviation, skewness, kurtosis, and variance.

In order to test the generalization ability of the trained models, the dataset was divided into two parts: the first 90% was used as the training data set (in-sample, consisting of 2180 observations), and the remaining 10% of the most recent observations was the test data set (out-of-sample, consisting of 243 observations).

#### **3. Empirical Results**

The prediction accuracy of each model for both the out-of-sample and in-sample data was measured using the Root Mean Square Error (RMSE) metric. Thus, the optimal model was selected as the one that minimizes the RMSE:

$$\text{RMSE} = \frac{\sqrt{\sum\_{t=1}^{T} \left(\hat{y}\_t - y\_t\right)^2}}{T} \tag{1}$$

where *yˆ* = the forecasted value, *y* = the actual value, and *T* = the number of observations.

Our forecasts were produced for several alternative forecasting horizons, i.e., *t* + 1, *t* + 3, *t* + 5, and *t* + 10. We completed the same task with a random walk model in order to compare our machine learning results to a naïve prediction model.

Before moving to structural models (The ones that include the independent variables of our data set.), we first tried to identify the best autoregressive representation, i.e., to produce the best AR(q) model (autoregressive model). The AR(q) model is a simple model that uses past (lagged) values of natural gas spot prices to forecast the future natural gas spot price.

$$\mathbf{X}\_t = \mathbf{c} + \sum\_{i=1}^q \varphi\_i \mathbf{X}\_{t-i} + \varepsilon\_t \tag{2}$$

where *X* is the natural gas spot price, *q* is the maximum number of lags, and *ϕi* the parameter vector of the lags to be estimated.

In order to identify the optimal number of lags, we train several linear SVM models by varying the number of lags we used each time, starting with an AR(1) up to an AR(15).

We concluded that by using the first 14 lags, we minimize the in-sample RMSE (0.04196). These results are presented in Figure 4.

**Figure 4.** RMSE for AR models.

After identifying the best autoregressive representation, we built structural models. These include the 14 lags and all of the explanatory variables described earlier as independent variables to produce forecasts one day ahead. For this, we trained several alternative machine learning models and also produced the results for the random walk model. The in-sample and out-of-sample RMSEs of these models are presented in Table 2. An important issue in such forecasting models is to avoid overfitting in the in-sample or out-of-sample datasets. In the literature, this is known as the bias–variance trade-off. An efficient forecasting model is one that provides a balanced performance both in-sample and out-of-sample, i.e., the bias and variance are comparable. For this reason, we rejected all of models that provided evidence of overfitting and continued our empirical analysis with the rest. In the last column of Table 2, we note the models that overfit and are not used in the rest of our analysis. Interestingly, the tree models (plus bagged and boosted trees) do not overfit, and all of the GPR models overfit alongside most of the SVM models (with the exception of the linear SVM).

**Table 2.** In-sample and out-of-sample (OOS) RMSE of all models.


#### *3.1. Time Framet+1*

According to the results presented in Figure 5, we observed that for the time horizon *t* + 1, the optimal in-sample model was the linear regression model with RMSE = 0.038421 and that the best out-of-sample forecasting model was the linear SVM model with RMSE = 0.056694. The robust linear model also showed very good results, as it had the second lowest RMSE in the out-of-sample data and the third lowest in the in-sample data. Finally, the random walk model seemed to adequately predict the out-of-sample data. Therefore, we can generally conclude that linear models are able to predict the natural gas spot prices one day ahead with high accuracy and that the best model (linear regression) has good generalization ability (Figure 6).

#### *3.2. Time Framet+3*

The results for the forecasting window *t* + 3 are presented in Figure 7. We observed that for time horizon *t* + 3, the optimal in-sample model was the bagged trees model with RMSE = 0.057793 and that the best out-of-sample forecasting model was the boosted trees model with RMSE = 0.077136. According to the above, it is clear that the best models at time horizon *t* + 3 are tree based models. The out-of-sample performance of the bagged trees model is presented in Figure 8.

**Figure 5.** RMSE for *t* + 1 forecasting.

**Figure 6.** Comparison of the actual natural gas spot prices and the predicted prices with the linear regression model for *t* + 1 in the out-of-sample part of the dataset.

#### *3.3. Time Framet+5*

The results for the forecasting window at *t* + 5 are presented in Figure 9. In this window, we found that the optimal in-sample model was the bagged trees model with RMSE = 0.061787 and that the best out-of-sample forecasting model was the linear SVM model with RMSE = 0.083687. The bagged trees model also shows good generalization ability (Figure 10). It is worth noting that the random walk model also showed good performance, as it holds the second lowest out-of-sample RMSE = 0.087654.

**Figure 7.** RMSE for *t* + 3 forecasting.

**Figure 8.** Comparison of the actual natural gas spot prices and the predicted prices with the bagged trees model for *t* + 3 in the out-of-sample part of the dataset.

#### *3.4. Time Frame t + 10*

Finally, for the *t* + 10 forecasting window the results are presented in Figure 11. We observed that the optimal in-sample model was the bagged trees model with RMSE = 0.064968 and that the best out-of-sample forecasting model was the linear SVM model with RMSE = 0.102711. Additionally, the random walk model showed good results, as it achieved the second lowest out-of-sample RMSE = 0.109871. The best model for time horizon *t* + 10 (bagged trees) has also good generalization ability (Figure 12).

**Figure 9.** RMSE for *t* + 5 forecasting.

**Figure 10.** Comparison of the actual natural gas spot prices and the predicted prices with the bagged trees model for *t* + 5 in the out-of-sample part of the dataset.

Interestingly the random walk model showed very good results for the out-of-sample part of the dataset at all time instances, while at the same time, we can conclude that all of the linear models have the ability to predict natural gas prices with high accuracy and showed very good performance with small RMSE values. The bagged trees models also showed very good predictive ability, having the lowest in-sample RMSE error at all time instances except for *t* + 1.

**Figure 11.** RMSE for *t* + 10 forecasting.

**Figure 12.** Comparison of the actual natural gas spot prices and the predicted prices with the bagged trees model for *t* + 10 in the out-of-sample part of the dataset.
