1. Introduction
Gas procurement is an everyday business for power utilities as well as mid-size to large industrial companies. To facilitate trade, virtual trading hubs have been established across Europe, with the Dutch Title Transfer Facility (TTF) being currently the most liquid European hub. Other major hubs are the lately-created Trading Hub Europe in Germany or the British National Balancing Point. In order to avoid short-term price peaks or significant price increases, as seen in mid-2021, market participants source long-term contracts that involve gas delivery for a whole season or calendar year. However, still, considerable amounts have to be balanced on a shorter note as demand and supply might suffer from short-term changes. Thereby, the Front Month (FM), which is a gas market product that implies a constant gas delivery in the upcoming month, is among the most liquid trading products. Given this definition and a certain gas demand, we have around 20 trading days for our sourcing activities, whereby the price is changing every day. Hence, the question remains as to when and how often to buy. A lot of companies avoid facing this question by opting for the index, which is defined as average price over all observations in the respective trading period. One could either replicate the index by purchasing the same amount every day; alternatively, large energy companies offer the index as commercial product.
Our idea now is: Can we use methodological support to identify a few trading days whose average price is below the index?—in plain words: Can we beat the index? Existing scientific literature—to the authors’ best knowledge—offers no solution on how to trade natural gas in order to achieve this goal. Extending the scope, we could consider solutions for other gas-related products such as storages. Here, when considering a mathematical approach, the least squares Monte Carlo method (LSMC) seems to be the natural choice. It serves as a blueprint for almost all sourcing strategies in theory and practice [
1,
2] and is the fundamental theory for various software applications. Originally designed for pricing American Options [
3], it was adapted to swing options and gas storages by Boogert and de Jong [
4]. Ludkovski and Maheshwari [
5] also propose a Monte-Carlo-based method (regression Monte Carlo analysis) for stochastic storage problems. Both the advantage (versatility) and drawback of these method is the dependence on Monte Carlo price simulations: Currently, the European gas and energy markets are seeing structural changes at almost regular intervals. Until mid-2021, the short-term gas price was ranging between EUR 5/MWh and EUR 30/MWh. In 2022 we have seen the price crossing EUR 100/MWh. Hence, the underlying stochastic price model has to be adapted frequently. An alternative is given by neural networks (NNs), a method that dates back to the 1940s; it is based on the digital replication of the functionality of the human brain. Recently, due to the increase in computational power, application scenarios can be found in almost any field and especially for time series forecasting—a task which is also required for our application: we intend to use NNs in order to identify the best trading days for gas sourcing. Our approaches are tested in the context of a case study using almost 11 years of TTF FM prices as an example. As benchmark, the simple and risk-averse myopic approach, which—as a greedy strategy—seeks the instantaneous profit, is used. The idea to beat the index is not new and professional traders are constantly trying to outperform the average price. Hence, if the application of our proposed methods is really worth the efforts, the profit (in the sense of cost savings) has to be significantly positive and also better than the myopic approach. For this purpose, we design four different NN-based strategies. Three are classification approaches, i.e., they estimate the probability that today’s price is among the lowest in the considered period. The fourth one simply forecasts the prices until the end of the considered trading period. Based on this estimate, we come to a purchasing decision. Networks are trained using the first part of the dataset and tested using TTF FM prices between August 2017 and December 2021. We evaluate the monthly performance by comparing the price of our strategy to the true index price. Additionally, we compute the cumulative performance over time. Thereby, we see that, contrary to the myopic approach, all NN strategies manage to show a positive cumulative payoff at the end. The median of the monthly differences to the index is positive (meaning the index price is higher) as well for all NN strategies, albeit rather close to zero. Despite this positive outcome, there is no perfect algorithm. All strategies have periods (often more than a year) in which the cumulative payoff is negative. Additionally, the significant price surge in mid-2021 helped most strategies to significantly increase their performance. If a rather risk-averse strategy is preferred, then one of the classification methods is the best choice. If one yearns for the highest payoff and also accepts temporary negative payoffs, the forecasting-based method is recommended.
The article is structured as follows:
Section 2 gives a brief introduction to NNs,
Section 3 describes the dataset, whereas
Section 4 contains the case study including the results. An extensive discussion, including a critical evaluation in the light of current scientific research, is given in
Section 5.
Section 6 summarizes and concludes the article.
2. Neural Networks
There is a large number of articles and books concerned with neural networks; hence, we refrain from giving a detailed introduction. Please refer to Aggarwal [
6] or Nielsen [
7] for more information. Zgurovsky et al. [
8] is another recommended reference that explicitly includes applications of neural networks. Various studies apply NNs for forecasting purposes starting from the 1990s (see, e.g., Hill et al. [
9]) until today, where fairly complex NN setups are used [
10]. Kreuzer et al. [
11] or Liebermann et al. [
12], for example, proposing a combination of different NN-types for meteorological forecasts. Further applications of NNs are given by Abdel-Nasser and Karar [
13], who forecast solar power, or Kim and Cho [
14], who use NNs for energy consumption prediction. Srivastava et al. [
15] apply NNs for wind power forecasting. Fawaz et al. [
16] give an overview of deep learning techniques for time series classification. Eventually, Livieris et al. [
17] forecasted gold price developments using neural networks.
The basic building block of all NNs is the neuron: it receives a feature vector
and sends out an output
o. Given a vector of weights
and a linear shift or bias
b, the features are aggregated to a value
z via
The final component of a neuron is, then, a so-called activation function
which generates the individual neuron’s output
[
6]. This function might be linear; the tangents hyperbolicus, the sigmoid, or the rectified linear unit, i.e.,
are among the most common alternatives [
6]. An exemplary neuron with three input values is displayed in
Figure 1. A neural network consists of multiple neurons organized in layers. In
Figure 2, we exemplary display a feedforward NN with one hidden layer, one input, and one output layer, assuming there is no bias (above, called
b at each layer). Thereby, the results of the output layer’s activation function are the desired outputs of the network, such as a forecasted value, for example.
There are different types of NN setups, such as so-called convolutional neural networks (CNN), which are often used for classification problems in voice or pattern recognition [
18,
19]. If time series data are involved, and especially if data are to be forecasted, recurrent NNs are applied, as these allow intertemporal links. A widely-used recurrent NN is the long short-term (LSTM) network [
20]: Instead of a simple neuron, the LSTM cell consists of a sequence of gates, namely, an input gate processing the regular input, a forget gate that controls how much information is eliminated, and an output gate that produces the final value handed to the next layer. Besides the possibility to model long-term effects, the LSTM model helps to avoid a situation of vanishing gradients in the loss function [
21]. A good fundamental description of LSTM networks is given by Sherstinsky [
22].
To sum up, there is a large variability in NN setups and we have quite a few hyperparameters to determine. Among others, these are number of neurons, number of layers, split-up ratio of the dataset in test and training data, or the loss function. An optimal combination is often found only by trial and error, as well as experience.
3. Dataset
Here, we consider the TTF Month Ahead product, which is traded solely on working days. In total we have 2880 observations ranging from 8 September 2010 to 31 January 2022 (data are courtesy of Wintershall Dea GmbH). As displayed in
Figure 3, we see a structural break around mid-2021. Before that, TTF FM prices were oscillating between a minimum of around EUR 5/MWh and maximum around EUR 30/MWh. Short- to mid-term trends are visible, but an annual pattern is not. However, in the second half of 2021, due to various geopolitical and economic reasons, the situation changed as both volatility and absolute price level increased significantly. This structural break, which has severe effects on the calibration and design of (stochastic) price models, is captured by the statistic properties of the price differences summarized in
Table 1. Note that we compute the differences before deriving standard measures such as mean and variance in order to at least roughly eliminate trends. Besides, as the time period after the structural break is significantly smaller, statements are to be treated with care. Here we see that, even after removing the largest five differences, the standard deviation of the second observation period is significantly higher than in the first one: to be precise, about 10 times higher. Additionally, there is some skewness in both periods, however, in different directions. Kurtosis is about 24 in the first period which is only a very small difference to the Gaussian value. However, in the second period, values are significantly smaller indicating a platykurtic distribution.
4. Case Study
Based on the dataset from
Section 3, we simulate a sourcing scenario based on the product Front Month. Benchmark is the fairly risk-averse option of index sourcing, i.e., basically buying the same amount of gas every trading day. This price is compared to various alternatives presented in
Section 4.1. Thereby, model calibration is discussed in
Section 4.2, and results are discussed in
Section 4.3.
4.1. Test Setup and Trading Algorithms
The test structure is as follows: We purchase gas at the Front Month price and sell it at the index price
. The first alternative is a myopic approach (see also [
1]): Every day we compute the preliminary index as the average of the FM prices already realized in the respective month. If the price today is below the preliminary index, we purchase the total amount. If on the last day the current price is still above the index (positive market development), then we have to purchase on the last day, i.e., the purchasing price for the
mth month, given the myopic approach is given by
with
being the number of trading days of the
mth month. Note that in this strategy we do not purchase gas on the first day of the new month. We also compute the best and the worst purchasing point of time for each month
m, in formulas
Alternatively, we test four NN-based strategies. In order to limit the degrees of freedom, we exemplary split up the month into two equally-spaced intervals. We tested other alternatives (one to four intervals) as well, and find that one or two purchase decisions yield the best results.
Three NN-based methods are classification methods. The first one, which we call ClassifyMin, computes for each of the upcoming days of the respective interval the probability that this day sees the lowest price. If today lies in the best 25% of all choices, we act. If not, we wait until tomorrow and run the same classification. The second method ClassifyLower is very similar but uses the last price of the previous month as a benchmark. For each of the upcoming days we compute the probability that the price is lower than the last one of the previous month. The third classification algorithm is called ClassifyQuantil. Here, we estimate the probability of a price being in the lower quantile of the previous k observations. Eventually, as above, if the current day is among the best 25% of all remaining trading days (of the respective interval), we act. If not, we run the same analysis on the next day. Note that if we are at the end of the trading period, we have to purchase anyway. This strategy offers a few degrees of freedom: What quantile is used? How many of the previous trading days are considered? Maybe we should focus not on the best 25% but opt for an alternative quantile value? The last tested trading strategy, which we call NN Forecast, is based on forecasting the prices until the end of the considered interval. If the forecast for today is in the lower 25% range of all forecasted prices, we act. If not, then we wait and update our forecast with today’s observation.
4.2. Model Calibration
For training the neural network, we split the dataset into two parts. The first 60% are used for training purposes, while the second 40% are test data. Common ratios are 70%/30% or 75%/25%. However, as we (a) have a sufficiently long dataset and (b) consider only monthly values in the final analysis, we opted for the first version. This gives us a test dataset that is long enough for performing a proper analysis. Note that when splitting up the dataset we have to consider a certain offset, i.e., days. The reason is that our strategies need the previous p, e.g., for generating price forecasts.
Various setups are tested. In all cases, a multilayer neural network with three layers and a neuron split of
is used. Thereby, the hidden layer is an LSTM layer. These values seem arbitrary but are the result of a parameter optimization. Various setups were tested and this one performed best. In addition, it shows that the classification models perform best with 50 epochs and a binary cross-entropy loss function. All layers use the hyperbolic tangent as activation function. However, in the last layer, a sigmoid function as activation function is applied. For the linear regression model, 200 epochs are chosen and the mean squared error as loss function is applied. Here, we also use the hyperbolic tangent as activation function in the first two layers, whereby linear activation function is used in the last layer (which generates the output). For
ClassifyQuantile, different versions were tested and we ended up with
. We also tested different sizes of previous data fed to the algorithm for benchmark purposes (see
Section 4.1) and found that the previous 5 observations are sufficient.
4.3. Results
The test dataset comprises all months from August 2017 to January 2022. For each month we apply all strategies to yield a purchasing price. Then, we compute the difference to the index. First, as an additional motivation for this case study, we consider the best- and the worst-case scenario. Given the prices displayed in
Figure 3 and the characteristics summarized in
Table 1, the risk of trading on individual days instead of sourcing via the index is increasing, especially since August 2021. In
Figure 4, we show the differences between best and worst sourcing day each month relative to the eventual index price. Even before August 2021, this relative difference could be as large as 20% of the index price or even 30%, as in December 2020. Hence, if one decides to prefer an individual sourcing over an index-based strategy, one has to accept a significantly increased risk.
A risk-averse approach is the myopic strategy whose cumulative payoff is, among others, displayed in
Figure 5, together with the historic true index price which serves as a reference. A positive difference means we are able to beat the index. In order to evaluate the performance over time, we gradually sum up the performance of each month and yield the cumulative payoff. Thereby, we see that the performance of the myopic strategy more or less evolves opposite to the index. A positive trend has, per construction (see
Section 4.1), a negative effect on the strategy as we keep on waiting with the purchase decision until we have to source at the end of the month. Especially in times of increasing prices, this strategy is no real alternative to the index sourcing.
The strategy ClassifyLower is fairly balanced over time, showing some minor positive and negative cumulative payoff—until August 2021, i.e., when we see a considerable price increase. Here, actually, the strategy performs well resulting in an cumulative payoff of almost EUR 10/MWh. However, excluding times with steep price changes, this strategy has no convincing performance. The quantile-based classification algorithm, again, performs fairly well until about April 2020, from which onwards the cumulative performance gradually decreases to a low of EUR /MWh in September 2021. From there on, the performance recovers. However, given common risk management guidelines in energy companies, it is very likely that this trading strategy would have been canceled beforehand. A strategy has to show a certain robustness against significant price changes, and this one does not. Hence, it can be considered as inadequate for an automated trading scheme. At the end, from a cumulative perspective, both ClassifyMin and NN Forecast outperform all other strategies as they profit significantly from the steep price increase at the end. Between about February 2019 and May 2021, both strategies, which seemingly offer, in most cases, the same purchase decisions, are outperformed by all others but the myopic strategy. Looking at the true index, it is hard to identify any specific pattern why this might be the case. ClassifyLower, again, was able to handle the price changes pretty well and showed a maximum cumulative loss of about EUR /MWh. It shows less oscillation regarding, e.g., NN Forecast, but also less cumulative loss at any time. In total, there is no strategy which outperforms all other ones or constantly outperforms the index. Given the cumulative payoff, NN Forecast, ClassifyMin, and ClassifyLower were able to handle the changing price regime, i.e., the steep price increase. At the end, ClassifyQuantile was able to achieve that as well; however, it shows a steep dip for a couple of months.
6. Conclusions
This article is concerned with index-based gas sourcing, which is a widely-spread form of short-term gas demand management in companies. It is tested whether it is possible to develop a neural-network-based trading algorithm that is able to achieve a gas sourcing price below the index by identifying the optimal time of purchase. Thereby, two versions of neural networks are tested: one set of neural networks is designed for classification and one set for forecasting the upcoming prices. As a classic benchmark, a myopic trading approach is included as well.
All models are tested using observations from the Dutch TTF trading hub. Results show that there is no outstanding method. Considering cumulative performance numbers, all models show positive results over the test period. However, all models have periods with negative performance and some are not robust against substantial price changes. If one has to choose, then a forecasting-based approach would be the best recommendation.
Despite no trading strategy being superior at any point of time, this analysis shows that neural networks have the potential to beat the index. We already tested various metaparameters, such as different number of layers and/or neurons, different quantiles for the classification algorithm, and different number of shopping times, for each strategy. However, there is still a considerable number of open questions. For example, there are more sophisticated price forecasting mechanisms than the ones applied here. Furthermore, it might be beneficial to include additional information such as temperature, demand, or other trading products, for example. As elaborated in the discussion section, further research is required.