Figure 3.
(a) Cumulative reward change during training for km. (b) Cumulative reward change during training for km. (c) Q-value functions and updates for km. (d) Q-value functions and updates for km. Q-value functions for (blue line) and (yellow line) are overlapping.
6.2.1. Performance
The cumulative profit over a period of 24 h (an episode) from exploiting the flexibility of pipeline energy storage is visualized on the left side in
Figure 4. This is an example day from the testing dataset, for which both MINLP and
outputs are feasible.
The performance of MINLP (yellow line) and
(green line) is compared to the LP for the CHP economic dispatch without grid dynamics (blue line), as well as the upper bound (red line) and basic control strategy (black line). The LP always finds an optimal solution to the CHP economic dispatch model. Therefore, the profit gain achieved by incorporating the pipeline energy storage can be inspected as the profit differences to the results without modeling storage in the network. These are shown in the left-hand side plots of
Figure 4.
In the first nine hours of the day for
km (left top plot in
Figure 4), the electricity price is low, and both MINLP and
make use of it by producing more heat and less electricity. The electricity price increases over the following time-steps, and MINLP and
use part of the already produced heat to satisfy consumer’s heat demand. This facilitates an increase in the production of electricity, and surpassing of the LP benchmark, for MINLP in the 12th hour and for
in the 14th hour. MINLP enables savings of €2159.41 for
km compared to the LP. The increase in the pipe length to
km leads to an increase in the volume of possible stored water and consequently to a greater opportunity for utilizing the thermal inertia property of the network. Consequently, MINLP for the larger pipe length
km saves €3271.02 compared to the LP.
for cases
km and
km saves €6.32 and €115.79, respectively.
The box plots on the right side in
Figure 4 summarize results for all days of the testing dataset. The median and quartile values show the lower performance of
and the decrease of
performance when the pipe’s length is increased from
km to
km on the entire dataset. The differences in the means of achieved profit for LP and MINLP, and LP and
are statistically significant with paired
t-test
p-value
. We distinguish three possible explanations for described behavior of
.
The first comes from the discretization of the state and action space in tabular
Q-learning. This discretization introduces an observability error—the nonlinear combination of discretization errors of the temperature in the supply and return network, and the mass flow. The observability error presents the difference between the information about the pipe state available to a
Q-learning algorithm and the simulator. This error brings the randomness in the environmental information (environment stochasticity) available to the
Q-learning agent and the actual values in the simulator. A two sample
t-test with
proves the causal effect of observability error on the reward function during testing (when there is no exploration). With enlarging the pipe from 4 to 12 km, the discretization step of chunk’s mass
is increased three times to keep the state space the same (feasible) size. In the reward function (
Section 3.3), the punishment for violating one of the four feasibility points is greater than the worst-case profit collected through an episode. Therefore, the increased environment stochasticity leads to less exploration (
Figure 3) and choosing ”safer” actions. This results in the performance degradation of
when enlarging the pipe (
Figure 4).
The second explanation for the profit difference between MINLP and
lies in the time horizon of the predictions of heat demand and electricity price available to two algorithms. When creating the mathematical model of the DHN, the MINLP is provided with deterministic (perfect) knowledge of the heat demand and electricity prices 24 h (length of the optimization horizon) in advance. In the
algorithm, the information about the environment is part of the state space. Because of the state space size limitation,
includes perfect prediction of the heat demand and electricity price only
one hour in advance through an external part of the state space
. Additional information to the agent about heat demand trends is provided through an inclusion of one step ahead season and time of the day information in the external state. The MINLP can be seen as an optimization strategy on the day-ahead electricity market, whereas
delivers a trading strategy for the real-time electricity market. The suitability of the
Q-learning algorithm for online use is important. In the future, real-time services will gain more significance as the share of intermittent and distributed energy sources is increasing [
2].
Both MINLP and
have limitations when applied to the uncertain and certain electricity markets, respectively. With an inclusion of future electricity prices, the state space of
Q-learning grows exponentially, making it an unsuitable choice for trading on the day-ahead electricity market. The MINLP can be adjusted to operation on the real-time electricity market with a rolling horizon approach. This approach implies optimization with a 24 h time horizon, application of the first action to the simulator model, initialization of the mathematical model variables with values from the simulator after the transition, and repetition of the optimization. Due to complex, nonlinear dependencies, the model is sensitive to external input from the simulator used for the initialization. The interaction with the simulator results in an instability of the optimization (more details on the stability of the optimization in
Section 6.2.2). When applying the rolling horizon approach to all the days in the testing dataset, the furthest reached time-step in the episode is 22 (out of 24): no single day in the data set could be completed.
To gain additional insight into the influence of the environment stochasticity and the horizon of perfect information of heat demand and electricity price on the performance of the algorithms, we plot the profit dependence on variance. The heat demand variance does not have a distinguished influence pattern on the algorithms’ performance (left side in
Figure 5). An increase in the electricity price variance increases profit variance (right side in
Figure 5). The performance difference between MINLP and
is remarkable when analyzed for the electricity price variance and the long pipe length of
km (bottom-right plot in
Figure 5). We identify two possible reasons for the differences between the upper and lower quartile and the maximum and minimum of box plots of profit in
Figure 5. The first is that the heat demand and/or electricity price for different days can have the same variance, but the pattern change from time-step to time-step can vary. Specific scenarios (for example, lower heat demand in a few time steps followed by higher heat demand) facilitate the use of pipeline energy storage flexibility; therefore, the profit will be in the upper quartile of the box plot for those scenarios. The second reason is that heat demand and electricity price impacts on profit gain are intercoupled. Consequently, the achieved profit on days where heat demand has the same variance can vary depending on the electricity price pattern.
The third explanation for the profit difference between MINLP and is the behavior of these algorithms at the end of the optimization time horizon. At the end of the time horizon, the algorithm can exploit the residual heat in the pipeline to maximize the profit gain. This phenomenon we call “draining the pipe”. As the CHP economic dispatch is a continuous process, draining the pipe is undesirable. A Q-learning algorithm assigns values to state–action pairs through interactions with the environment. The state–action pair at the end of an episode might be at the beginning of the following episode. Therefore, a Q-learning algorithm does not learn to “drain the pipe”, since it is updating the same Q-values in different time steps. The chance to exploit the heat residual at the end of the episode gives the advantage to MINLP in achieving profit.
6.2.2. Stability
As explained in
Section 2.1, the model of CHP economic dispatch is discontinuous, nonconvex, and highly nonlinear because of dependencies of temperature at the inlet and outlet of the pipeline in the supply and return network and mass flows. The complexity of the model represents a challenge for optimization with any state-of-the-art solver.
Figure 5.
(a) Heat demand variance for km. (b) Electricity price variance for km. (c) Heat demand variance for km. (d) Electricity price variance for km.
Figure 5.
(a) Heat demand variance for km. (b) Electricity price variance for km. (c) Heat demand variance for km. (d) Electricity price variance for km.
The number of variables in the mathematical model is 553, and the model has 865 constraints. IPOPT of the SCIP optimization suite [
45] is used for solving the CHP economic dispatch model on 182 days of the test dataset. With ten iterations, each lasting six minutes, IPOPT does not find a primal bound of the model for 120 days for pipe length
km and 113 days for pipe length
km (
Figure 6). These days are omitted from the profit and feasibility analysis for both MINLP and
(although
finds the solution for all days of the test dataset).
Moreover, the optimization procedure is sensitive to changes in the parameters. The change of parameter from 110 to 120 °C results in an increase of stable days, 92 for km and 101 for km. The increase in the range of allowed temperature values relaxes constraints and enables for more solutions to be found.
6.2.3. Feasibility
The feasibility of the CHP economic dispatch model and
is accessed on five points by simulator evaluation: underdelivered heat to the consumer, maximum inlet supply temperature, minimum inlet supply temperature, minimum inlet return temperature, and maximum mass flow. In
Figure 7, the so-called quantile plots show the percentage of days for which a lower or equal percentage of violation on the vertical axis occurred. No violations of the maximum mass flow and minimum return inlet temperature take place.
In the CHP economic dispatch model, network safety guarantees and limitations are implemented as hard constraints, while in the algorithm, they are soft constraints integrated as part of the reward function. Therefore, we expected that the MINLP has a lower percentage of violations, especially related to the underdelivered heat demand. However, this is not the case.
While Li et al. [
12] provides a comprehensive mathematical formulation of processes in the DHN pipeline, they assume a constant mass flow between the supply and return network. That is unfavorable in the DHN operation as it can result in high return temperatures [
46]. The heat exchange process in the HES is described accurately only with a set of complex mathematical equations. We hypothesize that the simplified model of HES in MINLP causes the divergence between the solution of MINLP and the simulator. To evaluate this hypothesis, we designed two more experiments.
In the first experiment, the simulator uses the HES that controls mass flow (realistic HES). In the second experiment, the simulator uses a simple HES model (ideal HES). The divergence is judged by the mean difference of return outlet temperature between the CHP economic dispatch model and the simulator for both cases. The return outlet temperature is chosen because it is at the end of the temperature propagation cycle, and the difference between the simulator and optimizer should be the most noticeable there. The mean temperature difference between the simulator with the realistic HES and the optimizer is 9 °C (higher temperature in the optimizer), while it is 0.18 °C between the simulator with the ideal HES and the optimizer. The conclusion is that the absence of a more realistic HES in the CHP economic dispatch model causes the divergence between the solution of MINLP and the simulator.
By interacting with the environment, the
algorithm approximates dynamics of the HES, leading to fewer violations. The pipe length increase deepens the environment stochasticity, as explained in
Section 6.2.1, resulting in an increase of violations when transferring from pipe length
km to
km.
6.2.4. Time-Scale Flexibility
To determine the suitability of algorithms for online use, we access the time flexibility property of the MINLP and algorithm.
The maximum number of iterations for MINLP is ten, and the iteration length is six minutes. The optimization time for one day from the testing dataset is one hour.
The training times of the algorithm for pipe lengths km and km are 4744 and 4222 min, respectively. All the experiments are performed on a computer with 4-core Intel I7 8665 CPU. The training times are approximately the same because of the identical state space size. The response on unseen scenarios from the testing dataset by is provided in a few seconds.
Therefore, is the suitable choice for real-time energy trading and encapsulation: the user only needs to input the operating state to get the control strategy, while the optimization algorithm needs to re-write the constraints and other formulas for different situations and repeat optimization.