Hierarchical Model-Based Deep Reinforcement Learning for Single-Asset Trading

Millea, Adrian

doi:10.3390/analytics2030031

Open AccessArticle

Hierarchical Model-Based Deep Reinforcement Learning for Single-Asset Trading

by

Adrian Millea

Department of Computing, Faculty of Engineering, Imperial College London, London SW7 2AZ, UK

Analytics 2023, 2(3), 560-576; https://doi.org/10.3390/analytics2030031

Submission received: 26 April 2023 / Revised: 4 July 2023 / Accepted: 7 July 2023 / Published: 11 July 2023

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

We present a hierarchical reinforcement learning (RL) architecture that employs various low-level agents to act in the trading environment, i.e., the market. The highest-level agent selects from among a group of specialized agents, and then the selected agent decides when to sell or buy a single asset for a period of time. This period can be variable according to a termination function. We hypothesized that, due to different market regimes, more than one single agent is needed when trying to learn from such heterogeneous data, and instead, multiple agents will perform better, with each one specializing in a subset of the data. We use k-meansclustering to partition the data and train each agent with a different cluster. Partitioning the input data also helps model-based RL (MBRL), where models can be heterogeneous. We also add two simple decision-making models to the set of low-level agents, diversifying the pool of available agents, and thus increasing overall behavioral flexibility. We perform multiple experiments showing the strengths of a hierarchical approach and test various prediction models at both levels. We also use a risk-based reward at the high level, which transforms the overall problem into a risk-return optimization. This type of reward shows a significant reduction in risk while minimally reducing profits. Overall, the hierarchical approach shows significant promise, especially when the pool of low-level agents is highly diverse. The usefulness of such a system is clear, especially for human-devised strategies, which could be incorporated in a sound manner into larger, powerful automatic systems.

Keywords:

deep reinforcement learning; trading; clustering; hierarchy; prediction; risk; bitcoin

1. Introduction

Since its inception in 2015 [1], deep reinforcement learning (DRL) has found many applications in a wide set of domains [2]. In recent years, its use in trading has become quite popular, at least in academia (for an overview, see [3] or [4]).

The most interest lies within the portfolio optimization problem (see, for example, references [5,6,7] for some modern approaches), which can leverage multiple assets both in uptrends and downtrends. In single-asset trading, there is too much risk involved, as argued, for example, in ref. [8]: if the asset goes down, then the best option is to hold. Additionally, if the market is in a strong bull trend, it is hard to beat it; see, for example, a recent confirmation of this by [9]. The market (in this case, the single asset) often changes regimes regarding volatility, trend, smoothness, etc. There have been many works showing and modeling this; see, for example, works mentioning trading and market regime switches (e.g., [10,11]).

The applied methodology pursued by many (if not all) researchers working in DRL for trading is to evaluate the agent for a short period (test mode) and then retrain it to deal with another test period. Moreover, the training must occur before the testing so the agent can catch the current market dynamics.

This workflow is cumbersome and introduces additional constraints and computational costs. Moreover, the uncertainty about the change in the market regime and timing still needs to be addressed in real-time. We thus propose an alternative method, where a hierarchical system decides which of the regimes the market is in and assigns the most appropriate behavior by selecting a low-level agent to act in the actual market, i.e., to buy and sell the asset. We delve more into the architecture in subsequent sections. We now proceed to the RL problem formulation.

The reinforcement learning problem is usually formulated as a Markov decision process (MDP) [12], which is a discrete stochastic control process for decision-making in environments with uncertainty (there are also continuous MDPs [13]). Formally, a discrete MDP is given by:

$S$ , a set of discrete states that the agent can be in, often called the state space.
$A$ , a set of discrete actions the agent can take in the environment, also referred to as the action space.
$p_{a} (s, s^{'})$ , which is the probability of being in state s at timestep t, taking an action a and then being in $s^{'}$ at timestep $t + 1$ . This formulation is also known as the transition function.
$r_{a} (s, s^{'})$ , which is the reward function, similarly dependent on a, the current action, and s, the current state, and is given by the environment when in the next state $s^{'}$ . Usually, the policy of selecting an action a when in state s is denoted by $π (a | s)$ and is often probabilistic.

In dynamic programming, which is somewhat the precursor of RL, where the transition function and reward function are given or known, two equations are employed alternatively:

To update a value function $V (s)$ , which encodes how good it is to be in state s;
To update the policy, which reflects the value function and is taken to be the action that maximizes this value function estimate, known as the greedy policy for V.

Formally:

\begin{matrix} V (s) & = \sum_{s^{'}} p_{π} (s, s^{'}) [r_{π} (s, s^{'}) + γ V (s^{'})] \\ π (s) & = a r g m a x_{a} \sum_{s^{'}} p_{a} (s, s^{'}) [r_{a} (s, s^{'}) + γ V (s^{'})] \end{matrix}

(1)

where

γ

is called the discount factor and

0 \leq γ \leq 1

. A commonly used RL alternative is to use a state-action value function, where we now have two identifiers, s and a, where p and r are unknown:

Q (s, a) = \sum_{s^{'}} p_{a} (s, s^{'}) [r_{a} (s, s^{'}) + γ \cdot max_{a} Q (s^{'}, a)]

(2)

with the model-free version, the iterative version given by (which is, in a sense, an approximation of the above equation with the greedy policy selection step included):

Q_{n e w} (s, a) = Q_{o l d} (s, a) + α \cdot [r_{a} + γ \cdot max_{a} Q (s^{'}, a) - Q_{o l d} (s, a)]

(3)

There are two classes of algorithms for RL: value function algorithms, which try to accurately estimate the state-value function V or the state-action value function Q (Q-learning), and policy algorithms, which look directly at optimizing

π

(policy gradient), bypassing the value function estimation (others use both methods, e.g., actor-critic).

On a different set of axes, there are model-based RL agents that model and learn the transition function

p (s, s^{'})

, and there are model-free agents that do not explicitly model the transition function. Model-based RL implies the existence of a simulation of the environment with which the agent can interact without incurring the cost associated with or benefiting from the precision of the real environment. If a model can learn accurately and efficiently, then the advantages of having a model are apparent; knowing the possible outcomes of some actions without actually taking them enables long-term planning and significantly enhances the agent’s capabilities.

Instead of using the summation over all possible following states to obtain a value for Q, we concatenate the prediction of the next state(s) to the current state, effectively adding another identifier to the Q function, which is the predicted future. This approach is more efficient but less accurate, but it still uses the information from the prediction. Moreover, it can be applied to any reinforcement learning algorithm with the minimum overhead. This methodology has been used previously for the same purpose; see, for example, [14]. This technique could be classified by [15] as implicit model-based RL, where we do not explicitly model the transition function and the planning procedure, but we use the prediction as part of the RL agent representation.

The purpose of this work was to leverage the newest technologies in the DRL domain to obtain better performance on highly volatile trading markets. The work described herein brings a number of novel contributions and overcomes some of the limitations of previous systems, such as reduced performance on testing sets that are very different and the need for training sets and retraining. Some benefits of using such a system are: lower computational cost, higher flexibility, incorporating sound domain knowledge into a complex high-performance automatic system, and balancing risk and reward using powerful deep networks.

2. Related Work

The idea of having an environment model has been used in the past quite effectively, both with analytical gradients [16] in the linear regime and even neural networks [17] in the nonlinear. As long as a model can learn to some degree of accuracy and is not too computationally intensive to do so, then the use of a model generally improves performance and sample efficiency. Moreover, it gives the ability to generate or roll out future trajectories.

This ability is instrumental in trading, where accurate predictions could greatly enhance profits, especially if they are multi-step. A couple of recent works combine DRL and MB methods for trading [14,18]. In [18], they use a sequence-to-sequence RNN-MDN (mixture density network [19]) to model the transition model based on latent states given by a convolutional autoencoder. They show that the performance of an agent trained on the transition model is very similar to that of one trained on the actual data, with the ability to explore the transition model and exploit it in the real environment. In the second work [14], using model-based prediction with DRL for trading, the authors use a nonlinear DyBM (dynamic Boltzmann machine) to make predictions and embed these into the state of the agent; they call it an infused prediction module.

On the other end, hierarchical RL enables the partitioning of the state space in some meaningful way (usually performed automatically) and the solving of different problems individually with so-called skills or options, sub-policies directly interacting with the environment. A higher-level agent chooses between these sub-policies according to a decided criteria and stops them according to a decided termination function, simultaneously giving goals and rewards to the low-level agents. Some new approaches to HRL using deep networks can be found in [20,21,22]; however, the concepts are much older; see [23,24]. For a more detailed overview, see [25]. The goal is to learn specific behaviors well while combining the learned solutions adaptively. Hierarchical RL methods hold promise for general artificial intelligence. In trading, they have been used successfully, for example, in portfolio management [26,27], but not in the single asset case.

The association of hierarchy with model-based RL seems fruitful; however, only a few works exist that combine the two. It has been mentioned as promising for the future in [28], but more from a philosophical or cognitive point of view than from a practical application to a real-world problem. More recently, the association has been performed in [29], where the authors use inductive logic programming to devise a transition model. This combines symbolic representations by employing an additional symbolic MDP, which augments the state with boolean predicates, which can be goals, the relationship of objects, events, or properties in the environment. The rewards for the low-level agent are augmented with a positive number if the agent reaches the subgoal set by the high-level agent (using the symbolic transition model). The high-level agent uses as a reward the accumulated environmental rewards that result when the low-level agent reaches the subgoal, with small penalizing factors for setting immature or unlearnable tasks as subgoals. The high-level agent alternates interacting with the environment and the simulated transition model, which results in increased sample efficiency. The overall agent is tested in two hard environments for flat agents (not using hierarchy), showing good performance. It is unclear if for both, but for one environment (involving a robot reaching several circles in a specific order), the HRL uses tabular Q-learning for the high-level agent and a PPO (Proximal Policy Optimization [30]) agent for the low level. Even though it is partly a symbolic approach and thus significantly engineered, it shows the potential of combining hierarchy and explicitly learning a transition model. A somehow related work is [31], where even though the markets are different (energy), they use multi-scale deep reinforcement learning with great results. An even more interesting and similar work, which is also on the energy market, is [32], where they use multiple DRL agents on different levels.

Our architecture is similar to the options framework [33], where an option is defined by:

$I_{o}$ , a set of initial states from which an option o can be started, with $I_{o} \in S$ ;
$π_{o}$ , a policy that is followed for the duration of the option;
$β$ , a termination function that governs the termination of the policy when certain conditions are met.

We describe our architecture and how the above points correspond to it.

3. Overview and Methods

The idea is to leverage heterogeneous behaviors in a flexible manner using hierarchy. The assumption is that the market can go through multiple different dynamics, and thus hierarchy seems the most appropriate to tackle this issue in a feasible and efficient way. This assumption has been widely studied in the literature and was verified experimentally by the author as well. Even though hierarchy is well suited for multi-asset problems, we show that it can also be used in a single-asset case successfully.

In the work that follows, we combine the model-based approach with a hierarchical structure, such that we use a high-level agent to select among a set of pretrained low-level agents, then act out several steps in the environment. These steps are governed by a certain termination function

β

, which in the simplest case is given by:

β_{t} (s) = \{\begin{matrix} 0 & if t % k \neq 0 \\ 1 & otherwise \end{matrix}

(4)

where % is the modulo operator, s is the current state, t is the current timestep in the environment, and k is a value significantly larger than 1; this is the resolution at which the high-level agent sees the world, the subsampling factor. We set

k = 30

(we test for

k = {10, 30, 50}

). The options’ policies are the actual low-level policies pretrained beforehand or simple strategies governed by some technical indicators (such as moving average crossover or mean reversion strategy). We use five PPO agents and two simple strategies. The initiation sets are all the same for all low-level agents and are constituted by this reduced set

\hat{S}

, which comprises every kth element of

S

, with

\hat{S} = {x_{t} | x_{t} \in S, t % k = 0}

.

The high-level agent is endowed with a classification mechanism

C

, which clusters incoming series into different classes and feeds different agents with samples from these different clusters. This mechanism is analogous to the subgoal discovery problem [34,35], where similar processing exists where the high-level agents find attractive states for the low-level agents to reach. Because the environment is different in our c, reaching the subgoal does not have any real meaning. We use k-means [36] from the Python library

t s l e a r n

(https://github.com/tslearn-team/tslearn/, accessed on 19 January 2023) for clustering the time series, with an episode length of 50 and 5 clusters. We tested more clustering algorithms, but they all gave similar results, so we chose the simplest one. We also tested more values for the number of clusters, but we chose a small one that showed the significant trends that one can see in a market.

The prediction function f is applied to the recent past to produce some probable future (same for the low level and function g):

\begin{matrix} {\tilde{x}}_{t : t + m} & = f (x_{t - n : t}) (high - level) \end{matrix}

(5)

\begin{matrix} x_{t + 1} & = g (x_{t - n : t}) (low - level) \end{matrix}

(6)

To use the classification part to its full power, we classify the past as well as the future. Classification induces a smoothness in the representation (since neighboring points will be very similar and thus the classification will be the same), which is very desirable since we do not want the policy to change too often. The classification is given by:

\begin{matrix} {\tilde{c}}_{t : t + m} & = C ({\tilde{x}}_{t : t + m}) (future) \end{matrix}

(7)

\begin{matrix} {\tilde{c}}_{t - n : t} & = C ({\tilde{x}}_{t - n : t}) (past) \end{matrix}

(8)

For a detailed description of the variables, please see Section 3.1 and Section 3.2.1.

We already see how model-based prediction can help. By assigning an agent to act for the next window, we have to classify it as being in one of the clusters, which means that we first have to predict it and then classify this prediction to match the agent. We denote by f the prediction function for the high level and by g the prediction function for the low level. For f, we use a deep autoregressive model using recurrent neural networks [37] from the Python library gluonts (https://ts.gluon.ai/, accessed on 19 January 2023), which is a probabilistic time-series library focused on deep learning models. For g, we use a dynamic Boltzmann machine (https://github.com/ibm-research-tokyo/dybm, accessed on 19 January 2023) which is generally good for one-step prediction and also has a constant training time.

For the RL part, we use RLlib (https://github.com/ray-project/ray/tree/master/rllib, accessed on 19 January 2023), which is a Python RL library that scales very well on multiple CPU cores and machines. It also has the latest state-of-the-art algorithms and RL tricks implemented and ready to use.

3.1. Data

We sample data at 1 h frequencies and use 4000 data points for training the low level, 8500 points for training the high level (including the previous 4000), and 10,000 points for testing (BTCUSDT taken from Binance (www.binance.com, accessed on 19 January 2023), 2019-10-23 07:00:00–2022-01-30 21:00:00). For details on the data splits, see Figure A2. For training the low level, we sample episodes at a length of 50. For the high level, we use 1500 (which amounts to 50 decision points for the high-level agent since it acts every 30 steps). For each performance measurement, we take 30 random samples.

Here, we deviate from the general literature, where the test sets are much smaller than the training sets, and often each testing set has to be preceded by a training set due to what is considered a requirement of the nature of the market and its different regimes. It is assumed that for an agent to perform well on some data, it needs to be trained on the period right before. Because we deal with the different regimes in another way (by employing heterogeneous agents), we can train systems that perform well on an extensive testing set and do not need to be trained multiple times for each testing period. Our agents are general enough to deal well with any market regime.

For the low level, the data are concatenated as

x_{t - n : t} = [p_{t - n : t}^{o}; p_{t - n : t}^{h}; p_{t - n : t}^{l}; p_{t - n : t}^{c}; v_{t - n : t}]

, where

p_{t - n : t}

corresponds to the last n steps of the respective price return (open, high, low, and close). By price return, we mean that

p_{t} = \frac{p r i c e_{t}}{p r i c e_{t - 1}} - 1

.

v_{t}

is the volume associated with timestep t, and is also a normalized difference:

v_{t} = \frac{v o l u m e_{t}}{v o l u m e_{t - 1}} - 1

.

3.2. Architecture

After we tried running multiple types of agents on the various parts of the dataset and obtained poor performance, we noticed that the performance was correlated with the training data; thus, we concluded that partitioning the data in some meaningful way might be fruitful. We describe this next.

The preprocessing for the high-level agent is performed by the clustering mechanism, where we classify each running episode in one of the existing clusters (

c_{i} \in {0, 1, 2, 3, 4}

) and feed as state the last n such classifications. To select the low-level agent to act, we consider this representation more appropriate since the classes represent the expertise of each agent in some sense. Moreover, neighboring episodes will have the same class, thus inducing smoothness in the representation and thus action selection, which is in line with our goals. We show the overall architecture of our learning system in Figure 1. The gray rectangles depict the logical agents, and the arrows represent the data flow, with dashed arrows representing the actions. We see that both levels incorporate a prediction model.

The low-level prediction did not improve much (or at all: the results were inconclusive) the performance of the low-level agents; thus, we do not include it in the final results. For the high-level state, we tried multiple approaches:

Using raw returns, as in the low-level agent;
Using the recent performance of each individual agent in its own simulated environment (similar to [7]) and concatenating these for all agents;
Using concatenated samples from the prediction model instead of median or mean;
Using the result of the classification of the recent past and the prediction, applying $C$ .

The best version used both classification and prediction (see Section 4). In general, in our experience, larger state spaces induce more variability, which is also what happened when using raw samples from the prediction model. We show the full algorithm in Algorithm 1.

Algorithm 1: Hierarchical model-based deep reinforcement learning for trading in evaluation mode.

3.2.1. Clustering as Preprocessing for the High-Level Agent

We endow the high-level agent with a preprocessing module that can cluster the incoming time series into a specific cluster (one of five discovered beforehand when clustering the training set, see Figure 2).

We denote by

s_{h}

the state of the high-level agent. The state at timestep t is given by

s_{h} = [c_{t - n : t}; {\tilde{c}}_{t : t + m}]

, where

c_{t - n : t}

is a vector of the last n classifications, where each point

c_{t - i}

with

0 \leq i \leq n

denotes the classification of the episode that ends at

t - i

.

{\tilde{c}}_{t : t + m}

is the prediction of the following episodes’ classes by the classification, which is fed with the prediction of the next timesteps. Thus, each point

{\tilde{c}}_{t + j}

with

0 \leq j \leq m

is the classification of the (predicted) episode ending at

t + j

. Because the high-level agent takes decisions at a coarser data granularity, having such a multiple-step prediction model makes sense. We use a DeepAR model for our multiple-step prediction, which generally keeps the shape of the future trend but lacks minute precision, which is what we needed. We show the example predictions in Figure 3a.

3.2.2. Pretraining the Low-Level Agents

We pretrain the low-level agents on specific data clusters, so each one specializes in one type of trend. We show the clusters and their means in Figure 2. We see that the five clusters capture some of the main trends in the market.

In Figure 4, we show the best performance of individual agents on the test set consisting of 15,000 points (21 months). We select the best one (among 24 trials) for each class for comparison. We perform 30 repetitions for each and depict the mean and standard deviation. We can conclude from these results a number of things:

Performance and trade dynamics vary widely between agents trained with different types of data;
More iterations do not mean better performance; we should be careful not to overfit;
More data does not mean better performance;
We should be watchful for possible pathologies (e.g., a small number of sells).

3.3. Prediction

In the context of RL, prediction assumes the existence of a model of the environment with which the agent can interact without incurring the losses associated with the real environment. This complete model comprises state and reward models, predicting the following states and rewards. We are only interested in the state model, as the reward model is completely dependent on the state model. In the sense that if we know the next state (price), we can easily compute the rewards, given the action taken.

For the low level, we tested with the true values instead of the predictions to see if the performance improved. We noticed a significant improvement. We repeated the value of the future step 10 times and then concatenated this to the state vector for the low level as a way of biasing the agent, such that this future value has a more considerable influence on the decision (having a single value did not seem to work too well). So, it is clear that accurate predictions can significantly improve performance and training time. We thus add a one-step prediction for each of the OHLCV time series, thus having five different prediction models for each agent. As we mentioned, this does not improve performance conclusively.

For the high-level agent, the representation is in terms of classes (0 to 4), and thus the prediction should be the same. We also note that the history window size has now doubled compared to the low level since an episode now defines every point ending at that point. So, we predict multiple steps in the future and look at more steps in the past. We show an example in Figure 5.

3.4. Rewards

The rewards for the low-level agent are given by the usual profit and loss (PnL). We also standardize the rewards, subtracting the running mean and dividing by the running standard deviation of the current episode at each step. We perform this operation for all types of rewards used and all agents.

For the high-level agent, we feed the cumulative profit and loss for each low-level episode, while for the low-level agent, we feed the realized PnL (profit and loss):

r_{t}^{h l} = \sum_{k}^{t} r_{k}^{l l} with r_{k}^{l l} = {price}_{k} - buy_{price}_{0}

(9)

where

b u y_p r i c e_{0}

is the first open buy price (see Appendix A.1).

3.5. Termination Function and Diversification

By looking at the low-level agents’ pretraining performance, we can draw some conclusions:

If a large loss has been incurred, there are chances that this is a period of losses for this agent (not only this agent; we see that some periods of losses coincide for all agents). Thus, we add a hold action for the high-level agent.
Recent agent performance could be relevant for the immediate future.
Agents are not complementary, but combining more of them can improve performance.
Improving diversification in the low-level dynamics might improve overall performance.

We hypothesized that the termination function would be necessary for trading and that the overall strategy could be improved by learning when to terminate, but it seems not. We tested with multiple values (10, 30, and 300), and the difference is small, especially among lower values. We settled on 30 as the value, which seemed more robust.

4. Experiments and Results

Firstly, we have to also mention the usual RL instability in training and emphasize how the best hyperparameters were different in each selection (for the pretrained agents and the hierarchical agent), so running multiple trials with different configurations of hyperparameters at each experimental stage is critical for obtaining reasonable solutions. We evaluate 30 samples for each curve after an initial run of 24 trials and select the best hyperparameter configuration (see Appendix A.2). A significant improvement was the addition of the ability to hold (or do nothing) for the high-level agent, thus allowing for a more extended period (the terminating value k), which makes sense because, as we saw from the low-level pretraining, there are extended periods where it is better to do nothing as it is almost impossible to be profitable.

Performance gains were hard to obtain with this setup due to the fact that agents were not very dissimilar when making profits; thus, we wanted to add more agents but with different trading dynamics. We proceed to describe these steps.

4.1. Hierarchy with Two Simple Strategies

We test two basic strategies, a mean reversion and a moving average crossover, which can be considered a momentum strategy. Figure 6a shows the performance of the two basic strategies. We then check to see if using a DRL agent to select between the two strategies is of any help. We also add the hold option for the agent to choose from three discrete actions. We show the performance of the DRL agent (we use a PPO agent as before) in Figure 6b. We see that performance is indeed better than in each of the two individual strategies. We use a minimal

[64, 32]

network with a

r e l u

activation function. We try three decision intervals for the PPO: every step, once every five steps (PPO 5), and once every ten steps (PPO 10).

The mean reversion strategy uses two indicators: a relative strength index indicator with an exponential moving window working over a period of 6 h and a moving average over a period of 20 h. The strategy buys if the RSI value is less than a certain threshold (40 in our case) and the price value is less than the lower Bollinger band and sells if the RSI value is higher than a certain threshold (60 in our case) and the current price is higher than the higher Bollinger. The higher and lower Bollinger thresholds are at

2 \times σ

, with

σ

the rolling standard deviation of the time series (on the same period as the mean: 20 h). The strategy holds or does nothing if none of the above conditions are met.

The moving average crossover strategy is even more straightforward. We have two different moving averages (in our case, one works on a 100 h window and the other on a 400 h window), and when the shorter one surpasses the lower one, it is a buy signal, and when the reverse is the case, it is a sell signal.

We see that these models can be leveraged to perform better when combined through an additional layer of decision-making. Thus, we include these two simple models in our final set of low-level agents. The results in Figure 7 and Figure 8 were obtained with this setup. We can see in Figure 7 the performance of various models tested throughout, with the best one being the one that uses both classification and prediction. In Figure 8, we see a significant rise in the Sharpe ratio and a small drop in performance for the agent using risk-return rewards, which looks very promising for a first set of experiments.

4.2. Risk-Return Optimization

The Markowitz model implies that one wants to minimize risk and maximize returns, and a hierarchical approach naturally lends itself to such a dual optimization. We thus test this approach as well, with the high-level agent optimizing for risk. The conditional value rewards the high level at risk, or conditional value at risk (CVaR). To define the CVaR, we first need to define the value at risk (VaR).

VaR: Value at risk is a measure describing how much value is at risk during a specific period and its probability of occurrence. Many critics have criticized this as being uninformative or giving a false measure of certainty and security where there should be none. One of the main arguments against using VaR is that it is a measure for describing rare events that are hard to describe.

Formally, let X be the profits (positive) and losses (negative) of a random variable (rv) associated with a certain distribution. The VaR at level

α \in (0, 1)

gives the smallest number y, such that the probability that

Y : = - X

is not larger than y is at least

1 - α

. We denote this as

V a R_{α} (X)

, and the most general definition is given by:

V a R_{α} (X) = - i n f {x \in R : F_{X} (x) > α} = F_{Y}^{- 1} (1 - α)

(10)

The function

F_{X}

is the cumulative distribution function (cdf) of the random variable X (for any rv, its cdf is well defined). There are multiple ways of computing the VaR:

Either by looking at past returns, assuming that future outcomes will be similar to past ones;
By assuming a parametric distribution of the returns, such as a Gaussian distribution;
By using Monte Carlo simulations of predicted returns.

We use the first method as it is the simplest and quite efficient. We use

α = 0.05

.

CVaR: Conditional value at risk is a measure related to VaR but more comprehensive as it looks at the expected value beyond the VaR threshold:

C V a R_{α} (X) = - \frac{1}{α} \int_{0}^{α} V a R_{γ} (X) d γ

(11)

Because losses are negative, then their expectations would also be negative, and we want a positive risk measure (since it should represent the magnitude of losses). We thus add a minus sign to ensure that the final value is positive. If we assume that the distribution is continuous, then this is equivalent to the conditional tail expectation given by:

E [- X | X \leq - V a R_{α} (X)]

(12)

CVaR is also known as the expected shortfall and can be seen as describing the average loss case when the losses are severe (they only occur

α

percent of the time).

When comparing the original version, using a cumulative return for the high level, versus the risk-return version, using CVaR as a reward, we obtain a significant difference with a larger Sharpe ratio for the CVaR agent (running a t-test between the two populations gave a p-value of 0.02, so the null hypothesis can be rejected, i.e., the two populations’ means are not equal).

To evaluate the performance of the risk-return model, we compute the monthly Sharpe ratio, which will be independent of the frequency of trading or data used. Otherwise, we risk sampling bias, with the Sharpe ratio dependent on the period length used for calculation. The monthly Sharpe ratio is computed as follows:

\begin{matrix} S = \frac{R_{e}}{σ_{e}} with R_{e} = \frac{1}{n} \sum_{i = 1}^{n} (R_{i} - R_{f}) and σ_{e} = \sqrt{\frac{1}{n - 1} \sum_{i = 1}^{n} {(R_{i} - R_{f} - R_{e})}^{2}} \end{matrix}

(13)

where

R_{e}

is the average monthly excess return,

R_{i}

is the return of the portfolio in month i,

R_{f}

is the risk-free excess return (which can be taken to vary each month but which we keep constant for simplicity), and n is the number of months. The risk-free return is generally quite low (initially taken as the return of US bonds); we take it as

5 %

.

5. Discussion

Our findings confirm the utility of both hierarchy and prediction in enhancing performance. In the hierarchical reinforcement learning framework, our results indicate that low-level one-step prediction did not yield significant performance improvements. However, an interesting avenue that emerges from this finding is the potential inadequacy of the current prediction model at this level. Therefore, further research is needed to investigate more efficient prediction models that could yield better performance at the low level.

For the high level, using predictions in conjunction with incremental classifications of these predictions significantly improved performance. The introduction of a risk-based reward measure such as CVaR altered the problem dynamic to a risk-return optimization, with the low level being incentivized for greater returns and the high level being rewarded for risk reduction. This balance of risk and reward is a crucial aspect of any market setting, and the inclusion of domain knowledge through clustering further sets our model apart from traditional hierarchical reinforcement learning approaches.

By leveraging hierarchy and prediction for trading a single asset, we were able to significantly improve performance in terms of overall returns and the Sharpe ratio. The large testing set used in this study also minimizes sampling bias, reinforcing the robustness of our findings.

Finally, the possibility of mapping the natural hierarchical structures seen in traditional portfolio management to this hierarchical system could provide a consistent and intuitive framework for managing heterogeneous assets. This integration could be the next frontier in the development of more effective and adaptable trading systems.

6. Conclusions

Both hierarchy and prediction can significantly improve performance. Even though we have not seen any performance benefit from adding one-step prediction at the low level, we have tested to see if good prediction does improve performance, and indeed it is so; using true values instead of predictions does provide positive results. This means that the prediction model used for the low level is insufficient.

For the high level, we saw that prediction used in conjunction with incrementally classifying predictions does offer significant performance improvements, even though neither of the two methods used separately offered any improvement. Moreover, we saw that changing the reward of the high level to a risk-based measure such as CVaR transforms the whole problem into a risk-return optimization, where the low level is rewarded for increased returns and the high level receives rewards for lowering the risk (or the risk measure).

Some aspects in which our work differs from traditional hierarchical reinforcement learning are:

Easy use of prediction at both levels (this can be used by any RL algorithm);
Incorporation of domain knowledge (the clustering part);
Balancing of risk and reward between the two sets of agents (very desirable in any market setting).

7. Future Directions

The current work can be extended by considering specific goals that the high-level agents might give the low-level ones, such as, for example, a desired (approximate) number of trades or a desired amount of risk to seek for the period (Sharpe ratio or Sortino ratio). The closer to the goal, the bigger the reward for the low-level agent. In this way, the behavior of the low-level agents can be constructed around specific desiderata.

One interesting behavior that could be enacted in this type of architecture is for one agent to control the learning of another (meta-learning), where, for example, the high-level control of a parameter

α \in R

with the reward of the low-level being

α r_{1} + (1 - α) r_{2}

.

Other ways that the hierarchical decision-making setup can be further leveraged are to devise complementary low-level agents, where the trading dynamics are as different as possible while still retaining reasonable performance individually. Moreover, training one level of agents with a certain risk measure and the other with the returns has not been explored fully in this article. For example, one could try using the risk measure in the low level and the returns for the high level. Or, alternatively, train them at the same time without pretraining the low level. However, this is usually computationally more intensive and more unstable, adding to the usual instability of RL training.

In this article, we showed how leveraging hierarchy and prediction for trading a single asset can significantly improve performance, both in terms of overall returns and the Sharpe ratio. We used a much larger testing set to avoid the sampling bias usually encountered in such scenarios (cherry picking) and showed that even simple strategies, such as the mean reversion or moving average crossover, can be leveraged with the help of an additional decision-making layer.

Furthermore, natural hierarchical structures seen in the usual portfolio management literature (e.g., asset classes based on industry type or country) could be mapped to such a hierarchical system in a straightforward manner, providing a consistent way of dealing with heterogeneous assets.

Funding

This research was funded by EPSRC Centre for Doctoral Training in High Performance Embedded and Distributed Systems grant number EP/L016796/1.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available in for free as specified in the text.

Conflicts of Interest

The author declares no conflict of interest.

Appendix A

Appendix A.1. Implementation Details

We use a simple trading environment with a fixed quantity size for each buy and sell (e.g.,

0.01

for BTCUSDT). We start with an initial balance of USDT 1000. We use a FIFO (first in first out) order queue, meaning that when we have multiple buys opened, we close the first one (first in) if a sell operation is requested. We do not allow selling short. We use transaction fees of

1 %

; however, recently, Bitstamp has introduced

0 %

trading fees for all coins for up to USD 1000 or equivalent (https://blog.bitstamp.net/post/bitstamps-summer-of-discovery-is-underway-with-0-trading-fee-for-all-coins/ accessed on 19 January 2023).

Because we only trade at the closing price of the hour in our simulations, the close moment of the open operation can be further optimized, and thus profits can significantly increase. We leave this for further work. The hierarchical approach can be easily extended to support another layer of decision-making on a smaller timescale.

We only take the close price (ohlcv) for the high-level model and predict the following 50 steps with DeepAR. We use the high-level model predictions in the high-level training regime, such that the agent should learn when and how much to use the model if the predictions are sometimes off. One could say that this allows the RL agent to learn a policy based on the prediction model and its errors.

We next show the hyperparameters. (Modified hyperparameters are shown; there are many other hyperparameters that are used with their default library values. The reader should check the original source code of the libraries used for all the hyperparameter values.).

Figure A1. Action distribution for the best HRL model.

Figure A2. Data splits.

Appendix A.2. Hyperparameters

Table A1. DeepAR prediction model.

Parameter	Value
num_cells	64
num_layers	2
dropout_rate	0.3
prediction_length	50

Table A2. DyBM prediction model.

Parameter	Value
delay	2
decay	0.9
prediction_length	1
AdaGrad rate	1.0
AdaGrad delta	1e-6

Table A3. K-means clustering.

Parameter	Value
n_clusters	5
max_iter	50
metric	euclidean
tolerance	1e-6

Table A4. Proximal Policy Optimization.

Parameter	Value
use_critic	True
use GAE	True
GAE lambda	1.0
kl_coeff	0.2
rollout_fragment_length	200
train_batch_size	200
sgd_minibatch_size	128
shuffle_sequences	True
num_sgd_iter	30
lr_schedule	None
vf_loss_coeff	1.0
entropy_coeff_schedule	grid_search([[1, 0.1], [500,000, 0.0001]])
clip_param	0.3
kl_target	0.01
batch_mode	truncate_episodes
vf_clip_param	grid_search([1, 10])
clip_param	grid_search([0.1, 0.2])
grad_clip	grid_search([1, 10])
learning_rate_schedule	grid_search([[[1, 1e-4], [500,000, 1e-5]], [[1, 1e-3], [500,000, 1e-4]], [[1, 1e-3], [500,000, 1e-5]]])
neural_network	[1024, 512]

References

Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
Li, Y. Deep reinforcement learning: An overview. arXiv 2017, arXiv:1701.07274. [Google Scholar]
Millea, A. Deep reinforcement learning for trading—A critical survey. Data 2021, 6, 119. [Google Scholar] [CrossRef]
Pricope, T.V. Deep reinforcement learning in quantitative algorithmic trading: A review. arXiv 2021, arXiv:2106.00123. [Google Scholar]
Zhang, Z.; Zohren, S.; Roberts, S. Deep reinforcement learning for trading. J. Financ. Data Sci. 2020, 2, 25–40. [Google Scholar] [CrossRef]
Betancourt, C.; Chen, W.H. Deep reinforcement learning for portfolio management of markets with a dynamic number of assets. Expert Syst. Appl. 2021, 164, 114002. [Google Scholar] [CrossRef]
Millea, A.; Edalat, A. Using Deep Reinforcement Learning with Hierarchical Risk Parity for Portfolio Optimization. Int. J. Financ. Stud. 2022, 11, 10. [Google Scholar] [CrossRef]
Ebens, H.; Kotecha, C.; Ypsilanti, A.; Reiss, A. Introducing the multi-asset strategy index. J. Altern. Investments 2008, 11, 6–25. [Google Scholar] [CrossRef]
Cornalba, F.; Disselkamp, C.; Scassola, D.; Helf, C. Multi-Objective reward generalization: Improving performance of Deep Reinforcement Learning for selected applications in stock and cryptocurrency trading. arXiv 2022, arXiv:2203.04579. [Google Scholar]
Dai, M.; Zhang, Q.; Zhu, Q.J. Trend following trading under a regime switching model. SIAM J. Financ. Math. 2010, 1, 780–810. [Google Scholar] [CrossRef] [Green Version]
Ang, A.; Timmermann, A. Regime changes and financial markets. Annu. Rev. Financ. Econ. 2012, 4, 313–337. [Google Scholar] [CrossRef] [Green Version]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Guo, X.; Hernández-Lerma, O. Continuous-time Markov decision processes. In Continuous-Time Markov Decision Processes; Springer: Berlin/Heidelberg, Germany, 2009; pp. 9–18. [Google Scholar]
Yu, P.; Lee, J.S.; Kulyatin, I.; Shi, Z.; Dasgupta, S. Model-based deep reinforcement learning for dynamic portfolio optimization. arXiv 2019, arXiv:1901.08740. [Google Scholar]
Moerland, T.M.; Broekens, J.; Jonker, C.M. Model-based reinforcement learning: A survey. arXiv 2020, arXiv:2006.16712. [Google Scholar]
Deisenroth, M.; Rasmussen, C.E. PILCO: A model-based and data-efficient approach to policy search. In Proceedings of the Proceedings of the 28th International Conference on machine learning (ICML-11), Bellevue, WA, USA, 28 June–2 July 2011; pp. 465–472. [Google Scholar]
Ha, D.; Schmidhuber, J. World models. arXiv 2018, arXiv:1803.10122. [Google Scholar]
Wei, H.; Wang, Y.; Mangu, L.; Decker, K. Model-based reinforcement learning for predictions and control for limit order books. arXiv 2019, arXiv:1910.03743. [Google Scholar]
Bishop, C.M. Mixture Density Networks; Neural Computing Research Group Report: NCRG/94/0041; Aston University: Birmingham, UK, 1994. [Google Scholar]
Levy, A.; Platt, R.; Saenko, K. Hierarchical actor-critic. arXiv 2017, arXiv:1712.00948. [Google Scholar]
Nachum, O.; Gu, S.S.; Lee, H.; Levine, S. Data-efficient hierarchical reinforcement learning. Adv. Neural Inf. Process. Syst. 2018, 31, 3303–3313. [Google Scholar]
Vezhnevets, A.S.; Osindero, S.; Schaul, T.; Heess, N.; Jaderberg, M.; Silver, D.; Kavukcuoglu, K. Feudal networks for hierarchical reinforcement learning. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 3540–3549. [Google Scholar]
Dietterich, T.G. The MAXQ Method for Hierarchical Reinforcement Learning. In Proceedings of the International Conference on Machine Learning, Madison, WI, USA, 24–27 July 1998; Volume 98, pp. 118–126. [Google Scholar]
Wiering, M.; Schmidhuber, J. HQ-learning. Adapt. Behav. 1997, 6, 219–246. [Google Scholar] [CrossRef]
Pateria, S.; Subagdja, B.; Tan, A.h.; Quek, C. Hierarchical reinforcement learning: A comprehensive survey. ACM Comput. Surv. (CSUR) 2021, 54, 1–35. [Google Scholar] [CrossRef]
Suri, K.; Shi, X.Q.; Plataniotis, K.; Lawryshyn, Y. TradeR: Practical Deep Hierarchical Reinforcement Learning for Trade Execution. arXiv 2021, arXiv:2104.00620. [Google Scholar]
Wang, R.; Wei, H.; An, B.; Feng, Z.; Yao, J. Commission fee is not enough: A hierarchical reinforced framework for portfolio management. arXiv 2020, arXiv:2012.12620. [Google Scholar] [CrossRef]
Botvinick, M.; Weinstein, A. Model-based hierarchical reinforcement learning and human action control. Philos. Trans. R. Soc. Biol. Sci. 2014, 369, 20130480. [Google Scholar] [CrossRef]
Xu, D.; Fekri, F. Interpretable Model-based Hierarchical Reinforcement Learning using Inductive Logic Programming. arXiv 2021, arXiv:2106.11417. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
Yang, L.; Sun, Q.; Zhang, N.; Li, Y. Indirect multi-energy transactions of energy internet with deep reinforcement learning approach. IEEE Trans. Power Syst. 2022, 37, 4067–4077. [Google Scholar] [CrossRef]
Wang, J.; Mishra, D.K.; Li, L.; Zhang, J. Demand Side Management and Peer-to-Peer Energy Trading for Industrial Users Using Two-Level Multi-Agent Reinforcement Learning. IEEE Trans. Energy Mark. Policy Regul. 2023, 1, 23–36. [Google Scholar] [CrossRef]
Sutton, R.S.; Precup, D.; Singh, S. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artif. Intell. 1999, 112, 181–211. [Google Scholar] [CrossRef] [Green Version]
Kulkarni, T.D.; Saeedi, A.; Gautam, S.; Gershman, S.J. Deep successor reinforcement learning. arXiv 2016, arXiv:1606.02396. [Google Scholar]
Srinivas, A.; Krishnamurthy, R.; Kumar, P.; Ravindran, B. Option discovery in hierarchical reinforcement learning using spatio-temporal clustering. arXiv 2016, arXiv:1605.05359. [Google Scholar]
Lloyd, S. Least square quantization in PCM. Bell Telephone Laboratories Paper. Published in journal much later: Lloyd, SP: Least squares quantization in PCM. IEEE Trans. Inform. Theor. (1957/1982) 1957, 18, 11. [Google Scholar]
Salinas, D.; Flunkert, V.; Gasthaus, J.; Januschowski, T. DeepAR: Probabilistic forecasting with autoregressive recurrent networks. Int. J. Forecast. 2020, 36, 1181–1191. [Google Scholar] [CrossRef]

Figure 1. Depiction of the overall architecture.

Figure 2. The five clusters discovered.

Figure 3. Sample predictions. (a) Multi-step prediction of the high-level agent. (b) One-step prediction of the low-level agent.

Figure 4. Performance of the low-level agents, each trained on one cluster. (a) Portfolio value for the agents trained on subsets of the data. (b) Sharpe ratio for the agents trained on subsets of the data. (c) Number of sells for the agents trained on subsets of the data.

Figure 5. State representation for the high-level agent.

Figure 6. Performance of a hierarchical agent (PPO) using two basic strategies as low-level agents. (a) Mean reversion and moving average crossover strategies. (b) PPO selecting between the two strategies.

Figure 7. Final performance on BTCUSDT of major models tested.

Figure 8. Cumulative rewards versus CVaR.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Millea, A. Hierarchical Model-Based Deep Reinforcement Learning for Single-Asset Trading. Analytics 2023, 2, 560-576. https://doi.org/10.3390/analytics2030031

AMA Style

Millea A. Hierarchical Model-Based Deep Reinforcement Learning for Single-Asset Trading. Analytics. 2023; 2(3):560-576. https://doi.org/10.3390/analytics2030031

Chicago/Turabian Style

Millea, Adrian. 2023. "Hierarchical Model-Based Deep Reinforcement Learning for Single-Asset Trading" Analytics 2, no. 3: 560-576. https://doi.org/10.3390/analytics2030031

Article Menu

Hierarchical Model-Based Deep Reinforcement Learning for Single-Asset Trading

Abstract

1. Introduction

2. Related Work

3. Overview and Methods

3.1. Data

3.2. Architecture

3.2.1. Clustering as Preprocessing for the High-Level Agent

3.2.2. Pretraining the Low-Level Agents

3.3. Prediction

3.4. Rewards

3.5. Termination Function and Diversification

4. Experiments and Results

4.1. Hierarchy with Two Simple Strategies

4.2. Risk-Return Optimization

5. Discussion

6. Conclusions

7. Future Directions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix A.1. Implementation Details

Appendix A.2. Hyperparameters

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI