The aim of this work was to create an interpretable AI for personal investment management. We used a policy regularisation method to instill inherent agent behaviours based on a prior action distribution, as in Equation (
1), for which we detail the algorithm in Algorithm 1. Our underlying assumption is that our method finds a local optimum in close proximity to the regularisation prior, which we base on the fact that policy regularisation in general does not a-priori prevent the exploration-exploitation process from finding an optimum [
33,
34].
We selected five asset classes in which a customer could invest a monthly amount over a duration of 30 years: a savings account, property, a portfolio of stocks, luxury expenditures, and additional mortgage payments. We include luxury expenditure to the portfolio under the premise that it may increase customer satisfaction in their portfolios [
18]. We define luxury items as any expenditure that may appeal to a person’s personality profile; people scoring high on openness might derive joy from spending money on travelling, people scoring high on extraversion may prefer to spend money on festivities with other people [
18], while other luxury items such as cars or artwork are also possible. While this investment class includes items typically listed on indices such the Knight Frank luxury investment index [
35]—art, fine wines, classic cars, etc., it also includes luxury expenditures such as travel, fine dining, and consumer electronics. However, it excludes basic household spending such as groceries, insurance, fuel, etc. Finally, we modelled the growth rates of assets according to historical index data, which we describe below.
Algorithm 1: Policy regularisation algorithm from [15]. |
Initialize the actor with random parameters Initialize the critic with random parameters Initialize the target actor with parameters Initialize the target critic with parameters Set the prior and the number of actions Set regularisation weight hyperparameter Set target update rate hyperparameter Initialize the replay buffer for e = 1, episodes do Initialise a random exploration function Reset the environment and get the first state observation , while not Done do ▹ Gather experience Select the action and add exploration randomness Retrieve the environmental response: reward and observation Store the transition tuple to replay buffer: if (end of episode) then end if end while Sample a random batch from the replay buffer ▹ Learn using experience replay Update critic parameters by minimising the loss:
Update the actor parameters by minimising the loss: ▹ From Equation ( 1)
Update the target parameters:
end for
|
3.1. Modelling Assumptions
We continuously distribute funds into assets based on the indices of the S&P 500 [
36], Norwegian property [
37], and the Norwegian interest rate [
38]. In addition, we invest in mortgages and luxury items. We show this data for a 30-year period in
Figure 1.
We make a number of assumptions which limit the scope of the portfolio and simplify investment choices to make the characterization of agent behaviour and interpretation of investment strategies tractable.
Assumption 1. Asset growth rates can be modelled by their respective asset indices, i.e., a stock portfolio may be modeled by a major stock index—e.g., the S&P 500, and an investment in property by its corresponding index.
The outright investment in indices such as S&P 500 is very common; it will return the growth rates according to these indices. This is a conservative assumption as stock portfolio optimization frequently outperforms indices, which may serve as a performance measure of the investment strategy [
29].
To give personalised advice, we depart from the premise that there is a mere correlation between spending behaviour and happiness. We are expanding the notion of the causal relationship of spending patterns and customer satisfaction to chart an investment strategy and provide advice that is aligned with customer personality [
18]. We enlisted a panel of experts from a major Norwegian bank to rate our asset classes according to a set of inherent properties: expected long term risk and returns, liquidity, minimum investment limits, and perceived novelty. We used the Sharpe ratio—the difference between the expected daily return and risk-free return divided by the standard deviation of daily returns—to quantify risk and historical data to gauge expected returns. These coefficients, the elements of a matrix
P, are shown in
Table 1.
The same panel of experts also assigned a matrix
Q describing the likely associations between the prototypical personality traits and the asset classes, shown in
Table 2. For instance, the conscientiousness trait might prefer assets classes with low expected risk, while the openness trait might prefer those which they perceive as novel.
From
P and
Q, we calculated a set of coefficients that describe the association that each personality trait might have with each of the asset classes. The resulting matrix of coefficients
, normalized by column and scaled such that the values are in the range
, are shown in
Table 3.
We define a MDP for a multi-agent RL setting as follows:
States A set of 13 continuous values representing the customer age (between 30 and 60 years and normalized to a range of ), six values for the asset class holdings and total portfolio value (scaled by ), and two market indicators for each of the three indices, i.e., their mean asset convergence divergence (MACD) (the difference between the 26-month and the 12-month exponential moving average of a trend) which predicts trend reversals and relative strength index ( where and are the average positive and negative changes to the index values respectively, for x periods) which corrects for potential false predictions by MACD. The time horizon is 30 years.
Reward The changes in portfolio values between time steps.
Actions The continuous distribution of funds across the five asset classes.
Assumption 2. The initial values for a portfolio consist of a mortgage of NOK 2 million and a property valued at NOK 2 million. All other assets have zero initial value.
It is easy to adjust these initial portfolio assignments for different individuals.
Assumption 3. We make consistent monthly investments of 10,000 Norwegian kroner (NOK).
This can be easily modified for individual customers’ contributions.
There is a priori no lower limit on the investment amounts:
Assumption 4. Property investment does not require bulk payments, i.e., smaller investments can be made through property funds, trusts, or crowdfunding.
While investment in physical real estate normally requires larger deposits, we allow our agents to invest smaller amounts into the property market, i.e., a fraction of the monthly investment contribution specified in Assumption 3. This is not a strong assumption as it is possible to invest smaller amounts in property indices, trusts, funds, etc.
We assign interest rates for savings accounts at 5–10% below, and those of mortgage accounts at 5–10% over the interest index. Individuals younger than 35 years receive the more beneficial interest rate, as is common in Norwegian banks. Luxury items experience a depreciation of 20% per year; the depreciation of luxury items is highly variable and depends on the item, e.g., while artwork may appreciate, cars typically depreciate rapidly:
Assumption 5. Luxury items depreciate at 20% per year.
Dividends are normally included in the calculation of indices and monthly transactions are relatively infrequent compared to high frequency trading:
Assumption 6. Any additional income from investments—such as dividend payouts or rental income—as well as costs such as transaction costs and fund management costs are ignored.
3.2. Agents
We train five DDPG agents, one for each of the five personality traits. Using Equation (
1) we regularise their objective functions with a prior derived from their respective personality traits in
Table 3, e.g., the openness prior
places the most weight on stocks and avoids mortgage repayments, property investment, and savings, while the conscientiousness prior
places the most weight on mortgage repayments and avoids stocks and luxury expenditure. These priors, shown in
Table 4, are probability distributions across the investment channels and therefore add up to one.
Our five agents have identical actor and critic networks, respectively. This is appropriate because they solve the same problem, but aim to find locally optimum policies in specific regions of the state-action space, as given by their respective regularisation priors. The 10 neural networks for the agents’ actors and critics each consist of two fully connected feed-forward layers with 2000 nodes in each layer. The actor networks each have a final soft-max activation layer while the critic networks have no final activations. The reason for the actors’ softmax activation is to ensure the values for the actions add up to one, while the critics need no activation as the rewards need not be scaled. We tuned the hyperparameters using a one-at-a-time parameter sweep resulting in learning rates of and for the actors and critics respectively, target network update parameters of , and regularisation coefficients of . Training batch sizes were 256 time steps and we sized the replay buffer to hold 2048 transitions. Each iteration collected 256 time steps and completed two training batches.