Reinforcement Learning-Based Multimodal Model for the Stock Investment Portfolio Management Task

Du, Sha; Shen, Hailong

doi:10.3390/electronics13193895

Open AccessArticle

Reinforcement Learning-Based Multimodal Model for the Stock Investment Portfolio Management Task

by

Sha Du

and

Hailong Shen

^*

College of Science, Northeastern University, Shenyang 110819, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(19), 3895; https://doi.org/10.3390/electronics13193895

Submission received: 30 August 2024 / Revised: 26 September 2024 / Accepted: 30 September 2024 / Published: 1 October 2024

Download

Browse Figures

Versions Notes

Abstract

:

Machine learning has been applied by more and more scholars in the field of quantitative investment, but traditional machine learning methods cannot provide high returns and strong stability at the same time. In this paper, a multimodal model based on reinforcement learning (RL) is constructed for the stock investment portfolio management task. Most of the previous methods based on RL have chosen the value-based RL methods. Policy gradient-based RL methods have been proven to be superior to value-based RL methods by a growing number of research. Commonly used policy gradient-based reinforcement learning methods are DDPG, TD3, SAC, and PPO. We conducted comparative experiments to select the most suitable method for the dataset in this paper. The final choice was DDPG. Furthermore, there will rarely be a way to refine the raw data before training the agent. The stock market has a large amount of data, and the data are complex. If the raw stock market data are fed directly to the agent, the agent cannot learn the information in the data efficiently and quickly. We use state representation learning (SRL) to process the raw stock data and then feed the processed data to the agent. It is not enough to train the agent using only stock data; we also added comment text data and image data. The comment text data comes from investors’ comments on stock bars. Image data are derived from pictures that can represent the overall direction of the market. We conducted experiments on three datasets and compared our proposed model with 11 other methods. We set up three evaluation indicators in the paper. Taken together, our proposed model works best.

Keywords:

stock prediction; reinforcement learning; state representation learning; multimodal; sentiment analysis

1. Introduction

Quantitative investment in stocks has long been a popular area of research. Quantitative investment techniques are designed to predict the direction of stocks through models, thus helping many retail investors make decisions that can lead to gains [1]. Most previous stock forecasting methods treat the stock problem as a time series modeling problem and choose to utilize statistics-based methods to solve it [2,3]. In recent years, as the field of artificial intelligence has become hot, solving time series problems with machine learning-based methods has proven to produce better results [4,5,6,7,8,9,10,11]. However, machine learning methods have various limitations. Some research [4,5] uses support vector machines (SVMs) to solve time series problems. However, a SVM is difficult to implement for large-scale training samples. Research [6] uses support vector regression (SVR) to solve time series problems. However, for linear data, the performance of SVR is slightly deficient relative to linear regression. Research [7] uses extreme learning machines (ELMs) to solve time series problems. ELMs avoid local optima, overfitting, and time-consuming problems, but has limited application in multilayer structures. Research [8,9] uses correlation vector machines (RVMs) to solve time series problems. But when the dataset is larger, the training time of the RVM is too long. The heuristic algorithm (HA) is an innovative technique in the field of machine learning [10], but it has a large dependence on parameters, and it is easy to fall into local optimal solutions. For the stock market, which has a large amount of complex data, traditional machine learning methods cannot consistently and accurately predict the direction of the stock market. Therefore, these traditional machine learning methods are not applicable to the stock prediction problem.

Deep reinforcement learning (DRL) is another method for building quantitative investment strategies. Its interactive trial-and-error learning characteristics are in line with the learning model of real-world organisms, i.e., the agent and the virtual financial market environment are constantly interacting, and the agent obtains feedback from the market by trying various trading actions, and then continuously adjusts its strategy. Thus, DRL-based methods have yielded good results in the field of quantitative investment. Research [12] introduced a recursive deep neural network for real-time financial signal representation and trading. The aim is to train computers to beat experienced traders in financial trading. Research [13] proposed a reinforcement learning framework without financial modeling that provides a deep machine learning solution for investment portfolio management tasks. Research [14] proposed the theory of deep reinforcement learning applied in stock trading and stock price prediction and proved the reliability and usability of the model through experimental data. Research [15] proposed two methods for stock trading decisions. Firstly, a nested reinforcement learning approach is proposed in this paper based on three deep reinforcement learning models. Then, a weighted random selection with a confidence strategy is proposed in the paper.

In addition, more and more stock prediction studies are incorporating sentiment analysis in their models. Research [16,17,18] believes that online comments related to the stock market affect retail investors’ judgment, which in turn affects stock trading and the direction of the stock market. Research [19,20,21] demonstrated that adding the sentiment analysis module can improve stock prediction accuracy.

In our previous work, we combined stock data and stock bar comment text and then built a stock investment portfolio management model using the DRL method. Although the experimental results show that large gains are obtained, the model does not adequately process information about the market environment, and the model data are not utilized efficiently, which leads to unstable results. Therefore, this paper addresses the shortcomings of the previous work.

Research [22] proposes a multimodal DRL method that combines a CNN and LSTM as feature extraction modules and uses a DQN as a decision model for trading. This paper utilizes stock data information to generate images and then utilizes modules of a CNN and LSTM to process the images. The results show that the model achieves a significant increase in profit. Research [23,24] refers to previous work and improves on it by utilizing multimodal data to more fully extract stock features.

SRL is one of the related areas of research to make the model effective for training data. The basic idea of SRL is to construct auxiliary training tasks to train the feature extractor and place it in front of the actor and critic to pre-process the raw state and action inputs of the environment so that the actor and critic can speed up the training and improve the results with more efficient and easy-to-process intermediate features as inputs.

The innovations of this paper are shown below as follows:

A multimodal stock investment portfolio management model is proposed. Most of the previous stock investment portfolio management methods only considered stock raw data or added comment text data to raw stock data. The input data for the model proposed in this paper includes stock raw data, comment text data, and image data. We collected nearly 10 million stock bar comments, from 1 January 2022 to 3 April 2024, for 118 stocks. The stock raw data and the sentiment analysis index of comment text are used as state inputs. We use previous stock data to construct three kinds of image data that can reflect the long-term trend in the stock market. The image processing module consists of a CNN and LSTM, which extract the overall dynamic characteristics of the market, and the long-term time series change characteristics, respectively.
Adding an SRL module in the model. Most previous stock investment portfolio management methods have not extracted information from the raw data before inputting it, and even fewer have used SRL to extract information from the raw data. This paper adds the SRL module, which can help the agent to obtain more complete information about the stock market. The raw data first passes through the SRL module before it is input to the critic and actor layers.
Reinforcement learning algorithms based on the strategy gradient method are used. Among the existing RL-based stock investment portfolio management methods, there are more value-based RL methods. In this paper, we choose to use the policy gradient-based RL methods. The action space of the RL algorithm based on the policy gradient method is continuous and therefore more suitable for investment portfolio management tasks than the RL algorithm based on the value function method.

This paper consists of five chapters. The first chapter is the introduction. In this chapter, we introduce some methods to solve the task of stock investment portfolio management. We also introduce the benefits of DRL in solving stock investment portfolio management tasks. Finally, we present the innovations of this paper. The second chapter is preliminaries. We introduce the basic knowledge of DRL and the basic knowledge of quantitative investing in this chapter. The third chapter is the methodology. We describe our methodology in detail. The fourth chapter is about experiments. We first describe the source of the dataset and the preprocessing of the dataset, and then we conduct comparative experiments. We select the most appropriate SRL method and the most appropriate RL method. Our method is then compared with 11 other methods. On all three datasets, our method obtained the best results. The fifth chapter is the conclusion. In this chapter, we discuss the strengths and limitations of our methodology and discuss the focus of future work.

2. Preliminaries

In this chapter, we introduced the basic knowledge of DRL and the basic knowledge of quantitative investing. In Section 2.1.1, we introduce some basic mathematical definitions of DRL. In Section 2.1.2, we introduce the value-based RL methods and the policy gradient-based RL methods. In Section 2.2.1, we introduce the quantitative investment task. In Section 2.2.2, we introduce the evaluation indicators and technical indicators used in this paper.

2.1. DRL

2.1.1. Markov Decision Process (MDP)

RL is a computational method to understand and automate goal-directed learning and decision-making problems. From a mathematical point of view, the RL problem is an idealized Markov decision process (MDP) with a theoretical framework for achieving goals through interactive learning. The agent and MDP together give a sequence of trajectories. MDP can be modeled using Equation (1):

\begin{matrix} p (s^{'}, r | s, a) = P r (s_{t + 1} = s^{'}, r_{t + 1} = r | s_{t} = s, a_{t} = a) . \end{matrix}

(1)

There is continuous interaction between the agent and the environment; in the current state (

s

), the agent selects actions (

a

), the environment responds accordingly to these actions and presents a new state (

s'

) to the agent, and the environment generates a reward (

r

). The goal of the agent is to maximize the cumulative discount reward:

\begin{matrix} G_{t} = \sum_{l = 1} γ^{l - 1} r_{t + l}, \end{matrix}

(2)

where

γ

is the discount factor.

The value functions are functions of the states (or state–action pairs) and are used to estimate how “good” (the expected return) the current agent is in a given state (or state–action pair). The magnitude of the expected return depends on the action chosen by the agent, so the value functions are related to the particular ways of acting, called policies (

π

). A policy is a mapping from the states to the probabilities of choosing each possible action in that state. We denote the value function of the state

s

under the policy

π

as

V_{π} (s)

:

V_{π} (s) ≜ E_{π} [G_{t} | S_{t} = s] = E_{π} [\sum_{k = 0}^{\infty} γ^{k} R_{t + k + 1} | S_{t} = s] .

(3)

We denote the value of taking action

a

in state

s

under policy

π

as

Q_{π} (s, a)

:

Q_{π} (s, a) ≜ E_{π} [G_{t} | S_{t} = s, A_{t} = a] = E_{π} [\sum_{k = 0}^{\infty} γ^{k} R_{t + k + 1} | S_{t} = s, A_{t} = a] .

(4)

With the introduction of the value function, the learning objectives of RL can be formally defined using the value function. In order to achieve this, it is first necessary to give the definition of the optimal value function, the optimal state–value function, and the optimal action–value function, which are shown below, respectively:

V_{*} (s) = {m a x}_{π} (V_{π} (s)), f o r a l l s \in S,

(5)

Q_{*} (\begin{matrix} s, a \end{matrix}) = {m a x}_{π} (Q_{π} (s, a)), f o r a l l s \in S, a \in A .

(6)

Given the optimal value function, the optimal policy can be obtained directly using the greedy algorithm. The RL task is to solve for the optimal policy:

π_{*} (\begin{matrix} s \end{matrix}) = a r g {m a x}_{a} Q_{*} (\begin{matrix} s, a \end{matrix}) .

(7)

2.1.2. Q-Learning Algorithm and Policy Gradient Methods

The Q-learning algorithm is the theoretical basis for value-based type methods, which are also used by the DQN family of models and actor–critic type methods to train action value functions. The Q-learning algorithm learns the optimal action value function directly and its formula for updating the action value function is shown below:

Q (s, a) = Q (s, a) + α (r + \underset{a^{'}}{m a x} Q (s^{'}, a^{'}) - Q (s, a))

(8)

By continuously executing Equation (8), the Q-learning algorithm performs both policy evaluation and policy enhancement until the Q-function converges, i.e., the optimal action value function is obtained, followed by the optimal policy.

In addition to value-based type methods, policy gradient-based methods are constantly being proposed.

DDPG [25] is based on DPG [26] and is trained using the actor–critic framework. The actor and critic are fitted using a neural network and parameters are updated using gradient descent. In addition, DDPG introduces OU noise to increase the exploration effort and uses soft updates to copy network parameters to the target network.

TD3 [27] is an improvement of the DDPG algorithm. TD3 introduces two critic networks to compute the TD target, which alleviates the overestimation problem in the function approximation and the actor–critic framework. At the same time, TD3 introduces smooth noise to generalize the estimation of the action value and reduces the updating frequency of the actor to increase the training stability.

SAC [28] proposes a generalized policy iteration algorithm under the soft actor–critic framework, which can obtain the optimal policy under the maximum entropy reinforcement learning framework. SAC optimizes the reward and the policy entropy at the same time, and the two parts of the objectives are weighed by the entropy coefficients. SAC proposes an algorithm that dynamically and adaptively adjusts the entropy coefficients, which makes SAC not sensitive to the parameters.

PPO [29] stabilizes the training by setting the confidence region to limit the update range of the policy parameters. PPO turns the problem into an unconstrained optimization problem and updates the parameters using gradient descent.

2.2. Quantitative Investment

2.2.1. The Task of Quantitative Investment

Quantitative investment is divided into two categories, namely the trading task and the investment portfolio management task. The trading task is to make rational trading decisions based on the market environment and to translate the trading decisions into the number of buys and sells for each stock to execute the trades. The portfolio management task requires readjusting the weight vectors at each moment to allocate the current assets to each stock as a unit vector, whose components represent the proportion of the amount invested in a particular stock at moment

t

. The goal of the investment portfolio management task is to maximize the total assets

{a s s e t}_{t}

at the moment

t

. Since the portfolio management task in real life is more in line with the investment behavior of most investors, this paper focuses only on the investment portfolio management task.

2.2.2. Indicators

In this paper, we use three evaluation indicators to evaluate the goodness of the model results, and we also construct six technical indicators for modeling. These three evaluation indicators and six technical indicators are shown in Table 1.

3. Methodology

In this section, we describe our methodology in detail. We regard the investment portfolio management task as an RL task. In Section 3.1, we describe the modeling of the RL task for this paper. In this section, we describe the state, action, reward, and course of dealing with the RL task. In Section 3.2, we introduce the role of SRL and the generation of image data. We then describe the architecture of our method in Section 3.2. We use SRL to process the raw data and then input the processed data to the agent. Finally, the agent is trained by the RL method. We show the workflow of the overall method in Figure 1.

3.1. Modeling the Environment for RL Task

State

N

is the total number of stocks. The state is a two-dimensional matrix of shape

(2 N + 4) * N

, which is divided into three parts. The first part is the covariance matrix between the close price vectors of

N

stocks; the second part is the average of the sentiment analysis index for each stock at the moment

t

(

h_{c}^{t}

,

c

means the stock and

t

means the moment); and the third part comprises the four technical indicators (MACD, RSI, CCI, ADX). The shapes of these three parts are

(N * N)

,

(N * N)

, and

(4 * N)

, respectively. The state at moment

t

is shown in Equation (9), where

{c o v}_{i, j}, 1 \leq i, j \leq N

denotes the covariance of the vectors consisting of the closing prices of stock

i

and stock

j

over the past 28 days. The covariance matrix is more suitable for investment portfolio management tasks, since investors usually use the standard deviation between stock prices to measure the risk of a particular asset allocation weight. This is why we include the covariance matrix as part of the state.

S_{t} = [\begin{matrix} [\begin{matrix} \begin{matrix} \begin{matrix} {c o v}_{1,1} & {c o v}_{1,2} \end{matrix} \\ \begin{matrix} {c o v}_{2,1} & {c o v}_{2,2} \end{matrix} \end{matrix} & \begin{matrix} \begin{matrix} \dots & {c o v}_{1, N} \end{matrix} \\ \begin{matrix} \dots & {c o v}_{2, N} \end{matrix} \end{matrix} \\ \begin{matrix} \begin{matrix} ⋮ & ⋮ \end{matrix} \\ \begin{matrix} {c o v}_{N, 1} & {c o v}_{N, 2} \end{matrix} \end{matrix} & \begin{matrix} \begin{matrix} ⋱ & ⋮ \end{matrix} \\ \begin{matrix} \dots & {c o v}_{N, N} \end{matrix} \end{matrix} \end{matrix}] \\ [\begin{matrix} \begin{matrix} \begin{matrix} h_{1}^{t} & h_{1}^{t} \end{matrix} \\ \begin{matrix} h_{2}^{t} & h_{2}^{t} \end{matrix} \end{matrix} & \begin{matrix} \begin{matrix} \dots & h_{1}^{t} \end{matrix} \\ \begin{matrix} \dots & h_{2}^{t} \end{matrix} \end{matrix} \\ \begin{matrix} \begin{matrix} ⋮ & ⋮ \end{matrix} \\ \begin{matrix} h_{N}^{t} & h_{N}^{t} \end{matrix} \end{matrix} & \begin{matrix} \begin{matrix} ⋱ & ⋮ \end{matrix} \\ \begin{matrix} \dots & h_{N}^{t} \end{matrix} \end{matrix} \end{matrix}] \\ [\begin{matrix} \begin{matrix} \begin{matrix} \begin{matrix} {M A C D}_{1} & {M A C D}_{2} \end{matrix} \\ \begin{matrix} {R S I}_{1} & {R S I}_{2} \end{matrix} \end{matrix} \\ \begin{matrix} \begin{matrix} {C C I}_{1} & {C C I}_{2} \end{matrix} \\ \begin{matrix} {A D X}_{1} & {A D X}_{2} \end{matrix} \end{matrix} \end{matrix} & \begin{matrix} \begin{matrix} \begin{matrix} \dots & {M A C D}_{N} \end{matrix} \\ \begin{matrix} \dots & {R S I}_{N} \end{matrix} \end{matrix} \\ \begin{matrix} \begin{matrix} \dots & {C C I}_{N} \end{matrix} \\ \begin{matrix} \dots & {A D X}_{N} \end{matrix} \end{matrix} \end{matrix} \end{matrix}] \end{matrix}] .

(9)

Action

We set up the action as a one-dimensional vector of length

N

. Each component of the vector represents the proportion of total assets allocated to the corresponding stock. We normalize the actions using the softmax function, so the actions are unit vectors. The agent maximizes total assets by adjusting the asset allocation weights. The formulaic representation of the action at moment

t

is shown in Equation (10), where

α_{t, i} \in [0,1], 1 \leq i \leq N

denotes the weight of the asset assigned to stock

i

at moment

t

.

a_{t} = [α_{t, 1}, α_{t, 2}, α_{t, 3}, \dots, α_{t, N}] .

(10)

Reward

We take the difference between the total assets at moment

t

and the total assets at moment

t - 1

as the reward. As some stocks are highly priced, this can lead to oversized rewards; therefore, rewards need to be scaled. We set the starting asset to be RMB one million, so we multiply the rewards by

10^{- 6}

so that the reward average is around 0. The reward function is shown below.

r_{t} = ({a s s e t}_{t} - {a s s e t}_{t - 1}) * 10^{- 6} .

(11)

Course of dealing

The number of stocks is

N

. Initial assets are RMB 1 million. The agent outputs the action

a_{t + 1}

at moment

t + 1

. The environment then updates the asset allocation by performing sell and buy operations on each stock in turn based on the difference between the asset allocation at time

t

and time

t + 1

. It is important to note that the funds used for buying come from the cash received from selling and this cash needs to be fully used. The environment comes from moment

t

to moment

t + 1

, when at the new closing price of the stock, the current asset allocation has total assets:

{a s s e t}_{t + 1} = {a s s e t}_{t} * P_{\frac{t + 1}{t}} * a_{t + 1}^{T},

(12)

where

a_{t + 1}^{T}

is the transpose of the action

a_{t + 1}

.

P_{\frac{t + 1}{t}}

is an N-dimensional row vector:

P_{\frac{t + 1}{t}} = [\frac{{c l o s e}_{t + 1, 1}}{{c l o s e}_{t, 1}}, \frac{{c l o s e}_{t + 1, 2}}{{c l o s e}_{t, 2}}, \dots, \frac{{c l o s e}_{t + 1, N}}{{c l o s e}_{t, N}}],

(13)

where

{c l o s e}_{t, i}, 1 \leq i \leq N

is the closing price of stock

i

at time

t

and

{c l o s e}_{t + 1, i}, 1 \leq i \leq N

is the closing price of stock

i

at time

t + 1

.

The goal of the investment portfolio management task is to maximize total assets at time

I

:

{a s s e t}_{I} = {a s s e t}_{0} * [P_{\frac{1}{0}} * a_{1}^{T}] * [P_{\frac{2}{1}} * a_{2}^{T}] * \dots * [P_{\frac{I}{I - 1}} * a_{I}^{T}] .

(14)

Since in reality, there are some transaction costs incurred when trading stocks, this paper sets a transaction fee of one thousand percent.

3.2. Model

3.2.1. SRL

The result of a machine learning task depends not only on the appropriateness of the algorithm but also on the quality and effective representation of the data. The purpose of SRL is to simplify the complex raw data, eliminate invalid or redundant information from the raw data, and refine the valid information to form features. Processing raw data using SRL can help the agent better understand the environment. Research [30,31,32] proposed three models, DenseNet, OFENet, and D2RL, respectively, which differ mainly in the way the inputs and outputs are connected at each layer. As shown in Figure 2, DenseNet combines the output of each layer with the inputs of all previous layers, OFENet combines the output of each layer of the network with the inputs of that layer, and D2RL combines the output of each layer with the original inputs. All three networks have their own domains of applicability, so this paper conducts comparative experiments between the three networks in the experimental section as a way of choosing the most suitable network for the dataset of this paper.

3.2.2. Image Data

It is not enough to generate state inputs to the agent using only numerical and textual data, so we added image data. This paper uses three types of images as image data, SSO indicator charts, DMI indicator charts, and candlestick charts. The SSO indicator chart and the DMI indicator chart represent SSO indicators and DMI indicators, respectively. The upper and lower parts of a candlestick chart are the candlestick timing chart and the volume chart, respectively, which are used to represent price fluctuations over a period of time. Candlesticks consist of the opening, closing, high, and low prices over a certain period of time.

The three types of images are plotted from the stock data of the Shanghai Stock Exchange Composite Index (SSEC); this is because one stock in isolation does not reflect the overall condition of the market. The three images in Figure 3 are plotted from the SSEC’s 4 January 2022 stock raw data. The two broken lines in the SSO chart represent the %K fast and %D slow lines, and the two horizontal lines represent the thresholds that signal a shift in market conditions. The three lines in the DMI chart represent the +DI, −DI, and ADX indicators, respectively. Candlestick charts in blue indicate that the closing price was higher than the opening price, while red indicates that the closing price was lower than the opening price. We also plot simple moving averages of the stock price for the last 5, 20, 60, and 120 moments in the candlestick chart.

3.2.3. Model Architecture

The model architecture of the model proposed in this paper is shown in Figure 4. The state information is first processed by the SRL module and then input as a vector to the multilayer perceptron (MLP) module. The image data generated by the SSEC is processed by the CNN–LSTM module. The agent’s training algorithm is a reinforcement learning algorithm. Since the action space is continuous, the value-based reinforcement learning algorithm is no longer applicable, so we choose to use the policy gradient-based reinforcement learning algorithm.

First, the image is input to the CNN module, the size of the image input is (4,400,240), the CNN module contains 5 layers of convolutional and pooling layers, and the output of the image feature map is of size (11, 4, 64). Next, the image is split in column order and the size of the image input is (4, 704), which is input to the LSTM module to obtain the final image feature data.

The image feature data and the processed environment information from the MLP module are input together to the agent for training. The agent outputs actions to assign asset weights to each stock and executing these actions can maximize total assets.

4. Experiments

In this section, we introduce the dataset and perform comparative experiments. In Section 4.1, we describe the dataset sources. In Section 4.2, we describe the data preprocessing. In Section 4.3, we perform comparative experiments. In Section 4.3, we first select the most appropriate RL method and then the most appropriate SRL method. Finally, we compare our method with 11 methods.

4.1. Dataset

The Chinese stock market is one of the most active and largest financial markets in the world today, with strong representativeness and research value. We chose the Chinese stock market as our dataset for the following three reasons:

For many years, most of the research on the stock market has used the US stock market as the dataset, and there is less research on stock prediction in the Chinese stock market.
The Chinese economy recovery is accelerating, and the total market capitalization of the Chinese stock market has exceeded RMB 10 trillion in 2020. The Chinese stock market is large and representative enough to be the subject of research.
The US stock market is dominated by institutional investors, while the Chinese stock market is dominated by retail investors. From the inception of the Chinese stock market to 2023, the number of retail investors has exceeded 220 million. This makes the Chinese stock market highly predictable.

In addition, the dataset has limitations. The development history of the Chinese stock market is not long enough. Some companies have too short of a development history, resulting in missing stock data. Some companies have fewer investors, resulting in fewer stock bar comments. These companies cannot provide valid data for the experiment and therefore need to be excluded. We excluded such stocks in the data preprocessing section. This paper selects 118 companies on the list of the top 150 listed companies in China for two consecutive years, 2022–2023, as the source of the stock trading dataset, based on the list of the top 150 listed companies in China published by the Walton Institute of Economic Research. The list of these 118 companies is placed in the Appendix A (Table A1).

East Money (https://www.eastmoney.com/) is one of the most visited and influential financial and securities portals in China. Since its launch in March 2004, East Money has always insisted on the authority and professionalism of its content, covering various financial fields in a multifaceted way, and updating tens of thousands of data points and information on a daily basis. Therefore, this paper chooses the stock bar under this website as the source of the dataset for sentiment analysis.

The SSEC is one of the most authoritative stock indexes in China. This is because companies listed on the Shanghai Stock Exchange, which are usually industry mainstays or even industry leaders, can have their stock prices change in a way that reflects broad changes in China’s stock market and can also influence the stock market. Therefore, the image data added to the model by this paper was plotted from the SSEC stock data.

4.2. Data Preprocessing

We use crawling techniques in Python to crawl posts from the comments section of a stock bar. We collected all posts in the stock bar for each day, from 1 January 2022 to 2 April 2024, for 118 stocks. These posts contained nearly 10 million comments, which we processed into text form and then used SnowNLP to obtain a sentiment analysis index. The sentiment analysis index is a number in the interval [0, 1], where a higher number indicates a more positive sentiment expressed in the language. SnowNLP is an official Python library inspired by TextBlob and written to make it easier to deal with Chinese. SnowNLP uses Bayesian machine learning methods to train on text. When text is fed into SnowNLP, the output is between 0 and 1. The closer the output number is to 1, the more positive the sentiment of the text. SnowNLP has a high correctness rate in dealing with the Chinese language. In this paper, we use the sentiment analysis module of SnowNLP to process the posts of stock bars in our dataset. We define the average of the sentiment analysis for each day of the 118 stocks as

h_{c}^{t}

, where

c

denotes the company and

t

denotes the date. Suppose a company is called

A

and today’s date is

T

. We collected all the comments under the stock bar of

A

on the day

T

. Suppose there are 100 of these comments. We fed these 100 comments into SnowNLP and obtained 100 sentiment indexes. These sentiment indexes range from 0 to 1, with higher values representing more positive sentiments. We then average these 100 indexes. This average of the sentiment indexes is

h_{A}^{T}

. The specific process is shown in Figure 5. We collect reviews of companies from stock bars, and the reviews are fed into SnowNLP. The average of the sentiment analysis index of all comments from company

c

on day

t

is

h_{c}^{t}

.

Among these 118 stocks, there are some stocks with missing data and some with too few comments. Therefore, we excluded these stocks. We eliminated 28 stocks with a low number of comments, and then processed the average of all sentiment analysis indexes (As) for the day for each of the remaining stocks together with the raw stock data into multiple tables. Part of the table is shown in Table 2 (using the stock 000001 as an example). A “Dataframe” represents a table and is a spreadsheet-like data structure. “Dataframe” contains a sorted set of lists, each column of which can have a different type of value. “Dataframe” has row and column indexes; it can be seen as a dictionary of “Series”. The input to our method proposed in this paper is “Dataframe”. Table 2 shows what the inputs to the method look like.

4.3. Comparison Experiment

We divide the 90 stocks into three datasets, and the three datasets are called Group A, Group B, and Group C, respectively. There are 30 stocks in each dataset. As shown in Figure 6, we use data from two years, 2022 and 2023, as the training set, the data of January 2024 as the validation set, and the data from February to March 2024 as the test set.

For all the comparison experiments conducted next, the model structure and hyperparameters of all methods except MDOSA (in subsequent experiments, we call the model proposed in this paper MDOSA) used the settings in the original paper but were consistent with MDOSA in key parameters such as batch_size and n_updates. The hyperparameter settings for the MDOSA model are shown in Table 3.

Please note that we used Python version 3.7.13 and PyTorch version 1.13.1. If the versions of Python and PyTorch are different from those in this paper, there may be a large discrepancy with the training results in this paper.

4.3.1. Comparative Experiments with RL Algorithms

In Section 2.1.2 of this paper, we presented four policy gradient-based RL algorithms. Each of these four algorithms has its own advantages and has different areas of application. In order to find the algorithms that are most applicable to the dataset of this paper, we tested the four algorithms on three datasets. Please note that in this comparison experiment, we did not include image data in these four RL algorithms, only textual sentiment analysis data in the states.

We have already introduced the evaluation indicators in Section 2.2.2 of the paper. Because of the randomized nature of RL training, we trained each algorithm six times, and then present in the table below the mean and standard deviation of the six trainings for the three indicators. The standard deviation of all three indicators is as small as possible. Except for the max drawdown, the average value of all other indicators is as large as possible. Of the three indicators, the Sharpe ratio is used to measure the excess return per unit of risk taken and represents the cost effectiveness of the investment. Therefore, it is the indicator we prioritize. The test results in Groups A, B, and C are shown in Table 4. The first place for each indicator is marked in red.

Since the names of the indicators are too long, in the first row of the tables, we call the average of cumulative returns, the standard deviation of cumulative returns, the average of the Sharpe ratio, the standard deviation of the Sharpe ratio, the average of the max drawdown, and the standard deviation of the max drawdown as ACR, SDCR, ASR, SDSR, AMD, and SDMD, respectively. We continue to use the full name of each indicator in the subsequent analysis.

We analyze the results as follows:

(1): Cumulative returns: TD3 has the largest average of cumulative return in both Group A and Group C, and DDPG has the largest average of cumulative return in Group B. DDPG has the smallest standard deviation of cumulative returns in both Group A and Group C, and TD3 has the smallest standard deviation of cumulative returns in Group B. This shows that both DDPG and TD3 are able to have strong stability while obtaining high returns.
(2): The Sharpe ratio: TD3 has the largest average of the Sharpe ratio in Group A, while DDPG has the largest average of the Sharpe ratio in both Groups B and C. Also, DDPG has the smallest standard deviation of the Sharpe ratio in all three datasets. The standard deviation of the Sharpe ratio for TD3 was particularly large in Groups A and C. The complexity and erratic direction of data in the stock market make solving investment portfolio management tasks with the TD3 algorithm not a good choice for investors who want to benefit with low risk. As a result, DDPG performs best in terms of the Sharpe ratio, the indicator that is prioritized for comparison.
(3): The max drawdown: The performance of the four algorithms in terms of the max drawdown is not very different. DDPG demonstrates a moderate level of competence in this area.

Thus, on a comprehensive basis, DDPG is the most appropriate algorithm.

4.3.2. Comparative Experiments on SRL Models

We have selected DDPG as the algorithm for agent training, and the SRL model is selected next. Please note that in this comparison experiment, we added not only textual sentiment analysis data but also image data.

In Section 3.2.1 of this paper, we introduced three SRL models. Each of these three models has its own merits, and in order to find out which model is most suitable for the dataset in this paper, we tested the three models on three datasets. We uniformly use DDPG as the training algorithm, while each SRL model uses a two-layer structure. The evaluation indicators are handled in the same way as in the RL algorithm comparison experiments. The test results in Groups A, B, and C are shown in Table 5. The first place for each indicator is marked in red.

We analyze the results as follows:

(1): Cumulative returns: OFENet has the largest average of cumulative returns in all three datasets and has the smallest standard deviation of cumulative returns in both Group B and Group C. This shows that DDPG has strong stability while earning high returns.
(2): The Sharpe ratio: OFENet has the largest average of the Sharpe ratio in all three datasets and has the smallest standard deviation of the Sharpe ratio in both Group B and Group C. This shows that OFENet is able to consistently achieve a high cost effectiveness of investment.
(3): The max drawdown: OFENet also achieves a better score in this aspect of the max drawdown.

Obviously, OFENet is the most appropriate SRL model. Since for the SRL module, we used the OFENet model and for the RL algorithm we used the DDPG algorithm, we called the model the multimodal DDPG model combining OFENet and sentiment analysis, or MDOSA for short.

4.3.3. Comparative Experiments of All Methods

In the past work, previous authors have designed many solutions for the investment portfolio management task using statistics-based methods. Therefore, we selected five statistics-based methods to compare with the MDOSA model in this paper [13,33]. In the field of stock prediction, there are many methods regarding investment portfolio management tasks. We have reviewed a large number of references and have found that there are five methods that are preferred. The five methods are the best constant rebalanced portfolio, best stock, uniform buy and hold, uniform constant rebalanced portfolio, and the universal portfolio. All five methods are applicable to the stock investment portfolio management task. Research [13] mentions the best constant rebalanced portfolio and universal portfolio in the paper and the experimental results show that these two methods have good performance. Therefore, these two methods are used in this paper as the benchmark methods. Research [33] mentioned the best stock, uniform buy and hold, and the uniform constant rebalanced portfolio in the paper. These three methods obtained good results in the paper and therefore, these three methods are used as benchmark methods in this paper. These five methods often appear in papers as benchmark methods in research related to the stock investment portfolio management task. Therefore, we choose these five methods as the benchmark methods in this paper. A brief description of these five methods is shown below as follows:

(1): Best constant rebalanced portfolio: using the given historical returns, find the optimal asset allocation weights.
(2): Best stock: choose the best performing stock based on historical data and simply invest in that stock.
(3): Uniform buy and hold: funds are evenly distributed at the initial moment and subsequently held at all times.
(4): Uniform constant rebalanced portfolio: adjust the allocation of funds at every moment to always maintain an even distribution.
(5): Universal portfolio: the returns of many kinds of investment portfolios are calculated based on statistical simulations, and the weighting of these portfolios is calculated based on the returns.

For convenience, we abbreviate these five methods as BCRP, BS, UBH, UCRP, and UP, respectively. In addition, we include all four RL models, DenseNet (based on DDPG), and D2RL (based on DDPG) in the comparison experiments. The test results for the three datasets are shown in Table 6, Table 7, and Table 8, respectively. Please note that there is no randomization in the five statistics-based methods, so the standard deviation of all three indicators is zero. Similarly, the first place for each indicator is marked in red.

We analyze the results as follows:

(1): Cumulative returns: MDOSA has the largest average of cumulative returns on all three datasets and has the smallest standard deviation of cumulative returns in Group B. DDPG has the smallest standard deviation of cumulative returns in both Group A and Group C. This shows that MDOSA has the highest yield and that both it and DDPG are strongly stable.
(2): The Sharpe ratio: MDOSA has the largest average of the Sharpe ratio in all three datasets and has the smallest standard deviation of the Sharpe ratio in both Group B and Group C. This shows that MDOSA has the highest cost effectiveness of investment and has strong stability.
(3): The max drawdown: MDOSA performs moderately well in this area.

It is obvious that MDOSA has the best combined performance in the three datasets. In order to see the characteristics of each method more clearly, we use radar charts to demonstrate them. The radar charts on the three datasets are shown in Figure 7, Figure 8, and Figure 9, respectively. All indicators have been max–min normalized. Additionally, for intuition, we multiply the smaller-is-better-indicators by

- 1

so that all indicators are larger-is-better. Please note that the standard deviations of the three indicators for the statistics-based methods are not processed.

5. Conclusions

In this paper, we develop a multimodal model for the stock investment portfolio management task (MDOSA). We utilize SRL to process the raw environmental information and input textual sentiment analysis data for stock reviews and image data representing the overall direction of the market to the agent. The RL algorithm based on the policy gradient is chosen as the algorithm for the training of agent. The biggest advantage of our method over previous methods is that we obtain high return rates while maintaining strong stability. Most previous methods could not guarantee strong stability and high return at the same time. The method in this paper was tested on three datasets with the goal of verifying strong stability. There are methods that may achieve high return on a certain dataset but fail to achieve high return on a different dataset. Our method obtains high return on all three datasets. In addition, our method is highly cost-effective. The Sharpe ratio represents the cost effectiveness of a particular investment portfolio. On all three datasets, our method has the largest Sharpe ratio. We compared MDOSA with the other 11 methods, and MDOSA generally outperformed the other 11 methods.

The real-life constraint of transaction costs is considered in this paper. Other constraints, such as liquidity and bid–ask bounce, are not considered in this paper. Bid–ask bounce refers to the adjustment phenomenon in which the stock price is in a continuous downward trend and eventually reverses back up to a certain price level because the stock price is falling too fast. These constraints can have implications for stock market trading. This is one of the limitations of the current research and we will improve our methodology in future research to address this issue. The max drawdown measures the ability of a method to resist risk in a market downturn. From the experimental results, MDOSA did not demonstrate the strongest risk resistance in terms of the max drawdown. In our future work, we will continue our research to find ways to make MDOSA have strong risk resistance in the market downturn. In the next study, we will consider the relationship between stocks. This is because considering the relationship between stocks enhances the risk tolerance of the method. There are various relationships between many companies. Some relationships are competitive, some are mutually beneficial, and some are affiliations. Thus, a change in one company’s stock may change the direction of some other company’s stock. In our future work, we will build a network of stock relationships to obtain stock prediction methods with better results.

Explainability in a broad sense means that we have access to enough comprehensible information that we need when we need to understand or solve something. Explainable deep learning and explainable reinforcement learning are both subproblems of explainable artificial intelligence, which are used to enhance human understanding of models. In this paper, we use the methodology in DRL. We know that a good explanation of the basis on which the model generates decisions will help investors understand our methodology. However, so far, there is no unified explainability method for explainable deep learning, and there is no unified explainability method for explainable reinforcement learning. The issue remains challenging, but it is certainly one that needs to be addressed. In future work, we will research the area of explainability. We will try to explain clearly the basis on which the agent generates decisions from within.

Author Contributions

Conceptualization, S.D.; methodology, S.D.; software, S.D.; validation, S.D.; data curation, S.D.; writing—original draft preparation, S.D.; writing—review and editing, S.D. and H.S.; visualization, S.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The raw stock data used in this paper are available from https://www.eastmoney.com/. The stock comment data collected in this paper will be provided by the authors upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. The list of 118 companies.

Company Code	Company Code	Company Code	Company Code	Company Code	Company Code
000001	003816	600188	600919	601319	601825
000002	300059	600295	600926	601328	601838
000039	300122	600309	600989	601390	601857
000063	300750	600362	600999	601398	601868
000333	300760	600383	601006	601577	601881
000568	600000	600406	601009	601600	601888
000617	600015	600426	601012	601601	601898
000651	600016	600438	601066	601607	601899
000708	600018	600519	601077	601618	601916
000776	600019	600546	601088	601628	601919
000858	600025	600585	601138	601633	601939
000983	600028	600606	601166	601658	601985
001872	600030	600690	601169	601668	601988
001979	600036	600704	601186	601669	601995
002142	600048	600741	601211	601688	603288
002304	600050	600803	601225	601699	603993
002352	600089	600809	601229	601728	688981
002415	600104	600837	601238	601766	900948
002475	600153	600887	601288	601800
002714	600176	600900	601318	601818

References

Bustos, O.; Pomares-Quimbaya, A. Stock market movement forecast: A systematic review. Expert Syst. Appl. 2020, 156, 113464. [Google Scholar] [CrossRef]
Adebiyi, A.A.; Adewumi, A.O.; Ayo, C.K. Comparison of arima and artificial neural networks models for stock price prediction. J. Appl. Math. 2014, 2014, 614342. [Google Scholar] [CrossRef]
Yan, X.; Guosheng, Z. Application of kalman filter in the prediction of stock price. In Proceedings of the 5th International Symposium on Knowledge Acquisition and Modeling (KAM 2015), London, UK, 27–28 June 2015; Atlantis Press: Amsterdam, The Netherlands, 2015; pp. 197–198. [Google Scholar]
Adnan, R.M.; Dai, H.-L.; Mostafa, R.R.; Parmar, K.S.; Heddam, S.; Kisi, O. Modeling multistep ahead dissolved oxygen concentration using improved support vector machines by a hybrid metaheuristic algorithm. Sustainability. 2022, 14, 3470. [Google Scholar] [CrossRef]
Zhang, Z.; Hong, W.-C. Application of variational mode decomposition and chaotic grey wolf optimizer with support vector regression for forecasting electric loads. Knowl. Based Syst. 2021, 228, 107297. [Google Scholar] [CrossRef]
Zhang, Z.; Hong, W.-C. Electric load forecasting by complete ensemble empirical mode decomposition adaptive noise and support vector regression with quantum-based dragonfly algorithm. Nonlinear Dyn. 2019, 98, 1107–1136. [Google Scholar] [CrossRef]
Adnan, R.M.; Dai, H.-L.; Mostafa, R.R.; Islam, A.R.M.T.; Kisi, O.; Heddam, S.; Zounemat-Kermani, M. Modelling groundwater level fluctuations by elm merged advanced metaheuristic algorithms using hydroclimatic data. Geocarto Int. 2023, 38, 2158951. [Google Scholar] [CrossRef]
Adnan, R.M.; Mostafa, R.R.; Dai, H.-L.; Heddam, S.; Kuriqi, A.; Kisi, O. Pan evaporation estimation by relevance vector machine tuned with new metaheuristic algorithms using limited climatic data. Eng. Appl. Comput. Fluid Mech. 2023, 17, 2192258. [Google Scholar] [CrossRef]
Mostafa, R.R.; Kisi, O.; Adnan, R.M.; Sadeghifar, T.; Kuriqi, A. Modeling potential evapotranspiration by improved machine learning methods using limited climatic data. Water 2023, 15, 486. [Google Scholar] [CrossRef]
Adnan, R.M.; Mostafa, R.R.; Islam, A.R.M.T.; Kisi, O.; Kuriqi, A.; Heddam, S. Estimating reference evapotranspiration using hybrid adaptive fuzzy inferencing coupled with heuristic algorithms. Comput. Electron. Agric. 2021, 191, 106541. [Google Scholar] [CrossRef]
Kumbure, M.M.; Lohrmann, C.; Luukka, P.; Porras, J. Machine learning techniques and data for stock market forecasting: A literature review. Expert Syst. Appl. 2022, 197, 116659. [Google Scholar] [CrossRef]
Deng, Y.; Bao, F.; Kong, Y.; Ren, Z.; Dai, Q. Deep direct reinforcement learning for financial signal representation and trading. IEEE Trans. Neural Netw. Learn. Syst. 2016, 28, 653–664. [Google Scholar] [CrossRef] [PubMed]
Jiang, Z.; Xu, D.; Liang, J. A deep reinforcement learning framework for the financial portfolio management problem. arXiv 2017, arXiv:1706.10059. [Google Scholar]
Li, Y.; Ni, P.; Chang, V. Application of deep reinforcement learning in stock trading strategies and stock forecasting. Computing 2020, 102, 1305–1322. [Google Scholar] [CrossRef]
Yu, X.; Wu, W.; Liao, X.; Han, Y. Dynamic stock-decision ensemble strategy based on deep reinforcement learning. Appl. Intell. 2023, 53, 2452–2470. [Google Scholar] [CrossRef]
Baker, M.; Wurgler, J. Behavioral corporate finance: An updated survey. In Handbook of the Economics of Finance; Elsevier: Amsterdam, The Netherlands, 2013; pp. 357–424. [Google Scholar]
Rupande, L.; Muguto, H.T.; Muzindutsi, P.-F. Investor sentiment and stock return volatility: Evidence from the Johannesburg stock exchange. Cogent Econ. Financ. 2019, 7, 1600233. [Google Scholar] [CrossRef]
Gite, S.; Khatavkar, H.; Kotecha, K.; Srivastava, S.; Maheshwari, P.; Pandey, N. Explainable stock prices prediction from financial news articles using sentiment analysis. PeerJ Comput. Sci. 2021, 7, e340. [Google Scholar] [CrossRef] [PubMed]
Gumus, A.; Sakar, C.O. Stock market prediction by combining stock price information and sentiment analysis. Int. J. Adv. Eng. Pure Sci. 2021, 33, 18–27. [Google Scholar] [CrossRef]
Mankar, T.; Hotchandani, T.; Madhwani, M.; Chidrawar, A.; Lifna, C. Stock market prediction based on social sentiments using machine learning. In Proceedings of the 2018 International Conference on Smart City and Emerging Technology (ICSCET), Mumbai, India, 5 January 2018; IEEE: New York, NY, USA, 2018; pp. 1–3. [Google Scholar]
Rajendiran, P.; Priyadarsini, P. Survival study on stock market prediction techniques using sentimental analysis. Mater. Today Proc. 2023, 80, 3229–3234. [Google Scholar] [CrossRef]
Shin, H.-G.; Ra, I. A deep multimodal reinforcement learning system combined with cnn and lstm for stock trading. In Proceedings of the 2019 International Conference on Information and Communication Technology Convergence (ICTC), Jeju, Republic of Korea, 16–18 October 2019; IEEE: New York, NY, USA, 2019; pp. 7–11. [Google Scholar]
Wang, Z.; Huang, B.; Tu, S.; Zhang, K.; Xu, L. DeepTrader: A deep reinforcement learning approach for risk-return balanced portfolio management with market conditions embedding. Proc. AAAI Conf. Artif. Intell. 2021, 35, 643–650. [Google Scholar] [CrossRef]
Wang, J.; Zhang, Y.; Tang, K.; Wu, J.; Xiong, Z. Alphastock: A buyingwinners-and-selling-losers investment strategy using interpretable deep reinforcement attention networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’19), Anchorage, AK, USA, 4–8 August 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 1900–1908. [Google Scholar]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; Riedmiller, M. Deterministic policy gradient algorithms. In Proceedings of the International Conference on Machine Learning, Beijing, China, 21–26 June 2014; PMLR: New York, NY, USA, 2014; pp. 387–395. [Google Scholar]
Fujimoto, S.; Hoof, H.; Meger, D. Addressing function approximation error in actor-critic methods. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; PMLR: New York, NY, USA, 2018; pp. 1587–1596. [Google Scholar]
Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.; Tan, J.; Kumar, V.; Zhu, H.; Gupta, A.; Abbeel, P.; et al. Soft actor-critic algorithms and applications. arXiv 2018, arXiv:1812.05905. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Ota, K.; Oiki, T.; Jha, D.; Mariyama, T.; Nikovski, D. Can increasing input dimensionality improve deep reinforcement learning? In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; PMLR: New York, NY, USA, 2020; pp. 7424–7433. [Google Scholar]
Sinha, S.; Bharadhwaj, H.; Srinivas, A.; Garg, A. D2rl: Deep dense architectures in reinforcement learning. arXiv 2020, arXiv:2010.09163. [Google Scholar]
Lim, B.; Zohren, S.; Roberts, S. Enhancing time series momentum strategies using deep neural networks. arXiv 2019, arXiv:1904.04912. [Google Scholar]

Figure 1. Workflow of the overall methodology.

Figure 2. Connection of inputs and outputs for each layer of the three SRL models.

Figure 3. Three images generated from the SSEC.

Figure 4. The model architecture of the model proposed in this paper.

Figure 5. The process of obtaining

h_{c}^{t}

.

Figure 5. The process of obtaining

h_{c}^{t}

.

Figure 6. Dataset segmentation.

Figure 7. Radar chart for all methods in Group A.

Figure 8. Radar chart for all methods in Group B.

Figure 9. Radar chart for all methods in Group C.

Table 1. Three evaluation indicators and six technical indicators.

Categories of Indicators	Name of Indicators	Meaning of Indicators
Evaluation indicators	Cumulative returns	The ratio of final and initial assets acquired over the entire period from inception to the end of the investment.
	Sharpe ratio	The Sharpe ratio is the ratio of the average rate of return to the standard deviation of the rate of return over the investment period.
	Max drawdown	The maximum value of the decline from any high point on the yield curve to its subsequent low point.
Technical indicators	MACD	Moving average convergence divergence
	RSI	Relative strength index
	CCI	Commodity channel index
	ADX	Average directional index
	DMI	Directional movement index
	SSO	Stochastic oscillator

Table 2. Partial presentation of data for stock 000001.

Date	Open	High	Low	Close	Volume	As
4 January 2022	16.48	16.66	16.18	16.66	116,925,933	0.3639
5 January 2022	16.58	17.22	16.55	17.15	196,199,817	0.3758
6 January 2022	17.11	17.27	17.00	17.12	110,788,519	0.3881
7 January 2022	17.10	17.28	17.06	17.20	112,663,070	0.3389
10 January 2022	17.29	17.42	17.03	17.19	90,977,401	0.3549
11 January 2022	17.26	17.54	17.14	17.41	158,199,940	0.3816
12 January 2022	17.41	17.45	16.90	17.00	150,216,355	0.3358

Table 3. Hyperparameter settings for MDOSA.

Hyperparameter	Meaning of Hyperparameter	Value
episode	Number of episodes in agent training	20
Gamma	The discount factor	0.90
n_updates	Number of trainings per interaction	5
batch_size	The batch size	4
buffer_size	The capacity of the memory buffer	10,000
learning_rate	The learning rate	0.00001
tau	Soft update parameter for the target network	0.005
srl_hidden_dim	SRL model hidden layer dimension	1024
lstm_hidden_dim	LSTM model hidden layer dimension	1024

Table 4. Comparison of test results of four RL algorithms in Groups A, B, and C.

Group	Model	ACR	SDCR	ASR	SDSR	AMD	SDMD
A	DDPG	0.0923	0.0086	4.2156	0.5440	0.0211	0.0043
	PPO	0.0922	0.0105	4.5951	0.9915	0.0243	0.0039
	SAC	0.0826	0.0106	3.9156	0.5628	0.0228	0.0032
	TD3	0.0987	0.0163	4.7371	0.8736	0.0212	0.0058
B	DDPG	0.0819	0.0142	3.7968	0.3591	0.0288	0.0048
	PPO	0.0750	0.0201	3.5134	0.8073	0.0338	0.0076
	SAC	0.0717	0.0161	3.4672	0.7080	0.0316	0.0041
	TD3	0.0737	0.0091	3.5712	0.3637	0.0297	0.0052
C	DDPG	0.0874	0.0033	3.9671	0.4071	0.0272	0.0046
	PPO	0.0847	0.0145	3.7884	0.6042	0.0249	0.0036
	SAC	0.0809	0.0162	3.7013	0.7186	0.0283	0.0091
	TD3	0.0916	0.0251	3.9487	1.0485	0.0270	0.0036

Table 5. Comparison of test results of three SRL models in Groups A, B, and C.

Group	Model	ACR	SDCR	ASR	SDSR	AMD	SDMD
A	DenseNet	0.0947	0.0010	4.4698	0.4200	0.0226	0.0058
	OFENet	0.1056	0.0126	5.0163	0.4121	0.0198	0.0035
	D2RL	0.0946	0.0089	4.5565	0.3614	0.0211	0.0054
B	DenseNet	0.0762	0.0174	3.7411	0.4801	0.0282	0.0055
	OFENet	0.0909	0.0078	4.0699	0.3524	0.0301	0.0089
	D2RL	0.0818	0.0115	3.8223	0.3653	0.0324	0.0072
C	DenseNet	0.0865	0.0133	3.7859	0.4381	0.0310	0.0061
	OFENet	0.1002	0.0107	4.4552	0.3985	0.0251	0.0028
	D2RL	0.0921	0.0109	4.1309	0.4070	0.0283	0.0027

Table 6. Comparison of test results for all methods in Group A.

Model	ACR	SDCR	ASR	SDSR	AMD	SDMD
MDOSA (ours)	0.1056	0.0126	5.0163	0.4121	0.0198	0.0035
DDPG	0.0923	0.0086	4.2156	0.5440	0.0211	0.0043
PPO	0.0922	0.0105	4.5951	0.9915	0.0243	0.0039
SAC	0.0826	0.0106	3.9156	0.5628	0.0228	0.0032
TD3	0.0987	0.0163	4.7371	0.8736	0.0212	0.0058
DenseNet	0.0947	0.0010	4.4698	0.4200	0.0226	0.0058
D2RL	0.0946	0.0089	4.5565	0.3614	0.0211	0.0054
BCRP	0.0980	0	4.8263	0	0.0180	0
BS	0.0960	0	4.5605	0	0.0174	0
UBH	0.0975	0	4.6250	0	0.0177	0
UCRP	0.0975	0	4.6245	0	0.0177	0
UP	0.0978	0	4.6359	0	0.0177	0

Table 7. Comparison of test results for all methods in Group B.

Model	ACR	SDCR	ASR	SDSR	AMD	SDMD
MDOSA (ours)	0.0909	0.0078	4.0699	0.3524	0.0301	0.0089
DDPG	0.0819	0.0142	3.7968	0.3591	0.0288	0.0048
PPO	0.0750	0.0201	3.5138	0.8073	0.0338	0.0076
SAC	0.0717	0.0161	3.4672	0.7080	0.0316	0.0041
TD3	0.0737	0.0091	3.5712	0.3637	0.0297	0.0052
DenseNet	0.0762	0.0174	3.7411	0.4801	0.0282	0.0055
D2RL	0.0818	0.0115	3.8223	0.3653	0.0324	0.0072
BCRP	0.0710	0	3.4065	0	0.0335	0
BS	0.0698	0	3.4996	0	0.0282	0
UBH	0.0745	0	3.6312	0	0.0301	0
UCRP	0.0744	0	3.6300	0	0.0301	0
UP	0.0746	0	3.6380	0	0.0301	0

Table 8. Comparison of test results for all methods in Group C.

Model	ACR	SDCR	ASR	SDSR	AMD	SDMD
MDOSA (ours)	0.1002	0.0107	4.4552	0.3985	0.0251	0.0028
DDPG	0.0874	0.0033	3.9671	0.4071	0.0272	0.0046
PPO	0.0847	0.0145	3.7884	0.6042	0.0249	0.0036
SAC	0.0809	0.0162	3.7013	0.7186	0.0283	0.0091
TD3	0.0916	0.0251	3.9487	1.0485	0.0270	0.0036
DenseNet	0.0865	0.0133	3.7859	0.4381	0.0310	0.0061
D2RL	0.0921	0.0109	4.1309	0.4070	0.0283	0.0027
BCRP	0.0803	0	3.7073	0	0.0331	0
BS	0.0856	0	3.9714	0	0.0233	0
UBH	0.0897	0	4.1020	0	0.0227	0
UCRP	0.0897	0	4.1016	0	0.0227	0
UP	0.0896	0	4.1018	0	0.0227	0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Du, S.; Shen, H. Reinforcement Learning-Based Multimodal Model for the Stock Investment Portfolio Management Task. Electronics 2024, 13, 3895. https://doi.org/10.3390/electronics13193895

AMA Style

Du S, Shen H. Reinforcement Learning-Based Multimodal Model for the Stock Investment Portfolio Management Task. Electronics. 2024; 13(19):3895. https://doi.org/10.3390/electronics13193895

Chicago/Turabian Style

Du, Sha, and Hailong Shen. 2024. "Reinforcement Learning-Based Multimodal Model for the Stock Investment Portfolio Management Task" Electronics 13, no. 19: 3895. https://doi.org/10.3390/electronics13193895

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Reinforcement Learning-Based Multimodal Model for the Stock Investment Portfolio Management Task

Abstract

1. Introduction

2. Preliminaries

2.1. DRL

2.1.1. Markov Decision Process (MDP)

2.1.2. Q-Learning Algorithm and Policy Gradient Methods

2.2. Quantitative Investment

2.2.1. The Task of Quantitative Investment

2.2.2. Indicators

3. Methodology

3.1. Modeling the Environment for RL Task

3.2. Model

3.2.1. SRL

3.2.2. Image Data

3.2.3. Model Architecture

4. Experiments

4.1. Dataset

4.2. Data Preprocessing

4.3. Comparison Experiment

4.3.1. Comparative Experiments with RL Algorithms

4.3.2. Comparative Experiments on SRL Models

4.3.3. Comparative Experiments of All Methods

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI