Next Article in Journal
Advantages of Plate Antennas over Loop Antennas for Circular Polarization—Its Application in Array Antenna with a Simplified Feed
Previous Article in Journal
Intelligent Traffic Control Decision-Making Based on Type-2 Fuzzy and Reinforcement Learning
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Reinforcement Learning-Based Multimodal Model for the Stock Investment Portfolio Management Task

College of Science, Northeastern University, Shenyang 110819, China
*
Author to whom correspondence should be addressed.
Electronics 2024, 13(19), 3895; https://doi.org/10.3390/electronics13193895
Submission received: 30 August 2024 / Revised: 26 September 2024 / Accepted: 30 September 2024 / Published: 1 October 2024

Abstract

:
Machine learning has been applied by more and more scholars in the field of quantitative investment, but traditional machine learning methods cannot provide high returns and strong stability at the same time. In this paper, a multimodal model based on reinforcement learning (RL) is constructed for the stock investment portfolio management task. Most of the previous methods based on RL have chosen the value-based RL methods. Policy gradient-based RL methods have been proven to be superior to value-based RL methods by a growing number of research. Commonly used policy gradient-based reinforcement learning methods are DDPG, TD3, SAC, and PPO. We conducted comparative experiments to select the most suitable method for the dataset in this paper. The final choice was DDPG. Furthermore, there will rarely be a way to refine the raw data before training the agent. The stock market has a large amount of data, and the data are complex. If the raw stock market data are fed directly to the agent, the agent cannot learn the information in the data efficiently and quickly. We use state representation learning (SRL) to process the raw stock data and then feed the processed data to the agent. It is not enough to train the agent using only stock data; we also added comment text data and image data. The comment text data comes from investors’ comments on stock bars. Image data are derived from pictures that can represent the overall direction of the market. We conducted experiments on three datasets and compared our proposed model with 11 other methods. We set up three evaluation indicators in the paper. Taken together, our proposed model works best.

1. Introduction

Quantitative investment in stocks has long been a popular area of research. Quantitative investment techniques are designed to predict the direction of stocks through models, thus helping many retail investors make decisions that can lead to gains [1]. Most previous stock forecasting methods treat the stock problem as a time series modeling problem and choose to utilize statistics-based methods to solve it [2,3]. In recent years, as the field of artificial intelligence has become hot, solving time series problems with machine learning-based methods has proven to produce better results [4,5,6,7,8,9,10,11]. However, machine learning methods have various limitations. Some research [4,5] uses support vector machines (SVMs) to solve time series problems. However, a SVM is difficult to implement for large-scale training samples. Research [6] uses support vector regression (SVR) to solve time series problems. However, for linear data, the performance of SVR is slightly deficient relative to linear regression. Research [7] uses extreme learning machines (ELMs) to solve time series problems. ELMs avoid local optima, overfitting, and time-consuming problems, but has limited application in multilayer structures. Research [8,9] uses correlation vector machines (RVMs) to solve time series problems. But when the dataset is larger, the training time of the RVM is too long. The heuristic algorithm (HA) is an innovative technique in the field of machine learning [10], but it has a large dependence on parameters, and it is easy to fall into local optimal solutions. For the stock market, which has a large amount of complex data, traditional machine learning methods cannot consistently and accurately predict the direction of the stock market. Therefore, these traditional machine learning methods are not applicable to the stock prediction problem.
Deep reinforcement learning (DRL) is another method for building quantitative investment strategies. Its interactive trial-and-error learning characteristics are in line with the learning model of real-world organisms, i.e., the agent and the virtual financial market environment are constantly interacting, and the agent obtains feedback from the market by trying various trading actions, and then continuously adjusts its strategy. Thus, DRL-based methods have yielded good results in the field of quantitative investment. Research [12] introduced a recursive deep neural network for real-time financial signal representation and trading. The aim is to train computers to beat experienced traders in financial trading. Research [13] proposed a reinforcement learning framework without financial modeling that provides a deep machine learning solution for investment portfolio management tasks. Research [14] proposed the theory of deep reinforcement learning applied in stock trading and stock price prediction and proved the reliability and usability of the model through experimental data. Research [15] proposed two methods for stock trading decisions. Firstly, a nested reinforcement learning approach is proposed in this paper based on three deep reinforcement learning models. Then, a weighted random selection with a confidence strategy is proposed in the paper.
In addition, more and more stock prediction studies are incorporating sentiment analysis in their models. Research [16,17,18] believes that online comments related to the stock market affect retail investors’ judgment, which in turn affects stock trading and the direction of the stock market. Research [19,20,21] demonstrated that adding the sentiment analysis module can improve stock prediction accuracy.
In our previous work, we combined stock data and stock bar comment text and then built a stock investment portfolio management model using the DRL method. Although the experimental results show that large gains are obtained, the model does not adequately process information about the market environment, and the model data are not utilized efficiently, which leads to unstable results. Therefore, this paper addresses the shortcomings of the previous work.
Research [22] proposes a multimodal DRL method that combines a CNN and LSTM as feature extraction modules and uses a DQN as a decision model for trading. This paper utilizes stock data information to generate images and then utilizes modules of a CNN and LSTM to process the images. The results show that the model achieves a significant increase in profit. Research [23,24] refers to previous work and improves on it by utilizing multimodal data to more fully extract stock features.
SRL is one of the related areas of research to make the model effective for training data. The basic idea of SRL is to construct auxiliary training tasks to train the feature extractor and place it in front of the actor and critic to pre-process the raw state and action inputs of the environment so that the actor and critic can speed up the training and improve the results with more efficient and easy-to-process intermediate features as inputs.
The innovations of this paper are shown below as follows:
  • A multimodal stock investment portfolio management model is proposed. Most of the previous stock investment portfolio management methods only considered stock raw data or added comment text data to raw stock data. The input data for the model proposed in this paper includes stock raw data, comment text data, and image data. We collected nearly 10 million stock bar comments, from 1 January 2022 to 3 April 2024, for 118 stocks. The stock raw data and the sentiment analysis index of comment text are used as state inputs. We use previous stock data to construct three kinds of image data that can reflect the long-term trend in the stock market. The image processing module consists of a CNN and LSTM, which extract the overall dynamic characteristics of the market, and the long-term time series change characteristics, respectively.
  • Adding an SRL module in the model. Most previous stock investment portfolio management methods have not extracted information from the raw data before inputting it, and even fewer have used SRL to extract information from the raw data. This paper adds the SRL module, which can help the agent to obtain more complete information about the stock market. The raw data first passes through the SRL module before it is input to the critic and actor layers.
  • Reinforcement learning algorithms based on the strategy gradient method are used. Among the existing RL-based stock investment portfolio management methods, there are more value-based RL methods. In this paper, we choose to use the policy gradient-based RL methods. The action space of the RL algorithm based on the policy gradient method is continuous and therefore more suitable for investment portfolio management tasks than the RL algorithm based on the value function method.
This paper consists of five chapters. The first chapter is the introduction. In this chapter, we introduce some methods to solve the task of stock investment portfolio management. We also introduce the benefits of DRL in solving stock investment portfolio management tasks. Finally, we present the innovations of this paper. The second chapter is preliminaries. We introduce the basic knowledge of DRL and the basic knowledge of quantitative investing in this chapter. The third chapter is the methodology. We describe our methodology in detail. The fourth chapter is about experiments. We first describe the source of the dataset and the preprocessing of the dataset, and then we conduct comparative experiments. We select the most appropriate SRL method and the most appropriate RL method. Our method is then compared with 11 other methods. On all three datasets, our method obtained the best results. The fifth chapter is the conclusion. In this chapter, we discuss the strengths and limitations of our methodology and discuss the focus of future work.

2. Preliminaries

In this chapter, we introduced the basic knowledge of DRL and the basic knowledge of quantitative investing. In Section 2.1.1, we introduce some basic mathematical definitions of DRL. In Section 2.1.2, we introduce the value-based RL methods and the policy gradient-based RL methods. In Section 2.2.1, we introduce the quantitative investment task. In Section 2.2.2, we introduce the evaluation indicators and technical indicators used in this paper.

2.1. DRL

2.1.1. Markov Decision Process (MDP)

RL is a computational method to understand and automate goal-directed learning and decision-making problems. From a mathematical point of view, the RL problem is an idealized Markov decision process (MDP) with a theoretical framework for achieving goals through interactive learning. The agent and MDP together give a sequence of trajectories. MDP can be modeled using Equation (1):
p s , r | s , a = P r s t + 1 = s , r t + 1 = r | s t = s , a t = a .
There is continuous interaction between the agent and the environment; in the current state ( s ), the agent selects actions ( a ), the environment responds accordingly to these actions and presents a new state ( s ) to the agent, and the environment generates a reward ( r ). The goal of the agent is to maximize the cumulative discount reward:
G t = l = 1 γ l 1 r t + l ,
where γ is the discount factor.
The value functions are functions of the states (or state–action pairs) and are used to estimate how “good” (the expected return) the current agent is in a given state (or state–action pair). The magnitude of the expected return depends on the action chosen by the agent, so the value functions are related to the particular ways of acting, called policies ( π ). A policy is a mapping from the states to the probabilities of choosing each possible action in that state. We denote the value function of the state s under the policy π as V π ( s ) :
V π s E π G t | S t = s = E π k = 0 γ k R t + k + 1 | S t = s .
We denote the value of taking action a in state s under policy π as Q π ( s , a ) :
Q π s , a E π G t | S t = s , A t = a = E π k = 0 γ k R t + k + 1 | S t = s , A t = a .
With the introduction of the value function, the learning objectives of RL can be formally defined using the value function. In order to achieve this, it is first necessary to give the definition of the optimal value function, the optimal state–value function, and the optimal action–value function, which are shown below, respectively:
V s = m a x π V π s , f o r   a l l   s S ,
Q s , a = m a x π Q π s , a , f o r   a l l   s S , a A .
Given the optimal value function, the optimal policy can be obtained directly using the greedy algorithm. The RL task is to solve for the optimal policy:
π s = a r g m a x a Q s , a .

2.1.2. Q-Learning Algorithm and Policy Gradient Methods

The Q-learning algorithm is the theoretical basis for value-based type methods, which are also used by the DQN family of models and actor–critic type methods to train action value functions. The Q-learning algorithm learns the optimal action value function directly and its formula for updating the action value function is shown below:
Q s , a = Q s , a + α r + m a x a Q s , a Q s , a
By continuously executing Equation (8), the Q-learning algorithm performs both policy evaluation and policy enhancement until the Q-function converges, i.e., the optimal action value function is obtained, followed by the optimal policy.
In addition to value-based type methods, policy gradient-based methods are constantly being proposed.
DDPG [25] is based on DPG [26] and is trained using the actor–critic framework. The actor and critic are fitted using a neural network and parameters are updated using gradient descent. In addition, DDPG introduces OU noise to increase the exploration effort and uses soft updates to copy network parameters to the target network.
TD3 [27] is an improvement of the DDPG algorithm. TD3 introduces two critic networks to compute the TD target, which alleviates the overestimation problem in the function approximation and the actor–critic framework. At the same time, TD3 introduces smooth noise to generalize the estimation of the action value and reduces the updating frequency of the actor to increase the training stability.
SAC [28] proposes a generalized policy iteration algorithm under the soft actor–critic framework, which can obtain the optimal policy under the maximum entropy reinforcement learning framework. SAC optimizes the reward and the policy entropy at the same time, and the two parts of the objectives are weighed by the entropy coefficients. SAC proposes an algorithm that dynamically and adaptively adjusts the entropy coefficients, which makes SAC not sensitive to the parameters.
PPO [29] stabilizes the training by setting the confidence region to limit the update range of the policy parameters. PPO turns the problem into an unconstrained optimization problem and updates the parameters using gradient descent.

2.2. Quantitative Investment

2.2.1. The Task of Quantitative Investment

Quantitative investment is divided into two categories, namely the trading task and the investment portfolio management task. The trading task is to make rational trading decisions based on the market environment and to translate the trading decisions into the number of buys and sells for each stock to execute the trades. The portfolio management task requires readjusting the weight vectors at each moment to allocate the current assets to each stock as a unit vector, whose components represent the proportion of the amount invested in a particular stock at moment t . The goal of the investment portfolio management task is to maximize the total assets a s s e t t at the moment t . Since the portfolio management task in real life is more in line with the investment behavior of most investors, this paper focuses only on the investment portfolio management task.

2.2.2. Indicators

In this paper, we use three evaluation indicators to evaluate the goodness of the model results, and we also construct six technical indicators for modeling. These three evaluation indicators and six technical indicators are shown in Table 1.

3. Methodology

In this section, we describe our methodology in detail. We regard the investment portfolio management task as an RL task. In Section 3.1, we describe the modeling of the RL task for this paper. In this section, we describe the state, action, reward, and course of dealing with the RL task. In Section 3.2, we introduce the role of SRL and the generation of image data. We then describe the architecture of our method in Section 3.2. We use SRL to process the raw data and then input the processed data to the agent. Finally, the agent is trained by the RL method. We show the workflow of the overall method in Figure 1.

3.1. Modeling the Environment for RL Task

  • State
N is the total number of stocks. The state is a two-dimensional matrix of shape 2 N + 4 * N , which is divided into three parts. The first part is the covariance matrix between the close price vectors of N stocks; the second part is the average of the sentiment analysis index for each stock at the moment t ( h c t , c means the stock and t means the moment); and the third part comprises the four technical indicators (MACD, RSI, CCI, ADX). The shapes of these three parts are N     N , N     N , and 4     N , respectively. The state at moment t is shown in Equation (9), where c o v i , j , 1 i , j N denotes the covariance of the vectors consisting of the closing prices of stock i and stock j over the past 28 days. The covariance matrix is more suitable for investment portfolio management tasks, since investors usually use the standard deviation between stock prices to measure the risk of a particular asset allocation weight. This is why we include the covariance matrix as part of the state.
S t = c o v 1,1 c o v 1,2 c o v 2,1 c o v 2,2 c o v 1 , N c o v 2 , N c o v N , 1 c o v N , 2 c o v N , N h 1 t h 1 t h 2 t h 2 t h 1 t h 2 t h N t h N t h N t M A C D 1 M A C D 2 R S I 1 R S I 2 C C I 1 C C I 2 A D X 1 A D X 2 M A C D N R S I N C C I N A D X N .
  • Action
We set up the action as a one-dimensional vector of length N . Each component of the vector represents the proportion of total assets allocated to the corresponding stock. We normalize the actions using the softmax function, so the actions are unit vectors. The agent maximizes total assets by adjusting the asset allocation weights. The formulaic representation of the action at moment t is shown in Equation (10), where α t , i 0,1 , 1 i N denotes the weight of the asset assigned to stock i at moment t .
a t = α t , 1 , α t , 2 , α t , 3 , , α t , N .
  • Reward
We take the difference between the total assets at moment t and the total assets at moment t 1 as the reward. As some stocks are highly priced, this can lead to oversized rewards; therefore, rewards need to be scaled. We set the starting asset to be RMB one million, so we multiply the rewards by 10 6 so that the reward average is around 0. The reward function is shown below.
r t = a s s e t t a s s e t t 1     10 6 .
  • Course of dealing
The number of stocks is N . Initial assets are RMB 1 million. The agent outputs the action a t + 1 at moment t + 1 . The environment then updates the asset allocation by performing sell and buy operations on each stock in turn based on the difference between the asset allocation at time t and time t + 1 . It is important to note that the funds used for buying come from the cash received from selling and this cash needs to be fully used. The environment comes from moment t to moment t + 1 , when at the new closing price of the stock, the current asset allocation has total assets:
a s s e t t + 1 = a s s e t t     P t + 1 t     a t + 1 T ,
where a t + 1 T is the transpose of the action a t + 1 . P t + 1 t is an N-dimensional row vector:
P t + 1 t = c l o s e t + 1 ,   1 c l o s e t , 1 ,   c l o s e t + 1 ,   2 c l o s e t , 2 , , c l o s e t + 1 ,   N c l o s e t , N ,
where c l o s e t , i ,   1 i N is the closing price of stock i at time t and c l o s e t + 1 ,   i ,   1 i N is the closing price of stock i at time t + 1 .
The goal of the investment portfolio management task is to maximize total assets at time I :
a s s e t I = a s s e t 0     P 1 0     a 1 T     P 2 1     a 2 T         P I I 1     a I T .
Since in reality, there are some transaction costs incurred when trading stocks, this paper sets a transaction fee of one thousand percent.

3.2. Model

3.2.1. SRL

The result of a machine learning task depends not only on the appropriateness of the algorithm but also on the quality and effective representation of the data. The purpose of SRL is to simplify the complex raw data, eliminate invalid or redundant information from the raw data, and refine the valid information to form features. Processing raw data using SRL can help the agent better understand the environment. Research [30,31,32] proposed three models, DenseNet, OFENet, and D2RL, respectively, which differ mainly in the way the inputs and outputs are connected at each layer. As shown in Figure 2, DenseNet combines the output of each layer with the inputs of all previous layers, OFENet combines the output of each layer of the network with the inputs of that layer, and D2RL combines the output of each layer with the original inputs. All three networks have their own domains of applicability, so this paper conducts comparative experiments between the three networks in the experimental section as a way of choosing the most suitable network for the dataset of this paper.

3.2.2. Image Data

It is not enough to generate state inputs to the agent using only numerical and textual data, so we added image data. This paper uses three types of images as image data, SSO indicator charts, DMI indicator charts, and candlestick charts. The SSO indicator chart and the DMI indicator chart represent SSO indicators and DMI indicators, respectively. The upper and lower parts of a candlestick chart are the candlestick timing chart and the volume chart, respectively, which are used to represent price fluctuations over a period of time. Candlesticks consist of the opening, closing, high, and low prices over a certain period of time.
The three types of images are plotted from the stock data of the Shanghai Stock Exchange Composite Index (SSEC); this is because one stock in isolation does not reflect the overall condition of the market. The three images in Figure 3 are plotted from the SSEC’s 4 January 2022 stock raw data. The two broken lines in the SSO chart represent the %K fast and %D slow lines, and the two horizontal lines represent the thresholds that signal a shift in market conditions. The three lines in the DMI chart represent the +DI, −DI, and ADX indicators, respectively. Candlestick charts in blue indicate that the closing price was higher than the opening price, while red indicates that the closing price was lower than the opening price. We also plot simple moving averages of the stock price for the last 5, 20, 60, and 120 moments in the candlestick chart.

3.2.3. Model Architecture

The model architecture of the model proposed in this paper is shown in Figure 4. The state information is first processed by the SRL module and then input as a vector to the multilayer perceptron (MLP) module. The image data generated by the SSEC is processed by the CNN–LSTM module. The agent’s training algorithm is a reinforcement learning algorithm. Since the action space is continuous, the value-based reinforcement learning algorithm is no longer applicable, so we choose to use the policy gradient-based reinforcement learning algorithm.
First, the image is input to the CNN module, the size of the image input is (4,400,240), the CNN module contains 5 layers of convolutional and pooling layers, and the output of the image feature map is of size (11, 4, 64). Next, the image is split in column order and the size of the image input is (4, 704), which is input to the LSTM module to obtain the final image feature data.
The image feature data and the processed environment information from the MLP module are input together to the agent for training. The agent outputs actions to assign asset weights to each stock and executing these actions can maximize total assets.

4. Experiments

In this section, we introduce the dataset and perform comparative experiments. In Section 4.1, we describe the dataset sources. In Section 4.2, we describe the data preprocessing. In Section 4.3, we perform comparative experiments. In Section 4.3, we first select the most appropriate RL method and then the most appropriate SRL method. Finally, we compare our method with 11 methods.

4.1. Dataset

The Chinese stock market is one of the most active and largest financial markets in the world today, with strong representativeness and research value. We chose the Chinese stock market as our dataset for the following three reasons:
  • For many years, most of the research on the stock market has used the US stock market as the dataset, and there is less research on stock prediction in the Chinese stock market.
  • The Chinese economy recovery is accelerating, and the total market capitalization of the Chinese stock market has exceeded RMB 10 trillion in 2020. The Chinese stock market is large and representative enough to be the subject of research.
  • The US stock market is dominated by institutional investors, while the Chinese stock market is dominated by retail investors. From the inception of the Chinese stock market to 2023, the number of retail investors has exceeded 220 million. This makes the Chinese stock market highly predictable.
In addition, the dataset has limitations. The development history of the Chinese stock market is not long enough. Some companies have too short of a development history, resulting in missing stock data. Some companies have fewer investors, resulting in fewer stock bar comments. These companies cannot provide valid data for the experiment and therefore need to be excluded. We excluded such stocks in the data preprocessing section. This paper selects 118 companies on the list of the top 150 listed companies in China for two consecutive years, 2022–2023, as the source of the stock trading dataset, based on the list of the top 150 listed companies in China published by the Walton Institute of Economic Research. The list of these 118 companies is placed in the Appendix A (Table A1).
East Money (https://www.eastmoney.com/) is one of the most visited and influential financial and securities portals in China. Since its launch in March 2004, East Money has always insisted on the authority and professionalism of its content, covering various financial fields in a multifaceted way, and updating tens of thousands of data points and information on a daily basis. Therefore, this paper chooses the stock bar under this website as the source of the dataset for sentiment analysis.
The SSEC is one of the most authoritative stock indexes in China. This is because companies listed on the Shanghai Stock Exchange, which are usually industry mainstays or even industry leaders, can have their stock prices change in a way that reflects broad changes in China’s stock market and can also influence the stock market. Therefore, the image data added to the model by this paper was plotted from the SSEC stock data.

4.2. Data Preprocessing

We use crawling techniques in Python to crawl posts from the comments section of a stock bar. We collected all posts in the stock bar for each day, from 1 January 2022 to 2 April 2024, for 118 stocks. These posts contained nearly 10 million comments, which we processed into text form and then used SnowNLP to obtain a sentiment analysis index. The sentiment analysis index is a number in the interval [0, 1], where a higher number indicates a more positive sentiment expressed in the language. SnowNLP is an official Python library inspired by TextBlob and written to make it easier to deal with Chinese. SnowNLP uses Bayesian machine learning methods to train on text. When text is fed into SnowNLP, the output is between 0 and 1. The closer the output number is to 1, the more positive the sentiment of the text. SnowNLP has a high correctness rate in dealing with the Chinese language. In this paper, we use the sentiment analysis module of SnowNLP to process the posts of stock bars in our dataset. We define the average of the sentiment analysis for each day of the 118 stocks as h c t , where c denotes the company and t denotes the date. Suppose a company is called A and today’s date is T . We collected all the comments under the stock bar of A on the day T . Suppose there are 100 of these comments. We fed these 100 comments into SnowNLP and obtained 100 sentiment indexes. These sentiment indexes range from 0 to 1, with higher values representing more positive sentiments. We then average these 100 indexes. This average of the sentiment indexes is h A T . The specific process is shown in Figure 5. We collect reviews of companies from stock bars, and the reviews are fed into SnowNLP. The average of the sentiment analysis index of all comments from company c on day t is h c t .
Among these 118 stocks, there are some stocks with missing data and some with too few comments. Therefore, we excluded these stocks. We eliminated 28 stocks with a low number of comments, and then processed the average of all sentiment analysis indexes (As) for the day for each of the remaining stocks together with the raw stock data into multiple tables. Part of the table is shown in Table 2 (using the stock 000001 as an example). A “Dataframe” represents a table and is a spreadsheet-like data structure. “Dataframe” contains a sorted set of lists, each column of which can have a different type of value. “Dataframe” has row and column indexes; it can be seen as a dictionary of “Series”. The input to our method proposed in this paper is “Dataframe”. Table 2 shows what the inputs to the method look like.

4.3. Comparison Experiment

We divide the 90 stocks into three datasets, and the three datasets are called Group A, Group B, and Group C, respectively. There are 30 stocks in each dataset. As shown in Figure 6, we use data from two years, 2022 and 2023, as the training set, the data of January 2024 as the validation set, and the data from February to March 2024 as the test set.
For all the comparison experiments conducted next, the model structure and hyperparameters of all methods except MDOSA (in subsequent experiments, we call the model proposed in this paper MDOSA) used the settings in the original paper but were consistent with MDOSA in key parameters such as batch_size and n_updates. The hyperparameter settings for the MDOSA model are shown in Table 3.
Please note that we used Python version 3.7.13 and PyTorch version 1.13.1. If the versions of Python and PyTorch are different from those in this paper, there may be a large discrepancy with the training results in this paper.

4.3.1. Comparative Experiments with RL Algorithms

In Section 2.1.2 of this paper, we presented four policy gradient-based RL algorithms. Each of these four algorithms has its own advantages and has different areas of application. In order to find the algorithms that are most applicable to the dataset of this paper, we tested the four algorithms on three datasets. Please note that in this comparison experiment, we did not include image data in these four RL algorithms, only textual sentiment analysis data in the states.
We have already introduced the evaluation indicators in Section 2.2.2 of the paper. Because of the randomized nature of RL training, we trained each algorithm six times, and then present in the table below the mean and standard deviation of the six trainings for the three indicators. The standard deviation of all three indicators is as small as possible. Except for the max drawdown, the average value of all other indicators is as large as possible. Of the three indicators, the Sharpe ratio is used to measure the excess return per unit of risk taken and represents the cost effectiveness of the investment. Therefore, it is the indicator we prioritize. The test results in Groups A, B, and C are shown in Table 4. The first place for each indicator is marked in red.
Since the names of the indicators are too long, in the first row of the tables, we call the average of cumulative returns, the standard deviation of cumulative returns, the average of the Sharpe ratio, the standard deviation of the Sharpe ratio, the average of the max drawdown, and the standard deviation of the max drawdown as ACR, SDCR, ASR, SDSR, AMD, and SDMD, respectively. We continue to use the full name of each indicator in the subsequent analysis.
We analyze the results as follows:
(1)
Cumulative returns: TD3 has the largest average of cumulative return in both Group A and Group C, and DDPG has the largest average of cumulative return in Group B. DDPG has the smallest standard deviation of cumulative returns in both Group A and Group C, and TD3 has the smallest standard deviation of cumulative returns in Group B. This shows that both DDPG and TD3 are able to have strong stability while obtaining high returns.
(2)
The Sharpe ratio: TD3 has the largest average of the Sharpe ratio in Group A, while DDPG has the largest average of the Sharpe ratio in both Groups B and C. Also, DDPG has the smallest standard deviation of the Sharpe ratio in all three datasets. The standard deviation of the Sharpe ratio for TD3 was particularly large in Groups A and C. The complexity and erratic direction of data in the stock market make solving investment portfolio management tasks with the TD3 algorithm not a good choice for investors who want to benefit with low risk. As a result, DDPG performs best in terms of the Sharpe ratio, the indicator that is prioritized for comparison.
(3)
The max drawdown: The performance of the four algorithms in terms of the max drawdown is not very different. DDPG demonstrates a moderate level of competence in this area.
Thus, on a comprehensive basis, DDPG is the most appropriate algorithm.

4.3.2. Comparative Experiments on SRL Models

We have selected DDPG as the algorithm for agent training, and the SRL model is selected next. Please note that in this comparison experiment, we added not only textual sentiment analysis data but also image data.
In Section 3.2.1 of this paper, we introduced three SRL models. Each of these three models has its own merits, and in order to find out which model is most suitable for the dataset in this paper, we tested the three models on three datasets. We uniformly use DDPG as the training algorithm, while each SRL model uses a two-layer structure. The evaluation indicators are handled in the same way as in the RL algorithm comparison experiments. The test results in Groups A, B, and C are shown in Table 5. The first place for each indicator is marked in red.
We analyze the results as follows:
(1)
Cumulative returns: OFENet has the largest average of cumulative returns in all three datasets and has the smallest standard deviation of cumulative returns in both Group B and Group C. This shows that DDPG has strong stability while earning high returns.
(2)
The Sharpe ratio: OFENet has the largest average of the Sharpe ratio in all three datasets and has the smallest standard deviation of the Sharpe ratio in both Group B and Group C. This shows that OFENet is able to consistently achieve a high cost effectiveness of investment.
(3)
The max drawdown: OFENet also achieves a better score in this aspect of the max drawdown.
Obviously, OFENet is the most appropriate SRL model. Since for the SRL module, we used the OFENet model and for the RL algorithm we used the DDPG algorithm, we called the model the multimodal DDPG model combining OFENet and sentiment analysis, or MDOSA for short.

4.3.3. Comparative Experiments of All Methods

In the past work, previous authors have designed many solutions for the investment portfolio management task using statistics-based methods. Therefore, we selected five statistics-based methods to compare with the MDOSA model in this paper [13,33]. In the field of stock prediction, there are many methods regarding investment portfolio management tasks. We have reviewed a large number of references and have found that there are five methods that are preferred. The five methods are the best constant rebalanced portfolio, best stock, uniform buy and hold, uniform constant rebalanced portfolio, and the universal portfolio. All five methods are applicable to the stock investment portfolio management task. Research [13] mentions the best constant rebalanced portfolio and universal portfolio in the paper and the experimental results show that these two methods have good performance. Therefore, these two methods are used in this paper as the benchmark methods. Research [33] mentioned the best stock, uniform buy and hold, and the uniform constant rebalanced portfolio in the paper. These three methods obtained good results in the paper and therefore, these three methods are used as benchmark methods in this paper. These five methods often appear in papers as benchmark methods in research related to the stock investment portfolio management task. Therefore, we choose these five methods as the benchmark methods in this paper. A brief description of these five methods is shown below as follows:
(1)
Best constant rebalanced portfolio: using the given historical returns, find the optimal asset allocation weights.
(2)
Best stock: choose the best performing stock based on historical data and simply invest in that stock.
(3)
Uniform buy and hold: funds are evenly distributed at the initial moment and subsequently held at all times.
(4)
Uniform constant rebalanced portfolio: adjust the allocation of funds at every moment to always maintain an even distribution.
(5)
Universal portfolio: the returns of many kinds of investment portfolios are calculated based on statistical simulations, and the weighting of these portfolios is calculated based on the returns.
For convenience, we abbreviate these five methods as BCRP, BS, UBH, UCRP, and UP, respectively. In addition, we include all four RL models, DenseNet (based on DDPG), and D2RL (based on DDPG) in the comparison experiments. The test results for the three datasets are shown in Table 6, Table 7, and Table 8, respectively. Please note that there is no randomization in the five statistics-based methods, so the standard deviation of all three indicators is zero. Similarly, the first place for each indicator is marked in red.
We analyze the results as follows:
(1)
Cumulative returns: MDOSA has the largest average of cumulative returns on all three datasets and has the smallest standard deviation of cumulative returns in Group B. DDPG has the smallest standard deviation of cumulative returns in both Group A and Group C. This shows that MDOSA has the highest yield and that both it and DDPG are strongly stable.
(2)
The Sharpe ratio: MDOSA has the largest average of the Sharpe ratio in all three datasets and has the smallest standard deviation of the Sharpe ratio in both Group B and Group C. This shows that MDOSA has the highest cost effectiveness of investment and has strong stability.
(3)
The max drawdown: MDOSA performs moderately well in this area.
It is obvious that MDOSA has the best combined performance in the three datasets. In order to see the characteristics of each method more clearly, we use radar charts to demonstrate them. The radar charts on the three datasets are shown in Figure 7, Figure 8, and Figure 9, respectively. All indicators have been max–min normalized. Additionally, for intuition, we multiply the smaller-is-better-indicators by 1 so that all indicators are larger-is-better. Please note that the standard deviations of the three indicators for the statistics-based methods are not processed.

5. Conclusions

In this paper, we develop a multimodal model for the stock investment portfolio management task (MDOSA). We utilize SRL to process the raw environmental information and input textual sentiment analysis data for stock reviews and image data representing the overall direction of the market to the agent. The RL algorithm based on the policy gradient is chosen as the algorithm for the training of agent. The biggest advantage of our method over previous methods is that we obtain high return rates while maintaining strong stability. Most previous methods could not guarantee strong stability and high return at the same time. The method in this paper was tested on three datasets with the goal of verifying strong stability. There are methods that may achieve high return on a certain dataset but fail to achieve high return on a different dataset. Our method obtains high return on all three datasets. In addition, our method is highly cost-effective. The Sharpe ratio represents the cost effectiveness of a particular investment portfolio. On all three datasets, our method has the largest Sharpe ratio. We compared MDOSA with the other 11 methods, and MDOSA generally outperformed the other 11 methods.
The real-life constraint of transaction costs is considered in this paper. Other constraints, such as liquidity and bid–ask bounce, are not considered in this paper. Bid–ask bounce refers to the adjustment phenomenon in which the stock price is in a continuous downward trend and eventually reverses back up to a certain price level because the stock price is falling too fast. These constraints can have implications for stock market trading. This is one of the limitations of the current research and we will improve our methodology in future research to address this issue. The max drawdown measures the ability of a method to resist risk in a market downturn. From the experimental results, MDOSA did not demonstrate the strongest risk resistance in terms of the max drawdown. In our future work, we will continue our research to find ways to make MDOSA have strong risk resistance in the market downturn. In the next study, we will consider the relationship between stocks. This is because considering the relationship between stocks enhances the risk tolerance of the method. There are various relationships between many companies. Some relationships are competitive, some are mutually beneficial, and some are affiliations. Thus, a change in one company’s stock may change the direction of some other company’s stock. In our future work, we will build a network of stock relationships to obtain stock prediction methods with better results.
Explainability in a broad sense means that we have access to enough comprehensible information that we need when we need to understand or solve something. Explainable deep learning and explainable reinforcement learning are both subproblems of explainable artificial intelligence, which are used to enhance human understanding of models. In this paper, we use the methodology in DRL. We know that a good explanation of the basis on which the model generates decisions will help investors understand our methodology. However, so far, there is no unified explainability method for explainable deep learning, and there is no unified explainability method for explainable reinforcement learning. The issue remains challenging, but it is certainly one that needs to be addressed. In future work, we will research the area of explainability. We will try to explain clearly the basis on which the agent generates decisions from within.

Author Contributions

Conceptualization, S.D.; methodology, S.D.; software, S.D.; validation, S.D.; data curation, S.D.; writing—original draft preparation, S.D.; writing—review and editing, S.D. and H.S.; visualization, S.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The raw stock data used in this paper are available from https://www.eastmoney.com/. The stock comment data collected in this paper will be provided by the authors upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. The list of 118 companies.
Table A1. The list of 118 companies.
Company CodeCompany CodeCompany CodeCompany CodeCompany CodeCompany Code
000001003816600188600919601319601825
000002300059600295600926601328601838
000039300122600309600989601390601857
000063300750600362600999601398601868
000333300760600383601006601577601881
000568600000600406601009601600601888
000617600015600426601012601601601898
000651600016600438601066601607601899
000708600018600519601077601618601916
000776600019600546601088601628601919
000858600025600585601138601633601939
000983600028600606601166601658601985
001872600030600690601169601668601988
001979600036600704601186601669601995
002142600048600741601211601688603288
002304600050600803601225601699603993
002352600089600809601229601728688981
002415600104600837601238601766900948
002475600153600887601288601800
002714600176600900601318601818

References

  1. Bustos, O.; Pomares-Quimbaya, A. Stock market movement forecast: A systematic review. Expert Syst. Appl. 2020, 156, 113464. [Google Scholar] [CrossRef]
  2. Adebiyi, A.A.; Adewumi, A.O.; Ayo, C.K. Comparison of arima and artificial neural networks models for stock price prediction. J. Appl. Math. 2014, 2014, 614342. [Google Scholar] [CrossRef]
  3. Yan, X.; Guosheng, Z. Application of kalman filter in the prediction of stock price. In Proceedings of the 5th International Symposium on Knowledge Acquisition and Modeling (KAM 2015), London, UK, 27–28 June 2015; Atlantis Press: Amsterdam, The Netherlands, 2015; pp. 197–198. [Google Scholar]
  4. Adnan, R.M.; Dai, H.-L.; Mostafa, R.R.; Parmar, K.S.; Heddam, S.; Kisi, O. Modeling multistep ahead dissolved oxygen concentration using improved support vector machines by a hybrid metaheuristic algorithm. Sustainability. 2022, 14, 3470. [Google Scholar] [CrossRef]
  5. Zhang, Z.; Hong, W.-C. Application of variational mode decomposition and chaotic grey wolf optimizer with support vector regression for forecasting electric loads. Knowl. Based Syst. 2021, 228, 107297. [Google Scholar] [CrossRef]
  6. Zhang, Z.; Hong, W.-C. Electric load forecasting by complete ensemble empirical mode decomposition adaptive noise and support vector regression with quantum-based dragonfly algorithm. Nonlinear Dyn. 2019, 98, 1107–1136. [Google Scholar] [CrossRef]
  7. Adnan, R.M.; Dai, H.-L.; Mostafa, R.R.; Islam, A.R.M.T.; Kisi, O.; Heddam, S.; Zounemat-Kermani, M. Modelling groundwater level fluctuations by elm merged advanced metaheuristic algorithms using hydroclimatic data. Geocarto Int. 2023, 38, 2158951. [Google Scholar] [CrossRef]
  8. Adnan, R.M.; Mostafa, R.R.; Dai, H.-L.; Heddam, S.; Kuriqi, A.; Kisi, O. Pan evaporation estimation by relevance vector machine tuned with new metaheuristic algorithms using limited climatic data. Eng. Appl. Comput. Fluid Mech. 2023, 17, 2192258. [Google Scholar] [CrossRef]
  9. Mostafa, R.R.; Kisi, O.; Adnan, R.M.; Sadeghifar, T.; Kuriqi, A. Modeling potential evapotranspiration by improved machine learning methods using limited climatic data. Water 2023, 15, 486. [Google Scholar] [CrossRef]
  10. Adnan, R.M.; Mostafa, R.R.; Islam, A.R.M.T.; Kisi, O.; Kuriqi, A.; Heddam, S. Estimating reference evapotranspiration using hybrid adaptive fuzzy inferencing coupled with heuristic algorithms. Comput. Electron. Agric. 2021, 191, 106541. [Google Scholar] [CrossRef]
  11. Kumbure, M.M.; Lohrmann, C.; Luukka, P.; Porras, J. Machine learning techniques and data for stock market forecasting: A literature review. Expert Syst. Appl. 2022, 197, 116659. [Google Scholar] [CrossRef]
  12. Deng, Y.; Bao, F.; Kong, Y.; Ren, Z.; Dai, Q. Deep direct reinforcement learning for financial signal representation and trading. IEEE Trans. Neural Netw. Learn. Syst. 2016, 28, 653–664. [Google Scholar] [CrossRef] [PubMed]
  13. Jiang, Z.; Xu, D.; Liang, J. A deep reinforcement learning framework for the financial portfolio management problem. arXiv 2017, arXiv:1706.10059. [Google Scholar]
  14. Li, Y.; Ni, P.; Chang, V. Application of deep reinforcement learning in stock trading strategies and stock forecasting. Computing 2020, 102, 1305–1322. [Google Scholar] [CrossRef]
  15. Yu, X.; Wu, W.; Liao, X.; Han, Y. Dynamic stock-decision ensemble strategy based on deep reinforcement learning. Appl. Intell. 2023, 53, 2452–2470. [Google Scholar] [CrossRef]
  16. Baker, M.; Wurgler, J. Behavioral corporate finance: An updated survey. In Handbook of the Economics of Finance; Elsevier: Amsterdam, The Netherlands, 2013; pp. 357–424. [Google Scholar]
  17. Rupande, L.; Muguto, H.T.; Muzindutsi, P.-F. Investor sentiment and stock return volatility: Evidence from the Johannesburg stock exchange. Cogent Econ. Financ. 2019, 7, 1600233. [Google Scholar] [CrossRef]
  18. Gite, S.; Khatavkar, H.; Kotecha, K.; Srivastava, S.; Maheshwari, P.; Pandey, N. Explainable stock prices prediction from financial news articles using sentiment analysis. PeerJ Comput. Sci. 2021, 7, e340. [Google Scholar] [CrossRef] [PubMed]
  19. Gumus, A.; Sakar, C.O. Stock market prediction by combining stock price information and sentiment analysis. Int. J. Adv. Eng. Pure Sci. 2021, 33, 18–27. [Google Scholar] [CrossRef]
  20. Mankar, T.; Hotchandani, T.; Madhwani, M.; Chidrawar, A.; Lifna, C. Stock market prediction based on social sentiments using machine learning. In Proceedings of the 2018 International Conference on Smart City and Emerging Technology (ICSCET), Mumbai, India, 5 January 2018; IEEE: New York, NY, USA, 2018; pp. 1–3. [Google Scholar]
  21. Rajendiran, P.; Priyadarsini, P. Survival study on stock market prediction techniques using sentimental analysis. Mater. Today Proc. 2023, 80, 3229–3234. [Google Scholar] [CrossRef]
  22. Shin, H.-G.; Ra, I. A deep multimodal reinforcement learning system combined with cnn and lstm for stock trading. In Proceedings of the 2019 International Conference on Information and Communication Technology Convergence (ICTC), Jeju, Republic of Korea, 16–18 October 2019; IEEE: New York, NY, USA, 2019; pp. 7–11. [Google Scholar]
  23. Wang, Z.; Huang, B.; Tu, S.; Zhang, K.; Xu, L. DeepTrader: A deep reinforcement learning approach for risk-return balanced portfolio management with market conditions embedding. Proc. AAAI Conf. Artif. Intell. 2021, 35, 643–650. [Google Scholar] [CrossRef]
  24. Wang, J.; Zhang, Y.; Tang, K.; Wu, J.; Xiong, Z. Alphastock: A buyingwinners-and-selling-losers investment strategy using interpretable deep reinforcement attention networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’19), Anchorage, AK, USA, 4–8 August 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 1900–1908. [Google Scholar]
  25. Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
  26. Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; Riedmiller, M. Deterministic policy gradient algorithms. In Proceedings of the International Conference on Machine Learning, Beijing, China, 21–26 June 2014; PMLR: New York, NY, USA, 2014; pp. 387–395. [Google Scholar]
  27. Fujimoto, S.; Hoof, H.; Meger, D. Addressing function approximation error in actor-critic methods. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; PMLR: New York, NY, USA, 2018; pp. 1587–1596. [Google Scholar]
  28. Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.; Tan, J.; Kumar, V.; Zhu, H.; Gupta, A.; Abbeel, P.; et al. Soft actor-critic algorithms and applications. arXiv 2018, arXiv:1812.05905. [Google Scholar]
  29. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
  30. Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
  31. Ota, K.; Oiki, T.; Jha, D.; Mariyama, T.; Nikovski, D. Can increasing input dimensionality improve deep reinforcement learning? In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; PMLR: New York, NY, USA, 2020; pp. 7424–7433. [Google Scholar]
  32. Sinha, S.; Bharadhwaj, H.; Srinivas, A.; Garg, A. D2rl: Deep dense architectures in reinforcement learning. arXiv 2020, arXiv:2010.09163. [Google Scholar]
  33. Lim, B.; Zohren, S.; Roberts, S. Enhancing time series momentum strategies using deep neural networks. arXiv 2019, arXiv:1904.04912. [Google Scholar]
Figure 1. Workflow of the overall methodology.
Figure 1. Workflow of the overall methodology.
Electronics 13 03895 g001
Figure 2. Connection of inputs and outputs for each layer of the three SRL models.
Figure 2. Connection of inputs and outputs for each layer of the three SRL models.
Electronics 13 03895 g002
Figure 3. Three images generated from the SSEC.
Figure 3. Three images generated from the SSEC.
Electronics 13 03895 g003
Figure 4. The model architecture of the model proposed in this paper.
Figure 4. The model architecture of the model proposed in this paper.
Electronics 13 03895 g004
Figure 5. The process of obtaining h c t .
Figure 5. The process of obtaining h c t .
Electronics 13 03895 g005
Figure 6. Dataset segmentation.
Figure 6. Dataset segmentation.
Electronics 13 03895 g006
Figure 7. Radar chart for all methods in Group A.
Figure 7. Radar chart for all methods in Group A.
Electronics 13 03895 g007
Figure 8. Radar chart for all methods in Group B.
Figure 8. Radar chart for all methods in Group B.
Electronics 13 03895 g008
Figure 9. Radar chart for all methods in Group C.
Figure 9. Radar chart for all methods in Group C.
Electronics 13 03895 g009
Table 1. Three evaluation indicators and six technical indicators.
Table 1. Three evaluation indicators and six technical indicators.
Categories of IndicatorsName of IndicatorsMeaning of Indicators
Evaluation indicatorsCumulative returnsThe ratio of final and initial assets acquired over the entire period from inception to the end of the investment.
Sharpe ratioThe Sharpe ratio is the ratio of the average rate of return to the standard deviation of the rate of return over the investment period.
Max drawdownThe maximum value of the decline from any high point on the yield curve to its subsequent low point.
Technical indicatorsMACDMoving average convergence divergence
RSIRelative strength index
CCICommodity channel index
ADXAverage directional index
DMIDirectional movement index
SSOStochastic oscillator
Table 2. Partial presentation of data for stock 000001.
Table 2. Partial presentation of data for stock 000001.
DateOpenHighLowCloseVolumeAs
4 January 202216.4816.6616.1816.66116,925,9330.3639
5 January 202216.5817.2216.5517.15196,199,8170.3758
6 January 202217.1117.2717.0017.12110,788,5190.3881
7 January 202217.1017.2817.0617.20112,663,0700.3389
10 January 202217.2917.4217.0317.1990,977,4010.3549
11 January 202217.2617.5417.1417.41158,199,9400.3816
12 January 202217.4117.4516.9017.00150,216,3550.3358
Table 3. Hyperparameter settings for MDOSA.
Table 3. Hyperparameter settings for MDOSA.
HyperparameterMeaning of HyperparameterValue
episodeNumber of episodes in agent training20
GammaThe discount factor0.90
n_updatesNumber of trainings per interaction5
batch_sizeThe batch size4
buffer_sizeThe capacity of the memory buffer10,000
learning_rateThe learning rate0.00001
tauSoft update parameter for the target network0.005
srl_hidden_dimSRL model hidden layer dimension1024
lstm_hidden_dimLSTM model hidden layer dimension1024
Table 4. Comparison of test results of four RL algorithms in Groups A, B, and C.
Table 4. Comparison of test results of four RL algorithms in Groups A, B, and C.
GroupModelACRSDCRASRSDSRAMDSDMD
ADDPG0.09230.00864.21560.54400.02110.0043
PPO0.09220.01054.59510.99150.02430.0039
SAC0.08260.01063.91560.56280.02280.0032
TD30.09870.01634.73710.87360.02120.0058
BDDPG0.08190.01423.79680.35910.02880.0048
PPO0.07500.02013.51340.80730.03380.0076
SAC0.07170.01613.46720.70800.03160.0041
TD30.07370.00913.57120.36370.02970.0052
CDDPG0.08740.00333.96710.40710.02720.0046
PPO0.08470.01453.78840.60420.02490.0036
SAC0.08090.01623.70130.71860.02830.0091
TD30.09160.02513.94871.04850.02700.0036
Table 5. Comparison of test results of three SRL models in Groups A, B, and C.
Table 5. Comparison of test results of three SRL models in Groups A, B, and C.
GroupModelACRSDCRASRSDSRAMDSDMD
ADenseNet0.09470.00104.46980.42000.02260.0058
OFENet0.10560.01265.01630.41210.01980.0035
D2RL0.09460.00894.55650.36140.02110.0054
BDenseNet0.07620.01743.74110.48010.02820.0055
OFENet0.09090.00784.06990.35240.03010.0089
D2RL0.08180.01153.82230.36530.03240.0072
CDenseNet0.08650.01333.78590.43810.03100.0061
OFENet0.10020.01074.45520.39850.02510.0028
D2RL0.09210.01094.13090.40700.02830.0027
Table 6. Comparison of test results for all methods in Group A.
Table 6. Comparison of test results for all methods in Group A.
ModelACRSDCRASRSDSRAMDSDMD
MDOSA (ours)0.10560.01265.01630.41210.01980.0035
DDPG0.09230.00864.21560.54400.02110.0043
PPO0.09220.01054.59510.99150.02430.0039
SAC0.08260.01063.91560.56280.02280.0032
TD30.09870.01634.73710.87360.02120.0058
DenseNet0.09470.00104.46980.42000.02260.0058
D2RL0.09460.00894.55650.36140.02110.0054
BCRP0.098004.826300.01800
BS0.096004.560500.01740
UBH0.097504.625000.01770
UCRP0.097504.624500.01770
UP0.097804.635900.01770
Table 7. Comparison of test results for all methods in Group B.
Table 7. Comparison of test results for all methods in Group B.
ModelACRSDCRASRSDSRAMDSDMD
MDOSA (ours)0.09090.00784.06990.35240.03010.0089
DDPG0.08190.01423.79680.35910.02880.0048
PPO0.07500.02013.51380.80730.03380.0076
SAC0.07170.01613.46720.70800.03160.0041
TD30.07370.00913.57120.36370.02970.0052
DenseNet0.07620.01743.74110.48010.02820.0055
D2RL0.08180.01153.82230.36530.03240.0072
BCRP0.071003.406500.03350
BS0.069803.499600.02820
UBH0.074503.631200.03010
UCRP0.074403.630000.03010
UP0.074603.638000.03010
Table 8. Comparison of test results for all methods in Group C.
Table 8. Comparison of test results for all methods in Group C.
ModelACRSDCRASRSDSRAMDSDMD
MDOSA (ours)0.10020.01074.45520.39850.02510.0028
DDPG0.08740.00333.96710.40710.02720.0046
PPO0.08470.01453.78840.60420.02490.0036
SAC0.08090.01623.70130.71860.02830.0091
TD30.09160.02513.94871.04850.02700.0036
DenseNet0.08650.01333.78590.43810.03100.0061
D2RL0.09210.01094.13090.40700.02830.0027
BCRP0.080303.707300.03310
BS0.085603.971400.02330
UBH0.089704.102000.02270
UCRP0.089704.101600.02270
UP0.089604.101800.02270
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Du, S.; Shen, H. Reinforcement Learning-Based Multimodal Model for the Stock Investment Portfolio Management Task. Electronics 2024, 13, 3895. https://doi.org/10.3390/electronics13193895

AMA Style

Du S, Shen H. Reinforcement Learning-Based Multimodal Model for the Stock Investment Portfolio Management Task. Electronics. 2024; 13(19):3895. https://doi.org/10.3390/electronics13193895

Chicago/Turabian Style

Du, Sha, and Hailong Shen. 2024. "Reinforcement Learning-Based Multimodal Model for the Stock Investment Portfolio Management Task" Electronics 13, no. 19: 3895. https://doi.org/10.3390/electronics13193895

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop