1. Introduction
Since 2003, global equity markets have almost tripled in size, rising to a total market capitalization of USD 109 trillion. The largest stock market in the world is the U.S. stock market, with a total market capitalization of over USD 46.2 trillion in 2024. The U.S. stock market has undergone hundreds of years of development, and its management system is already very sound and risk-resistant. The Chinese stock market has developed for a relatively short period of time and its system is not sufficiently perfect, so it does not maintain good stability. At the end of 2022, CCTV Finance Channel conducted a survey of 764,600 retail investors, which showed that 92.51% of retail investors lost money, and only 4.34% made a profit. In 2022, 206 million retail investors lost CNY 76,000 per capita, and in 2023, although the situation had improved, retail investors still lost CNY 54,000 per capita. The stock market has a huge amount of data and many influencing factors, and most of them are retail investors who do not have enough expertise to get useful information directly from it, which makes it difficult to benefit. Retail investors need a stable and profitable model to help them make profits, and stock exchanges can regulate the market by predicting stock prices to attain more information. Predicting the direction of stock prices has become more and more important, which has led many scholars to devote themselves to the research of stock prediction. In the past, a single model could only capture a certain aspect of information and ignore other information, which is a method that is unable to predict the stock market well, so it is a trend to combine the research results of different fields to build a stock prediction model.
Most previous stock forecasting methods treat the stock problem as a time series modeling problem and choose to utilize statistically based methods to solve it [
1,
2]. In recent years, as the field of artificial intelligence has become hot, solving time series problems with machine learning-based methods has proven to have better results [
3,
4,
5]. However, due to the large amount and complexity of stock market data, traditional machine learning methods cannot consistently and accurately predict the direction of the stock market. Deep reinforcement learning (DRL) is another method to building quantitative investment strategies. In 2015, Deepmind [
6] combined deep learning (DL) and reinforcement learning (RL) in video games to achieve results beyond human levels. This was the beginning of deep reinforcement learning (DRL); since then, DRL began to be applied to various fields and showed strong learning ability and adaptability. In 2016, DeepMind applied DRL to the field of machine gaming with its development of AlphaGo [
7], which defeated the world Go champion. The follow-up AlphaZero [
8] beat AlphaGo again through self-learning. In addition, DRL algorithms have been applied by people in a variety of fields such as automatic intelligent machines [
9,
10,
11,
12], text generation [
13,
14], text games [
15], autonomous driving [
16], and target localization [
17], among others. Research from references [
18,
19,
20,
21,
22] applies DRL to stock prediction. The experimental results in these papers demonstrate the feasibility of constructing stock prediction models using the DRL method.
Research [
18] applied deep reinforcement learning to global stock markets, greatly improving returns relative to before. Research [
19] applied deep Q- network and deep recursive Q-network to stock trading. In the paper, an end-to-end daily stock trading system is constructed that decides to buy or sell on each trading day. Research [
20] proposed a theory on the application of deep reinforcement learning in stock trading decisions and stock price prediction, demonstrating the reliability and usability of the model through experimental data. Research [
21] proposed a multi-layer and multi-ensemble stock trader based on deep neural network and meta-learner to solve the stock prediction issue. Research [
22] proposed two methods for stock trading decisions. Firstly, a nested reinforcement learning approach is proposed in this paper based on three deep reinforcement learning models. Then, a weight random selection with confidence strategy is proposed in the paper. The datasets of these methods are not the Chinese stock market. There are fewer stock prediction models based on the Chinese stock market. Research [
23] proposed a deep neural network model for predicting stock price movements. In the paper, knowledge graph and graph embedding techniques are used to select the related stocks of the target stock to construct the market information and trading information. Research [
24] treated stock price charts as images and uses deep learning neural network for image modelling. The method proposed in the paper can predict the stock price movement in the short term. The datasets of research [
23] and research [
24] are both Chinese stock market.
The rapid rise of the Internet has led people to browse information and express their opinions on major platforms every day, and most of these opinions are accompanied by the expression of positive or negative emotions, which will drive the emotions of the viewers. The remarks posted online by stock investors can influence the judgment of retail investors, which in turn affects stock prices [
25,
26,
27,
28,
29]. Research [
30] uses the advanced NLP technique BERTopic to analyze topic sentiment derived from stock market comments. Research [
30] combines sentiment analysis with various deep learning models, and demonstrates through the paper’s results that adding sentiment analysis to the models significantly improves the performance of these models. Thus, collecting information about these comments and analyzing the sentiment in them can help to predict the future direction of stock prices. In other words, adding sentiment analysis to stock prediction methods can lead to improved methods.
In this paper, deep reinforcement learning and sentiment analysis are applied to the Chinese stock market. We combine deep reinforcement learning methods and sentiment analysis techniques to propose two new methods. These two methods are called SADQN-R (Sentiment Analysis Deep Q-Network-R) and SADQN-S (Sentiment Analysis Deep Q-Network-S) respectively. The contributions can be summarized as follows:
Data Innovation: Most of the datasets for stock prediction models come from the U.S. stock market, and there are fewer stock prediction models about China. However, there are huge differences between the Chinese stock market and the U.S. stock market, so the models applicable to other countries are not necessarily applicable to China. In order to get stable gains in the Chinese stock market, it is necessary to construct stock prediction models based on Chinese dataset. In this paper, 118 stocks are selected as the dataset. These 118 stocks are ranked among the top 150 stocks in China for two consecutive years in 2022 and 2023. The final experimental results show that our model can benefit from the Chinese stock market.
Method Innovation: Few studies have combined DRL and sentiment analysis to form stock prediction models. This paper uses the Q-learning algorithm based on convolutional neural networks to train stock prediction models, and adds sentiment indices as rewards (R) and states (S) into DQN, respectively, to obtain two models, SADQN-R (Sentiment Analysis Deep Q-Network-R) and SADQN-S (Sentiment Analysis Deep Q-Network-S). We tested the trained SADQN-R and SADQN-S on the test set and compared them with several other methods, and the results show that SADQN-S has the best performance among all methods.
Application Innovation: Most of the previous stock prediction methods use the historical data of a stock to predict the future direction of that stock. Newly listed stocks do not have historical data and cannot be predicted accurately. In this paper, the train set and test set are from different stocks. The test results show that our model applied to newly-listed stocks can achieve high returns.
There are seven chapters in this paper. The first chapter is the introduction. We present in this section the applications of DRL in various fields and some stock prediction methods based on DRL. At the end of this chapter, we present the innovations of this paper. The second chapter is related work. We present the position and differences of our research in the field in this section. The third chapter is the preliminaries. We introduce convolutional neural networks (CNN) and knowledge about DRL and sentiment analysis in this section. The fourth chapter is the models. We explain modelling for reinforcement learning tasks in detail and illustrate the network setup for deep learning in this chapter. Furthermore, we describe the methods proposed in this paper in this chapter. The fifth chapter is the experiments. We introduce the dataset in this chapter and conduct comparative experiments. We first compared DQN and DDQN and then compared several baseline methods to ours. The results show that our method possesses the highest gains on both test sets. The sixth chapter is the discussion. We introduce the classification of stock prediction methods in this chapter and discuss the advantages of our method over previous methods. The seventh chapter is conclusion. In this section, we talk about the shortcomings of the paper and the outlook for future work.
2. Related Work
With the continuous progress of DRL theory, DRL methods are increasingly used in stock prediction. Research [
31] applied the combination of DL and RL to stock market trading for the first time. In the paper, it is proposed to introduce a multilayer perceptron to enhance the extraction capability of the feature recurrent neural networks. Many engineering tips for DL are given in the paper. Research [
32] proposed a DRL transaction framework centered on ensemble of identical independent evaluators (EIIE). EIIE treats some kind of neural network as an integration unit, processing the historical information of each asset with a neural network of the same parameters, and ultimately integrating the results of all assets to generate the asset assignment weights. This paper also proposes efficient vector storage methods and training approaches suitable for both offline and online. Research [
33] proposed a multimodal DRL method that combines CNN and LSTM as a feature extraction module and uses DQN as a decision model for trading. In the paper, stock timing information is used to generate three types of images and a module combining CNN and LSTM is used to process the images. Research [
34] proposed the DeepTrader framework, adding techniques such as TCN, GCN, and spatial attention mechanisms to research [
35]. Research [
34] improved the feature extraction of the stock price component from temporal and spatial features, respectively, in the paper. In addition, market factors are added to the paper to calculate market sector sentiment. After people gradually realized the advantages of DRL for solving stock prediction problems, it has been discussed in many review articles. Research [
36] discusses a number of issues from the perspective of DRL modelling of stock prediction problems, such as problem classification, risk assessment, environment modelling, and model selection. Research [
37] summarizes some of the more effective methods. Combining these reviews, DRL-based stock prediction methods can be broadly classified into three categories: Critic-Only, Actor-Only, and Actor-Critic.
Most previous studies have used the US stock market as the dataset, and few methods have used the Chinese stock market as the dataset. China has a large number of stockholders and huge stock data. Chinese retail investors need some stock prediction methods to help them make stable and high returns. Therefore, it is necessary to establish a method to use the Chinese stock market as the dataset. This research of ours is to use the Chinese stock market as the dataset. More and more scholars are proposing excellent stock prediction methods. However, few methods are specific to newly listed stocks. Newly listed stocks do not have past datasets and cannot be fitted or trained with past data to get stock direction. This paper is aimed at newly listed stocks for which no past data are available. The relevant parameters in this paper are shown in
Table A1.
3. Preliminaries
3.1. Neural Networks
A neural network is a computational model that mimics the way that the human brain works. It is fundamental to the fields of deep learning and machine learning. Neural networks consist of a large number of neurons that are interconnected in a network that can process complex data inputs and perform a variety of tasks. The basic components of a neural network consist mainly of neurons, hierarchies, weights, biases, and activation functions. These components work together to enable neural networks to learn and model complex non-linear relationships. Neural networks continuously improve their ability to recognize and process data by learning and adjusting connection weights. With learning, neural networks are able to show better and better performance on a variety of tasks such as image recognition, language understanding, and game play. Common neural networks include CNN, RNN, long short-term memory (LSTM), gated recurrent unit (GRU), generative adversarial networks (GAN), and so on.
3.2. CNN
CNN is a feed forward neural network. The default input to a CNN is an image, which allows us to encode specific properties into the structure of the network, making our feedforward function more efficient and reducing the number of parameters. CNN usually contain the following layers: convolutional layer, rectified linear unit layer, pooling layer, and fully-connected layer. The “convolution” in CNN is a mathematical operation that processes image data through convolutional kernels. These convolution kernels are smaller in size than the input image and they cover a partial region of the image. The convolution kernel generates an element of a new feature image by element-by-element multiplication and summation of the pixel values in that region. Rectified linear unit (ReLU) is a very common and important activation function in the field of deep learning. ReLU is widely used in numerous neural network models. The mathematical expression for ReLU is shown below:
This form allows the ReLU function to maintain a gradient of 1 for
, which is an important advantage of the ReLU function. This is because gradient vanishing is a challenge to be faced during the training process of deep learning and the ReLU function alleviates this issue to some extent.
The purpose of pooling is to reduce the feature image. Commonly used pooling methods include max pooling and mean pooling. Fully connected neural network should be called fully connect feedforward neural network. Fully connect means that every neuron in the next layer is connected to every neuron in the previous layer. Feedforward refers to the fact that after the input signal enters the network, the signal flow is unidirectional with no feedback in between.
3.3. Deep Reinforcement Learning
3.3.1. Basic Knowledge
RL is a computational approach to understanding and automating goal-directed learning and decision making. It is distinguished from other computational approaches by its emphasis on learning by an agent from direct interaction with its environment, without relying on exemplary supervision or complete models of the environment [
38]. From a mathematical point of view, the RL problem is an idealized Markov Decision Process (MDP) with a theoretical framework for achieving goals through interactive learning. In RL, the agent is responsible for learning and making decisions, and everything outside the agent that interacts with it is called the environment. The agent is in a “state” at every moment in the process of continuous exploration. In RL, the interaction between the agent and the environment is called “action”. In other words, “action” is an action that the agent can take. The agent and the environment interact over a series of discrete time steps. The time step is denoted by
,
. The interaction between the agent and the environment is continuous: in the current state (
), the agent chooses action (
), the environment responds to the action and presents a new state (
) to the agent, and the environment generates a reward (
), the goal of the agent is to maximize the cumulative reward it receives in the long run. Cumulative reward refers to the return (
):
In Equation (2),
refers to the current moment and
is the moment when the interaction between the agent and the environment ends.
However, in many cases, the interaction between the agent and the environment continues without restriction. At this point
does not necessarily converge. Therefore, we need to introduce another concept called discounting. Adding the discount rate allows
to be convergent when the interaction goes on indefinitely. As shown below:
where
is a parameter,
, called the discount rate.
The value function is divided into state-value function and action-value function. The state-value function is used to estimate the expected return of the agent in the current state. The action-value function is used to estimate the expected return of the agent in the current state-action pair. The magnitude of the expected return depends on the action chosen by the agent, so the value functions are related to the particular ways of acting, called policies (
). A policy is a mapping from the states to the probabilities of choosing each possible action in that state. We denote the value function of the state
under the policy
as
(state-value function):
We denote the value of taking action
in state
under policy
as
(action-value function). The mathematical expression is shown below:
Finite MDP problems can be well solved by methods based on value functions such as dynamic programming (DP), Monte Carlo methods (MC), and temporal-difference learning (TD). However, most of the real-life RL problems are continuous with infinite states, so the methods based on value functions cannot evaluate the value function well.
To solve this problem, researchers propose to use function approximation for prediction, using parameterization to approximate the value function. Although this is an improvement over traditional RL, experimental results show that the performance is still poor for complex problems. It was not until the introduction of DL that complex RL problems were well solved. Over the years, DL has developed rapidly, and with the excellent feature representation capability of deep neural networks, it has solved many difficult problems in academia and industry and achieved important research results. DRL combines the framework of DL and the idea of RL, and with the powerful feature representation capability of DL, RL technology has really become practical. DeepMind published a paper in Nature in 2015 [
6]. In this paper, a DRL algorithm called DQN was proposed. It combines Q-learning in RL with DL, and experiments in Atari games, achieving better results than human players. Since then, DRL has been booming.
3.3.2. DQN and DDQN
DQN (Algorithm A1. This algorithm is shown in
Appendix A), as one of the most classical DRL algorithms, has been applied in several fields with good results. In RL,
is called the action value function, which represents the expected value of the future return in the case of being in state
and taking action
. Using the Q-Learning algorithm, we can get a Q-table (
Table 1), which contains the Q-values for taking different actions
(
) in all states
(
). After initializing the Q-table, the Q-learning algorithm uses the Bellman equation to update the Q-table by iterating during the interaction between the agent and the environment until convergence. Before explaining the DQN algorithm further, we explain the Bellman equation. MDP can be modeled using Equation (6).
where
refers to the probability,
refers to the state at a particular moment,
refers to the state at the next moment,
is the action performed at this particular moment, and
is the reward that the agent receives for performing
this action. Therefore, Equation (7) can be obtained by deriving Equation (4).
where
is the state space,
refers to the state at a particular moment,
refers to the state at the next moment,
is the action performed at this particular moment, and
is the reward that the agent receives for performing
this action. The last line of Equation (7) is the Bellman equation for
.
With the final Q-table obtained, the agent can then determine what action to take in a certain state
to obtain the maximum Q-value. The iteration is as follows:
where
is the learning rate, which is used to specify the magnitude of the update, and
is the discount rate, which is used to ensure that the iteration can converge. The
in Equation (8) is the reward given to the agent by the environment after the agent performs the action. However, most real-life RL problems possess infinite states and it is not possible to solve such RL problems using Q-table. To solve this type of problem, we can parameterize the value function and approximate
by
. The function
can be generated using a neural network, which we call Deep Q-network, where
is the parameter on which the neural network is to be trained. The parameter update process is as follows:
Set up two networks, the behavior network
and the target network
. The behavior network is responsible for controlling agent and collecting experience; target network is responsible for computing
:
where i is the number of iterations and
is the discount rate. The
in Equation (9) is the reward given to the agent by the environment after the agent performs the action.
The parameters are updated using the gradient descent method. A partial derivation of the parameter
yields the following gradient:
During the update process, only the weight of the behavior network is updated and the weight of the target network remains unchanged. At regular intervals, the weights of the behavior network are copied to the target network so that the target network can also be updated. After the introduction of the target network, the target Q-value is kept constant over a period of time, which reduces the correlation between the current Q-value and the target Q-value to a certain extent and improves the stability of the algorithm.
Over the years, scholars have continued to improve DQN [
39,
40,
41,
42], trying to design a new, more advanced algorithm with higher applicability based on DQN. The final results show that DQN may not be as good as the improved algorithm in a certain area, but it has always ranked high in terms of the combined scores of all applicable areas. However, DQN has an obvious drawback, which is the overestimation of the Q-value. This is because it uses only the current Q network for action selection and Q estimation when updating Q values. To address this problem, research [
43] proposed the use of two networks, one for selecting actions and the other for estimating Q-values. Research [
43] called the method proposed in the paper DDQN. The model structure of DDQN is the same as that of DQN, the difference lies in the different objective functions. The objective functions of the two are:
3.4. Sentiment Analysis
Natural language processing (NLP) is one of the hottest research areas at the moment, aiming to develop applications or services that can understand human language. An important branch of NLP is sentiment analysis, also known as opinion mining, which is used to analyze the emotions, opinions, attitudes, etc. expressed by people in textual messages. The development and rapid start of sentiment analysis was made possible by the rapid growth of social media on the web, such as online shopping software, short video software, WeChat Version 8.0.50, Weibo Version 14.9.1, etc., which was the first time in the history of mankind that there was such a huge amount of digital volume in the form of records.
Sentiment analysis is usually divided into two steps: feature extraction and sentiment classification. Sentiment analysis techniques are mainly classified into three types, which are DL based methods, machine learning-based methods and rule-based methods. Sentiment analysis methods based on DL utilize deep neural network models such as CNN, RNN, etc. to learn semantic information in text sequences; machine learning-based sentiment analysis methods aim at autonomously deducing the correlation between the text content and its emotional meanings through the training of the model, and the commonly used methods include Support Vector Machine, Decision Tree, Naive Bayes, etc.; rule-based sentiment analysis methods are manually defined sentiment lexicon and grammar rules. SnowNLP is an official Python class library inspired by TextBlob and written to facilitate the processing of Chinese, using Bayesian machine learning methods to train the text. The output is between 0 and 1; the closer the number is to 1 indicates that the text sentiment is more positive. This paper uses SnowNLP’s sentiment analysis module to process the stock bar posts in the dataset.
4. Methodology
For the subsequent experiments, we chose DQN as the training algorithm for the agent, so we call the two models proposed in this paper Sentiment Analysis Deep Q-Network-R (SADQN-R) and Sentiment Analysis Deep Q- Network-S (SADQN-S).
4.1. Modeling the Environment for RL Task
Most previous stock methods use raw data from stocks as inputs to the model, and few studies used images as inputs in stock prediction. Research [
44] tried to use images as inputs in their paper and improved the performance of the model. Research [
18] applied DRL to stock prediction by using images as input to the model and greatly improved the model’s return. Therefore, we believe that further research on the direct use of images as input is necessary.
We call the input State the State-0-1 matrix. The State-0-1 matrix consists of the closing price, volume after max-min normalization. The closing price of a stock is the price at which the last trade in that stock was made on a business day. If there are no transactions on the day, the closing price of the previous business day will be the closing price of the day. Stock volume is the number of trades made between buyers and sellers of a stock and is unilateral. For example, a stock with a volume of 100,000 shares means that the buyer bought 100,000 shares while the seller sold 100,000 shares.
The State-0-1 matrix (Algorithm A2. This algorithm is shown in
Appendix A) is an
-order square matrix of 0 s and 1 s (
columns represent data for the last
days), and the matrix is categorized into upper, middle, and lower matrices. The middle matrix (2 rows and
columns) is full of zeros and is used to help the agent distinguish between closing price and volume. The upper matrix (
rows and
columns) is the closing price: the normalized closing price is divided into m intervals, if the closing price on day
is in the maximum interval, the first row of column
is 1, and each of the rest of column
is 0; if the closing price on day
is in the second-largest interval, the second row of column
is 1, and each of the rest of the columns
is 0; if the closing price on day
is in the minimum interval, the
th row of column
is 1, and each of the rest of column
is 0; and so on. The lower matrix
rows and
columns) is the volume and is generated in the same way as the upper matrix. In this paper, we set
to 32. The State-0-1 matrix is shown in
Figure 1 (using the State-0-1 matrix for stock #601318 on 22 February 2024 as an example,
). We use blue for the number 1 and white for the number 0.
The State of company at moment is denoted as ( denotes the company and denotes the date).
The three actions are long (), neutral (), and short () ( denotes the company and denotes the date).
DQN, DDQN, SADQN-R, and SADQN-S all have different reward settings, as described in
Section 3.3.
Input the stock’s data for the past 32 days to the agent, which outputs a vector of value functions
(
denotes the company and
denotes the date). The value function vector
is a three-dimensional column vector with three values approximating the value functions of each of the three actions. We can benefit the next day by choosing the action that has the largest value function. When investing in stocks, it is too risky to invest in only one stock, so it is necessary to set up a portfolio that is invested according to weight
. Assuming that the total number of stocks at the time of conducting the portfolio is
, at moment
, our model processes
stocks simultaneously and constructs the weights using the output vector of value functions
. The weights are positive if the value function of the long action is the largest, and negative if the value function of the short action is the largest. In order to have an intuitive result, we set the total assets to 1. A positive weight represents the execution of a long action. A negative weight represents the execution of a short action. To reduce investment risk, we make the daily long position equal to the short position. Therefore, all weights add up to equal 0. In addition, since the total assets are 1, all the absolute values of the weights add up to equal 1. That is, the weights are satisfied:
If
, it indicates a long position of
. If
, it indicates a short position of
(Algorithm A3. This algorithm is shown in
Appendix A).
The initial asset in this paper is 1. Agent predicts the direction of all stocks for the next day, assigning different weights to each stock. The environment then performs sell and buy operations on each stock in turn based on the weights. The environment moves from moment to moment , when the total assets are updated at the new closing price of the stock, and the agent’s goal is to maximize the total assets at each moment. This process continues until the last day of the investment time period. Since in reality some transaction costs are incurred when trading stocks, this paper sets a transaction fee of 5 per thousand.
4.2. Network Settings for Deep Learning
Since the input states are two-dimensional, CNN is chosen as the neural network structure for agent training. The core idea of CNN is to process the input data through convolutional kernels in the convolutional layer as a way to perform feature extraction. CNN contains convolutional layer and pooling layer. In order to effectively reduce the amount of computation, the pooling layer shrinks the input image, reducing the pixel information and keeping only the important information. The information contained in the stock trading market is huge and intricate, and increasing the number of layers appropriately can help the agent learn more information. We set the CNN with 6 hidden layers. In addition, [
18] conducted experiments in the paper, which showed that using a size of 32 × 32 × 1 as the input to the CNN enables the agent to have the best training results.
Therefore, based on the previous work, we set up the CNN network structure in this paper. The input size of the CNN is 32 × 32 × 1. CNN has 6 hidden layers, the first 4 are convolutional layers and the last 2 are fully connected layers (FC layers). Each convolutional layer and the first FC layer are followed by Rectified Linear Unit (ReLU). The four convolutional layers consist of 16 convolutional kernels of size 5 × 5 × 1, 16 convolutional kernels of size 5 × 5 × 16, 32 convolutional kernels of size 5 × 5 × 16 and 32 convolutional kernels of size 5 × 5 × 32, respectively. The output size of the CNN is 3 × 1.
4.3. Methods
- 1.
DQN and DDQN
DQN and DDQN are set up with the same CNN structure, State, Action, Reward, and Course of dealing. State, Action, and Course of dealing are described in
Section 3.1, and the CNN setup is described in
Section 3.2. Reward is shown below:
where
denotes the company,
denotes the date, and
is the percentage increase in closing price. There are two ways to add sentiment analysis to the model. We can add the sentiment analysis index average
to state or we can add the sentiment analysis index average
to reward. Until we get the experimental results, we do not know which joining way is better. Therefore, we designed two methods. We first add
to the reward, with the state remaining the same as the DQN. Then, we add the
to the state, with the reward remaining the same as the DQN. These two methods are referred to as SADQN-R and SADQN-S. Specific descriptions of the two methods are shown below.
- 2.
SADQN-R
We add the sentiment analysis index
to the Reward of DQN and call the resulting new model Sentiment Analysis Deep Q-Network-R (SADQN-R). State, Action and Course of dealing for SADQN-R are described in
Section 3.1, and the CNN setup for SADQN-R is described in
Section 3.2. Reward is shown below:
where
denotes the company,
denotes the date, and
is the percentage increase in closing price.
is in the range [0, 1], so we set 0.45 to be the sentiment-neutral value. As shown in
Figure 2, it is the model diagram of SADQN-S. The model diagram of SADQN-R differs from that of SADQN-S in that the middle two rows of the input State are all zeros.
- 3.
SADQN-S
We add the sentiment analysis index
to the State of DQN, and call the resulting new model Sentiment Analysis Deep Q- Network-S (SADQN-S). Action and Course of dealing for SADQN-S are described in
Section 3.1, and the CNN setup for SADQN-S is described in
Section 3.2. Reward is shown below:
where
denotes the company,
denotes the date, and
is the percentage increase in closing price. In SADQN-S, State is no longer the same as described in
Section 3.1. This is because in SADQN-S, the sentiment analysis index averages are added to State, and the Reward remains the same as DQN. We averaged
into 8 intervals based on data preprocessing. The sentiment index average is a number between 0 and 1. We divide the interval [0, 1] into 8 equal parts, i.e., [0, 0.125), [0.125, 0.25), [0.25, 0.375), …, [0.875, 1]. If
belongs to the first interval, columns 1–4 of rows 16–17 of the State-0-1 matrix are 1, and the rest are unchanged; if
belongs to the second interval, columns 5–8 of rows 16–17 of the State-0-1 matrix are 1, and the rest are unchanged; and so on. The State-0-1 matrix is shown in
Figure 3 (using the State-0-1 matrix for stock #601318 on 22 February 2024 as an example,
= 32). We use blue for the number 1 and white for the number 0. The yellow color in the middle two rows indicates 1, and the white color in the middle two rows indicates 0.
The model diagram for SADQN-S is shown in
Figure 2. As shown in
Figure 2, the State-0-1 matrix is the input. For the sake of convenience, we take the 10th order matrix as an example to make a diagram. In the behavior network, the input State-0-1 matrix passes through the convolutional, ReLu and FC layers, and finally outputs a value vector
, which predicts the value function for each of the three actions. In this process, the initial state is
, and the agent uses the value vector
to obtain an action
. The agent then comes to the next state
while gaining a reward
. In the Target network,
is input and after going through the neural network, the value vector is output. In this process, the agent gets an action
. After getting the data output from both networks, the agent then updates the network weights based on the loss function. The process is repeated until the number of iterations reaches the maximum value.
- 4.
Baseline methods
The training and test sets of our proposed model are from different stocks, which is one of the innovations of this paper. In the subsequent experiments, we need to set up some baseline methods. Most of the previous methods have training and test sets from the same stocks and are not suitable as baseline methods. We find three methods which are statistically based methods [
32,
45]. These 3 methods do not need to be trained using past data from the test set. In other words, all 3 methods and the method proposed in this paper can be applied to newly listed stocks for which no past data are available. Therefore, these 3 methods can be used as baseline methods.
A brief description of the three methods is as follows:
Uniform Buy and Hold: Funds are evenly distributed at the initial moment and subsequently held at all times.
Uniform Constant Rebalanced Portfolio: Adjust the allocation of funds at every moment to always maintain an even distribution.
Universal Portfolio: The returns of many kinds of investment portfolios are calculated based on statistical simulations, and the weighting of these portfolios is calculated based on the returns.
6. Discussion
In recent years, stock prediction has been a hot topic in the research field, and scholars have created a variety of models:
Traditional time series analysis methods: ARIMA model [
47,
48], moving average method [
49], exponential smoothing method [
50,
51], etc.
Traditional-based machine learning methods: random forest [
52,
53], SVM algorithm [
54], Boost algorithm [
55], etc.
Although scholars continue to improve these methods, they can only predict stocks based on past datasets and not the future direction of prices of newly listed stocks. In this paper, we add sentiment analysis to deep reinforcement learning and build the SADQN-R and SADQN-S models, which we test on two test sets different from the training set. We also put three statistically based methods that can be applied to newly listed stocks to the test on two test sets. The initial asset is set to 1 and the agent selects the trading action in the investment timeframe. After the trading timeframe is over, we get a final asset. Two methods, SADQN-R and SADQN-S, are proposed in this paper. In the first test set, SADQN-R gained 1.0204 total assets and SADQN-S gained 1.1229 total assets. In the first test set, SADQN-S received the largest final total assets and SADQN-R ranked third. In the second test set, SADQN-R received 1.0751 total assets and SADQN-S received 1.1054 total assets. In the second test set, SADQN-S received the largest final total assets and SADQN-R ranked fourth. Thus, we can tell that SADQN-S outperforms several other methods, while SADQN-R does not get a particularly good performance. There are not many methods that can be applied to newly listed stocks, and we found three ways. In the first test set, none of the 3 methods had final total assets exceeding 1. That is, these three methods do not increase the initial assets. In the second test set, the final total assets for the three methods were 1.0782, 1.0782, 1.0769, respectively. The results of all three methods were less than 1.1054 for SADQN-S. The final results show that SADQN-S has the best performance. This shows that our model can help stockholders to get high returns in newly listed stocks.
7. Conclusions
Our method can help stock investors make stable and high returns. However, our model has shortcomings. In this paper, we directly use the raw data to get the state, without processing the raw data. In fact, even if we only used two years of data for training, it still contains a huge amount of information for the agent. Letting the agent learn all the information from the complex huge amount of data is difficult. Therefore, in our next work, we will improve our model for this problem. We will add a module used to process the raw data before inputting the state to help the agent get more information about the stock market. In this paper, we have used raw stock data and comment text data. In our next work, we plan to add images to the model. These images can represent the overall direction of the market.
For the implementation of the method in this paper, the advantage is that less data is required. For the application, we only need stock data and comment text data for the last 32 days. On the other hand, there is a limitation to the implementation of the method in this paper. The limitation is that our method relies on the sentiments expressed in the comment text. If some people make nonsense comments, it can lead to a large bias in the sentiment data we get. Although such comments make up only a small percentage of all comments, this does have an impact on the accuracy of the model. In our next work, we will address this issue by proposing a module to handle comments that express too extreme emotions. At the same time, we expect more scholars to join in stock prediction research and propose better methods.