A Stock Prediction Method Based on Deep Reinforcement Learning and Sentiment Analysis

Du, Sha; Shen, Hailong

doi:10.3390/app14198747

Open AccessArticle

A Stock Prediction Method Based on Deep Reinforcement Learning and Sentiment Analysis

by

Sha Du

and

Hailong Shen

^*

College of Science, Northeastern University, Shenyang 110819, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(19), 8747; https://doi.org/10.3390/app14198747

Submission received: 30 July 2024 / Revised: 14 September 2024 / Accepted: 24 September 2024 / Published: 27 September 2024

Download

Browse Figures

Versions Notes

Abstract

:

Featured Application

The model proposed in this paper can help stock investors to get high returns on newly listed stocks.

Abstract

Most previous stock investing methods were unable to predict newly listed stocks because they did not have historical data on newly listed stocks. In this paper, we use the Q-learning algorithm based on a convolutional neural network and add sentiment analysis to establish a prediction method for Chinese stock investment tasks. There are 118 companies that are ranked in the Chinese top 150 list for two consecutive years in both 2022 and 2023. We collected all comments under the stock bar of these 118 stocks for each day from 1 January 2022 to 1 July 2024, totaling nearly 10 million comments. There are 90 stocks left after the preprocessing of 118 stocks. We use these 90 stocks as the dataset. The stock’s closing price, volume, and comment text data are fed together to the agent, and the trained agent outputs investment behaviors that maximize future returns. We apply the trained model to two test sets that are completely different from the training set and compare it to several other methods. Our proposed method called SADQN-S obtains results of 1.1229 and 1.1054 on the two test sets. SADQN-S obtained higher final total assets than the other methods on both test sets. This shows that our model can help stock investors earn high returns on newly listed stocks.

Keywords:

stock prediction; deep reinforcement learning; sentiment analysis; convolutional neural network; DQN

1. Introduction

Since 2003, global equity markets have almost tripled in size, rising to a total market capitalization of USD 109 trillion. The largest stock market in the world is the U.S. stock market, with a total market capitalization of over USD 46.2 trillion in 2024. The U.S. stock market has undergone hundreds of years of development, and its management system is already very sound and risk-resistant. The Chinese stock market has developed for a relatively short period of time and its system is not sufficiently perfect, so it does not maintain good stability. At the end of 2022, CCTV Finance Channel conducted a survey of 764,600 retail investors, which showed that 92.51% of retail investors lost money, and only 4.34% made a profit. In 2022, 206 million retail investors lost CNY 76,000 per capita, and in 2023, although the situation had improved, retail investors still lost CNY 54,000 per capita. The stock market has a huge amount of data and many influencing factors, and most of them are retail investors who do not have enough expertise to get useful information directly from it, which makes it difficult to benefit. Retail investors need a stable and profitable model to help them make profits, and stock exchanges can regulate the market by predicting stock prices to attain more information. Predicting the direction of stock prices has become more and more important, which has led many scholars to devote themselves to the research of stock prediction. In the past, a single model could only capture a certain aspect of information and ignore other information, which is a method that is unable to predict the stock market well, so it is a trend to combine the research results of different fields to build a stock prediction model.

Most previous stock forecasting methods treat the stock problem as a time series modeling problem and choose to utilize statistically based methods to solve it [1,2]. In recent years, as the field of artificial intelligence has become hot, solving time series problems with machine learning-based methods has proven to have better results [3,4,5]. However, due to the large amount and complexity of stock market data, traditional machine learning methods cannot consistently and accurately predict the direction of the stock market. Deep reinforcement learning (DRL) is another method to building quantitative investment strategies. In 2015, Deepmind [6] combined deep learning (DL) and reinforcement learning (RL) in video games to achieve results beyond human levels. This was the beginning of deep reinforcement learning (DRL); since then, DRL began to be applied to various fields and showed strong learning ability and adaptability. In 2016, DeepMind applied DRL to the field of machine gaming with its development of AlphaGo [7], which defeated the world Go champion. The follow-up AlphaZero [8] beat AlphaGo again through self-learning. In addition, DRL algorithms have been applied by people in a variety of fields such as automatic intelligent machines [9,10,11,12], text generation [13,14], text games [15], autonomous driving [16], and target localization [17], among others. Research from references [18,19,20,21,22] applies DRL to stock prediction. The experimental results in these papers demonstrate the feasibility of constructing stock prediction models using the DRL method.

Research [18] applied deep reinforcement learning to global stock markets, greatly improving returns relative to before. Research [19] applied deep Q- network and deep recursive Q-network to stock trading. In the paper, an end-to-end daily stock trading system is constructed that decides to buy or sell on each trading day. Research [20] proposed a theory on the application of deep reinforcement learning in stock trading decisions and stock price prediction, demonstrating the reliability and usability of the model through experimental data. Research [21] proposed a multi-layer and multi-ensemble stock trader based on deep neural network and meta-learner to solve the stock prediction issue. Research [22] proposed two methods for stock trading decisions. Firstly, a nested reinforcement learning approach is proposed in this paper based on three deep reinforcement learning models. Then, a weight random selection with confidence strategy is proposed in the paper. The datasets of these methods are not the Chinese stock market. There are fewer stock prediction models based on the Chinese stock market. Research [23] proposed a deep neural network model for predicting stock price movements. In the paper, knowledge graph and graph embedding techniques are used to select the related stocks of the target stock to construct the market information and trading information. Research [24] treated stock price charts as images and uses deep learning neural network for image modelling. The method proposed in the paper can predict the stock price movement in the short term. The datasets of research [23] and research [24] are both Chinese stock market.

The rapid rise of the Internet has led people to browse information and express their opinions on major platforms every day, and most of these opinions are accompanied by the expression of positive or negative emotions, which will drive the emotions of the viewers. The remarks posted online by stock investors can influence the judgment of retail investors, which in turn affects stock prices [25,26,27,28,29]. Research [30] uses the advanced NLP technique BERTopic to analyze topic sentiment derived from stock market comments. Research [30] combines sentiment analysis with various deep learning models, and demonstrates through the paper’s results that adding sentiment analysis to the models significantly improves the performance of these models. Thus, collecting information about these comments and analyzing the sentiment in them can help to predict the future direction of stock prices. In other words, adding sentiment analysis to stock prediction methods can lead to improved methods.

In this paper, deep reinforcement learning and sentiment analysis are applied to the Chinese stock market. We combine deep reinforcement learning methods and sentiment analysis techniques to propose two new methods. These two methods are called SADQN-R (Sentiment Analysis Deep Q-Network-R) and SADQN-S (Sentiment Analysis Deep Q-Network-S) respectively. The contributions can be summarized as follows:

Data Innovation: Most of the datasets for stock prediction models come from the U.S. stock market, and there are fewer stock prediction models about China. However, there are huge differences between the Chinese stock market and the U.S. stock market, so the models applicable to other countries are not necessarily applicable to China. In order to get stable gains in the Chinese stock market, it is necessary to construct stock prediction models based on Chinese dataset. In this paper, 118 stocks are selected as the dataset. These 118 stocks are ranked among the top 150 stocks in China for two consecutive years in 2022 and 2023. The final experimental results show that our model can benefit from the Chinese stock market.
Method Innovation: Few studies have combined DRL and sentiment analysis to form stock prediction models. This paper uses the Q-learning algorithm based on convolutional neural networks to train stock prediction models, and adds sentiment indices as rewards (R) and states (S) into DQN, respectively, to obtain two models, SADQN-R (Sentiment Analysis Deep Q-Network-R) and SADQN-S (Sentiment Analysis Deep Q-Network-S). We tested the trained SADQN-R and SADQN-S on the test set and compared them with several other methods, and the results show that SADQN-S has the best performance among all methods.
Application Innovation: Most of the previous stock prediction methods use the historical data of a stock to predict the future direction of that stock. Newly listed stocks do not have historical data and cannot be predicted accurately. In this paper, the train set and test set are from different stocks. The test results show that our model applied to newly-listed stocks can achieve high returns.

There are seven chapters in this paper. The first chapter is the introduction. We present in this section the applications of DRL in various fields and some stock prediction methods based on DRL. At the end of this chapter, we present the innovations of this paper. The second chapter is related work. We present the position and differences of our research in the field in this section. The third chapter is the preliminaries. We introduce convolutional neural networks (CNN) and knowledge about DRL and sentiment analysis in this section. The fourth chapter is the models. We explain modelling for reinforcement learning tasks in detail and illustrate the network setup for deep learning in this chapter. Furthermore, we describe the methods proposed in this paper in this chapter. The fifth chapter is the experiments. We introduce the dataset in this chapter and conduct comparative experiments. We first compared DQN and DDQN and then compared several baseline methods to ours. The results show that our method possesses the highest gains on both test sets. The sixth chapter is the discussion. We introduce the classification of stock prediction methods in this chapter and discuss the advantages of our method over previous methods. The seventh chapter is conclusion. In this section, we talk about the shortcomings of the paper and the outlook for future work.

2. Related Work

With the continuous progress of DRL theory, DRL methods are increasingly used in stock prediction. Research [31] applied the combination of DL and RL to stock market trading for the first time. In the paper, it is proposed to introduce a multilayer perceptron to enhance the extraction capability of the feature recurrent neural networks. Many engineering tips for DL are given in the paper. Research [32] proposed a DRL transaction framework centered on ensemble of identical independent evaluators (EIIE). EIIE treats some kind of neural network as an integration unit, processing the historical information of each asset with a neural network of the same parameters, and ultimately integrating the results of all assets to generate the asset assignment weights. This paper also proposes efficient vector storage methods and training approaches suitable for both offline and online. Research [33] proposed a multimodal DRL method that combines CNN and LSTM as a feature extraction module and uses DQN as a decision model for trading. In the paper, stock timing information is used to generate three types of images and a module combining CNN and LSTM is used to process the images. Research [34] proposed the DeepTrader framework, adding techniques such as TCN, GCN, and spatial attention mechanisms to research [35]. Research [34] improved the feature extraction of the stock price component from temporal and spatial features, respectively, in the paper. In addition, market factors are added to the paper to calculate market sector sentiment. After people gradually realized the advantages of DRL for solving stock prediction problems, it has been discussed in many review articles. Research [36] discusses a number of issues from the perspective of DRL modelling of stock prediction problems, such as problem classification, risk assessment, environment modelling, and model selection. Research [37] summarizes some of the more effective methods. Combining these reviews, DRL-based stock prediction methods can be broadly classified into three categories: Critic-Only, Actor-Only, and Actor-Critic.

Most previous studies have used the US stock market as the dataset, and few methods have used the Chinese stock market as the dataset. China has a large number of stockholders and huge stock data. Chinese retail investors need some stock prediction methods to help them make stable and high returns. Therefore, it is necessary to establish a method to use the Chinese stock market as the dataset. This research of ours is to use the Chinese stock market as the dataset. More and more scholars are proposing excellent stock prediction methods. However, few methods are specific to newly listed stocks. Newly listed stocks do not have past datasets and cannot be fitted or trained with past data to get stock direction. This paper is aimed at newly listed stocks for which no past data are available. The relevant parameters in this paper are shown in Table A1.

3. Preliminaries

3.1. Neural Networks

A neural network is a computational model that mimics the way that the human brain works. It is fundamental to the fields of deep learning and machine learning. Neural networks consist of a large number of neurons that are interconnected in a network that can process complex data inputs and perform a variety of tasks. The basic components of a neural network consist mainly of neurons, hierarchies, weights, biases, and activation functions. These components work together to enable neural networks to learn and model complex non-linear relationships. Neural networks continuously improve their ability to recognize and process data by learning and adjusting connection weights. With learning, neural networks are able to show better and better performance on a variety of tasks such as image recognition, language understanding, and game play. Common neural networks include CNN, RNN, long short-term memory (LSTM), gated recurrent unit (GRU), generative adversarial networks (GAN), and so on.

3.2. CNN

CNN is a feed forward neural network. The default input to a CNN is an image, which allows us to encode specific properties into the structure of the network, making our feedforward function more efficient and reducing the number of parameters. CNN usually contain the following layers: convolutional layer, rectified linear unit layer, pooling layer, and fully-connected layer. The “convolution” in CNN is a mathematical operation that processes image data through convolutional kernels. These convolution kernels are smaller in size than the input image and they cover a partial region of the image. The convolution kernel generates an element of a new feature image by element-by-element multiplication and summation of the pixel values in that region. Rectified linear unit (ReLU) is a very common and important activation function in the field of deep learning. ReLU is widely used in numerous neural network models. The mathematical expression for ReLU is shown below:

R e L U (x) = \{\begin{matrix} x, i f x > 0 \\ 0, i f x \leq 0 \end{matrix} .

(1)

This form allows the ReLU function to maintain a gradient of 1 for

x > 0

, which is an important advantage of the ReLU function. This is because gradient vanishing is a challenge to be faced during the training process of deep learning and the ReLU function alleviates this issue to some extent.

The purpose of pooling is to reduce the feature image. Commonly used pooling methods include max pooling and mean pooling. Fully connected neural network should be called fully connect feedforward neural network. Fully connect means that every neuron in the next layer is connected to every neuron in the previous layer. Feedforward refers to the fact that after the input signal enters the network, the signal flow is unidirectional with no feedback in between.

3.3. Deep Reinforcement Learning

3.3.1. Basic Knowledge

RL is a computational approach to understanding and automating goal-directed learning and decision making. It is distinguished from other computational approaches by its emphasis on learning by an agent from direct interaction with its environment, without relying on exemplary supervision or complete models of the environment [38]. From a mathematical point of view, the RL problem is an idealized Markov Decision Process (MDP) with a theoretical framework for achieving goals through interactive learning. In RL, the agent is responsible for learning and making decisions, and everything outside the agent that interacts with it is called the environment. The agent is in a “state” at every moment in the process of continuous exploration. In RL, the interaction between the agent and the environment is called “action”. In other words, “action” is an action that the agent can take. The agent and the environment interact over a series of discrete time steps. The time step is denoted by

t

,

t = 1, 2, 3, \dots

. The interaction between the agent and the environment is continuous: in the current state (

S_{t}

), the agent chooses action (

A_{t}

), the environment responds to the action and presents a new state (

S_{t + 1}

) to the agent, and the environment generates a reward (

R_{t + 1}

), the goal of the agent is to maximize the cumulative reward it receives in the long run. Cumulative reward refers to the return (

G_{t}

):

G_{t} ≜ R_{t + 1} + R_{t + 2} + R_{t + 3} + \dots + R_{T} .

(2)

In Equation (2),

t

refers to the current moment and

T

is the moment when the interaction between the agent and the environment ends.

However, in many cases, the interaction between the agent and the environment continues without restriction. At this point

G_{t}

does not necessarily converge. Therefore, we need to introduce another concept called discounting. Adding the discount rate allows

G_{t}

to be convergent when the interaction goes on indefinitely. As shown below:

G_{t} ≜ R_{t + 1} + γ R_{t + 2} + γ^{2} R_{t + 3} + \dots = \sum_{k = 0}^{\infty} γ^{k} R_{t + k + 1},

(3)

where

γ

is a parameter,

0 \leq γ \leq 1

, called the discount rate.

The value function is divided into state-value function and action-value function. The state-value function is used to estimate the expected return of the agent in the current state. The action-value function is used to estimate the expected return of the agent in the current state-action pair. The magnitude of the expected return depends on the action chosen by the agent, so the value functions are related to the particular ways of acting, called policies (

π

). A policy is a mapping from the states to the probabilities of choosing each possible action in that state. We denote the value function of the state

s

under the policy

π

as

V_{π} (s)

(state-value function):

V_{π} (s) ≜ E_{π} [G_{t} |S_{t} = s] = E_{π} [\sum_{k = 0}^{\infty} γ^{k} R_{t + k + 1} |S_{t} = s] .

(4)

We denote the value of taking action

a

in state

s

under policy

π

as

Q_{π} (s, a)

(action-value function). The mathematical expression is shown below:

Q_{π} (s, a) ≜ E_{π} [G_{t} |S_{t} = s, A_{t} = a] = E_{π} [\sum_{k = 0}^{\infty} γ^{k} R_{t + k + 1} |S_{t} = s, A_{t} = a] .

(5)

Finite MDP problems can be well solved by methods based on value functions such as dynamic programming (DP), Monte Carlo methods (MC), and temporal-difference learning (TD). However, most of the real-life RL problems are continuous with infinite states, so the methods based on value functions cannot evaluate the value function well.

To solve this problem, researchers propose to use function approximation for prediction, using parameterization to approximate the value function. Although this is an improvement over traditional RL, experimental results show that the performance is still poor for complex problems. It was not until the introduction of DL that complex RL problems were well solved. Over the years, DL has developed rapidly, and with the excellent feature representation capability of deep neural networks, it has solved many difficult problems in academia and industry and achieved important research results. DRL combines the framework of DL and the idea of RL, and with the powerful feature representation capability of DL, RL technology has really become practical. DeepMind published a paper in Nature in 2015 [6]. In this paper, a DRL algorithm called DQN was proposed. It combines Q-learning in RL with DL, and experiments in Atari games, achieving better results than human players. Since then, DRL has been booming.

3.3.2. DQN and DDQN

DQN (Algorithm A1. This algorithm is shown in Appendix A), as one of the most classical DRL algorithms, has been applied in several fields with good results. In RL,

Q (s, a)

is called the action value function, which represents the expected value of the future return in the case of being in state

s

and taking action

a

. Using the Q-Learning algorithm, we can get a Q-table (Table 1), which contains the Q-values for taking different actions

a

(

a \in A

) in all states

s

(

s \in S

). After initializing the Q-table, the Q-learning algorithm uses the Bellman equation to update the Q-table by iterating during the interaction between the agent and the environment until convergence. Before explaining the DQN algorithm further, we explain the Bellman equation. MDP can be modeled using Equation (6).

p (s^{’}, r| s, a) = P r (s_{t + 1} = s^{’}, r_{t + 1} = r| s_{t} = s, a_{t} = a),

(6)

where

P r

refers to the probability,

s

refers to the state at a particular moment,

s^{’}

refers to the state at the next moment,

a

is the action performed at this particular moment, and

r

is the reward that the agent receives for performing

a

this action. Therefore, Equation (7) can be obtained by deriving Equation (4).

\begin{array}{l} v_{π} (s) & ≜ & E_{π} [G_{t} ∣ S_{t} = s] \\ = & E_{π} [R_{t + 1} + γ G_{t + 1} ∣ S_{t} = s] \\ = & \sum_{a} π (a | s) \sum_{s^{’}} \sum_{r} p (s^{’}, r | s, a) [r + γ E_{π} [G_{t + 1} | S_{t + 1} = s^{’}]] \\ = & \sum_{a} π (a | s) \sum_{s^{’}, r} p (s^{’}, r | s, a) [r + γ v_{π} (s^{’})], for all s \in S, \end{array},

(7)

where

S

is the state space,

s

refers to the state at a particular moment,

s^{’}

refers to the state at the next moment,

a

is the action performed at this particular moment, and

r

is the reward that the agent receives for performing

a

this action. The last line of Equation (7) is the Bellman equation for

v_{π} (s)

.

With the final Q-table obtained, the agent can then determine what action to take in a certain state

s

to obtain the maximum Q-value. The iteration is as follows:

\begin{matrix} Q (s_{t}, a_{t}) \leftarrow Q (s_{t}, a_{t}) + α [R_{t + 1} + γ \max_{a} Q (s_{t + 1}, a) - Q (s_{t}, a_{t})], \end{matrix}

(8)

where

α

is the learning rate, which is used to specify the magnitude of the update, and

γ

is the discount rate, which is used to ensure that the iteration can converge. The

r_{t + 1}

in Equation (8) is the reward given to the agent by the environment after the agent performs the action. However, most real-life RL problems possess infinite states and it is not possible to solve such RL problems using Q-table. To solve this type of problem, we can parameterize the value function and approximate

Q (s, a)

by

Q (s, a {; θ}_{i})

. The function

Q (s, a; θ_{i})

can be generated using a neural network, which we call Deep Q-network, where

θ_{i}

is the parameter on which the neural network is to be trained. The parameter update process is as follows:

Set up two networks, the behavior network $Q (s_{t}, a_{t}; θ_{i})$ and the target network $Q (s_{t + 1}, a_{t + 1}; θ_{i}^{-})$ . The behavior network is responsible for controlling agent and collecting experience; target network is responsible for computing $Y_{i}$ :

$Y_{i} = r_{t + 1} + γ \underset{a_{t + 1}}{m a x} Q (s_{t + 1}, a_{t + 1}; θ_{i}^{-});$

(9)

where i is the number of iterations and $γ$ is the discount rate. The $r_{t + 1}$ in Equation (9) is the reward given to the agent by the environment after the agent performs the action.
Define a loss function:

$\begin{matrix} L (θ_{i}) = E [{(r_{t + 1} + γ \max_{a_{t + 1}} Q (s_{t + 1}, a_{t + 1}; θ_{i}^{-}) - Q (s_{t}, a_{t}; θ_{i}))}^{2}]; \end{matrix}$

(10)
The parameters are updated using the gradient descent method. A partial derivation of the parameter $θ_{i}$ yields the following gradient:

$\nabla_{θ_{i}} L (θ_{i}) = E_{s_{t}, a_{t}, R_{t}, s_{t + 1}} [(Y_{i} - Q (s_{t}, a_{t}| θ_{i})) \nabla_{θ_{i}} Q (s_{t}, a_{t}| θ_{i})] .$

(11)

During the update process, only the weight

θ_{i}

of the behavior network

Q (s_{t}, a_{t}; θ_{i})

is updated and the weight

θ_{i}^{-}

of the target network

Q (s_{t + 1}, a_{t + 1}; θ_{i}^{-})

remains unchanged. At regular intervals, the weights of the behavior network are copied to the target network so that the target network can also be updated. After the introduction of the target network, the target Q-value is kept constant over a period of time, which reduces the correlation between the current Q-value and the target Q-value to a certain extent and improves the stability of the algorithm.

Over the years, scholars have continued to improve DQN [39,40,41,42], trying to design a new, more advanced algorithm with higher applicability based on DQN. The final results show that DQN may not be as good as the improved algorithm in a certain area, but it has always ranked high in terms of the combined scores of all applicable areas. However, DQN has an obvious drawback, which is the overestimation of the Q-value. This is because it uses only the current Q network for action selection and Q estimation when updating Q values. To address this problem, research [43] proposed the use of two networks, one for selecting actions and the other for estimating Q-values. Research [43] called the method proposed in the paper DDQN. The model structure of DDQN is the same as that of DQN, the difference lies in the different objective functions. The objective functions of the two are:

Y_{i}^{D Q N} = r_{t + 1} + γ \underset{a}{m a x} Q (s_{t + 1}, a; θ_{i}^{-}),

(12)

Y_{i}^{D D Q N} = r_{t + 1} + γ Q (s_{t + 1}, \underset{a}{ar gmax} Q (s_{t + 1}, a; θ_{i}); θ_{i}^{-}) .

(13)

3.4. Sentiment Analysis

Natural language processing (NLP) is one of the hottest research areas at the moment, aiming to develop applications or services that can understand human language. An important branch of NLP is sentiment analysis, also known as opinion mining, which is used to analyze the emotions, opinions, attitudes, etc. expressed by people in textual messages. The development and rapid start of sentiment analysis was made possible by the rapid growth of social media on the web, such as online shopping software, short video software, WeChat Version 8.0.50, Weibo Version 14.9.1, etc., which was the first time in the history of mankind that there was such a huge amount of digital volume in the form of records.

Sentiment analysis is usually divided into two steps: feature extraction and sentiment classification. Sentiment analysis techniques are mainly classified into three types, which are DL based methods, machine learning-based methods and rule-based methods. Sentiment analysis methods based on DL utilize deep neural network models such as CNN, RNN, etc. to learn semantic information in text sequences; machine learning-based sentiment analysis methods aim at autonomously deducing the correlation between the text content and its emotional meanings through the training of the model, and the commonly used methods include Support Vector Machine, Decision Tree, Naive Bayes, etc.; rule-based sentiment analysis methods are manually defined sentiment lexicon and grammar rules. SnowNLP is an official Python class library inspired by TextBlob and written to facilitate the processing of Chinese, using Bayesian machine learning methods to train the text. The output is between 0 and 1; the closer the number is to 1 indicates that the text sentiment is more positive. This paper uses SnowNLP’s sentiment analysis module to process the stock bar posts in the dataset.

4. Methodology

For the subsequent experiments, we chose DQN as the training algorithm for the agent, so we call the two models proposed in this paper Sentiment Analysis Deep Q-Network-R (SADQN-R) and Sentiment Analysis Deep Q- Network-S (SADQN-S).

4.1. Modeling the Environment for RL Task

State

Most previous stock methods use raw data from stocks as inputs to the model, and few studies used images as inputs in stock prediction. Research [44] tried to use images as inputs in their paper and improved the performance of the model. Research [18] applied DRL to stock prediction by using images as input to the model and greatly improved the model’s return. Therefore, we believe that further research on the direct use of images as input is necessary.

We call the input State the State-0-1 matrix. The State-0-1 matrix consists of the closing price, volume after max-min normalization. The closing price of a stock is the price at which the last trade in that stock was made on a business day. If there are no transactions on the day, the closing price of the previous business day will be the closing price of the day. Stock volume is the number of trades made between buyers and sellers of a stock and is unilateral. For example, a stock with a volume of 100,000 shares means that the buyer bought 100,000 shares while the seller sold 100,000 shares.

The State-0-1 matrix (Algorithm A2. This algorithm is shown in Appendix A) is an

N

-order square matrix of 0 s and 1 s (

N

columns represent data for the last

N

days), and the matrix is categorized into upper, middle, and lower matrices. The middle matrix (2 rows and

N

columns) is full of zeros and is used to help the agent distinguish between closing price and volume. The upper matrix (

(N - 2) / 2

rows and

N

columns) is the closing price: the normalized closing price is divided into m intervals, if the closing price on day

x

is in the maximum interval, the first row of column

x

is 1, and each of the rest of column

x

is 0; if the closing price on day

x

is in the second-largest interval, the second row of column

x

is 1, and each of the rest of the columns

x

is 0; if the closing price on day

x

is in the minimum interval, the

(N - 2) / 2

th row of column

x

is 1, and each of the rest of column

x

is 0; and so on. The lower matrix

(N - 2) / 2

rows and

N

columns) is the volume and is generated in the same way as the upper matrix. In this paper, we set

N

to 32. The State-0-1 matrix is shown in Figure 1 (using the State-0-1 matrix for stock #601318 on 22 February 2024 as an example,

N = 32

). We use blue for the number 1 and white for the number 0.

The State of company

c

at moment

t

is denoted as

s_{c}^{t}

(

c

denotes the company and

t

denotes the date).

Action

The three actions are long (

a_{c}^{t} = 1

), neutral (

a_{c}^{t} = 0

), and short (

a_{c}^{t} = - 1

) (

c

denotes the company and

t

denotes the date).

Reward

DQN, DDQN, SADQN-R, and SADQN-S all have different reward settings, as described in Section 3.3.

Course of dealing

Input the stock’s data for the past 32 days to the agent, which outputs a vector of value functions

ρ_{c}^{t}

(

c

denotes the company and

t

denotes the date). The value function vector

ρ_{c}^{t}

is a three-dimensional column vector with three values approximating the value functions of each of the three actions. We can benefit the next day by choosing the action that has the largest value function. When investing in stocks, it is too risky to invest in only one stock, so it is necessary to set up a portfolio that is invested according to weight

α_{c}

. Assuming that the total number of stocks at the time of conducting the portfolio is

M

, at moment

t

, our model processes

M

stocks simultaneously and constructs the weights using the output vector of value functions

ρ_{c}^{t}

. The weights are positive if the value function of the long action is the largest, and negative if the value function of the short action is the largest. In order to have an intuitive result, we set the total assets to 1. A positive weight represents the execution of a long action. A negative weight represents the execution of a short action. To reduce investment risk, we make the daily long position equal to the short position. Therefore, all weights add up to equal 0. In addition, since the total assets are 1, all the absolute values of the weights add up to equal 1. That is, the weights are satisfied:

\{\begin{matrix} \sum_{c = 1}^{M} |α_{c}| = 1 \\ \sum_{c = 1}^{M} α_{c} = 0 \end{matrix} .

(14)

If

α_{c} > 0

, it indicates a long position of

α_{c}

. If

α_{c} < 0

, it indicates a short position of

α_{c}

(Algorithm A3. This algorithm is shown in Appendix A).

The initial asset in this paper is 1. Agent predicts the direction of all stocks for the next day, assigning different weights to each stock. The environment then performs sell and buy operations on each stock in turn based on the weights. The environment moves from moment

t

to moment

t + 1

, when the total assets are updated at the new closing price of the stock, and the agent’s goal is to maximize the total assets at each moment. This process continues until the last day of the investment time period. Since in reality some transaction costs are incurred when trading stocks, this paper sets a transaction fee of 5 per thousand.

4.2. Network Settings for Deep Learning

Since the input states are two-dimensional, CNN is chosen as the neural network structure for agent training. The core idea of CNN is to process the input data through convolutional kernels in the convolutional layer as a way to perform feature extraction. CNN contains convolutional layer and pooling layer. In order to effectively reduce the amount of computation, the pooling layer shrinks the input image, reducing the pixel information and keeping only the important information. The information contained in the stock trading market is huge and intricate, and increasing the number of layers appropriately can help the agent learn more information. We set the CNN with 6 hidden layers. In addition, [18] conducted experiments in the paper, which showed that using a size of 32 × 32 × 1 as the input to the CNN enables the agent to have the best training results.

Therefore, based on the previous work, we set up the CNN network structure in this paper. The input size of the CNN is 32 × 32 × 1. CNN has 6 hidden layers, the first 4 are convolutional layers and the last 2 are fully connected layers (FC layers). Each convolutional layer and the first FC layer are followed by Rectified Linear Unit (ReLU). The four convolutional layers consist of 16 convolutional kernels of size 5 × 5 × 1, 16 convolutional kernels of size 5 × 5 × 16, 32 convolutional kernels of size 5 × 5 × 16 and 32 convolutional kernels of size 5 × 5 × 32, respectively. The output size of the CNN is 3 × 1.

4.3. Methods

1.: DQN and DDQN

DQN and DDQN are set up with the same CNN structure, State, Action, Reward, and Course of dealing. State, Action, and Course of dealing are described in Section 3.1, and the CNN setup is described in Section 3.2. Reward is shown below:

r_{c}^{t} = I_{c}^{t + 1} \times a_{c}^{t} .

(15)

where

c

denotes the company,

t

denotes the date, and

I_{c}^{t + 1}

is the percentage increase in closing price. There are two ways to add sentiment analysis to the model. We can add the sentiment analysis index average

h_{c}^{t}

to state or we can add the sentiment analysis index average

h_{c}^{t}

to reward. Until we get the experimental results, we do not know which joining way is better. Therefore, we designed two methods. We first add

h_{c}^{t}

to the reward, with the state remaining the same as the DQN. Then, we add the

h_{c}^{t}

to the state, with the reward remaining the same as the DQN. These two methods are referred to as SADQN-R and SADQN-S. Specific descriptions of the two methods are shown below.

2.: SADQN-R

We add the sentiment analysis index

h_{c}^{t}

to the Reward of DQN and call the resulting new model Sentiment Analysis Deep Q-Network-R (SADQN-R). State, Action and Course of dealing for SADQN-R are described in Section 3.1, and the CNN setup for SADQN-R is described in Section 3.2. Reward is shown below:

r_{c}^{t} = I_{c}^{t + 1} \times a_{c}^{t} + (h_{c}^{t + 1} - 0.45) \times a_{c}^{t},

(16)

where

c

denotes the company,

t

denotes the date, and

I_{c}^{t + 1}

is the percentage increase in closing price.

h_{c}^{t}

is in the range [0, 1], so we set 0.45 to be the sentiment-neutral value. As shown in Figure 2, it is the model diagram of SADQN-S. The model diagram of SADQN-R differs from that of SADQN-S in that the middle two rows of the input State are all zeros.

3.: SADQN-S

We add the sentiment analysis index

h_{c}^{t}

to the State of DQN, and call the resulting new model Sentiment Analysis Deep Q- Network-S (SADQN-S). Action and Course of dealing for SADQN-S are described in Section 3.1, and the CNN setup for SADQN-S is described in Section 3.2. Reward is shown below:

r_{c}^{t} = I_{c}^{t + 1} \times a_{c}^{t} .

(17)

where

c

denotes the company,

t

denotes the date, and

I_{c}^{t + 1}

is the percentage increase in closing price. In SADQN-S, State is no longer the same as described in Section 3.1. This is because in SADQN-S, the sentiment analysis index averages are added to State, and the Reward remains the same as DQN. We averaged

h_{c}^{t}

into 8 intervals based on data preprocessing. The sentiment index average is a number between 0 and 1. We divide the interval [0, 1] into 8 equal parts, i.e., [0, 0.125), [0.125, 0.25), [0.25, 0.375), …, [0.875, 1]. If

h_{c}^{t}

belongs to the first interval, columns 1–4 of rows 16–17 of the State-0-1 matrix are 1, and the rest are unchanged; if

h_{c}^{t}

belongs to the second interval, columns 5–8 of rows 16–17 of the State-0-1 matrix are 1, and the rest are unchanged; and so on. The State-0-1 matrix is shown in Figure 3 (using the State-0-1 matrix for stock #601318 on 22 February 2024 as an example,

N

= 32). We use blue for the number 1 and white for the number 0. The yellow color in the middle two rows indicates 1, and the white color in the middle two rows indicates 0.

The model diagram for SADQN-S is shown in Figure 2. As shown in Figure 2, the State-0-1 matrix is the input. For the sake of convenience, we take the 10th order matrix as an example to make a diagram. In the behavior network, the input State-0-1 matrix passes through the convolutional, ReLu and FC layers, and finally outputs a value vector

ρ_{c}^{t}

, which predicts the value function for each of the three actions. In this process, the initial state is

s_{t}

, and the agent uses the value vector

ρ_{c}^{t}

to obtain an action

a_{t}

. The agent then comes to the next state

s_{t + 1}

while gaining a reward

r_{t + 1}

. In the Target network,

s_{t + 1}

is input and after going through the neural network, the value vector is output. In this process, the agent gets an action

a_{t + 1}

. After getting the data output from both networks, the agent then updates the network weights based on the loss function. The process is repeated until the number of iterations reaches the maximum value.

4.: Baseline methods

The training and test sets of our proposed model are from different stocks, which is one of the innovations of this paper. In the subsequent experiments, we need to set up some baseline methods. Most of the previous methods have training and test sets from the same stocks and are not suitable as baseline methods. We find three methods which are statistically based methods [32,45]. These 3 methods do not need to be trained using past data from the test set. In other words, all 3 methods and the method proposed in this paper can be applied to newly listed stocks for which no past data are available. Therefore, these 3 methods can be used as baseline methods.

A brief description of the three methods is as follows:

Uniform Buy and Hold: Funds are evenly distributed at the initial moment and subsequently held at all times.
Uniform Constant Rebalanced Portfolio: Adjust the allocation of funds at every moment to always maintain an even distribution.
Universal Portfolio: The returns of many kinds of investment portfolios are calculated based on statistical simulations, and the weighting of these portfolios is calculated based on the returns.

5. Experiments

In this section, we conducted comparative experiments. We introduce the dataset and hyperparameter settings. The purpose of our experiments is to identify suitable RL algorithms. In addition, we are also to compare our method with the baseline methods as a way to prove that the method proposed in this paper can help stock investors to get stable and high returns.

5.1. Dataset

5.1.1. Data Source

Over the years, most of the studies on the stock market have their datasets originated from the U.S. stock market, and there are fewer studies on stock prediction for the Chinese stock market. Separately, there are other reasons why the Chinese stock market was chosen as the subject of this paper:

In 2020, Chinese accelerating economic recovery pushed Chinese stock market values past the peaks set during the stock market bubble of 2015, and the total market value of Chinese stock market surpassed USD 10 trillion. The Chinese stock market is large enough and representative enough to be the subject of study;
While the US stock market is mainly dominated by institutional investors, the Chinese stock market is mainly dominated by retail investors [46]. From the establishment of the Chinese stock market in 1990 to 2023, the number of retail investors has exceeded 220 million. This makes the Chinese stock market highly predictable.

The Walton Institute of Economic Research publishes the Chinese top 150 list every year. Among the companies that were top 150 in 2022, 118 remain top 150 in 2023. In this paper, these 118 companies are selected as the source of the stock trading dataset. East Money (https://www.eastmoney.com/, accessed on 10 July 2024) is one of the most visited and influential financial and securities portals in China. Since its launch in March 2004, East Money has always insisted on the authority and professionalism of its content, covering various financial fields in a multifaceted way, and updating tens of thousands of data and information on a daily basis. Therefore, this paper chooses the stock bar under the website as the source of the dataset for sentiment analysis. East Money has all the stock trading information of Chinese stock market, such as opening price, closing price, volume and so on. Stock bar (https://guba.eastmoney.com/) is a stock communication community in East Money. Netizens will express their opinions about the stock market in the Stock Bar. The trading data information in East Money is publicly available and can be downloaded by the researchers themselves.

5.1.2. Data Preprocessing

First, we collected all the comments for these 118 stocks from 1 January 2022 to 1 July 2024 from stock bar (Table 2). The total number of comments is nearly 10 million. Then the sentiment analysis module in the Python class library SnowNLP is used to analyze the sentiment of all the daily comments of each stock to obtain the average value of the sentiment index of all the daily comments of the 118 stocks, denoted by

h_{c}^{t}

(

c

denotes the company and

t

denotes the date) (Figure 4). Suppose a company is called

A

and today’s date is

T

. We collected all the comments under the stock bar of

A

on the day

T

. Suppose there are 100 of these comments. We fed these 100 comments into SnowNLP and got 100 sentiment indexes. These sentiment indexes range from 0 to 1, with higher values representing more positive sentiments. We then average these 100 indexes. This average of the sentiment indexes is

h_{A}^{T}

. As shown in Figure 4, stock investors post their comments on stock bar. We collect these comments and then use SnowNLP to get the sentiment index of each comment. We average the sentiment index of all the comments on a stock for the day to get the average sentiment index (

h_{c}^{t}

). Finally, we exclude the less active and noisy ones totaling 28 stocks and retain 90 stocks as our dataset. These 90 stocks have a long history. They have sufficient data and are relatively stable. In addition, the stocks of these 90 companies have a large amount of discussion in the Chinese stock market and have a large number of investors. The comment areas of these 90 stocks in the stock bar hold a large number of comments. These 90 stocks have sufficient stock data, tons of review data and investors. Therefore, we use these 90 stocks as the dataset for the experimental section.

Next, we downloaded the closing price, volume, and percentage increase for 90 stocks from 1 January 2022 to 1 July 2024 from the Choice Financial Terminal. We first max-min normalize the data of each stock, and then generate the State-0-1 matrix as the state input to the agent.

5.2. Comparison Experiments

We divide the 90 stocks into two parts, which have 60 stocks and 30 stocks, respectively. We then divide the 30 stocks into three groups, A, B, and C. Each of the three groups, A, B, and C, owns 10 stocks. We use the data of these 60 stocks from 1 January 2022 to 31 December 2023 as the training set, the data of Group A from 1 January 2022 to 31 December 2023 as the validation set, and the data of Group B and Group C from 1 January 2024 to 1 July 2024 as the test set. The dataset division is shown in Figure 5.

Next, we explain our reasons for dividing the dataset this way. Suppose today is 1 January 2024 and there are some stocks that just went public today. If we want to invest in these stocks, most of the previous methods will not help us invest. This is because those methods were trained with past data from the test set. For example, there are 10 stocks and they were listed in 2010. Most previous methods have proceeded this way: data from 2010 to 2022 was used as the training set and data from 2023 as the validation set, which was then applied to the 10 stocks. Our methods proceed in this way: our training set is data for 60 stocks from 2022 to 2023, and the validation set is data for another 10 stocks from 2022 to 2023. If the trained method can get high returns on the validation set, then it means that the trained method can be applied to the newly listed stocks. Even if the newly listed stocks lack past data, we can still get high returns by investing in these stocks. In addition, we set up two test sets in order to demonstrate the stability of our methods through this approach.

In the next comparison experiments, we use the validation set to tune the hyperparameters of all methods and set them the same on such key parameters as batch size, capacity of the memory buffer, and so on. The hyperparameter settings for the SADQN-S are shown in Table 3. The training process and testing process of the experiment are shown in Figure 6 and Figure 7, respectively (using SADQN-S as an example). As shown in Figure 6, we first process the comment text to obtain

h_{c}^{t}

.

h_{c}^{t}

and the stock raw data are then combined to form the State-0-1 matrix. The State-0-1 matrix, the actions performed by the agent and the percentage increase of the closing price, which are fed to both networks. The two networks are trained using the Q-learning algorithm. As shown in Figure 7, we merge the

h_{c}^{t}

and stock raw data to obtain the State-0-1 matrix. When the agent receives the State-0-1 matrix, it outputs the value function vector using the already trained network and selects the action based on the value function vector.

For the training and testing process of the experiment: the difference between DQN and SADQN-S is state; the difference between DDQN and SADQN-S is state and Q-learning algorithm; the difference between SADQN-R and SADQN-S is state and reward. In addition, since our method is based on deep reinforcement learning, the training process requires relatively large computational resources. We train through the server. The server’s configurations are:

GPU: RTX 3080 × 2(20 GB) × 1;
CPU: 10 vCPU Intel(R) Xeon(R) Gold 6148 CPU @ 2.40 GHz;
Random access memory (RAM): 60 GB.

5.2.1. Comparison Experiment between DQN and DDQN

In some domains, DDQN has significant advantages over DQN, but is not necessarily superior to DQN on all tasks. Therefore, before adding the sentiment analysis module, we compare the results of traditional DQN with traditional DDQN on the stock prediction dataset. We test the trained DQN and DDQN in Group B and Group C. Since the input state requires the data of previous 32 days, we set the initial asset to be 1 on 22 February 2024, and the Agent assigns weights to the 10 stocks in the test set and performs actions every day, which lasts until 1 July 2024, to get the final total asset. The test results of DQN and DDQN in Group B and Group C are shown in Figure 8. The initial total asset is 1. The agent executes trading actions every day and the total asset changes. Asset in Figure 8 is the final total asset.

The results show that DQN has higher final total assets than DDQN on both test sets. Crucially, DDQN incurred losses, while DQN made all profits. DQN clearly outperforms DDQN, so we chose DQN as the algorithm to train the agent in this paper.

5.2.2. Comparison Experiment for All Methods

We tested the trained SADQN-R and SADQN-S in Group B and Group C. Since the input state requires the data of previous 32 days, we set the initial asset to be 1 on 22 February 2024, and the Agent assigns weights to the 10 stocks in the test set and performs actions every day, which lasts until 1 July 2024, to get the final total asset. In previous work, most of the methods for solving stock portfolios need to be trained using historical data of the stocks in the test set, and there are fewer methods that can be used for stock portfolios without relying on historical data. We looked for three statistically based methods that do not require historical data in solving the stock portfolio task. These three methods have been described in Section 3.3. These 3 methods are called Uniform Buy and Hold, Uniform Constant Rebalanced Portfolio, and Universal Portfolio, respectively. For convenience, we abbreviate these three methods as UBAH, UCRP, and UP, respectively. We test these 3 methods on Group B and Group C. We show the test results of DQN, DDQN, SADQN-R, SADQN-S and three statistically based methods in Group B and Group C in Figure 9.

SADQN-R achieved positive gains on both test sets, but compared to DQN, SADQN-R did not significantly outperform DQN. While all three statistically based methods obtained high returns in Group C, all three statistically based methods obtained negative returns in Group B. SADQN-S achieves positive gains on both test sets, and both achieve maximum gains. In Group B, four methods - DDQN, UBAH, UCRP, and UP - failed to achieve positive returns, while SADQN-S achieved a final asset of 1.1229 in Group B, which is far more than the other six methods. Taken together, the SADQN-S has the best performance.

6. Discussion

In recent years, stock prediction has been a hot topic in the research field, and scholars have created a variety of models:

Traditional time series analysis methods: ARIMA model [47,48], moving average method [49], exponential smoothing method [50,51], etc.
Traditional-based machine learning methods: random forest [52,53], SVM algorithm [54], Boost algorithm [55], etc.

Although scholars continue to improve these methods, they can only predict stocks based on past datasets and not the future direction of prices of newly listed stocks. In this paper, we add sentiment analysis to deep reinforcement learning and build the SADQN-R and SADQN-S models, which we test on two test sets different from the training set. We also put three statistically based methods that can be applied to newly listed stocks to the test on two test sets. The initial asset is set to 1 and the agent selects the trading action in the investment timeframe. After the trading timeframe is over, we get a final asset. Two methods, SADQN-R and SADQN-S, are proposed in this paper. In the first test set, SADQN-R gained 1.0204 total assets and SADQN-S gained 1.1229 total assets. In the first test set, SADQN-S received the largest final total assets and SADQN-R ranked third. In the second test set, SADQN-R received 1.0751 total assets and SADQN-S received 1.1054 total assets. In the second test set, SADQN-S received the largest final total assets and SADQN-R ranked fourth. Thus, we can tell that SADQN-S outperforms several other methods, while SADQN-R does not get a particularly good performance. There are not many methods that can be applied to newly listed stocks, and we found three ways. In the first test set, none of the 3 methods had final total assets exceeding 1. That is, these three methods do not increase the initial assets. In the second test set, the final total assets for the three methods were 1.0782, 1.0782, 1.0769, respectively. The results of all three methods were less than 1.1054 for SADQN-S. The final results show that SADQN-S has the best performance. This shows that our model can help stockholders to get high returns in newly listed stocks.

7. Conclusions

Our method can help stock investors make stable and high returns. However, our model has shortcomings. In this paper, we directly use the raw data to get the state, without processing the raw data. In fact, even if we only used two years of data for training, it still contains a huge amount of information for the agent. Letting the agent learn all the information from the complex huge amount of data is difficult. Therefore, in our next work, we will improve our model for this problem. We will add a module used to process the raw data before inputting the state to help the agent get more information about the stock market. In this paper, we have used raw stock data and comment text data. In our next work, we plan to add images to the model. These images can represent the overall direction of the market.

For the implementation of the method in this paper, the advantage is that less data is required. For the application, we only need stock data and comment text data for the last 32 days. On the other hand, there is a limitation to the implementation of the method in this paper. The limitation is that our method relies on the sentiments expressed in the comment text. If some people make nonsense comments, it can lead to a large bias in the sentiment data we get. Although such comments make up only a small percentage of all comments, this does have an impact on the accuracy of the model. In our next work, we will address this issue by proposing a module to handle comments that express too extreme emotions. At the same time, we expect more scholars to join in stock prediction research and propose better methods.

Author Contributions

Conceptualization, S.D.; methodology, S.D.; software, S.D.; validation, S.D.; data curation, S.D.; writing—original draft preparation, S.D.; writing—review and editing, S.D. and H.S.; visualization, S.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw stock data used in this paper is available from https://www.eastmoney.com/ (accessed on 10 July 2024). The stock comment data collected in this paper will be provided by the authors upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. Relevant parameters and meaning of parameters in the paper.

Name of Parameters	Meaning of Parameters
$S_{t}$	state at moment $t$
$A_{t}$	action at moment $t$
$R_{t}$	reward at moment $t$
$G_{t}$	return at moment $t$
$π$	policy
$V_{π}$	state-value function
$Q_{π}$	action-value function
$ρ_{c}^{t}$	value function vector of company $c$ at moment $t$
$α_{c}$	asset allocation weight
$I_{c}^{t + 1}$	percentage increase in closing price of company $c$ at moment $t + 1$
$h_{c}^{t}$	average sentiment analysis index for company $c$ at moment $t$

Algorithm A1. DQN

1. Initialize replay memory

D

to capacity

N

2. Initialize action-value function

Q

with random weights

θ

3. Initialize target action-value function

\hat{Q}

with weights

θ^{-} = θ

4. For episode

= 1, M

do

5. Initialize sequence

s_{1} = \{x_{1}\}

and preprocessed sequence

Φ_{1} = Φ (s_{1})

6. For

t = 1, T

do

7. With probability

ε

select a random action

a_{t}

8. otherwise select

a_{t} = {argmax}_{a} Q (Φ (s_{t}), a; θ)

9. Execute action

a_{t}

, in emulator and observe reward

r_{t}

, and image

x_{t + 1}

10. Set

s_{t + 1} = s_{t}, a_{t}, x_{t + 1}

and preprocess

Φ_{t + 1} = Φ (s_{t + 1})

11. Store transition

(Φ_{t}, a_{t}, r_{t}, Φ_{t + 1})

in

D

12. Sample random minibatch of transitions

(Φ_{j}, a_{j}, r_{j}, Φ_{j + 1})

from

D

13. Set

y_{j} = \{\begin{matrix} r_{j} & if episode terminates at step j + 1 \\ r_{j} + γ \max_{a} \hat{Q} (Φ_{j + 1}, a ’; θ^{-}) & otherwise \end{matrix}

14. Perform a gradient descent step on

{(y_{j} - Q (Φ_{j}, a_{j}; θ))}^{2}

with respect to the network parameters

θ

15. Every

C

steps reset

\hat{Q} = Q

16. End For

17. End For

Algorithm A2. State-0-1

Require: INPUT_FILE_PATH: Path to Input File

Ensure: Processed data is written to the output file in the specified directory

1. OUTPUT_DIRECTORY

\leftarrow

Path to Output Directory

2. ORDER

\leftarrow

32

3. NUM_INTERVALS

\leftarrow

(ORDER − 2)/2

4. INTERVAL_LENGTH

\leftarrow

1/NUM_INTERVALS

5. TRADE_START_INDEX

\leftarrow

ORDER-NUM_INTERVALS

6. excel_data

\leftarrow

ReadExcel(file_path, usecols =

7. [‘closing_price’, ‘volume’])

8. closing_price,volume excel_data[‘closing_price’], excel_data[‘volume’]

9. max_price, min_price

\leftarrow

max(closing_price), min(closing_price)

10. max_volume, min_volume

\leftarrow

max(volume), min(volume)

11. for i = 0, len(closing_price) − 1 do

12. closing_price[i]

\leftarrow

(closing_price[i] − min_price)/(max_price-min_price)

13. volume[i]

\leftarrow

(volume[i] − min_volume)/(max_volume − min_volume)

14. end for

15. matrix_list

\leftarrow

[ ]

16. length

\leftarrow

len(closing_price)

17. for day = 0, length − ORDER do

18. matrix

\leftarrow

zeros((ORDER, ORDER))

19. for i = 0, ORDER − 1 do

20. closing_interval

\leftarrow

determine_interval(closing_price[day+i], NUM_INTERVALS)

21. volume_interval

\leftarrow

determine_interval(volume[day+i], NUM_INTERVALS)

22. matrix[NUM_INTERVALS - closing_interval, i]

\leftarrow

1

23. matrix[TRADE_START_INDEX + NUM_INTERVALS − volume_interval, i]

\leftarrow

1

24. end for

25. matrix_list.append(matrix)

26. end for

27. with open(OUTPUT_DIRECTORY + ‘/output.txt’, ‘a’):

28. tmp

\leftarrow

‘ ’ if is_first_company else ‘F’

29. for matrix_index, matrix in enumerate(matrix_list) do

30. for i, row in enumerate(matrix) do line

\leftarrow

row.tolist()

31. tmp

\leftarrow

tmp + ‘ ’ .join([str(int(c)) For c in line]) + ‘ ’

32. end for

33. tmp

\leftarrow

tmp + ‘E’ if matrix_index

\neq

len(matrix_list) − 1 else ‘ ’

34. end for

35. f.write(tmp)

Algorithm A3. Portfolio

1. Initialize

α \leftarrow

0

2. for all

c

do

3. if

ρ_{c} [0] \geq ρ_{c} [1]

and

ρ_{c} [0] \geq ρ_{c} [2]

then

4.

α_{c} \leftarrow 1

5. else if

ρ_{c} [2] \geq ρ_{c} [0]

and

ρ_{c} [2] \geq ρ_{c} [1]

then

6.

α_{c} \leftarrow - 1

7. end if

8. end for

9.

μ_{n} \leftarrow \frac{1}{N} \sum_{c = 1}^{N} α_{c}

10.

α_{c} \leftarrow α_{c} - μ_{n}

for all

c

11.

\sum_{n} \leftarrow \sum_{c = 1}^{N} |α_{c}|

12.

α_{c} \leftarrow α_{c} / \sum_{n}

for all

c

References

Adebiyi, A.A.; Adewumi, A.O.; Ayo, C.K. Comparison of arima and artificial neural networks models for stock price prediction. J. Appl. Math. 2014, 2014, 614342. [Google Scholar] [CrossRef]
Yan, X.; Zhang, G. Application of kalman filter in the prediction of stock price. In 5th International Symposium on Knowledge Acquisition and Modeling (KAM 2015); Atlantis Press: Amsterdam, The Netherlands, 2015; pp. 197–198. [Google Scholar] [CrossRef]
Zhang, Z.; Hong, W.-C. Electric load forecasting by complete ensemble empirical mode decomposition adaptive noise and support vector regression with quantum-based dragonfly algorithm. Nonlinear Dyn. 2019, 98, 1107–1136. [Google Scholar] [CrossRef]
Zhang, Z.; Hong, W.-C. Application of variational mode decomposition and chaotic grey wolf optimizer with support vector regression for forecasting electric loads. Knowl.-Based Syst. 2021, 228, 107297. [Google Scholar] [CrossRef]
Adnan, R.M.; Dai, H.-L.; Mostafa, R.R.; Parmar, K.S.; Heddam, S.; Kisi, O. Modeling multistep ahead dissolved oxygen concentration using improved support vector machines by a hybrid metaheuristic algorithm. Sustainability 2022, 14, 3470. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the game of go with deep neural networks and tree search. Nature 2016, 529, 484–489. [Google Scholar] [CrossRef] [PubMed]
Silver, D.; Hubert, T.; Schrittwieser, J.; Antonoglou, I.; Lai, M.; Guez, A.; Lanctot, M.; Sifre, L.; Kumaran, D.; Graepel, T.; et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science 2018, 362, 1140–1144. [Google Scholar] [CrossRef] [PubMed]
Levine, S.; Pastor, P.; Krizhevsky, A.; Ibarz, J.; Quillen, D. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. Int. J. Robot. Res. 2018, 37, 421–436. [Google Scholar] [CrossRef]
Zhang, M.; McCarthy, Z.; Finn, C.; Levine, S.; Abbeel, P. Learning deep neural network policies with continuous memory states. In Proceedings of the 2016 IEEE International Conference on Robotics and Automation (ICRA), Stockholm, Sweden, 16–21 May 2016; pp. 520–527. [Google Scholar] [CrossRef]
Levine, S.; Finn, C.; Darrell, T.; Abbeel, P. End-to-end training of deep visuomotor policies. J. Mach. Learn. Res. 2016, 17, 1–40. [Google Scholar]
Lenz, I.; Knepper, R.A.; Saxena, A. DeepMPC: Learning deep latent features for model predictive control. In Proceedings of the Robotics: Science and Systems, Rome, Italy, 13–17 July 2015; pp. 10–25. [Google Scholar]
Guo, H. Generating text with deep reinforcement learning. arXiv 2015, arXiv:1510.09202. [Google Scholar]
Li, J.; Monroe, W.; Ritter, A.; Galley, M.; Gao, J.; Jurafsky, D. Deep reinforcement learning for dialogue generation. arXiv 2016, arXiv:1606.01541. [Google Scholar]
Narasimhan, K.; Kulkarni, T.; Barzilay, R. Language understanding for text- based games using deep reinforcement learning. arXiv 2015, arXiv:1506.08941. [Google Scholar]
Sallab, A.E.; Abdou, M.; Perot, E.; Yogamani, S. Deep reinforcement learning framework for autonomous driving. arXiv 2017, arXiv:1704.02532. [Google Scholar] [CrossRef]
Caicedo, J.C.; Lazebnik, S. Active object localization with deep reinforcement learning. In Proceedings of the 2015 IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2488–2496. [Google Scholar] [CrossRef]
Lee, J.; Kim, R.; Koh, Y.; Kang, J. Global stock market prediction based on stock chart images using deep Q-network. IEEE Access 2019, 7, 167260–167277. [Google Scholar] [CrossRef]
Chen, L.; Gao, Q. Application of deep reinforcement learning on automated stock trading. In Proceedings of the 2019 IEEE 10th International Conference on Software Engineering and Service Science (ICSESS), Beijing, China, 18–20 October 2019; pp. 29–33. [Google Scholar] [CrossRef]
Li, Y.; Ni, P.; Chang, V. Application of deep reinforcement learning in stock trading strategies and stock forecasting. Computing 2020, 102, 1305–1322. [Google Scholar] [CrossRef]
Carta, S.; Corriga, A.; Ferreira, A.; Podda, A.S.; Recupero, D.R. A multi-layer and multi-ensemble stock trader using deep learning and deep reinforcement learning. Appl. Intell. 2021, 51, 889–905. [Google Scholar] [CrossRef]
Yu, X.; Wu, W.; Liao, X.; Han, Y. Dynamic stock-decision ensemble strategy based on deep reinforcement learning. Appl. Intell. 2023, 53, 2452–2470. [Google Scholar] [CrossRef] [PubMed]
Long, J.; Chen, Z.; He, W.; Wu, T.; Ren, J. An integrated framework of deep learning and knowledge graph for prediction of stock price trend: An application in Chinese stock exchange market. Appl. Soft Comput. 2020, 91, 106205. [Google Scholar] [CrossRef]
Liu, Q.; Tao, Z.; Tse, Y.; Wang, C. Stock market prediction with deep learning: The case of China. Financ. Res. Lett. 2022, 46, 102209. [Google Scholar] [CrossRef]
Tetlock, P.E. Cognitive style and political ideology. J. Personal. Soc. Psychol. 1983, 45, 118–126. [Google Scholar] [CrossRef]
Kim, S.-M.; Hovy, E. Determining the sentiment of opinions. In COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics; COLING: Geneva, Switzerland, 2004; pp. 1367–1373. [Google Scholar]
Baker, M.; Wurgler, J. Investor sentiment and the cross-section of stock returns. J. Financ. 2006, 61, 1645–1680. [Google Scholar] [CrossRef]
Rupande, L.; Muguto, H.T.; Muzindutsi, P.-F. Investor sentiment and stock return volatility: Evidence from the johannesburg stock exchange. Cogent Econ. Financ. 2019, 7. [Google Scholar] [CrossRef]
Gite, S.; Khatavkar, H.; Kotecha, K.; Srivastava, S.; Maheshwari, P.; Pandey, N. Explainable stock prices prediction from financial news articles using sentiment analysis. Peer J Comput. Sci. 2021, 7, 340. [Google Scholar] [CrossRef]
Zhu, E. BERTopic-Driven Stock Market Predictions: Unraveling Sentiment Insights. arXiv 2024, arXiv:2404.02053. [Google Scholar]
Deng, Y.; Bao, F.; Kong, Y.; Ren, Z.; Dai, Q. Deep Direct Reinforcement Learning for Financial Signal Representation and Trading. IEEE Trans. Neural Netw. Learn. Syst. 2016, 28, 653–664. [Google Scholar] [CrossRef]
Jiang, Z.; Xu, D.; Liang, J. A Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem. arXiv 2017, arXiv:1706.10059. [Google Scholar]
Shin, H.G.; Ra, I.; Choi, Y.H. A Deep Multimodal Reinforcement Learning System Combined with CNN and LSTM for Stock Trading. In Proceedings of the 2019 International Conference on Information and Communication Technology Convergence (ICTC), Jeju, Republic of Korea, 16–18 October 2019; pp. 7–11. [Google Scholar] [CrossRef]
Wang, Z.; Huang, B.; Tu, S.; Zhang, K.; Xu, L. DeepTrader: A Deep Reinforcement Learning Approach for Risk-Return Balanced Portfolio Management with Market Conditions Embedding. Proc. AAAI Conf. Artif. Intell. 2021, 35, 643–650. [Google Scholar] [CrossRef]
Wang, J.; Zhang, Y.; Tang, K.; Wu, J.; Xiong, Z. AlphaStock: A Buying-Winners-and-Selling-Losers Investment Strategy using Interpretable Deep Reinforcement Attention Networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 1900–1908. [Google Scholar] [CrossRef]
Millea, A. Deep Reinforcement Learning for Trading-A Critical Survey. Data 2021, 6, 119. [Google Scholar] [CrossRef]
Pricope, T.V. Deep Reinforcement Learning in Quantitative Algorithmic Trading: A Review. arXiv 2021, arXiv:2106.00123. [Google Scholar]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, UK, 2018; pp. 10–11. [Google Scholar]
Wang, Z.; Schaul, T.; Hessel, M.; Hasselt, H.; Lanctot, M.; Freitas, N. Dueling network architectures for deep reinforcement learning. In International Conference on Machine Learning; PMLR: Birmingham, UK, 2016; pp. 1995–2003. Available online: https://proceedings.mlr.press/v48/wangf16.html (accessed on 10 July 2024).
Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning; PMLR: Birmingham, UK, 2016; pp. 1928–1937. Available online: https://proceedings.mlr.press/v48/mniha16.html (accessed on 10 July 2024).
Hessel, M.; Modayil, J.; Van Hasselt, H.; Schaul, T.; Ostrovski, G.; Dabney, W.; Horgan, D.; Piot, B.; Azar, M.; Silver, D. Rainbow: Combining improvements in deep reinforcement learning. Proc. AAAI Conf. Artif. Intell. 2018, 32. [Google Scholar] [CrossRef]
Horgan, D.; Quan, J.; Budden, D.; Barth-Maron, G.; Hessel, M.; Van Hasselt, H.; Silver, D. Distributed prioritized experience replay. arXiv 2018, arXiv:1803.00933. [Google Scholar]
Van Hasselt, H.; Guez, A.; Silver, D. Deep reinforcement learning with double q-learning. Proc. AAAI Conf. Artif. Intell. 2016, 30. [Google Scholar] [CrossRef]
Cavalcante, R.C.; Brasileiro, R.C.; Souza, V.L.; Nobrega, J.P.; Oliveira, A.L. Computational intelligence and financial markets: A survey and future directions. Expert Syst. Appl. 2016, 55, 194–211. [Google Scholar] [CrossRef]
Lim, B.; Zohren, S.; Roberts, S. Enhancing time series momentum strategies using deep neural networks. arXiv 2019, arXiv:1904.04912. [Google Scholar]
Leippold, M.; Wang, Q.; Zhou, W. Machine learning in the Chinese stock market. J. Financ. Econ. 2022, 145, 64–82. [Google Scholar] [CrossRef]
Khan, S.; Alghulaiakh, H. ARIMA model for accurate time series stocks forecasting. Int. J. Adv. Comput. Sci. Appl. 2020, 11. [Google Scholar] [CrossRef]
Afeef, M.; Ihsan, A.; Zada, H. Forecasting stock prices through univariate ARIMA modeling. NUML Int. J. Bus. Manag. 2018, 13, 130–143. [Google Scholar]
Raudys, A.; Pabarškaitė, Ž. Optimising the smoothness and accuracy of moving average for stock price data. Technol. Econ. Dev. Econ. 2018, 24, 984–1003. [Google Scholar] [CrossRef]
Rahman, M.A. Forecasting Stock Prices through Exponential Smoothing Techniques in The Creative Industry of The UK Stock Market. Int. J. Asian Bus. Manag. 2024, 3, 323–338. [Google Scholar] [CrossRef]
Martin, R.; Johan, J. Integrated system in forecasting stocks of goods using the exponential smoothing method. J. Appl. Bus. Technol. 2021, 2, 13–27. [Google Scholar] [CrossRef]
Abraham, R.; Samad, M.E.; Bakhach, A.M.; El-Chaarani, H.; Sardouk, A.; Nemar, S.E.; Jaber, D. Forecasting a stock trend using genetic algorithm and random forest. J. Risk Financ. Manag. 2022, 15, 188. [Google Scholar] [CrossRef]
Yin, L.; Li, B.; Li, P.; Zhang, R. Research on stock trend prediction method based on optimized random forest. CAAI Trans. Intell. Technol. 2023, 8, 274–284. [Google Scholar] [CrossRef]
Li, X.; Sun, Y. Stock intelligent investment strategy based on support vector machine parameter optimization algorithm. Neural Comput. Appl. 2020, 32, 1765–1775. [Google Scholar] [CrossRef]
Nabi, R.M.; Soran Ab, M.S.; Harron, H. A novel approach for stock price prediction using gradient boosting machine with feature engineering (gbm-wfe). Kurd. J. Appl. Res. 2020, 5, 28–48. [Google Scholar] [CrossRef]

Figure 1. State-0-1 matrix for Stock #601318 at 22 February 2024.

Figure 2. For the sake of convenience, we take the 10th order matrix as an example to make a diagram. In the behavior network, the blue color in the top and bottom 4 rows of the input

S_{t}

represents the number 1, the yellow colour in the middle 2 rows represents the number 1, and the rest of the white color represents the number 0. The input State-0-1 matrix, through the convolutional, ReLu, and FC layers, finally outputs a value vector

ρ_{c}^{t}

, which predicts the value function of each of the three actions. The agent selects the action corresponding to the maximum value and then updates the network weights according to the loss functions.

Figure 2. For the sake of convenience, we take the 10th order matrix as an example to make a diagram. In the behavior network, the blue color in the top and bottom 4 rows of the input

S_{t}

represents the number 1, the yellow colour in the middle 2 rows represents the number 1, and the rest of the white color represents the number 0. The input State-0-1 matrix, through the convolutional, ReLu, and FC layers, finally outputs a value vector

ρ_{c}^{t}

, which predicts the value function of each of the three actions. The agent selects the action corresponding to the maximum value and then updates the network weights according to the loss functions.

Figure 3. State-0-1 matrix for Stock #601318 at 22 February 2024 (SADQN-S).

Figure 4. Flowchart for obtaining

h_{c}^{t}

.

Figure 4. Flowchart for obtaining

h_{c}^{t}

.

Figure 5. The 90 stocks are divided into two parts, which have 60 stocks and 30 stocks, respectively. Then, we divided these 30 stocks into three groups, A, B, and C. This grouping is to make the test set independent of both the training and validation sets.

Figure 6. The training process of the experiment (using SADQN-S as an example).

Figure 7. The testing process of the experiment (using SADQN-S as an example).

Figure 8. The test results of DQN and DDQN in Group B and Group C.

Figure 9. Test results of DQN, DDQN, SADQN-R, SADQN-S, and 3 statistically based methods in Groups B and C.

Table 1. The agent uses Q-table to select the action that maximizes the Q-value in state

s

.

Table 1. The agent uses Q-table to select the action that maximizes the Q-value in state

s

.

Q-Table	$a_{1}$	$a_{2}$	$a_{3}$	$a_{4}$	$a_{5}$
$s_{1}$	$Q (s_{1}, a_{1})$	$Q (s_{1}, a_{2})$	$Q (s_{1}, a_{3})$	$Q (s_{1}, a_{4})$	$Q (s_{1}, a_{5})$
$s_{2}$	$Q (s_{2}, a_{1})$	$Q (s_{2}, a_{2})$	$Q (s_{2}, a_{3})$	$Q (s_{2}, a_{4})$	$Q (s_{2}, a_{5})$
$s_{3}$	$Q (s_{3}, a_{1})$	$Q (s_{3}, a_{2})$	$Q (s_{3}, a_{3})$	$Q (s_{3}, a_{4})$	$Q (s_{3}, a_{5})$
$s_{4}$	$Q (s_{4}, a_{1})$	$Q (s_{4}, a_{2})$	$Q (s_{4}, a_{3})$	$Q (s_{4}, a_{4})$	$Q (s_{4}, a_{5})$
$s_{5}$	$Q (s_{5}, a_{1})$	$Q (s_{5}, a_{2})$	$Q (s_{5}, a_{3})$	$Q (s_{5}, a_{4})$	$Q (s_{5}, a_{5})$
$\dots$

Table 2. The number of stock codes (118) and total number of comments for each stock.

Securities Code	Amount of Comments	Securities Code	Amount of Comments	⋯	Securities Code	Amount of Comments
000001	79,722	000858	118,561	⋯	601916	22,295
000002	82,502	000983	42,376	⋯	601916	22,295
000039	13,332	001872	3735	⋯	601939	16,470
000063	121,742	001979	22,767	⋯	601985	38,764
000333	55,628	002142	23,694	⋯	601988	24,239
000568	41,006	002304	34,658	⋯	601995	36,199
000617	68,873	002352	33,436	⋯	603288	53,185
000651	144,022	002415	70,002	⋯	603993	41,070
000708	6328	002475	61,227	⋯	688981	65,854
000776	44,098	002714	68,077	⋯	900948	6934

Table 3. The hyperparameter settings for the SADQN-S.

Hyperparameter	Meaning of Hyperparameter	Value
maxiter	The maximum number of iterations	550,000
batch_size	The batch size	32
buffer_size	The capacity of the memory buffer	1000
learning_rate	The learning rate	0.001
gamma	The discount factor	0.99

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Du, S.; Shen, H. A Stock Prediction Method Based on Deep Reinforcement Learning and Sentiment Analysis. Appl. Sci. 2024, 14, 8747. https://doi.org/10.3390/app14198747

AMA Style

Du S, Shen H. A Stock Prediction Method Based on Deep Reinforcement Learning and Sentiment Analysis. Applied Sciences. 2024; 14(19):8747. https://doi.org/10.3390/app14198747

Chicago/Turabian Style

Du, Sha, and Hailong Shen. 2024. "A Stock Prediction Method Based on Deep Reinforcement Learning and Sentiment Analysis" Applied Sciences 14, no. 19: 8747. https://doi.org/10.3390/app14198747

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Stock Prediction Method Based on Deep Reinforcement Learning and Sentiment Analysis

Abstract

Featured Application

Abstract

1. Introduction

2. Related Work

3. Preliminaries

3.1. Neural Networks

3.2. CNN

3.3. Deep Reinforcement Learning

3.3.1. Basic Knowledge

3.3.2. DQN and DDQN

3.4. Sentiment Analysis

4. Methodology

4.1. Modeling the Environment for RL Task

4.2. Network Settings for Deep Learning

4.3. Methods

5. Experiments

5.1. Dataset

5.1.1. Data Source

5.1.2. Data Preprocessing

5.2. Comparison Experiments

5.2.1. Comparison Experiment between DQN and DDQN

5.2.2. Comparison Experiment for All Methods

6. Discussion

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI