1. Introduction
According to the efficient market hypothesis [
1], stock prices already incorporate all available valuable information, which means that analyzing stock prices using historical price data is not feasible, and this view suggests that using fundamental analysis or technical indicators to predict stock prices may not be effective in an efficient market. However, many pieces of evidence also suggest the opposite view. For example, Pedersen’s study shows that those who are better at processing information have an edge in stock market investment [
2]. However, due to the random volatility of financial time series, it is difficult for researchers to comprehensively analyze their characteristics to make accurate forecasts, and how to comprehensively analyze the market information to make more accurate forecasts has become an ongoing issue in the field of stock price forecasting.
In the early stages of research, scholars predominantly employed conventional statistical models, including autoregressive moving average (ARMA), autoregressive integrated moving average model (ARIMA), autoregressive conditional heteroskedasticity model (GARCH), etc. These classical statistical models maintain substantial relevance in contemporary predictive research. For example, Rounaghi and Nassir Zadeh applied the ARMA model to forecast monthly and yearly stock return time series in the S&P 500 and London Stock Exchange [
3]. Herwartz employed the GARCH model to predict stock returns and obtained useful information for signaling one-step-ahead directions of stock price changes through independence testing [
4].
However, with the evolution and diversification of financial markets, the complexity of financial time series has increased, rendering traditional econometric models seemingly inadequate for contemporary research. In order to adapt to higher data precision and complexity, machine learning models have been employed in the research of predicting financial time series. Traditional works often use models such as support vector machines (SVM), artificial neural networks (ANN), random forest (RF), and extreme gradient boosting (XGBoost). For instance, Qiu et al. adapted an artificial neural network to predict the return of the Japanese Nikkei 225 index and the result outperformed the traditional BP training algorithm [
5]. Zhou et al. proposed a novel approach that integrates complete ensemble empirical mode decomposition with adaptive noise and XGBoost to forecast crude oil prices [
6].
With the continuous breakthroughs in computing power and data capacity, an increasing number of studies are employing deep learning models for the prediction of financial time series [
7,
8]. Many research indicates that deep neural networks can better handle financial time series, especially the long short-term memory (LSTM) network introduced by Hochreiter and Schmidhuber in 1997 [
9]. The application of the LSTM in stock price prediction and financial forecasting research has further elevated the study of deep learning in the financial domain. For example, Wu et al. applied LSTM and its variant models to predict Bitcoin prices [
10], and Kim and Won proposed a combined LSTM model to predict the volatility of financial markets [
11]. Up until the present moment, LSTM remains one of the most extensively utilized technologies in the field of time series prediction, and it continues to harbor significant untapped potential.
Applying deep learning to the field of financial time series prediction, the selection of model input features is one of the most crucial issues. The choice of input features is directly related to the model’s ability to better learn the inherent correlations between time series. In previous works, the input features of the models typically included stock volume and price data. For instance, Barua and Sharma introduced technical indicators based on market data and used a CNN-BiLSTM model to predict index close prices [
12]. Wang, W.Y. et al. constructed multiple input features using price data and selected the optimal combination of input features for prediction [
13].
Recent studies aim to improve and diversify the selection of input features. Especially with the development of natural language processing technologies, data collection and processing methods are becoming increasingly diversified. Researchers are no longer limited to analyzing stock fundamental information and technical indicators. The study of market sentiment is receiving increasingly more attention. Researchers are beginning to collect text information, especially finance market news, to analyze the stock market. Many studies have shown the effectiveness of this approach for predicting stock trends (e.g., [
14,
15]). Results of previous works (e.g., [
15,
16,
17,
18]) suggest that using both market data and news-based information is helpful for the market prediction problem.
Researchers are not only confined to the sentiment of news, the analysis of retail investors’ sentiment derived from social media has also become a focal point. For example, Poongodi et al. developed a tweet node algorithm to construct a network of tweet nodes, aiming to extract potential associations in Twitter data for stock market prediction [
16]. Poongodi et al. analyzed the typical trends in the online communities and social media platforms to understand and extract insights that could be used to predict the crypto-currency price movement trends [
17]. However, there is still room for improvement in enriching the data sources for sentiment analysis and refining and standardizing market sentiment analysis methods.
Regarding technical applications, previous sentiment analysis in financial markets relied more on manually annotated dictionaries to analyze the sentiment of financial texts [
18,
19]. With the development of deep learning, many deep learning models have been applied to text analysis and achieved significant results. For example, Daudert introduced an adaptive feedforward neural network that utilizes recorded text and contextual information for fine-grained sentiment analysis [
20]. Jing et al. used a CNN-based sentiment analysis model for sentiment analysis of financial texts [
21].
The transformer model in particular, due to its capability in capturing long-range dependencies and thus analyzing semantics more effectively, has significantly propelled the development of natural language processing technologies. In recent research, transformer-based natural language processing methods have shown promising results in financial text data analysis. Particularly Google’s BERT model [
22], as a transformer-based pre-trained model, made remarkable progress in natural language processing and was applied to sentiment analysis of financial texts in many studies (e.g., [
23,
24]). For instance, Hiew et al.’s study shows that a BERT-based sentiment analysis approach is superior to models such as FastText or a multichannel Convolutional Neural Network (CNN) [
25]. However, there is still significant room for research and exploration of BERT’s application in the financial market.
Regarding analysis methods, previous works on market sentiment mainly focused on sentiment classification. Based on existing techniques for sentiment polarity analysis, text sentiment is classified into positive, neutral, or negative categories, and the number of texts with different sentiment tendencies is used to calculate sentiment scores as model inputs for stock price prediction [
26].
Although there have been many attempts to apply sentiment analysis to price prediction, current research still has several shortcomings. Previous works on market sentiment mainly focused on sentiment polarity (positive/negative/neutral expression), much research has expanded on this foundation. For example, Chou split news headlines into words and then analyzed the sentiment polarity of each word to calculate sentiment scores for stock price prediction [
27]. Cristescu et al. analyzed the sentiment polarity of news headlines and used a regression model to predict prices [
28]. These methods resulted in an inevitable loss of data accuracy and have a significant limitation. Moreover, most existing research focused more on the market sentiment of the target stock and ignored the sentiment impact of its related sectors. For example, Fazlija and Harder only used news related to an underlying asset to construct sentiment indicators for stock price trend prediction [
29]. Deng et al. only used investor sentiment related to an underlying asset for prediction [
30]. In addition, previous research mainly used single news or post data sources, which are relatively limited (e.g., [
26,
31]). Furthermore, retail investors account for a large percentage of the stock market, and existing research has largely ignored the impact of this group on market sentiment. How to extract market sentiment information more accurately and comprehensively and make more accurate stock price predictions based on sentiment information is an essential issue in current research.
To address these issues, we propose the BERT-LLA model, which combines sentiment analysis with technical indicators. Following Li, Q. et al. [
32], Nassirtoussi et al. [
33], and Wang, H. et al. [
34], we combine news and investor reviews for market sentiment analysis, while using financial texts from upstream and downstream industries to form multi-channel data. We also propose a comprehensive sentiment index calculation method for combining news and investor comments. We leverage the BERT model for sentiment analysis and calculate the sentiment index series and the technical indicator time series for model prediction. The main contributions of this research are:
We propose a prediction model called BERT-LLA that leverages a pre-trained model for financial sentiment analysis and outperforms the baselines in test sets.
We propose a comprehensive sentiment index calculation method for combining news and investor comments to standardize the use of these two types of text information.
We consider the impact of market sentiment in the company’s upstream and downstream sectors and propose combining the sentiment of the upstream and downstream sectors for stock price prediction, which can solve the problem of a relatively limited source of text data.
We propose a related sector selection method based on semantic similarity and sector heat, which can help us screen related sectors for stock price prediction more intelligently and effectively.
We analyze the impact of the weight of investor comments and news on prediction accuracy and confirm our experience that news has a more substantial effect on market sentiment. We also obtain the relatively optimized values of the weight, which have enlightening significance for subsequent research on the synergistic effect of investor sentiment and news on market sentiment.
The rest of the paper is organized as follows:
Section 2 describes our proposed model and corresponding details.
Section 3 describes the experimental design and presents the experimental results and discussions.
Section 4 summarizes our work and points out future directions for research.