1. Introduction
In terms of machine learning, time series forecasting, in a broad sense, involves the training and utilization of a model to predict the future values of variables that describe a phenomenon based on historical data. Time series are mathematical formalizations that include sequential and time-dependent observations. In this work, such sequential and time-dependent data are represented by stock market closing prices.
In recent times, financial forecasting seems to be a highly relevant field of research that has the potential to play a critical role in managing risks, making informed decisions, and achieving financial goals, given the increasingly complex and dynamic landscape that constitutes contemporary economies. However, besides this, the financial setting constitutes quite an interesting phenomenon in and of itself from both a modeling and a psychological point of view, inter alia, given the assumption that the various outcomes at their core are informed by social attitudes that are expressed linguistically. Nevertheless, it seems as if this stands out as a characteristic depiction of human agency capable of exemplifying an inherently paranoid aspect of humanity.
In this ever-evolving financial landscape, it is now common knowledge that harnessing the power of deep learning and relevant emotion-related information constitutes a promising path for investigating improvements regarding forecasting endeavors. Furthermore, the emotions and sentiment polarities extracted from posts on social media can be a significantly useful tool for modeling general behavior toward financial markets. Here, the framework to be presented integrates the above two rationales, with, on the one hand, the additional introduction of a thorough benchmarking of a multitude of deep learning and ensemble methods and, on the other, a space of emotion-related features that does not simply contain general sentiment polarities but integrates an extensive description through a subtle multi-label classification system of 28 distinct emotions, representing a variety of emotional attitudes towards each stock examined. Building on our previous work, we first compare the best-performing algorithmic schemes of a benchmarked set of 30 state-of-the-art methods and then present a method for incorporating the aforementioned fine-grained emotion feature exploitation together with a feature selection procedure.
Thus, this work is about comparing and benchmarking a number of state-of-the-art methods that incorporate both classical sentiment analysis and multi-label emotion classification in the task of financial forecasting, as well as proposing a derived methodology that exploits temporal convolutional networks (TCNs) and emotion analysis to improve medium-term stock market closing price forecasts. Specifically, regarding the latter, the proposed scheme consists of the following distinct modules: TCNs, feature selection, sentiment analysis, and a BERT-based [
1] multi-label emotion classifier, all under a multivariate-averaging ensemble scheme. Convolutional networks are a class of neural networks specializing in learning hierarchical features from structured data by applying convolutional operations. Temporal convolutional networks (TCNs) are a type of neural network architecture designed for processing sequential data, such as time series. These networks focus on capturing temporal dependencies and patterns by leveraging convolutional operations with respect to the temporal dimension, allowing them to analyze and learn from the sequential nature of such data. BERT, standing for bidirectional encoder representations from transformers, is a state-of-the-art natural language processing (NLP) mode, a transformer-based neural network architecture specifically designed for language tasks. Hence, the selected day-to-day sentiment and emotion scores extracted from related tweets are incorporated into the feature space and used in a multivariate setting to predict the closing prices of 15 stocks. The investigation builds on the results presented in [
2] in the sense that the aforementioned work, which works within the same framework as the present one regarding base learners and data, enables us to reject a fairly large number of methods, keeping only those that exhibit good behavior. Thus, the experimental framework starts with five top-performing methods and includes the investigation of a number of possible weighted ensemble forecasting procedures. It will be shown that the proposed methodology prevails with respect to every evaluation metric, exhibiting the best overall performance in each of the valuations. Furthermore, we will see that the use of multivariate inputs containing specific emotional features always improves the derived predictions.
Given the above, in summary, this work can be seen as both the end piece of a rather extensive comparative study and as presenting a concrete, specific methodology. A novel methodology for improved medium-term stock market closing price forecasts that integrates TCNs and the emotion-related features extracted from tweets is introduced. The method presents the final outcome of a thorough benchmarking and comparison process. The latter includes the investigation of a large number of possible ensemble predictors that incorporate a variety of emotion-related multivariate inputs under a plethora of weighted combinatory schemes. It is shown that the presented methodology clearly outperforms every base or ensemble scheme. Additionally, the incorporation of deep learning as well as fine-grained specific emotion polarities under our simple averaging combinatory scheme not only stands out as good applied practice but has the potential to draw a path towards the creation of semantically rich and diverse feature spaces that represent subtle emotion polarities that can potentially be used in various modeling tasks. We show, through various charts and empirical performance validation, that the incorporation of feature selection, sentiment analysis, and multi-label emotion classification leads to significant prediction improvements. The results demonstrate that the inclusion of multivariate inputs containing specific emotional features consistently leads to improvements in accuracy. Hence, the creation of fine-grained, specific, and distinct emotion polarities stands out as a largely beneficial practice that, quite promisingly, could be utilized in various prediction tasks.
Concluding this introduction, the structure of the present work is as follows: First, some related works are listed. Then, in
Section 3, the experimental and evaluation procedures are given.
Section 4 contains elements of the proposed methodology. Finally, the results and a summary assessment follow.
2. Related Work
In this section, indicative works from the existing literature are briefly introduced. As already mentioned, emotion and sentiment-related representations have been the center of focus in a multitude of diverse research endeavors.
Starting with some indicative works regarding the latest trend in general sentiment and opinion mining, a novel labeling strategy, together with an effective model for structured sentiment analysis consisting of graph attention networks and an adaptive multi-label classifier, is introduced in [
3]. This approach demonstrates significant performance improvements over prior state-of-the-art models on five benchmark datasets across multiple languages. In [
4], a novel multiplex cascade framework for unified aspect-based sentiment analysis (ABSA) that maintains the interaction existing between the various ABSA subtasks is introduced. By hierarchically modeling the subtasks and integrating syntax-aware information, the proposed Syntax-aware Multiplex framework improves ABSA results across 28 subtasks with substantial gains. A method that exploits documents’ latent target-opinion distribution and then leverages fine-grained sentiment analysis principles to enhance document-level sentiment classification is proposed in [
5]. The method, consisting of a variational and a classification part, introduces a hierarchical approach with a variational autoencoder and a transformer-based module, respectively, effectively capturing latent fine-grained target and prior opinion information and achieving state-of-the-art performance on various benchmark datasets. Moreover, in [
6], a Three-hop Reasoning chain-of-thought (CoT) framework is presented for implicit sentiment analysis (ISA), with both inspired by and targeting human-like reasoning processes. The method achieves significant improvements, surpassing the state-of-the-art in both supervised and zero-shot setups.
Concerning research on sentiment classifications and economic data, in [
7], FinBERT, a domain-specific language model for natural language processing regarding financial-related tasks, is presented. FinBERT is a state-of-the-art BERT-based language model fine-tuned on financial textual datasets. Such fine-tuning procedures are now common practice, and, actually, the model is also used within the experimental framework of the present work. Furthermore, in [
8], a deep learning architecture that leverages managerial emotion representations formed by speech recognition using FinBERT-based sentiment analysis applied to earnings conference call transcripts is proposed. In [
9], text-based emotion recognition with a focus on deep learning techniques is explored. The work extends existing methods by addressing class imbalances and introducing transfer learning-based strategies, offering comprehensive benchmarking of text-based emotion recognition methods and demonstrating the superiority of deep learning approaches across various datasets. Sentiment polarities generated from tweets are used in [
10] to investigate the impact of Twitter on stock market decisions. For this, a methodology that utilizes financial-based sentiment analysis on relevant and influential Twitter accounts is employed. The study contains comparisons regarding the investigation of correlations between tweets and stock market behavior during the H1N1 and COVID-19 periods. A company-specific model for sentiment analysis in financial data is proposed in [
11]. The model’s architecture is composed of neural networks and aspires to generally detect trend variations in stock prices, transforming pretrained word embeddings that have no financial specificity into embeddings that capture important domain-specific characteristics. A knowledge base extends the financial-related embedding space by enriching the vocabulary. The topic has been investigated relatively extensively with earlier known architectures as well, where various neural network models, such as long short-term memory (LSTM) and convolutional neural networks (CNNs), are employed to model stock market opinions [
12]. A hierarchical data structure and a two-step model are used in [
13] for financial-related aspect classes and corresponding sentiment polarities in sentence prediction, whereas in [
14], a novel semantic and syntactic-enhanced neural model is introduced to improve target sentiment representation regarding bullish or bearish sentiments in the financial domain by incorporating dependency graphs and context words.
Regarding multi-label emotion analysis-related works, in [
15], an emotion prediction framework consisting of a prompt-based generative multi-label emotion prediction model is presented, demonstrating competitive results after being tested on the two datasets. In [
16], a novel model called SpanEmo that treats multi-label emotion classification as a span prediction task is introduced. The introduced strategy, in broad terms, aims to present an enhanced model with the capacity to represent the underlying existing associations between emotions as labels and sentences. A topic-enhanced capsule network for multi-label emotion detection consisting of a variational autoencoder that learns latent topic information and a capsule module capturing the corresponding emotion features is introduced in [
17]. The proposed method significantly outperforms a variety of previous methods and strong baseline schemes on two benchmark datasets, demonstrating top-level performance. Additionally, a latent emotion memory network (LEM) for multi-label emotion classification that can learn latent emotion distribution without relying on external sources and can efficiently incorporate it into the classification network is presented in [
18]. The results from experiments on two benchmark datasets indicate that the suggested model demonstrates state-of-the-art behavior, outperforming well-established baselines.
Moving on to papers regarding financial forecasting and sentiment analysis, a comprehensive literature-based study on investor sentiment analytics and machine learning applied to predict stock prices is presented in [
19]. Additionally, review-wise, the work in [
20] presents a critical literature review regarding text mining and sentiment analysis for stock market prediction, focusing on stock markets. A systematic review examining works based on using machine learning and text mining techniques applied to news data to predict the stock market is presented in [
21]. The study identifies gaps and barriers in the field while highlighting the increasing use of artificial neural networks and advanced natural language processing methods and opportunities for future research. In [
22], a sentiment-annotated dataset containing textual data related to Bitcoin taken from Reddit is proposed. The dataset is used to evaluate relevant crypto price change forecasts by incorporating various architectures, such as recurrent neural networks (RNNs) and transformers. A work based on using stock-specific news synopses, together with extracted sentiment features to predict stock prices, is presented in [
23]. The study aspires to present a forecasting framework that positively exploits various stock-related aspects, such as discretized stock price movements, valence sentiment analysis, and sentiment polarities. Moreover, ref. [
24] introduces weak supervision in financial forecasting, investigating the incorporation of both sentiment analysis (performed on news and social media data) and machine learning methods to the task of cryptocurrency price prediction. Inter alia, the paper employs a BERT classifier to extract sentiment scores, which are then included in a model for predicting daily returns. In [
25], again, various past stock-price values and a pretrained BERT model are utilized in combination under a predictive scheme that employs LSTM neural networks. The setup introduces features that contain sentiment scores extracted from news and a relevant online forum, as well as other stock-related historical information such as the opening, closing, highest, and lowest prices. In [
26], a new dataset for stock market emotion detection is presented. The set contains data consisting of 12 fine-grained emotion classes concerning investor emotion. The impact of investor emotions extracted is investigated within a time series forecasting setup.
Regarding the architecture that is the core of the methodology proposed here, temporal convolutional networks (TCNs) are used in various forecasting endeavors. In [
27], a temporal convolution network model is proposed for multivariable time series prediction, with the authors presenting results that suggest prediction accuracy improvements. The model is employed in a sequence-to-sequence layout applied to nonperiodic datasets. Multichannel residual blocks in parallel with a deep convolution neural network-based asymmetric structure are presented. Moreover, regarding short-term energy load forecasting, a model based on a temporal convolutional network and a light gradient boosting machine (LightGBM) is proposed in [
28]. The TCN is used over the input features to model the underlying information and long-term temporal dependencies. Then, a LightGBM is utilized to predict energy loads. In [
29], state-of-the-art temporal convolutional networks are utilized to forecast weather, outperforming LSTMs and various other machine learning architectures. Lastly, in [
30], an investor attention factor is employed by combining various trading information as the input and utilizing a temporal convolutional network to predict volatility under high-frequency financial data, and a novel technique combining temporal convolutional networks and recurrent neural networks (RNNs) for greenhouse crop yield prediction is presented in [
31].
Closing this literature review, it is rather obvious that the above indicative listing of relevant works does not exhaust the scope of even a small presentation. Therefore, the reader is urged to further follow the relevant literature.
3. Experimental & Evaluation Framework
The central problem of this work is the modeling of specific financial-related time series containing stock market closing prices in order to predict their future fluctuations. The way this task is treated here is as a regression problem.
A time series forecasting task can be formally described as follows: Given a set of time series observations
, where
is the observation at time
t, and a set of timestamps
, the goal is to build a forecasting model,
F, that can predict its future fluctuations
, where
. This forecasting model,
F, can be expressed as:
where
represents the prediction at time
,
is the historical time series data up to time
t, and
h is the forecast horizon, that is, the number of future time steps. The objective here is to train and evaluate a model (
F) in order to be able to extract predictions that minimize the differences between
and the actual observed values
for various
h.
As was already mentioned, the present investigation has its starting point in previously drawn conclusions in terms of creating a set of well-performing methods to test as base learners. Specifically, in [
2], from a comparison of the 30 state-of-the-art methods for time series forecasting, as depicted in
Table A1, a multivariate temporal convolutional network-based method exploiting sentiment analysis was proposed for the task of stock market forecasting. In addition, four more methods stood out. Furthermore, in the same work, we saw that, in terms of generality and a prediction time window that becomes wider, the use of sentiment modeling features improved the predictions.
Given the above, in this paper, a multivariate stock market forecasting methodology based on a variation of the aforementioned temporal convolutional network is proposed. The methodology now exploits both sentiment analysis and a multi-label emotion classification scheme based on BERT applied to stock-related data extracted from Twitter. A series of predictions incorporating various emotion-related time series is first produced and then integrated into an average-weighted scheme, the elements of which are obtained after a feature selection process. The results indicate a general dominance of the proposed method in every tested case and in all metrics. The latter resulted from an extensive evaluation of the outputs of a variety of ensemble configurations compared to our proposed methodology.
3.1. Framework Outline
In short, the experimental framework and evaluation process of the compared methodologies were as follows: We started with a set of five algorithms that were to be used as the base learners and that, in our aforementioned previously related research, exhibited the best behavior. Then, with the exploitation of three sentiment analysis techniques, Vader, TextBlob, and FinBERT, as well as a multi-label classifier of 28 different emotions that we created by fine-tuning the BERT model, a multitude of sentiment polarities on the one hand, and emotion-related outputs on the other, were extracted from stock-related Twitter data. Then, for each of these outputs and in order to create the corresponding time series that would include a daily observation, a daily average was calculated. Then, for each stock, a dataset with 65 features was formed, consisting of the closing prices, the above sentiment and emotion-related features, and their weekly rolling mean versions. Moreover, for each stock and corresponding dataset, all possible combinations consisting of two features were extracted based on the following rule: every combination had to include, as its main component, the time series of the closing price. Thus, we had 64 different multivariate versions to run—together with the univariate one—for each stock and each base learner. The final input dataset used in the training resulted, on the one hand, from its introduction into the feature space consisting of a number of things resulting from the application of the sentiment analysis and emotion classification and on the other, the incorporation of a smoothed version of every feature used; that is, both the closing price time series and the sentiment and emotion-related ones. This process is outlined schematically in
Figure 1.
In other words, based on the time series of the closing price and given, on the one hand, the three sentiment polarities from the outputs of Vader, TextBlob, and FInBERT and the 28 emotion features of the multi-label BERT classifier, and on the other hand, their smoothed versions resulting from the application of weekly rolling media, 64 features were created that characterized the multivariate layouts. Each of the above characteristics, together with the closing value, constituted an input feature setup.
By using these setups, the first set of experiments was performed for each base learner. Then, according to six evaluation metrics, the best setups regarding each of the five methods investigated were extracted. Next, the possible blended and weighted-average ensemble versions were investigated in the direction of deriving a methodology. Each such ensemble could consist of two to five constituent methods and corresponding input feature setups, each of which was composed of the best-performing multivariate outputs extracted in the previous step. The experimental setting presupposes a first internal benchmarking of the set of 30 state-of-the-art methods presented in
Table A1 [
2] and continues further investigations of the ensemble methodologies from the best-performing ones. In this context, a new performance ranking is created containing the five base learners to be presented in
Table 1 and a number of weighted blended ensemble layouts. The experiments were performed on 80% of the data, reserving the remaining 20% for testing. The following three different time frames were added to the multitude of settings to be tested: single-day, 7-day, and 14-day time shifts.
Thus, all the above methods, together with the proposed methodology to be presented in detail in
Section 4, were benchmarked, again, according to the six metrics utilized. Two types of final evaluations were performed: (a) first, the average performance value was calculated regardless of shift and dataset. Here, a ranking based on the average value for each metric was produced. (b) Then, the Friedman rankings [
32,
33] were calculated, incorporating 15 stock datasets × 3 shifts per dataset = 45 sets.
Table 1.
Best-performing algorithms.
№ | Abbreviation | Algorithm |
---|
1 | TCN | Temporal Convolutional Network [34] |
2 | XCMPlus | Explainable Convolutional Neural Plus Network [35] |
3 | LSTM | Long Short-Term Memory Network [36] |
4 | LSTMPlus | Long Short-Term Memory Plus Network [37] |
5 | TSTPlus | Time Series Transformer Plus [38] |
5. Results
The results will now be presented. Here, CD diagrams, bar plots, and tables of the overall averaged results will be used. The CD diagrams will depict the top 10 of the overall Friedman rankings of the competing methods examined regarding all time shifts arranged by metric. Bar plots will depict the numerical values of the Friedman ranks regarding the best-performing schemes.
The tables will contain the top three configurations regarding the average performance of each method in terms of the corresponding metric independent of time shift. This means that the tables are going to include information about the exact values of the metrics, whereas the CD diagrams and bar plots will show relative rankings.
Thus,
Table 5 depicts the average metric values of the three best-performing methods regardless of time shift. Looking at
Table 5, one first notices the general superiority of the proposed methodology, that is, the one we refer to as “TCN Mean”, which ranks first in every metric. Beyond that, there is not much to say here about the proposed configuration apart from the fact that its aforementioned prevalence is clear in every metric. Besides the performance of the proposed methodology, however, one can observe various ensemble layouts appearing in the top positions. These methods are the best-performing in the context of every possible weighted average combination version of the base learners presented in
Table 1. Regarding this, all the tables and illustrations presented in this section also include a description of the respective weights used in every weighted average ensemble. Specifically, beyond the TCN Mean, one can distinguish such an ensemble configuration: the ensemble consisting of a TCN that incorporates the fear feature—as extracted from the emotion classification process, together with an XCMPlus trained on the setup containing the weekly rolling mean version of the admiration emotion feature, in a linear arrangement with corresponding weights
. This method ranks second in every metric, with the exception of RMSLE, where it ranks third. Additionally, when looking not only at the ranks but also at the values of the metrics, we can further observe that the method clearly loses on average, but in some cases, not by too much.
Table 5.
Average performance per metric: top three.
№ | Method | MAE | № | Method | MAPE |
---|
1 | TCN Mean | 4.158 | 1 | TCN Mean | 0.171 |
2 | TCN fear & XCMPlus admiration RM7 | 4.309 | 2 | TCN fear & XCMPlus admiration RM7 | 0.174 |
3 | TCN disgust RM7 | 4.502 | 3 | TCN fear & XCMPlus gratitude RM7 | 0.175 |
№ | Method | MSE | № | Method | RMSE |
1 | TCN Mean | 74.057 | 1 | TCN Mean | 5.120 |
2 | TCN fear & XCMPlus admiration RM7 | 75.891 | 2 | TCN fear & XCMPlus admiration RM7 | 5.321 |
3 | TCN disgust RM7 | 79.539 | 3 | TCN disgust RM7 | 5.430 |
№ | Method | RMSLE | № | Method | R2 |
1 | TCN Mean | 0.093 | 1 | TCN Mean | 0.400 |
2 | TCN Close RM7 & XCMPlus gratitude RM7 | 0.098 | 2 | TCN fear & XCMPlus admiration RM7 | 0.209 |
3 | TCN fear & XCMPlus admiration RM7 | 0.101 | 3 | TCN nervousness RM7 | 0.185 |
In addition, when looking at the emotion features that appear in the first positions, we can also observe something rather expected: emotions, such as fear, admiration, and disgust, exhibit the best efficiencies, something that seems, even according to common sense, to make sense, given that the emotions in question are attitudes towards stocks that can be related to a general predisposition, and the hypothesis according to which the what is said about stocks subdefines how stocks fluctuate, is valid. Still, our methodology performs best by far.
This can also be seen from the Friedman rankings presented in
Figure 5 and
Figure 6. The five best-performing methods are presented in
Figure 5. The CD diagrams in
Figure 6 contain the 10 best-performing schemes as well as aspects of their corresponding statistical mutual relations. Again, our methodology ranks first in every valuation metric by a wide margin. However, here, the ensembles positioned below the proposed methodology are not the same as those that were included in the ranking that contained the average values per metric presented in
Table 5. In this context, the following two additional features seem, in our opinion, quite interesting: gratitude and approval. These participate in configurations that generally occupy the top positions in the rankings. Each of these features seems to be able to capture a relatively relevant attitude toward the corresponding stocks in our case study. In conclusion, observe that the CD diagrams also indicate the clear statistical independence of the proposed TCN Mean methodology.
We have two additional remarks: first, each of the best-performing configurations, both regarding the average metric value performances and the Friedman rankings, contains specific features that have been extracted from the BERT-based multi-label emotion classifier. As we will see below, regarding the TCN Mean methodology, emotion features are extracted from the feature selection process, which suggests multivariable setups with TCN predictions that constitute the final averaged ensemble of the proposed methodology. Thus, emotion classification is crucial in our configuration. Second, each of the best-performing methodologies has a TCN as its main component. This is particularly important if we remember the following: the investigation here includes the five best methods from a set of 30 methods tested in [
2], and the experiments involved the same context. Thus, the results here build on the results of [
2], and this indicates an extremely widespread prevalence of the proposed methodology in a huge number of individual and ensemble methods.
Now, in relation to the configuration presented in
Figure 4, the aforementioned ranking of TCN Mean does not include the feature selection process. Actually, the incorporation of the latter leads to a further increase in the efficiency of the proposed method, and this constitutes a further indication of the essentiality of introducing relevant emotion features—beyond, for example, the fact that the univariate version never appears in the higher places in the rankings. Here, in the same way as before, aggregated results, once again, regarding two aggregation tactics will be presented: one containing the average metric values depicted in
Table 6 and the other Friedman ranks included in
Figure 7 and
Figure 8. There are three competing tactics here: initially, TCN Mean without feature selection, a TCN Mean version that incorporates mutual information [
49] feature selection, and finally, the proposed methodology: a TCN Mean version that exploits the correlation feature selection procedure mentioned above.
Figure 5.
Aggregate Friedman rankings.
Figure 6.
CD diagrams: Friedman rankings.
Figure 7.
CD diagrams: Friedman rankings.
Figure 8.
Aggregate feature Friedman rankings.
In short, in both cases, the dominance—in terms of ranking—of the correlation feature selection strategy is easily observed. Both in the CD diagrams and bar plots, as well as in the average rankings, this strategy is placed at the top of the results. An exception is the MSE metric in the rankings of
Table 6. There, the use of mutual information-based feature selection ranks first. However, the general prevalence of the correlation strategy is clear and, therefore, recommended. Specifically, in terms of average metric values, the difference in performance between the strategies is even more evident. This, among others, also indicates that the performance of the methodology we propose is greatly increased in terms of absolute numbers with the incorporation of the correlation feature selection strategy.