1. Introduction
Predicting stock price behavior is an investor’s goal in order to make the correct decision. A stock trader is a type of investor who always attempts to profit from the purchase and sale of stock. Therefore, this sort of investor must predict stock price changes to make the right decision on whether to sell or hold the stock they currently own. To earn money, stock traders must purchase stocks whose prices are expected to grow over the predicted period and sell stocks whose prices are dropping. If stock traders predict trends in stock prices correctly, they will have the potential to make a profit. Thus, predicting stock price trends is very important for stock traders’ decision-making. However, the stock market shows highly complex trends. It is influenced by a wide range of economic factors such as Market Capitalization (MAC), general economic conditions, sentiment indices of social media and financial news [
1,
2]. Therefore, stock market prediction is known as one of the most challenging issues in time series prediction due to noise and volatility characteristics [
3].
Previous research on predicting stock prices with effective machine learning models has largely been divided into two main approaches. The first approach aims to propose a prediction model using only historical stock data as input features and the second approach aims to apply related features to create models, including external indicators (e.g., news sentiment and social sentiment) and technical indicators.
For the first approach, there are a vast number of methodologies used to create predicting models. The common techniques include Artificial Neural Networks (ANN), Support Vector Machine (SVM), Auto Regressive Integrated Moving Average (ARIMA), etc. In addition, ANN has various structures for each data type such as Recurrent Neural Networks (RNN) for time series data and Convolutional Neural networks (CNN) for image and video data [
4,
5]. Recently, Long Short-Term Memory (LSTM), a type of RNN, has attracted great attention as a result of the rapid growth of machine learning in the field of time series prediction, due to its ability to solve long-term dependence [
6,
7].
For the second approach, financial prediction based on machine learning techniques frequently adopts technical analysis to construct input features. Over 20% of financial market prediction models utilize technical indicators as input features [
8]. Therefore, many researchers have tried to demonstrate that media sentiment affects stock prices and use it as an input feature to create a prediction model. A research article presented in [
9] proposed models using financial news and technical indicators to predict intraday directional movements in the stock price of Chevron Corporation (CVX). A research article presented in [
10] proposed a method for predicting stock market future patterns by using news and social media.
However, an original single machine learning method is not effective enough to predict stock prices. Additional processes and methods are required to increase model performance. The Empirical Mode Decomposition (EMD) proposed by [
11] is a popular method to apply in order to transform input features before creating a prediction model. The EMD is a method to decompose the signal into physically meaningful components, called Intrinsic Mode Functions (IMFs). The EMD can analyze non-linear and non-stationary time series data by decomposing them into different resolutions of components. The trends of data are extracted and non-linear and non-stationary eliminated. The prediction model which applied EMD to raw data in data preprocessing outperformed the prediction model which did not [
12,
13]. Therefore, many researchers have proposed hybrid models based on the EMD method to improve their prediction models. A research article presented in [
14] proposed a multistep-ahead predicting methodology that combines EMD and Support Vector Regression (SVR) for the prediction of the S&P 500. In addition, a hybrid model that combines EMD and BiLSTM to enhance PM
2.5 concentration prediction performance was proposed by [
15].
The EMD method has proved to be useful in decomposing the components from non-linear and non-stationary signals. However, EMD retains the problem of mode splitting and mode mixing [
16]. To address this, advanced versions of EMD have been proposed. Ensemble Empirical Mode Decomposition (EEMD) for a noise-assisted method was proposed by [
17]. The idea of the EEMD consists of adding different series of white Gaussian noise into the original signal. However, EEMD still has numerically negligible errors and, when different types of white Gaussian noise are added, the number of IMFs can alter [
18]. To resolve this problem, Complete Ensemble Empirical Mode Decomposition with Adaptive Noise (CEEMDAN) was proposed by [
19]. In this method, at each stage of the decomposition, a different noise is added to the residue of the current stage instead of white Gaussian noise [
20]. The advanced versions of EMD can be used in various applications such as a predicting approaching wind speed based on EEMD and LSTM [
21], a predicting approach for crude oil prices based on CEEMDAN, and LSTM with news sentiment index developed by [
22].
In this research, a hybrid framework based on the combination of Principal Component Analysis (PCA), EMD and LSTM is proposed to predict one step ahead the closing price of the stock market in Thailand. The proposed framework is divided into two parts: the features engineering part and the prediction model part, with a total of five processes. Concretely, first, the news sentiment index score is created by using Financial Sentiment Analysis with Bidirectional Encoder Representations from Transformers (FinBERT). After that, PCA is used to reduce the dimension of the technical indicator and as an input feature for the prediction model. Next, the closing price of the stock market is decomposed into several IMFs via EMD. LSTM is applied to predict each IMF along with the news sentiment score and principal components from PCA. Finally, the prediction values of each IMF are composed to obtain the final closing price of the stock market.
2. Background Theories
This section describes related theories used in this research. It divides into seven sub-sections. The first two sub-sections deal with processes to create input features including feature transformation and sentiment analysis. The next two sub-sections cover processes to create a prediction model including Empirical Mode Decomposition and Long Short-Term Memory. Finally, the last three subsections look at statistical methods to check model performance including time series cross-validation, performance metrics and the Augmented Dickey-Fuller Test.
2.1. Feature Transformation
The Curse of Dimensionality basically means that the error increases along with the number of features. In other words, increasing the number of features does not always improve accuracy. Nowadays, this concept is applied in the fields of machine learning. In theory, increasing the dimensions can add more information to the dataset and improve its quality. Nevertheless, it rarely helps improve model performance in practice because real-world data contains more noise and redundancy [
23].
The model is likely to underfit when a dataset does not have enough features. On the other hand, it is likely to overfit when the dataset has too many features. Thus, many dimensionality reduction methods have been proposed to overcome this limitation. Dimensionality reduction is a method to eliminate some features of the dataset and create a restricted set of features that contains all the data needed to predict more efficiently and accurately. There are two methods of dimensionality reduction including feature selection and feature transformation. The key difference between them is that feature selection keeps a subset of the original features, whereas feature transformation creates a new feature that catches most of the important data.
Principal Component Analysis (PCA) is one of the most well-known techniques for depletion reduction [
24]. PCA is a feature transformation method used to reduce the dimension of massive data sets by transforming many variables to fewer, while retaining most of the information in the large set. This technique saves resources for running models and increases accuracy [
25].
In the field of stock prediction, since technical indicators depend on trend, volatility, volume, momentum and daily returns, they can generalize to various scenarios. PCA can consider a high number of technical indicators as input features without encountering the curse of dimensionality [
26]. The advantages of PCA can be applied in various data sources and applications such as tourist behavior analysis [
27] and offshore wind turbines selection [
28]. In addition, some research indicates that combining machine learning and PCA results in significant model improvement, particularly in comparison to mature dimensionality reduction techniques [
29]. The basic steps of PCA are as follows:
The first step is normalization of the original data to ensure that each set contributes equally to the analysis. Mathematically, the normalization equation is represented as (1) where
xmin and
xmax denote the minimum and maximum value of a feature,
x denotes an original value and
denotes a new value.
The second step is establishing a covariance matrix according to the normalized data matrix. Since the dataset is n-dimensional, this will result in an n × n covariance matrix represented as matrix A.
The third step is to calculate the eigenvectors and eigenvalues of the covariance matrix to identify the principal components. The eigenvalues (
λ) of matrix
A are found by solving (2), where
I denotes the same dimensional identity matrix as
A, which is an essential requirement for matrix subtraction. For each
λ, a corresponding eigenvector (
) can be found by solving (3).
The last step is decreasing the original matrix by sorting eigenvectors with corresponding eigenvalues from largest to smallest. The eigenvector with the highest eigenvalue becomes the principal component of the data. After that, first p eigenvalues are chosen to reduce the dimensions and then principal components are received.
2.2. Sentiment Analysis
Sentiment Analysis is a method for defining whether data are positive, negative, or neutral by using Natural Language Processing (NLP). Sentiment analysis is commonly used on textual data to examine the attentions, feelings, behaviors, decisions and emotions of persons who are either the speaker or writer concerning the target topics. The basic task in sentiment analysis is grouping texts in sentences or documents. The grouping of texts are determined by the opinions of people which are either positive, negative, or neutral.
Sentiment analysis techniques can be categorized into three approaches: lexicon-based approaches, machine learning-based approaches and hybrid approaches. First, lexicon-based approaches are a method of using a lexicon to perform sentiment classification by calculating the weighting of labeled words and counting. Second, machine learning-based approaches are a method of using machine learning techniques, for example, Naive Bayesian and Support Vector Machine, which are considered as standard machine learning techniques. The input of the model includes lexical features, sentiment lexicon-based features and parts of speech [
30]. Last, hybrid approaches are methods that use the aggregation of both lexicon-based and machine learning techniques [
31]. In addition, sentiment analysis can generate profits for investors because it can help to make decisions [
32].
Financial Sentiment Analysis with Bidirectional Encoder Representations from Transformers (FinBERT) proposed by [
33] is a language model based on Bidirectional Encoder Representations from Transformers (BERT) for financial NLP tasks. The FinBERT model includes two phases: pre-training and fine-tuning. During the pre-training phase, the FinBERT model constructs a large variety of pre-training objectives to help better capture language knowledge and semantic information. This phase trains the BERT language model in the finance domain, using a large financial corpus and a general corpus. During the fine-tuning phase, datasets for financial sentiment classification are labeled. The main sentiment analysis dataset is Financial PhraseBank. Researchers extracted 4845 sentences from the dataset with financial terms. Then, 16 experts and master students with finance backgrounds labeled the data with sentiments including positive, negative and neutral. The FinBERT model will provide a polarity score for a given text and SoftMax outputs for one of three labels: positive, negative, or neutral.
2.3. Empirical Mode Decomposition (EMD)
EMD proposed by [
11] is used to divide a signal without leaving the time domain. It can be equated to other analysis methods such as Fourier Transformation and Wavelet Decomposition. The EMD is beneficial for analyzing natural signals and it often applies to non-linear and non-stationary situations.
The EMD distinguishes the complexity of the original signal into a series of Intrinsic Mode Functions (IMF) with amplitude and a residual difference. IMFs satisfy the following two conditions:
The IMFs have only one extreme between zero crossings. In another word, the difference in number of maxima and minima is at most 1.
The mean of the wave of IMF is zero.
The EMD decomposes the signal into IMFs through a sifting process. As shown in
Figure 1, the sifting method can be explained using the following algorithm. Decompose a data set
x(
t) into IMFs
xn(
t) and a residual
r(
t), as a result of which the signal can be described by (4)
2.4. Long Short-Term Memory (LSTM)
Deep learning is a type of machine learning that simulates the process of the human brain in terms of data and pattern formation for making decisions. The number of architectures and algorithms used in deep learning is wide and various [
34]. Countless deep learning architectures such as Recurrent Neural Network (RNN) have been applied to NLP [
35]. RNN is a variant of the Artificial Neural Network (ANN) which is designed to handle tasks with sequence data. The idea of RNNs is to make use of the output from the previous state as an input to the next state. This allows the model to recognize the pattern of the input sequence. RNN has the benefit of using past data to predict future events. As a result, everything that has occurred in the past will have an influence on the future. However, RNN is ineffective for very long-term dependencies. This is due to the exponentially decreasing gradients and the decay of information for long-term dependencies. This problem is called the vanishing gradient problem.
LSTM proposed by [
36] is an improved version of RNN, avoiding the encountering of problems. LSTM is specifically modeled to manage tasks involving long-term dependencies information because it has a capacity to forget irrelevant information or store information for longer periods of time with memory cell support. The LSTM has a chain-like structure consisting of several subunits joined together. The unit of the LSTM architecture is a block memory with memory cells. These memory cells have three structures to control information flow: forget gate layer, input gate layer and output gate layer. The forget gate layer determines what information from the previous cell is fed onto the current cell. The input gate layer determines the relevant information to update the cell state. The output gate layer determines the output value for the next hidden state based on the input and memory of the block [
37].
Furthermore, LSTM is appropriate for time series prediction because it can learn and remember long-term memory topics such as market movement [
38]. Advanced versions of LSTM can be used for various applications such as energy consumption [
39], gas field production [
40], chatbot messages classification [
41] and rice export price prediction [
42].
2.5. K-Fold Cross-Validation with Time Series Data
Cross-validation is a data resampling method for estimating the actual prediction performance of models and tuning hyper-parameters. In order to overcome the problem of overfitting, cross-validation is used to check overall model performance to detect this problem. In addition, it is used to adjust appropriate hyper-parameters, such as the appropriate batch size and epoch in ANN model.
K-fold cross-validation is one of the methods. The procedure begins by randomly splitting the dataset into folds of equal size. The model is trained by using k-1 folds that represent the training set. Then, the trained model is applied to the remaining fold, which represents the testing set and the performance of the model is evaluated. This procedure is repeated until every fold is used as a testing set. The final metrics are the average of the errors obtained in each fold [
43].
However, K-fold cross-validation cannot be utilized in the case of time series due to randomly splitting the dataset, because it is irreconcilable in the real world to use values from the future to forecast values from the past. The K-fold Cross-validation with Time Series Data has a different procedure. The idea is that each observation is the first used as a testing set and then added to the training set of the model [
44]. The procedure begins by splitting the dataset into k folds of equal size. In the initial iteration, only the first k folds are used as a training set and the next folds are used as a testing set. In the next iteration, the old training set and testing set are merged and used as a training set. This procedure continues until the last fold is tested. The comparable training set only contains observations that occurred before the testing set observation. Hence, no future observations are used to make the prediction [
45].
2.6. Performance Metrics
In this research, the performance metrics are separated into two main parts. The first part evaluates the performance of the financial news sentiment analysis model and the second part evaluates the performance of the stock price prediction model.
In the first part, confusion metrics are used to validate the financial news sentiment analysis model, in order to compare performance with other models using precision, recall,
F1-score and accuracy from (5)–(8), respectively where
TP denotes true positive,
TN denote true negatives,
FP denotes false positives,
FN denotes false negatives and
n denotes the number of observations.
In the second part, to evaluate the performance of the stock price prediction model, the Root Mean Square Error (RMSE), Mean Absolute Percentage Error (MAPE), Mean Absolute Error (MAE) and coefficient of determination (
R2) statistics are used to compare performance with the other models (9)–(11), respectively, where
yi denotes actual value,
denotes predicted value and
denotes the mean of
y value.
2.7. Augmented Dickey-Fuller (ADF) Test
An ADF test is a fundamentally statistical significance test for determining whether time series is stationary or non-stationary. The ADF test is suitable for testing the stationarity of a time series because it belongs to a category of tests called the Unit Root Test. It exists in a time series of
ρ values calculated by (13) where
Yt denotes the value of the time series at time
t,
denotes exogenous variables
denotes a white noise, and
ρ and
δ denotes estimated parameters.
If , Y is a non-stationary series while , Y is a stationary series, as a result, the stationarity hypothesis can be determined by determining if the total value of ρ is strictly smaller than 1.
The ADF test expands the Dickey-Fuller test (DF) equation to include high order regressive process in the model. The DF test is a unit root test that tests the null hypothesis. The standard DF test is carried out by subtracting from both sides of the Unit Root Test from (14) where
α denotes a constant equal
ρ,
β denotes a coefficient.
The null and alternative hypotheses are evaluated using the conventional
t-ratio for
α. The ADF equation, which is a DF equation but includes a high-order regressive process in the model, can be calculated as (15),
The
t-ratio is then used to test the same null hypothesis as the DF test. Assuming the null hypothesis involves the presence of unit root, that is
, the
-value derived from the equation (13) should be greater than the significance level and the statistical test value be greater than the critical value in order to reject the null hypothesis. As a result, the series is inferred to be non-stationary [
46,
47].
4. Experimental Methods and Results
In this section, there are four sub-sections. The decomposition results of EMD re discussed in the first sub-section. The comparison results between the FinBERT with the Thai news fine-tuning model, an original FinBERT and other sentiment analysis models are presented to verify the effectiveness of the FinBERT in the second sub-section. The different combinations of the PCA, EMD, LSTM and news components are presented to validate the proposed model from several perspectives in the third sub-section. The comparison results between different advanced versions of EMD are presented to verify the best version of EMD in the last sub-section.
4.1. Experimental Methods and Results of the Decomposition Component by EMD
For construction of the prediction model, the closing price of stock data as historical data was transformed into the new data using EMD. As shown in
Figure 7, this experiment demonstrates decomposing results to create IMFs using EMD. The seven IMFs were decomposed from the original closing price sequence and the results of the IMFs’ scale from high to low frequency. However, the number of IMFs are different depending on the raw data. The processes of EMD are repeated until there are only one global maxima and minima value showing on IMF 7 in
Figure 7. The number of IMFs will be changed if the raw data is changed. On the other hand, the number of IMFs will still be of the same value when applying EMD to the same data.
The result shows that it can be divided into three groups. The first group are high-frequency components in the original data. This group was represented by the first few IMFs with a lot of noise. The second group are middle-frequency components. represented by the center IMFs with a medium noise. The last group are low-frequency components. This group was represented by the last few IMFs with little noise. Moreover, the last IMF is comparable to the trend of a stock. It is common to hypothesize that the LSTM can accurately predict low-frequency IMFs, but it will struggle with high-frequency IMFs. To maximize the prediction efficiency, the LSTM is trained individually by each IMF. Thus, the hyper-parameter, the number of hidden layers and weights are different for each IMF. This is the significant difference making a hybrid EMD-LSTM model perform better than a single LSTM model, which is applied directly to the original closing price time series, with characteristics of noise and volatility.
IMFs were obtained by subtracting from the original closing price, so the summation of all IMFs is totally identical to the original. For this reason, the summation of the prediction results of all IMFs can be considered as the prediction result for the original closing price.
4.2. Experimental Methods and Results of News Sentiment Analysis
FinBERT with Thai news fine-tuning was used in the financial news sentiment analysis model in the feature engineering part of the proposed model. In order to assess the efficacy of Thai news analysis, FinBERT with Thai news fine-tuning was compared to the original FinBERT, which is FinBERT with default, and other popular sentiment analysis models, such as Vader [
50] and Text-blob [
51].
This research manually annotated the financial news dataset. The annotated dataset was random from news data to label with three classes of sentiments: positive, negative and neutral, totaling 1500 articles. The annotated dataset was used for training, FinBERT supervised fine-tuning and model performance testing. The top 80% of the data is used as the training dataset for supervised fine-tuning training and the remaining 20% was used as the testing dataset to evaluate the model performance. Moreover, each class has an equal number of examples in the testing set. Other models were used, similar to both training and testing datasets and to FinBERT with Thai news fine-tuning.
From
Table 3, the result shows that the FinBERT with Thai news fine-tuning has the highest average accuracy and average F1-score of the compared models. When considering the F1-score in each class, the FinBERT with Thai news fine-tuning has the highest value. Both FinBERT models perform similarly well when it comes to categorizing news as Class Negative. In classes Positive and Neutral, the FinBERT with Thai news fine-tuning has a moderate F1-score value. However, both Vader and Textblob have very low model performance for this dataset.
4.3. Experimental Methods and Results of the Proposed Framework
The proposed framework was used for the closing price of stock market prediction. This framework contained many processes in both the feature engineering and prediction model part. Therefore, this sub-section was used to verify the efficacy of the proposed model in each process. A set of sensitivity experiments was established using various combinations of the EMD, LSTM, PCA and financial news components validating the proposed model from several perspectives. The experimental design can be seen in
Table 4 and the output data for all experiments are the closing price of the stocks market in Thailand.
Experiment 1 was a comparison result between the EMD-LSTM model and other prediction models. The purpose of this experiment is to apply the models to the original closing price directly without using additional input features. The comparison results between the proposed model and other models evaluate whether the EMD-LSTM model effectively improves the outcomes of prediction over state-of-the-art models in stock price time series modeling.
Experiment 2 is a comparison of the effects of adding principal components and technical indicators to EMD-LSTM. This experiment applies an additional input feature to the proposed model, which is the original technical indicator and the principal component of PCA. The experiment aims to show whether the principal component from PCA can improve the prediction of EMD-LSTM. The comparison results comparing using the principal components as input features and using the original technical indicators as input features examines whether the model, when applying PCA effectively, improves the outcomes of prediction due to the curse of dimension.
Experiment 3 is a comparison of the effects of adding news sentiment score to prediction models. This experiment applied an additional input feature from FinBERT to the proposed model. The experiments are evaluated to identify whether applying a news sentiment score improves model performance.
4.3.1. Comparisons Result between the EMD-LSTM Model and Other Models
In this experiment, machine learning methods and the original closing price are applied to estimate the prediction performance. The EMD-LSTM, LSTM and ARIMA are used for comparison.
Table 5 shows that the EMD method has great advantages in the closing price of stock market prediction, with MAE dropping by 56.13% when compared to LSTM and 85.67% when compared to the ARIMA model. Moreover, LSTM and ARIMA had a close model performance. This implies that a single model cannot impressively solve data patterns and make brilliant predictions. In addition,
Figure 8 shows the predictive results of the LSTM and EMD-LSTM, revealing that the predicted values of the EMD-LSTM series visibly deviate from the original data.
4.3.2. Comparisons of the Effects of Adding Principal Component and Technical Indicator to EMD-LSTM
From the previous experiment, the EMD-LSTM model outperforms when compared with the other prediction models. This experiment applied an additional input feature to the proposed model, which is the original technical indicator and the principal components from PCA. In order to make predictions with the EMD-LSTM model, individual IMFs were predicted with LSTM and the additional input feature. After tuning of the LSTM model, the optimal hyper-parameters were obtained to achieve the prime prediction results for the IMFs, as shown in
Table 6. The batch size was between 8 and 32 while the epoch was between 150 and 300. In addition, the other settings of the LSTM model included hyperbolic tangent as activation function, ADAM as optimizer, mean squared error as loss function and learning with 0.001.
The experimental results of adding the input feature of each IMF are shown in
Table 7. The results show that the principal components can improve the efficiency of the model and outperform the IMF and technical indicators in the first four IMFs. Nevertheless, after the fifth IMF, the model that uses only the IMF value outperforms the other models using additional input features, as can be seen in
Figure 9. Meanwhile, models with a technical indicator as an input feature perform the worst across all IMFs. In addition,
Figure 10 shows the IMF of closing price testing set prediction results. Due to the high frequency of the components, the prediction values of the first several IMF components explicitly diverge from the original data, but the prediction values of the last IMF nearly matched the original data.
Next, the prediction results of each IMF are assembled in order to compare the final closing price prediction results. In addition, the PCA-EMD-LSTM model used principal components as an input feature to predict IMF 1 to 4 but uses only the IMF value for IMF 5 to 7. From
Table 8, the result shows that PCA-EMD-LSTM achieves the best prediction result, followed by a close second to the model that uses only the IMF value, whereas models using technical indicators as the input feature have the worst prediction results.
4.3.3. Comparisons of the Effects of Adding News Sentiment score to Prediction Models
The preliminary model in the previous experiment used only input features from historical data. In this experiment, the news sentiment score was applied to a prediction model to identify whether applying a news sentiment score improves the prediction model.
From
Table 9, the result shows that the news sentiment score has great advantages as an input feature in stock price prediction, with MAE dropping by 20.82% when compared to a single LSTM. On the other hand, the prediction results become worse when the news sentiment score of PCA-EMD-LSTM is included compared with only PCA-EMD-LSTM. Evidently, it is better to ignore the news sentiments component part of the proposed model. However, the news sentiment score part can improve the model performance of the original model.
4.4. Comparison Results between Difference Advanced Versions of EMD
From the previous three experiments, the best architecture of the proposed model was PCA-EMD-LSTM. However, there are advanced versions of EMD such as EEMD and CEEMDAN. Therefore, this experiment changed the EMD part from the proposed model to both EEMD and CEEMDAN, called PCA-EEMD-LSTM and PCA-CEEMDAN-LSTM, respectively. To create the prediction model, closing price of the stock market in Thailand and principal components from PCA are used as input features.
The experiment result shows in
Table 10 that the PCA-EMD-LSTM had the lowest model performance and PCA-EEMD-LSTM had a moderate model performance. On the other hand, using the EMD as a decomposition method is the most effective for prediction with PCA-LSTM. Finally, the PCA-EMD-LSTM architecture had the highest model performance and
Figure 3 can exclude the news sentiment score part with blue background.
5. Discussion
In order to verify the effectiveness of the proposed hybrid framework, several experiments on various factors were examined. The observation results are as follows:
The EMD-LSTM model outperforms state-of-the-art benchmark models indicating that decomposition methods with EMD decrease the complexity of sequences and develop prediction performance. Moreover, EMD decomposed the original signal into minor components based on their frequencies. In order to maximize the prediction effectiveness, the LSTM is trained individually by each component; therefore, the hyper-parameters and weights are different for each component. This is the significant difference that makes a hybrid EMD-LSTM model perform better than a single LSTM.
The prediction result shows that PCA can help to reduce prediction errors in the first few IMFs when applying the principal components from PCA to the EMD-LSTM. This indicates that the PCA method creates useful features from technical indicators for improving IMF with high-frequency prediction. From the obtained results in
Table 10, the PCA-EMD-LSTM achieves the highest prediction performance for the closing price of the stock market. However, based on the experimental results in
Table 5, the MAPE value of the EMD-LSTM model is slightly higher than the obtained MAPE result of the PCA-EMD-LSTM. Therefore, further experiments on different datasets are required to verify the performance improvement of using PCA in the EMD-LSTM model.
Applying the news sentiment score to the EMD-LSTM does not improve prediction results in every IMF. On the other hand, adding the news sentiment score can improve the original LSTM performance. This means that news sentiment can be used to predict the closing price of the stock market while it cannot be used to predict the decomposed component of a closing price of the stock market.
To increase the efficiency of this proposed framework, there are a number of gaps that need further development. For example, IMFs may be adaptively predicted by various traditional or hybrid machine learning models. Recently, a novel approach [
52] to select effective machine learning model combination for time series forecasting was proposed. Based on the machine learning combination approach, it can be applied to this proposed framework for improving the prediction results of each IMF and the prediction of the closing price of the stock market. In addition, based on a recently published re-search study [
53], an interesting decomposition method, i.e., a hybrid time series decom-position strategy (HTD), can be applied instead of EMD for further improvement of this proposed framework.
6. Conclusions
In this research, a hybrid framework based on the combination of PCA, EMD and LSTM is proposed to predict one step ahead of the closing price of the stock market. Moreover, the proposed model is capable of combining both historical and textual data as the input features. The overall design of the proposed system is separated into two parts: the feature engineering part and the prediction model part. The feature engineering part is used to create input features for the prediction model. There are two main processes: the finance and economics news sentiment score using FinBERT with Thai news fine-tuning and the principal components from technical indicators using PCA. The prediction model part is used to predict the closing price of the stock market. Historical data were decomposed into several IMFs via EMD. Next, LSTM was utilized to predict each IMF along with input features from the previous part. Finally, the prediction values of each IMF were used together to produce the final stock price prediction. Based on the results of the experiments, the proposed framework using PCA, EMD and LSTM had the best prediction performance for the closing price of the stock market. Moreover, based on the obtained experimental results in the LTSM model, the performance of the original LSTM is improved when applying news sentiment analysis.
Future research can be conducted in order to optimize the model’s efficiency. For example, different machine learning algorithms can be adaptively selected for the different IMFs after decomposing the original data. This process may improve the prediction results of each IMF and the prediction of the closing price of the stock market. In addition, the effect of different sets of technical indicators can be explored to find the best set for IMF prediction.