1. Introduction
Nowadays, the most popular data sharing field is social media; therefore, social media sites accumulate huge amounts of data. Twitter [
1] is a commonly used social media data sources. Twitter has nearly 700 million users and 58 million tweets on average per day [
2]. People post messages that are called as tweets, which may include texts, videos, links, etc. Because of its huge popularity and usage, analyzing tweets posted by users has become more and more important. Therefore, automatically detecting tweets’ sentiments is an attractive research area for many researchers.
Sentiment analysis is the outcome of people’s emotions, attitudes, opinions, sentiments, etc., in their sharing’s, which can be written or spoken. This concept especially focuses on polarity detection [
3], which identifies negative and positive opinions in the text.
Sentiment analysis is carried out at three levels: the word or phrase level, the sentence level and the document level [
4]. Generally, lexicon-based, learning-based and hybrid-based approaches [
5] are used to realize sentiment classification problems.
Figure 1 shows different sentiment analysis approaches and algorithms.
Tweets are, in a way, microblogs or short texts, so our sentiment analysis is performed at the sentence level. In our work, we used learning-based approaches to define sentiments in sentences. Financial news from tweets and the sentiment analysis of these tweets may contain important information or indicators for the financial or stock market. Although many studies have been conducted in English in the field of sentiment analysis and financial sentiment analysis, not many studies have been published in Turkish yet. Turkish financial tweets were collected with determined keywords from the BIST 100 index using association rule mining [
6] and the tweets were tagged as “POSITIVE”, “NEGATIVE” and “NEUTRAL”. Binary datasets including only positive and negative classes, and multi-class dataset including positive, negative and neutral classes were created.
Noisy or unclear sentences negatively affected the sentiment classification process. In order to prepare these tweets for analysis, we used pre-processing, which included stop word removal, normalization processes, etc. The “ITU Turkish NLP Web Service API” was utilized for the Turkish text normalization process [
7].
Deep learning algorithms and methods have provided great improvements in the fields of pattern recognition and image recognition. These improvements led to Natural Language Processing (NLP) researchers to be able to focus on deep learning methods. The use of dense vector representations based on Neural Networks has achieved better results for NLP tasks. The success of word embedding [
8,
9] and deep learning methods [
10] caused the trend of using deep learning algorithms in NLP tasks. In contrast to the traditional machine-learning-based NLP systems, which use handmade features, deep learning enables automatic feature representation learning. Handmade features have several bottlenecks [
11]. We used word embedding and pre-trained word embedding with fastText [
12] for feature representation in our work.
Neural Network, Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), Gated Recurrent Units (GRU) and GRU-CNN models were used as sentiment classifiers in this study. The performances of these models were evaluated based on their accuracy.
The arrangement of this paper is as follows: the introduction is included in
Section 1; works related to sentiment analysis and Turkish financial tweet data are discussed in
Section 2;
Section 3 contains descriptions of the materials and methods used in our work; the results are presented in
Section 4; and
Section 5 includes our conclusions. The key highlights have been concisely outlined in
Table 1 and
Table 2, successfully differentiating the groundbreaking contributions of this research from its practical applications for stakeholders within the financial industry.
2. Related Studies
The sentiment analysis of tweets related to finance can be a significant indicator for investors when analyzed and interpreted according to the stock market. Automatically determining tweets’ sentiments is an attractive research area for many researchers. Feature vectors for text representation, classification techniques such as SVN, CNN, LSTM, Naïve Bayes, etc., and relations between tweets and stock markets are just a few research areas in this field. Although many sentiment analysis studies have been conducted on Twitter data, there are not enough studies on these subjects in the Turkish language and on Turkish stock markets.
Nasukawa and Yi have studied sentiment extraction for specific subjects from a document, instead of document classification [
13]. Also the review of “sentiment analysis” has been reviewed in reference [
14].
Almohaimeed has studied sentiment analysis on English tweets in order to predict S&P 500 index movement. He used data mining to draw out the companies affecting the S&P 500 index, in order to rank these companies and to determine patterns. In his thesis, he showed that classifier ensembles perform better than classic classifiers in the process of classifying tweets; his prediction model has an accuracy rate above 80% [
15].
The relationship between the stock market index and Turkish tweets was studied by Şimşek and Özdemir. They used 113 words and eight classes for their emotion corpus. When these words were found in tweets, they count them and calculated average happiness values. They showed that the relationship between the stock market and tweet data is approximately 45% [
16].
The relationship between social media and daily stock prices was investigated by Yıldırım and Yüksel. A telecommunication company from Borsa Istanbul was selected. For a given period, daily data (opening price, closing price etc.) was collected. Sentiment analysis was applied for the same period. According the Spearman’s rank correlation test results, a negative and moderate correlation exist between the daily stock price and public sentiments in tweets [
17].
The prediction of exchange rate movements using tweets has been studied by Öztürk and Çiftçi. The keywords “#USD/TR”, “USD/TR”, “Dollar”, “#Dollar” were used for tweet collection. Collected tweets’ sentiments and the daily exchange rate of USD/TR were analyzed by them. They used value 1 for increasing exchange rate and 0 for the rest of the cases. They also categorized the collected tweets as Buy, Sell and Neutral. As a result, they found a remarkable relationship between the exchange rate and the sentiments of tweets [
18].
Eliaçık and Erdoğan studied sentiment analysis methods on microblogging sites that use new user metrics. They proposed the measurement of the financial community’s sentiment polarity on microblogging sites. In addition, they analyzed the correlation between the behavior of the Borsa Istanbul index and the mood of the financial community weekly using the Pearson correlation coefficient method [
19].
Akgül, Ertano and Diri studied sentiment analysis and Twitter. They used both n-gram and lexicon methods, implementing two different models. They concluded that the lexicon method has a better performance than the n-gram method [
20].
Bollen, Mao and Zeng studied stock market predictions using Twitter moods. The text content of daily tweets were analyzed by using two mood tracking tools, OpinionFinder and Google-Profile of Mood States (GPOMS). They used a Granger causality analysis and a self-organizing fuzzy Neural Network to explore their hypothesis that public mood states could be used to predict change in DJIA closing values. They found that using specific public mood dimensions remarkably improve DJIA predictions [
21].
Velioglu, Yıldız and Savas studied “sentiment analysis using learning approaches over emojis for Turkish tweets”. They used bag-of-words and fastText representations for evaluated sentiment classification models, including sentiment analysis performed over emojis/emoticons. Their results show that there are no notable distinctions between these models [
22].
Smailovic et al. studied stream-based sentiment analysis in the financial domain. They explored the relationship between sentiments expressed in tweets related to selected companies and their stock prices movements. They used the SVM classifier for tweet categorization based on positive, negative and neutral statements. They found that there is a relationship between company-related tweets and their stock price changes, and that tweets could be used as a measure for stock price directions [
23].
Bilgin and Şentürk studied “sentiment analysis of tweets based on document vectors using supervised learning and semi-supervised learning”. They carried out sentiment analysis using Turkish and English tweets [
24].
Ayata, Saraçlar and Özgür studied sentiment analysis using machine learning and word embedding for Turkish tweets. They used SVM and Random Forest classifiers for sentiment classification. They also used vector embedding for Turkish tweet representation. Their results show that sectoral-based tweet classification gives better results than general or non-domain tweet classification [
25].
A financial tweet refers to a message shared on the Twitter platform that delves into financial subjects, encompassing discussions on stock market trends, economic news, investment strategies, tips on personal finance and updates related to cryptocurrencies. Such tweets serve the purpose of disseminating information, offering commentary and initiating conversations among individuals with an interest in the field of finance [
26].
Categories of financial tweets:
Market updates: These tweets furnish current and immediate information regarding stock prices, market indices and economic indicators [
21].
Analysts’ perspectives: Financial analysts frequently convey their insights and predictions on Twitter, impacting investment decisions [
27].
Personal finance guidance: Authorities, bloggers and individuals disseminate practical advice and strategies for effectively managing personal finances [
28].
Cryptocurrency updates: Financial Twitter frequently features news and updates on cryptocurrency prices, trading activities and regulatory developments [
29].
Economic insights: Economists, policymakers and journalists often share their perspectives and analyses on various economic events and policies through financial tweets [
30].
Benefits of following financial tweets:
Remaining well-informed: Following financial tweets enables individuals to stay abreast of market movements, economic trends and timely news updates [
31].
Gaining knowledge from experts: Following financial tweets allows individuals to gain insights and knowledge from experienced financial professionals and analysts who share their expertise [
32].
Participating in conversations: Financial Twitter serves as a platform for individuals to actively participate in discussions with like-minded individuals interested in finance, facilitating the exchange of ideas and perspectives [
31]. The comparative analysis of sentiment analysis in finance, with proactive recommendations are shown in
Table 3.
In summary, market participants stand to gain advantages by incorporating sentiment analysis into their decision-making workflows, utilizing machine learning models and adjusting their strategies to align with the instantaneous insights offered by financial tweets.
3. Materials and Methods
We collected Turkish financial tweets discretely between 13 January 2019 and 10 March 2020 using Python, Tweepy library, Twitter API and MySQL. Collected tweets were manually tagged as positive, negative, neutral and irrelevant using our Java-based tagging program.
In the tweet pre-processing phase, using our Python code, we removed unnecessary sections of tweets, transformed tweet text to lowercase, and fixed spelling/writing errors (normalization) and restored popular abbreviations to their full forms (e.g., mrb to merhaba). ITU Turkish NLP Web Service API [
7] was used for the normalization process.
Word embedding and fastText’s pre-trained word embedding [
12] were used as feature extractors. Deep learning algorithms—Neural Network, Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), Gated Recurrent Units (GRU) and GRU-CNN—were used for sentiment classification. The configuration of Neural Networks, encompassing factors such as the number of hidden layers, the dimensions of layers and the choice of activation functions, was contingent upon the unique requirements posed by the task at hand and the characteristics of the dataset. Nevertheless,
Table 4 furnishes broad insights into the prevalent architecture commonly employed across different categories of Neural Networks.
3.1. Datasets
In this study, we worked on a newly created Turkish tweet dataset, tagged by us, that included 2313 tweets. The dataset had 992 POSITIVE, 629 NEGATIVE and 691 NOTR labelled tweets. We created two datasets: binary (“0-NEGATIVE”, “1-POSITIVE”) and multi-class (“NEGATIVE”, “POSITIVE” and “NEUTRAL”) datasets. Dataset distributions are shown in
Figure 2.
3.2. Tweet Pre-Processing Phase
Before using tweets as an input in our Neural Network models, the tweets needed pre-processing. Tweet pre-processing included:
Removing unnecessary sections of tweets (external links and usernames (signified with @sign), URL (http://...), stop words, #tags, retweets (starts with “RT”), punctuations, unnecessary whitespaces, etc.);
Transforming characters to lowercase;
Removing numbers;
Correcting spelling/writing errors (normalization) and restoring popular abbreviations to their full forms (e.g., mrb to merhaba). ITU Turkish NLP Web Service API [
7] was used for the normalization process.
We developed a tweet pre-processing program with Python, which processed the tweets as shown in
Figure 3.
3.3. Feature Extraction
Machine learning algorithms, needing numerical values as inputs, cannot directly run on text data. The process of converting text to numerical values is called feature extraction. There are numerous types of feature extraction methods. Some popular feature extraction methods for text are Bag of Words (BoW) and word embedding. We used the word embedding approach in our work.
3.3.1. Bag of Words (BoW)
Each document is represented as a vector
, and each dimension of vector d consists of a unique term in the term spaces of the document collection. We express each vector
as
where
is the weight of the term of document
.
Boolean weighting and
TF-IDF are the most commonly used weighting algorithms.
Boolean weighting has a binary representation for term weight. Its weight is considered as 1 if the document consists of the term, otherwise it is considered as 0. The equation of Boolean weighting is
where
is the frequency of term
in the document [
33].
The TF-IDF (Term Frequency-Inverse Document Frequency) weighting equation is as follow
where,
is the frequency of term
in document
,
is the total number of documents and
is the number of documents that include term
[
33].
3.3.2. Word Embedding
This is a text representation in which similar words have similar representations. In other words, in a coordinate system, corresponding words are placed close to each other [
34,
35]. Word2vec [
36], GloVe [
37] and fastText [
38] are the most common word embedding models. Mikolov et al. used Artificial Neural Networks (ANN) in a Word2vec model. Word2vec is based on the prediction of a word from surrounding words (Continuous Bag of Words, CBOW) or the prediction of surrounding words from a given word (Skip gram). We used word embedding and pre-trained word embedding with fastText in our study. The feature vector size was 300.
3.4. Classifier Models
Deep learning algorithms have made impressive advances in research areas like pattern recognition, image recognition, etc., in recent years. Because of deep learning algorithms’ results and developments in Neural Network-based word embedding [
8,
9] representations, recent Natural Language Processing (NLP) research has increasingly used deep learning algorithms and word embedding instead of SVM and logistic regression techniques.
3.4.1. Convolutional Neural Networks (CNN)
Convolutional Neural Networks have impressive results in computer vision and image processing areas [
39,
40,
41]. It is a model that has come to be increasingly used in NLP research. The use of of CNNs for texts first started with Collobert and Weston’s research [
42]. They used a look-up table to transform words into a vector representation. Firstly, the word tokenization process takes place, whereby these words are transformed into a word embedding matrix with of selected or determined dimension. After this step, the convolution process is applied to the embedding matrix with selected kernels to create a feature map. The max-pooling operation follows the convolution step to reduce the dimension of output and obtain the fixed-length output [
11,
43].
Figure 4 shows CNN modeling for text.
3.4.2. Recurrent Neural Networks (RNNs)
Recurrent Neural Networks trust the principle that sequential information processing is primarily based on the Elman network [
44]. An RNN recursively applies the previously computed results into a computation for every instance in an input sequence.
Figure 5 shows a simple RNN structure [
11,
43]
The capacity for memorization of the previous results is the main difference or advantage of an RNN [
11]. So, it is convenient for various NLP tasks like sentiment analysis, speech recognition, etc. In practice, these simple RNNs suffer from a vanishing gradient problem, which complicates the learning and tuning parameters of the preceding layers in the network [
11].
This problem has led to the development of various RNN derivative models like Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU).
3.4.3. Long Short-Term Memory (LSTM)
LSTM has “forget gates” in addition to the simple RNN architecture to handle vanishing and exploding problems.
Figure 6 shows the LSTM structure.
Unlike the simple RNN, LSTM back-propagates errors through a limitless number of time steps [
11].
3.4.4. Gated Recurrent Unit
The GRU is another RNN derivative model. It has less complexity but a similar performance to LSTM. GRU adds reset and update gates to simple RNN.
Figure 7 shows a Gated Recurrent Unit.
The high training accuracies (100% for some models) suggest overfitting, where the model memorizes the training data rather than learning generalizable patterns. This leads to poor performance on unseen data.
Regularization methods, such as L1, L2 and elastic net, impose penalties on excessive model complexity, serving as a deterrent against overfitting to particular data points in the training set [
45].
Dropout Layers: Randomly dropping out neurons during training forces the model to rely on other features and prevents overfitting to individual neurons [
46].
Balanced Dataset: An imbalanced dataset, wherein there is a prevalence of either positive or negative tweets, can result in the model exhibiting bias toward the majority class. This may lead to high training accuracy but might not ensure effective generalization [
47,
48].
Oversampling/Undersampling: Employing techniques such as oversampling, which involves replicating data points from the minority class, or under sampling, which entails removing data points from the majority class, aids in balancing the dataset. These approaches aim to alleviate bias, fostering a more equitable learning experience for the model from both classes [
49].
5. Constraints
Dataset scale: The size of the Turkish financial tweets’ dataset is comparatively modest, which may constrain the applicability and reliability of the developed models.
Pre-processing complexity: Recognizing the intricacies in handling Turkish tweets, the authors concede the challenges arising from ambiguity and informal language during pre-processing. This may result in potential inaccuracies or biases in sentiment classification.
Binary versus multi-class classification: The discernible performance difference between binary and multi-class classifications underlines the complexities in effectively capturing more refined sentiment categories.
Domain specificity: Given that the models are specifically trained on financial tweets, there is a possibility that their effectiveness might not extend seamlessly to other domains or diverse sentiment analysis tasks.
Potential Biases
Data collection bias: Employing specific keywords for tweet collection may introduce selection bias, potentially skewing the representation of certain sentiment groups by either overemphasizing or underemphasizing them.
Labeling bias: The subjective nature of manual sentiment labeling makes it susceptible to individual biases, influencing the accuracy and reliability of sentiment categorization.
Model bias: The selection of algorithms and hyperparameters holds the potential to impact model performance, introducing biases that may affect the interpretation of sentiment analysis results.
Pre-trained word embedding bias: The biases inherent in the training data of pre-trained embeddings could be mirrored in sentiment analysis outcomes, potentially amplifying and perpetuating biases present in the initial word embedding data.
Although this research offers valuable perspectives on the sentiment analysis of Turkish financial tweets, both researchers and readers must remain cognizant of these limitations and biases. This awareness is crucial for the accurate interpretation and contextualization of this study’s findings.
6. Conclusions
Sentiment analysis research has been conducted extensively on social media data in the English language. However, a limited amount of sentiment analysis research has been conducted on social media data in the Turkish language. We created our datasets using Turkish financial tweets, and we tried five different machine learning algorithms (Neural Network, CNN, LSTM, GRU and GRU-CNN) to find sentiments on those datasets together with word embedding and pre-trained word embedding. The binary classification results were better than the multi-class classification results, as shown in
Table 10.
Our results reveal that, generally, all models perform better when they are run with pre-trained fastText word vectors. Also, binary classification results are better than multi-class classification results, as expected. Surprisingly, the results are close to each other. With pre-trained word embedding, CNN models had the best results of all. When we used word embedding, the GRU-CNN model gave better results for the binary classification and the Neural Network model gave better results for the multi-class classification.
We propose a CNN model with pre-trained word embedding for binary and multi-class classifications. Its maximum testing accuracy was 83.02% and the average of its maximum testing accuracies for all folds was 78.35% for binary classifications. For multi-class classifications, its maximum testing accuracy was 72.73% and the average of its maximum testing accuracies for all folds was 65.05%.
In future works, using additional layers in these models may improve their performances. The use of more specific pre-processing techniques could also improve model performances, as the collected Turkish tweets about the Turkish financial market contain many ambiguous words and phrases that make the pre-processing step difficult. In addition, enlarging the datasets could lead to better results.