1. Introduction
Sentiment analysisencompasses a series of methods and heuristics for detecting and extracting subjective information, such as opinions, emotions, and attitudes from language [
1]. Although it originated in the text subjectivity analysis performed by computational linguists in the 1990s [
2,
3], which was later enhanced by studies about public opinion at the beginning of the 20th century, the proliferation of publications on sentiment analysis did not start until the Web became widespread [
4]. The Web and social media in particular have created a large corpora for academic and industrial research on sentiment analysis [
5]. Examples of this can be found in the various applications of sentiment analysis investigated thus far: product pricing [
6], competitive intelligence [
7], market analysis [
8], election prediction [
9], public health research [
10], syndromic surveillance [
11], and many others. The vast majority of papers on sentiment analysis has been written after 2004 [
12], making it one of the fastest growing research areas.
Sentiment analysis requires pre-processing components to structure the text and extract features that can later be exploited by machine learning algorithms and text mining heuristics. Generally, the purpose of pre-processing is to separate a set of characters from a text stream into classes, with transitions from one state to the next on the occurrence of particular characters. By careful consideration of the set of characters—punctuation, white spaces, emoticons, and emojis—arbitrary text sequences can be handled efficiently.
Tokenising a stream of characters into a sequence of word-like elements is one of the most critical components of text pre-processing [
13]. For the English language, it appears trivial to split words by spaces and punctuation, but some additional knowledge should be taken into consideration, such as opinion phrases, named entities, and stop-words [
14]. Previous research suggests that morphological transformations of language—that is, an analysis of what we can infer about language based on structural features [
15]—can also improve our understanding of subjective text. Examples of these are stemming [
16] and lemmatisation [
17]. Lately, researchers working on word-embeddings and deep-learning based approaches have recommended that we should use techniques such as word segmentation, part-of-speech tagging, and parsing [
18].
All the decisions made about text pre-processing have proved crucial to capturing the content, meaning, and style of language. Therefore, pre-processing has a direct impact on the validity of the information derived from sentiment analysis. However, recommendations for the best pre-processing practice have not been conclusive. Thus, we aim to evaluate various combinations of pre-processing components quantitatively. To this extent, we have acquired a large collection of Twitter [
19] data through Kaggle [
20], the online data science platform and tested a number of
pre-processing flows—sequences of components that implement methods to clean and normalise text. We also assessed the impact of each flow in the accuracy of a couple of off-the-shelf sentiment analysis tools and one supervised algorithm implemented by us.
It is important to clarify that we are not intending to develop a new sentiment analysis algorithm. Instead, we are interested in identifying pre-processing components that can help existing algorithms to improve their accuracy. Thus, we have tested our various pre-processing components with two off-the-shelf classifiers and a basic naïve Bayes classifier [
21]. As their name indicates, the off-the-shelf classifiers are generic and not tailored to specific domains. However, those classifiers were chosen specifically because they are used widely and therefore our conclusions may be relevant to a potentially larger audience. Lastly, we chose the naïve Bayes classifier as a third option, because it can easily be reproduced to achieve the same benefits that will be discussed later. Our results confirm that the order of the pre-processing components makes a difference. We have also encountered that the use of lemmatisation, while useful for reducing inflectional forms of a word to a common base, does not improve sentiment analysis significantly.
The reminder of this paper is organised as follows.
Section 2 reviews the related work.
Section 3 describes the corpus used for our experiments, and the text pre-processing components and flows tested.
Section 4 presents our results, and, finally,
Section 5 offers our conclusions.
2. Related Work
As the number of papers on sentiment analysis continues to increase—largely as a result of social media becoming an integral part of our everyday lives—the number of publications on text pre-processing has increased too. According to Mäntylä et al. [
12], nearly 7000 papers on sentiment analysis have been indexed by the Scopus database [
22] thus far. However, 99% of those papers were indexed after 2004 [
12].
Over the past two decades, YouTube [
23] and Facebook [
24] have grown considerably. Indeed, YouTube and Facebook have reached a larger audience than any other social platform since 2019 [
25]. Facebook, in particular, has held a steady dominance over the social media market throughout recent years. In the UK, for example, which is where we carried out our study, Facebook has a market share of approximately 52.40%, making it the most popular platform as of January 2021. Twitter, on the other hand, has achieved a market share of 25.45%, emerging as the second leading platform as of January 2021 [
26].
Although Twitter’s market share falls behind Facebook’s, the amount of Twitter’s publicly available data is far greater than that corresponding to Facebook. This makes Twitter remarkably attractive within the research community, and that is why we have undertaken all our work using it.
Table 1 lists four of the most cited papers on text pre-processing available on Scopus. Such papers along with their references account for 279 publications, which are all represented in
Figure A1 in
Appendix A. Pink circles in
Figure A1 represent the four papers displayed in
Table 1, red circles represent the earliest publications on the subject—these are publications mostly related to the Lucene Search Engine [
27]—and blue circles represent the rest of the papers. The size of the circles depends on the number of citations of the corresponding paper: the larger the circle is, the more citations the paper has. The links between the papers denote the citation relationship—paper
A is linked to paper
B if
A cites
B.
Figure A1 was produced with Gephi [
28], an open-source network analysis and visualisation software package.
The size and connectivity of the network displayed in
Figure A1 in
Appendix A, which is based on only four papers, shows that the body of literature on text pre-processing is large and keeps growing. However, it is still unclear which pre-processing tools should be employed, and in which order. Thus, we have evaluated various pre-processing flows quantitatively and we will present our conclusions here.
A couple of influential publications that have followed a similar approach to what we aim to achieve are Angiani et al. [
32] and Jianqiang et al. [
30]. These publications are included in
Table 1 and have stated the sequences of pre-processing components that they have examined and the order in which they have examined them.
Table 2 displays these sequences as pre-processing flows. The first row of
Table 2 indicates the datasets used by the corresponding authors to test their approaches. The remaining rows in
Table 2 display the actual steps included in the pre-processing flows. As explained before, researchers have recommended the use of text pre-processing techniques before performing sentiment analysis—an example of this can be found in Fornacciari et al. [
33]. However, we are not only interested in using pre-processing techniques but also in comparing different pre-processing flows and identifying the best.
Pre-processing is often seen as a fundamental part of sentiment analysis [
32], but it rarely is evaluated in detail. Consequently, we wanted to assess the effect of pre-processing on some off-the-shelf sentiment analysis classifiers that have become popular—namely
VADER [
34] and
TextBlob [
35]. To compare and contrast these off-the-shelf classifiers with other alternatives, we have implemented our own classifier, based on the
naïve Bayes algorithm [
21]. Additionally, we took advantage of this opportunity to examine some components which have not been researched broadly in the literature. For instance, most of the existing literature on English sentiment analysis refers to stemming as a pre-processing step—see, for instance, Angiani et al. [
32]. However, we have also evaluated
lemmatisation [
36], which is a pre-processing alternative that has been frequently overlooked.
Despite all the recent NLP developments, determining the sentiment expressed in a piece of text remains a problem that has not been fully solved. Issues such as sarcasm or negation remain largely unsolved [
37]. Our main contribution lies precisely in identifying pre-processing components that can pave the way to improving the state of the art.
Table 2.
Two text pre-processing flows evaluated in the relevant literature.
Table 2.
Two text pre-processing flows evaluated in the relevant literature.
Angiani et al.’s flow [32]: Tested | Jianqiang et al.’s flow [30]: Tested |
on two SemEval Twitter datasets | on five manually annotated Twitter |
from 2015 [38] and 2016 [39]. | datasets (reported in [30]). |
In order to keep only significant | Replace negative mentions—that |
information, remove URLs, hashtags | is, replace won’t, can’t, and n’t |
(for example, #happy) and user | with will not, cannot, and not, |
mentions (for example, @BarackObama). | respectively. |
Replace tabs and line breaks | Remove URLs. Most researchers |
with a blank, and quotation | consider that URLs do not carry |
marks with apostrophes. | any valuable information regarding |
| the sentiment of a tweet. |
Remove punctuation, except for | Revert words that contain |
apostrophes (because apostrophes | repeated characters to their |
are part of grammar constructs, | original English forms. For |
such as the genitive). | example, revert cooooool to cool. |
Remove vowels repeated in sequence | Remove numbers. In general, |
at least three times to normalise | numbers are of no use when |
words. For example, the words cooooool | measuring sentiment. |
and cool will become the same. | |
Replace sequences of a and h with a | Remove stop words. Multiple |
laugh tag—laughs are typically | lists are available, but the classic |
represented as sequences of the | Van Rijsbergen stop list [40] was |
characters a and h. | selected by Jianqiang et al. [30]. |
Convert emoticons into corresponding | Expand acronyms to the original |
tags. For example, convert :) into | words by using an acronym |
smile_happy. The list of emoticons | dictionary, such as the Internet & |
is taken from Wikipedia [41]. | Text Slang Dictionary & Translator [42]. |
Convert the text to lower case, and | |
remove extra blank spaces. | |
Replace all negative constructs (can’t, | |
isn’t, never, etc.) with not. | |
Use PyEnchant [43] for the detection | |
and correction of misspellings. | |
Replace insults with the tag bad_word. | |
Use the Iterated Lovins Stemmer | |
[44] to reduce nouns, verbs, and adverbs | |
which share the same radix. | |
Remove stop words. | |
4. Results
Our experiments ensured that the entire dataset was processed by each of the three classifiers described above. However, we also tested the classifiers separately with the part of the dataset for which we had a high confidence sentiment value, because we had a higher expectation that this part of the dataset would work as a gold standard—a collection of references against which the classifiers can be compared.
All the flows were assessed by determining the sentiment using the three classifiers on both the raw data—that is, the dataset without any pre-processing—and the pre-processed data—that is, the dataset processed by the different combinations of the components that constitute the proposed flows.
After determining the sentiment using the classifiers, we evaluated their accuracy.
Table 3 shows the accuracy of the classifiers when tested with the entire dataset: first without pre-processing—raw data—and then with each of the flows.
Table 3, as well as the rest of the tables in this section, shows in bold font the entry that corresponds to the highest accuracy achieved by each of the classifiers. For instance, TextBlob achieves its highest accuracy when using Flow 4; thus, the accuracy that corresponds to the combination of TextBlob and Flow 4 is shown using bold font.
Table 4 shows the accuracy of the classifiers when tested with the gold standard—the part of the dataset for which we have a high confidence value. First, we tested the classifiers without any pre-processing and then with each of the pre-processing flows.
According to
Table 3 and
Table 4, there is at least one flow that increases the accuracy of each of the three classifiers evaluated, which proves the importance and benefits of pre-processing. From
Table 3 and
Table 4, we can confirm that Flow 4 offers the greatest advantages for two of the three classifiers. TextBlob achieves its greatest accuracy with Flow 4 when tested with both the entire dataset and the gold standard. Similarly, naïve Bayes achieves its greatest accuracy with Flow 4 when tested with the entire dataset and the gold standard. Recall that Flow 4 includes all the pre-processing components listed in
Section 3.2, except for spelling correction and lemmatisation.
Given that naïve Bayes performed better than the other two classifiers in all cases, regardless of the pre-processing flow chosen, we opted to evaluate this classifier further. We started by testing naïve Bayes exclusively with negative tweets, then exclusively with positive tweets, and finally with all the tweets. The results of this additional evaluation are presented in
Table 5. Note that the results of testing exclusively with negative tweets are shown in the row titled “Naïve-Bayes Negative”, the results of testing exclusively with positive tweets are shown in the row titled “Naïve-Bayes Positive”, and the results of testing with all the tweets are shown on the row titled “Naïve-Bayes Total”. We also evaluated naïve Bayes when tested, separately, with the positive and negative tweets included in the gold standard. The results are presented in
Table 6. It should be observed that Flow 4 allows the naïve Bayes classifier to achieve its highest accuracy with the entire dataset, when tested exclusively with positive tweets and with all the tweets together—see
Table 5. Additionally, Flow 4 allows the naïve Bayes classifier to achieve the highest accuracy with the gold standard, when tested exclusively with positive tweets and with all the tweets together—see
Table 6.
Overall, Flow 4 appears to be the best pre-processing option. It allows the naïve Bayes classifier to achieve its highest accuracy with the entire dataset, when testing exclusively with positive tweets and all the tweets together—see
Table 5. Additionally, it allows the naïve Bayes classifier to achieve the highest accuracy with the gold standard, when testing exclusively with positive tweets and all the tweets together—see
Table 6. Our results can be summarised as follows:
The importance of pre-processing: For each of the three classifiers evaluated, there is at least one pre-processing flow that increases its accuracy, regardless of the dataset employed for testing—the entire dataset or the gold standard—which confirms the benefits of pre-processing.
Sensitivity: As it can be seen in
Table 5 and
Table 6, naïve Bayes positive achieves its worst accuracy overall without pre-processing. Additionally, naïve Bayes positive appears to be the most
sensitive classifier to the pre-processing flows—that is, naïve Bayes positive is the classifier that improves its accuracy the most with the help of pre-processing. Indeed, Flow 4 enhances its accuracy significantly.
Insensitivity: Both VADER and TextBlob achieve similar performance, and they seem to be
insensitive to pre-processing—in the sense that pre-processing does not increase their accuracy considerably. According to
Table 3 and
Table 4, the variations in their accuracy, with and without any pre-processing, are minimal.
Pre-processing loss: Naïve Bayes negative achieves the highest accuracy overall when tested with the entire dataset and the gold standard. However, such a high accuracy is achieved without any pre-processing.
The developers of VADER claim to have implemented a number of heuristics that people use to assess sentiment [
75]. Such heuristics include, among others, pre-processing punctuation, capitalisation,
degree modifiers—also called
intensifiers,
booster words, or
degree adverbs—and dealing with the conjunction “
but”, which typically signals a shift in sentiment polarity, with the sentiment of the text following the conjunction being the dominant part [
75]. Hutto and Gilbert use the sentence “
The food here is great, but the service is horrible” to show an example of mixed sentiment, where the latter half of the sentence dictates the overall polarity [
75].
In other words, VADER performs its own text pre-processing, and this is likely to interfere with our flows. For instance, if we removed all the occurrences of the word “but” from the dataset while using Flow 4, because “but” is a stop-word, we would be disturbing VADER’s own pre-processing and, consequently, damaging its accuracy. Hence, VADER’s accuracy is not improved by Flow 4 or any other Flow where we remove stop-words. Unsurprisingly, VADER achieved its best performance with Flow 5, which is the only flow that does not remove stop-words. We can expect that TextBlob also performs its own pre-processing. After all, TextBlob and VADER are both off-the-shelf tools that implement solutions to the most common needs in sentiment analysis, without expecting their users to perform any pre-processing. Therefore, they are insensitive to our flows.
Table 5 and
Table 6 show some high accuracy values for the naïve Bayes negative and total classifiers, which highlight the possibility of
overfitting [
78]. Clearly, overfitting is a fundamental issue in supervised machine learning, which prevents algorithms from performing correctly against unseen data. We need assurance that our classifier is not picking up too much noise. Thus, we have applied
cross-validation [
79]. We have separated our dataset into
k subsets
, so that each time we test the classifier, one of the subsets is used as the test set, and the other
subsets are put together to form the training set.
Table 7 shows the accuracy values after applying the
k-fold cross validation.
While the accuracy is reduced by a small percentage after cross-validation, we still have Flow 4 as the best pre-processing option for Naïve Bayes Total—see
Table 7. Of course, we are considering further training, testing, and validation with other datasets as part of our future work.
5. Conclusions
We have reviewed the available research on text pre-processing, focusing on how to improve the accuracy of sentiment analysis classifiers for social media. We have implemented several pre-processing components and evaluated various combinations of them. Our work has been tested with a collection of tweets obtained through Kaggle to quantitatively assess the accuracy improvements derived from pre-processing.
For each of the sentiment analysis classifiers evaluated in our study, there is at least one combination of pre-processing components that increases its accuracy, which confirms the importance and benefits of pre-processing. In the particular case of our naïve Bayes classifier, the experiments confirm that the order of the pre-processing components matters, and pre-processing can significantly improve its accuracy.
We have also discussed some challenges which require further research. Evidently, the accuracy of supervised learning classifiers depends heavily on the training data. Therefore, there is an opportunity to extend our work with new training samples. Our current results are promising and motivate further work in the future.