1. Introduction
In recent years, the internet has become a global platform for communication and dissemination of information. Today, popular social media sites have a huge global reach and audience, with Facebook having more than 2.27 billion monthly active users [
1], while YouTube boosting almost 1 billion users every month [
2]. Similarly, Twitter has an average of 335 million monthly active users [
3] and it increased with 14% per day over the last few years [
4]. Some other popular social media are Instagram, LinkedIn, Tumblr, WeChat and WhatsApp. The registered number of total users on these social sites are in the billions [
5].
Nowadays, due to the extensive use of social media, a large amount of data is generated on a daily basis. The data generated from these sites is in an unstructured form that creates opportunities, as well as challenges for processing and analysis to make a useful understanding of the underlying hidden patterns.
The data gathered from the social sites are used for various purposes such as;
Cyberspace offers freedom of flows of communication and opinion expressing, since there are some groups that are exploiting the capability of social media to spread distorted belief and negative influence on other people. A number of research studies explored that the current social media is regularly being misused by many hate groups to promote radicalization (also stated as cyber-crime, cyber-extremism, and cyber-hate-propaganda) [
6,
7,
8,
9]. Many people use these social media platforms to promote harmful ideology by spreading extreme contents among their audience. These radical groups post offensive and violent messages, comments, and hateful speeches focusing on their objectives. They even communicate with other such existing virtual communities to extend their network on social media based on sharing a similar agenda. Social media has now become the easiest way for these extremist groups to recruit new members in their groups by reaching a world-wide audience and then these groups gradually tend to influence newly recruited members in spreading violence and extremism. A study presented in [
10,
11] shows that in 2015, more than 125,000 accounts were detected from Twitter through human judgments that were associated with terrorist activities. As a consequence of growing extremism content on Twitter, the company developed a team of professionals in various countries (United States of America (USA), Ireland, etc.,) in order to observe suspected accounts and to block them when identified.
The increasing trend of social media for expressing views requires special attention of language processing and machine learning research community to devise techniques that can help government agencies to automatically identify users who have extreme views. Such efforts will surely help the agencies to control crime as the presence of such content on social media in massive volume is one of the biggest concerns nowadays. The issues discussed above make an investigation of online extreme contents, as an important area of research.
Researchers from various disciplines such as social science, psychology, and computer science are working hard to develop new tools and techniques to counter and combat the problems of identification of online extremism. Specifically, the need for computer science researcher’s attention has risen to a great extent to develop an automated system(s) that help to identify users/groups who are posting extreme views on social media platforms. In order to accomplish this task, most of the researchers used approaches like term frequency and lexicon-based dictionaries such as
SentiStrength and
SentiWordNet [
12,
13,
14]. However, the outcome of such techniques is not reliable because human written language is more than word frequencies, since language also has semantic orientations. The traditional lexicon approaches (also stated as bag-of-words) have some good outcomes on general sentiment analysis such as topic modeling and product or movie review system. The lexicon-based approaches are not considered efficient in sensing extremism due to the limitation of ignoring the semantic orientation, because these approaches perceive the meaning of a textual content using the term frequency only. Therefore, these approaches do not seem to be successful in distinguishing between the tweets having either extreme or neutral views.
Unlike dictionary-based approaches, machine learning can also be used to predict social media extreme content. As a common practice, this problem is handled in machine learning using supervised approaches that require pre-defined labels for training a classification model. This limitation is often resolved using sentiment-based dictionaries in order to label each sample. We observed that quite often such dictionary-based approaches to label the training data give false labels due to the limitation of dictionaries to sense the context of sentences. We applied dictionary-based approaches to label tweets, either extreme or neutral, for our Twitter dataset. After carefully observing each tweet and its label, we observed that around 70% of dictionary-based assigned labels, for a sub sample data of 1000 tweets, were incorrect. This has raised some serious questions on earlier work that was often based on a dataset with labels assigned using dictionary-based approaches before applying any machine learning-based classifier [
10,
14,
15]. Therefore, for generating training data, we used a dataset where we carefully read each tweet and then assigned a suitable label to it. The need to improve the automatic labelling procedure requires more attention of researchers than ever before.
Over the last few decades, machine learning has been widely used for various problems like movie and product review systems, authorship attribution, proteomic and genomic studies, sentiment analysis, etc. However, analysis of the extreme or radicalized social media content studies are very few. The earlier published work of identifying extremism on social media platforms is mostly related to a terrorist group called the Islamic State of Iraq and Syria (ISIS). Very little work has been done in the context of Taliban using social media analysis to identify radicalized content (see
Table 1). This study involves analysis of the microblogging site Twitter data (crawled using a set of keywords) using machine learning approach to understand that if a tweet or set of tweets have extreme/radicalized content. This study takes Afghanistan as a region of interest for the purpose of analysis.
Among all social media platforms, Twitter is chosen since it is widely used by all communities ranging from common man to celebrities with the short form of text content in every tweet. The maximum size of an earlier tweet message was 140 characters per tweet which had been extended to 250 characters since November 2017. The short length of tweets, however, makes the problem more challenging because users often tweet on a topic giving very limited contextual information making it hard to perform sentiment analysis when compared with other social media platform contents [
25].
The main contribution of the proposed methodology are as follows:
Importance of Data Labeling: To build a sentiment classification model, a labelled data set is needed to train and test the machine learning model. Commonly, as mentioned earlier, researchers use a dictionary-based method (i.e., SentiWordNet, SentiStrength) for this purpose. We reported that in aspect-based sentiment analysis the current labeling method has a lack of correctness and reliability.
Exploratory Data Analysis (EDA): An EDA is a technique used for in-depth statistical analysis in order to make a better understanding of the dataset and this is usually achieved through data visualization techniques. Since a tweet-based content is in an unstructured form that requires converting into a structured form before performing a statistical analysis. The resulting structured form of textual data often has high dimensions that are not practically viable to visualize using conventional visualization approaches. In order to resolve this issue, an unsupervised approach called Principal Component Analysis (PCA) [
26] is adopted to project a high dimensional data onto a low dimensional space to get an insight by visualizing it onto a 2-D or 3-D plot. The 2-D or 3-D visualization help users to know how each tweet is related to every other tweet in terms of neighborhood and in terms of assigned group labels.
Classification Model Building for Sentiment Analysis: Supervised machine learning algorithms are used for classifying tweets having extreme contents. Several classification methods including very commonly known ones like naïve Bayes’, decision tree, random forest, K Nearest Neighbors (KNN), Support Vector Machine (SVM) and lesser-known ones like ensemble classification methods with boosting and bagging approaches were applied for predictive analytics purpose. Furthermore, to boost the accuracy of the classification models, feature sets such as n-grams and TF-IDF are empirically evaluated.
The remainder of the paper is presented as follows: In
Section 2, detailed reviews of detecting radicalized groups from the web using sentiment analysis are discussed; In
Section 3, the proposed methodology for detecting the extreme content is presented. The experimental setup for evaluating the proposed framework, and the results are demonstrated in
Section 4;
Section 5 concludes the proposed framework and its effectiveness for detecting extreme contents in Twitter data in the context of Afghanistan.
2. Related Work
In the context of extremism and terrorism-related sentiment analysis, most of the previous work is performed on right-wing extremism or on a terrorist group ISIS. A report on terrorism published by USA Center for Cyber and Homeland Security describes that the social media is a key tool to promote radicalization and the study also reveals that this is a widely adopted way of recruiting new members, in extremist groups, across the world particularly focusing on users from Europe and USA [
27]. Another study [
28] reported that approximately 89% of organized terrorism on the internet take place through social media networks. In response to a terrorist attack in London on 3 June 2017, British Prime Minister blamed social media site in her dialogue: “We cannot allow this ideology the safe space it needs to breed—yet that is precisely what the internet and the big companies that provide Internet-based services provide” [
29]. Therefore, predicting terrorism from social media becomes an essential job because the terrorism poses a serious threat to society and security of citizens. In order to limit this cyber-terrorism, several researchers show their efforts by predicting radicalization content from the social media platform (see
Figure 1).
The area of Natural Language Processing (NLP) and sentiment analysis evolve gradually year by year since the last few decades.
Figure 1 depicts the machine learning techniques used over a timeline in the existing literature to inspect extreme contents from the web. KNN, naïve Bayes’, EDA, data clustering, decision tree, Gradient Boosted Decision Tree (GBDT), and Deep Neural Network (DNN) are the most adopted techniques for radicalization detection on social media site(s) [
30,
31,
32,
33,
34,
35,
36].
In [
12], the authors present a model for detecting radical content from web forums. The model was built using
SentiWordNet,
WordNet, and python based Natural Language ToolKit (NLTK) to predict polarity and intensity of radicalization on a web forum. Two different Arabic web forums
Mantada and
Qawem were used to perform experiments. A dataset containing 500 sentences was selected from each forum and translated manually into the English language. Each sentence was broken into tokens and then stored in a bag-of-words form, after preprocessing POS tagging was used to assign a tag to each word. To assign a score to each word, they used a lexicon approach:
WordNet and
SentiWordNet. The sentence score was then calculated by taking an average of word scores in a sentence.
In [
14], the authors illustrated a sentiment analysis approach to classify radical text including radical right and radical Islamic on the web. The work presented in the paper is based on a machine learning approach for classifying websites as pro-extremist and anti-extremist. They used a custom-written web crawler called Terrorism and Extremism Network Extractor (TENE) to collect data from 102 online websites. POS tags were used to find high occurrence nouns in the targeted webpages. For assigning sentiment to each page, they used
SentiStrength dictionary to evaluate sentiment based on the specific keywords. After labeling with
SentiStrength, decision tree algorithm was used for classification task. The accuracy of the two-class classification model (i.e., two categories of extremists) was observed to be higher compared with three-class (i.e., three categories of extremists), and four class (i.e., four categories of extremists) classification tasks respectively.
In [
15], the authors identified the most radical users from the dark web using lexicon-based approach. Four popular web forums (i.e., Gawaher, Islamic Network, Islamic Awakening and Turn to Islam) are investigated that allow discussion on jihadists and terrorism. A list of 400 keywords from these web forums was prepared using POS tagging and then this list was used to get sentiment score of each post of a user. Sentiment analysis on user’s online posts and comments considering the POS-based generated dictionary into account was carried out using statistical analysis and named this tool as Sentiment-based Identification of Radical Authors (SIRA). The SIRA predicts the level of extremism in users by analyzing the average sentiment score of posts, volume, severity and duration of any negative posts.
In [
22], the authors proposed a computational framework that combines social network analysis with sentiment analysis tools to analyze radical groups on YouTube. The data (i.e., comments, profile) were crawled from a group of 700 suspected YouTuber accounts. The authors investigated these users on different topics with respect to polarity in order to identify the sign of intolerance and extremism. A lexicon-based module was used to determine the target topic, and then the sentiment analysis technique was applied to get the opinion of users towards these topics. Two different results for males and females are drawn to represent the most positive and negative topics in both categories. The results demonstrate that females are more positive toward
al-Qaida and negative toward
Judaism. Whereas, for the males, the results demonstrate higher positivity on
Islam.
In [
23], Brookings Institution conducted a study to understand the relationship between social media and terrorism. A combined approach of machine learning and manual processing was adopted to analyze population of ISIS supporters on Twitter. The work presented in this paper investigates 20,000 suspected Twitter accounts and claims that 93% of all examining accounts are supporting ISIS. Furthermore, a minimum threshold of 46,000 and max bound of 90,000 was estimated for existing ISIS-supporting accounts from October to November 2014. Where each account had an average of 1000 followers, considerably higher than any ordinary Twitter user.
A hybrid approach of sentiment analysis was proposed in [
10] to predict ISIS supporter accounts on Twitter. Herein, real-time tweets are collected using specific query words (i.e., ISIS, bomb, terrorist) with
Twitter Streaming API. After collecting data, a lexicon-based approach was used for labelling purpose, where
SentiWordNet dictionary was used to calculate the score of each tweet. For classification task, naïve Bayes’ algorithm was used. In this work a two-fold approach was applied where, after classifying every user, more tweets were collected from the same user account and verification of assigned sentiment score was performed using tweet history.
Another approach was presented in [
24] where tweets were grouped into various radical groups based on the presence of special keywords such as “Al-Qaida”, “Jihad”, “Terrorist Operations” and “Extremism” through a lexical-based approach. A dictionary of semantically related terms (categories) was created by observing hashtags in tweets. For classification task, tweets were vectored based on dictionary related words, such as if any tweets contain dictionary related words, the score of tweets will be 1 otherwise 0. A set of rules was defined for each category and a vector representing each tweet was then compared with all rules to assign the most appropriate category to a tweet having radical content.
Most of the work in the literature regarding extremism detection through social media content analysis using machine learning techniques was done related to ISIS. In the context of Afghanistan, much of the region has been in wars in the last fifty years and hence the literacy rate is not much high and a small community use Twitter as a medium of communication. Our research focuses on analyzing Twitter-based data, generated by the people of Afghanistan, in order to identify users having extreme views or neutral views that highly affects the mode of society. The purpose of this analysis is to help organizations to better understand the extremism in the society and to define strategies to reduce hate views in order to build a better peaceful society.
3. Materials and Methods
This study aims to build a state-of-the-art framework to assess the efficacy of technological advancement in the context of text-based content analysis. In this study we collected data from Twitter using Twitter Streaming API and apply standard preprocessing techniques of NLP that ranges from tokenization, stemming, lemmatization, computing TF-IDF features, etc., to generate a dataset. The main objective of this study is to develop a model that can predict a class of a given tweet either as neutral or extreme.
We proposed a framework that involves a two-step process of using machine learning methods in order to get better understanding of the data and predictive ability. The first step is to perform an Exploratory Data Analysis (EDA) [
37,
38], where the objective is to have a better understanding of underlying hidden patterns in the data. As part of an EDA, we use to apply Principal Component Analysis (PCA) to reduce dimensions of our dataset in order to visually observe hidden patterns in the data. As the second step of our analysis, we apply various machine learning classification models such as SVM, naïve Bayes’, decision tree, KNN and ensemble classification methods with boosting and bagging approaches. The evaluation of classification models is a key step. We compute evaluation measures such as accuracy, precision, recall and F-score in order to evaluate predictive performance of the algorithms applied. In our experiments, we also demonstrate the impact of data size on the performance of classification models.
In subsequent sub-sections, the key stages of our analysis are explained. These stages are data collection, data pruning, preprocessing, feature extraction, and exploratory data analysis, predictive model building and its evaluation.
Figure 2 represents the schematic representation of the proposed framework of classifying tweets either as extreme or as neutral.
3.1. Data Collection and Preparation
Many Twitter-based datasets are available over the internet either freely or commercially for understanding public sentiment on social or political issues [
39,
40]. However, there is no dataset available in the context of Afghanistan war zone for assessing public sentiment on the war situation in Afghanistan. Therefore, we collected new and relevant data from Twitter associated with our problem. Twitter provides APIs that allow researchers to extract real-time tweets using different parameter settings for text analytics. These APIs extract tweets either based on given query terms or tweets from a profile of a specified user or based on given geo-location or language constraints or combination of any of these. The request to API not only returns tweet text but it also other information that includes username, user location, tweet text, user mentions in tweets, etc. The API supports the JavaScript Object Notation (JSON), Extensible Markup Language (XML) and Really Simply Syndication (RSS) formats.
It is important to mention that Twitter generates more than 500 million tweets per day on different topics and events [
41,
42]. To fetch the only relevant tweets from all the available data, a query was carefully prepared. Afghanistan was chosen as a targeted geo-location with the conjunction of a query for data extraction. The reason for selecting Afghanistan as our key focus was because of an extreme level of conflict between the public of Afghanistan on the matter of Taliban and Afghan government. Many
Afghani support Taliban groups and they express openly about it on social media. Similarly, there are also those people in Afghanistan who support Afghan government and express their opinion against the Taliban on the social media platforms very openly. There is also a big community in Afghanistan who discuss issues and have their opinions—and their opinions are quite fair and neutral with a focus on bringing peace in the community. Considering this situation, we formulated a problem definition stating that studying Afghanistan is the most suitable case scenario to identify extreme and neutral content in tweets using the machine learning techniques. We believe that the same can be replicated to other geo-locations in future studies.
For relevant data extraction, we used several Twitter trends that are related to Taliban matters; while from all of them the #kunduz event was the hottest trending topic. The Kunduz trend was the result of an airstrike conducted by Afghan forces on a religious school in the strong hold area of Taliban on 2 April 2018. Afghan government was of the view that Taliban leaders’ presence at the Madrassa was the reason of this attack by the Afghan forces.
In order to collect data from Twitter to understand the sentiment of Afghan people after Kunduz Madrassa attack in Afghanistan, the list of query words was prepared based on the trending topics on Twitter, from the 2 to 8 April 2018. The total query words that we use for extracting tweets were 60.
Table 2 provides a list of 32 query words that returned most of the tweets for our dataset. The remaining query words, like ‘Bomb’, ‘suicide’, ‘Force’, ‘Army’, ‘HungerStrike’, ’MullahOmer’, etc., are the ones that returned small number of tweets. A total of 7500 tweets were downloaded based on the query words in the geo-location of Afghanistan for a span of one week after the Kunduz incident. All the collected data were stored locally in an Excel format for further processing and analysis purposes.
3.1.1. Data Labeling
The accuracy of labeling affects the performance of the model in a way that a classifier may fall apart and there is no guarantee that the model can predict or classify correctly on low-quality labelled data. Labelling a large dataset is a critical building block and a key factor in supervised learning. Even for some machine learning task, this is the costly and time-consuming job. Due to this high cost of the labelling process, researchers often adopt semi-supervised approaches for many real-world applications, where a large amount of unlabelled data is given as input with the conjunction of a few labelled samples to perform a classification task.
The two well-known approaches adopted by the research community for data labeling are manual annotation and automated annotations/data programming. In the automated-labeling approach, various dictionary resources and data programming tools are used in a way that they take dataset as input and generate a label for each sample in the dataset. The examples of such tools are
WordNet, SentiWordNet and
SentiStrength dictionaries that are easily available for data annotations. In [
10,
14], an automated approach adopted in data labeling process for supervised classifiers. As we mentioned earlier that the limitation of automated labeling is that it replies only on a catalog of words and the class label is assigned based on the predefined polarity of words available in the dictionary for a given observed sample. Whereas in many applications, the context of the sentence is not only dependent on the word occurrence but even order of the words is also significant. In such a scenario, the context of the overall sentence is beyond the word occurrence, automated labelling does not give accurate results. Therefore, researchers are trying their best to improve automated labelling mechanisms. Because of limitation of existing automated labelling to label content by ignoring sematic, we opted to choose manual labelling process and we labelled our dataset by carefully reading each tweet by not only considering occurrence of each word but also considering the context and semantics as well. Though hand labelling required more time and human resources, it is more accurate and reliable.
3.1.2. Data Pruning
The extracted tweets are in raw form, that contain “noise” or “undesired data” in the form of irrelevant terms, symbols, links, and punctuation marks. These irrelevant terms in the data are not useful for the model and may reduce the performance of classifiers. To remove such undesired data from the corpus, we perform some preprocessing tasks on data that are described in the subsequent sections (see
Figure 3).
Tokenization
In order to eliminate undesired terms, and to construct a word vector for the model, we parsed every tweet into tokens. Tokenization transform the tweet text into words segment such as if we had a tweet like [@pajhwok Former president slams Pakistan’s airstrike in Kunar], it will be segmented with the rule that each word in the tweet is separated by a space or special character in the tokenization phase. The process of tokenization converts the sentence into tokens and generate the following output like [@pajhwo] [Former] [president] [slams] [Pakistan’s] [airstrike] [in] [Kunar].
Stop Words Removal
Tweets contain many unrelated words that lead to an increase in the dimensionality of features. This increase in dimensionality increases computational complexity of classification models. Few terms are entirely useless in tweets for computation algorithm i.e., “the”, “an”, “is”, “am”, “how”, “to”, etc. It is a set of frequent words that carry less important meaning. Such tokens are removed because of non-informative and unnecessary increase in training time and memory overhead.
Removing Irregular Terms
While posting tweets many user tag URL links and images with the combination of text, which also becomes part of tweet message. Here, we only focus on written content; therefore, we removed all the contents which are in uneven form.
Stemming
Due to grammatical reason, people use a different inflected form of words such as kill, killed, and killing. A stemming process is applied for token normalization where the goal is to reduce inflectional form of words to a common base form. Stemming removes affixes from a word and uses the beginning of the word to represent a common base form. For example, the stem of study, studies, and studying is studi.
3.1.3. Feature Extraction
Sentiment analysis on text usually requires hand written feature derived from word-level (e.g., airstrike, terrorist), word-level
n-gram (e.g., doing_good, good_job), character level
n-gram (e.g., b, be, beh, av, ave beha, behave), POS tags (e.g., noun, adjective, verb), word cluster (e.g., maybe, probably, prob collapse to the same cluster), hashtag (e.g., #Afghanistan, #Trump), emoticon (e.g., ☺, ☹), user tags (e.g., @Trump), abbreviations (e.g., WTH, ASAP, ROFL) and elongated words (e.g., yummy, hurrah). Machine learning algorithm needs a numerical picture in the form of a feature vector that enables the model to perform mathematical and statistical investigation. In feature building phase, text data requires to be converted into a manageable representation that is understandable by the algorithm. For this purpose, a feature vector is constructed (i.e., Term Frequency-Inverse Document Frequency) where weighting scheme is applied in order to calculate the score of each token in the corpus. It is defined as:
Here, t represents a term in a tweet and d represents a tweet (usually termed as document in text documents).
3.2. Exploratory Data Analysis (EDA)
EDA is known as a way of exploring dataset characteristics by computing basic statistical summary. The statistical summary about the dataset(s) is often used in combination with data visualization techniques in order to have better understanding of the dataset. EDA helps users to think beyond applying classical classification modelling algorithms. By applying EDA, we can have more detailed insights into the data with meaningful information that can also help to define/refine a hypothesis.
In our analysis, we propose to apply Principal Component Analysis (PCA) as an EDA with an objective to transform a high-dimensional data space onto a low-dimensional space (usually 2-D or 3-D). Transformation of 2-D or 3-D through PCA will be useful to get a visualization plot that helps to improve our understanding of the dataset through visual observation in order to find natural groupings in the data. The problem with a tweet-based dataset and other natural text datasets is that it has several features (i.e., terms) and visualizing such high-dimensional features is not viable through classical information visualization tools, so we opted to use PCA to get transformed low-dimensional space in order to plot data on a scatter plot.
Quite often PCA based dimensionality reduction methods are also used to get a transformed space of more than three-dimensions for datasets having dimensions from a few hundred to a few thousand. In such cases, dimensionality is reduced before applying classification model to reduce complexity of classification model by giving a small set of transformed features.
3.3. Classification Algorithm
Tweet classification is the task of assigning a tweet to one of a set pre-defined class where . Classification tasks are performed by supervised machine learning, where a supervised learning algorithm is used to train the classifier. Normally a set of N training sample is provided to learning algorithm from which it crops a function that map tweets to classes. Here represents the ith training tweet and is the corresponding class label of .
Choosing a good classifier is an important task to build a robust state-of-the-art predictive ability. In this work various machine learning algorithms; Support Vector Machine (SVM) [
43], naïve Bayes’ [
44], decision tree [
45], random forest [
46], KNN [
47] and ensemble classification methods (with bagging and boosting) [
48], etc. are considered. The effectiveness of all these algorithms is demonstrated in the next section for the analysis of our Twitter-based dataset.
3.4. Performance Evaluation
Some of the metrics that are used to evaluate the effectiveness of the classifiers are discussed in the subsequent sub-sections.
• Accuracy
Accuracy is the proportion of the correct prediction made by the model. In simple term, it is the percentage of all the input samples in the dataset that are correctly classified such as, if the model correctly predicts 45 data samples out of 50, the accuracy of the model would by 90%.
• Precision, Recall, and F-score
Precision can be defined as the ratio of true positive predicted and total positive in the data set and it measures the performances with respect to correct prediction. The greater precision means the less miss predicted hits. The recall is the ability of the classifier to predict as many true positive out of the total expected. The performance will be better as precisions and recall have greater values. F-score is the harmonic mean of recall and precision. F-score describes that how precise is the model by taking a mean of precision and recall.
5. Conclusions and Future Work
The wide-spread use of social media has heavily impacted the lives of people in communities. Nowadays, extremist organizations often use social media to disseminate their viewpoints to a larger community; either to generate sympathy for their cause or to recruit people. In this work, our objective is to use social media website Twitter-based tweets data in combination with machine learning approaches to automatically identify user tweets having extreme contents. Many such approaches have been proposed in the literature and they often were able to achieve predictive accuracy of around 80% [
10,
14,
15,
16,
17,
18]. Most of the earlier reported work is related to ISIS and to our knowledge, this is the novel reported research work in the context of the Afghanistan war zone to predict tweets having extreme and neutral contents. The analysis involves processing tweets data to generate TF-IDF features extracted from 1-g, 2-g, 3-g, etc., and PCA-based reduced features. A two-step analysis was performed: Exploratory Data Analysis (EDA) and Classification Modelling. In terms of EDA, we highlighted the importance of an exploratory data analysis in defining a hypothesis that can be useful in getting better predictive ability of the underlying predictive classification problem. We also demonstrate that simplest extracted features from natural language processing domain can help to identify between extreme and neutral tweet content but are not good enough to differentiate between sub-classes of extreme (i.e., pro-Afghan government or pro-Taliban). In a classification modelling process, various classification models were applied and SVM classification model has shown a predictive accuracy 84% using TF-IDF features extracted from bi-gram features of tweets.
The analysis suggests that in order to get better predictive ability in terms of extreme sub-groups, the tweets content semantic knowledge is required in the analysis. We plan to include semantic knowledge in our future work for this purpose. This is important because many tweets contain similar words but have a different context or even opposite opinions. In aspect-based sentiment analysis, the semantics are required to be included to reduce such limitations.