**1. Introduction**

*1.1. Smart Cities, Tranportation, and Social Sensing*

Smart cities and societies aim to revolutionize our daily lives and improve social, economic, and environmental sustainability through increased technology penetration, participatory governance, and wise use of natural and other resources [1]. Smart urban and rural developments require timely sensing and analysis of diverse data produced by various edge sensors, smart devices, GPS, cameras, and the Internet of Things (IoT) [2]. Social media such as Twitter have become an important class of sensors for smart urban and

**Citation:** Alomari, E.; Katib, I.; Albeshri, A.; Yigitcanlar, T.; Mehmood, R. Iktishaf+: A Big Data Tool with Automatic Labeling for Road Traffic Social Sensing and Event Detection Using Distributed Machine Learning. *Sensors* **2021**, *21*, 2993. https://doi.org/10.3390/s21092993

Academic Editor: Alberto Gotta

Received: 19 March 2021 Accepted: 21 April 2021 Published: 24 April 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

rural developments [3], and in many sectors of smart cities and societies, it is increasingly being seen as a conveniently available and relatively inexpensive source of information compared to physical sensors [4]. Road transportation that is considered the backbone of modern economies is one such sector. It costs globally 1.25 million deaths and 50 million human injuries annually and therefore it is a research and development area of high significance.

Increased urbanisation is giving rise to the evolution of cities into megacities where traffic congestion is a leading problem causing devastating economic, social, and ecological losses. The annual cost of congestion in the US is USD305 billion, not to mention the damages to health and the number of deaths. Congestion is caused due to the steadily growing traffic in the cities over the years, road damages, roadworks, traffic accidents, bad weather, and other contingencies. There is a need to detect these causes or events to enable timely planning and operations.

Many times, congestion is caused due to events that are beyond the direct scope of physical road sensors, and therefore physical sensors cannot detect these events until the effects of these events are visible on the roads and can be sensed by the on-road sensors. For example, a football match in a city is likely to disrupt the traffic and increase pressure on the road network in certain segments of the city. Such an event can be detected through social media in advance of the event and timely intervention may reduce the aggravation of congestion in the city. Events such as a major football event may have already been known to the authorities. However, social media can also detect events that are being arranged on ad hoc bases—such as social gatherings, small sports gatherings—and, though these are small, there can be many of these in a city and can create an aggregately large pressure on the city roads. Similarly, unpredictable events such as a fire in a city segmen<sup>t</sup> may also disrupt the city traffic and such events can also be detected automatically on social media before their effects on the roads are visible. Moreover, historical analysis of social media data can reveal hidden information related to the traffic that may have not known otherwise and can be used for urban planning.

Twitter is one of the most popular microblogging media used for communication and sharing personal status, events, news, etc. [5]. Twitter allows users to post short text messages called tweets. A massive amount of real-time data is posted by millions of users on various topics including transportation and real-time road traffic [4,6–8]. In the recent decade, the use of Twitter and other social media by researchers and practitioners to study different issues in many application domains and sectors has steadily increased [9–14]. Transportation is no exception where social media has been used to study various aspects such as for analysing travel behaviours [15], recognizing mobility patterns [16], congestion detection [17], and event detection [4,6,9,18,19]. Due to the microblogging and real-time nature of Twitter, people are likely to communicate information about small and large-scale social gatherings, sports events, or events such as a fire, weather, allowing such information to be extracted in real-time [20]. Such information can allow the detection of transportationrelated events and their causes for timely planning and operations. However, while manifesting grea<sup>t</sup> potential, several major challenges need to be overcome before its wide adoption in transportation and other areas.

### *1.2. Summary of the Proposed Work*

The aim of this work is to develop big data technologies for detecting road trafficrelated events (i.e., events that may affect road traffic) from Twitter data in the Arabic language with a focus on Saudi Arabia. Over the past few years, we have continued to build a detailed literature review on social media analytics in transportation. We have learnt from the literature review that the cutting-edge on big data-enabled social media analytics for transportation-related studies is limited. Many more studies are needed to improve the breadth and depth of the research on the subject in several aspects to establish maturity in this area. The research gaps relate to the focus of the studies, the size and diversity of the data, the applicability and performance of the machine

learning methods, the diversity in terms of the social media languages, the scalability of the computing platforms, and others [13,21]. The maturity of research in this area will allow the development, commercialization, and wide adoption of the tools for transportation planning and operations (for the literature review and research gap, see Section 2).

This paper brings a range of technologies together to detect road traffic-related events using big data and distributed machine learning. The paper contributes to most of the above-mentioned research gaps. The most specific contribution of this research is an automatic labelling method for machine learning-based traffic-related event detection from Twitter data. In principle, the method itself is generic and can be applied for natural language processing (NLP) in any language. However, in this paper, the method is applied to Twitter data in the Arabic language. One of the approaches to detecting events from social media requires text classification using supervised classification algorithms. Supervised classification requires labeling of data for the training phase. For big data, the manual labeling process is time-consuming and labor-intensive [22]. Using the automatic labelling techniques developed in this paper we are able to deal with over an order of magnitude larger dataset compared to our earlier work in [6]. We are able to detect several real events in Saudi Arabia without any prior knowledge, including a fire in Jeddah, rains in Makkah, and an accident in Riyadh. The proposed automatic labeling method uses predefined dictionaries to reduce the effort, time, and cost of manual labeling of tweets. The dictionaries have been generated automatically for each event type using the top vocabularies extracted from the manually labeled dataset. Then, the dictionaries are adjusted manually to add synonyms and make sure that we do not miss any important vocabulary. After that, we divide them into levels based on the importance and the degree of relevance to the event type. Then, we calculate the weight for each labeled tweet (see Section 3 for details of the tool design including the automatic labelling method).

The proposed method has been implemented in a software tool called Iktishaf+ (an Arabic word meaning discovery) that is able to detect traffic events automatically from tweets in the Arabic language using distributed machine learning over Apache Spark. The tool is built using nine components that are used for nine specific functions namely data collection and storage, data pre-processing, tweets labeling, feature extraction, tweets filtering, event detection, spatio-temporal information extraction, reporting and visualization, and internal and external validation. The architectural blocks of the Iktishaf+ system are depicted in Figure 1 (we will describe in detail the system architecture including its nine components in Section 3). Iktishaf+ is built using a range of technologies including Apache Spark, Spark ML, Spark SQL, NLTK, PowerBI, Parquet, and MongoDB.

The Iktishaf+ tool uses Iktishaf Stemmer that we introduced in [6]. It is a light stemmer for the Arabic language developed by us. It is designed to strip affixes based on the length of the tokens. It allows reducing the feature space and minimizing the number of removed letters from the token to prevent changes in meaning or losing important words. We also use in this work a location extractor developed by us that helps to find the location of the detected events. It uses multiple methods to extract the event location. The location is extracted from the tweet text where the place name is explicitly mentioned in the message or is included as hashtags. Additionally, the event locations are identified from the account names using a predefined list of account names that are specialized in posting about traffic conditions in different cities in Saudi Arabia. If no information is found, the other attributes associated with the tweets JSON object such as coordinates and user profiles are checked for the location information. These methods have allowed us to extract and visualize spatio-temporal information about the detected events.

The specific data used in this work comprises 33.5 million tweets collected from Saudi Arabia using the Twitter API for a period of over a year. We have not used this data in any of the earlier works. The findings show the effectiveness of the Twitter media in detecting important events, and other information in time, space, and information-structure with no earlier knowledge about them. The detected events are validated using internal and external sources.

**Figure 1.** Iktishaf+: The proposed system architecture.

Iktishaf+ is an enhanced version of our tool Iktishaf that was introduced in [6]. The tool Iktishaf+ extends the functionality, capacity, and testing of our earlier work on traffic event detection. The earlier work on event detection has reported analyses of 2.5 million tweets [6]. The number of tweets was limited by our ability to manually label the tweets. We have also applied the Iktishaf tool to detect governmen<sup>t</sup> measures and public concerns related to CVOID-19 using unsupervised learning [13]. Our earlier work on big data social media analytics has also focused on application areas including public sentiments analysis of governmen<sup>t</sup> services [23], logistics [24,25], and healthcare [7] in both Arabic and English languages.

The Iktishaf+ tool uses open-source big data distributed computing technologies that enable the scalability and integration of transportation software systems with each other and with other smart city systems such as smart healthcare and urban governance systems. An elaboration of the novelty, contributions, and utilization of this work is given in Section 2.4.

The organization of the paper is as follows: Section 2 highlights the related work, in which we review different techniques for traffic events detection from social media in addition to the existing approaches for labeling large-scale datasets. Section 3 describes the proposed methodology and details the tool design and architecture. Section 4 explains the analysis results, which is followed by conclusions and future work reported in Section 5.

### **2. Literature Review**

Digital societies could perhaps be characterized by their increasing desire to express themselves and interact with others, and this is done through various digital platforms [26]. The core ingredients of these digital societies, or digital platforms that enable these societies include a range of emerging technologies and their convergence. The technologies include big data [27–30], high-performance computing (HPC) [31–34], artificial intelligence [35–38], cloud, fog, and edge computing [39–41], social sensing [42–45], and Internet of Things (IoT) [36,40,46–48]. The applications include transportation [4,49–53], healthcare [7,54–56], and others [57–60]. The pulse for sensing and engaging with the environments are provided by social media and IoT. Sentiment analysis, or opinion mining, is a vital tool in natural language processing (NLP) [61], and many of the notable works on sentiment analysis rely on artificial intelligence, Twitter, and other social media.

We review here the literature related to the topics of this paper, which is detection of events related to road traffic using Twitter in the Arabic language. We begin with the literature about traffic event detection and then we discuss the solutions for automatic labeling. We discuss in Section 2.1 the works that have been developed for event detection in any language whether they use big data or not. Section 2.2 discusses works in the Arabic language and since works for Arabic are very limited, we introduce the studies related to detect any type of events not only traffic events. In Section 2.3, we present the solutions for labeling large datasets. Finally, Section 2.4 reveals the research gap.

### *2.1. Traffic Events Detection Using Social Data (Any Language)*

Sakaki et al. [62] proposed an earthquake reporting system using Japanese tweets. They classified real-time tweets into positive (event-related) and negative (not related to events) classes using an SVM classifier. To prepare the training set, they used three groups of features for each tweet, which are keywords in a tweet, the number of words, and the words before and after the target-event words. Furthermore, they extended their work to extract events from tweets referring to driving information [63]. They collected tweets using a list of keywords about traffic-related events such as heavy traffic, traffic restriction, police checkpoints, parking, and rain mist. As in their previous work, they prepared the features using different methods and then they selected the best features to train a classifier using the SVM algorithm. Moreover, Klaithin and Haruechaiyasak [10] analyzed tweets in the Thai language to extract traffic events. They trained a classifier using Naïve Bayes to classify the tweets into six categories, which are accident, announcement, question, request, sentiment and traffic condition. They applied machine learning classifier based on Naive Bayes Model.

Kumar et al. [64] trained a sentiment classification model to detect negative sentiment about a road hazard from Twitter. The data is collected using search filtering with specific terms that relate to traffic. Then, naïve Bayes, K-nearest-neighbor and the dynamic language model (DLM) are used to build models to classify the tweets into a hazard and not hazard. Semwal et al. [65] applied real-time spatio-temporal analysis on Facebook data to detect traffic insights. They designed a module to detect the occurrence of events based on the spike in the number of posts at a specific time period and location. When the number is more than a threshold, the posts associated with that time and location are analyzed to evaluate their sentiments. Further, a random forest classifier was used to predict the most dominant issue for the next day. To address the problem of having an imbalanced dataset, they used SMOTE. Tejaswin et al. [66] also used random forest classifier to predict traffic incidents. The traffic incidents are clustered and predicted using spatio-temporal data from Twitter. The location information is extracted using NLP and background knowledge by using Freebase API, which is a community-curated structured database containing large number of entities and each one defied by multiple properties and attributes that helps in entity disambiguation.

Moreover, D'Andrea et al. [11] collected real-time Italian tweets and classified them after applying text mining techniques. The tweets are classified into three classes namely, traffic due to an external event, traffic congestion or crash, and non-traffic. They built a set of traffic events detected from official news websites or local newspapers and then they compared the time of detecting an event from these official sites with the time of detection from Twitter's stream fetched by their system.

None of the above-discussed approaches have used big data technologies. Salas et al. [44] used apache spark to process tweets and train model using SVM classification algorithm to classify them into traffic and non-traffic related tweets. To extract location information, they used a combination of name entity recognition (NER) such as Stanford NER and a knowledge base such as Wikipedia.

Suma et al. [67] built a classification model using logistic regression with stochastic gradient descent to detect events related to road traffic from English tweets using Apache Spark. Lau [42] used the latent Dirichlet allocation (LDA) topic modeling module to filter traffic messages. In addition, they used the Spark MLib library and trained classifiers using SVM, KNN and NB to detect traffic events. A detailed survey of event detection techniques using twitter data can be found in [68].

### *2.2. Traffic Events Detection Using Social Data (Arabic Language)*

A very limited number of studies have proposed to analyze Arabic social information for traffic event detection, so first we review the studies about detecting any event not necessarily related to traffic. Then, we review the works that focus on transport and traffic events. Finally, we discuss the works that use big data. Alkouz and Alghbari [69] analyzed English and Arabic data, including standard Arabic and UAE dialectical posts from Twitter and Instagram to detect and predict traffic jams. They filtered the collected data to keep only traffic-related data. They used about 2.4 million tweets and 319,125 traffic-related image captions from Instagram. Further, the text is cleaned, tokenized and then stemmed using the NLTK root stemmer. Then, they used a predefined list of keywords to classify the posts into reporting posts or non-reporting posts where reporting posts contain at least one vocabulary from the list. They developed a tool to identify locations from the text of posts and/or GPS location. Further, they employed a linear regression model to predict future traffic jams. Moreover, Alkhatib et al. [70] analyzed tweets written in Modern Standard Arabic and Dialect Arabic analysis for the purpose of incident and emergency reporting. To detect incidents and disasters occurring in the UAE, they collected tweets in real-time using specific keywords, which are a car accident, earthquake, drought, hailstorm, heatwave, building collapse, riot and civil disorder. They labeled 8000 tweets manually to generate a training set and collected 82,150 tweets as a testing set. They built classification models using five machine learning algorithms, which are Polynomial Networks (PN), NB, KNN, Rachio (RA) and SVM. Moreover, they applied root stemmer and the results showed that it improves the classification accuracy. Further, they built NER corpus using Wikipedia to identify certain types of NEs such as building name, event risk and impact level and number of casualties. To extract location information, they built a dictionary of terms related to location names in the Dubai city.

Other researchers have proposed a solution to detect events but their main focus was not on traffic. AL-Smadi and Qawasmeh [71] extracted events about technology, sports, and politics using unsupervised rule-based technique. Alsaedi and Pete [72] developed a solution using naïve Bayes and online clustering algorithms to detect disruptive events. Alabbas et al. [73] detected a high-risk flood using a SVM classifier.

However, none of the above-discussed approaches for event detection from Arabic have used big data technologies. Alomari and Mehmood [18] developed a dictionarybased approach using SAP HANA, which is an in-memory processing platform to analyze Arabic tweets related to traffic congestion in Jeddah city. Additionally, they extracted traffic congestion causes. Furthermore, they extended their work and applied sentiment analysis on traffic-related tweets [19]. Moreover, they developed a supervised classification models using Apache Spark platform to detect eight types of traffic events, which are accident, roadwork, road closure, road damage, road condition, fire, weather, and social events. The results show that SVM achieves better results compared to logistic regression and Naive Bayes algorithm. Subsequently, they extended their work and validate the ability of the proposed Iktishaf tool [6] in detecting various events, their locations and times, with no earlier knowledge about the events from about 2.5 million tweets. Further, they designed a new light stemmer, Iktishaf Stemmer, for Arabic text and study the effect of using it on the performance. The results show that the performance of the trained model with and without using the proposed stemmer is almost similar. On the other hand, comparing to other light stemmers such as Tashaphyne and ISRI, Iktishaf Stemmer helps to minimize the number of letters removed and eliminate changes on the meaning especially for the words that related to transportation.

### *2.3. Solution for Labeling Large Scale Dataset (Any Language)*

Manual labeling is a very challenging and expensive process especially with having a very large dataset and thus supervised learning is hard to applied on big social data. One of the solutions is crowdsourcing by cooperating with freelancers but one of the issues is the quality of the work [74]. Besides, crowdsourcing is not a fully automatic approach. In this section, we discuss the existing works that are similar to us and enables labeling text automatically to eliminate the need for human experts to label all the training set.

Pandey and Natarajan [75] proposed a system to extract situation awareness (SA) information and location from Twitter during disaster events. They suggested using semisupervised classification instead of the traditional supervised machine learning approach, which would be tedious and time-consuming in term of labeling. For creating a semisupervised model, they manually labeled a small set of tweets and then fed them to the SVM to classify them into situation awareness and non-situation awareness. Then, they used the result from this initial classification to self-train the model. However, their model achieved very low precision and recall value for situation awareness class.

Shafiabady et al. [76] suggested using an unsupervised clustering approach such as self-organizing maps (SOM) and correlation coefficient (CorrCoef) to group the unlabelled documents and use them as labelled data to train the SVM for text classification. However, their approach was applied on documents not on short text such as tweets. Ghahreman and Dastjerdi [77] applied semi-automatic labelling by combining co-training algorithms with the similarity evaluation measure. They labelled a small set of data manually, then they used the SVM algorithm to classify the unlabelled document. After that, based on the threshold, part of the output is selected. Then, they calculated the similarity between the selected documents and manually labelled documents

Zewen et al. [78] suggested labeling a few documents automatically using the external semantic resources e.g., HowNet. Then, they combined the labeled data and most of the unlabeled training data to train the classifier by semi-supervised learning. To label the documents automatically, they obtained the knowledge of the category name using lexical databases as external semantic resources and then they generated a set of features for the corresponding category. Further, they extracted features from the documents as a corresponding feature vector. After that, the similarity between each category name and each text document are calculated to rank the documents and classify them into the corresponding category. Triguero et al. [79] provided a taxonomy for the self-labeled techniques. One of the techniques is the addition mechanism. It consists of a variety of schemes, including incremental, batch and amending.

### *2.4. Research Gap, Novelty, Contributions, and Utilization*

It can be seen from the literature review provided in this section that the works that use big data technology for traffic-related event detection are limited. To the best of our knowledge, none of the existing works for Arabic have used big data technology and platforms. Furthermore, none of them have used automatic labeling to address the problem of manual labeling of large datasets. The cutting-edge on big data-enabled social media analytics for transportation-related studies is limited. Many more studies are needed to improve the breadth and depth of the research on the subject in several aspects to establish maturity in this area. The research gaps relate to the focus of the studies, the size and diversity of the data, the applicability and performance of the machine learning methods, the diversity in terms of the social media languages, the scalability of the computing platforms, and others [13,21]. The maturity of research in this area will allow the development, commercialization, and wide adoption of the tools for transportation planning and operations.

The range of technologies that we have incorporated in the Iktishaf+ tool advances the state-of-the-art on big data social media analytics in the Arabic language in a number of ways (some of these contributions related to big data analysis also apply more broadly to English and other languages). Firstly, the extended tool Iktishaf+ has contributed multiple big data pipelines and architectures for event detection (in transportation and other sectors) from social media using cutting-edge technologies including data-driven distributed machine learning and high-performance computing. Secondly, it incorporates a novel pre-processing pipeline for Saudi dialectical Arabic that includes irrelevant characters removal, tokenizer, normalizer, stop words removal, and an Arabic light stemmer to improve event detection and overall performance. This will help many other works in the Arabic language to benefit from our work. Thirdly, the tool incorporates a range of lexicon-based, supervised, and unsupervised machine learning methods for event detection from social media in the Arabic language to enable smarter transportation and smarter societies. Using these methods, we have detected various physical and conceptual events such as congestion, fire, weather, governmen<sup>t</sup> measures, and public concerns. Fourthly, the extended tool incorporates an automatic labeling method to reduce the effort, time, and cost of manual labeling of large datasets. We are not aware of any automatic labelling work in the Arabic language. Fifthly, we have developed and incorporated methods in the tool for spatial and temporal information from Twitter data to allow spatio-temporal clustering and visualization of detected events. Sixthly, we have developed methods for validating the detected events using internal and external sources. None of the existing works in English and other languages, particularly Arabic, have reported a similar analysis of Twitter data for event detection in terms of the richness of the methods, depth of analysis, and significance of findings. To the best of our knowledge, no work in the Arabic language exists that has used automatic labelling or big data tools or has reported analysis of a large number of tweets such as we have in this paper.

The scalability of the software systems for big data analytics is critical and is being hampered due to the challenges related to the management, integration, and analysis of big data (the 4V challenges). The use of big data distributed computing technologies is important because it will allow the scalability and integration of transportation software systems with each other and with other smart city systems. The ability of the Iktishaf+ tool to execute in parallel could save a month of computing time for the specific dataset size and the problem addressed in our work and speed up the development process [13]. For larger datasets, executing sequential codes may not even be possible, or distributed computing could save years of development time.

The utilization possibilities of our tool are many. For example, governments could learn about the various events, public concerns, and reactions related to certain government policies, measures, and actions (in pandemic and normal times) and develop policies and measures to address these concerns. The public could raise their concerns and give feedback on governmen<sup>t</sup> policies. The public could learn about various public and in-

dustry activities (such as fires, social events, and other events, and economic activities detected by our tool in the earlier work [13]) and ge<sup>t</sup> involved in these to address financial, social, and other difficulties. The standardization and adoption of such tools could lead to real-time surveillance and the detection of transportation-related or other events, or disease outbreaks (and other potentially dangerous phenomena) across the globe and allow governments to take timely actions to prevent various risks, the spread of diseases, and other disasters. The international standardization of such tools could allow governments to learn about the impact of policies of various countries and develop best practices for national and international response.

### **3. Iktishaf+: Methodology and Design**

Figure 1 illustrates the Iktishaf+ architecture. It consists of nine components, which are: (1) Data Collection and Storage Component, (2) Data Pre-Processing Component, (3) Tweets Labeling Component, (4) Feature Extractor Component, (5) Tweet Filtering Component, (6) Event Detection Component, (7) Spatio-Temporal Extractor Component, (8) Reporting and Visualization Component and (9) External and Internal Validation Component. The next subsection explains the tools and libraries used to develop Iktishaf+. Sections 3.2–3.10 elaborate each component in detail.

### *3.1. Tools and Libraries*

Iktishaf+ is built over the Apache Spark platform, which enables in-memory processing on distributed data. The main libraries that have been used are Spark ML and Spark SQL. Spark.ML is a new package introduced in Spark 1.2. Unlike the Spark.MLlib package that was built on top of RDDs, Spark.ML contains higher-level API built on top of DataFrames creating and tuning practical machine learning pipelines. Moreover, the script was written using Python and runs on the Aziz supercomputer, a Fujitsu 230 TFLOPS machine comprising around 500 nodes, each with 24 cores. Besides, Fujitsu Exabyte File System (FEFS) has been used to provides high performance storage space as well as Scalable I/O performance. FEFS is a scalable parallel file system based on Lustre. The Aziz supercomputer supports running Spark with YARN, which allocates resources across applications.

Figure 2 shows the architecture of Apache Spark with YARN. Spark applications can run as independent sets of processes on a cluster. It acquires *Executors* on cluster nodes. The SparkContext is responsible for coordinating the application and enable connecting YARN. Then, it sends *tasks* to the executors to run computations for the application.

**Figure 2.** Spark Application Run on Yarn.

### *3.2. Data Collection and Storage Component (DCSC)*

We collected the data using the Twitter REST API, which enables collecting historical data. The returned data by the Twitter API are encoded using JavaScript Object Notation (JSON). Each tweet object includes a unique ID, the text content itself, a timestamp that represents when it was posted, and many child objects such as 'user' and 'place'. Based on their documentation [80], a tweet can have over 150 attributes associated with it. Each child object encapsulates attributes to describe it. For instance, the 'user' object contains 'name', 'screen\_ name', 'id', 'followers\_count', and others. Some of the attributes belong to 'user' object can be filled in manually by the user such as 'description' and 'location'. Additionally, each tweet includes the 'entities' object, which encapsulates several attributes such as 'hashtags', 'user\_mentions', 'media', and 'links'. The example in Figure 3 shows the core attributes of a tweet object.

**Figure 3.** A Tweet Object.

We fetch Arabic tweets posted by the users in Saudi Arabia by using geo-filtering. Besides, we searched for tweets using hashtags that included cities name. Both methods ensure collecting tweets about Saudi Arabia or posted from any place inside it, but not necessarily related to road traffic. Moreover, we created a list of specialized accounts that post about transportation and traffic condition in Saudi Arabia and then use them to obtain traffic-related tweets. We collected data in the period between September 2018 and October 2019. The total number of the collected tweets is 33.5 million. After that, we clean the data by removing duplicates and the retweets.

For storing the collected tweets, we need a storage method that provides flexible schemas to store/retrieve the data. So, we found that the NoSQL databases are more appropriate comparing to the traditional table structures in relational databases. One of the common types of NoSQL databases is a document-oriented database. In this type, each key is paired with the document of various document data types, such as XML and JSON. One of the most widely used document-oriented databases is MongoDB. Thus, we used MongoDB to store the fetched tweets using Twitter API.

Moreover, we use parquet file storage. The Parquet is a column-oriented format. It is supported by many data processing systems including Apache Spark. Furthermore, it enables very efficient compression. We select it to store the output after each stage because it is efficient and provides good performance for both storage and processing. Besides, Spark SQL supports both reading and writing Parquet files that automatically preserves the schema of the original data. After reading data from the Parquet file, it is stored in Spark DataFrames, which is equivalent to a table in a relational database or a data frame in R/Python. However, it provides richer optimizations.

### *3.3. Data Pre-Processing Component (DPC)*

We use pre-processing component proposed in our earlier work, Iktishaf [6]. Algorithm 1 shows the Iktishaf+ pre-processing algorithm. It received the collected tweets, the Arabic diacritics [D], punctuations [P], and Arabic stop words [SW] as an input while the output is clean, normalized, and stemmed tokens. The collected tweets are exported from MongoDB and stored in Apache Spark DataFrame. The next subsections explain the main pre-processing steps.

### 3.3.1. Irrelevant Characters Removal

We removed Arabic diacritics, punctuation marks. English letters and numbers. For Arabic diacritics, we created a list of all the three forms of diacritics suggested by Diab et al. [81]. The first form is vowel diacritics. It refers to the three main short vowels, named in Arabic as Fatha (-), Damma (-) and Kasra (-) as well as the Sukun diacritic (-), which indicates the absence of any vowel. The second form is nunation diacritics, which are named in Arabic as Fathatan (-), Dammatan (-) and Kasratan (-). They represent the doubled version of the short vowels. The third form is called Shadda (germination) and refers to the consonant-doubling diacritical (-). It also can be merged with diacritics from the two previous types and result in a new diacritic such as (-), (-). Therefore, the total number of Arabic diacritics is thirteen diacritical marks. All of them will be removed from the text.

Furthermore, we created a list of all the punctuation marks such as commas, period, colons, both Arabic and English semi-colons, and question marks, in addition to the different types of brackets, slashes and mathematical symbols as well as the other signs such as \$, %, &, and @. For the hashtags, we strip only the hash (#) and underscore (\_) symbols and keep the keywords because it may contain useful information such as the place or event name.

```
Algorithm 1 Pre-Processing.
Input: tweets; [D]; [P]; [SW]
Output: Clean, normalized and stemmed tokens
1 $spark ← createSparkSession()
2 tweets_DF ← spark.read(tweets)
3 ForEach tweet in tweets_DF['text'] do
// Remove Irrelevant Characters
4 clean_tweets ← ""
5 For char in tweet do
6 If char not in [D] AND char not in [P] AND not char.isdigit() AND not char.isEnglishChar()
      then
7 clean_tweets.append(char )
8 end
// Tokenize
9 tokens ← [ ]
10 tokens ← clean_tweets.split()
// Normalize
11 normalized_tokens ← [ ]
12 For token in tokens do
13 For alif in ['

                     ', '', '

                           '] do
14 If alif in token then
15 token ← token.replace(alif, '')
16 end
17 If token.endswith(' 	

                               ') or token.endswith('	 ') then
18 token ← token.replaceLastCharWith('	')
19 If token.endswith(' ') then
20 token ← token.replaceLastCharWith(' '}
// Remove Stop Words
21 If token not in [SW] then
22 normalized_tokens.append( token)
23 end
// Stemmer
24 stemm_tokens ← stemmer(normalized_tokens)
25 End
```
3.3.2. Tokenizer and Normalizer

To divide the text into a list of words (tokens), we used a split() method in python, which returns a list of substrings after breaking the giving string by a specified separator, in our case the separator is any white space in the text. After that, the tokens are passed to the normalizer to replace letters that have different forms into the basic shape. The letter ()

pronounced Alif had three forms ( , , ). It will be normalized to bare Alif (). Besides, the letter ( ) pronounced Yaa will be normalized to dotless Yaa ( ). In addition, the letter ( ) pronounced Taa marbutah will be normalized to ( ).

### 3.3.3. Stop-Words Removal

The Natural Language Toolkit (NLTK) [82] provided a stop-words list for Arabic Language. However, it was designed for the formal Modern Standard Arabic. Therefore, we modified the list and added new stop-words that usually used in dialectical Arabic such as " ", " " and others. Subsequently, we considered common grammar mistakes. For

instance, the preposition " " might be written " " and " " might be written " ". Besides, we added the common words that are used in Du'aa (prayer) such as " ! ", ""#\$", " %" because they are frequently used and keeping them is not necessary, in our particular case. Before using the final generated stop-words list, we normalized them because they will be stripped from normalized tweets.
