*1.2. Findings*

We formulate and analyze the findings of this paper from three relationship perspectives: information-structural, temporal, and spatio-temporal.

**Information-Structural Perspective:** Using the tool, in terms of the informationstructural (or subject matters) perspective, we have detected 15 government pandemic measures and public concerns (quarantine, loan, salary, mobility, etc.) and have grouped them into six macro-concerns (economic sustainability, social sustainability, contain the virus, etc.). For the **pandemic measures** implemented by the Saudi government in relation to the COVID-19 pandemic, we detect curfew and restrictions on mobility in the country, quarantine and fines, restrictions on praying in the mosques, campaigns to stay home, COVID-19 prevention, and cleaning services provided to curb the coronavirus spread. For **economic sustainability**, we detected that the government provided financial incentives including loans and private-sector salaries. Businesses increased offers to increase their sales. People moved to or increased their online economic activities, such as the activities related to prize draws for income earnings. For health, well-being, and **social sustainability**, we detected that blood donation and treatment at hospitals have been a major cause of concern. People also actively talked about the new number of cases. The **daily livelihood** issues in Saudi Arabia include five daily congregational prayers at the mosques that were suspended by the government. This was a major concern because praying five daily prayers in congregations is compulsory in Islam (with certain exceptions). People usually pray in mosques in congregations, standing close to each other and aligning shoulders and ankles with the person on the right and left, which is risky in the pandemic situation. People also increased in supplications for the safety of people. Roads were found to be empty or with abnormally low traffic during the corona times, and this was also vigorously discussed. A significant reduction in mobility was noted across the country that was related to **environmental sustainability**, health, and well-being due to the reduction in traffic congestion and air pollution. The detected events in Kingdom of Saudi Arabia (KSA) are also aligned with **international concerns**, such as various lockdown measures [14], reduced mobility [15], reduction in blood donations [16], financial difficulties and related government incentives [17,18], and worries related to returning to normal times [19].

**Temporal Perspective:** Regarding the temporal perspective of the various pandemicrelated events within the time period of the dataset (1 February 2020–1 June 2020), we are able to see timely relationships in the progression of various events. Figure 1 shows the timeline of some of the detected government measures and public concerns. Some of these events in the Twitter activity that remained high for a period are shown with their start and end times. The earliest detected events in the data are related to virus infection, prevention, curfew, and stay home. Between **mid-March 2020** and the **end of May** (with some intermittent gaps), people also increased their Twitter activity related to the virus infection concern (spread of coronavirus and the increase in the number of cases). The curfew (7 a.m.–6 p.m.) in Saudi Arabia was ordered firstly on **22 March** and applied from the next day. The events related to loans were detected with the highest peak on **22 March** (Saudi Arabian Monetary Agency (SAMA) announced it on the same day [20]). The activities on Twitter related to the quarantine event (we use "measure", "concern", and "event" interchangeably, as appropriate) remained high during **the initial period** of the curfew to around **mid-April**. The "No Mobility" event (empty roads) was vigorously discussed on **24 March**, two days after the curfew was ordered. The curfew situation and the reduction in government services along with people's fear of getting infected by the virus had caused a reduction in blood donations and blood supplies. The activity on this topic requesting blood donations was seen to be increased from **late March** to **mid-April**. On **2 April**, a 24 h curfew was enforced in Makkah and Medina (the two holiest cities in KSA and the Muslim world), which stirred heavy Twitter activity. The events for the salary events were detected on the **3 April** the day King Salman of Saudi Arabia ordered to contribute towards 60% of the salaries of Saudi private-sector employees with a financial incentive of 9 billion Riyals in total (this was verified through external sources [21]). A peak

activity related to the Five Daily Prayers concern was found on **26 May** when the Ministry of Interior announced that they would allow daily prayers to be held in all mosques of the Kingdom (except the mosques in Makkah city). Finally, we detected the "Back to Normal" government measure on **29 May 2020**.

**Figure 1.** The timeline of some of the detected government pandemic measures and public concerns.

**Spatio-Temporal Perspective**: We extracted location information using different approaches including tweet text and hashtags, geo-coordinate attributes, and user profiles. We were able to detect important events in over 50 cities around the kingdom with major activities related to COVID-19 cases, curfew, etc., in the Makkah, Riyadh, and Eastern provinces.

We validated the detected government measures and public concerns and their spatial and temporal nature through external validation by searching online news media or through internal validation by checking tweets. These findings show the effectiveness of the Twitter media in detecting important events, government measures, public concerns, and other information in both time and space with no earlier knowledge about them.

The organization of the paper is as follows. Section 2 reviews the related works and elaborates on the research gaps. Section 3 explains our methodology and the design of the tool. Section 4 discusses the results and analysis. Section 5 gives the conclusions and directions for future work.

#### **2. Literature Review**

Smart cities and societies are driven by the need to provide highly competitive, productive, and smarter environments through the innovation and optimization of urban processes and life [22,23]. Artificial intelligence has taken us by storm [24], and has led to the emergence of concepts such as artificially intelligent cities [25]. A key to providing smartness for emerging urban and rural environments is to continuously sense and analyze these environments and make timely and effective decisions [6,10,24,26]. Social media analysis using machine learning has become a key method to provide the pulse for sensing and engaging with the environments [6] and is expected to provide smarter solutions for our fight against COVID-19 and future pandemics as well as during peace times.

We review here the literature relevant to the topic of this paper, which is the detection of COVID-19-related public concerns from social media (big) data in the Arabic language using machine learning, specifically the LDA topic modelling method. Firstly, in Section 2.1, we provide a background on the pre-COVID-19 use of social media in various application domains. Subsequently, we review in Section 2.2 the works about COVID-19 analysis that have used social media data without limiting the reviewed works to any analysis method or a language. In Section 2.3, we review the works about COVID-19 analysis and social

media that have specifically used topic modelling for analysis purposes; these works are not limited to any language. We focus on the Arabic language in Section 2.4 and review the works related to COVID-19 analysis that use Twitter data. Finally, Section 2.5 discusses the research gap.

#### *2.1. Use of Social Media in Research (Pre-COVID-19)*

Digital societies could perhaps be characterized by their increasing desire to express themselves and interact with others, and this is done through various digital platforms such as social media. It is reported that roughly 58% of the global "eligible population" (70% of the eligible population in 100 countries around the world) uses social media [10,27]. Social media could provide a two-way communication channel for individuals, governments, businesses, and others to engage with their friends, communities, stakeholders, etc. [10]. The traditional methods of data collection and analysis using surveys and other means cannot capture such timely and large-scale data, alongside them having other disadvantages. Researchers in recent years have increasingly used social media including Twitter to study different issues in many application domains and sectors, and this trend has been ramping up in COVID-19-related research and other studies. Social media and Internet of Things (IoT) provide the pulse for sensing and engaging with the environments [10]. Sentiment analysis, or opinion mining, that utilizes social and other textual media is a vital tool in natural language processing (NLP), defined as "the field of study that analyzes people's opinions, sentiments, evaluations, appraisals, attitudes, and emotions toward entities such as products, services, organizations, individuals, issues, events, topics, and their attributes" [10,28]. Many of the notable works on sentiment analysis rely on machine learning and social media, with applications in logistics and urban planning [12,25,29–31]; categorizing tweets about road conditions into useful, nearly useful, and irrelevant complaint tweets [2]; identifying sources of noise pollution [32]; extracting traffic-related information from tweets [3]; general and traffic-related event detection [6,9,11,33–35]; public opinion mining for government services [13]; detecting health-related topics from the stream of tweets (without aiming to detect a particular illness) [36]; tracking the side effects of certain medications [37]; the detection of top symptoms, diseases, and medications and related awareness activities [10]; tracking flu infections on Twitter [38], influenza surveillance from social media data [39–41]; and many more.

#### *2.2. COVID-19 and Social Media (General)*

We review here the works about COVID-19 analysis that have used social media data without regard to any modelling method or a language. Singh et al. [42] analyzed tweets about coronavirus in different languages including English, French, German, Italian, and others. Furthermore, they have performed spatiotemporal analysis of the data. They focused on three countries, which are the United States, Italy, and China, and showed the time series of tweets and the daily confirmed COVID-19 cases. They found that the countries that had a higher number of COVID-19 cases also had a higher number of tweets about COVID-19. Gencoglu [43] applied supervised classification to capture COVID-19-related discourse during the pandemic. They collected around 26 million tweets using Twitter streaming API with keyword filtering. They trained classifiers using k-nearest neighbor, logistic regression, and support vector machine (SVM) to classify the tweets into 11 categories including donate, prevention, reporting, share, speculation, symptoms, and others. For training the machine learning classifiers, they utilized two annotated datasets of questions and comments related to COVID-19. The dataset consisted of several languages, including English, French, and Spanish, and was generated by nativespeaker annotators based on an ontology. Then, they employed language-agnostic BERT (Bidirectional Encoder Representations) sentence embeddings to obtain a pre-trained model. To extract embeddings, they used the TensorFlow framework on a 64-bit Linux machine with an NVIDIA Titan Xp GPU. They found that Twitter activity increased due to the increase in the spread of COVID-19 across the world.

Several other works on the use of social media for COVID-19 analysis have been reported. These include studies on American and Chinese peoples' views on COVID-19 [44], the mood of Indian people during the pandemic [45], the spread of anti-Asian hate speech [46], the political tension between Brazil and China [47], and the identification of emotional valence and predominant emotions [48]. Moreover, others have looked into modelling social media data for COVID-19-related analysis to study the spread of misinformation about the coronavirus [49–51], discovering political conspiracies in the U.S. that were posted by Twitter automated accounts during the COID-19 outbreak [52], identifying the causal relationship of the daily Twitter activity and sentiments during the pandemic [53], and studying the frequency of the phrases "Chinese virus" and "China virus" before and after the outbreak in the United States [54].

None of the works reported in this subsection have a focus or methods similar to our research reported in this paper. None of them have used the distributed big data computing framework Apache Spark. The discussed works did not support social media in the Arabic language, which, as mentioned earlier, has its own challenges, particularly since it is not based on the Latin script. Moreover, the size and period of the used data are also different.

#### *2.3. COVID-19 and Topic Modeling*

We review here the works about COVID-19 analysis using social media that have specifically used topic modelling as the modelling method. These works are not limited to any language. Liu et al. [55] studied the role of the Chinese mass media during the COVID-19 crisis using news articles from the WiseSearch database. They applied LDA and extracted 20 topics and then classified them into nine themes. The topics include prevention and control policy, prevention and control measures, medical affiliation and staff, epidemiologic study, and others. The themes include confirmed cases, prevention and control procedures, medical treatment and research, detection at public transportation, and others. Kaila and Prasad [56] applied LDA analysis and found the topics related to the coronavirus from 18,000 tweets. Besides, they applied sentiment analysis and found that most of the tweets were negative. Abd-Alrazaq et al. [57] identified twelve topics from 167,073 tweets, collected for the period 2 February 2020 to 15 March 2020, using LDA and grouped them into four themes: the origin of COVID-19, the source of the novel coronavirus, the impact of COVID-19 on people and countries, and the methods for decreasing the spread of COVID-19. Then, they used a simple string-matching technique to find tweets that contain the selected keywords of the topics. Additionally, they calculated the interaction rate for each topic after calculating the sentiment score and the number of retweets, likes, and followers for each topic. None of the works discussed in this paragraph have applied temporal or spatial analysis, supported social media in the Arabic language, or used distributed big data computing platforms such as Apache Spark.

Med [58] collected 94,467 posts from the Reddit website in the period between 3 March and 31 March. Then, they applied LDA and found 50 topics, 10 of them were assigned to one of the following categories: public health measures, daily life impact, and sense of pandemic severity. After that, they measured daily changes in the frequency of topics. Ordun et al. [59] applied keyword analysis to find the most frequent words. They analyzed around 5.5 million tweets in different languages that are based on Latin script. Arabic, Chinese, and other languages that are based on non-Latin scripts were not included. They used term-frequency inverse-document-frequency (TF-IDF) and defined the max\_features to 10,000. In addition, they performed topic modeling and identified twenty topics using the default parameters of the Gensim LDA MultiCore model. For each topic, they extracted the top twenty terms and used the first three terms to label the topic. Further, they used Uniform Manifold Approximation and Projection (UMAP) to visualize how the 20 topics grouped together. Additionally, they performed a temporal analysis to examine the trend of topics over time. Additionally, they applied time-to retweet analysis and measure the time between the tweet and the retweets. None of the works discussed in this paragraph

have used distributed big data computing platforms such as Apache Spark, supported social media in the Arabic language, or applied spatial analysis.

Mackey et al. [60] applied the Biterm Topic Model (BTM) to detect topics related to COVID-19 symptoms, experiences with access to testing, and disease recovery. They collected around 4 million tweets after filtering by keywords. The data was collected for the period 3 March 2020 to 20 March 2020. Then, the tweets were grouped into five main thematic categories: "conversations about first and secondhand reports of symptoms", "symptom reporting concurrent with lack of testing", "discussion of recovery", "confirmation of negative diagnosis", and "discussion about recalling symptoms". For the analysis, they used python packages and R-studio. Additionally, they analyzed the time and location for the geotagged tweets (in our work, we use multiple methods for the location extraction of tweets). Li et al. [61] detected stress symptoms related to COVID-19 in the United States. They integrated a Correlation Explanation (CorEx) learning algorithm and clinical Patient Health Questionnaire (PHQ) lexicon and proposed a CorExQ9 algorithm. They collected 80 million tweets for the period of January 2020 to April 2020 and used a Jupyter computing environment deployed on the Texas A&M High Performance Computer. They compared CorExQ9 with LDA and non-negative matrix factorization (NMF). Moreover, they visualized the symptoms of COVID-19 related stress at the county level for multiple two-week periods. These works differ from our work in multiple aspects, including the differences in the foci of the studies, the overall methodology, the specifics of analysis, the time period of the data used, and particularly the processing of social media in the Arabic language.

#### *2.4. COVID-19 and Twitter (Arabic Language)*

We review the works related to COVID-19 analysis that use Twitter data with a focus on the tweets in the Arabic language. Alam et al. [62] analyzed Arabic and English tweets during the COVID-19 pandemic to find whether the tweets contained a factual claim. They defined annotation guidelines for manual annotation. Alshaabi et al. [63] collected tweets in 24 languages including Arabic. They created time series for the top thousand 1 g for each language. Then, they applied basic observations about some of the time series data, including the use of the word "virus" in the tweets of all languages. Alsudais and Rayson [64] collected around 1 million tweets about coronavirus for the period December 2019 to April 2020 and clustered them using the K-means algorithm with the Python Scikit-learn package. They found five topics; these are "COVID-19 statistics", "prayers for God", "COVID-19 locations", "advice and education for prevention", and "advertising". Besides this, to identify rumors, they applied supervised classification and labeled 2000 tweets as false information, correct information, and unrelated. The review of the works on COVID-19 analysis using Twitter data in the Arabic language shows that the works on the topic are scarce and are limited in their variety and the depth of the technologies, methods, and analysis used in those works. For example, none of these works have used big data platforms, and none have reported spatio-temporal analysis.

#### *2.5. Research Gap, Novelty, and Contributions*

The literature review provided in this section clearly establishes the enormous potential of social media analytics for COVID-19-related studies. The traditional methods of data collection and analysis using surveys and other means cannot capture such timely and large-scale data, alongside them having other disadvantages. The state-of-the-art social media analytics for COVID-19-related studies is limited. Many more studies are needed to improve the breadth and depth of the research on the subject with regard to the focus of the studies, the size and diversity of the data, the applicability and performance of the machine learning methods, the diversity of the social media languages, the scalability of the computing platforms, etc. The maturity of research in this area will allow the development, commercialization, and wide adoption of the tools for pandemic-related and general surveillance and other purposes.

The research reported in this paper is different from the existing works on social media analytics for COVID-19-related studies in several respects, including the focus of the studies, the methodology, the size of the data, the time/period of the social media data, support for the social media in the Arabic language, whether the studies have used big data distributed computing platforms, the breadth and the depth of the reported analysis such as spatial and temporal analysis, the geographical focus of the studies, and the specific findings. None of the existing works have reported a similar COVID-19 analysis of Twitter data in the Arabic language with regard to the modelling method used and the depth of the analysis. The Twitter data we have used, its time period, and the methodology of its collection and analysis are different. The methods used for the validation of the findings are also different. None of the existing works on the COVID-19 analysis has used big data technologies for social media in Arabic. Even the works that use big data distributed computing platforms for the analysis of text in languages other than Arabic are very limited and differ in several aspects. The scalability of the software systems for COVID-19 analysis is critical and is being hampered due to the challenges related to the management, integration, and analysis of big data (the 4V challenges). We have developed a novel architecture and pipeline (see Figure 2) for big data management and analysis using distributed machine learning. We have also provided an analysis of the execution time complexity for LDA algorithms for a different number of iterations (between 5 and 1000 iterations) on a varying number of computing cores (see Section 4.4). The use of big data distributed computing technologies is important, because it will allow the scalability and integration of COVID-19-related software with each other and with other healthcare and smart city systems.

**Figure 2.** The tool architecture.

#### **3. The System Methodology and Design**

The architecture of the proposed system is depicted in Figure 2. It comprises five components that are depicted in the figure as five separate blocks and discussed in the following subsections subsequent to the overview below.

#### *3.1. The System Overview*

We built our tool in Apache Spark, which is a big data platform for in-memory computations on distributed data. Apache Spark provides the Spark ML package for machine learning and Spark SQL for data handling. Spark SQL acts as a distributed SQL engine. Additionally, it offers a programming abstraction called DataFrames, which is conceptually equivalent to a table in a relational database but is immutable, parallel, and distributed to handle big data. Moreover, the proposed tool was developed using Python and runs over Aziz supercomputer, which supports running Spark with YARN. Aziz consists of 380 regular computer nodes, 112 compute nodes with large memory, as well as 2 additional GPU compute nodes and 2 additional MIC compute nodes. All the computer nodes run CentOS 6.4 with dual Intel E5-2695v2 processors. Each node has 24 cores. Regular nodes provide 96 GB memory, while large memory nodes provide 256 GB memory. Further, they provided the Fujitsu Exabyte File System (FEFS) which offers high-speed storage to store input/output data for the running jobs. It provides 7 petabytes of memory.

Algorithm 1 shows the master algorithm. The inputs are the search queries and the geocoordinates, which are required for the Data Collection and Storage Component (DCSC) in addition to the location dictionary, which will be used during spatio-temporal information extraction. The dataset was collected using the Twitter REST API and stored in MongoDB. Then, the tweets will be loaded into Spark DataFrame (DF), which is a distributed data collection organized into named columns. After that, the tweet Dataframe will be passed to the Data Pre-Processing Component (DPC), which removes noise from the text and provides cleaned, normalized, and stemmed tokens. Furthermore, the major concerns will be discovered using the Measures and Concerns Detector Component (MCDC), which applies an unsupervised LDA model to cluster the tweets. Subsequently, to perform spatial and temporal analysis, the date, time, and location information are extracted using a Spatio-Temporal Information Component (STIC). Finally, in the Validation and Visualization Component (VVC), the results are visualized and validated against external or internal sources.

#### **Algorithm 1:** Master


#### *3.2. Data Collection and Storage Component (DCSC)*

The experimental dataset contains Arabic tweets collected using Twitter REST API during the period from 1 February to 1 June 2020. The total number of fetched tweets are approximately 14.8 million tweets. The tweets were acquired using two methods. First, we use keywords and hashtags related to coronavirus, such as #corona and #<- >, #covid19, as well as official accounts that post about it, such as the account of the Saudi Ministry of Health (@SaudiMOH). The second method is fetching tweets without keyword filtering to make sure that we do not miss any important tweets because we want to see what are the topics that people were talking about and how the pandemic has changed their life. Subsequently, we used geolocation filtering to obtain only tweets posted in Saudi Arabia because our main focus in this work is to find the major concerns during the pandemic time in Saudi Arabia.

Algorithm 2 illustrates the algorithm of the data collection. To store the collected tweets, we searched for a storage method that supports flexible schemas. Therefore, we selected the NoSQL databases, particularly MongoDB, which is a document-oriented database. They enable storing various document data types, such as XML and JSON.

Moreover, to store the output of each component, we used Parquet file storage. One of the reasons for selecting Parquet is because it is supported by many data processing systems, including Apache Spark. Besides this, it automatically preserves the schema of the original data and provides a good performance for both storage and processing. The files were stored using the Fujitsu Exabyte File System (FEFS), which is a scalable parallel file system based on Lustre. Finally, the duplicated tweets were removed before passing them to the next stage, which is pre-processing.


#### *3.3. Data Pre-Processing Component (DPC)*

The main pre-processing steps can be summarized as follows: (1) irrelevant character removal, (2) tokenizer, (3) normalizer (4) stop-word removal, and (5) stemmer. In the first step, we removed all the numbers, the English alphabet, and all punctuation marks. This means removing @, which every username started with, and # and \_, which are used in hashtags. However, we leave the hashtag name itself if it is not in English because it might include useful information, such as the city name. Removing English and punctuation means also removing links and all punctuation including Arabic semi-colons ( ) and Arabic

question marks ( ). Furthermore, we removed the thirteen forms of Arabic diacritics [65] which can be grouped under three categories: vowel, nunation and shadda diacritics. Vowel diacritics include the three main short vowels, called in Arabic Fatha (), Damma

( ), and Kasra ( ), as well as the Sukun diacritic ( ), which indicates the absence of any vowel. Nunation diacritics represent the doubled version of the short vowels known in Arabic as Fathatan ( ), Dammatan ( ), Kasratan ( ). The last form of diacritics is Shadda (germination). It refers to the consonant-doubling diacritical (). This also can be merged

with diacritics from the two previous types and result in a new diacritic such as () or ( ).

The second step is dividing the text into tokens. We used the split() method in Python with the white-space separator. The third step is using the Normalizer to normalize the words (tokens) that contain different forms of Alif ( , , ), 'Yaa' ( ) and "TAA MARBUTAH/ " into the basic form. To clarify, the letter "Taa marbutah" ( ) will be replaced with "haa" () while "Yaa" ( ) will be replaced with "dotless Yaa" (). Additionally, "Alif" with three forms ( 

, , ) will be replaced with "bare Alif" ().

The fourth step is removing stop-words. To do this, we modified the stop-words provided by the Natural Language Toolkit (NLTK) to include a new list of stop-words as well as normalize them. Since the NLTK stop-words list was designed for the formal Modern Standard Arabic, we modified the list to include words that usually used in dialectical Arabic, such as " ", " ", " ", and " ", in addition to that we consider the common grammar mistakes. For example, the preposition " " might be written "! " and ""# " might be written ""\$ ". Besides this, we included words that are usually used in Du'aa (prayer) such as "%& -", "'()", " \*". After that, we normalized the final stopwords list before using them because they will be extracted from a normalized text. This component is part of our earlier paper, Iktishaf. For further details, see the pre-processing algorithm in [6].

Finally, we stem the tokens using the Iktishaf Light Stemmer [6]. Unlike the existing Arabic light stemmers, Iktishaf stemmer was designed to minimize the number of letters removed and eliminate changes in the meaning. It used a predefined list of prefixes and suffixes. Then, based on the length of the word, the tool decides which affix can be removed. That leads to minimizing the word confusion and losing or changing the word meaning. For further details, see the stemmer algorithm in [6].

#### *3.4. Measures and Concerns Detector Component (MCDC)*

To discover concerns, we used the Latent Dirichlet Allocation (LDA) topic modeling algorithm. It is a statistical model that is used to identify the main topics discussed in a collection of documents. It is an unsupervised method that models documents and topics based on dirichlet distribution. Each document is characterized by the probability distribution over various topics while each topic is modeled as a probability distribution over words. The model received a collection of documents and returned a set of topics. Each topic includes a set of words. This required defining the number of topics, denoted by *k* to model the distributions. In this work, the tweets are the documents and we refer to topics as concerns. Apache Spark supports LDA since Spark 1.3.0 in the MLlib package and it also supports it in ML package.

Algorithm 3 illustrates the algorithm of the Measures and Concerns Detector Component. The inputs for this component are the pre-processed tweets (tweet\_p). The set concerns number ([K]), the set of iterations number ([R]), and the threshold value. The output of the DPC will be loaded from parquet files and stored in a Spark DataFrame (tweet\_DF). For training the model, we need to pass the documents (tweets) as vectors of word counts. Thus, we used the CountVectorizer function. Then, we applied TF-IDF weight, which is a statistical measure used to evaluate how important a word is to a document (tweet) in a collection (tweets). This stands for term frequency-inverse document frequency. TF-IDF comprises of two parts Term Frequency (*TF*) and Inverse Document Frequency (*IDF*). *TF* measures how frequently a word occurs in a tweet. It is calculated using the following equation:

$$TF\_{w,t} = \frac{f\_{\rm wt}}{n\_t} \, ^\prime \tag{1}$$

where *fwt* is the frequency of word *w* in tweet *t* and *nt* is the total number of words in that tweet.

$$IDF\_t = 1 + \log \frac{|T|}{|t:w \in t|} \tag{2}$$

where |*T*| is the total number of tweets, and it is divided by the total number of tweets that contain the word *w*. Then, the multiplication of *TF* and *IDF* will represent the weight of the word *w* in tweet *t*.

$$TF-IDF\_{w,t} = \,^t TF\_{w,t} \times \,^t IDF\_t. \tag{3}$$

After passing the collection of tweets as a vector to the LDA model, we need to specify the number of concerns (*k*), which also can be thought of as cluster centers. To find a suitable number of concerns, we tested different concerns numbers and calculated the perplexity. Perplexity is a statistical criterion of how well a probability model predicts a sample. It is a standard metric to measure generalization performance [66]. Lower perplexity score indicates a good model. Further, we tested different iteration numbers to find the best value.

**Algorithm 3:** Measures and Concerns Detector

**Input:** tweets\_p; [K]; [R]; threshold

**Output:** concerns[][], tweets\_g\_DF


Figure 3 shows the perplexity score against the number of concerns, *k*. The perplexity score decreases with an increase in the value of *k* with some minor exceptions. The gain in the perplexity score after *k* = 15. is relatively insignificant. Therefore, we use 15 as the value of *k*—i.e., the number of concerns to be detected by our tool is set to 15. We also carried out an empirical analysis of the various concerns detected by different values of *k* and found that *k* = 15 produces the best results.

Moreover, the model that achieved the best results is trained to obtain a final concerns list. Furthermore, by calling the *describeTopics* function, we obtain a list of the top terms for each concern. From the list of terms, we can understand the concern and thus we define a label that represents it. For each tweet, we get an array of the probability distribution, which represents how much the tweet belongs to each cluster. The concerns probability as well as the tweets are stored in concernsProb\_tw\_DF. We need to make each tweet belong to one concern (cluster), so we pick the concern with the highest probability in the array and we consider it the best concern that represents the tweet. Thus, we get a group of tweets under each concern. Since we have a large number of tweets, we assume that some of them might be included under a specific concern because it represents the highest probability comparing to the other concerns but the probability value itself might be very low. To keep only tweets that are highly related to the concern, we decide to define a threshold and filter out the tweets that have a probability less than the threshold value. This value will depend on the data; in our particular case, we found that most of the tweets have a probability higher than 0.8 as shown in Figure 4. So, we set the threshold = 0.8. The outputs of this component are the lists of top keywords that explain each concern and the tweets grouped by the concerns. The detected concerns will be explained later in the results section (see Section 4).

In the MCDC component, we also compute the correlation matrix by calculating correlation coefficients between the keywords of the detected concerns. This helps in understanding relationships between the keywords. There are three main types of correlation coefficient formulas, which are Pearson, Kendall, and Spearman correlation. The Pearson correlation coefficient is the most commonly used. We selected it in this work. It

measures the linear dependence between two variables. The Pearson correlation between two variables *x* and *y* is computed using Equation (4) below.

$$r = \frac{\sum (\mathbf{x}\_i - \overline{\mathbf{x}})(y\_i - \overline{\mathbf{y}})}{\sqrt{\sum (\mathbf{x}\_i - \overline{\mathbf{x}})^2 \sum (y\_i - \overline{\mathbf{y}})^2}},\tag{4}$$

where *xi* is the *i*th value of *x* variable, *yi* is the *i*th value of *y* variable, *x* is the mean of the values of the *x* variable, and *y* is the mean of the values of the *y* variable.

The correlation matrix (see Section 4.1) is an asymmetrical (K × K) square matrix where AB entry is a cell in the matrix that shows the correlation between two keywords in row A and column B. Each cell has a value between 1 and −1, where 1 represents a strong positive correlation and −1 represents a strong negative correlation.

#### *3.5. Spatio-Temporal Information Component (STIC)*

In this work, we identify concerns, and then each tweet under each concern that has information about location or time, we call it an event. To apply spatio-temporal analysis, we need to know the time and the location of the extracted event.

The obtained data using the Twitter API are encoded using JavaScript Object Notation (JSON). Each tweet object we obtained can have over 150 attributes associated with it according to their documentation [67]. Each child object, such as users and place, encapsulates attributes to describe it.

We extracted time and date information from "created-at" attribute which shows UTC time when the tweet was created. For location extraction from the tweet object, we applied different techniques.

The first approach is extracting location names from the "text" attribute. It contains the tweet message. The location name might be explicitly mentioned in the text or it might be part of the hashtags. We generated a dictionary for Saudi cities in English and Arabic as well as their coordinates. Before using the dictionary to search for the cities' names in the text, we passed the Arabic names list to Iktishaf Light Stemmer because we extracted them from the text after applying pre-processing. However, if the city name is not found in the text, we move to the next approach, which is looking for geo coordinates information.

Therefore, the second approach is obtaining coordinates from "coordinate" or "place" child objects. The "place" child object includes several attributes, such as "place\_type", "place\_name", "country\_code". The "place\_type" can be either city or point of interest (poi). Moreover, we do not move to this approach unless we do not find the information

in the text because the associate geo-coordinates within the tweet object represents the location where the user physically present at the time of posting the tweet and it does not necessarily be the actual location of the event that they are talking about. If users disable location services in their smartphones, the value of these attributes will be null.

Thus, the third approach we follow is extracting information from the "user" child object. This contains the user profile information such as the screen name and bio, which includes a short description as well as the country and city name. Users fill in the information manually so they can be written in English or Arabic and they might use different spelling such as Makkah can be written as Makah or mecca. Therefore, our location extractor was designed to extract both English and Arabic names as well as the common names for Saudi cities. However, users usually fill in this information when they create their account and do not change them when they travel to another country/city. That is why we leave this option to the end and we do not apply it unless we do not find the location information from the previous two approaches.

#### *3.6. Validation and Visualization Component (VVC)*

We followed two methods to validate the identified concerns as well as their spatial and temporal nature. The first method is based on searching against various official sources, reports, and news media on the web. We consider it an external validation. The second method is based on Twitter data we have, where it can give us the detailed information in addition to space and time information, particularly if it was posted by an official news account such as @spagov or the account of Ministry of Health.

After identifying the public concerns using the MCDC (see Section 3.4), we drew line charts to show changes of concerns overtimes. Further, to show the concerns for their spatial nature, we plotted them on top of the Saudi Arabia map. For this purpose, we used Power BI and Tableau.
