*Article* **Spatio-Temporal Sentiment Mining of COVID-19 Arabic Social Media**

**Tarek Elsaka 1,2,\*, Imad Afyouni <sup>1</sup> , Ibrahim Hashem <sup>1</sup> and Zaher Al Aghbari <sup>1</sup>**


**Abstract:** Since the recent outbreak of COVID-19, many scientists have started working on distinct challenges related to mining the available large datasets from social media as an effective asset to understand people's responses to the pandemic. This study presents a comprehensive social data mining approach to provide in-depth insights related to the COVID-19 pandemic and applied to the Arabic language. We first developed a technique to infer geospatial information from non-geotagged Arabic tweets. Secondly, a sentiment analysis mechanism at various levels of spatial granularities and separate topic scales is introduced. We applied sentiment-based classifications at various location resolutions (regions/countries) and separate topic abstraction levels (subtopics and main topics). In addition, a correlation-based analysis of Arabic tweets and the official health providers' data will be presented. Moreover, we implemented several mechanisms of topic-based analysis using occurrence-based and statistical correlation approaches. Finally, we conducted a set of experiments and visualized our results based on a combined geo-social dataset, official health records, and lockdown data worldwide. Our results show that the total percentage of location-enabled tweets has increased from 2% to 46% (about 2.5M tweets). A positive correlation between top topics (lockdown and vaccine) and the COVID-19 new cases has also been recorded, while negative feelings of Arab Twitter users were generally raised during this pandemic, on topics related to lockdown, closure, and law enforcement.

**Keywords:** Arabic tweets; COVID-19 pandemic; sentiment analysis; social data mining; spatiotemporal correlation

#### **1. Introduction**

Global digital statistics [1] reveal that there were more than 4.2 billion active social media users by January 2021, which is 90% of the total number of internet users. In addition, social networks have become a house for numerous real-life events that may occur in our everyday life. The global COVID-19 pandemic has been spreading worldwide, and related topics have been trending since then. Many scientists and companies have started working on challenges related to the processing and analysis of diverse types of health data, medical images, Bluetooth, and GPS data, as well as social data. From a data mining perspective, researchers have been trying to extract knowledge from people's opinions, thoughts, and feelings from social networks. Social data mining includes various associated fields such as Sentiment Analysis (SA) [2]. SA infers positive and negative mentions of people's thoughts, behaviors, and feelings based on their writings about trending topics [3]. The Twitter platform is well suited for analysing users' sentiments during the COVID-19 period, with over 353 million monthly active users [1]. Using data mining techniques, public opinion on COVID-19-related topics can be monitored and tracked in space and time.

From a different perspective, and according to the latest Internet world statistics, Arabic is ranked fourth among the ten most used languages over the Internet [4], with more than 250 million Internet users [5] originating from Arab countries. Arabic is identified by 22 Arabic-speaking countries as an official language [6]. Furthermore, millions of Arabic users

**Citation:** Elsaka, T.; Afyouni, I.; Hashem, I.; Al Aghbari, Z. Spatio-Temporal Sentiment Mining of COVID-19 Arabic Social Media. *ISPRS Int. J. Geo-Inf.* **2022**, *11*, 476. https://doi.org/10.3390/ijgi11090476

Academic Editors: Gloria Bordogna, Cristiano Fugazza and Wolfgang Kainz

Received: 1 June 2022 Accepted: 30 August 2022 Published: 2 September 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

use social media networks to communicate and contribute daily Arabic content over social media. Therefore, our focus in this paper is to analyse Arabic social content available on Twitter, and to investigate people's opinions and sentiments about the COVID-19 pandemic. Recent research works have primarily focused on analyzing social data by extracting trending topics, and inferring general sentiments from related topics, with a special focus on the English language. However, COVID-19 related sentiment analysis on Arabic social media has not been fully addressed. In addition, the few existing research works on Arabic social data do not consider the spatial-temporal aspect in sentiment analysis.

In this study, we focus on analyzing Arabic social content from Twitter related to the COVID-19 pandemic, to discover people's sentiments and correlations between COVID-19 related topics and subtopics, at different levels of spatio-temporal granularities. We aim to highlight correlations between insights extracted from social data and official health data records while investigating the impact of the global pandemic on multiple aspects with different spatial and temporal scales.

This study extends our previous work [7] by presenting a comprehensive social data mining approach for the Arabic language, which employs Arabic-specific word embedding techniques with a focus on the correlation between spatio-temporal social data and official health data. Our approach presents several unique contributions compared to existing works as follows:


This paper is organized as follows: Section 2 outlines a review of some related work. The description and implementation of the proposed methodology are presented in Section 3. The results and findings of the proposed methodology are discussed in Section 4. Section 5 presents concluding remarks and future research directions.

#### **2. Related Work**

Current literature on social data mining has witnessed considerable achievements from NLP and ML research fields [8]. Sentiment Analysis (SA) [2] expresses the users' opinions in various forms with diverse linguistic styles to extract subjectivity and polarity from text [9] to provide countless benefits such as supporting people to make their choices. From the early days of 2020, researchers began studying social media content related to COVID-19 that focused on English tweets about COVID-19 or other Latin languages, while few researchers investigated Arabic content. Some research works motivated topic analysis to illustrate the hot topics discussed on social media using a word embedding,

word frequency, location frequency, language frequency, and character and word n-gram features weighted by TF-IDF. Meanwhile, other researchers applied feature extraction in feature-based sentiment analysis to determine sentiment polarity and forecast sentiment in social data [10]. Most of them used ML classifiers to verify results by semantic analysis. The following sections classify our review of most research studies on social streams, particularly in Arabic.

#### *2.1. Data Collection and Classification*

Recent research works have principally focused on analyzing social data by extracting trending topics and inferring general sentiments from related topics, with a special focus on the English language and less production on Arabic content. Many research works motivated collecting social data to be shared with the research community. In addition, they used their datasets in statistical analysis investigations such as Alanazi et al. [11] and Haouari et al. [12]. Some researchers such as Alharbi [13] identified a coronavirus dataset of Arabic tweets from three Saudi social streams and classified the dataset as conversations about precautionary steps taken by governments, conversations demonstrating social unity, and conversations endorsing government decisions. Additionally, some research works focused on analysis of tweets datasets for classification such as Hamdy et al. [14] who studied different types of tweets collected from Twitter from different perspectives of analysis and machine learning classification. They combined different machine learning models to classify tweets into related/not related to Coronavirus.

#### *2.2. Geolocation Analysis*

Some researchers worked on the location-enabled features of social data such as Qazi et al. [15] that introduced the GeoCoV19, a large-scale Twitter dataset related to the COVID-19 pandemic. They used the Nominatim (Open Street Maps) data at geolocation granularity levels to derive their geolocation information using a gazetteer-based method to extract toponyms from user location and tweet text. Likewise, Lamsal [16] introduced the COV19Tweets Dataset, a large-scale English language tweets dataset with sentiment ratings. They filtered the COV19Tweets Dataset's geotagged tweets to create the GeoCOV19Tweets Dataset contains only 141k tweets (0.045 percent).

#### *2.3. Topic Analysis and Semantic Analysis*

Alshalan et al. [17] used the ArCov-19 dataset [12], an ongoing dataset of Arabic tweets related to COVID-19, to find the hate speech in the Arab world, as well as the most common topics addressed in hate speech tweets. They used a pre-trained convolutional neural network (CNN) model to evaluate tweets for hate speech. Similarly, Alsafari et al. [18] built Arabic hate and offensive speech detection system because of an increasing proliferation of hate speech on social media. However, unfortunately, the collected data are not related to COVID-19. They applied four robust extraction algorithms based on four forms of hate: religion, race, nationality, and gender. They then labeled the corpus using a threehierarchical annotation methodology, ensuring ground truth at each level by verifying interannotation agreement evaluated by applying ML classifiers. As Well, Hamoui et al. [19] examined the Arabic content on Twitter to see what the most popular topics were among Arabic users. They used Non-negative Matrix Factorization (NMF) to find the most common unigrams, bigrams, and trigrams in a dataset of Arabic tweets. They presented, discussed, and divided the final discovered topics into many categories.

Likewise, Al-Laith et al. [20] analyzed the emotional reactions of people during the COVID-19 pandemic using a rule-based technique to classify tweets. They examined six forms of emotion to discover citizens' worries. Furthermore, they created a framework for tracking people's emotions and correlating emotions with tweets mentioning some of the COVID-19 pandemic symptoms. Similarly, Bahja et al. [21] revealed the initial results of identifying the relevancy of the tweets and what Arab people tweeted about the COVID-19 feelings/emotions (Safety, Worry, and Irony). They used ML and NLP techniques to

discover what Arab people talked about COVID-19 on Twitter. Meanwhile, Essam and Abdo [22] examined how Arabs are dealing with the COVID-19 pandemic on Twitter. They extracted specific keywords and n-grams to classify common themes in the compiled corpus. They conducted a lexicon-based thematic analysis to find that tweeters had high levels of affective conversation full of negative emotions.

Some research work focused on the sentiment analysis of social data such as Manguri et al. [23]. They offered a graphical representation of the data after the sentiment analysis. Further, Chakraborty et al. [24] demonstrated tweets comprising and how health organizations have failed to guide people around this pandemic epidemic using a model with Deep Learning (DL) classifiers. Furthermore, Kabir et al. [25] created a neural network model and trained using manually labeled data to detect distinct emotions in Covid-19 tweets at fine-grained labeling. They constructed a bespoke Q&A roBERTa model to extract terms from tweets predominantly responsible for the accompanying emotions. Moreover, Hussain et al. [26] developed and used an AI-based technique to analyze socialmedia public reaction concerning COVID-19 vaccines in the United Kingdom and the United States to understand public opinion and discover hot subjects. They employed NLP and DL algorithms to anticipate average feelings, sentiment trends, and conversation topics. In addition, low-resource languages have witnessed recent efforts for investigating sentiment analysis, trying to bridge the gap by manually collecting and annotating social media data. ALBANA is a deep learning-based sentiment analyzer that performs sentiment analysis of around 10K Facebook comments in the Albanian language [27]. Attention mechanism along with fastText word embedding model was used to discover the interdependence and meanings of words while employing a BiLSTM for sentiment classification. Furthermore, Imran et al. [28] examined how people from various cultural backgrounds responded to COVID-19 and how they felt about the ensuing steps that various countries took in response. They used deep long short-term memory (LSTM) models to estimate the sentiment polarity and emotions from extracted tweets have been trained to reach cuttingedge accuracy. They demonstrated an original and cutting-edge method for validating the supervised DL models using Twitter tweets that had been extracted.

#### *2.4. Misleading Information Detection*

Some researchers tried to handle the misleading information published on social media such as Alsudias and Rayson [29] who collected and examined Arabic tweets about COVID-19 to identify the topics using the k-means algorithm, to detect rumors, and to predict tweets' sources. They used ML algorithms to identify false, correct, and irrelevant information, with two sets of features word frequency and word embedding. In a similar manner, Elhadad et al. [30] presented the COVID-19 Twitter dataset (COVID-19-FAKES) in bilingual (Arabic/English). They gathered COVID-19 pre-checked facts from several fact-checking websites to create a ground-truth database to annotate their collected dataset. They used shared knowledge from the official websites and Twitter accounts as a source of accurate information. They used ML algorithms and feature extraction techniques to annotate Tweets in the COVID-19-FAKES dataset. Similarly, Hussein et al. [31]) created an effective strategy based on the AraBERT language paradigm for combating the Tweets COVID-19 Infodemic. They trained language models on plain texts rather than tweets since pre-trained language models are widely available in many languages and available plain text corpora are larger than tweet-only corpora, allowing for greater performance.

#### *2.5. Discussion*

Table 1 summarizes attempts to process COVID-19-related social data with important information such as the number of tweets contained in each dataset, the language of the dataset, the time frame the data was collected, techniques used in the research work, and the features used in that work.


