*1.1. Description of the Proposed Work*

In this paper, we propose a software tool comprising a collection of machine learning and other methods for the analysis of Twitter data in Arabic with the aim to detect government pandemic measures and public concerns during the COVID-19 pandemic. The methods used in the tool include an unsupervised Latent Dirichlet Allocation (LDA) topic modeling algorithm, natural language processing (NLP), correlation analysis, and other spatio-temporal information extraction and visualization methods. The tool was built using a range of technologies including MongoDB, Parquet, Apache Spark, Spark SQL, and Spark ML. The tool comprises five software components (see Section 3). The Data Collection and Storage Component (DCSC) uses various search queries and geo-coordinates to collect data using Twitter REST (Representational State Transfer) API (Application Programming Interface) and stores it using MongoDB and Apache Spark DataFrame (DF), a distributed data collection organized into named columns. The Data Pre-Processing Component (DPC) removes noise from the text and provides cleaned, normalized, and stemmed tokens. The Measures and Concerns Detector Component (MCDC) uses an unsupervised LDA model to cluster the tweets and detect government and public measures and concerns. The correlations in data are also computed here. The Spatio-Temporal Information Component (STIC) performs spatial and temporal analysis by extracting the date, time, location, and other information from the tweets. The Validation and Visualization Component (VVC) visualizes the results spatially and temporally using maps and other tools and validates the detected measures and concerns using internal or external sources such as news media. The Twitter dataset used in this specific study comprises 14 million tweets. It was collected using the Twitter API from 1 February 2020 to 1 June 2020 for the Kingdom of Saudi Arabia.

The software developed for this work is part of the tool Iktishaf [6–9] that we have been developing for the last few years. Earlier work on this tool has focused mainly on mobility-related event detection using supervised learning. We have also developed other tools for big data social media analytics in healthcare [10], logistics [11,12], and public opinion mining for government services [13]. These works have used Twitter data in Arabic or English.
