1. Introduction
Information from social networks has become an important resource for assessing public health, offering a powerful tool for real-time disease monitoring and the prediction of communicable diseases [
1]. The growing trend in the use of social networks to share personal situations from users’ daily lives generates large amounts of data that describe the behavior of the population, demographic data, geolocation data, images, text and interaction between users [
2]. If it were possible to identify users’ posts prior to an official confirmation (dictated by a hospital or laboratory) of an Acute Respiratory Infection, the collective information of an area or region would help to identify a possible outbreak. Traditionally, disease monitoring relies on Indicator-Based Surveillance Systems, which collect and report information to government health agencies. However, these processes often involve slow data flow and adherence to protocols that delay the publication of critical information [
3]. A second type of surveillance system, known as Event-Based Surveillance Systems, is gaining prominence with the availability of new information sources like social networks. The World Health Organization (WHO) defines these systems as the organized and rapid collection of information from sources such as social media, news outlets, and public health networks about events that may pose risks to public health [
4]. An example of the importance of the use of these Event-Based Surveillance Systems occurred on 30 December 2019, the network-based tool ProMED (Programme for Monitoring Emerging Infectious Diseases) reported posts on the Chinese social network Weibo regarding pneumonia cases in Wuhan, prior to the recognition of the COVID-19 virus as a potential pandemic [
5]. WeChat is another social network used in China to predict COVID-19 trends, through the “Pandemic Forecast and Warning WeChat Mini Program”, which enabled early warnings of new outbreaks across 31 provinces in China [
2]. WeChat was also used to distribute surveys for gathering information on COVID-19-related symptoms and to generate epidemic time series based on the collected data [
6].
Another concept that has gained attention in recent research is Digital Epidemiology, an emerging field that leverages big data—including social networks—and digital technologies to analyze disease-related patterns. Its primary goal is the early detection and monitoring of viral outbreaks. Fallatah and Adekola [
7] suggest that integrating this approach into existing surveillance systems could enhance global health infrastructure and help save lives during viral epidemics.
Therefore, this research focuses on social media data; however, one of the significant challenges in utilizing data from these platforms is limited access. Charles et al. [
8] carried out a literature review on the use of social network information in the context of public health monitoring, and the main social network was Twitter (now X) with 81% (from 33 studies reviewed). Gupta and Katarya [
9] performed a similar review, observing that Twitter data was used in 64% of a total of 26 selected investigations. The reason for the frequent use of this social network to carry out research is precisely the availability and access to current and past publications. However, due to recent changes in its administration, the access policies for the publications of this social network have changed. On the other hand, use of data from social networks for surveillance presents some challenges. The main concern when working with such data to monitor public health is misinformation; Wang et al. [
10] conclude that misinformation is more popular than accurate information, and misinformation frequently induces fear, anxiety, and mistrust in institutions. Therefore, the use of NLP to identify those publications related to the diseases that are of interest for monitoring is a quite complex task, which can be addressed with different procedures and algorithms. Giancotti et al. [
11] highlight additional concerns when working with data from social networks, including the absence of an ethical framework, the lack of internationally recognized privacy protection policies, the risk of spreading conspiracy theories, and the potential for false-positive events due to geographic or cultural variations in language.
The processes for selecting which algorithms to work with can be complicated and confusing. So, we introduce a computational methodology for this task, which contains three phases. In each phase, tests and comparisons are carried out among models to facilitate their selection and development. We use ML platforms such as Dataiku and the programming language Python. The main objective of this research is to investigate the relationship between tweets and information reported by health authorities related to ARI and COVID-19. Thus, we compare different ML prediction models using confirmed cases of ARI and COVID-19 as a target; and as a feature, tweets related to possible cases of these diseases (independent variable).
For each phase of the proposed computational methodology, we follow a data science workflow [
12], which contains a workflow consisting in three steps: data, analysis and communication, see
Figure 1. The Data step involves data collection, cleaning, storage and sharing; Analysis step is composed by training and forecasting models; and the last step corresponds to the communication of the results, that is the ability to share knowledge with interested parties (visualization).
This paper is organized as follows. First, we start in
Section 2 with the literature review with the related work to this research. Then, following the data science workflow:
Section 3 presents the data, the first step;
Section 4 as part of the analysis step the proposed computational methodology is described;
Section 5 is the visualization of results, which corresponds to the communication step. At the end comes the conclusions and further research.
2. Related Work
The use of data from social networks to predict respiratory diseases is not a new research topic, and the complexity of the algorithms varies depending on their architecture and the degree of analysis of the tweets. Dai et al. [
13] describes a classification of four types of analyses to identify the degree of assessment of the text obtained from social media posts and the complexity of the algorithms used: (1) Keywords-based Approaches, (2) Learning-based Approaches, (3) Lexicon-Based Approaches, and (4) Word Embedding Based Approach. The architecture of the algorithms used in this research includes elements of types (1), (2), and (4), and the reason for using different approaches was to compare the results obtained. Previous investigations have described the selection of tweets through Keywords-based Approach. For example, words like “sore throat”, “cough” and “fever” found in tweets, which are related to respiratory diseases have been used [
14]. Subsequently, the quantity or frequency of tweets related to these words can be compared with national respiratory disease statistics. Talvis et al. [
14] calculated the linear correlation coefficient between selected Tweets and Google Flu Trends Queries. Hirose and Wang [
15] used multiple linear regression models to relate the dynamics of tweets with influenza keywords, and Influenza-Like Illness data [
16]. One of the disadvantages of using keywords in the selection of tweets is that they may not be related to events associated directly with the person who is publishing a social media post; for example, if the selected tweets are those which contain the word “influenza”, in some cases these could be part of a public health campaign, and not related to people who have had or have a respiratory illness. So, one of the main challenges is the selection of social media posts. In this research, different methods were compared to identify an increase in the accuracy of the predictions depending on the filters used in the tweets and the algorithms used for this. The second type of algorithms described by Dai et al. [
13], requires labeled data for training. Then, ML classifier is used to predict if a tweet is related or not to acute respiratory infections. Some of the most widely used ML techniques are Naïve Bayes, Random Forest, Decision Tree, k-Nearest Neighbors (KNN), XgBoost, Logistic Regression and Support Vector Machines (SVM). Similar works were found in the investigations of Zuccon et al. [
17], Santos and Matos [
18], and Prieto et al. [
19], using different tools such as WEKA (
https://ml.cms.waikato.ac.nz/weka/index.html) and scikit-learn toolkit (
https://scikit-learn.org/) to compare different ML models, and the models with the highest classification accuracy were SVM and Naïve Bayes. Jiang et al. [
20] analyzed help-seeking messages on the Chinese social network Weibo during different stages of the COVID-19 pandemic. They used Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) to approach the task as a binary topic classification problem.
Agrawal et al. [
21] used different ML models to classify people’s feelings and analyze the direction of polarity in tweets, with support vector machine model being the one that obtained the highest performance; Ueda et al. [
22] identified the changes in feelings before and after the declaration of a state of emergency in Japan due to the COVID-19 pandemic, relating tweets to different emotions such as anger, sadness, surprise, disgust, etc.; and Aldosery et al. [
23] evaluated the reaction sentiment (positive, neutral and negative) towards the COVID-19 pandemic in the UK, with a sentiment analysis model that combines a recurrent neural network and an embedded topic model. Guo et al. [
24] examined Twitter users’ beliefs about vaccination orders during the COVID-19 pandemic in the USA, with classification ML models; Kalanjati et al. [
25] did similar research to classify sentiments and opinions regarding COVID-19 and the COVID-19 vaccination on Indonesian-language. Hamedani et al. [
26] analyzed Reddit posts during the COVID-19 pandemic to identify major discussion topics using the Latent Dirichlet Allocation (LDA) algorithm. They then classified the sentiment of each topic using the Syuzhet package for the R programming language.
As mentioned above, labeled data presents one of the main challenges for training ML models; the manual classification of tweets is the most used mechanism in the reviewed investigations belonging to this type, to build the training dataset of the models, and this classification involves declaring which tweets are related to respiratory disease and which are not. In the dataset used by Kagashe et al. [
27], if a tweet is related to influenza infection, it is classified as “relevant”, otherwise it is classified as “irrelevant”. Zuccon et al. [
17] used a scale from 0 to 100 (0 being no flu, 100 being certain of a flu), in which the evaluators classified the tweets related to ILI. The mechanism used by Santos and Matos [
18] to reduce the errors, consists in a classification with three evaluators, labeling according to the majority.
The unsupervised ML Algorithms in the fourth type (Word Embedding Based Approach) of Dai et al. [
13] do not require labeled data, which makes human intervention in classification minimal. There are different processes to achieve this objective; one of them is the conversion of sentences (tweets) to numerical vectors; Word2Vec is one of the most used tools, and contains two types of algorithms: continuous bag-of-words and continuous skip-gram [
28]. Other researchers use this algorithm to discover relationships between words through context [
29,
30]; the comparison can be made using measures such as the cosine similarity, and the results can be used for different purposes, such as finding a relationship between named entities in text corpora, or simply with keywords. In this study, we use the results of Word2Vec to train the Kmeans clustering model, with the aim of grouping tweets according to their context. Chen et al. [
31] used the Kmeans model to identify the relationship of viral publications (Twitter and Weibo) and 77 features related to different topics (violence, international, vaccination, misinformation, etc.).
Didi et al. [
32] integrate sequences of ML algorithms for different purposes; they used algorithms such as FastText and Glove for feature extraction, classification models such as SVM, XgBoost and LR; to later make predictions with Prophet, LSTM and SVR models. In the literature analyzed, we found some articles that used deep learning models, Long Short Term Memory (LSTM) [
24,
32,
33] and Convolutional Neural Networks (CNN) [
34]. In recent years, these algorithms have been developed and are more accurate than traditional ML models, e.g., Naive Bayes, Support Vector Machine, Random Forest, etc. Part of our approach is to study deep learning algorithms for prediction: (i) DeepAR is a forecasting method based on autoregressive recurrent neural networks, which learns a global model from historical data [
35] and (ii) Transformer is an architecture based on an attention mechanism, embeddings and feed forward neural networks [
36]. And in the same way, we integrate a prior analysis and selection of tweets using Word Embedding techniques, such as Word2Vec [
13]. Other models that we use for classifying and grouping tweets are Bidirectional Encoder Representations from Transformers (BERT) [
37] and Kmeans [
38] respectively. Although these models have been used previously [
31,
39], the combination with deep learning prediction models in a pipeline is innovative in this research. Lande et al. [
40] used BERT models in tweet classification that provided better results compared to LDA models. To et al. [
41] used this same model to identify anti-vaccination tweets comparing with bidirectional long short-term memory networks with pre-trained GLoVe embeddings (Bi-LSTM) and more traditional ML models such as SMV; they obtained the best results with the BERT model. And Yang et al. [
42] compared BERT model with SVM, classifying tweets to identify users who have self-reported chronic stress experiences, obtaining the best results with BERT. Chen et al. [
43] utilized a Chinese pre-trained BERT model to extract 20 key topics from posts on the Chinese social network Weibo, analyzing public perception of active aging after the COVID-19 pandemic. They suggest that these topics can serve as a guideline for long-term government planning, helping integrate public concerns into policy development.
4. Methods: Training and Forecasting Models (Analysis)
In
Figure 2, we introduce the computational methodology that consists of three phases. The first phase mainly includes the use of Dataiku platform for data cleaning, transformation and modeling of ML algorithms. Dataiku is a data science and ML collaborative platform, and it was selected among others for its ability to implement ML algorithms quickly and easily. Also, it helps with the unification of data formats, and it has diverse functionalities for data transformation. Moreover, it contains a wide variety of ML algorithms available to implement in the pipelines. The role played by this phase is the rapid comparison between different ML prediction models using confirmed cases of ARI and COVID-19 as a target; and as a feature, tweets related to possible cases of these diseases (independent variable). In this phase, a simple data cleaning and transformation process is carried out in Dataiku, omitting the tweet classification and selection algorithms. In the second phase, the comparison of predictive models classified as Deep Learning is carried out, and the tweet selection stage is added with algorithms developed in the Python language. Most of this phase is developed in Dataiku, prioritizing agility and speed in testing, but Python code components are integrated into the pipeline to develop tasks that are not possible in Dataiku, such as the use of Word2Vec (word embedding technique); the combination of Python code and Dataiku functionalities is the fundamental part of this phase. The third phase was carried out completely in Python code using the ML algorithms with the best performance obtained in the previous stages. The main advantage of converting Dataiku pipelines to code is the use of recursion in the algorithms, allowing to carry out a large number of tests and verifications.
4.1. Phase One
As mentioned above in “Twitter Data” section, see
Section 3.1.1, the tweets downloaded from the API correspond to those filtered with keywords such as: asthma, bronchitis, respiratory, cough and COVID19; posted in the state of San Luis Potosí and restricted to the language Spanish. The objective of this phase was to identify the type of ML algorithms that we will use. We used prediction models with continuous variables, selecting the target (dependent) variable, the number of confirmed ARI and COVID-19 cases, and using the weekly count of selected tweets as a feature (independent variable). Seven different models were used to predict the variables COVID cases and ARI cases. The independent variable in both diseases was the count of tweets related to these diseases, see
Figure 3. Data cleaning, data transformation, training and testing of the models were carried out on the Dataiku platform. One of the key advantages of this tool is its ability to test a wide range of algorithms and machine learning models simultaneously, allowing for quick comparison of results. The ML models evaluated in Dataiku were: Random forest, Ordinary Least Squares, Lasso Regression, Decision Trees, SVM, KNN and Artificial Neural Networks (hereafter referred as Feed Forward). In order to compare the models among them, we used Maximum Absolute Percent Error (MAPE) and Root Mean squared error (RMSE) metrics, see Equations (
1) and (
2), respectively.
4.2. Phase Two
The second phase of our computational methodology focuses on involving more complex algorithms and processes, which are not feasible to execute in Dataiku; such as processing word embeddings algorithms as Word2Vec. The primary focus in this phase remains on maintaining agility and speed in conducting the necessary tests.
Three pipelines were built to generate three datasets used in the prediction models, see
Figure 4.
In the second part of this pipeline we used the Kmeans algorithm to cluster the tweets. Dataiku provides various models for solving clustering problems, including KMeans, Gaussian Mixture, Mini-Batch KMeans, Agglomerative Clustering, Spectral Clustering, DBSCAN, Interactive Clustering, and Isolation Forest. KMeans was chosen for its simplicity, lower computational cost, and superior performance in the Silhouette metric. This process allows to separate the tweets related to either COVID-19 or ARI, because there are publications from institutions or individuals that are not related to respiratory infectious diseases. Some tweets were simply informative about the status of the pandemic or diseases. The Kmeans algorithms employs silhouette analysis, where the coefficient measures the similarity of a data point within its cluster (cohesion) compared to other clusters (separation). This parameter goes from −1 to 1, if it approaches to 1 it indicates how tightly packed all the points in a cluster are with respect to their centroid.
- 2.
DEP-IND. This pipeline used the tweet dataset without any filtering process, except those carried out when downloading through Twitter API; a simple cleanup is performed and a weekly tweet count, see
Section 3.2.
- 3.
DEP. This pipeline was created to verify the “tweets” feature contribution (independent variable of the other two pipelines); only historical weekly count of confirmed COVID-19 cases and ARI are used for prediction.
The last step of each one of the three pipelines was the use of ML models for time series prediction. Dataiku provides both statistical models (Trivial Identity, Seasonal Naive, AutoARIMA, Seasonal Trend, and Non-Parametric Time Series) and deep learning models (Feed Forward, DeepAR, and Transformer) for time series forecasting. Statistical models were excluded due to their lower accuracy compared to deep learning models. Deep Learning models (DeepAR [
35], Feed Forward [
49] and Transformer [
36]) were used to predict the target variable (COVID-19 or ARI); it is important to note that these models were not found in previous research related to Twitter. For the prediction models, Dataiku tool was used to train and test these models. To compare the different prediction models, MAPE metric was used, see Equation (
1) below. The time series used correspond to a period between January 2020 and December 2022.
4.3. Phase Three
This third phase of the computational methodology is developed in Python language. The main objective of this stage is to improve and adjust the processes in the pipelines that are not possible to integrate in Dataiku. In this phase, four pipelines were developed named Category 1 to 4, and each one corresponds to a unique combination of procedures to generate four datasets. These datasets are used by the prediction models, see
Figure 5.
Category 1. This pipeline develops the cleaning and processing of the target (in prediction) variable historical weekly count of confirmed COVID-19 and ARI cases. It is similar to the DEP pipeline from Phase two, but transformed into Python code. The predictions about the output of this pipeline are the basis for comparing the performance of other pipelines with the tweet count as a feature.
Category 2. This pipeline used the tweet dataset without any filtering process, except those carried out when downloading through Twitter API; a simple cleanup is performed (similar to DEP-IND pipeline in Phase two) and a weekly tweet count.
Category 3. This pipeline is the same one used in DEP-IND-C in the previous phase.
Category 4. This pipeline used a tweet classifier through BERT algorithm [
37]; labeling tweets related to users who are related to COVID-19 or ARI. The advantage of BERT over other word embedding methods like Word2Vec is its ability to generate vector representations for words based on the context of the entire sentencee [
41]. The BERT version used is “bert-base-multilingual-cased”, since this can be used for the Spanish language. For example, other researchers used IndoBERT model for Indonesian language Kalanjati et al. [
25]. Moreover, elimination of stopwords and tokenization were part of the cleaning process of this pipeline.
The final step of each pipeline involved training and testing the DeepAR model, a forecasting method based on autoregressive recurrent neural networks. This model is based on LSTM, which handle sequential data efficiently [
35]. It was implemented using the Python GluonTS library [
50] to forecast ARI and COVID-19 time series.
6. Conclusions
We introduce a computational methodology to predict acute respiratory infections with social media (Twitter) and ML algorithms, which is composed of three phases. The different phases allow us to compare various ML models applied to the processing of tweets and their subsequent use as a predictors in respiratory diseases. Although the main idea was to establish the bases of tweet analysis to extend it to other scenarios and applications, the results obtained help us understand the possible difficulties and limitations of using tweets as a predictive variable. The results of the first phase showed us that the use of the tweet count as an independent variable in the prediction of respiratory diseases (COVID-19 and ARI) alone is not enough. The best result is a MAPE error of 0.475 with Support Vector Machines on ARI cases. On the other hand, prediction models (time series) that involve the history of the dependent variable obtain better results, which corresponds to Phase two and Phase three. In Phase two, we found the following: (a) COVID-19, in two of three models the independent variable does not improve the model, it even worsens it. (b) The only algorithm that shows an improvement is Feed Forward, see
Table 5. (c) The contribution of the tweet count as an independent variable is clearly better in the predictions for the ARI time series, compared to COVID-19. (d) In ARI time series predictions, the DeepAR algorithm is by far the best of the three models. (e) The process of clustering and filtering the tweets does not represent any prediction improvement in this phase.
The Phase three focus on the analysis of the DeepAR model with different preprocessing approaches for Twitter data: Categories 2 to 4, see
Figure 5. We found that for three years of analysis, the “tweets” variable is only representative in the predictions of confirmed ARI cases. So, the model that obtained the best result is with the pipeline “Category 3” (Word Embedding and Kmeans procedure), which is related to ARI cases prediction, see
Table 6 and
Table 7. “Category 3” helps to highlight that NLP and the selection of tweets represents an improvement in the prediction results, although it only had an improvement of 1.01% compared to the prediction without the use of the tweets variable. In the second half of 2020, the inclusion of the tweets variable improved prediction accuracy by approximately 3% for both ARI and COVID-19 confirmed cases. This suggests that information shared in tweets was representative at the onset of the pandemic but became less relevant over time.
Given the nature of the datasets, we cannot compare this research directly with other papers. Because, other researches use data of different regions, dates and diseases. Nevertheless, several contributions arise. The broad study on Twitter Data, ARI Data and COVID-19 Data, allow us to identify the best preprocessing techniques for Twitter Data, and the best ML predictive model, DeepAR. Twitter Data presents challenges due to the need for text mining, making this research valuable for others working in similar areas.
Moreover, the proposed computational methodology can be used for other data sources to identify the best ML predictive models along with Twitter data to predict time series. It can be applied to other fields, such as finance, energy, and others that utilize time series data.
6.1. Limitations
There are several limitations when using social network data in research. The primary challenge we faced was restricted access to tweets, as the Twitter API account used had a download limit of 5000 tweets per month. Additionally, the downloaded tweets may not fully represent the analyzed population (San Luis Potosí, Mexico) in terms of both geographic distribution and demographic diversity.
The API query used to download tweets included seven keywords related to ARI and COVID-19. However, this presents a limitation in capturing relevant tweets, as there may be other terms or regional idioms used to refer to these diseases. Additionally, the vocabulary used by social media users to describe the same illness may change over time.
Tweet collection was restricted to Spanish-language posts, and specific libraries for text cleaning and processing were configured accordingly. The analysis of tweets in other languages is beyond the scope of this research, although it is feasible to adjust the parameters to support additional languages.
In this study, NLP techniques were incorporated alongside BERT (for classification) and K-means (for clustering) to identify tweets from individuals who have experienced or are experiencing respiratory diseases (ARI or COVID-19). However, some False Positives may still be present among the selected tweets. To improve accuracy in filtering out False Positives, a more in-depth analysis of tweet context using specialized NLP techniques is needed.
The selection of machine learning models in Phase 1 and 2 was limited to those available in Dataiku. While this platform offers a wide range of clustering and time series forecasting models, the inability to conduct comparative tests with models outside its framework remains a limitation.
Another constraint is the potential loss of contextual information in tweets due to the removal of emoticons, non-alphanumeric symbols, and pictorial characters, as these elements may carry additional or essential meaning for accurate tweet classification.
6.2. Future Research
The Twitter Data corresponds only to the state of San Luis Potosí, México, so a more extensive analysis including different cities in Mexico and other countries would provide a more accurate conclusions about the usefulness of tweets as a variable to predict respiratory diseases. Also, longer time series than 2020–2022 could yield different results than the recent outbreak.
Analyzing tweets in other languages presents an opportunity for deeper comparative studies across different regions or within multilingual areas. A key challenge for multilingual support is the availability of text processing libraries, such as NLTK, for the target language.
Additionally, the vocabulary used to describe diseases is extensive and constantly evolving. A valuable research avenue involves analyzing word usage, expression patterns, and regional idioms. However, conducting such an analysis would require access to a larger volume of tweets.
On the other hand, those interested in conducting research with Twitter data (now X) should consider the new limitations on data downloading, due to the new policies implemented in the new company. So, other sources of social media data must be explored, since social media data improves the prediction. Additionally, classifying tweets is a difficult task, since it involves advanced NLP techniques. Algorithms such as Word Embeddings and BERT classification models require large datasets to perform well, so it is still a research area finding algorithms for small datasets.
Finally, we only focused on mining text in tweets, but today they are frequently accompanied by images and emoticons. So, a joint analysis of these different forms of communication (text and images) is an opportunity to improve the interpretation of tweets, and to be able to classify them better accuracy.