Arabic Twitter Conversation Dataset about the COVID-19 Vaccine
Abstract
:1. Summary
- We build an Arabic Twitter dataset of 1.1 M Arabic posts that was streamed over one year, covering the period from January to December 2021. The data collection started when most countries around the world started the COVID-19 vaccination campaigns. Thus, the dataset covers the initial dynamic conversation on vaccine distribution.
- We performed a preliminary analysis on the raw data which revealed topical insights and resulting in seven database tables. Further analysis can be done among multiple database tables.
- We release the dataset to be freely available to the research community in the Mendeley data repository https://data.mendeley.com/datasets/zmwfnsms9n (accessed on 31 October 2022). The dataset can be useful for researchers in different fields to analyze people’s activity following the first announcement of the vaccine distribution or to perform comparative analysis. Moreover, scientific communities, public health agencies, and analysts might be interested in this dataset to obtain insights, make decisions, or design strategies that might help in some potential situations
2. Literature Review
3. Data Description
4. Results and Analysis
4.1. Hashtag
4.2. Media
4.3. Users
4.4. Retweet and Reply Analysis
4.5. Textual Analysis
5. Methods
5.1. Data Collection
5.2. Data Preprocessing
- Remove the URL from the text
- Remove the mentions (@user)
- Remove the hashtags
- Replace the repeated letters with one letter.
- Remove stop words in the Arabic language such as pronouns, articles, prepositions, etc.
- Remove punctuation such as commas, brackets, and full stops.
- Replace emojis with special tokens.
5.3. Implementation
6. Potential Research Applications
7. Conclusions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Sorensen, L. User managed trust in social networking—Comparing Facebook, MySpace and Linkedin. In Proceedings of the 2009 1st International Conference on Wireless Communication, Vehicular Technology, Information Theory and Aerospace & Electronics Systems Technology, Aalborg, Denmark, 17–20 May 2009; pp. 427–431. [Google Scholar] [CrossRef]
- Kavada, A. Social Media as Conversation: A Manifesto. Soc. Media Soc. 2015, 1, 2056305115580793. [Google Scholar] [CrossRef] [Green Version]
- Aslam, S. Twitter by the Numbers: Stats, Demographics & Fun Facts. Available online: https://www.omnicoreagency.com/twitter-statistics/ (accessed on 14 October 2022).
- Aldekhyyel, R.N.; Binkheder, S.; Aldekhyyel, S.N.; Alhumaid, N.; Hassounah, M.; AlMogbel, A.; Jamal, A.A. The Saudi Ministries Twitter communication strategies during the COVID-19 pandemic: A qualitative content analysis study. Public Health Pract. 2022, 3, 100257. [Google Scholar] [CrossRef]
- Michael, H. The use of Twitter by state leaders and its impact on the public during the COVID-19 pandemic. Heliyon 2020, 6, e05540. [Google Scholar] [CrossRef]
- Roy, M.; Moreau, N.; Rousseau, C.; Mercier, A.; Wilson, A.; Atlani-Duault, L. Ebola and Localized Blame on Social Media: Analysis of Twitter and Facebook Conversations During the 2014–2015 Ebola Epidemic. Cult. Med. Psychiatry 2020, 44, 56–79. [Google Scholar] [CrossRef] [Green Version]
- Kagashe, I.; Yan, Z.; Suheryani, I. Enhancing Seasonal Influenza Surveillance: Topic Analysis of Widely Used Medicinal Drugs Using Twitter Data. J. Med. Internet Res. 2017, 19, e315. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Muñoz-Sastre, D.; Rodrigo-Martín, L.; Rodrigo-Martín, I. The Role of Twitter in the WHO’s Fight against the Infodemic. Int. J. Environ. Res. Public Health 2021, 18, 11990. [Google Scholar] [CrossRef] [PubMed]
- Abbas, A.; Eliyana, A.; Ekowati, D.; Saud, M.; Raza, A.; Wardani, R. Data set on coping strategies in the digital age: The role of psychological well-being and social capital among university students in Java Timor, Surabaya, Indonesia. Data Brief 2020, 30, 105583. [Google Scholar] [CrossRef]
- Polack, F.P.; Thomas, S.J.; Kitchin, N.; Absalon, J.; Gurtman, A.; Lockhart, S.; Perez, J.L.; Pérez Marc, G.; Moreira, E.D.; Zerbini, C.; et al. Safety and Efficacy of the BNT162b2 mRNA COVID-19 Vaccine. New Engl. J. Med. 2020, 383, 2603–2615. [Google Scholar] [CrossRef]
- Covid-19: Pfizer/BioNTech Vaccine Judged Safe for Use in UK. Available online: https://www.bbc.com/news/health-55145696. (accessed on 9 September 2022).
- Kim, J.H.; Marks, F.; Clemens, J.D. Looking beyond COVID-19 vaccine phase 3 trials. Nat. Med. 2021, 27, 205–211. [Google Scholar] [CrossRef]
- Tekumalla, R.; Banda, J.M. A Large-Scale Twitter Dataset for Drug Safety Applications Mined from Publicly Existing Resources. arXiv 2020, arXiv:2003.13900v1. [Google Scholar] [CrossRef]
- Stemmer, M.; Parmet, Y.; Ravid, G. What Are IBD Patients Talking about on Twitter? In ICT for Health, Accessibility and Wellbeing; Springer International Publishing: Cham, Switzerland, 2021; pp. 206–220. [Google Scholar]
- Saniei, R.; Rodríguez Doncel, V. PHDD: Corpus of Physical Health Data Disclosure on Twitter during COVID-19 Pandemic. SN Comput. Sci. 2022, 3, 212. [Google Scholar] [CrossRef] [PubMed]
- Singh, L.; Bansal, S.; Bode, L.; Budak, C.; Chi, G.; Kawintiranon, K.; Wang, Y. A first look at COVID-19 information and misinformation sharing on Twitter. arXiv 2020, arXiv:2003.13907v1. [Google Scholar]
- Chen, E.; Lerman, K.; Ferrara, E. Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set. JMIR Public Health Surveill. 2020, 6, e19273. [Google Scholar] [CrossRef] [PubMed]
- Aguilar-Gallegos, N.; Romero-García, L.E.; Martínez-González, E.G.; García-Sánchez, E.I.; Aguilar-Ávila, J. Dataset on dynamics of Coronavirus on Twitter. Data Brief 2020, 30, 105684. [Google Scholar] [CrossRef]
- Haouari, F.; Hasanain, M.; Suwaileh, R.; Elsayed, T. ArCOV-19: The First Arabic COVID-19 Twitter Dataset with Propagation Networks. arXiv 2021, arXiv:2004.05861v4 2021. [Google Scholar] [CrossRef]
- Elhadad, M.K.; Li, K.F.; Gebali, F. COVID-19-FAKES: A Twitter (Arabic/English) dataset for detecting misleading information on COVID-19. In Advances in Intelligent Networking and Collaborative Systems; Barolli, L., Li, K.F., Miwa, H., Eds.; Springer: Cham, Switzerland, 2021; Volume 1263, pp. 256–268. [Google Scholar] [CrossRef]
- Hayawi, K.; Shahriar, S.; Serhani, M.A.; Taleb, I.; Mathew, S.S. ANTi-Vax: A novel Twitter dataset for COVID-19 vaccine misinformation detection. Public Health 2022, 203, 23–30. [Google Scholar] [CrossRef] [PubMed]
- Memon, S.A.; Carley, K.M. Characterizing COVID-19 misinformation communities using a novel twitter dataset. arXiv 2020, arXiv:2008.00791. [Google Scholar] [CrossRef]
- Mubarak, H.; Hassan, S.; Chowdhury, S.; Alam, F. ArCovidVac: Analyzing Arabic Tweets about COVID-19 Vaccination. arXiv 2022, arXiv:2201.06496. [Google Scholar] [CrossRef]
- Hu, T.; Wang, S.; Luo, W.; Zhang, M.; Huang, X.; Yan, Y.; Liu, R.; Ly, K.; Kacker, V.; She, B.; et al. Revealing public opinion towards COVID-19 vaccines with Twitter Data in the United States: A spatiotemporal perspective. J. Med. Internet Res. 2021, 23, e30854. [Google Scholar] [CrossRef]
- Malagoli, L.G.; Stancioli, J.; Ferreira, C.H.G.; Vasconcelos, M.; da Silva, A.P.C.; Almeida, J.M. A look into COVID-19 vaccination debate on Twitter. In Proceedings of the 13th ACM Web Science Conference 2021, New York, NY, USA, 21–25 June 2021; p. 225e33. [Google Scholar] [CrossRef]
- Alshaabi, T.; Dewhurst, D.R.; Minot, J.R.; Arnold, M.V.; Adams, J.L.; Danforth, C.M.; Dodds, P.S. The growing amplification of social media: Measuring temporal and social contagion dynamics for over 150 languages on twitter for 2009–2020. arXiv 2021, arXiv:2003.03667. [Google Scholar] [CrossRef]
- Abdul-Mageed, M.; Elmadany, A.; Nagoudi, E.M.B.; Pabbi, D.; Verma, K.; Lin, R. Mega-Cov: A billion-scale dataset of 100+ languages for covid-19. arXiv 2021, arXiv:2005.06012. [Google Scholar] [CrossRef]
- Haouari, F.; Hasanain, M.; Suwaileh, R.; Elsayed, T. ArCOV19-Rumors: Arabic COVID-19 Twitter Dataset for Misinformation Detection. In Proceedings of the Sixth Arabic Natural Language Processing Workshop, Kyiv, Ukraine, 9 April 2021; pp. 72–81. [Google Scholar]
- Alam, F.; Shaar, S.; Dalvi, F.; Sajjad, H.; Nikolov, A.; Mubarak, H.; Martino, G.D.S.; Abdelali, A.; Durrani, N.; Darwish, K.; et al. Fighting the COVID-19 Infodemic: Modeling the perspective of journalists, fact-checkers, social media platforms, policy makers, and the society. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2021, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 611–649. [Google Scholar] [CrossRef]
- Yang, Q.; Alamro, H.; Albaradei, S.; Salhi, A.; Lv, X.; Ma, C.; Zhang, X. Senwave: Monitoring the global sentiments under the covid-19 pandemic. arXiv 2020, arXiv:2006.10842. [Google Scholar] [CrossRef]
- Alsudias, L.; Rayson, P. COVID-19 and Arabic Twitter: How can Arab world governments and public health organizations learn from social media? In Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020, Online, 9–10 July 2020. [Google Scholar]
- Zhou, X.; Mulay, A.; Ferrara, E.; Zafarani, R. ReCOVery: A multimodal repository for COVID-19 news credibility research. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, New York, NY, USA, 19–23 October 2020. [Google Scholar] [CrossRef]
- Murić, G.; Wu, Y.; Ferrara, E. COVID-19 Vaccine Hesitancy on Social Media: Building a Public Twitter Dataset of Anti-vaccine Content, Vaccine Misinformation and Conspiracies. JMIR Public Health Surveill. 2021, 7, e30642. [Google Scholar] [CrossRef] [PubMed]
- DeVerna, M.R.; Pierri, F.; Truong, B.T.; Bollenbacher, J.; Axelrod, D.; Loynes, N.; Bryden, J. CoVaxxy: A Collection of English-Language Twitter Posts About COVID-19 Vaccines. In Proceedings of the International AAAI Conference on Web and Social Media, Atlanta, GA, USA, 6–9 June 2021; Volume 15, pp. 992–999. [Google Scholar]
- Twitter. Developer Agreement and Policy. Available online: https://developer.twitter.com/en/developer-terms/agreement-and-policy (accessed on 9 September 2022).
- Pypi, Snscrape. Available online: https://pypi.org/project/snscrape/ (accessed on 9 September 2022).
- NLTK. Available online: https://github.com/linuxscout/pyarabic (accessed on 14 October 2022).
- Christodoulou, G.; Georgiou, C.; Pallis, G. The Role of Twitter in YouTube Videos Diffusion. In Web Information Systems Engineering–WISE 2012. WISE 2012. Lecture Notes in Computer Science; Wang, X.S., Cruz, I., Delis, A., Huang, G., Eds.; Springer: Heidelberg/Berlin, Germany, 2012; Volume 7651. [Google Scholar] [CrossRef]
Study | Available Online | Period | Dataset | Language | Application |
---|---|---|---|---|---|
[16] | No | January 2020–March 2020 | COVID-19 tweet conversation | Multilingual | Content analysis and topic and prevalent myths detection |
[17] | Yes | January 2020–March 2020 | COVID-19 Twitter posts | Multilingual | Initial content analysis |
[16] [18] | Yes | January 2020–February 2020 | COVID-19 Twitter posts | Multilingual | Statical and content analysis |
[19] | Yes | January 2020–January 2021 | COVID-19 Twitter posts | Arabic | Statical and content analysis |
[20] | Yes | February 2020–March 2020 | COVID-19 Tweets | English and Arabic | Misinformation detection |
[22] | Yes | December 2020–July 2021 | COVID-19 vaccine annotated tweets | English | Misinformation detection |
[23] | Yes | January 2021–February 2021 | COVID-19 vaccine annotated tweet dataset | Arabic | Vaccination stance detection and content analysis |
[24] | No | March 2020–February 2021 | COVID-19 vaccines tweets in US. | English | Sentiment analysis and emotion analysis Topic modeling and word cloud mapping. |
[25] | Yes | December 2020–January 2021 | COVID-19 vaccines tweets | English | Sentiment and psycholinguistic analysis |
[27] | Yes | January 2020–July 2020 | COVID-19 tweets | Multilingual | Analysis and classification |
[28] | Yes | January 2020–January 2021 | COVID-19 tweets | Arabic | Misinformation detection |
[29] | Yes | January 2020–March 2021 | COVID-19 tweets | Multilingual | Disinformation analysis |
[30] | Yes | March 2020–May 2020 | COVID-19 tweets | English and Arabic | Sentiment analysis |
[31] | No | December 2020–April 2020 | COVID-19 tweets | Arabic | Rumor detection |
[32] | Yes | January 2020–May 2020 | COVID-19 vaccine news articles and related tweets | English | Reliable and unreliable news prediction |
[33] | Yes | October 2020–December 2020 | Twitter dataset in anti-vaccine. | English | Antivaccination descriptive analysis |
[34] | Yes | December 2020–January 2021 | COVID-19 vaccines Twitter posts. | English | Descriptive analysis and statistics visualization |
Database | Description | Fields |
---|---|---|
D1.General | Collection of tweets regarding the COVID-19 vaccine. Estimated size: 58.64 MB. | tweet_id: unique id for each post. datetime: the date and time of creation of the tweet. keyword: term used to extract the tweets. |
D2.Media | Collection of tweets with at least one media. Estimated size: 24.25 MB. | tweet_id: unique id for each post. media_type: type of the media (photo, gif, or video) media_url: complete URL of the media |
D3.Hashtag | Collection of hashtags in each tweet. Estimated size: 27.51 MB. | tweet_id: unique id for each post. datetime: the date and time of creation of the tweet. hashtag: terms used as hashtag within the tweet. |
D4.Reply | Collection of tweets that had at least one reply and the count of all the replies to the tweet. Estimated size: 20.12 MB. | tweet_id: unique id for each post. datetime: the date and time of creation of the tweet. twreply_count: number of replies to each tweet |
D5.Retweet | Collection of tweets that had at least one retweet and the count of all the retweets for the tweet. Estimated size: 6.012 MB. | tweet_id: unique id for each post. datetime: the date and time of creation of the tweet. retweet_count: number of retweets for each tweet |
D6.Vaccine_type | Collection of tweets about different types of vaccine Estimated size: 26.64 MB. | tweet_id: unique id for each post. datetime: the date and time of creation of the tweet. vac_type: type of the vaccine |
D7.Users | Collection of nodes of unique users. Estimated size: 5.684 MB. | user_id: user’s id account |
Hashtag | English Translation | Counts | Hashtag | English Translation | Counts |
---|---|---|---|---|---|
كورونا | Corona | 65,130 | توكلنا | Twakklna (App used in SA) | 2709 |
لقاح_كورونا | Corona vaccine | 38,778 | أكسفورد | Oxford | 2630 |
فايزر | Pfizer | 32,770 | اكسفورد | Oxford | 2585 |
عاجل | Urgent | 13,683 | صحتي | Sehaty (App used in SA) | 2542 |
لقاح_فايزر | Pfizer vaccine | 11,907 | COVID19 | COVID19 | 2540 |
الصحة | Health | 9433 | مصر | Egypt | 2481 |
كوفيد_19 | COVID-19 | 9157 | المدينة_المنورة | Madinah | 2427 |
وزارة_الصحة | Ministry of health | 8786 | أوميكرون | Omicron | 2368 |
لقاح | Vaccine | 8612 | سينوفارم | Sinopharm | 2239 |
الكويت | Kuwait | 8419 | اخذتم_لقاح_كورونا_والا_باقي | Did you take the vaccine or not yet | 2204 |
السعودية | Saudi | 7399 | أسترازينكا | Astrazeneca | 2137 |
خذ_الخطوة | Take the step | 6470 | خادم_الحرمين_الشريفين | Custodian of the two holy mosques | 2131 |
فيروس_كورونا | Corona virus | 5347 | الرياض | Riyadh | 2095 |
كوفيد19 | COVID-19 | 5290 | يدا_بيد_نتعافى | Hand by hand recovering | 2035 |
الإمارات | Emirates | 5263 | بريطانيا | United Kingdom | 1895 |
موديرنا | Moderna | 5217 | المغرب | Morocco | 1870 |
أسترازينيكا | Astrazeneca | 4221 | الملك_يتلقي_لقاح_كورونا | The king got the corona vaccine | 1852 |
اخذت_جرعه_لقاح_ولا_باقي | Did you take the dose or not yet | 4204 | صحة | Health | 1785 |
الأردن | Jordan | 4192 | الصين | China | 1770 |
لقاح_أسترازينك | Astrazeneca vaccine | 3370 | اللقاح | Vaccine | 1742 |
لبنان | Lebanon | 3200 | روسيا | Russia | 1708 |
الجرعة_الثانية | Second dose | 2845 | البحرين | Bahrain | 1650 |
لا_للتطعيم_الاجباري | No to compulsory vaccination | 2780 | العربية | Arabia | 1647 |
January–April | May–August | September–December | |||
---|---|---|---|---|---|
كورونا | Corona | جرعه | Dose | كورونا | Corona |
جرعه | Dose | كورونا | Corona | جرعه | Dose |
اولى | First | اخد | Got | اخذ | Got |
صحه | Health | ثانيه | Second | جرعتين | Two doses |
موعد | Appointment | صحه | Health | يوم | Day |
فيروس | Virus | تطعيم | Vaccination | صحه | Health |
يتلقى | Get | يوم | Day | جرعات | Doses |
حمدلله | Thank God | اولى | First | موجود | Exist |
سلام | Peace | سلام | Peace | ثانيه | Second |
تطبيق | Application | موعد | Appointment | حضور | Attendance |
صحتي | My health | وزاره | Ministry | ثالثه | Third |
مليون | Million | مملكه | Kingdom | تم | Done |
نفسي | Myself | حمدلله | Thanks God | طلاب | Students |
حمايه | Protection | مليون | Million | عام | Year |
سجل | Recorded | جرعتين | Two doses | استكمال | Complete |
وطني | My country | ضد | Against | اعمار | Ages |
شركه | Company | ناس | People | طالبات | Students |
نحمى | Protect | صيني | Chines | حصلو | Got |
مجتمع | Community | مصر | Egypt | دراسه | Study |
يوم | Day | كويت | Kuwait | تنشيطيه | Booster |
Tweets | Counts | Percentage |
---|---|---|
Collected tweets | 1,125,446 | Collected tweets |
Filtered tweets | 1,101,349 | |
Tweets with location | 691,461 | 62.78% |
Include hashtag | 288,803 | 26.22% |
Include media | 220,797 | 20.045% |
Include videos | 36,863 | 3.347% |
Include photos | 186,571 | 16.94% |
Unique hashtags | 94,407 | |
Unique locations | 58,785 | |
Unique users | 344,328 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Alhazmi, H. Arabic Twitter Conversation Dataset about the COVID-19 Vaccine. Data 2022, 7, 152. https://doi.org/10.3390/data7110152
Alhazmi H. Arabic Twitter Conversation Dataset about the COVID-19 Vaccine. Data. 2022; 7(11):152. https://doi.org/10.3390/data7110152
Chicago/Turabian StyleAlhazmi, Huda. 2022. "Arabic Twitter Conversation Dataset about the COVID-19 Vaccine" Data 7, no. 11: 152. https://doi.org/10.3390/data7110152
APA StyleAlhazmi, H. (2022). Arabic Twitter Conversation Dataset about the COVID-19 Vaccine. Data, 7(11), 152. https://doi.org/10.3390/data7110152