The Effect of Training Data Size on Disaster Classification from Twitter
Abstract
:1. Introduction
2. Related Work
- Algorithm performance analysis: This study systematically evaluates the performance of multiple machine learning algorithms for tweet classification in the context of disaster events. It provides valuable insights into the strengths and weaknesses of each algorithm, aiding practitioners in making informed choices. It also employs ensemble and stacking techniques to further boost performance.
- Hyperparameter tuning importance: By emphasizing the significance of hyperparameter tuning, particularly through Bayesian optimization, this work underscores the potential performance gains achievable by systematically exploring the hyperparameter space. This knowledge can guide practitioners in optimizing their models effectively.
- Occam’s razor in machine learning: The application of Occam’s razor to machine learning model selection is explored, emphasizing the advantages of simpler models in terms of interpretability and reduced overfitting risk.
- Impact of dataset size on model choice: This research establishes a practical guideline for selecting the most suitable algorithm based on dataset size. This contribution can aid practitioners in making efficient and effective algorithm choices that are the most appropriate based on the scale of their data.
3. Materials and Methods
3.1. Dataset
3.2. Classification Models
3.3. Setup
4. Results and Discussion
Further Improvements using Ensemble and Stacking Approaches
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Takahashi, B.; Tandoc, E.C., Jr.; Carmichael, C. Communicating on Twitter during a disaster: An analysis of tweets during Typhoon Haiyan in the Philippines. Comput. Hum. Behav. 2015, 50, 392–398. [Google Scholar] [CrossRef]
- Yuan, F.; Li, M.; Liu, R. Understanding the evolutions of public responses using social media: Hurricane Matthew case study. Int. J. Disaster Risk Reduct. 2020, 51, 101798. [Google Scholar] [CrossRef]
- Wang, B.; Zhuang, J. Crisis information distribution on Twitter: A content analysis of tweets during Hurricane Sandy. Nat. Hazards 2017, 89, 161–181. [Google Scholar] [CrossRef]
- Belcastro, L.; Marozzo, F.; Talia, D.; Trunfio, P.; Branda, F.; Palpanas, T.; Imran, M. Using social media for sub-event detection during disasters. J. Big Data 2021, 8, 1–22. [Google Scholar] [CrossRef]
- Annis, A.; Nardi, F. Integrating VGI and 2D hydraulic models into a data assimilation framework for real time flood forecasting and mapping. Geo-Spat. Inf. Sci. 2019, 22, 223–236. [Google Scholar] [CrossRef]
- Peary, B.D.; Shaw, R.; Takeuchi, Y. Utilization of social media in the east Japan earthquake and tsunami and its effectiveness. J. Nat. Disaster Sci. 2012, 34, 3–18. [Google Scholar] [CrossRef]
- Styve, L.; Navarra, C.; Petersen, J.M.; Neset, T.S.; Vrotsou, K. A visual analytics pipeline for the identification and exploration of extreme weather events from social media data. Climate 2022, 10, 174. [Google Scholar] [CrossRef]
- Caragea, C.; Silvescu, A.; Tapia, A.H. Identifying informative messages in disaster events using convolutional neural networks. In Proceedings of the International Conference on Information Systems for Crisis Response and Management, Rio de Janeiro, Brazil, 22–25 May 2016; pp. 137–147. [Google Scholar]
- Neppalli, V.K.; Caragea, C.; Caragea, D. Deep neural networks versus naive bayes classifiers for identifying informative tweets during disasters. In Proceedings of the 15th Annual Conference for Information Systems for Crisis Response and Management (ISCRAM), Rochester, NY, USA, 20–23 May 2018. [Google Scholar]
- Alam, F.; Sajjad, H.; Imran, M.; Ofli, F. CrisisBench: Benchmarking crisis-related social media datasets for humanitarian information processing. In Proceedings of the International AAAI Conference on Web and Social Media, Atlanta, GA, USA, 8–10 June 2021; Volume 15, pp. 923–932. [Google Scholar]
- Jain, P.; Ross, R.; Schoen-Phelan, B. Estimating distributed representation performance in disaster-related social media classification. In Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, Vancouver, BC, Canada, 27–30 August 2019; pp. 723–727. [Google Scholar]
- Krishna, D.S.; Gorla, S.; PVGD, P.R. Disaster tweet classification: A majority voting approach using machine learning algorithms. Intell. Decis. Technol. 2023, 17, 343–355. [Google Scholar] [CrossRef]
- Ning, X.; Yao, L.; Wang, X.; Benatallah, B. Calling for response: Automatically distinguishing situation-aware tweets during crises. In Proceedings of the Advanced Data Mining and Applications: 13th International Conference, ADMA 2017, Singapore, 5–6 November 2017; pp. 195–208. [Google Scholar]
- Madichetty, S.; Sridevi, M. A novel method for identifying the damage assessment tweets during disaster. Future Gener. Comput. Syst. 2021, 116, 440–454. [Google Scholar] [CrossRef]
- Nazer, T.H.; Morstatter, F.; Dani, H.; Liu, H. Finding requests in social media for disaster relief. In Proceedings of the 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), San Francisco, CA, USA, 8–21 August 2016; pp. 1410–1413. [Google Scholar]
- Toraman, C.; Kucukkaya, I.E.; Ozcelik, O.; Sahin, U. Tweets under the rubble: Detection of messages calling for help in earthquake disaster. arXiv 2023, arXiv:2302.13403. [Google Scholar]
- Devaraj, A.; Murthy, D.; Dontula, A. Machine-learning methods for identifying social media-based requests for urgent help during hurricanes. Int. J. Disaster Risk Reduct. 2020, 51, 101757. [Google Scholar] [CrossRef]
- Murzintcev, N.; Cheng, C. Disaster hashtags in social media. Isprs Int. J. Geo-Inf. 2017, 6, 204. [Google Scholar] [CrossRef]
- Alam, F.; Qazi, U.; Imran, M.; Ofli, F. Humaid: Human-annotated disaster incidents data from twitter with deep learning benchmarks. In Proceedings of the International AAAI Conference on Web and Social Media, Virtually, 7–10 June 2021; Volume 15, pp. 933–942. [Google Scholar]
- Burel, G.; Alani, H. Crisis event extraction service (crees)-automatic detection and classification of crisis-related content on social media. In Proceedings of the 15th International Conference on Information Systems for Crisis Response and Management, Rochester, NY, USA, 20–23 May 2018. [Google Scholar]
- Ramezan, C.A.; Warner, T.A.; Maxwell, A.E.; Price, B.S. Effects of training set size on supervised machine-learning land-cover classification of large-area high-resolution remotely sensed data. Remote Sens. 2021, 13, 368. [Google Scholar] [CrossRef]
- Medar, R.; Rajpurohit, V.S.; Rashmi, B. Impact of training and testing data splits on accuracy of time series forecasting in machine learning. In Proceedings of the 2017 International Conference on Computing, Communication, Control and Automation (ICCUBEA), Pune, India, 17–18 August 2017; pp. 1–6. [Google Scholar]
- Barbedo, J.G.A. Impact of dataset size and variety on the effectiveness of deep learning and transfer learning for plant disease classification. Comput. Electron. Agric. 2018, 153, 46–53. [Google Scholar] [CrossRef]
- Vabalas, A.; Gowen, E.; Poliakoff, E.; Casson, A.J. Machine learning algorithm validation with a limited sample size. PLoS ONE 2019, 14, e0224365. [Google Scholar] [CrossRef]
- Prusa, J.; Khoshgoftaar, T.M.; Seliya, N. The effect of dataset size on training tweet sentiment classifiers. In Proceedings of the 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA), Miami, FL, USA, 9–11 December 2015; pp. 96–102. [Google Scholar]
- Laurer, M.; Van Atteveldt, W.; Casas, A.; Welbers, K. Less annotating, more classifying: Addressing the data scarcity issue of supervised machine learning with deep transfer learning and BERT-NLI. Political Anal. 2024, 32, 84–100. [Google Scholar] [CrossRef]
- Abdelwahab, O.; Bahgat, M.; Lowrance, C.J.; Elmaghraby, A. Effect of training set size on SVM and Naive Bayes for Twitter sentiment analysis. In Proceedings of the 2015 IEEE international symposium on signal processing and information technology (ISSPIT), Abu Dhabi, United Arab Emirates, 7–10 December 2015; pp. 46–51. [Google Scholar]
- Tekumalla, R.; Banda, J.M. Using weak supervision to generate training datasets from social media data: A proof of concept to identify drug mentions. Neural Comput. Appl. 2023, 35, 18161–18169. [Google Scholar] [CrossRef]
- Nguyen, T.H.; Nguyen, H.H.; Ahmadi, Z.; Hoang, T.A.; Doan, T.N. On the Impact of Dataset Size: A Twitter Classification Case Study. In Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, Melbourne, Australia, 14–17 December 2021; pp. 210–217. [Google Scholar]
- Olteanu, A.; Castillo, C.; Diaz, F.; Vieweg, S. Crisislex: A lexicon for collecting and filtering microblogged communications in crises. In Proceedings of the Eighth International AAAI Conference on Weblogs and Social Media, Ann Arbor, MI, USA, 1–4 June 2014; Volume 8, pp. 376–385. [Google Scholar]
- Imran, M.; Mitra, P.; Castillo, C. Twitter as a lifeline: Human-annotated twitter corpora for NLP of crisis-related messages. arXiv 2016, arXiv:1605.05894. [Google Scholar]
- Imran, M.; Elbassuoni, S.; Castillo, C.; Diaz, F.; Meier, P. Practical extraction of disaster-relevant information from social media. In Proceedings of the 22nd International World Wide Web Conference, Rio de Janeiro, Brazil, 13–17 May 2013; pp. 1021–1024. [Google Scholar]
- Imran, M.; Elbassuoni, S.; Castillo, C.; Diaz, F.; Meier, P. Extracting information nuggets from disaster-Related messages in social media. Iscram 2013, 201, 791–801. [Google Scholar]
- Alam, F.; Ofli, F.; Imran, M. Crisismmd: Multimodal twitter datasets from natural disasters. In Proceedings of the International AAAI Conference on Web and Social Media, Palo Alto, CA, USA, 25–28 June 2018; Volume 12. [Google Scholar]
- Imran, M.; Castillo, C.; Lucas, J.; Meier, P.; Vieweg, S. AIDR: Artificial intelligence for disaster response. In Proceedings of the 23rd International Conference on World Wide Web, Seoul, Republic of Korea, 7–11 April 2014; pp. 159–162. [Google Scholar]
- Schütze, H.; Manning, C.D.; Raghavan, P. Introduction to Information Retrieval; Cambridge University Press: Cambridge, UK, 2008; Volume 39. [Google Scholar]
- Bojer, C.S.; Meldgaard, J.P. Kaggle forecasting competitions: An overlooked learning opportunity. Int. J. Forecast. 2021, 37, 587–603. [Google Scholar] [CrossRef]
- Vapnik, V. The Nature of Statistical Learning Theory; Springer Science & Business Media: Berlin/Heidelberg, Germany, 1999. [Google Scholar]
- Wright, R.E. Logistic Regression. In Reading and Understanding Multivariate Statistics; Grimm, L.G., Yarnold, P.R., Eds.; American Psychological Association: Washington, DC, USA, 1995; pp. 217–244. [Google Scholar]
- Chen, T.; He, T.; Benesty, M.; Khotilovich, V.; Tang, Y.; Cho, H.; Chen, K.; Mitchell, R.; Cano, I.; Zhou, T.; et al. Xgboost: Extreme Gradient Boosting. R Package Version 0.4-2. 2015, Volume 1, pp. 1–4. Available online: https://cran.ms.unimelb.edu.au/web/packages/xgboost/vignettes/xgboost.pdf (accessed on 2 June 2024).
- Nguyen, D.; Al Mannai, K.A.; Joty, S.; Sajjad, H.; Imran, M.; Mitra, P. Robust classification of crisis-related data on social networks using convolutional neural networks. In Proceedings of the International AAAI Conference on Web and Social Media (ICWSM-17), Montreal, QC, Canada, 15–18 May 2017; Volume 11, pp. 632–635. [Google Scholar]
- Effrosynidis, D.; Symeonidis, S.; Arampatzis, A. A comparison of pre-processing techniques for twitter sentiment analysis. In Proceedings of the Research and Advanced Technology for Digital Libraries: 21st International Conference on Theory and Practice of Digital Libraries, TPDL 2017, Thessaloniki, Greece, 18–21 September 2017; pp. 394–406. [Google Scholar]
- Symeonidis, S.; Effrosynidis, D.; Arampatzis, A. A comparative evaluation of pre-processing techniques and their interactions for twitter sentiment analysis. Expert Syst. Appl. 2018, 110, 298–310. [Google Scholar] [CrossRef]
- Frazier, P.I. Bayesian optimization. In Recent Advances in Optimization and Modeling of Contemporary Problems; Informs: Catonsville, MD, USA, 2018; pp. 255–278. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
- Keras. 2015. Available online: https://keras.io (accessed on 2 June 2024).
Label | CrisisLex | CrisisNLP | SWDM13 | ISCRAM13 | DRD | DSM | CrisisMMD | AIDR | Total |
---|---|---|---|---|---|---|---|---|---|
Informative | 42,140 | 23,694 | 716 | 2443 | 14,849 | 3461 | 11,488 | 2968 | 101,759 |
Not informative | 27,559 | 16,707 | 141 | 78 | 6047 | 5374 | 4532 | 3901 | 64,339 |
Total | 69,699 | 40,401 | 857 | 2521 | 20,896 | 8835 | 16,020 | 6869 | 166,098 |
Training Size | CNN | Bernoulli NB | LGB | Linear SVC | Log. Regr. | XGB |
---|---|---|---|---|---|---|
0.01 (1094) | 0.767006 | 0.827914 | 0.795521 | 0.826859 | 0.819480 | 0.805749 |
0.05 (5472) | 0.817594 | 0.847006 | 0.842708 | 0.849749 | 0.847635 | 0.841576 |
0.10 (10,944) | 0.832310 | 0.851917 | 0.850162 | 0.860265 | 0.856589 | 0.850461 |
0.15 (16,416) | 0.844751 | 0.851077 | 0.861370 | 0.863743 | 0.860611 | 0.854576 |
0.20 (21,888) | 0.842131 | 0.853103 | 0.865964 | 0.869158 | 0.866019 | 0.856966 |
0.25 (27,360) | 0.845838 | 0.853619 | 0.865838 | 0.870425 | 0.868642 | 0.860029 |
0.30 (32,832) | 0.845977 | 0.854866 | 0.871616 | 0.872945 | 0.869069 | 0.863636 |
0.35 (38,304) | 0.849324 | 0.854992 | 0.871856 | 0.874139 | 0.869954 | 0.867107 |
0.40 (43,776) | 0.855468 | 0.854340 | 0.867247 | 0.875189 | 0.872209 | 0.868828 |
0.45 (49,248) | 0.846153 | 0.854289 | 0.869479 | 0.876488 | 0.873499 | 0.867771 |
0.50 (54,720) | 0.853218 | 0.854913 | 0.873175 | 0.876596 | 0.873894 | 0.870020 |
0.55 (60,192) | 0.857627 | 0.854928 | 0.875890 | 0.877865 | 0.874766 | 0.871631 |
0.60 (65,664) | 0.857755 | 0.854836 | 0.870250 | 0.878697 | 0.874896 | 0.871449 |
0.65 (71,136) | 0.858243 | 0.855855 | 0.877637 | 0.878350 | 0.876792 | 0.874482 |
0.70 (76,608) | 0.853668 | 0.856543 | 0.872727 | 0.878383 | 0.876389 | 0.873957 |
0.75 (82,080) | 0.863194 | 0.857458 | 0.880775 | 0.878759 | 0.875894 | 0.875208 |
0.80 (87,552) | 0.859496 | 0.857563 | 0.878524 | 0.878986 | 0.877193 | 0.874284 |
0.85 (93,024) | 0.857330 | 0.857608 | 0.875800 | 0.879284 | 0.877253 | 0.876186 |
0.90 (98,849) | 0.862121 | 0.857308 | 0.876491 | 0.880046 | 0.877699 | 0.877721 |
0.95 (103,968) | 0.859183 | 0.857128 | 0.880402 | 0.880067 | 0.878064 | 0.876060 |
1.00 (109,441) | 0.863807 | 0.857458 | 0.881492 | 0.880683 | 0.878582 | 0.877902 |
Training Size | CNN | Bernoulli NB | LGB | Linear SVC | Log. Regr. | XGB |
---|---|---|---|---|---|---|
0.01 (1094) | 0.760970 | 0.828699 | 0.793511 | 0.824018 | 0.820579 | 0.801083 |
0.05 (5472) | 0.825369 | 0.851177 | 0.845027 | 0.856757 | 0.853133 | 0.845988 |
0.10 (10,944) | 0.840800 | 0.854125 | 0.853006 | 0.865346 | 0.861614 | 0.853893 |
0.15 (16,416) | 0.850352 | 0.857158 | 0.864257 | 0.867850 | 0.865944 | 0.859451 |
0.20 (21,888) | 0.847575 | 0.857831 | 0.871351 | 0.872391 | 0.870868 | 0.861848 |
0.25 (27,360) | 0.851999 | 0.859652 | 0.870432 | 0.874624 | 0.872885 | 0.865389 |
0.30 (32,832) | 0.854714 | 0.860439 | 0.876219 | 0.876940 | 0.875810 | 0.869051 |
0.35 (38,304) | 0.853916 | 0.859715 | 0.874784 | 0.877344 | 0.877032 | 0.871913 |
0.40 (43,776) | 0.860278 | 0.859205 | 0.873649 | 0.876895 | 0.877959 | 0.873085 |
0.45 (49,248) | 0.851993 | 0.860048 | 0.875293 | 0.880186 | 0.878756 | 0.870071 |
0.50 (54,720) | 0.857606 | 0.860140 | 0.880370 | 0.880724 | 0.879013 | 0.874163 |
0.55 (60,192) | 0.861118 | 0.860143 | 0.880256 | 0.882413 | 0.880510 | 0.875076 |
0.60 (65,664) | 0.858978 | 0.860485 | 0.878439 | 0.882081 | 0.880602 | 0.873560 |
0.65 (71,136) | 0.861469 | 0.860549 | 0.882362 | 0.883176 | 0.881592 | 0.877876 |
0.70 (76,608) | 0.858348 | 0.861224 | 0.880650 | 0.882728 | 0.882752 | 0.877911 |
0.75 (82,080) | 0.868000 | 0.861219 | 0.884633 | 0.883806 | 0.883376 | 0.877862 |
0.80 (87,552) | 0.865731 | 0.861811 | 0.882780 | 0.884273 | 0.883718 | 0.878570 |
0.85 (93,024) | 0.859579 | 0.861736 | 0.881777 | 0.884310 | 0.884066 | 0.880593 |
0.90 (98,849) | 0.866165 | 0.861556 | 0.882326 | 0.884636 | 0.884272 | 0.879329 |
0.95 (103,968) | 0.863560 | 0.861340 | 0.885998 | 0.885465 | 0.884855 | 0.879588 |
1.00 (109,441) | 0.866116 | 0.862007 | 0.884912 | 0.885308 | 0.885435 | 0.879179 |
Set | CNN | Bernoulli NB | LGB | Linear SVC | Log. Regr. | XGB | Ensemble | Stacking |
---|---|---|---|---|---|---|---|---|
Validation | 0.8421 | 0.8531 | 0.8659 | 0.8691 | 0.8660 | 0.8569 | 0.8729 | - |
Test | 0.8475 | 0.8578 | 0.8713 | 0.8723 | 0.8708 | 0.8618 | 0.8757 | 0.8761 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Effrosynidis, D.; Sylaios, G.; Arampatzis, A. The Effect of Training Data Size on Disaster Classification from Twitter. Information 2024, 15, 393. https://doi.org/10.3390/info15070393
Effrosynidis D, Sylaios G, Arampatzis A. The Effect of Training Data Size on Disaster Classification from Twitter. Information. 2024; 15(7):393. https://doi.org/10.3390/info15070393
Chicago/Turabian StyleEffrosynidis, Dimitrios, Georgios Sylaios, and Avi Arampatzis. 2024. "The Effect of Training Data Size on Disaster Classification from Twitter" Information 15, no. 7: 393. https://doi.org/10.3390/info15070393
APA StyleEffrosynidis, D., Sylaios, G., & Arampatzis, A. (2024). The Effect of Training Data Size on Disaster Classification from Twitter. Information, 15(7), 393. https://doi.org/10.3390/info15070393