GenCo: A Generative Learning Model for Heterogeneous Text Classification Based on Collaborative Partial Classifications
Abstract
:1. Introduction
Related Works
2. Materials and Methods
2.1. Conventional Approach
- The weather is worse today due to climate change.
- The increase in economic crises is due to the pandemic.
- World leaders are determined to end world crises.
- Major decisions to end climate change were made by world leaders at the climate summit.
- During the pandemic, economic activities were shut down, making world leaders struggle with the world economy.
- No world economy survives the pandemic.
- World climate change summit discusses how to tackle world climate change crises during a pandemic crisis.
- Most world leaders don’t have a large economy to tackle the pandemic and climate change crises.
- Without a sustainable economy, it may take longer to survive the pandemic shock.
- World economic crises and pandemics are headaches to world leaders.
2.2. Proposed Model
Algorithm 1 GenCo learning and classification of heterogeneous text corpora | |
Require: X (binary vectorized input), k (number of class), m (number of observation instances). | |
Ensure: | |
▷ number of feature segments | |
for to m do | ▷ learning of class posterior |
▷ estimating class priors | |
for to ndo | ▷ estimating partial posteriors |
▷ estimating observation priors | |
▷ partial class posterior | |
▷ partial heterogeineity | |
end for | |
▷ class posterior update | |
end for | |
) | ▷ final class posterior |
2.3. Performance Measure
3. Experimental Results and Discussions
3.1. Experimental Setup
3.1.1. Dataset and Feature Presentation
3.1.2. Dataset Pre-Processing
3.1.3. Model Definition
- Smoothing parameter: The smoothing parameter , which is fixed to , corresponding to a Laplacian smoothing.
- Number of segmentation: The number of segments used depends on the number of features (i.e., dimensions of the vocabulary) of the dataset concerned.
- Number of partialization layers: A single layer of partialization is used, and the number of partial class posteriors in the layer is equal to the number of feature segments.
3.2. Results Discussions
3.2.1. Twitter US Airline Dataset Results
3.2.2. Conference Paper Dataset Results
3.2.3. SMS Spam Dataset Results
3.2.4. Performance and Comparison with Models from Other Studies
4. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A
References
- Nancy, R.; Gianluca, G.; Rick, W.; Marilynn, B.; Enrique, F.; Margaret, F. Globalization and human cooperation. Proc. Natl. Acad. Sci. USA 2009, 106, 4138–4142. [Google Scholar] [CrossRef]
- Goody, J. The Logic of Writing and the Organization of Society; Cambridge University Press: Cambridge, UK, 1986. [Google Scholar] [CrossRef]
- Korde, V. Text Classification and Classifiers: A Survey. Int. J. Artif. Intell. Appl. 2012, 3, 85–99. [Google Scholar] [CrossRef]
- Dogra, V.; Verma, S.; Kavita; Chatterjee, P.; Shafi, J.; Choi, J.; Ijaz, M.F. A Complete Process of Text Classification System Using State-of-the-Art NLP Models. Comput. Intell. Neurosci. 2022, 2022, 1883698. [Google Scholar] [CrossRef] [PubMed]
- Kowsari, K.; Jafari Meimandi, K.; Heidarysafa, M.; Mendu, S.; Barnes, L.; Brown, D.; Barnes, L. Text Classification Algorithms: A Survey. Information 2019, 10, 150. [Google Scholar] [CrossRef] [Green Version]
- Malvestuto, F.; Zuffada, C. The Classification Problem with Semantically Heterogeneous Data; Springer: Berlin/Heidelberg, Germany, 2006; pp. 157–176. [Google Scholar] [CrossRef]
- Staš, J.; Juhár, J.; Hladek, D. Classification of heterogeneous text data for robust domain-specific language modeling. EURASIP J. Audio Speech Music. Process. 2014, 2014, 14. [Google Scholar] [CrossRef] [Green Version]
- Zhang, H.; Li, D. Naïve Bayes Text Classifier. In Proceedings of the 2007 IEEE International Conference on Granular Computing (GRC 2007), Fremont, CA, USA, 2–4 November 2007; p. 708. [Google Scholar] [CrossRef]
- Xu, S. Bayesian Naïve Bayes classifiers to text classification. J. Inf. Sci. 2018, 44, 48–59. [Google Scholar] [CrossRef]
- Mitra, V.; Wang, C.J.; Banerjee, S. Text classification: A least square support vector machine approach. Appl. Soft Comput. 2007, 7, 908–914. [Google Scholar] [CrossRef]
- Qiang, G. An Effective Algorithm for Improving the Performance of Naive Bayes for Text Classification. In Proceedings of the 2010 Second International Conference on Computer Research and Development, Kuala Lumpur, Malaysia, 7–10 May 2010; pp. 699–701. [Google Scholar]
- Akhter, M.P.; Jiangbin, Z.; Naqvi, I.R.; Abdelmajeed, M.; Mehmood, A.; Sadiq, M.T. Document-Level Text Classification Using Single-Layer Multisize Filters Convolutional Neural Network. IEEE Access 2020, 8, 42689–42707. [Google Scholar] [CrossRef]
- Li, W.; Gao, S.; Zhou, H.; Huang, Z.; Zhang, K.; Li, W. The Automatic Text Classification Method Based on BERT and Feature Union. In Proceedings of the 2019 IEEE 25th International Conference on Parallel and Distributed Systems (ICPADS), Tianjin, China, 4–6 December 2019; pp. 774–777. [Google Scholar]
- Du, C.; Huang, L. Text Classification Research with Attention-based Recurrent Neural Networks. Int. J. Comput. Commun. Control 2018, 13, 50. [Google Scholar] [CrossRef] [Green Version]
- Wilbur, W.J. Boosting naïve Bayesian learning on a large subset of MEDLINE. In Proceedings of the AMIA Symposium; American Medical Informatics Association: Bethesda, MD, USA, 2000; pp. 918–922. [Google Scholar]
- Xu, S.; Li, Y.; Wang, Z. Bayesian Multinomial Naive Bayes Classifier to Text Classification. In Advanced Multimedia and Ubiquitous Engineering: MUE/FutureTech 2017; Springer: Singapore, 2017. [Google Scholar]
- Manning, C.; Raghavan, P.; Schütze, H. Introduction to Information Retrieval; Cambridge University Press: Cambridge, UK, 2008; pp. 253–286. [Google Scholar] [CrossRef]
- Daly, R.; Shen, Q.; Aitken, S. Learning Bayesian networks: Approaches and issues. Knowl. Eng. Rev. 2011, 26, 99–157. [Google Scholar] [CrossRef] [Green Version]
- Murphy, K.P. Machine Learning: A Probabilistic Perspective; The MIT Press: Cambridge, MA, USA, 2012. [Google Scholar]
- Vihinen, M. How to evaluate performance of prediction methods? Measures and their interpretation in variation effect analysis. BMC Genom. 2012, 13 (Suppl. S4), S2. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Figure, E. Twitter US Airline Sentiment Dataset. 2019. Available online: https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment (accessed on 25 February 2023).
- Harun, R. Research Papers Dataset. 2018. Available online: https://www.kaggle.com/datasets/harunshimanto/research-paper (accessed on 25 February 2023).
- Almeida, T.A.; Hidalgo, J.M.G.; Yamakami, A. Contributions to the Study of SMS Spam Filtering: New Collection and Results. In Proceedings of the 11th ACM Symposium on Document Engineering, Mountain View, CA, USA, 19–23 September 2011; pp. 259–262. [Google Scholar] [CrossRef]
- Bird, S.; Edward, L.; Ewan, K. Natural Language Processing with Python; O’Reilly Media Inc.: Sebastopol, CA, USA, 2009. [Google Scholar]
- Tan, K.L.; Lee, C.P.; Lim, K.M. RoBERTa-GRU: A Hybrid Deep Learning Model for Enhanced Sentiment Analysis. Appl. Sci. 2023, 13, 3915. [Google Scholar] [CrossRef]
- AlBadani, B.; Shi, R.; Dong, J. A Novel Machine Learning Approach for Sentiment Analysis on Twitter Incorporating the Universal Language Model Fine-Tuning and SVM. Appl. Syst. Innov. 2022, 5, 13. [Google Scholar] [CrossRef]
- Basiri, M.E.; Nemati, S.; Abdar, M.; Cambria, E.; Acharrya, U.R. ABCDM: An Attention-based Bidirectional CNN-RNN Deep Model for sentiment analysis. Future Gener. Comput. Syst. 2021, 115, 279–294. [Google Scholar] [CrossRef]
- Li, S. Machine Learning SpaCy. 2018. Available online: https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/machine%20learning%20spaCy.ipynb (accessed on 24 February 2023).
- Xia, T.; Chen, X. A Discrete Hidden Markov Model for SMS Spam Detection. Appl. Sci. 2020, 10, 5011. [Google Scholar] [CrossRef]
- Ghourabi, A.; Mahmood, M.A.; Alzubi, Q.M. A Hybrid CNN-LSTM Model for SMS Spam Detection in Arabic and English Messages. Future Internet 2020, 12, 156. [Google Scholar] [CrossRef]
- Schwarz, G. Estimating the Dimension of a Model. Ann. Stat. 1978, 6, 461–464. [Google Scholar] [CrossRef]
y | ||||||||
---|---|---|---|---|---|---|---|---|
1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | G |
2 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | B |
3 | 0 | 1 | 1 | 0 | 0 | 2 | 1 | B |
4 | 0 | 0 | 0 | 2 | 1 | 1 | 1 | G |
5 | 2 | 0 | 1 | 0 | 0 | 2 | 1 | B |
6 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | B |
7 | 0 | 2 | 1 | 2 | 2 | 2 | 0 | G |
8 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | G |
9 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | B |
10 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | B |
y | ||||||||
---|---|---|---|---|---|---|---|---|
1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | G |
2 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | B |
3 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | B |
4 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | G |
5 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | B |
6 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | B |
7 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | G |
8 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | G |
9 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | B |
10 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | B |
Datasets | Documents | Vocabularies | Vocabulary Segments | Categories |
---|---|---|---|---|
(1) Twitter US Airline dataset [21] | 14,640 | 100 | 10 | 3 |
(2) Conference Paper dataset [22] | 2507 | 100 | 10 | 5 |
(3) SMS Spam dataset [23] | 5574 | 50 | 5 | 2 |
Datasets | Models | Accuracy (%) | BIC |
---|---|---|---|
Twitter US Airline dataset | RoBERTa-GRU [25] | 91.52 | 2455.53 |
ULMFit-SVM [26] | 99.78 | 1352.18 | |
ABCDM [27] | 92.75 | 1178.23 | |
GenCo (our work) | 98.40 | 959.18 | |
Conference Paper dataset | Linear SVM [28] | 74.63 | 1041.10 |
GenCo (our work) | 89.90 | 782.90 | |
SMS Spam dataset | Discrete HMM [29] | 95.90 | 833.61 |
Hybrid CNN-LSTM [30] | 98.37 | 2103.36 | |
GenCo (our work) | 99.26 | 431.31 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ekolle, Z.E.; Kohno, R. GenCo: A Generative Learning Model for Heterogeneous Text Classification Based on Collaborative Partial Classifications. Appl. Sci. 2023, 13, 8211. https://doi.org/10.3390/app13148211
Ekolle ZE, Kohno R. GenCo: A Generative Learning Model for Heterogeneous Text Classification Based on Collaborative Partial Classifications. Applied Sciences. 2023; 13(14):8211. https://doi.org/10.3390/app13148211
Chicago/Turabian StyleEkolle, Zie Eya, and Ryuji Kohno. 2023. "GenCo: A Generative Learning Model for Heterogeneous Text Classification Based on Collaborative Partial Classifications" Applied Sciences 13, no. 14: 8211. https://doi.org/10.3390/app13148211
APA StyleEkolle, Z. E., & Kohno, R. (2023). GenCo: A Generative Learning Model for Heterogeneous Text Classification Based on Collaborative Partial Classifications. Applied Sciences, 13(14), 8211. https://doi.org/10.3390/app13148211