Machine Learning-Based Text Classification Comparison: Turkish Language Context
Abstract
:1. Introduction
2. Literature Review
2.1. Text Classification in Turkish Context
2.2. Natural Language Processing
2.3. Machine Learning
2.3.1. Supervised Learning
2.3.2. Unsupervised Learning
3. Research Method
3.1. Python Tool
3.2. Zemberek
4. Data Preparation
4.1. Data Preparation Steps
4.2. Data Exploration
4.2.1. Stop Word List
4.2.2. Unigram Data Comparison
4.2.3. Bigram Data Comparison
4.2.4. Trigram Data Comparison
4.2.5. Data Classification
4.3. Feature Extraction
5. Findings
5.1. Performance Metrics
5.2. Model Performance Evaluation
5.2.1. Model Evaluation of Raw Data
5.2.2. Model Evaluation after Data Preparation
5.2.3. Training Time Evaluation
5.2.4. Long Term-Short Memory Insights
6. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Ajitha, P.; Sivasangari, A.; Immanuel Rajkumar, R.; Poonguzhali, S. Design of text sentiment analysis tool using feature extraction based on fusing machine learning algorithms. J. Intell. Fuzzy Syst. 2021, 40, 6375–6383. [Google Scholar] [CrossRef]
- Minaee, S.; Kalchbrenner, N.; Cambria, E.; Nikzad, N.; Chenaghlu, M.; Gao, J. Deep learning-based text classification: A comprehensive review. ACM Comput. Surv. (CSUR) 2021, 54, 1–40. [Google Scholar] [CrossRef]
- Srinivasan, S.; Ravi, V.; Alazab, M.; Ketha, S.; Al-Zoubi, A.M.; Kotti Padannayil, S. Spam emails detection based on distributed word embedding with deep learning. In Machine Intelligence and Big Data Analytics for Cybersecurity Applications. Studies in Computational Intelligence; Maleh, Y., Shojafar, M., Alazab, M., Baddi, Y., Eds.; Springer: Cham, Germany, 2021; Volume 919, pp. 161–189. [Google Scholar]
- Akhter, M.P.; Jiangbin, Z.; Naqvi, I.R.; Abdelmajeed, M.; Fayyaz, M. Exploring deep learning approaches for Urdu text classification in product manufacturing. Enterp. Inf. Syst. 2022, 16, 223–248. [Google Scholar] [CrossRef]
- Sarker, I.H. Machine learning: Algorithms, real-world applications and research directions. SN Comput. Sci. 2021, 2, 160. [Google Scholar] [CrossRef] [PubMed]
- Mohammed, A.; Kora, R. An effective ensemble deep learning framework for text classification. J. King Saud Univ.-Comput. Inf. Sci. 2022, 34, 8825–8837. [Google Scholar] [CrossRef]
- Qasim, R.; Bangyal, W.H.; Alqarni, M.A.; Ali Almazroi, A. A fine-tuned BERT-based transfer learning approach for text classification. J. Healthc. Eng. 2022, 2022, 3498123. [Google Scholar] [CrossRef]
- Thirumoorthy, K.; Muneeswaran, K. Feature selection for text classification using machine learning approaches. Natl. Acad. Sci. Lett. 2022, 45, 51–56. [Google Scholar] [CrossRef]
- Luo, X. Efficient english text classification using selected machine learning techniques. Alex. Eng. J. 2021, 60, 3401–3409. [Google Scholar] [CrossRef]
- Altınel, B.; Ganiz, M.C. Semantic text classification: A survey of past and recent advances. Inf. Process. Manag. 2018, 54, 1129–1153. [Google Scholar] [CrossRef]
- Kadhim, A.I. Survey on supervised machine learning techniques for automatic text classification. Artif. Intell. Rev. 2019, 52, 273–292. [Google Scholar] [CrossRef]
- Li, Q.; Peng, H.; Li, J.; Xia, C.; Yang, R.; Sun, L.; Yu, P.S.; He, L. A survey on text classification: From traditional to deep learning. ACM Trans. Intell. Syst. Technol. (TIST) 2022, 13, 1–41. [Google Scholar] [CrossRef]
- Hartmann, J.; Huppertz, J.; Schamp, C.; Heitmann, M. Comparing automated text classification methods. Int. J. Res. Mark. 2019, 36, 20–38. [Google Scholar] [CrossRef]
- Shah, K.; Patel, H.; Sanghvi, D.; Shah, M. A comparative analysis of logistic regression, random forest and KNN models for the text classification. Augment. Hum. Res. 2020, 5, 12. [Google Scholar] [CrossRef]
- El Rifai, H.; Al Qadi, L.; Elnagar, A. Arabic text classification: The need for multi-labeling systems. Neural Comput. Appl. 2022, 34, 1135–1159. [Google Scholar] [CrossRef] [PubMed]
- Elnagar, A.; Al-Debsi, R.; Einea, O. Arabic text classification using deep learning models. Inf. Process. Manag. 2020, 57, 102121. [Google Scholar] [CrossRef]
- Dai, Y.; Guo, W.; Chen, X.; Zhang, Z. Relation classification via LSTMs based on sequence and tree structure. IEEE Access 2018, 6, 64927–64937. [Google Scholar] [CrossRef]
- Yuvaraj, N.; Chang, V.; Gobinathan, B.; Pinagapani, A.; Kannan, S.; Dhiman, G.; Rajan, A.R. Automatic detection of cyberbullying using multi-feature based artificial intelligence with deep decision tree classification. Comput. Electr. Eng. 2021, 92, 107186. [Google Scholar] [CrossRef]
- Yadav, B.P.; Ghate, S.; Harshavardhan, A.; Jhansi, G.; Kumar, K.S.; Sudarshan, E. Text categorization performance examination using machine learning algorithms. In Proceedings of the IOP Conference Series: Materials Science and Engineering, Warangal, India, 9–10 October 2020; IOP Publishing: Warangal, India, 2020; p. 022044. [Google Scholar]
- Yildiz, B. Efficient text classification with deep learning on imbalanced data improved with better distribution. Turk. J. Sci. Technol. 2022, 17, 89–98. [Google Scholar] [CrossRef]
- Köksal, Ö.; Yılmaz, E.H. Improving automated Turkish text classification with learning-based algorithms. Concurr. Comput. Pract. Exp. 2022, 34, e6874. [Google Scholar] [CrossRef]
- Karasoy, O.; Ballı, S. Spam SMS detection for Turkish language with deep text analysis and deep learning methods. Arab. J. Sci. Eng. 2022, 47, 9361–9377. [Google Scholar] [CrossRef]
- Bozyigit, F.; Dogan, O.; Kilinc, D. Categorization of customer complaints in food industry using machine learning approaches. J. Intell. Syst. Theory Appl. 2022, 5, 85–91. [Google Scholar]
- Amasyalı, M.F.; Diri, B. Automatic Turkish text categorization in terms of author, genre and gender. In Natural Language Processing and Information Systems. NLDB 2006. Lecture Notes in Computer Science; Kop, C., Fliedl, G., Mayr, H.C., Métais, E., Eds.; Springer: Berlin/Heidelberg, Germany, 2006; Volume 3999, pp. 221–226. [Google Scholar]
- Güran, A.; Akyokuş, S.; Bayazıt, N.G.; Gürbüz, M.Z. Turkish text categorization using n-gram words. In Proceedings of the International Symposium on Innovations in Intelligent Systems and Applications (INISTA 2009), Trabzon, Turkey, 29 June–1 July 2009; IEEE: Trabzon, Turkey, 2009; pp. 369–373. [Google Scholar]
- Uysal, A.K.; Gunal, S. The impact of preprocessing on text classification. Inf. Process. Manag. 2014, 50, 104–112. [Google Scholar] [CrossRef]
- Yıldırım, S.; Yıldız, T. A comparative analysis of text classification for Turkish language. Pamukkale Univ. J. Eng. Sci. 2018, 24, 879–886. [Google Scholar] [CrossRef]
- Kuyumcu, B.; Aksakalli, C.; Delil, S. An automated new approach in fast text classification (fastText): A case study for Turkish text classification without pre-processing. In Proceedings of the 3rd International Conference on Natural Language Processing and Information Retrieval, ACM, Tokushima, Japan, 28–30 June 2019; pp. 1–4. [Google Scholar]
- Çoban, Ö.; Özel, S.A.; İnan, A. Deep learning-based sentiment analysis of Facebook data: The case of Turkish users. Comput. J. 2021, 64, 473–499. [Google Scholar] [CrossRef]
- Dogru, H.B.; Tilki, S.; Jamil, A.; Hameed, A.A. Deep learning-based classification of news texts using doc2vec model. In Proceedings of the 1st International Conference on Artificial Intelligence and Data Analytics (CAIDA), Riyadh, Saudi Arabia, 6–7 April 2021; IEEE: Riyadh, Saudi Arabia, 2021; pp. 91–96. [Google Scholar]
- Zulqarnain, M.; Alsaedi, A.K.Z.; Ghazali, R.; Ghouse, M.G.; Sharif, W.; Husaini, N.A. A comparative analysis on question classification task based on deep learning approaches. PeerJ Comput. Sci. 2021, 7, e570. [Google Scholar] [CrossRef] [PubMed]
- Bektaş, J. Detection of economy-related Turkish tweets based on machine learning approaches. In Data Mining Approaches for Big Data and Sentiment Analysis in Social Media; El-Latif, A.A.A., Ed.; IGI Global: Hershey, PA, USA, 2022; pp. 171–195. [Google Scholar]
- Eminagaoglu, M. A new similarity measure for vector space models in text classification and information retrieval. J. Inf. Sci. 2022, 48, 463–476. [Google Scholar] [CrossRef]
- Erkaya, A.E. Text Classification based on Organizational Data Using Machine Learning; Ankara Yıldırım Beyazıt Üniversitesi Fen Bilimleri Enstitüsü: Keçiören/Ankara, Türkiye, 2019. [Google Scholar]
- Akın, A.A.; Akın, M.D. Zemberek, an open source NLP framework for Turkic languages. Structure 2007, 10, 1–5. [Google Scholar]
- Kayabaş, A.; Schmid, H.; Topcu, A.E.; Kiliç, Ö. TRMOR: A finite-state-based morphological analyzer for Turkish. Turk. J. Electr. Eng. Comput. Sci. 2019, 27, 3837–3851. [Google Scholar] [CrossRef]
- Pandas. User Guide. NumFOCUS, Inc. Hosted by OVHcloud. 2022. Available online: https://pandas.pydata.org/docs/user_guide/index.html (accessed on 25 July 2022).
- Matplotlib. Matplotlib: Visualization with Python. 2022. Available online: https://matplotlib.org (accessed on 27 July 2022).
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
- Keras. Developer Guides. 2019. Available online: https://keras.io/guides/ (accessed on 29 July 2022).
- Akın, A.A. zemberek-nlp. 2021. Available online: https://github.com/ahmetaa/zemberek-nlp (accessed on 15 August 2022).
- Jaradat, A.; Safieddine, F.; Deraman, A.; Ali, O.; Al-Ahmad, A.; Alzoubi, Y.I. A probabilistic data fusion modeling approach for extracting true values from uncertain and conflicting attributes. Big Data Cogn. Comput. 2022, 6, 114. [Google Scholar] [CrossRef]
- Zhang, Z.-H.; Min, F.; Chen, G.-S.; Shen, S.-P.; Wen, Z.-C.; Zhou, X.-B. Tri-partition state alphabet-based sequential pattern for multivariate time series. Cogn. Comput. 2022, 14, 1881–1899. [Google Scholar] [CrossRef]
- Hossain, T.; Mauni, H.Z.; Rab, R. Reducing the effect of imbalance in text classification using SVD and GloVe with ensemble and deep learning. Comput. Inform. 2022, 41, 98–115. [Google Scholar] [CrossRef]
- Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
Study | Technique/Algorithm Used | Findings |
---|---|---|
[24] | SVM, NB, RF | The NB is the best for author discovery; the SVN is the best for race and genre identification |
[25] | DT-J48, K-NN, Bayesian Probabilistic classifiers, N-gram method | K-NN performance achieved 65.5%, BN achieved 94%, and J48 achieved 75% |
[26] | SVM, Micro-F1 | The preparation phase is just as critical as the processes of the extraction and selection of features |
[27] | Bag-of-words approach, artificial neural system | Stop-word cleansing and morphological analysis had little effect on the outcome |
[28] | FastText tested using NB, K-NN, J48 | Multinomial NB classifier achieved the best at 90.12% |
[29] | DL techniques | Recurrent neural systems obtained the highest accuracy of 91.6% |
[30] | DL, ML techniques (NB, SVM, RF, and Gauss NB) | 94.17% in the Turkish sample compared to 96.41% in the English sample in the classifications performed by CNN |
[31] | DL techniques (Gated Recurrent Unit, LSTM, CNN) | DL algorithms achieved an accuracy of 93.7% on the question dataset |
[32] | SVM, NB, LR, and integration LR with SVM | The integration approach of the SVM with LR generated the best results (82.9%) |
[23] | LR, NB, K-NN, SVM, RF applied on TF-IDF and word2vec | Extreme Gradient Boosting with an TF-IDF weighted value scored the best (86%) F-measure score |
[33] | Proposed a similarity metric that can be used for K-NN and k-means | The suggested metric might be employed in any applicable method or model for data acquisition and text classification |
[22] | ML (NB, RF, SVM, multilayer perceptron, Random Subspace, LR, K-NN, DL (CNN and LSTM) | CNN scored the best with a 99.86% accuracy rate |
[21] | NB, LR, K-NN, SVM, RF | The new technique outperformed earlier F1-score-based news classification experiments and achieved 96.00% accuracy. |
[20] | LSTM | A new data distribution methodology |
This study | k-means technique, TF-IDF, SVM, NB, LTSM, RF, LR | LTSM was found to be the most effective technique in terms of accuracy, and data preparation is important for the overall performance of the algorithm used |
Dataset | Maximum Number of Words | Minimum Number of Words | Average Number of Words |
---|---|---|---|
Raw | 3026 | 1 | 33 |
Prepared | 1171 | 1 | 18 |
Dataset | SVM | RF | NB | LR | LSTM |
---|---|---|---|---|---|
Raw | 46,943.523 | 36.727 | 41.663 | 5850.307 | 60,214.325 |
Preprocessed without simplification step | 9879.506 | 23.387 | 20.232 | 5418.467 | 58,879.152 |
Preprocessed | 8865.088 | 7.445 | 0.227 | 39.614 | 10,426.658 |
Stop word | sokak, cadde, mahalle, mah, mh, istinaden, no, tel, cep, sk, faks, te, kap, iç, gerek, bulvar, ilçe, il, arz, sayın, etmek, başvuru, eski, meydan, gelmek, null, saat, fax, cad, sok, ara, civar, bura, ora, kişi, görev, başlamak, yaşamak, binmek, sıkıntı, ad, taraf, soy, acilen, çöz, bulunmak, müdahale, numara, bilgi, vermek, birim, söz, yarmak, iyi, sayın, günlemek, tarih, yetkili, başkan, mağdur, vatandaş, şikayet, nol, anmak, yeni, ivedilikle, mağdur, temiz, yolmak, zor, kalmak, demek, almak, bina, gitmek, patlak, konu, ev, durum, istemek, kontrol, geçmek, nol, ivedi, rica, mevcut, park, gün, site, kullanmak, büyükşehir, bey, beklemek, lütfen, yok, mağduriyet, gidermek, talep, şikâyet, belediye. |
Algorithm | Precision | Accuracy | F1-Score | Recall |
---|---|---|---|---|
RF | 9% | 10% | 10% | 12% |
NB | 29% | 35% | 29% | 29% |
LR | 50% | 56% | 52% | 54% |
SVM | 60% | 59% | 64% | 64% |
Algorithm | Precision | Accuracy | F1-Score | Recall |
---|---|---|---|---|
RF | 33% | 18% | 7% | 18% |
NB | 72% | 70% | 66% | 70% |
LR | 76% | 76% | 75% | 77% |
SVM | 77% | 78% | 77% | 78% |
Data | RF | LR | NB | SVM | LSTM |
---|---|---|---|---|---|
Raw | 36.727 | 5850.307 | 41.663 | 46,943.523 | 60,214.325 |
Prepared | 7.445 | 39.614 | 0.227 | 8865.088 | 10,426.658 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Alzoubi, Y.I.; Topcu, A.E.; Erkaya, A.E. Machine Learning-Based Text Classification Comparison: Turkish Language Context. Appl. Sci. 2023, 13, 9428. https://doi.org/10.3390/app13169428
Alzoubi YI, Topcu AE, Erkaya AE. Machine Learning-Based Text Classification Comparison: Turkish Language Context. Applied Sciences. 2023; 13(16):9428. https://doi.org/10.3390/app13169428
Chicago/Turabian StyleAlzoubi, Yehia Ibrahim, Ahmet E. Topcu, and Ahmed Enis Erkaya. 2023. "Machine Learning-Based Text Classification Comparison: Turkish Language Context" Applied Sciences 13, no. 16: 9428. https://doi.org/10.3390/app13169428
APA StyleAlzoubi, Y. I., Topcu, A. E., & Erkaya, A. E. (2023). Machine Learning-Based Text Classification Comparison: Turkish Language Context. Applied Sciences, 13(16), 9428. https://doi.org/10.3390/app13169428