Multi-Keyword Classification: A Case Study in Finnish Social Sciences Data Archive
Abstract
:1. Introduction
- We introduce the FSD dataset as a benchmark for the MLC task. The properties of this dataset, such as size and the bi-lingual metadata, make it particular in the field of MLC.
- We evaluate the performance of several textual MLC approaches in the domain of social sciences. We used a Naive approach, and the ML-KNN, ML-RF, X-BERT and Parabel techniques to tackle the MLC problem.
- We compare the performance of a set of text representation techniques by combining them with the selected MLC methods. This quantitative evaluation provides an objective comparison of the text representation techniques.
2. FSD and Analysis of Workflow
3. Dataset
4. Feature Extraction
4.1. Bag of Words
4.2. Term Frequency—Inverse Document Frequency
4.3. Pretrained Word Embeddings
5. Multi-Label Classification Methods
5.1. Naive Approach
5.2. Multi-Label k Nearest Neighbours
5.3. Multi-Label Random Forest
5.4. X-BERT
5.5. Parabel
6. Experimental Evaluation
6.1. Experimental Setup
6.2. Evaluation Metrics
6.3. Results
7. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Loza Mencía, E.; Fürnkranz, J. Efficient pairwise multilabel classification for large-scale problems in the legal domain. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases; Springer: Berlin/Heidelberg, Germany, 2008; Volume 5212, pp. 50–65. [Google Scholar] [CrossRef]
- Chalkidis, I.; Fergadiotis, M.; Malakasiotis, P.; Aletras, N.; Androutsopoulos, I. Extreme Multi-Label Legal Text Classification: A case study in EU Legislation. arXiv 2019, arXiv:1905.10892. [Google Scholar]
- You, R.; Zhang, Z.; Wang, Z.; Dai, S.; Mamitsuka, H.; Zhu, S. AttentionXML: Label Tree-based Attention-Aware Deep Model for High-Performance Extreme Multi-Label Text Classification. Adv. Neural Inf. Process. Syst. 2019, 32, 5820–5830. [Google Scholar]
- Prabhu, Y.; Kag, A.; Harsola, S.; Agrawal, R.; Varma, M. Parabel: Partitioned label trees for extreme classification with application to dynamic search advertising. In Proceedings of the World Wide Web Conference (WWW), Lyon, France, 23–27 April 2018; Association for Computing Machinery, Inc.: New York, NY, USA, 2018; pp. 993–1002. [Google Scholar] [CrossRef] [Green Version]
- Papanikolaou, Y.; Tsoumakas, G.; Laliotis, M.; Markantonatos, N.; Vlahavas, I. Large-scale online semantic indexing of biomedical articles via an ensemble of multi-label classification models. J. Biomed. Semant. 2017, 8, 1–13. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Al-Salemi, B.; Ayob, M.; Kendall, G.; Noah, S.A.M. Multi-label Arabic text categorization: A benchmark and baseline comparison of multi-label learning algorithms. Inf. Process. Manag. 2019, 56, 212–227. [Google Scholar] [CrossRef]
- Fiallos, A.; Jimenes, K. Using reddit data for multi-label text classification of twitter users interests. In Proceedings of the 2019 6th International Conference on eDemocracy and eGovernment, ICEDEG, Quito, Ecuador, 24–26 April 2019; pp. 324–327. [Google Scholar] [CrossRef]
- Chang, W.C.; Yu, H.F.; Zhong, K.; Yang, Y.; Dhillon, I. Taming Pretrained Transformers for Extreme Multi-label Text Classification. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 3163–3171. [Google Scholar]
- Sechidis, K.; Tsoumakas, G.; Vlahavas, I. On the stratification of multi-label data. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Athens, Greece, 5–9 September 2011; Springer: Berlin/Heidelberg, Germany, 2011; Volume 6913, pp. 145–158. [Google Scholar] [CrossRef] [Green Version]
- Thakur, N.; Han, C.Y. Country-Specific Interests towards Fall Detection from 2004–2021: An Open Access Dataset and Research Questions. Data 2021, 6, 92. [Google Scholar] [CrossRef]
- Bengio, Y.; Ducharme, R.; Vincent, P.; Jauvin, C. A Neural Probabilistic Language Model. JOurnal Mach. Learn. Res. 2003, 3, 1137–1155. [Google Scholar] [CrossRef]
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. In Proceedings of the 1st International Conference on Learning Representations, ICLR 2013—Workshop Track Proceedings, Scottsdale, AZ, USA, 2–4 May 2013. [Google Scholar]
- Bojanowski, P.; Grave, E.; Joulin, A.; Mikolov, T. Enriching Word Vectors with Subword Information. Trans. Assoc. Comput. Linguist. 2017, 5, 135–146. [Google Scholar] [CrossRef] [Green Version]
- Ali, F.; Ali, A.; Imran, M.; Naqvi, R.A.; Siddiqi, M.H.; Kwak, K.S. Traffic accident detection and condition analysis based on social networking data. Accid. Anal. Prev. 2021, 151, 105973. [Google Scholar] [CrossRef]
- Mikolov, T.; Grave, E.; Bojanowski, P.; Puhrsch, C.; Joulin, A. Advances in Pre-Training Distributed Word Representations. In Proceedings of the LREC 2018—11th International Conference on Language Resources and Evaluation, Miyazaki, Japan, 7–12 May 2018; pp. 52–55. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the NAACL HLT 2019—2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies—Proceedings of the Conference, Minneapolis, MI, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; Neural Information Processing Systems Foundation: San Francisco, CA, USA, 2017; Volume 2017, pp. 5999–6009. [Google Scholar]
- Virtanen, A.; Kanerva, J.; Ilo, R.; Luoma, J.; Luotolahti, J.; Salakoski, T.; Ginter, F.; Pyysalo, S. Multilingual is not enough: BERT for Finnish. arXiv 2019, arXiv:1912.07076. [Google Scholar]
- Khattak, F.K.; Jeblee, S.; Pou-Prom, C.; Abdalla, M.; Meaney, C.; Rudzicz, F. A survey of word embeddings for clinical text. J. Biomed. Inform. 2019, 100, 100057. [Google Scholar] [CrossRef]
- Boutell, M.R.; Luo, J.; Shen, X.; Brown, C.M. Learning multi-label scene classification. Pattern Recognit. 2004, 37, 1757–1771. [Google Scholar] [CrossRef] [Green Version]
- Read, J.; Pfahringer, B.; Holmes, G.; Frank, E. Classifier chains for multi-label classification. Mach. Learn. 2011, 85, 333–359. [Google Scholar] [CrossRef] [Green Version]
- Hüllermeier, E.; Fürnkranz, J.; Cheng, W.; Brinker, K. Label ranking by learning pairwise preferences. Artif. Intell. 2008, 172, 1897–1916. [Google Scholar] [CrossRef] [Green Version]
- Tsoumakas, G.; Vlahavas, I. Random k-Labelsets: An Ensemble Method for Multilabel Classification. In Proceedings of the European Conference on Machine Learning, Warsaw, Poland, 17–21 September 2007; Springer: Berlin/Heidelberg, Germany, 2007; Volume 4701, pp. 406–417. [Google Scholar] [CrossRef] [Green Version]
- Freund, Y.; Schapire, R.E. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. J. Comput. Syst. Sci. 1997, 55, 119–139. [Google Scholar] [CrossRef] [Green Version]
- Schapire, R.E.; Singer, Y. Improved boosting algorithms using confidence-rated predictions. Mach. Learn. 1999, 37, 297–336. [Google Scholar] [CrossRef] [Green Version]
- Cheng, W.; Hüllermeier, E. Combining Instance-Based Learning and Logistic Regression for Multilabel Classification. In Proceedings of the LWA 2009—Workshop-Woche: Lernen-Wissen-Adaptivitat—Learning, Knowledge, and Adaptivity, Darmstadt, Germany, 21–23 September 2009; p. 6. [Google Scholar] [CrossRef] [Green Version]
- Spyromitros, E.; Tsoumakas, G.; Vlahavas, I. An Empirical Study of Lazy Multilabel Classification Algorithms. In Proceedings of the Hellenic Conference on Artificial Intelligence, Syros, Greece, 2–4 October 2008; Springer: Berlin/Heidelberg, Germany, 2008; Volume 5138, pp. 401–406. [Google Scholar] [CrossRef]
- Zhang, M.L.; Zhou, Z.H. ML-KNN: A lazy learning approach to multi-label learning. Pattern Recognit. 2007, 40, 2038–2048. [Google Scholar] [CrossRef] [Green Version]
- Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
- Agrawal, R.; Gupta, A.; Prabhu, Y.; Varma, M. Multi-Label Learning with Millions of Labels; Association for Computing Machinery (ACM): New York, NY, USA, 2013; pp. 13–24. [Google Scholar] [CrossRef]
- Wu, X.; Gao, Y.; Jiao, D. Multi-label classification based on Random Forest algorithm for non-intrusive load monitoring system. Processes 2019, 7, 337. [Google Scholar] [CrossRef] [Green Version]
- Zhang, M.L.; Zhou, Z.H. Multilabel neural networks with applications to functional genomics and text categorization. IEEE Trans. Knowl. Data Eng. 2006, 18, 1338–1351. [Google Scholar] [CrossRef] [Green Version]
- Kurata, G.; Xiang, B.; Zhou, B. Improved neural network-based multi-label classification with better initialization leveraging label co-occurrence. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT 2016, San Diego, CA, USA, 12–17 June 2016; pp. 521–526. [Google Scholar] [CrossRef]
- Yang, P.; Sun, X.; Li, W.; Ma, S.; Wu, W.; Wang, H. SGM: Sequence Generation Model for Multi-label Classification. arXiv 2018, arXiv:1806.04822. [Google Scholar]
- Chang, W.C.; Yu, H.F.; Zhong, K.; Yang, Y.; Dhillon, I.S. X-BERT: eXtreme Multi-label Text Classification using Bidirectional Encoder Representations from Transformers. arXiv 2020, arXiv:1905.02331. [Google Scholar] [CrossRef]
- Peters, M.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep Contextualized Word Representations; Association for Computational Linguistics (ACL): Stroudsburg, PA, USA, 2018; pp. 2227–2237. [Google Scholar] [CrossRef] [Green Version]
- Morin, F.; Bengio, Y. Hierarchical Probabilistic Neural Network Language Model; PMLR: Boulder, CO, USA, 2005; pp. 246–252. [Google Scholar]
- Babbar, R.; Shoelkopf, B. DiSMEC - Distributed Sparse Machines for Extreme Multi-label Classification. In Proceedings of the WSDM 2017—Proceedings of the 10th ACM International Conference on Web Search and Data Mining, Cambridge, UK, 6–10 February 2017; pp. 721–729. [Google Scholar]
- Järvelin, K.; Kekäläinen, J. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. 2002, 20, 422–446. [Google Scholar] [CrossRef]
- Poerner, N.; Schutze, H.; Schutze, S. Multi-View Domain Adapted Sentence Embeddings for Low-Resource Unsupervised Duplicate Question Detection. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019. [Google Scholar]
- Eberhard, L.; Walk, S.; Posch, L.; Helic, D. Evaluating Narrative-Driven Movie Recommendations on Reddit. In Proceedings of the 24th International Conference on Intelligent User Interfaces, Los Angeles, CA, USA, 17–20 March 2019. [Google Scholar] [CrossRef]
Finnish Metadata | English Metadata | Total | |
---|---|---|---|
Number of Studies | 1484 | 1484 | 2968 |
Distinct Labels | 2896 | 1709 | 4605 |
Labels/Studies | 1.95 | 1.15 | 1.55 |
Ground Truth | |||
---|---|---|---|
Positive(1) | Negative(0) | ||
Positive(1) | True Positive | False Positive | |
Predicted | Negative(0) | False Negative | True Negative |
Features | Precision | Recall | F1 Score | |
---|---|---|---|---|
English metadata | ||||
Naive | BoW | 0.19 | 0.13 | 0.32 |
ML-kNN | TF-IDF | 0.31 | 0.57 | 0.4 |
FastText | 0.30 | 0.52 | 0.38 | |
BERT | 0.28 | 0.46 | 0.35 | |
Multi-BERT | 0.29 | 0.47 | 0.36 | |
ML-RF | TF-IDF | 0.58 | 0.75 | 0.65 |
FastText | 0.52 | 0.68 | 0.59 | |
BERT | 0.47 | 0.70 | 0.57 | |
Multi-BERT | 0.46 | 0.71 | 0.56 | |
X-BERT | - | 0.54 | 0.71 | 0.62 |
PARABEL | TF-IDF | 0.59 | 0.78 | 0.67 |
Finnish metadata | ||||
Naive | BoW | 0.15 | 0.11 | 0.24 |
ML-kNN | TF-IDF | 0.28 | 0.56 | 0.38 |
FastText | 0.27 | 0.49 | 0.35 | |
BERT | 0.27 | 0.50 | 0.35 | |
Multi-BERT | 0.25 | 0.44 | 0.32 | |
ML-RF | TF-IDF | 0.60 | 0.76 | 0.67 |
FastText | 0.53 | 0.69 | 0.60 | |
BERT | 0.49 | 0.69 | 0.58 | |
Multi-BERT | 0.43 | 0.68 | 0.52 | |
X-BERT | - | 0.41 | 0.58 | 0.48 |
PARABEL | TF-IDF | 0.58 | 0.75 | 0.66 |
Features | nDCG@1 | nDCG@3 | nDCG@5 | |
---|---|---|---|---|
English metadata | ||||
ML-kNN | TF-IDF | 0.65 | 0.60 | 0.58 |
FastText | 0.60 | 0.56 | 0.54 | |
BERT | 0.57 | 0.53 | 0.51 | |
Multi-BERT | 0.54 | 0.51 | 0.49 | |
ML-RF | TF-IDF | 0.77 | 0.74 | 0.71 |
FastText | 0.71 | 0.66 | 0.64 | |
BERT | 0.65 | 0.62 | 0.60 | |
Multi-BERT | 0.62 | 0.60 | 0.58 | |
X-BERT | - | 0.79 | 0.75 | 0.72 |
PARABEL | TF-IDF | 0.83 | 0.78 | 0.76 |
Finnish metadata | ||||
ML-kNN | TF-IDF | 0.65 | 0.61 | 0.60 |
FastText | 0.60 | 0.57 | 0.55 | |
BERT | 0.59 | 0.57 | 0.56 | |
Multi-BERT | 0.54 | 0.52 | 0.50 | |
ML-RF | TF-IDF | 0.78 | 0.75 | 0.73 |
FastText | 0.75 | 0.71 | 0.69 | |
BERT | 0.72 | 0.68 | 0.66 | |
Multi-BERT | 0.64 | 0.61 | 0.58 | |
X-BERT | - | 0.69 | 0.64 | 0.60 |
PARABEL | TF-IDF | 0.80 | 0.76 | 0.74 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Skenderi, E.; Huhtamäki, J.; Stefanidis, K. Multi-Keyword Classification: A Case Study in Finnish Social Sciences Data Archive. Information 2021, 12, 491. https://doi.org/10.3390/info12120491
Skenderi E, Huhtamäki J, Stefanidis K. Multi-Keyword Classification: A Case Study in Finnish Social Sciences Data Archive. Information. 2021; 12(12):491. https://doi.org/10.3390/info12120491
Chicago/Turabian StyleSkenderi, Erjon, Jukka Huhtamäki, and Kostas Stefanidis. 2021. "Multi-Keyword Classification: A Case Study in Finnish Social Sciences Data Archive" Information 12, no. 12: 491. https://doi.org/10.3390/info12120491
APA StyleSkenderi, E., Huhtamäki, J., & Stefanidis, K. (2021). Multi-Keyword Classification: A Case Study in Finnish Social Sciences Data Archive. Information, 12(12), 491. https://doi.org/10.3390/info12120491