Next Article in Journal
Harnessing the Distributed Computing Paradigm for Laser-Induced Breakdown Spectroscopy
Next Article in Special Issue
Explainable Pre-Trained Language Models for Sentiment Analysis in Low-Resourced Languages
Previous Article in Journal
Artificial Intelligence Techniques for Sustainable Reconfigurable Manufacturing Systems: An AI-Powered Decision-Making Application Using Large Language Models
Previous Article in Special Issue
International Classification of Diseases Prediction from MIMIIC-III Clinical Text Using Pre-Trained ClinicalBERT and NLP Deep Learning Models Achieving State of the Art
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

IndoGovBERT: A Domain-Specific Language Model for Processing Indonesian Government SDG Documents

1
Graduate School of Information Science and Engineering, Ritsumeikan University, Ibaraki 5678570, Osaka, Japan
2
Ministry of National Development Planning/BAPPENAS, Jakarta 10310, Indonesia
3
College of Information Science and Engineering, Ritsumeikan University, Ibaraki 5678570, Osaka, Japan
4
Center for Democracy Studies Aarau (ZDA), University of Zurich, 8006 Zurich, Switzerland
*
Author to whom correspondence should be addressed.
Big Data Cogn. Comput. 2024, 8(11), 153; https://doi.org/10.3390/bdcc8110153
Submission received: 3 October 2024 / Revised: 29 October 2024 / Accepted: 7 November 2024 / Published: 9 November 2024
(This article belongs to the Special Issue Artificial Intelligence and Natural Language Processing)

Abstract

Achieving the Sustainable Development Goals (SDGs) requires collaboration among various stakeholders, particularly governments and non-state actors (NSAs). This collaboration results in but is also based on a continually growing volume of documents that needs to be analyzed and processed in a systematic way by government officials. Artificial Intelligence and Natural Language Processing (NLP) could, thus, offer valuable support for progressing towards SDG targets, including automating the government budget tagging and classifying NSA requests and initiatives, as well as helping uncover the possibilities for matching these two categories of activities. Many non-English speaking countries, including Indonesia, however, face limited NLP resources, such as, for instance, domain-specific pre-trained language models (PTLMs). This circumstance makes it difficult to automate document processing and improve the efficacy of SDG-related government efforts. The presented study introduces IndoGovBERT, a Bidirectional Encoder Representations from Transformers (BERT)-based PTLM built with domain-specific corpora, leveraging the Indonesian government’s public and internal documents. The model is intended to automate various laborious tasks of SDG document processing by the Indonesian government. Different approaches to PTLM development known from the literature are examined in the context of typical government settings. The most effective, in terms of the resultant model performance, but also most efficient, in terms of the computational resources required, methodology is determined and deployed for the development of the IndoGovBERT model. The developed model is then scrutinized in several text classification and similarity assessment experiments, where it is compared with four Indonesian general-purpose language models, a non-transformer approach of the Multilabel Topic Model (MLTM), as well as with a Multilingual BERT model. Results obtained in all experiments highlight the superior capability of the IndoGovBERT model for Indonesian government SDG document processing. The latter suggests that the proposed PTLM development methodology could be adopted to build high-performance specialized PTLMs for governments around the globe which face SDG document processing and other NLP challenges similar to the ones dealt with in the presented study.
Keywords: pre-trained language model (PTLM); government document processing; PTLM development methodology; document classification; text similarity assessment pre-trained language model (PTLM); government document processing; PTLM development methodology; document classification; text similarity assessment

Share and Cite

MDPI and ACS Style

Riyadi, A.; Kovacs, M.; Serdült, U.; Kryssanov, V. IndoGovBERT: A Domain-Specific Language Model for Processing Indonesian Government SDG Documents. Big Data Cogn. Comput. 2024, 8, 153. https://doi.org/10.3390/bdcc8110153

AMA Style

Riyadi A, Kovacs M, Serdült U, Kryssanov V. IndoGovBERT: A Domain-Specific Language Model for Processing Indonesian Government SDG Documents. Big Data and Cognitive Computing. 2024; 8(11):153. https://doi.org/10.3390/bdcc8110153

Chicago/Turabian Style

Riyadi, Agus, Mate Kovacs, Uwe Serdült, and Victor Kryssanov. 2024. "IndoGovBERT: A Domain-Specific Language Model for Processing Indonesian Government SDG Documents" Big Data and Cognitive Computing 8, no. 11: 153. https://doi.org/10.3390/bdcc8110153

APA Style

Riyadi, A., Kovacs, M., Serdült, U., & Kryssanov, V. (2024). IndoGovBERT: A Domain-Specific Language Model for Processing Indonesian Government SDG Documents. Big Data and Cognitive Computing, 8(11), 153. https://doi.org/10.3390/bdcc8110153

Article Metrics

Back to TopTop