1. Introduction
Indonesia is one of the most densely populated countries and thus faces various challenges in addressing health issues. According to the Ministry of Health of the Republic of Indonesia, as of 30 December 2021, Indonesia had a population of 273,879,750 people spread across 16,722 islands, with 26.5 million people categorized as poor in September 2021 [
1]. The level of poverty in Indonesia is closely related to public health problems. People living in poverty tend to lack adequate access to healthcare services. Furthermore, Indonesia still has a high incidence of infectious diseases, especially tuberculosis (TB), pneumonia, hepatitis, diarrhea, COVID-19, measles, polio, dengue fever, and others. The national prevalence of non-communicable diseases has also shown an increasing trend in recent years [
1].
Agustina et al. (2019) [
2] stated that the Universal Health Coverage (UHC) program can be implemented to ensure safe, affordable, and effective access to healthcare services without facing financial difficulties, in line with the Sustainable Development Goals (SDGs) set by the World Health Organization (WHO). The Indonesian government has established the Law of the Republic of Indonesia No. 24 of 2011, pertaining to Indonesia’s National Health Insurance (BPJS), as the implementing body for the social health insurance program, to support the achievement of UHC. BPJS officially started its operation in 2014 and became a significant step in improving public access to affordable healthcare.
The President Director of BPJS for Health, Ali Ghufron Mukti, stated that the COVID-19 pandemic prompted BPJS for Health to develop the Mobile JKN application to transition from traditional face-to-face services to digital services [
3]. Mobile JKN is an application developed by BPJS that provides features to view information and membership status, register, and make claims for healthcare treatment reimbursement for participants of the National Health Insurance—Healthy Indonesian Card (JKN-KIS) program. Mobile JKN was introduced in 2017 as a form of technological development that encourages the use of digital services [
4]. The COVID-19 pandemic has driven the increased use and development of the Mobile JKN application. During the COVID-19 pandemic, BPJS provided online queue systems using the Mobile JKN application, remote consultations (teleconsultations), online prescriptions, and online referral services. From 20 March to 21 July 2021, the teleconsultation service of the Mobile JKN application was used by 9656 doctors in Primary Healthcare Facilities (FKTP) [
3].
Since 15 February 2023, the Mobile JKN application has been downloaded by more than 10 million users with 470,000 user reviews on the Google Play platform. Google Play is an online store visited by users to find applications, games, movies, TV shows, books, and other content on smartphones that use the Android operating system (Source: play.google.com, accessed on 20 February 2023). Every application has its strengths and weaknesses, which are conveyed through user reviews in the review section. User reviews aim to provide evaluations for the government to improve the quality of public health insurance services in the future. Therefore, sentiment analysis of user reviews becomes crucial in evaluating user satisfaction with the services and determining areas that need improvement.
Sentiment analysis is a technique to detect favorable and unfavorable opinions about a specific subject (such as organizations and their products) that can be used for various purposes [
5]. According to Medhat et al. (2014) [
6], sentiment analysis, or opinion mining, is the computational study of people’s opinions, attitudes, and emotions toward an entity. Shaik et al. (2022) [
7] state that sentiment analysis is one of the most widely used applications of Natural Language Processing (NLP) to identify the intentions of individuals from their reviews. Sentiment analysis is performed using machine learning to classify text reviews as positive, neutral, or negative.
The field of sentiment analysis has witnessed significant advancements, with recent studies delving into various methodologies to enhance accuracy and applicability. Noteworthy contributions include the work of Wu et al. (2021), who integrated rich syntactic knowledge to improve aspect and opinion terms extraction through syntax fusion encoding and high-order scoring mechanisms [
8]. In a different vein, Tian et al. (2023) proposed an end-to-end aspect-based sentiment analysis (EASA) approach utilizing combinatory categorial grammar (CCG) to capture both syntactic and semantic information, yielding state-of-the-art results [
9].
Li et al. (2021) introduced supervised contrastive pre-training to recognize implicit sentiment orientation, enriching aspect-based sentiment analysis by capturing both explicit and implicit sentiments [
10]. Shi et al. (2022) addressed limitations in structured sentiment analysis by proposing a novel labeling strategy and a graph attention network-based model, significantly surpassing previous state-of-the-art models [
11]. Fei et al. (2022) focused on enhancing the robustness of ABSA models through multi-faceted improvements, spanning model design, data augmentation, and advanced training strategies [
12].
Moreover, Huang et al. (2020) introduced a weakly supervised approach for aspect-based sentiment analysis, utilizing sentiment aspect joint topic embeddings and neural classifiers to overcome the absence of labeled examples [
13]. Li et al. (2022) bridged the gap between sentiment analysis and dialogue contexts by introducing the conversational aspect-based sentiment quadruple analysis task, and providing a benchmark dataset and model for cross-utterance quadruple extraction [
14]. Another contribution by Fei (2020) involved the development of a Latent Emotion Memory network for multi-label emotion classification, integrating latent emotion distribution and context information to achieve state-of-the-art results [
15].
These prominent studies collectively shape the landscape of sentiment analysis, harnessing innovative techniques to enhance accuracy, adaptability, and robustness. By drawing insights from these methodologies, this study endeavors to bring a novel perspective to sentiment analysis within the context of user reviews for the Mobile JKN application.
According to Uysal and Gunal (2014) [
16], the framework stages for text classification consist of preprocessing, word representation, feature selection, classification, and model performance evaluation. Furthermore, model performance can be enhanced through hyperparameter tuning. Hyperparameter tuning is the process of finding more optimal hyperparameter values for the model. Each stage of the framework affects the performance of the classification model created.
This structured approach consists of five fundamental stages, each playing a crucial role in shaping the process and outcomes of our analysis. The initial phase involves preprocessing, wherein the raw text data are refined and prepared for subsequent analysis. The subsequent stage encompasses word representation, wherein the text is transformed into a numerical format suitable for machine learning algorithms. Feature selection follows, where pertinent features are carefully chosen to enhance model efficacy. Classification, the subsequent stage, entails the application of machine learning techniques to categorize the text. Finally, the framework concludes with model performance evaluation, whereby the effectiveness of the classification model is rigorously assessed using established metrics such as Accuracy, Precision, Recall, and F1-Score. The choice of evaluation metric depends on the complexity and distribution of the data.
Şahin and Klç (2019) [
17] used the F1-Score metric to evaluate a classification model for an imbalanced dataset in the Reuters-21578 dataset. Padurariu and Breaban (2019) [
18] also used the F1-Score metric in their study to evaluate a classification model on a dataset containing work experience. In the case of an imbalanced dataset, F1-Score is commonly used because it combines precision and recall equally for majority and minority classes. This performance metric serves as a reference for hyperparameter tuning.
In the classification stage with machine learning, Mantovani et al. (2019) [
19] state that most machine learning algorithms are sensitive to the values of hyperparameters, which directly affect the performance of the model. One of the commonly used machine learning algorithms for various problems is the Support Vector Machine (SVM). The performance of an SVM model is highly influenced by the values of hyperparameters such as the kernel function (
), gamma (γ), polynomial degree (
), and regularized constant (
). Hyperparameter tuning by changing these hyperparameter values can improve the performance of the SVM model.
Previous research on sentiment analysis has been conducted in the banking services domain by Sari and Irhamah (2020) [
20]. Their study classified Twitter data into positive and negative sentiments using the Term Frequency Inverse Document Frequency (TF-IDF) word representation as the input for the Naïve Bayes Classifier (NBC) and SVM algorithms with SMOTE. In their research, Mahendrajaya (2019) [
21] conducted sentiment analysis on user opinion tweets about Gopay services. The study used a lexicon-based method to label the sentiment as positive or negative. The word representation used was TF-IDF as the input for SVM algorithms with linear and polynomial kernels for classification.
The application of feature selection methods can be performed in classification methods to improve model performance by reducing the number of features used. Cahyono (2017) [
22] states that feature selection is used to reduce a large feature set into a smaller subset of relevant features. Feature selection reduces computational time and improves model efficiency by using only the features considered relevant or most impactful on the model. Sentiment analysis research on COVID-19 vaccination using Naïve Bayes Classifier with Chi-Square feature selection and Particle Swarm Optimization has been conducted by Septiana et al. (2021) [
23] with the Chi-Square feature selection yielding the best performance by improving the model’s accuracy from 63.69% to 69.13%. Furthermore, Luthfiana et al. (2020) [
24] conducted sentiment analysis on user reviews of an application dataset consisting of 553 reviews for three classes: positive, neutral, and negative sentiments, using the SVM method and Chi-Square feature selection. The research obtained the performance results without feature selection with an accuracy of 69%, precision of 48%, recall of 53%, and F1-Score of 50%. After applying feature selection, the model’s performance improved, with an accuracy of 77%, precision of 50%, recall of 55%, and F1-Score of 73%. The research also performed hyperparameter tuning on the regularized constant and gamma.
In this study, the domain of sentiment analysis is explored, building upon the SVM approach in conjunction with Chi-Square feature selection, as presented in the framework proposed by Luthfiana et al. (2020). The distinctiveness of our study lies in the utilization of advanced technical strategies to address specific challenges. The TF-IDF (Term Frequency-Inverse Document Frequency) methodology is harnessed for word representation, a robust technique that gauges word importance by considering their prevalence across the entire text corpus. Additionally, hyperparameter tuning is undertaken by optimizing the regularized constant, rooted in the F1-Score metric. This methodical calibration of parameters is a strategic effort aimed at enhancing model performance, thereby refining sentiment classification outcomes.
A noteworthy departure from the methodology of Luthfiana et al. (2020) pertains to the expansion of the dataset, which encompasses a larger volume of user reviews obtained from the Google Play platform. This augmentation facilitates a more comprehensive grasp of user sentiments, contributing to a more nuanced and insightful analysis. The dataset adheres to a binary classification scheme, categorizing sentiments as either positive or negative. This two-class classification framework forms the fundamental basis of the sentiment analysis endeavor.
Through these methodological enhancements, the goal is not only to replicate but to elevate the effectiveness of the SVM-based sentiment analysis paradigm. By embracing advanced techniques and broadening the scope of data utilization, this study introduces an evolved methodology that transcends prior limitations, deriving strength from its advanced technical underpinnings.
1.1. Problem Statement
How does sentiment analysis of user reviews for the Mobile JKN application using SVM classification and Chi-Square feature selection method work?
How well does the model’s performance using SVM classification and the Chi-Square feature selection method fare in conducting sentiment analysis of user reviews for the Mobile JKN application?
What is the optimal value of the regularized constant hyperparameter for the SVM method in sentiment analysis of user reviews for the Mobile JKN application, as determined by the F1-Score metric?
1.2. Model Limitation
The model in this study is limited by the following conditions:
The method employed includes Chi-Square feature selection and the SVM classification method;
The data used comprise reviews of the Mobile JKN application from the Indonesian Google Play Store, with a total of 7020 reviews collected through scraping between 1 February 2023, and 20 March 2023;
Sentiment analysis is performed by categorizing review data into two classes: positive sentiment and negative sentiment;
Sentiment analysis and computations are conducted using the Python programming language with an interpreter in the DataSpell IDE;
Model performance improvement is based on the F1-Score metric with hyperparameter tuning for the regularized constant and the “linear” kernel.
1.3. Broad Objectives
Obtain sentiment analysis results of user reviews for the Mobile JKN application;
Attain model performance for sentiment analysis of user reviews for the Mobile JKN application;
Determine the optimal value of the regularized constant hyperparameter for sentiment analysis of user reviews for the Mobile JKN application.
1.4. Contributions of This Work
Advanced Framework Integration: This study pioneers the integration of Support Vector Machine (SVM) classification and Chi-Square feature selection within a unified framework. This innovative amalgamation aims to harness the strengths of both techniques, leading to improved sentiment analysis accuracy and robustness;
Hyperparameter-Tuned Model: A significant contribution lies in the introduction of hyperparameter tuning, specifically optimizing the regularized constant, to tailor the SVM model’s performance for sentiment analysis. This strategic optimization, based on the F1-Score metric, showcases a commitment to refining model predictions for imbalanced datasets;
Focused Domain Application: The applicability of this approach extends to user-generated content by employing a dataset of Mobile JKN application reviews. This application-focused approach addresses the nuances and challenges unique to sentiment analysis in the context of real-world user reviews;
Clear Experimental Insights: This study provides a clear and detailed overview of the experimental methodology, encompassing text preprocessing, feature selection, model training, and performance evaluation. By elucidating each step, it offers insights into the mechanics and effectiveness of the approach;
Model Limitations and Significance: Recognizing the boundaries of this work, a dedicated section on model limitations is presented. This candid exploration of potential constraints contributes to a well-rounded understanding of the scope and implications of the research.
3. Results
3.1. Data
The dataset utilized for this research comprises reviews of the Mobile JKN application, sourced from the Google Play Store’s digital distribution platform. Spanning from 1 February 2023 to 20 March 2023, the dataset encompasses a total of 7020 data points. Among these, 4777 instances manifest positive sentiment, while 2243 instances convey negative sentiment. This distribution illustrates an inherent class imbalance within the dataset, classifying it as an imbalanced dataset. Consequently, for the optimization of hyperparameters, the F1-Score emerges as the most pertinent metric, effectively addressing the intricacies of imbalanced classes.
The chosen dataset resonates with significance on several fronts. Its origin from a popular digital platform mirrors real-world user sentiments, rendering it authentic and indicative of user experiences. The imbalanced nature of the dataset parallels real-world scenarios where positive sentiments tend to outweigh negative sentiments, underscoring the relevance of handling imbalanced datasets within the domain of sentiment analysis. The dataset’s temporal scope encapsulates recent user feedback, aligning with contemporary user perceptions of the Mobile JKN application.
Furthermore, this dataset selection affords the opportunity to explore the challenges and strategies associated with class imbalance mitigation and hyperparameter optimization. The distinct characteristics of the dataset, including its volume, sentiment distribution, and relevance, converge to form a valuable foundation for investigating the efficacy of the proposed SVM and Chi-Square feature selection methodology. Through its intrinsic representation of real-world user sentiments, the dataset contributes both contextual authenticity and analytical depth to the research, ultimately enriching the study’s validity and applicability.
In essence, the dataset serves as a pivotal component of this research, epitomizing the interplay between authentic sentiment data, imbalanced class representation, and the proposed methodology’s effectiveness. The sample data can be seen in
Table 2.
3.2. Preprocessed Data
Text preprocessing is performed on labeled data. This process involves several steps, including case folding to convert all letters to lowercase, stopword filtering to remove meaningless words, tokenizing to separate sentences into individual words, and stemming to derive the base form of words with affixes. Additionally, abbreviations are replaced with relevant full-length words, and special characters such as punctuation marks or emojis are removed. This stage also includes removing numbers that appear at the end of words, for example, transforming the word “masing2” to “masing”.
Empty (null) review data are removed after this stage. Out of the total 7020 data points, there are 148 empty data points, resulting in 6872 data points after undergoing text preprocessing. Sample data that have undergone text preprocessing can be seen in
Table 3.
3.3. Chi-Square Feature Selection
The preprocessed data are divided into a training dataset of 80% and a test dataset of 20%. The training dataset consists of 5497 review data, while the test dataset contains 1375 review data. The training dataset consists of 1800 reviews with positive sentiment and 3697 reviews with negative sentiment, based on their classes.
Feature selection using Chi-Square is performed by calculating the
value for each term
using Equations (1) and (2). There are 2996 unique words in the training data that will be potential features in the SVM classification model. The selection of features or words used in creating the classification model is achieved by taking the top 1000 words with the highest Chi-Square values. Sample results of the Chi-Square calculations can be seen in
Table 4.
Based on
Table 4, the words “tidak”, “bisa”, “daftar”, and “aplikasi” have the highest Chi-Square values. This indicates that these words are the most relevant in determining the classification class. On the other hand, the words “putar” and “aktip” have the lowest Chi-Square values, suggesting that these words are less relevant in determining the classification class.
3.4. SVM Classification Model and Hyperparameter Tuning
The test data obtained from the previous data splitting consist of 1375 reviews. The test data, which have undergone text preprocessing, feature selection and TF-IDF word representation, produce input vectors
for the SVM classification method. Hyperparameter tuning is performed on the regularized constant
, and the results are obtained in
Table 5.
Based on the F1-Score metric, the best performing model achieved an accuracy of 96.82% with a hyperparameter of 10. The model has an accuracy rate of 95.56% in correctly classifying the test data. Additionally, the model has a precision of 96.98%, indicating that the majority of data classified as positive by the model are truly positive out of the entire test data. The recall obtained by this model is 96.67%, demonstrating the extent to which the model can accurately find and classify positive data overall.
In this study, the hyperparameter was set to 100 to establish a foundational model for benchmarking the performance of various hyperparameter values. This choice provided a consistent reference point for evaluating alternative parameter configurations.
As part of the ablation study, the model’s performance was systematically assessed using different feature subsets. Notably, when employing the entire feature set consisting of 2996 words, the model achieved an F1-Score of 95,09%, highlighting its proficiency in capturing sentiment variations across a wide range of linguistic features.
Of particular interest, the model’s performance further improved when feature selection reduced the feature set to 1000 words, resulting in an impressive 96,43% F1-Score. This 1.34% increase in F1-Score underscores the impact of feature selection on enhancing the model’s discriminative capability. These findings suggest that the strategic curation of a more compact feature set, achieved through Chi-Square feature selection, can enhance sentiment analysis accuracy.
3.5. Label Prediction
The test data are classified using the tuned SVM model. Sample classification of the test data with the tuned classification model can be seen in
Table 6.
The prediction results using the model can be visualized in
Figure 3.
Figure 3 shows that out of 1375 reviews in the test data, 69.74% (959 reviews) are of positive sentiment and 30.25% (416 reviews) are of negative sentiment. Furthermore, the calculation of the most frequently occurring words in each sentiment class is conducted to understand the message conveyed by the users.
3.5.1. Positive Reviews Data
The analysis of the sentiment distribution within user reviews of the Mobile JKN application reveals noteworthy insights. As depicted in
Figure 4, the visualization of the most frequently occurring words in positive reviews highlights prominent terms such as “bantu” (help), “mudah” (easy), “bagus” (good), “mantap” (excellent), and “aplikasi” (application). These recurring words signify the positive sentiment conveyed by users, indicating a favorable experience with the Mobile JKN application. Notably, the use of terms like “bantu” (help) and “mudah” (easy) suggests that users find the application helpful and user-friendly, enhancing their perception of the overall service quality. The appearance of words like “bagus” (good) and “mantap” (excellent) further corroborates the positive sentiment, indicating users’ satisfaction with the application’s performance. This linguistic analysis underscores the alignment between user expectations and the application’s actual utility. Such an interpretation emphasizes the successful implementation of the Mobile JKN application, as affirmed by users’ positive expressions.
3.5.2. Negative Reviews Data
The examination and interpretation of results pertaining to user reviews of the Mobile JKN application warrant a more comprehensive analysis. As illustrated in
Figure 5, a closer examination of the most frequently encountered terms within negative reviews brings to light prominent words such as “tidak” (not), “bisa” (can), “aplikasi” (application), “daftar” (register), and “nomor” (number). These prevalent terms signify recurring themes in negative sentiment reviews, which often encompass specific grievances voiced by users. An overarching concern shared by users is the perceived difficulty in the registration process within the application and challenges associated with registered phone numbers. Such insights underscore the practical challenges users encounter during their interaction with the application. Furthermore, the appearance of terms like “masuk” (login), “susah” (difficult), “error” (error), and “harus” (must) highlights additional areas of frustration and discontent experienced by users. The lack of ease during the login process and the presence of errors contribute to user dissatisfaction, leading to the expression of negative sentiments in their reviews. This analysis elucidates the nuances of negative feedback, emphasizing the specific pain points faced by users while navigating the application. A more profound exploration of these findings enhances our understanding of user experience and provides valuable insights for potential enhancements to address the identified challenges.
5. Discussion
The amalgamation of Chi-Square feature selection and the SVM classification technique establishes a compelling innovation in the realm of sentiment analysis for Indonesia’s National Health Insurance mobile application reviews. The method’s effectiveness is underpinned by distinct factors.
Primarily, the incorporation of Chi-Square feature selection augments the model’s potency. The strategic curation of relevant features bolsters the classifier’s discriminatory prowess. By spotlighting salient linguistic indicators, the model becomes adept at unraveling intricate nuances of sentiment embedded in the dataset.
Furthermore, the adoption of the SVM classification algorithm aligns seamlessly with the intricate fabric of textual data prevalent in reviews. The algorithm’s aptitude for deciphering non-linear relationships within features and sentiments aligns harmoniously with the task at hand. The pragmatic selection of the “linear” kernel underscores both the model’s computational efficiency and efficacy in capturing the essence of sentiment.
The efficacy of hyperparameter tuning, specifically the calibration of the regularized constant and kernel parameters, stands as a significant contributor to the enhanced performance. The meticulous optimization of these parameters harmonizes the model’s behavior with the unique attributes of the dataset, facilitating robust generalization and heightened classification accuracy.
Lastly, the extensive evaluation process and meticulous data collection, encompassing a substantial corpus of Indonesian reviews from the Google Play Store, enrich the model with a comprehensive and diverse dataset. This resourcefulness empowers the model to extrapolate adeptly, accommodating the spectrum of sentiment expressions intrinsic to user reviews.
Reviewing the results obtained from this research, the novelty of this papers can be highlighted in detail as follows. The novelty of this work resides in the innovative application of a comprehensive framework for sentiment analysis. Our study brings together a combination of methods and processes, including Support Vector Machine (SVM) classification and Chi-Square feature selection, integrated within the context of user reviews. Additionally, we incorporate techniques such as TF-IDF representation and meticulous text preprocessing to enhance the effectiveness of our approach.