Fine-Tuned RoBERTa Model for Bug Detection in Mobile Games: A Comprehensive Approach

Usman, Muhammad; Ahmad, Muhammad; Ullah, Fida; Muzamil, Muhammad; Hamza, Ameer; Jalal, Muhammad; Gelbukh, Alexander

doi:10.3390/computers14040113

Open AccessArticle

Fine-Tuned RoBERTa Model for Bug Detection in Mobile Games: A Comprehensive Approach

by

Muhammad Usman

^1,†,

Muhammad Ahmad

^1,†

,

Fida Ullah

¹,

Muhammad Muzamil

²,

Ameer Hamza

²,

Muhammad Jalal

² and

Alexander Gelbukh

^1,*

¹

Centro de Investigación en Computación, Instituto Politécnico Nacional (CIC-IPN), Mexico City 07738, Mexico

²

Department of Computer Science and Software Engineering, The Islamia University of Bahawalpur, Punjab 63100, Pakistan

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Computers 2025, 14(4), 113; https://doi.org/10.3390/computers14040113

Submission received: 19 December 2024 / Revised: 9 January 2025 / Accepted: 11 January 2025 / Published: 21 March 2025

Download

Browse Figures

Versions Notes

Abstract

:

In the current digital era, the Google Play Store and the App Store are major platforms for the distribution of mobile applications and games. Billions of users regularly download mobile games and provide reviews, which serve as a valuable resource for game vendors and developers, offering insights into bug reports, feature suggestions, and documentation of existing functionalities. This study showcases an innovative application of fine-tuned RoBERTa for detecting bugs in mobile phone games, highlighting advanced classification capabilities. This approach will increase player satisfaction, lead to higher ratings, and improve brand reputation for game developers, while also reducing development costs and saving time in creating high-quality games. To achieve this goal, a new bug detection dataset was created. Initially, data were sourced from four top-rated mobile games from multiple domains on the Google Play Store and the App Store, focusing on bugs, using the Google Play API and App Store API. Subsequently, the data were categorized into two classes: binary and multi-class. The Logistic Regression, Convolutional Neural Network (CNN), and pre-trained Robustly Optimized BERT Approach (RoBERTa) algorithms were used to compare the results. We explored the strength of pre-trained RoBERTa, which demonstrated its ability to capture both semantic nuances and contextual information within textual content. The results showed that pre-trained RoBERTa significantly outperformed the baseline models (Logistic Regression), achieving superior performance with a 5.49% improvement in binary classification and an 8.24% improvement in multi-class classification, resulting in cross-validation scores of 96% and 92%, respectively.

Keywords:

NLP; RoBERTa; SVM; machine learning; emerging issues; Google Play Store; App Store; mobile apps

1. Introduction

In the current digital era, the primary market players for mobile applications are the Google Play Store and the App Store. The Google Play Store was launched on 6 March 2012. Initially, the store offered 450,000 mobile applications and games. By 2023, the Google Play Store had significantly expanded, offering more than 2.43 million apps, with approximately 490,000 categorized as games [1,2]. In contrast, the App Store was launched on 10 July 2008, and initially offered 500 applications. Over time, its app inventory grew to approximately 2.2 million by 2017. However, this figure experienced a modest decline in subsequent years as Apple began removing older or 32-bit apps. As of 2021, the App Store hosted more than 1.8 million apps [3]. Furthermore, the App Store has undergone rapid expansion, reaching 1.81 million apps by 2024, including 472,000 games. Mobile gaming has become increasingly important in modern society as a ubiquitous source of relaxation, entertainment, and social interaction [4]. From an economic standpoint, mobile games contribute substantially to revenue growth through a diverse range of monetization models. Additionally, they create job opportunities and support a thriving ecosystem for developers, 2D/3D graphic designers, animators, and other content creators. Mobile games play a crucial role in enriching lives, connecting communities, and contributing to modern digital culture [5]. In 2023, the mobile gaming sector exhibited remarkable growth, generating a staggering USD 81 billion in revenue, which constituted an impressive 49% of the overall gaming revenue. This figure surpassed the combined revenue generated by PC and console games [6]. Specifically, iOS platforms contributed USD 47.7 billion to the revenue from mobile games, while Google Play contributed USD 33.3 billion. Notably, Chinese gamers emerged as the most significant contributors to mobile game revenue in 2023, accounting for 34% of total consumer spending on mobile games. However, the industry encountered difficulties related to bugs in the games, which negatively impacted the user experience and necessitated continuous optimization efforts [7]. Recent researchers have highlighted the importance of user reviews in determining the success of an application [8]. Positive reviews can lead to higher rankings within app stores, which can result in increased visibility, sales, and download figures [9]. Reviews play a crucial role in assisting users in navigating the vast array of available apps and in making informed decisions about which apps to use. Users can express their satisfaction or dissatisfaction, or recommend additional features through a combination of free-text commentary and star ratings [10]. Furthermore, recent studies have emphasized the potential significance of reviews for both app developers and vendors.

Many reviews contain important information related to the quality of mobile games, including reports of bugs or issues [11], summaries of user experiences with specific features [12], requests for improvements in the game [13], and even suggestions for new features [11,14]. Existing studies do not provide a detailed comparison of bug detection methodologies [15] in mobile games between the Google Play Store and the App Store. To close this gap, our research examines the bug detection approaches utilized on various platforms. Understanding these characteristics is vital to boosting app performance and enhancing the user experience in the gaming industry. This study employs transfer learning techniques to identify problems in mobile games [16] based on user reviews from both the Google Play Store and the App Store. The goals include developing efficient bug detection models, evaluating algorithm efficiency, and investigating the relationship between discovered faults and user reviews.

In this study, we collected user-generated reviews of top-rated mobile games such as Minecraft, GTA San Andreas, Call of Duty: Mobile Season 4, and Lord Mobile from both the Google Play Store and the App Store. Machine learning models such as Logistic Regression (LR), Support Vector Machine (SVM), Random Forest (RF), and K-Nearest Neighbor (KNN), and deep learning models such as Bidirectional Gated Recurrent Unit (BGRU), Bidirectional Long Short-Term Memory (BiLSTM), Enhanced Long Short-Term Memory (ELSTM), and Convolutional Neural Network (CNN), and pre-trained transfer learning models including Bidirectional Encoder Representations from Transformers (BERT), A Robustly Optimized BERT Pretraining Approach (RoBERTa), and Generative Pre-trained Transformer 2 (GPT-2) were employed to determine the most efficient solution for analyzing the results. The newly annotated dataset was structured into two distinct subtasks: (i) binary class and (ii) multi-class. However, the evaluation focuses solely on models applied to both subtasks, which involve bug detection in games. The results demonstrated that the proposed model (RoBERTa) significantly outperformed the baseline models, achieving superior performance with a 5.49% improvement in binary classification and an 8.24% improvement in multi-class classification, resulting in accuracies of 96% and 92%, respectively. To the best of our knowledge, no prior work has utilized machine learning, deep learning, and transfer learning techniques for binary and multi-class classification on Google Play and App Store game reviews for bug detection. This study is the first to use a manually annotated dataset to detect bugs with the goal of improving game quality for both developers and vendors. We chose to focus on three specific bug categories—network, graphical, and performance issues—because they directly impact the player’s experience. By categorizing bugs into these key areas, this approach helps developers quickly identify and fix the most disruptive problems, ultimately enhancing user satisfaction and enjoyment.

This study makes the following contributions:

Dataset development: We develop a binary and multi-class dataset for bug detection in user reviews, establish guidelines for dataset annotation, and evaluate existing datasets to suggest enhancements;
Understanding of bug detection in user reviews: We examine the linguistic nuances of user reviews and model bug reports as distinct categories to inform the development of our dataset and classification tasks;
Text classification: We explore bug detection through a binary and multi-class text classification (TC) task, which is a relatively new approach. The binary classification involves classifying whether the review reports a bug or not. If the review reports a bug, the multi-class classification further classifies the specific type of bug into categories such as network, graphical, and performance issues.
Benchmarking: We conduct various experiments on learning approaches, offering a benchmark for future research on bug detection tasks.
Performance improvements: The proposed model (RoBERTa) achieved a 96% cross-validation score in binary classification and a 92% cross-validation score in multi-class classification, resulting in improvements of 5.49% and 8.24%, respectively, compared to traditional machine learning models (LR 91% in binary and 85% in multi-class classification).

The rest of this paper is organized as follows: Section 2 describes the literature review and background information. Section 3 describes the overall methodology and design. Section 4 illustrates the experimental results. Section 5 discusses the conclusion and future work.

2. Literature Review

Mobile gaming has become a crucial aspect of our day-to-day routines, offering recreation and amusement to countless users globally [17]. As the popularity of mobile games continues to grow, developers have been confronted with the task of creating flawless experiences for their audience. This challenge has prompted researchers to undertake various studies and literature reviews on defects in mobile games [18], delving into prevalent types of bugs, their causes, and potential remedies. These common bug categories include visual anomalies or artifacts during gameplay (graphic glitches), errors within the game mechanics (gameplay errors), and issues affecting performance [19].

Sadiq et al. [4] highlight the significance of online reviews in consumer decision-making and the challenges posed by fake reviews and ratings, particularly on platforms like the Google Play Store. They developed a deep learning-based approach to predict contradictions between numeric ratings and text reviews in Google App reviews. Their approach involves predicting sentiment polarity and utilizing deep learning models to infer star ratings from review texts, aiming to provide unbiased ratings based on genuine user feedback. The experimental results indicate the effectiveness of the proposed approach in accurately predicting app ratings.

Schumer et al. [5] assess the features of diet and nutrition apps available on the Google Play Store. Using search terms such as ‘diet apps’ and ‘nutrition apps’, a search was conducted in August 2017, resulting in the identification of 86 apps. All apps were freely available, with a majority targeting user of all ages. A comparative analysis revealed that diet apps had a slightly higher average rating (4.4) compared to nutrition apps (4.3). Diet apps were also more frequently updated and featured in-app purchases more often than nutrition apps.

Martens et al. [6] contribute by investigating fake reviews in app stores, analyzing their prevalence, characteristics, and impact. Through disguised questionnaires and policy analysis, they provide insights into fake review providers’ strategies and offerings. Developing a high-accuracy classifier for detecting fake reviews is another significant contribution, offering a practical tool for app store operators and analysts. Their findings highlight differences between fake and genuine reviews and offer implications for software engineering, app users, and app store operators, advancing understanding in the field.

Zhang et al. [15] introduce Carat, a method and implementation for detecting energy bugs, code misbehavior that leads to energy wastage, on mobile devices. Carat adopts a collaborative, black-box approach, where a non-invasive client app periodically sends coarse-grained measurements to a server. The server then correlates higher expected energy use with client properties such as running apps, device model, and operating system. In a controlled experiment, Carat successfully detected all energy bugs, and in a deployment to 883 users, it identified 5434 instances of apps exhibiting buggy behavior in real-world scenarios.

Jiang et al. [16] address energy bugs in Android apps, crucial for battery conservation due to limited smartphone battery capacity. Their approach focuses on detecting resource leaks and layout defects using SAAD, a static analysis technique. SAAD employs inter-procedural analysis for bug detection, achieving accuracies of 87% for both resource leaks and layout defects in experimental evaluations on 64 Android apps.

Wu et al. [20] implemented a testing approach called WDTEST, focused on widget detection and tailored for mobile games at NetEase Games. They collected a comprehensive dataset of graphical user interfaces for mobile games and conducted detailed assessments of advanced widget detection techniques within the mobile gaming environment. The evaluations showed that WDTEST performs better than the commonly used Monkey tool, achieving three times the coverage of unique user interfaces in gaming contexts.

Xu et al. [9] introduce a new approach called Cross-Triplet Deep Feature Embedding to predict JIT bugs in various mobile applications. The Cross-Triplet Deep Feature Embedding (CDFE) strives to enhance bug prediction accuracy by incorporating an advanced cross-triplet loss function into a deep neural network.

Van der Lee et al. [21] introduce an innovative testing approach that integrates state machine learning with specialized algorithms to reveal attack routes in mobile Android applications. State machine learning involves representing the application’s behavior using states and transitions, enabling a comprehensive understanding of its operations. By utilizing state machine learning and tailored algorithms together, the methodology aims to pinpoint potential attack paths that adversaries could exploit to compromise the application’s security. The research focused on enhancing mobile application security by presenting a testing method that detects vulnerabilities in inferred state machine models, thus contributing to overall app security improvements.

Tazuddin et al. [22] introduce an innovative gaming framework that utilizes indoor mobile Wi-Fi localization as game input without requiring additional infrastructure. This design aims to optimize gameplay by balancing responsiveness and accuracy levels.

Abbasi et al. [23] introduce the concept of Application Tail Energy Bugs (ATEBs) in smartphones, focusing on excessive energy consumption even after app termination. Through experiments on real Android apps, the study identifies potential causes and user actions triggering ATEBs, exploring app component behaviors like activities and services. By tracing wake locks and CPU-engaging services, the study examines the relationship between software changes and energy consumption. Additionally, a tool utilizing Android debug bridge commands is designed to detect ATEBs, providing a practical solution for developers without access to power meters.

Kim et al. [24] address the challenge of limited battery life in mobile devices by proposing a static optimization technique for energy-efficient app development, particularly focusing on graphics-intensive apps. By leveraging static analysis, the technique predicts app behavior to optimize drawing commands. Three key optimizations are highlighted: loop invariant texture analysis, packing images, and identical frames detection. Implemented against LibGDX, an Android game engine, the technique demonstrated significant energy savings, up to 44% of total device energy consumption, as observed in experiments conducted on open-source projects.

Existing work on bug detection in mobile games focuses mainly on energy efficiency or gameplay-related issues, which lacks a comprehensive approach to leveraging user reviews. This was the motivation behind integrating semantic analysis and contextual understanding into our work.

3. Materials and Methods

In this study, supervised machine learning, deep learning, and pre-trained transfer learning algorithms were used for the bug detection of Google Play Store and App Store app reviews for the top 4 games, such as Minecraft, GTA San Andreas, Call of Duty: Mobile Season 4, and Lord Mobile. In the context of bug detection in mobile games, we followed the subsequent key steps:

Data Collection: This stage involved collecting 10 k user reviews from both the Google Play Store and App Store.
Data Pre-processing: The second stage entailed pre-processing the data to remove noise from the dataset.
Data Labeling: The third stage comprised labeling the data into binary and multi-class categories.
Application of Models: The fourth stage entailed applying machine learning, deep learning, and pre-trained transfer learning models, such as SVM, ANN, and LSTM, and pre-trained BERT models to predict binary and multi-class categories.
Model Evaluation: In this phase, predictive models were evaluated using four metrics: accuracy, precision, recall, and macro F1-score. These metrics provided insights into the effectiveness of the models in predicting the near-optimal class.

3.1. Construction of Dataset

To identify bugs in mobile games, 10,000 user reviews were collected from both the Google Play Store and the App Store, focusing on four top-quality mobile games: Minecraft, GTA San Andreas, Call of Duty: Mobile Season 4, and Lords Mobile. The objective was to uncover potential bugs, glitches, or performance issues reported by users. The collected data were thoroughly analyzed to identify patterns, recurring bugs, and prioritize necessary fixes, ultimately improving the overall user experience. Figure 1 illustrates a typical ML/DL workflow, customized to address the unique challenges of bug detection in mobile games. This workflow encompasses data scraping, pre-processing, annotation, model training, bug detection, and performance evaluation. Meanwhile, Figure 2 visually represents the dataset through a Word Cloud generated using python code, highlighting frequent keywords and issues mentioned in user reviews, which further aids in identifying critical areas that require attention.

3.2. Data Pre-Processing

Data pre-processing plays an important role in the context of textual data, as shown in Figure 3. It is essential for improving model performance, reducing computational complexity, and ensuring more accurate and reliable outcomes in natural language processing tasks. Real-world data often contain noise, missing values, language errors, and duplications, making pre-processing necessary to clean and prepare raw data for analysis or model training. Key pre-processing tasks include removing stop words, special characters, punctuation marks, digits, and reviews shorter than 20 characters; converting text to lowercase for uniformity; tokenizing text into smaller units such as words or phrases; and stemming to reduce words to their root forms. These steps ensure that the data is free from noise, consistent, and structured, allowing machine learning models or algorithms to focus on meaningful patterns and features. A sample of the process for removing unnecessary characters is shown in Table 1.

3.3. Data Labeling

In this step, we employed human annotators to annotate the user reviews. The trained annotators meticulously reviewed and categorized each review based on the presence of bugs in the game into two categories such as binary and multi-class as seen in Figure 4. Multi-class reflects the most common categories of bugs reported by the users, which include glitches, gameplay issues, and performance bugs, which align well with the findings in prior research. Their careful assessments ensure the accuracy and reliability of the dataset, which are essential for training binary and multi-class classification models effectively. This collaborative effort aims to build a comprehensive resource for identifying and understanding the bugs in the game and refining our dataset’s precision and quality.

3.4. Application of Models Training and Testing Phase

After the dataset labeling process, we employed a diverse range of traditional machine learning algorithms such as LR, SVM, RF, and K-Nearest Neighbor (KNN), along with four deep learning models, including BGRU, BiLSTM, CNN, and ELSTM. Additionally, we utilized three pre-trained transfer learning methods, BERT, RoBERTa, and GPT-2, to identify the most effective solution for bugs in player reviews on the Play and App stores, thereby improving user experience and providing developers with valuable feedback for mobile games. To ensure the robustness and generalization of our models, we evaluated their performance using cross-validation scores. This approach provided a reliable estimate of model accuracy by dividing the dataset into multiple folds and systematically training and testing the models on different subsets as depicted in the Figure 5. The aggregated cross-validation results allowed us to compare model efficacy comprehensively and select the best-performing methods for identifying and addressing user-reported issues effectively.

3.5. Model Evaluation Phase

We utilized key metrics, such as cross-validation score, precision, recall, and F1-score (as seen in Equations (1)–(4)) to evaluate the performance of our constructed models. Through a thorough examination of these metrics, we gained invaluable insights into the capacity of the models to accurately classify user reviews, ultimately providing insightful feedback on their effectiveness in identifying bugs within mobile games.

Precision = \frac{T P}{F P + T P}

(1)

Recall = \frac{T P}{F N + T P}

(2)

F 1 - score = \frac{2 \times (Recall \times Precision)}{Recall + Precision}

(3)

The k-fold cross-validation equation is defined as follows:

{CV}_{k} = \frac{1}{k} \sum_{i = 1}^{k} Score (S_{i})

(4)

where

TP: True Positive.
FP: False Positive.
FN: False Negative.
${CV}_{k}$ : Average cross-validation error over k folds.
$Score (S_{i})$ : Score computed on the i-th validation set.

4. Experimental Results

This section will examine the outcomes of our predicted models. We employed four machine learning, four deep learning, and three transfer learning-based approaches to identify the most effective solution for detecting mobile game bugs. The decision to use multiple models was aimed at identifying the specific contributions of each model, thereby enabling the selection of the optimal one. To ensure optimal performance across all models, we systematically analyzed and fine-tuned their parameters, maximizing the specific contributions of each model and its hyperparameters. The detailed hyperparameters and corresponding search grids for the best performing model (RoBERTa) is presented in Table 2 in Error analysis section.

4.1. Results for Machine Learning

Figure 6 presents the performance metrics for four machine learning models such as LR, SVM, RF, and KNN using TF-IDF feature extraction in a binary classification task. Each model’s precision, recall, F1-score, and cross-validation scores are reported as weighted averages. LR and SVM exhibit identical and superior performance, achieving scores of 0.91 across all metrics, indicating strong consistency and balance between precision and recall. RF performs slightly lower with scores of 0.89, reflecting good performance but less effective than LR and SVM. KNN shows the lowest scores at 0.87, suggesting relatively weaker predictive capability. Overall, the results highlight LR and SVM as the most effective models for this task, followed by RF and KNN, which may still be viable depending on the application context.

Figure 7 illustrates the performance metrics for four machine learning models (Logistic Regression, Support Vector Machine, Random Forest, and K-Nearest Neighbors) applied to a multi-class classification task using TF-IDF features. Weighted average scores for precision, recall, F1-score, and cross-validation are reported. Logistic Regression (LR) and Support Vector Machine (SVM) both achieve the highest scores across all metrics (0.85 for precision, recall, F1-score, and cross-validation), indicating strong and consistent performance. Random Forest (RF) shows slightly lower scores with a weighted average F1-score of 0.78 and cross-validation accuracy of 0.81, suggesting a moderate drop in balance between precision and recall. K-Nearest Neighbors (KNN) has the lowest scores, with a weighted average F1-score and cross-validation accuracy of 0.79 and 0.80, respectively, reflecting relatively weaker performance in handling the multi-class task.

4.2. Results for Deep Learning

Figure 8 summarizes the performance of four deep learning models includings BGRU, BiLSTM, CNN, and ELSTM on a binary classification task, reporting their weighted average precision, recall, F1-score, and cross-validation scores. CNN achieves the best overall performance with a perfect balance across all metrics at 0.91, followed closely by BiLSTM, which scores consistently at 0.88 for all metrics. Both BGRU and ELSTM exhibit identical, significantly lower performance, with a precision of 0.29, recall of 0.54, F1-score of 0.38, and cross-validation accuracy of 0.54, indicating poor predictive capability and imbalance in handling the binary task.

Figure 9 presents the performance metrics of four deep learning models—Bidirectional GRU (BGRU), Bidirectional LSTM (BiLSTM), Convolutional Neural Network (CNN), and Extended LSTM (ELSTM)—applied to a multi-class classification task, using weighted average precision, recall, F1-score, and cross-validation scores. CNN outperforms the other models with consistently high scores of 0.82 across all metrics, indicating strong and balanced performance. BiLSTM follows with slightly lower, but competitive scores of 0.79 for precision and F1-score and 0.80 for recall and cross-validation. In contrast, BGRU and ELSTM demonstrate identical, poor performance with a precision of 0.29, recall of 0.54, F1-score of 0.38, and cross-validation accuracy of 0.54, reflecting limited effectiveness in addressing the multi-class classification problem.

4.3. Transformer Results

Figure 10 highlights the performance of three transfer learning models includings BERT, RoBERTa, and GPT-2 on a binary classification task, with weighted average scores for precision, recall, F1-score, and cross-validation. RoBERTa achieves the highest scores across all metrics at 0.96, indicating superior performance in the task. BERT closely follows with consistent scores of 0.95, showcasing excellent effectiveness. GPT-2, while slightly behind, also demonstrates strong performance with uniform scores of 0.94, making it a highly competitive model for binary classification tasks. All three models exhibit outstanding balance and accuracy in handling the classification problem.

Figure 11 illustrates the performance of three transfer learning models—BERT, RoBERTa, and GPT-2—on a multi-class classification task, presenting their weighted average precision, recall, F1-score, and cross-validation scores. RoBERTa achieves the highest performance, with a score of 0.92 across all metrics, showcasing exceptional effectiveness and balance. BERT closely follows with consistent scores of 0.91, demonstrating robust and reliable performance. GPT-2, while slightly lower at 0.90 across all metrics, remains highly competitive, highlighting its strong capability for handling multi-class classification tasks. Overall, all three models exhibit remarkable accuracy and reliability in this domain.

4.4. Error Analysis

Table 2 outlines the hyperparameters explored during grid search optimization for the proposed model (RoBERTa) training. For learning rate, values ranged from

1 \times 10^{- 5}

to

3 \times 10^{- 4}

, allowing for fine-tuning at various scales. Epochs were varied between 3 and 25 to assess model performance across short and extended training periods. Batch sizes included 8, 32, 64, and 128 to explore the impact on convergence and computational efficiency. Weight decay values ranged from 0.01 to 0.1 to regulate overfitting. Hidden dropout rates of 0.02 and 0.1 were tested to manage overfitting and improve generalization. Lastly, warm-up steps were adjusted between 0.03 and 0.1, ensuring stable training by gradually increasing the learning rate at the beginning of the process. This grid search provided a robust exploration of hyperparameter combinations for optimal model performance.

Figure 12 highlights the top-performing models across three learning approaches—machine learning, deep learning, and transfer learning—for binary and multi-class classification tasks. Figure 13, Figure 14 and Figure 15 present the confusion matrices of the top-performing models used in this study. Figure 16 represents the training and validation performance metrics of the proposed model over multiple epochs for the binary classification task, while Figure 17 depicts the training and validation performance across different epochs for the multi-class classification task. For binary classification, RoBERTa from transfer learning achieves the highest scores, with precision, recall, F1-score, and cross-validation all at 0.96. Both Logistic Regression (machine learning) and CNN (deep learning) follow with identical scores of 0.91 across all metrics. For multi-class classification, RoBERTa again leads with scores of 0.92 in all metrics, showcasing its superior adaptability. SVM (machine learning) scores slightly lower, with an F1-score of 0.84 and cross-validation of 0.85, while CNN (deep learning) achieves consistent scores of 0.82 across all measures. This indicates that transfer learning models outperform traditional and deep learning models in both binary and multi-class classification tasks.

5. Conclusions and Future Work

In the dynamic landscape of mobile applications, the Google Play Store and App Store play pivotal roles, hosting a plethora of games and apps that attract billions of users worldwide. This study presents a significant advancement in the field of bug detection for mobile games by leveraging user reviews from the Google Play Store and App Store. Using machine learning, deep learning, and transfer learning models, we successfully developed a highly accurate system for identifying and classifying bugs. The proposed methodology (RoBERTa) demonstrated superior performance, achieving accuracies of 96% in binary classification and 92% in multi-class classification, marking an improvement of 5.49% and 8.24% over traditional methods (Logistic Regression).

Our work makes several impactful contributions, including the development of a manually annotated dataset tailored for binary and multi-class bug detection tasks. By categorizing bugs into network, graphical, and performance issues, this study provides a structured approach that enables developers to quickly identify and address critical problems, ultimately enhancing user satisfaction. Furthermore, the exploration of linguistic nuances in user reviews has introduced a novel dimension to bug detection, bridging the gap between user feedback and actionable insights for developers.

The findings of this research have profound implications for the mobile gaming industry. By offering a benchmark for future studies and establishing a scalable and efficient bug detection system, this work sets a new standard for improving game quality and user experience. The methodology and results presented here pave the way for further innovations in leveraging user-generated content to enhance digital products.

In the future, we aim to expand the dataset to include additional bug categories and integrate real-time bug detection capabilities. Incorporating cross-platform analyses and user sentiment evaluations can further enrich the system, providing a holistic approach to improving mobile games. This study underscores the critical role of user feedback in driving innovation and elevating standards in the gaming industry, ensuring a more enjoyable and seamless experience for players worldwide.

Author Contributions

All authors contributed to this study’s conception and design. M.A., A.H. and M.U.: conceptualization; methodology; writing—review. M.A.: conceptualization, methodology. M.A. and F.U.: reviewing and editing. M.A. and A.H.: data curation. M.A., M.J. and A.H.: writing—original draft preparation. M.A.: visualization, investigation. M.M.: supervision, validation A.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data will be made available on request.

Acknowledgments

The work was conducted with partial support from the Mexican Government through the grant A1-S-47854 of CONACYT, Mexico, grants 20241816, 20241819, and 20240951 of the Secretaría de Investigación y Posgrado of the Instituto Politécnico Nacional, Mexico. The authors thank the CONACYT for the computing resources brought to them through the Plataforma de Aprendizaje Profundo para Tecnologías del Lenguaje of the Laboratorio de Supercómputo of the INAOE, Mexico, and acknowledge the support of Microsoft through the Microsoft Latin America PhD Award.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Appinventiv. Google Play Store Statistics: Top Trends & Data Analysis. Available online: https://appinventiv.com/blog/google-play-store-statistics/ (accessed on 23 May 2024).
Statista. Number of Available Gaming Apps in the Google Play Store from 1st Quarter 2015 to 1st Quarter 2022. Available online: www.statista.com/statistics/780229/number-of-available-gaming-apps-in-the-google-play-store-quarter (accessed on 23 May 2024).
Rouse, R., III. Gaming and graphics: The console and PC: Separated at birth? ACM SIGGRAPH Comput. Graph. 2001, 35, 5–9. [Google Scholar] [CrossRef]
Sadiq, S.; Umer, M.; Ullah, S.; Mirjalili, S.; Rupapara, V.; Nappi, M. Discrepancy detection between actual user reviews and numeric ratings of Google App store using deep learning. Expert Syst. Appl. 2021, 181, 115111. [Google Scholar] [CrossRef]
Schumer, H.; Amadi, C.; Joshi, A. Evaluating the dietary and nutritional apps in the google play store. Healthc. Inform. Res. 2018, 24, 38. [Google Scholar] [CrossRef] [PubMed]
Martens, D.; Maalej, W. Towards understanding and detecting fake reviews in app stores. Empir. Softw. Eng. 2019, 24, 3316–3355. [Google Scholar] [CrossRef]
Carreño, L.V.G.; Winbladh, K. Analysis of user comments: An approach for software requirements evolution. In Proceedings of the 2013 35th International Conference on Software Engineering (ICSE), San Francisco, CA, USA, 18–26 May 2013. [Google Scholar]
Hu, G.; Yuan, X.; Tang, Y.; Yang, J. Efficiently, effectively detecting mobile app bugs with appdoctor. In Proceedings of the Ninth European Conference on Computer Systems, Amsterdam, The Netherlands, 14–16 April 2014; pp. 1–15. [Google Scholar]
Xu, Z.; Zhao, K.; Zhang, T.; Fu, C.; Yan, M.; Xie, Z.; Zhang, X.; Catolino, G. Effort-aware just-in-time bug prediction for mobile apps via cross-triplet deep feature embedding. IEEE Trans. Reliab. 2021, 71, 204–220. [Google Scholar] [CrossRef]
Guzman, E.; Maalej, W. How do users like this feature? A fine grained sentiment analysis of app reviews. In Proceedings of the 2014 IEEE 22nd International Requirements Engineering Conference (RE), Karlskrona, Sweden, 25–29 August 2014. [Google Scholar]
Kristensen, J.T.; Burelli, P. Difficulty Modelling in Mobile Puzzle Games: An Empirical Study on Different Methods to Combine Player Analytics and Simulated Data. arXiv 2024, arXiv:2401.17436. [Google Scholar] [CrossRef]
Li, H.; Zhang, L.; Zhang, L.; Shen, J. A user satisfaction analysis approach for software evolution. In Proceedings of the 2010 IEEE International Conference on Progress in Informatics and Computing, Shanghai, China, 10–12 December 2010; Volume 2. [Google Scholar]
Pagano, D.; Maalej, W. User feedback in the appstore: An empirical study. In Proceedings of the 2013 21st IEEE International Requirements Engineering Conference (RE), Rio de Janeiro, Brazil, 15–19 July 2013. [Google Scholar]
Finkelstein, A.; Harman, M.; Jia, Y.; Martin, W.; Sarro, F.; Zhang, Y. App store analysis: Mining app stores for relationships between customer, business and technical characteristics. RN 2014, 14, 24. [Google Scholar]
Zhang, J.; Musa, A.; Le, W. A comparison of energy bugs for smartphone platforms. In Proceedings of the 2013 1st International Workshop on the Engineering of Mobile-Enabled Systems (MOBS), San Francisco, CA, USA, 25 May 2013. [Google Scholar]
Jiang, H.; Yang, H.; Qin, S.; Su, Z.; Zhang, J.; Yan, J. Detecting energy bugs in android apps using static analysis. In International Conference on Formal Engineering Methods; Springer International Publishing: Cham, Switzerland, 2017. [Google Scholar]
Payandeh, A.; Sharbaf, M.; Rahimi, S.K. A Systematic Review of Model-Driven Game Development Studies. IEEE Trans. Games, early access. 2024. [Google Scholar]
Taesiri, M.R.; Macklon, F.; Habchi, S.; Bezemer, C.-P. Searching bug instances in gameplay video repositories. IEEE Trans. Games, early access. 2024. [Google Scholar]
GomezRomero-Borquez, J.; Del-Valle-Soto, C.; Del-Puerto-Flores, J.A.; Briseño, R.A.; Varela-Aldás, J. Neurogaming in Virtual Reality: A Review of Video Game Genres and Cognitive Impact. Electronics 2024, 13, 1683. [Google Scholar] [CrossRef]
Wu, X.; Ye, J.; Chen, K.; Xie, X.; Hu, Y.; Huang, R.; Ma, L.; Zhao, J. Widget detection-based testing for industrial mobile games. In Proceedings of the 2023 IEEE/ACM 45th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), Melbourne, Australia, 14–20 May 2023; pp. 173–184. [Google Scholar]
van der Lee, W.; Verwer, S. Vulnerability Detection on Mobile Applications Using State Machine Inference. In Proceedings of the 2018 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW), London, UK, 23–27 April 2018; pp. 1–10. [Google Scholar]
Tazuddin, A.M.; Ming, W.M.; Aizuddin, A.M. Collaborative Location-Based Mobile Game with Error Detection Algorithm. J. Telecommun. Electron. Comput. Eng. (JTEC) 2018, 10, 1–7. [Google Scholar]
Abbasi, A.M.; Al-Tekreeti, M.; Naik, K.; Nayak, A.; Srivastava, P.; Zaman, M. Characterization and detection of tail energy bugs in smartphones. IEEE Access 2018, 6, 65098–65108. [Google Scholar] [CrossRef]
Kim, C.H.P.; Kroening, D.; Kwiatkowska, M. Static program analysis for identifying energy bugs in graphics-intensive mobile apps. In Proceedings of the 2016 IEEE 24th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), London, UK, 19–21 September 2016. [Google Scholar]

Figure 1. Proposed methodology and design.

Figure 2. Word cloud of the most frequent words in the Bug detection dataset.

Figure 3. Pre-processing approach used in the study.

Figure 4. Annotation procedure based on binary and multi-class.

Figure 5. Illustration of k-fold cross-validation.

Figure 6. Machine learning results in binary classification task.

Figure 7. Machine learning results in multi-classification task.

Figure 8. Deep learning results in binary classification task.

Figure 9. Deep learning results in multi-classification task.

Figure 10. Transfer learning results in binary classification task.

Figure 11. Transfer learning results in multi-classification task.

Figure 12. Top-performing models from each learning approach.

Figure 13. Confusion matrix of Logistic Regression in binary and multi-class.

Figure 14. Confusion matrix of CNN in binary and multi-class.

Figure 15. Confusion matrix of RoBERTa in binary and multi-class.

Figure 16. Training and validation performance of different epochs in binary class.

Figure 17. Training and validation performance of different epochs in multi-class.

Table 1. Sample of removing unnecessary characters.

Fun game, but frequent lag and frame rate drops make it hard to enjoy, especially during intense moments.!

Game player review after removing the unnecessary character and make changes:

fun game but frequent lag and frame rate drops make it hard to enjoy especially during intense moments

Table 2. Hyperparameter tuning for the proposed methodology (RoBERTa).

Hyperparameter	Grid Search Values
Learning Rate	$1 \times 10^{- 5}$ , $1 \times 10^{- 2}$ , $2 \times 10^{- 5}$ , $3 \times 10^{- 5}$ , $3 \times 10^{- 4}$
Epoch	3, 9, 20, 25
Batch Size	8, 32, 64, 128
Weight Decay	0.01–0.1
Hidden Dropout	0.02, 0.1
Warm-up Steps	0.03–0.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Usman, M.; Ahmad, M.; Ullah, F.; Muzamil, M.; Hamza, A.; Jalal, M.; Gelbukh, A. Fine-Tuned RoBERTa Model for Bug Detection in Mobile Games: A Comprehensive Approach. Computers 2025, 14, 113. https://doi.org/10.3390/computers14040113

AMA Style

Usman M, Ahmad M, Ullah F, Muzamil M, Hamza A, Jalal M, Gelbukh A. Fine-Tuned RoBERTa Model for Bug Detection in Mobile Games: A Comprehensive Approach. Computers. 2025; 14(4):113. https://doi.org/10.3390/computers14040113

Chicago/Turabian Style

Usman, Muhammad, Muhammad Ahmad, Fida Ullah, Muhammad Muzamil, Ameer Hamza, Muhammad Jalal, and Alexander Gelbukh. 2025. "Fine-Tuned RoBERTa Model for Bug Detection in Mobile Games: A Comprehensive Approach" Computers 14, no. 4: 113. https://doi.org/10.3390/computers14040113

APA Style

Usman, M., Ahmad, M., Ullah, F., Muzamil, M., Hamza, A., Jalal, M., & Gelbukh, A. (2025). Fine-Tuned RoBERTa Model for Bug Detection in Mobile Games: A Comprehensive Approach. Computers, 14(4), 113. https://doi.org/10.3390/computers14040113

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fine-Tuned RoBERTa Model for Bug Detection in Mobile Games: A Comprehensive Approach

Abstract

1. Introduction

2. Literature Review

3. Materials and Methods

3.1. Construction of Dataset

3.2. Data Pre-Processing

3.3. Data Labeling

3.4. Application of Models Training and Testing Phase

3.5. Model Evaluation Phase

4. Experimental Results

4.1. Results for Machine Learning

4.2. Results for Deep Learning

4.3. Transformer Results

4.4. Error Analysis

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI