1. Introduction
Healthcare is undergoing a global transformation with the increasing adoption of artificial intelligence (AI) technologies. These advancements are revolutionizing patient care, diagnostics, treatment, and healthcare management workflows. AI subfields such as computer vision, natural language processing, machine learning, and deep learning have made significant contributions to the evolution of healthcare services [
1]. While non-AI applications in healthcare are generally limited to basic functions such as symptom checking and appointment scheduling, AI-enabled healthcare applications provide advanced decision support, personalized care, and clinical insight based on data-driven algorithms [
2].
Given that patients are the end users of these AI-powered systems, it is essential that such applications are designed with a strong emphasis on usability, safety, and privacy. AI systems should be easy to use and should enhance patient navigation and satisfaction with healthcare services [
3]. Thus, assessing the usability of AI healthcare applications is a critical step toward ensuring their effective and safe implementation.
AI refers to technologies that can perform cognitive and physical tasks and make decisions without direct human intervention [
4]. Its components of machine learning, neural networks, genetic algorithms, and pattern recognition support various medical processes. AI can exceed human limitations and enhance productivity, especially in human–computer interaction (HCI), where it improves interface design, user engagement, and system safety [
5,
6]. In healthcare, AI is commonly used in clinical support systems and robotic surgery, though many systems remain minimally usable, posing potential risks to patients [
7,
8]. mHealth applications supported by AI aim to improve care quality, cost-efficiency, and health system management [
2]. Despite these advantages, some users remain hesitant about AI’s role in healthcare due to concerns related to usability, trust, and transparency [
2].
While AI apps offer advantages in administrative tasks, diagnosis, treatment, and follow-up care [
3], users still encounter usability challenges [
9]. For example, voice-driven smart assistants often underperform on complex queries [
9,
10], and design failures in HCI have contributed to critical incidents such as self-driving vehicle accidents [
9]. In healthcare, design issues such as poor navigation, limited error recovery, and lack of transparency can compromise trust and usability key elements in medical environments that demand precision and reliability [
11].
This study offers a comprehensive usability evaluation of three leading AI-based mHealth applications using both expert and user-based assessments. The focus is on the usability of these tools rather than the internal effectiveness of the underlying AI algorithms. The findings aim to support developers and designers with practical recommendations to improve the usability and user experience of AI-powered mobile health apps.
Few studies have systematically assessed the usability of AI-powered mHealth apps [
3,
12,
13,
14,
15], and fewer still have used integrated methods such as heuristic evaluation, user testing, and automated assessments [
16,
17,
18]. This study fills that gap by offering a combined usability evaluation framework, emphasizing user-centered design, and broadening existing models that traditionally prioritize algorithmic performance over human experience. Regarding the growth of the mHealth market, the fact that over 325,000 apps were available in 2017 alone [
19] underscores the need for improved usability standards. Despite advances in AI modeling, user-centered design remains underrepresented in development processes [
20]. Prominent AI health apps like Woebot [
21] and Babylon Health [
22] exemplify innovation in digital health but face challenges due to varied user health and digital literacy [
23]. Previous research has highlighted usability concerns such as confusing outputs and unexpected errors [
24], reinforcing the importance of transparent and intuitive design. In the following sections, we review the current literature, define our research objectives, and present a triangulated methodology to evaluate the usability of three AI-powered mHealth applications.
4. Materials and Methods
To evaluate the usability of AI features in mHealth apps, this study adopts an experimental approach. Inspired by [
13], which used ISO 9241-11 metrics [
25], and [
16], which applied Nielsen’s heuristics [
26], this study follows a similar structure assessing effectiveness, efficiency, and satisfaction. The study begins with a literature review to shape its objectives and identify key evaluation factors related to usability and AI in mHealth. The evaluation approach comprises three phases: (1) a heuristic evaluation by experts using best practices; (2) user testing with tasks and think-aloud protocols; (3) automated usability tests to detect technical flaws and accessibility compliance. The methodology provides a comprehensive assessment of usability from both expert and user perspectives, culminating in actionable recommendations.
Figure 1 illustrates the research methodology workflow.
4.1. Developing Customized Usability Heuristics
Heuristic evaluation, as introduced by [
34], involves experienced evaluators assessing an interface against usability principles. Evaluators individually document usability issues and then consolidate their findings, ranking issues by severity, frequency, and criticality [
35].
Table 1 lists Nielsen’s ten heuristics. Transparency and explainability are increasingly essential in AI design [
36,
37,
38,
39]. Users expect clear explanations and trustworthy systems [
23,
40,
41].
Table 2 presents the proposed heuristics tailored to AI-powered mHealth apps.
4.2. App Selection Criteria
This study evaluated three widely used AI-powered mHealth applications [
42]: ADA [
43], Mediktor [
44], and WebMD [
45]. These applications were selected based on the following inclusion criteria to ensure relevance to real-world, general-use AI-based healthcare tools:
Popularity and Availability: Each app ranked among the most downloaded in the Health and Fitness or Medical categories on both the Apple App Store and Google Play Store in recent market analytics [
42].
User Ratings: Apps with a minimum average rating of 4.0 stars on both platforms were selected to ensure established user acceptance and quality [
42].
AI Integration: Apps were required to incorporate artificial intelligence features such as symptom checking, triage support, or diagnostic suggestions, aligning with the study’s focus on AI-powered healthcare usability.
Language and Accessibility: Apps had to be available in English and provide core functionalities (e.g., symptom checker) without requiring paid subscriptions or institutional licensing, ensuring accessibility for general users.
4.3. Participant Recruitment and Selection
4.3.1. User Participants
The study employed both inclusion and exclusion criteria to guide participant selection. A total of 30 participants (18 males, 12 females), aged between 18 and 65 years (mean = 33.4 years; standard deviation = 11.2), were recruited through purposive sampling to ensure diversity in health and digital literacy levels.
Inclusion Criteria:
Exclusion Criteria:
Individuals with visual or motor impairments that could affect app interaction.
Current or prior employment in app development, usability research, or digital health sectors.
4.3.2. Expert Evaluators
In addition to user testing, a separate group of five expert evaluators participated in the heuristic evaluation phase. According to Nielsen’s guidance, involving 3 to 5 experts is considered optimal for identifying the majority of usability issues [
46]. The selected evaluators held professional backgrounds in information technology and demonstrated relevant experience in usability testing, AI-based mobile applications, and healthcare systems, in accordance with established recommendations for heuristic evaluation [
47].
4.4. The Evaluation Process
To meet the study’s goal, expert and user evaluations were performed using the dual-method approach. The apps were evaluated by five raters separately employing the 13 proposed heuristics of
Table 2, both qualitatively and severity-wise. The qualitative results and severity scores were given by the evaluators. A structured scale in the range of 0 (no issue) to 4 (usability catastrophe) was used to rate the severity of each problem [
48] by considering its frequency, impact, and persistence, and it is presented in
Table 3.
Issues of user interface consistency, transparency, and AI feature reliability are among the subjects that were addressed in their study. It was at the same time that user testing sessions were carried out, and the users were assigned five main tasks that can be seen in
Table 4. Every session monitored four main indicators: task success, time on task, number of errors, and satisfaction level. They were measured using the success scale [
26]. Errors resulting from different actions than intended [
49] were taken as errors and satisfaction was evaluated by the System Usability Scale (SUS).
The performance of each of the participants was investigated on the basis of such metrics as the following:
Testing occurred in distraction-free environments with reliable internet access. Tools such as Zoom and Excel supported coordination and data collection. Smartphones, tablets, and headsets were used to simulate realistic conditions and ensure audio clarity during remote sessions.
4.5. Pilot Validation, Analysis, and Outcomes
A pilot study was conducted to validate the evaluation procedures [
50]. Experts reviewed the heuristics (
Table 2), and a user test identified some tasks as unclear. The essential refinements were conducted. Data from the qualitative and quantitative methodologies were analyzed. The variable checklists generated qualitative-oriented data, while user testing gathered the numerical information (errors, task success, time). The analysis was extensive and contributed to the emergence of concrete solutions in terms of design. The purpose of the study was to point out the flaws of AI-run mHealth apps and to provide the audience with visual, straightforward, and effective drawings to be used in the future. The results were anticipated to put AI as a competent tool in the healthcare system and bring more strength to the user experience.
5. Results
This section presents the results of both expert-based and user-based evaluations of three AI-powered mHealth applications: ADA, Mediktor, and WebMD. It includes findings from heuristic assessments by domain experts and usability testing with participants, covering metrics such as usability problems, task success, completion time, errors, and user satisfaction.
5.1. Overview of Selected Applications and Pilot Study
Three AI-powered mHealth applications have been chosen based on the set criteria. These are ADA, Mediktor, and WebMD. ADA is a tool that is powered by AI to gain healthcare and symptom assessments [
43]. It definitely combines human knowledge and smart technology and thus makes it possible for the person concerned to access care and get the right medical help [
43]. Mediktor is the name of an app that can work with AI and is used for symptom assessment and the navigation of care [
45]. WebMD, in addition to being one of the most popular health consultation websites, is also the most reliable health partner of people who want to receive health and disease information and who want to be involved in the decision-making of their health [
44]. The pilot study was instrumental in ensuring that the evaluation design and process were sound, effective, and capable of leading to significant learning. Two sets of activities were employed in the research. An expert was requested via email to go through the heuristic evaluation method. After five days, a reply came, indicating the need for the definitions of heuristics and the provision of more explicit examples to support expert insights, as well as the adjustment of some heuristics to better facilitate assessment. User testing was performed with a participant who was directed to complete the evaluation tasks in order to identify some of the possible difficulties such as unclear instructions or wrong task flow. After feedback from the expert and user, some improvements were made that reflected the clarity, coverage, and applicability of each heuristic and the task instructions.
5.2. Expert Evaluation Based on the Customized Heuristics
Four experts were selected based on predefined criteria and invited via email to participate in the evaluation. Each expert received two documents: an evaluation guide outlining study objectives and task instructions, and a separate file detailing the customized heuristics. The experts were given seven days to assess the selected applications. The expert evaluation covered three mHealth AI-powered applications—ADA, Mediktor, and WebMD—using the proposed heuristics. Each expert assessed usability issues and assigned severity ratings from 0 to 4.
Table 5 groups and summarizes the 32 usability problems identified in ADA, clustering similar issues under thematic categories to improve readability while preserving the original insights.
The findings in
Table 5 highlight several recurring challenges in ADA’s user interface, particularly in areas related to AI explainability, error prevention, and multilingual support. The presence of high-severity issues in AI transparency (H11), user expectation setting (H12), and the accuracy of automation (H13) suggests that users may struggle to understand or trust AI-generated health insights. Additionally, critical gaps in accessibility, personalization, and user-assistance mechanisms indicate that significant usability improvements are needed to effectively accommodate a wider range of users.
Table 6 presents the expert evaluation of Mediktor, grouping 29 identified usability issues under thematic categories to maintain reporting consistency.
The expert evaluations for Mediktor reveal moderate to severe issues in terms of AI communication and user interaction flow. The most critical problems revolve around inadequate explainability and insufficient personalization options for users with varying needs.
Table 7 summarizes 38 usability issues in WebMD by grouping related issues under clear themes for comparative clarity.
The expert evaluation for WebMD highlights numerous usability concerns, especially in areas involving AI transparency, diagnostic accuracy, and content clarity. Severe issues were found in AI explainability and user support, with critical gaps in how the system presents information and guides users through decision-making. These findings suggest that WebMD, despite its widespread recognition, may present risks to users due to opaque system logic, inconsistent navigation, and limited customization options.
5.3. Comparative Severity Ratings by Heuristic Category
Examining the entire situation thoroughly for usability concerns in these mHealth applications powered by AI could be better understood by conducting a comparative analysis based on the severity scores for the 13 heuristic categories in each application, the experts, and ADA, Mediktor, and WebMD. In the evaluation, every heuristic problem was rated on a scale of 0 (no) to 4 (usability catastrophe). The total scores were determined by the respective frequency of the severity level multiplied by the grade.
Figure 2 illustrates the cumulative severity ratings for each device across the applications.
Figure 2 illustrates notable distinctions in heuristic severity across the three evaluated mHealth applications. ADA demonstrated the highest severity scores in Transparency and Explainability (H11) and User Expectations (H12), indicating substantial challenges related to AI communication and user trust. Mediktor, on the other hand, revealed critical issues in User Control and Freedom (H3) and Error Prevention (H5), reflecting concerns with navigation and input validation. WebMD displayed consistently high severity in AI Explainability (H11–H13) and Recognition and Recall (H6), suggesting deficiencies in conveying automation processes and supporting personalized interactions. This comparative visualization provides a clearer understanding of the most problematic heuristics in each application and informs prioritization for future design improvements.
5.4. User Evaluation Based on the Selected Applications
To evaluates usability from an end-user perspective, a user-based evaluation was conducted on the three selected AI-powered mHealth applications: ADA, Mediktor, and WebMD. Twelve participants with varying backgrounds, age groups, and device experience levels were recruited and observed while completing a standardized set of five core tasks. Their performance was documented across four key metrics: task success, task completion time, number of errors, and overall satisfaction.
5.4.1. Task Completion Success Rates
Each participant was instructed to complete five predefined tasks on each application. Task success was recorded as 1 for successful completion on the first attempt, 0.5 for partial completion (e.g., completed with assistance or multiple attempts), and 0 for failure to complete the task. The average success score for each task was then calculated per application to determine the ease of task completion and identify areas where users faced difficulty.
Figure 3 explores the average task success rates across the three applications—ADA, Mediktor, and WebMD—based on the cumulative performance of all participants in each task.
Figure 3 illustrates the comparative task success rates across the ADA, Mediktor, and WebMD applications. ADA and Mediktor demonstrate high average success rates across most tasks. WebMD exhibits comparable performance except for Task 2, which was unsupported across participant devices, resulting in a data gap.
5.4.2. Task Completion Time
To understand user efficiency, the average time to complete each task was calculated for each application. Task duration serves as a metric of both usability and complexity.
Figure 4 introduces the average time (in seconds) spent per task, indicating efficiency and complexity for each application.
Figure 4 shows the average time in seconds for each task across the three applications. ADA demonstrated consistent times, while Mediktor showed longer durations for Task 2 and Task 3. WebMD had the shortest durations overall but again lacked data for Task 2.
5.4.3. Number of Errors
Errors represent usability breakdowns where users perform unintended actions. Each observed mistake was counted per task and application.
Figure 5 introduces the total number of errors observed during task execution per application.
Figure 5 depicts the number of errors users made while performing tasks on each app. ADA users committed the highest number of errors in Task 2 and Task 3. Mediktor showed fewer errors overall, while WebMD had moderate errors with fewer in later tasks.
5.4.4. Overall Satisfaction (SUS Score)
User satisfaction was assessed using the System Usability Scale (SUS), where each participant rated their experience with each application. The SUS score ranges from 0 to 100, with higher scores indicating better perceived usability.
Figure 6 below displays individual user scores across the three applications.
Figure 6 demonstrates that ADA received the highest average satisfaction score (80.4), followed by Mediktor (72.0), while WebMD trailed behind with an average score of (56.8), reflecting a comparatively lower perceived usability.
5.5. Inferential Statistical Analysis
To validate the usability differences observed in the descriptive results, inferential statistics were applied using data collected from 30 participants. This sample size satisfies the minimum requirement for parametric testing and ensures adequate statistical power for within-subject comparisons.
Before performing the inferential tests, the normality assumption required for repeated-measures ANOVA was assessed using the Shapiro–Wilk test:
ADA: W = 0.975, p = 0.687;
Mediktor: W = 0.984, p = 0.913;
WebMD: W = 0.963, p = 0.365.
All p-values were above 0.05, indicating that the SUS scores for each application are approximately normally distributed. This justifies the use of parametric statistical methods. Since the same participants evaluated all three applications, a repeated-measures ANOVA was conducted to compare user satisfaction (SUS scores) across the applications. This within-subjects approach aligns with the study design and appropriately accounts for the dependence between measures. To explore specific differences between pairs of applications, paired-sample t-tests were conducted.
Figure 7 illustrates the mean SUS scores across ADA, Mediktor, and WebMD, with error bars representing the standard deviations, highlighting differences in user satisfaction.
As shown in
Figure 7, ADA achieved the highest mean SUS score (M = 81.29), followed by Mediktor (M = 76.51), and WebMD (M = 70.61). The standard deviations are visually represented as error bars.
These results indicate statistically significant differences among the three applications, thereby supporting the descriptive findings presented earlier. ADA demonstrated the highest level of user satisfaction, followed by Mediktor, with WebMD receiving the lowest scores. The inclusion of data from 30 participants enhances the statistical power and robustness of the analysis. Overall, the inferential findings corroborate and strengthen the study’s conclusions, aligning consistently with the usability patterns observed throughout the evaluation. This evidence reinforces the study’s conclusions about relative usability among AI-powered mHealth applications.
7. Conclusions
This study conducted a triangulated usability evaluation of three AI-powered mHealth applications—ADA, Mediktor, and WebMD—using heuristic analysis, user testing, and automated inspection. While ADA demonstrated relatively higher usability in terms of SUS scores and task success, all apps showed critical shortcomings in transparency, user guidance, and explainability features. The findings revealed that users encountered significant navigational and input-related difficulties, especially in WebMD. Moreover, none of the apps presented confidence scores, rationale explanations, or robust feedback mechanisms, indicating low compliance with transparency and explainable AI (XAI) principles. Although heuristic and user evaluations identified consistent usability trends, these findings are limited by the small sample size and the descriptive nature of some metrics. As such, our conclusions are intended to highlight areas of improvement rather than assert definitive superiority among the apps. Future work should involve larger and more diverse user samples, as well as deeper evaluation of clinical accuracy and long-term user engagement. Designers and developers are encouraged to adopt XAI principles, ensure interface consistency, and enhance feedback systems to improve trust and usability in AI-driven healthcare applications.