The results of the UX and usability tests of our health monitoring system are described in the following. An overview of the overall questionnaire results is shown in
Table 1. For each of the system components, the perspicuity was rated using the UEQ+ questionnaire, for sessions 2 and 3 the SUS questionnaire was additionally used for evaluations. We calculated the means of the UEQ+ ratings and their corresponding rated importances, along with the component and overall system KPI, as well as the overall SUS scores.
3.1. Session 1: User Group A in Residential Care
In session 1, we tested our system with user group A, which consisted of eight residents of a nursing home. Additionally, one of their caregivers also participated in the study in this session. The results in
Table 1 include only the eight older participants, as they belong to the target group of older adults. Overall, the user experience of the health monitoring system was rated highly in session 1. The KPIs, with a possible value between −3 and 3, are all above 2, with the interaction with the robot Pepper being rated the highest (2.88) and the connected health devices rated the lowest (2.25), and the app in between at 2.71. This results in an overall system KPI of 2.51. In addition, the caregiver also gave high ratings, with an overall system KPI of 2.50, indicating high UX from the care perspective, also.
Regarding the use of health devices for measurements, we observed that some of the participants struggled to put on the wearable and the blood pressure monitor by themselves. All participants were able to correctly carry out the body temperature measurement, while some had problems with sending the measured value to the server. This was performed using a button on the thermometer. After another explanation and/or demonstration, most of the participants managed to both measure and send the measured value to the server correctly.
When using the data visualization app, some participants tapped the buttons in such a manner, that the app did not respond, e.g., due to tapping the button for too long. They, therefore, had to repeatedly tap the button for it to lead to its corresponding action or page. Overall, most of the participants were able to find the values we asked them to look for, i.e., their body temperature and blood pressure. One participant stated that they would prefer to use the app on a computer. This is also possible with this application, as it was developed using a multi-platform framework. Additionally, almost all participants also stated that they had a positive impression of both the devices and the app.
In the context of a health assessment, which is intended to relieve care professionals of documentation tasks, both data accuracy and data completeness are crucial. We evaluated data accuracy by calculating the success rate, which we define as the number of error-free interaction turns in relation to the total number of turns in a session. In this evaluation, we considered the actual health assessment, only, and excluded the introduction. We defined a turn as the complete sequence of a robot question, the user response, and the subsequent robot feedback.
In session 1, over the course of all assessments of the eight older users, we recorded a total of 148 turns, of which 114 were error-free. This results in an overall success rate of 77.03%. The number of total turns includes 24 items that were queried more than once in the same assessment due to faulty system routing. In three cases, an assessment had to be aborted and restarted. In each of these cases, the participants had directed questions at the research team. The system logs show that these off-topic remarks were picked up by the system, interpreted wrongfully, and caused faulty routing. When the evaluation is cleansed of these aborted assessments, a success rate of 80.01% is calculated from a total of 136 turns and 109 error-free turns. Furthermore, the logs show that participants using non-valid answer options is the most common source of error in session 1. This occurred a total of 15 times and is thus the cause for approximately 44% of the off-script events. We categorized the detected off-script events, and present them along with their absolute frequency, in
Table 2.
In terms of the completion rate, an assessment is considered complete, if the required data on all items were collected. Based on the system logs, we found that four of the eight valid assessments in session 1 were completed, resulting in a completion rate of 50.0%. The success and completion rates for all sessions are listed in
Table 3. As five of the eight older participants in session 1 had previous experience with the robot Pepper, we also calculated the success and completion rates separately for the experienced (group A1) and the inexperienced (group A2) users, as listed in
Table 4. Group A1 achieves 79 successful turns, in relation to a total of 95 turns, resulting in a success rate of 83.16%. Out of the five interactions conducted, three were completed in full, thus bringing the completion rate to 60.00%. Group A2, the users without any previous experience with the robot score a lower success rate of 66.04% from a total of 53 turns with 35 successful ones. One out of these three interactions was completed, resulting in a completion rate of 33.33%. Additionally, the caregiver—not included in this calculation as they are outside of the user group of older adults—achieved a success rate of 93.75% and were able to complete the health assessment.
3.1.1. Evaluation 1: Conclusions
Overall in session 1, user experience was rated highly. Some participants struggled with the use of the devices and the handling of the tablet when using the app. Additionally, while the success rate of the interactions with the social robot was 77.03%, the completion rate of the health assessment was only 50.00%. We, therefore, introduced the following optimizations into the system, hoping to increase the completion rate in the subsequent test sessions.
3.1.2. Optimizations
Following an iterative development model in the UX design process, we implemented the optimizations before expert users were confronted with the system in session 2. The main factor influencing the completion rate of the health assessment was the number of overall off-script-events at 34, which led to wrong turns taken in the conversation with the robot, and then subsequent abortions to the interaction. To improve this, we implemented the health assessment questionnaire using the Rasa feature forms, as well as custom actions, for test sessions 2 and 3. If during an interaction with the robot Pepper, the user’s verbal input can not be recognized or matched with an intent, a question mark is displayed on the robot’s tablet, to prompt the user to repeat and/or rephrase their input. As part of the optimization, we added this information to the introduction of the health assessment. Furthermore, we observed that participants frequently gave premature voice commands, while the robot was transitioning from the introduction to the assessment. As a result, only a part of the input was captured by the robot, leading to false routing between items. To prevent this and improve the success rate of the interactions, we implemented a robot prompt that asked the user to wait two seconds while the robot switches to the health assessment. In session 1 we included only 14 questions plus one conditional question of the health assessment questions in order to first test a simplified version of the assessment. After session 1, we implemented the full 31 questions for the subsequent tests, which are listed in
Appendix A.
We also made several design adjustments to the mobile app. These include an introduction with explanations of how to use the app and visual support to use the app in both portrait and landscape mode. Finally, we implemented stronger haptic feedback when tapping buttons to increase the awareness of users if they had actually tapped the button correctly.
3.2. Session 2: User Group B in Lab Tests
In session 2, we conducted lab tests with five expert users who have previous experience not only with using but also developing chatbots or robots. The user experience was also rated highly for all system components by this group. The KPI for the health assessment (2.45) and the app (2.35) were rated higher than for the sensors (1.55), resulting in an overall system KPI of 2.1. The system usability, evaluated with the SUS, was rated highly at 85, indicating very good usability at a maximum possible value of 100. Additionally, based on our observations the participants had only very minor difficulties using the health devices, and all of them were able to find their measured body temperature and blood pressure in the app. Some of the participants also stated that the app had a clear layout, and was intuitive to use.
In session 2, we recorded a total of 167 turns over the course of five assessments. Based on logs and transcripts, 144 turns can be considered error-free, resulting in a success rate of 86.2%. The most common reasons for errors were (1) participants deviating from the predefined answer options and (2) the system registering no speech input, each occurring seven times over the course of this session. There was no need to abort any assessments, on the contrary, each of the five assessments conducted was completed in full, so the completion rate was 100%. A number of participants stated that the health assessment was too long and sometimes lacked variety in terms of the robot’s replies. We also observed that some of the participants tried to test the robot, by giving more complex answers than the predefined answers. When this approach was unsuccessful, they went back to using short answers. Additionally, in session 2, we identified several errors in the training data, which we subsequently resolved before session 3.
3.3. Session 3: User Group C in Assisted Living
In session 3, we tested our system with user group C, who are residents of assisted living communities and have extensive previous experience with the social robot Pepper, as one lives in their community. Again, the KPI of the robot (1.33), the devices (1.42), and the app (1.92) were rated highly, albeit lower than in sessions 1 and 2. The app was rated best, with the devices rated only slightly higher than the robot. The overall system KPI is rated well at 1.45, while the overall system usability is rated at 67.5 of 100, indicating average usability. Regarding the health devices, the participants had difficulties on the first try of the measurements. However, when attempting a second measurement, all participants were able to conduct the measurements by themselves. Some of the participants also mentioned their interest in using the devices and liked their modern design. When using the mobile app, we observed that the participants were unsure which category to find the measured values in. However, after finding the first value we asked them to find (body temperature), they were quick to also find the other values.
Session 3 was conducted with three participants so that a lower number of 103 turns was recorded. In relation to the number of 84 error-free turns, a success rate of 81.6% results. Based on transcripts and system logs, we identified the most common cause of error to be deviating from the ones of the previous sessions. In session 3, the system not recognizing speech input is the most prevalent issue, occurring 10 times and thus causing over 52% of the errors. The main consequence is that users have to repeat themselves. However, two turns were identified, in which the system did not register any user input in the logs, yet triggered a robot uttering. In session 3, again all assessments were completed in full, resulting in a completion rate of 100%. Additionally, we occasionally observed slow responses in the robot, which can be indicative of a slow internet connection, and also has an impact on how well or fast the robot’s speech recognition functions.
3.4. Overall Results
The overall results of the study are visualized in
Figure 6. The participants of session 1 rated the user experience of each individual component and the overall system the highest. The UX evaluation of session 2 was lower, with the lowest values for session 3. Especially the rating of the robot-assisted health assessment was noticeably lower in session 3 than in sessions 1 and 2. Additionally, the system usability was rated as high in session 2 and average in session 3.
Regarding the success rates of the conversation turns in the interaction with the social robot for the health assessment, small differences can be seen. The success rates in sessions 1 and 3 are similar at around 80%. The success rate in session 2 with the expert users is slightly higher at 86.23%. Large differences can be seen in the completion rates of the health assessment after the first session. While only five of nine assessments could be completed in session 1, all assessments were completed in sessions 2 and 3. Overall, the results indicate good user experience and usability of our health monitoring system.
Additionally, the combined results of sessions 1 and 3 are shown in
Figure 7. Minor changes, mostly of technical nature, were introduced into the system after session 1, as described in
Section 2.4. Both sessions 1 and 3 were conducted with the target group of older adults; however, giving good indications of the rated UX of the system for this group. The combined KPIs of all system components, as well as the overall system, are rated very highly in sessions 1 and 3, with all values above 2.