4.1. User Engagement Scale
The mean and standard deviation of Group A’s UES for the UIs in the four dimensions (“aesthetic appeal: AE”, “focused attention: FA”, “perceived usability: PU”, and “reward factor: RW”) are displayed in
Figure 5. AE, representing visual attractiveness, was rated higher in the TUI (M = 4.04, SD = 0.56) than ARUI (M = 4.01, SD = 0.81). PU asked about aspects of perceived usability, and the TUI (M = 4.36, SD = 0.42) was perceived as having better usability than the ARUI (M = 4.07, SD = 0.60). RW, which is positive experiential outcomes (e.g., willingness to recommend our app to others and having fun with the interaction), also scored a higher mean in the TUI (M = 3.96, SD = 0.55) than ARUI (M = 3.93, SD = 0.76). The overall engagement score is the sum of all the average scores of the UES dimensions (i.e., AE, FA, PU, and RW). The max score of overall engagement score is 20 and the TUI (M = 15.09, SD = 1.64) received a higher score than the ARUI (M = 15.01, SD = 2.51). Unlike the previous three dimensions and overall engagement score, FA, which represented a concentration level, was evaluated with a higher score in the ARUI (M = 3.01, SD = 1.00) than TUI (M = 2.73, SD = 0.80). To determine whether the data analysis result had a statistically significant difference, we conducted a paired sample
t-test.
Table 4 presents a paired sample
t-test (right-tailed) results with a degree of freedom of 23.
We used quantile–quantile plots (Q-Q plots) to confirm the normal distribution of the differences in the data. We then checked the p-value (p) and test statistic (t) to confirm whether we can accept the null hypothesis (: There is no difference in user engagement between the ARUI and TUI). Since the p of AE and RW were higher than 0.05, we failed to reject the null hypotheses, which meant there were no significant differences between the ARUI and TUI in AE and RW dimensions for user engagement. Regarding the PU, p showed less than 0.05, which meant we might reject the null hypothesis. The 95% confidence interval also indicated the same evidence that we might reject the null hypothesis for the PU since it did not cross zero. However, the critical value of our case was 1.71 (degree of freedom = 23, = 0.05, right-tailed test), and the t of PU was −2.52, which was far less than the critical value. Therefore, we failed to reject the null hypothesis for the PU since the t of PU was away from the rejection region. On the other hand, FA showed a p less than 0.05, 95% of a confidence interval that did not include zero, and a t of 2.49. As in our case (degree of freedom = 23, = 0.05, right-tailed test), the critical value was 1.71, which was less than the FA’s t. Therefore, we rejected the null hypothesis for FA and accepted the alternative hypothesis (: The ARUI increases user engagement compared to the TUI) for FA dimension. In summary, there is no significant increase in the ARUI of three out of four dimensions of UES, such as AE (t(23) = −0.21, p = 0.42, 95% of confidence interval (CI) [−0.31, 0.24]), PU (t(23) = −2.52, p = 0.01, CI [−0.49, −0.09]), and RW (t(23) = −0.41, and p = 0.34, CI [−0.17, 0.11]). However, the paired sample t-test result showed a statistically significant increase in the ARUI at FA, t(23) = 2.49, p = 0.01, and CI [0.09, 0.47], meaning that the ARUI provided more engagement than the TUI in the aspect of FA. Overall, the paired sample t-test showed that there is no significant increase in the overall engagement score of the ARUI compared to the TUI, (t(23) = −0.21, p = 0.42, CI [−0.73, 0.57]), indicating that the ARUI did not increase user engagement; however, FA was significantly increased by the ARUI.
Table 5 shows the UES result of Group B, who experienced the PHOENIX solution with the mobile app of the TUI for three months. Since the sample size for this group was too small (
n = 5) to conduct inferential statistical analysis, descriptive statistics (M and SD) were calculated to check the central tendency and dispersion of Group B’s data. We found within Group B that the TUI’s overall engagement score is higher than that of the ARUI due to the higher scores for the TUI in AE, PU, and RW dimensions. However, FA score of the ARUI is higher than the TUI, which is similar to Group A’s result.
Figure 6 depicts the distribution of UI preference of the participants in Group B based on the mean of AE, FA, PU, and RW. The result showed that the ARUI was selected by four participants (80%), while the TUI was chosen more frequently (60%) in other dimensions, such as AE, PU, and RW. The ‘Equal’ refers to a case when both UIs received the same score from a participant.
4.2. System Usability Scale
Figure 7 shows the comparison of each participant’s SUS score between the TUI and ARUI. The mean of the overall SUS score of the TUI was 84.17 (SD = 8.71), and the ARUI was 78.13 (SD = 16.07). The bars with thick outlines indicate cases where the ARUI received a higher score than the TUI. Accordingly, six participants (P3, P5, P8, P11, P22, and P24) rated the ARUI higher than the TUI. To determine whether the result had a statistically significant difference, a paired sample
t-test on SUS was conducted, and
Table 6 describes its result.
The Q-Q plot was used to confirm the normality of the differences, and the paired sample t-test results indicated that there is no statistically significant difference in the right-tailed test of SUS scores, t(23) = −2.35, p = 0.01, CI [−10.45, −1.64].
Table 7 shows Group B’s SUS scores for the TUI and ARUI. Three participants (P26, P27, and P28) perceived the mobile app with TUI had better usability than the app with ARUI. However, the average SUS score of the mobile app with the TUI was 69.00 (SD = 26.80), whereas the app with the ARUI was rated 72.00 (SD = 14.83) due to the large difference in SUS scores by P29. The data of P29 could be an outlier in a statistical analysis. However, since there was a difference in the period of time for experiencing each UI and the small sample size of Group B, conducting an inferential statistical analysis was inappropriate due to the low power of a statistical test with Group B’s data. Therefore, we kept the P29’s data to gain insight into what caused this difference between the two UIs.
4.4. Qualitative Findings
We analyzed feedback that the participants noted during the user evaluation. The feedback was analyzed and grouped into four categories, such as AR companion, ARUI, data and interface design, and personalization.
4.4.1. AR Companion
As
Figure 9 illustrates, 62.50% of participants in Group A rated a higher score for the ARUI than the TUI in FA. Participants explicitly expressed their interest in the AR companion during the user evaluation. One participant (P9) noted that interacting with the AR companion was fun even after rating higher scores for the TUI in AE, PU, and RW. We then received various proposals to provide more engagement by adding content that users could enjoy. For example, the participants suggested adding more animations (P7, P17, and P29), audio dialogues or sound effects (P3), a feature that the AR companion can be changed to other animals (P2 and P10), and visual effects, such as icons, to emphasize the expression of the AR companion (P9, P17, and P20). While several participants focused on the entertainment aspect of the AR companion, two participants (P5 and P13) were interested in the AR’s intuitiveness for effectively presenting data. P5 noted the following:
“The pet (AR companion) represents the (room) condition explicitly, which means more easy to understand and more engaging”.
(P5)
Due to these aspects of entertainment and intuitiveness, 54.17% of participants in RW scored the ARUI higher than the TUI. Participants perceived the ARUI could be valuable for other people like kids [
46] or older adults [
47] who might prefer a UI with improved user engagement.
4.4.2. ARUI
Participants in both groups noted the inconvenience of using their hands for the ARUI. They especially complained about the requirement of aligning their hands on the camera to spawn the AR companion (P11, P22, P24, P26, P27, P28, and P29). Some participants in Group A commented that they preferred the TUI over ARUI due to the familiarity of the UI design compared with apps designed for similar but different tasks, such as monitoring a robot cleaner, tracking parking prices, logging electricity usage, and managing a smart home (P1, P2, P6, P9, P10, and P17). Compared to the ARUI, the common characteristic of those apps was that the participants could see the data right after logging in to the app. The participants with ARUI had additional steps (i.e., holding a camera and showing a hand) to reach the data, and we speculate that it affected their preference for UI. P9 mentioned the following:
“Depending on the users and context of use, the process to get information should be quicker and simpler or engaging to play”.
(P9)
In addition, the participants were familiar with reading numbers and graphs to gain information; therefore, understanding the condition of rooms through the AR companion could be fun but less satisfying with obtainable data due to missing details, such as numeric data. Hence, displaying AR on an object in the room (i.e., physical object) or in the air (i.e., markerless) simplifies the process of data acquisition (P9) rather than aligning a hand with the mobile camera, and a dialogue box on the AR companion for information display (P9) could be a potential solution for missing details.
4.4.3. Data and Interface Design
Participants in Group A complained about the quality of the notification message. Due to unfamiliar terminologies and ambiguous meanings of the sentences (P5, P18, P20, and P23), the participants in Group A had a hard time understanding the notification message. Moreover, this ambiguity of the notification message might cause a failure in optimizing energy usage due to unintentional effects. P23 noted the following:
“My action can still affect a failure in energy optimization if the advice from the (PHOENIX) server is not clear and appropriate. For example, if CO2 is too high, (the recommendation would ask to) open windows or use a ventilation system to refresh the indoor air. But that also leads to decrease the room temperature if I open a window in the winter season. This is not saving energy because if I feel cold after opening the windows, I would increase the temperature, so to use more energy”.
(P23)
In order to support understanding of rooms and apartment conditions, the necessity of various sensor data was mentioned as well since our apps provided temperature and CO2 only. Data like electricity usage (P1 and P2), water usage (P11), temperature history (P14), sensor location (P17, P18, and P23), and saved costs by energy optimization (P23) were suggested as additional information. In this context, additional data visualization methods were requested to deliver data effectively rather than in a simple list, such as line graphs (P10 and P14), bar charts (P2), and 2D maps (P4, P11, P12, P17, P18, P20, and P23). When some participants argued about the missing sensor data, in the meantime, other participants (P4 and P7) pointed out the lack of supportive materials, like a tutorial, for understanding the UIs.
In relation to the lack of tutorials, an unexpected behavior of UI caused confusion among participants. For example, whenever a room was selected in the ARUI, the audio spoke the latest notification message for the selected room aloud. This behavior could only be prevented by clicking the microphone button to mute the audio and was never explained in the app (P7). Another example was that the voice command feature required the participant to speak specific words to activate the command. The app informed the available words while the system listened to user input (i.e., voice command) for 10 s. As a consequence, the participants gave a voice command after the system listening session ended because of the time spent reading the tutorial. Since the participants were unaware of whether the system was still waiting for their input, the participants were annoyed and confused about the failure. Some participants (P15 and P20) liked the voice command feature, whereas many participants experienced the malfunction. If there were tutorials that explained how to control audio and use voice commands in detail, the participants could avoid discomfort when they tried these features. Supporting natural language for the voice command could resolve this inconvenience (P12) as well.
4.4.4. Personalization
Participants in Group A left feedback regarding a personalized system configuration for the ARUI. For example, the apps already provided four different font sizes and two interface theme colors; however, participants preferred to change to other sizes and colors and set them as default (P4, P10). In addition, participants wanted to adjust the threshold value of room temperature for receiving a notification message (P1, P19) that fitted their residence and daily energy usage pattern. Furthermore, participants who were less in favor of a cat wished to change the species of the AR companion to other animals, such as a dog (P2, P10). Auto energy management was one of the system features that the participants in Group A demanded, as it would enable one-time checks without periodical monitoring. For example, once a rule for the energy management system was set up based on time, date, current room temperatures, residents’ presence, monthly energy cost, weather conditions, and outdoor temperatures (P1, P2, P8, and P10), residents could take advantage of the PHOENIX service even without being conscious of the notification message.