**3. Results**

The results focus on the participants' subjective ratings on the questionnaires, measuring system usability, trust and workload. The results are divided into subchapters for each questionnaire. The subjective ratings are analysed using multifactorial analysis of variance (MANOVA) with repeated measurements with a significance level of *punc* ≤ 0.05 of the null-hypothesis. The *p*-value index indicates the specific analysis. So *pint* stands for the interaction of the relevant groups and the factor time, *pt* for the factor time, and *pgr* for the group factors. Moreover, Pearson correlations were calculated to examine the correlations between the different factors. The method used to process the gathered results is already made available in Clement et al. [44].

#### *3.1. Expression of Trust*

The evaluation of the expression of trust uses the mean value (Q0) to analyse the participants' trust within the sample throughout the experiment.

A tendency towards a significant interaction can be seen in combination with ADAS pre-experience (*F* = 2.308, *pint* = 0.136, *η*2 = 0.048). The group with ADAS pre-experience shows a lower increase in the overall response than the group without ADAS pre-experience. This can be seen in Figure 5 on the very left side at the Q0 values. The ADAS pre-experience group's results show that the ratings before and after the experiment are at a similar level but slightly increase for the post-test results. For the factor time, the rating changes highly significant from 4.58 to 4.98 point (*F* = 10.160, *pt* = 0.003, *η*2 = 0.181). For the factor ADAS pre-experience, a tendency towards a significant difference can be seen (*F* = 2.940, *pt* = 0.093, *η*2 = 0.060).

The mean values resulting from the Expression of Trust (EOT) questionnaire before and after the experiment are shown in Figure 5, including their standard deviation. The results are split into two key groups, with and without ADAS pre-experience. The figure shows this for the mean value (Q0) and for the questions Q1 to Q7. The details are in Table 1. The single questions show various noticeable and significant differences. In Q1 "I understand how the system works—its goals, actions and output", the interaction (*F* = 10.619, *pint* = 0.002, *η*2 = 0.188) and the factor ADAS pre-experience itself ( *F* = 6.462, *pgr* = 0.014, *η*2 = 0.123) show a significant difference. Further, in Q3, the interaction reveals a significant difference between the two groups of the factor ( *F* = 6.384, *pint* = 0.015, *η*2 = 0.122), as indicated in the empirical analysis above. In Q5, the factor pre-experience itself reveals two nearly significant different levels of the answers (*F* = 3.962, *pgr* = 0.052, *η*2 = 0.079), also indicated by the different empirical levels of the group answers. Ratings of participants without ADAS pre-experience are higher for all answers in the post-test questionnaire compared to the

pre-test questionnaire. The ratings provided by the participants with ADAS pre-experience are less consistent with a higher variance before and after the experiment.

**Figure 5.** Expression of trust for question 0 (Q0—mean value) and detailed question 1–7 (Q1–7) with and without ADAS pre-experience, before and after the test.

**Table 1.** Mean values (m), standard deviation (sd) and ANOVA-results for ADAS pre-experience x Time.


Note: (*F* ... *F*-value, *pgr* ... *p*-value for Group ADAS pre-experience, *pint* ... *p*-value for interaction, *η*2 ... effect size) of the expression of trust for Q0 to Q7 for the factors ADAS pre-experience (with/without) and time of measurement (before/after).

The factor gender is illustrated in Figure 6 regarding the mean value (Q0) and the single questions Q1 to Q7. Results show a tendency towards the same significant difference for time of measurement (before/after) for both groups, male and female. However, although they reflect the overall change of trust through the simulator session, they do not show a significant interaction regarding differences in Q0 (*F* = 1.733, *pint* = 0.195, *η*2 = 0.036). The global picture shows that the female group tended to rate the system lower than the male group before the experiment. After the experiment, the increase of the female group's rating was higher than the one of the male group but still did not surpass the level of the male group.

In Table 2, results for the analysis of variance for the expression of trust are given for the interaction between the factor time and the factor gender as well as the main effect for the factor gender. The main effect regarding the factor time is the same as mentioned in the analysis of ADAS pre-experience and not outlined separately. Significant differences can be seen for the interaction in Q4 (*F* = 5.230, *pint* = 0.026, *η*2 = 0.102) and for the factor gender in Q1 (*F* = 5.510, *pgr* = 0.023, *η*2 = 0.107).

**Figure 6.** Expression of trust for question 0 (Q0—mean value) and detailed question 1–7 (Q1–7) for male and female groups, before and after the test.

**Table 2.** Mean values (m), standard deviation (sd) and ANOVA-results for Gender x Time.


Note: (*F* ... *F*-value, *pgr* ... *p*-value for group gender, *pint* ... *p*-value for interaction, *η*2 ... effect size) of the expression of trust for Q0 to Q7 for the factors gender (m/f) and time of measurement (before/after).

Analysis for the remaining demographic subgroups on the mean value (Q0) show no significant differences for the factor age (*F* = 0.020, *pgr* = 0.980, *η*2 = 0.001) and for the interaction with the factor time (*F* = 0.183, *pint* = 0.834, *η*2 = 0.008). For the factor driving experience, the values show no significant differences (*F* = 1.075, *pgr* = 0.350, *η*2 = 0.046), same as for the interaction with the factor time (*F* = 1.045, *pint* = 0.360, *η*2 = 0.044).

#### *3.2. NASA TLX*

The NASA TLX was performed as a raw test without the weighting of the questions in pairs. The evaluation of the questionnaire reveals that the factor age group (*F* = 3.481, *pgr* = 0.039) had a significant influence on participants' ratings of Q5. The group of elderly participants and those with fewer kilometers per year report more effort to accomplish their performance. Moreover, the factor driving experience (*F* = 4.278, *pgr* = 0.019) had a significant influence on participants' ratings of Q6. These factors as well as the factor ADAS are depicted in Figure 7. There were no significant interactions reported in the two-way analysis of variance as shown in Table 3. The differences are noteworthy for Q5. For the group with less than 5000 km/year, a descriptive analysis shows lower success for the questions Q4, Q5, and Q6.


**Table 3.** Results of the two-way ANOVA analysis for the NASA TLX.

Note: \*\* indicates *p* ≤ 0.01, and \* indicates *p* ≤ 0.05 X. \* indicates a significant difference for the factor, and (X vs.X)\* indicates a significant interaction.

**Figure 7.** NASA TLX for questions 1–6 (Q1–6) for the groups with and without ADAS pre-experience, the groups of driving experience per year, and the age groups.

#### *3.3. System Usability Scale*

The system usability evaluated for each demographic sample group shows results around the overall value of 77.49 out of 100. The value is generated from the sum of all single questions for each subject multiplied by 2.5 to correspond to the target maximum scale of 100 according to Rauer [49]. Figure 8 shows the results for each demographic subgroup within the participants. The groups are all in a similar range and distribution.

A two-way analysis of variance revealed only one significant interaction of the age group with the driving experience (*F* = 4.347, *pgr* = 0.005). This may be explained by a small number of participants in the age group of 46+ years, which results in a biased *p*-value sensitive to a distribution based on small sample size, even though according to Bangor et al. [46], the system used reflects an overall score in the 3rd quartile, which is acceptable and in terms of rating between good and excellent for the whole sample.

**Figure 8.** SUS score for the demographic sample groups.

The results for single questions are shown in Figure 9. Since the questions are positively and negatively polarised, the negative ones were inverted prior to the analysis. Results sugges<sup>t</sup> an easy-to-use and quickly adjustable system with no unnecessary complexities. Its usage is not cumbersome and it demands no assistance to be used. The results also reveal that the system has some inconsistency (2.67 out of 4) and is not fully convincing (2.64 out of 4) to the overall sample, but still far better than the 50% quantile.

**Figure 9.** SUS scores for the single questions for the groups with and without ADAS pre-experience, inverted for negative questions.

#### *3.4. Correlations between Workload, Trust and Usability*

To identify the correlation between the participants' ratings, Pearson correlations between all questionnaires and single items were calculated. In Table 4, an overview of the major calculations is given using the mean values (Q0) for the SUS and the EOT, as they have higher reliability due to being built on the sub-questions of the related questionnaires and therefore ge<sup>t</sup> more information into the variance. The values show the correlation *r* and their significance.


**Table 4.** Pearson correlations between system usability scale (SUS), expression of trust (EOT) and NASA TLX items.

Note: \*\*\* indicates *p* ≤ 0.001, \*\* indicates *p* ≤ 0.01, and \* indicates *p* ≤ 0.05.

The results reveal the following significant findings. SUS (Q0) and EOT before (Q0) were found to be highly significant and moderately positively correlated (*r* = 0.37, *p* = 0.01). SUS (Q0) and EOT after (Q0) were found to be highly significant and highly positively correlated (*r* = 0.63, *p* = 0.00). EOT before (Q0) and EOT after (Q0) were found to be highly significant and highly positively correlated (*r* = 0.75, *p* = 0.00). Scatter plots of these correlations can be found in Figure 10. These positive correlations imply that participants who rated the system usability low also rated the expression of trust (before and after) lower, and those who rated the system usability higher also have a higher expression of trust. The correlation between SUS and EOT after is higher than SUS and EOT before. Moreover, significant correlations between several NASA TLX single items and the SUS (Q0), the EOT (Q0) before, and the EOT (Q0) after were found as illustrated in Table 4.

**Figure 10.** Pearson correlations for the combination of the accumulated questions Q0 with best fitted line. (*r* ... Pearson correlation coefficient, *p* ... significance of the correlation).

Results that reveal significant correlations between NASA TLX and SUS as well as NASA TLX and EOT after, they are consistently negatively correlated. This means the higher participants rate the workload, the lower they rate the usability and trust. These results reflect a high internal validity of the experiment. Furthermore, these correlations sugges<sup>t</sup> a high validity of the applied questionnaires, as results point to a consistent and contextual evaluation within the measurement tools.

To check for spurious correlations and improve understanding of the correlations, especially regarding correlations' linearity and data outliers, the correlations are visually analysed using a scatter plot and the correlation line. The major correlations are depicted in Figure 10. The scatter plots do not show a tendency for a non-linearity and are well scattered around the correlation line. A visual comparison of the expression of trust (Q0) before and after the experiment reveals a general increase of trust as there are more data points above the 45° dotted line. A point on the dotted line would mean an equal rating in Q0 before and after the experiment and indicates no change throughout the experiment. A data-point

below the dotted line indicates a decrease of trust, and a data-point above it means an increase of trust. This approach supports the findings of the analysis of variance reported in Section 3.1. Apart from the inter-questionnaire correlations, the highest correlation and significance levels relate to the system usability scale with questions SUS Q1 (would use the system frequently) and SUS Q9 (found the system to be convincing) in correlation to the post-testing measurement of the expression of trust with questions Q2, Q4, Q6, and Q7.

## **4. Discussion**

We quantified the increase of trust in the "AD system under test" by exposing participants to a driving simulator and evaluating their subjective perception before and after that experiment session. As expressed by the mean value (Q0) of the Expression of Trust questionnaire, the increase of trust was significant for all groups between the two points of measurement. This study shows an appropriate way to increase and evaluate trust in a simulator study.

Participants with ADAS pre-experience entered the study with higher confidence in such systems compared to the group without ADAS pre-experience, as depicted in Figure 5. Despite the high starting confidence level, the simulator session increases their trust in the AD system, indicating a high level of validity between the driving simulation and the real-world experience. There is a significant increase in trust for both groups. The group without ADAS pre-experience shows a much higher growth than the already preexperienced. Besides the single questions Q1 and Q3 of the EOT, no significant difference can be seen between the groups. Q1 hints that the pre-experienced group could not gain further understanding, whereas the inexperienced group could significantly increase their knowledge of such a system. We found that pre-experienced participants might see a negative impact on their driving style the more they learn about the possibilities of automated driving systems and their driving behaviour in safety-critical situations, as a decrease in the Q3 was found. In contrast, the group without ADAS pre-experience sees a positive impact on their driving style, as they express a willingness to use the system in the future.

The analysis of demographic differences and similarities between genders shows the same tendency towards gaining trust (Q0) in the system through the simulator experience. The subgroups express a similar behaviour, besides Q1, with only one significant difference in the interaction of the groups and the time of measurement, which reflects the faith in the system shown in Figure 6. Female participants are more sceptical about their faith in the system before the simulator session compared to the male group. The simulator session affected their attitude towards the applied automated driving systems, i.e., their trust increased, reaching a similar level as the male participants. This effect can also be observed as a tendency in all other questions, even though there is no significant difference. This trend can also be seen within a descriptive analysis. It may reflect that the female group was more sceptical before the experiment but reached a level similar to one of the male participants after the experiment.

The NASA TLX analysis reveals an overall expected low workload of the participants. As the questionnaire is provided after the complete test sequence, it reflects the workload of the entire simulator session and does not differentiate between the single scenarios. There are significant negative correlations between the NASA TLX and both the system usability scale and the expression of trust. Since both questionnaires measure similar aspects with different purposes, the effect can be assumed as valid. The negative correlations between the trust in the system and the workload throughout the simulator session are significant, meaning that a higher workload correlates with less trust within the participants. The participants' ratings were below the 50% line for the overall workload evaluation, which suggests a low workload within the simulator session. This was expected, as full AD minimised the physical and mental demands. Nevertheless, some subgroups showed more strain within the experiment. As shown in Figure 7, the groups of higher ages and lower yearly mileage had a lower success rate, a higher personal effort and a higher frustration

after the simulator session. Hence, there are only tendentially significant differences between the groups.

The participants' ratings within the System Usability Scale (SUS) show a usability that can be seen as good to excellent. As already discussed in Bangor et al. [46] and Brooke [48], this method is suitable for evaluating the overall system usability. We hypothesise that the participant knowledge of the early developmental state of the system may have influenced their evaluations. With a mean of around 77.5 points out of 100, the response indicates a very good system although it also implies that following a normal distribution, around 50% of the participants rated the system lower than that and therefore only good instead of very good. The assessment of the mean and standard deviation of all demographic subgroups reveals that they are all on a similar level regarding their usability evaluation. This points to a neutral, not too complicated and realistic experimental design, which may be seen as an essential prerequisite for reliable results.

Regarding the Pearson correlation with the combination of both the pre- and postquestionnaires for the expression of trust, one can derive that smaller trust in the initial level results in a higher increase of trust after the simulator session. This is expressed in the middle of Figure 10 by the fitted line being above the 45° line on the left side while touching the 45° line at the top end, which may also reflect a ceiling effect as the scale is limited to 0–7 and may not be extended in the post-session questionnaire. An increase from a higher level may not be measured that accurately. A comparison of the pre- and post-correlations with the system usability shows a more homogeneous rating in the post-evaluation. The correlation with the pre-questionnaire is smaller (*r* = 0.37, *p* = 0.013) reflecting that the system's behaviour is unknown prior to the simulator session. Nevertheless, the lower part of the system usability scale shows a slight decrease in the expression of trust. In contrast, the upper part did increase noticeable in a clockwise turn of the fitted line in the left and right part of Figure 10.

Both correlations sugges<sup>t</sup> different clusters, one below and one above the fitted line of Figure 10. An analysis of the correlations with a separation into the different demographic groups reveals noticeable findings which can be seen in Figure 11. It shows that the group without ADAS pre-experience seems to be located closer to the cluster below the fitted line than the group with ADAS pre-experience. It also reveals that the trust of the group with a low annual driving experience increases its mean ratings in trust throughout the simulator session and has the highest intern group delta according to their system usability rating.

The findings of the study respond to the demands of standardisation in a humancentric approach to manage handover and takeover between the vehicle and the human for SAE level 3 automation [1]. Besides the required compliance with the existing standards, the monitoring of the human state during the interaction with the AD functions also provides precious feedback on the machine's performance which can be used for improving the intelligent machine itself [19] and provides the foundations for further human-centred development with the flair of a humanistic AI control. Considering that ensuring dependability of such systems relies on AI approaches, it is still an open issue that lacks standard industrialisation solutions [15]. Ensuring dependability, despite the complexity and changing nature of the systems due to adaptation and learning, is a precondition for public trust and acceptance [58]. The findings of the study help to quantify human behaviour and define measurable parameters that are also working towards standardisation of the concept of safety-critical autonomous or AI-enhanced applications.

In addition, the presented work can support the establishment of trust and acceptance measures of AD in general and AI-based approaches in particular. Such measures may provide the foundation of necessary acceptance and standardisation related to human perception and trust in AD systems.

**Figure 11.** Pearson correlations with descriptive analysis of potential clusters regarding ADAS pre-experience and driving experience.
