**Usability Evaluation—Advances in Experimental Design in the Context of Automated Driving Human–Machine Interfaces**

**Deike Albers 1,\*, Jonas Radlmayr 1, Alexandra Loew 1, Sebastian Hergeth 2, Frederik Naujoks 2, Andreas Keinath <sup>2</sup> and Klaus Bengler <sup>1</sup>**


Received: 27 March 2020; Accepted: 23 April 2020; Published: 28 April 2020

**Abstract:** The projected introduction of conditional automated driving systems to the market has sparked multifaceted research on human–machine interfaces (HMIs) for such systems. By moderating the roles of the human driver and the driving automation system, the HMI is indispensable in avoiding side effects of automation such as mode confusion, misuse, and disuse. In addition to safety aspects, the usability of HMIs plays a vital role in improving the trust and acceptance of the automated driving system. This paper aggregates common research methods and findings based on an extensive literature review. Empirical studies, frameworks, and review articles are included. Findings and conclusions are presented with a focus on study characteristics such as test cases, dependent variables, testing environments, or participant samples. These methods and findings are discussed critically, taking into consideration requirements for usability assessments of HMIs in the context of conditional automated driving. The paper concludes with a derivation of recommended study characteristics framing best practice advice for the design of experiments. The advised selection of scenarios and metrics will be applied in a future validation study series comprising a driving simulator experiment and three real driving experiments on test tracks in Germany, the USA, and Japan.

**Keywords:** conditionally automated driving; human–machine interface; usability; validity; method development

#### **1. Introduction**

The introduction of conditionally automated driving (CAD) vehicles drastically alters the role of the human in the car. Based on the definition of the Society of Automotive Engineers (SAE), CAD or Level 3 automated driving means that the automated driving system (ADS) is responsible for the entire driving task, while the human operator is ready to respond as necessary to ADS-issued requests to intervene and to system failures by resuming the driving task [1]. The transition of the human driver from the role of operator to the passenger role implies a paradigm change relative to the Level 2 or partially automated systems that are available today [1,2]. This paradigm change, including transitions back and forth to lower levels of automated driving, affects the human–machine interface. CAD implies that the human must take back control of the driving task in cases where the system reaches a system boundary and in doing so, to resume manual driving. The resulting transition of the driving task from the automation system to the human requires an appropriate communication strategy as well as a human–machine interface (HMI) that supports the interaction between the two

parties in general. New challenges in both HMI design for automated driving and CAD in particular are addressed in this review paper.

This paper gives an overview of the status quo for usability assessments for automated driving HMIs. Current practice is presented by summarizing the methodological approaches of study articles. Additionally, theoretical articles such as literature reviews are included. Both are considered in the derivation of best practice advice for experimental design. This best practice advice will be applied in an international validation study for assessing the usability of CAD HMIs comprising four experiments in three countries and two testing environments. In Germany, a driving simulator experiment and a test track experiment are planned. Two further test track experiments are planned for Japan and the USA. All four experiments will apply the same study design, ensuring the comparability of the results. The articles in this paper have been aggregated using a predefined set of six categories. These categories were identified in the research phase of the validation project and represent differences in the methodological approaches.

Basing on the existing literature, this paper aims to derive a feasible practical and theoretical experimental design that will be validated in the study series described above. The developed experimental design serves as best practice for future studies in which the aim is to assess the usability of CAD HMIs.

#### **2. Paper Selection and Aggregation**

This paper reviews 16 scientific articles that cover the usability assessment of CAD HMIs. The selection includes study articles and theoretical articles. The selection process and the aggregated data are presented in the following sections.

#### *2.1. Paper Selection*

Literature searches have been conducted in the literature manager Mendeley and Researchgate, resulting in seven articles.

Additionally, a systematic review has been conducted via the search engine for scientific literature Google Scholar. The process followed the guideline Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) and is visualized in Figure 1 [3]. This guideline enhances the transparency of the selection process by describing the stepwise narrowing of the chosen articles for the review. For the identification of potential articles, different combinations of keywords such as "Usability", "Human–Machine Interface", and "Conditionally Automated Driving" are applied. The first step in the process resulted in 553 articles. The next step included the articles identified in other libraries or databases, respectively. In total, 188 duplicates were removed. A first screening of the titles and the abstracts lead to the exclusion of 346 further articles. After reading all articles, 10 more articles were excluded due to a lack of relevance for this review. In the resulting selection of 16 articles, the usability assessment of ADS HMIs for CAD is described.

**Figure 1.** Process of the literature review based on the PRISMA guideline [3].

#### *2.2. Aggregation*

The final selection includes 16 articles. Nine articles present experiments, which are hereafter referred to as study articles [4–12]. The seven other articles are of theoretical nature and are therefore referred to as theoretical articles [13–19]. There are several characteristics that define a study design. By taking into account both common practice and theoretical considerations, this review paper aims to derive best practice advice for researchers interested in the usability of CAD HMIs.

Six experiment characteristics were chosen to meet the challenge of assessing usability in the development process. The literature search yielded different approaches for the usability testing of ADS HMIs. The differences identified in the first research phase resulted in the selection of six categories. These provide the structure of this paper, including the study characteristics' dependent variables, and the testing environment. The definitions of the term usability applied in each of the selected articles are used to understand the research focus of each article. Furthermore, the sample characteristics, the test cases, and the conditions of use, i.e., initial versus repeat contact (see below), are considered. The characteristics listed below provide an insight into the methodological approaches of the nine empirical study articles and the discussed and recommended methodologies of the seven theoretical articles:


The characteristics listed or applied in the 16 articles are summarized in the first paragraph of the following subsections and the respective tables. Every subsection closes with a critical discussion of the findings resulting in a recommendation of an experimental procedure or method. These recommendations form the best practice advice for usability assessments of CAD HMIs.

#### 2.2.1. Definition of Usability

The understanding of the term usability has a considerable influence on the experimental design that researchers choose. Different definitions and operationalizations may result in a different study design. To reflect these potential differences in design, the information on usability given in the selected articles is compared in this subsection. Table 1 shows 12 of the 16 articles. Four articles do not define or operationalize the term usability [4,5,8,15]. Five of the remaining articles [9,11,13,18,19] give an insight into the authors' understanding of the construct usability by the chosen dependent variable(s), e.g., the acceptance, or metrics, e.g., the System Usability Scale (SUS) [20], the Post-Study System Usability Questionnaire (PSSUQ) [21], or the acceptance scale of Van der Laan (VDL) [22]. Four articles [6,7,12,17] cite ISO Standard 9241 with its definition of usability as the "extent to which a system, product or service can be used by specified users to achieve specified goals with effectiveness, efficiency and satisfaction in a specified context of use" [23] (p. 2). However, the complete definition is used only once [7], whereas three articles focus on the effectiveness and efficiency while leaving out the construct satisfaction [6,12,17]. Ref. [12] adds the term "usefulness" to the constructs effectiveness and efficiency. Two other articles cite the minimum requirements of the National Highway Traffic Safety Administration (NHTSA) [16,17]. These requirements impose that the user of an ADS HMI must be able to understand if the ADS is "(1) functioning properly; (2) currently engaged in ADS mode; (3) currently 'unavailable' for use; (4) experiencing a malfunction; and/or (5) requesting control transition from the ADS to the operator" [24] (p. 10). Ref. [17] applies the NHTSA minimum requirements to the two constructs effectiveness and efficiency. The remaining two articles [10,14] cite Nielsen [25] who builds usability from five constructs: learnability, efficiency, memorability, errors, and satisfaction.



<sup>1</sup> Theoretical article.

In addition to the implicit operationalization through dependent variables, only three sources are cited for the 16 selected articles. These are the ISO Standard 9241 [23], the NHTSA minimum requirements [24], and the Nielsen model for usability [25]. During examination of the articles, both theoretical articles and study articles posed difficulties in working out the authors' understanding of usability. Considering that usability forms the focus of the research question, the underlying definition or at least the operationalization should be communicated to the readers. We strongly advice applying ISO Standard 9241 that comprises the constructs effectiveness, efficiency, and satisfaction [23]. Since the ISO Standard does not elaborate on the detailed testing procedure, further operationalizations are recommended, e.g., whether the effectiveness is tested in a setting with novice or experienced users. When citing the NHTSA requirements for usability tests, researchers choose a different approach to defining the term usability that considers the context of automated driving. Moreover, the usability is rated according to the comprehension of the user that the ADS is "(1) functioning properly, (2) currently engaged in ADS mode, ... " [24] (p. 10). This narrows down the practical realization of the usability assessment. A combination of this approach and the definition of the general term usability based on ISO Standard 9241 seems to be most applicable.

#### 2.2.2. Testing Environment

Four of the theoretical articles provide no information on the testing environment in which the usability assessment should be conducted [13–16]. Of the remaining 12 articles (shown in Table 2), the use of an instrumented car is recommended twice [18,19], while the additional use of a high-fidelity driving simulator is recommended by [18]. Ten of the 12 articles recommend or use a driving simulator [4–11,17,18]. The details of the simulator are specified in most of the study articles. A fix-base simulator is used in four articles [4,7,9,10], a moving-base simulator is used in two cases [5,6], and in one other case, a low-fidelity simulator is described [11]. Ref. [12] does not use an instrumented car or driving simulator; rather, desktop methods are applied where paper and video prototypes are evaluated.


**Table 2.** Aggregation of the testing environments.

<sup>1</sup> Theoretical article.

Driving simulators are the prevalent testing environment in the field of usability assessments of ADS HMIs for CAD. Only two of the theoretical articles stress the need for real driving experiments, e.g., with instrumented cars. Driving simulators provide efficient and risk-free testing environments that provide valuable results [26]. For some research questions, they may even be the only realizable testing environment, e.g., for testing critical situations in automated driving such as near crashes and system failures in high-speed conditions. As the name implies, driving simulators do not equate with reality. High-fidelity driving simulators increase the match with reality and are to be preferred over low-fidelity simulators or desktop methods. The validity of driving simulators is assessed in several studies [27]. For research results obtained in driving simulators used to assess the usability of CAD

HMIs, the validity is yet to be verified. For practical reasons, driving simulators constitute the best testing environment. However, a validation check for the research results is needed.

#### 2.2.3. Sample Characteristics

This subsection aggregates the sample characteristics. Usability tests can be conducted with experts or potential users [28,29]. Information on the participant group is provided by 14 articles of this review [4–14,16–18]. Three theoretical articles recommend including both sample groups, i.e., experts and participants in the development process of an ADS HMI [13,16,18]. Two other theoretical articles list users as participants [14,17]. Ref. [17] recommends a diverse age distribution as advised in [30]. Moreover, the authors emphasize that participants should not be affiliated with the tested system. Of the nine study articles, two conducted the usability assessment with 6 or 5–9 experts, respectively [11,12]. In these articles, experts were described as working in the "field of cognitive ergonomics" or "field of ergonomics, HMI, and function development from university and industry". The seven other study articles conducted their usability tests with potential users [4–10]. The reported sample size varies between 12 and 57. The age distribution ranges between 20 and 62, except for [10], where older adults between 47 and 88 years old were tested. Attention should be drawn to the fact that of the seven experiments with potential users, five experiments were conducted with employees of a car maker [4–8]. Table 3 shows an overview of the sample characteristics.


<sup>1</sup> Theoretical article.

Conducting tests with potential users is the predominant method in the articles of this review. Using experts as participants represents an efficient approach for identifying major usability issues early in the development process. At advanced stages, tests with potential users are indispensable. The participants should be selected with high demands to the representativeness. The population of potential users of ADS has a high level of variability in its characteristics, e.g., prior experience or physical and cognitive abilities. User testing is most valid and productive when a sample representing potential users is being tested. Research using subpopulations could lead to biased results [31]. Therefore, when testing the usability of CAD HMIs, efforts should be made to keep the number of participants with affiliations to technical or automotive domains to a minimum. Further characteristics such as age or gender should be selected according to the represented user group. The sample size varies greatly in the selected articles. The decision on sample size should be defined by the statistical procedure used to identify potential effects of interest.

#### 2.2.4. Test Cases

The test cases in an experiment are strongly dependent on the research question. As the research questions in the selected articles of this review all focus on the usability assessment of ADS HMIs for CAD, the test cases are comparable. However, no details are considered; Table 4 shows only test case categories. Ten of the 13 articles that provide information on test cases list transition scenarios [4–7,11,12,15–18]. Downward transitions are found in each of these 10 articles. A more detailed view shows that seven of these articles describe transitions to manual driving [6,7,11,12,16–18]. Eight articles [4–7,12,16–18] list test cases with upward transitions, e.g., SAE Level 0 (L0) to SAE Level 3 (L3) [1]. The system mode as well as the availability of automated driving modes are listed as dedicated test cases in four articles [12,16–18]. Likewise, three experiments include test cases with information on planned maneuvers, e.g., lane changes [7,11,12]. Two articles include test cases that represent different traffic scenarios, e.g., traffic density [8,9]. Use of the navigation function is the focus of [10].



<sup>1</sup> Theoretical article. <sup>2</sup> [1].

In the articles considered in this review, most of the test cases comprise transitions between or the availability of different automation modes, which are mostly referred to as SAE levels [1]. Successful transitions and the operator's understanding of the automated driving modes are important for the safe and efficient handling of the ADS. If the usability is tested and the human operator fails to understand the information communicated by the HMI, improvement measures for the HMI are inevitable. Therefore, the interaction of the operator with the ADS should be tested regarding these

functions. In addition to test cases directly related to automation modes, another type of test case can be applied when assessing the usability. These are test cases where usability evaluations refer to the handling of additional systems such as navigation systems or the radio. Non-driving-related activities (NDRA) are of high importance for usability evaluations where the human operator is involved in the driving task [2]. With the introduction of CAD, the focus of usability assessments is on transitions and the automation modes themselves. Additionally, this review concludes with a recommendation for testing non-critical scenarios. Critical situations are important for assessing safety aspects. These situations have a low probability of occurring. In particular, situations with high criticality are not suitable for usability assessments, e.g., tests that determine the range of reaction times with a crash rate of 100%. For a thorough evaluation of usability, comprising constructs such as satisfaction of the ISO Standard 9241 [23], recurring non-critical situations are more appropriate.

#### 2.2.5. Dependent Variables

Three of the theoretical articles do not provide information on dependent variables [14–16]. The dependent variables stated in the theoretical articles or applied in the study articles of the remaining 13 articles are shown in Table 5. The dependent variables are categorized in constructs, while information on the specific metrics is added in the respective cells. More generally, the variables can be categorized into observational and subjective data. Six articles recommend or report the use of observational data [4,5,8,11,13,19]. Ref. [13] recommends collecting both data types; the interaction performance with a system or secondary task, as well as the visual behavior. Two other articles name visual behavior (e.g., the number of gaze switches) as a suitable metric [5,19]. The interaction performance is assessed either directly based on the reaction time or the number of operating steps/errors or indirectly by expert assessments. In total, four articles list this type of a dependent variable [4,8,11,13]. The SUS [20] is widely used and belongs to the subjective measures. The questionnaire is listed by six of the 13 articles [6,7,10–13]. Two other dedicated usability questionnaires are utilized in one article each; the Post-Study System Usability Questionnaire [21] by [13] and the standardized ISO 9241 Questionnaire [32], as cited by [12]. Other constructs that interrelate with usability such as acceptance, which correlates with the construct satisfaction of ISO 9241 [23], are tested by several articles in this review. These constructs report further questionnaires. Questionnaires on acceptance are used three times [7,9,19], e.g., the VDL [22] or the Unified Theory of Acceptance and Use of Technology (UTAUT) [33]. Questionnaires on trust such as the General Trust Scale (GTS) [34] or the Universal Trust in Automation scale (UTA) [35] are reported three times [7,10,19]. Constructs such as workload (cited by [10,19]), measured, for example, using the metric NASA Task Load Index (NASA-TLX) [36], situation awareness (cited by [10,19]), measured, for example, using the metric Situation Awareness Global Assessment Technique (SAGAT) [37], or the mental model of drivers (cited by [4,5]), measured, for example, using the mental model questionnaire by Beggiato [38], are each listed twice. Additional questionnaires that are reported only once can be found in Table 5. In addition to questionnaires, methods such as the Thinking Aloud Technique [39], applied by [8,9,11], or heuristic evaluations [40], applied by [12,17,18], are commonly used, especially for expert studies. Furthermore, interviews, expert evaluations, and spaces for suggestions and comments are often used to gain insights that standardized methods cannot provide [8,9,11,19].

**Table 5.** Aggregation of the dependent variables. NDRA: non-driving-related activities. UEQ: User Experience Questionnaire. meCUE: modular evaluation of key Components of User Experience. SART: Situation Awareness Rating Technique. ATCQ: Attitudes Towards Computers Questionnaire. DALI: Driving Activity Load Index.


<sup>1</sup> Theoretical article.

Summarizing the listed dependent variables, usability appears as a well-defined construct ([23]) that can be assessed via multifaceted metrics. Depending on the research questions, different dependent variables seem more applicable than others. Nevertheless, patterns can be detected. A combination of observational and subjective data is used by 6 of the 13 articles that provide information on dependent variables [4,5,8,11,13,19]. The SUS [20] is widely used by the researchers cited in this review. Where individual research questions are concerned, further questionnaires can be used to evaluate constructs such as trust, acceptance, or workload. If information for specific research interests cannot be extracted via standardized methods, interviews, the Thinking Aloud Technique or heuristic evaluations can be applied. When combining these dependent variables, mutual impacts should be considered. For example, applying the Thinking Aloud Technique is not suitable in combination with interaction performance measurements such as reaction times. For tests with potential users, this review recommends a combination of observational metrics that measure the behavior and subjective metrics that gather the operator's personal impressions. For observational data, analysis of the visual behavior based on ISO 15007 [49] and the interaction performance using the ADS HMI seem most applicable. Possible metrics are the number of operating errors, the reaction time for a button press or the percent time on an area of interest, e.g., the instrument cluster. The SUS is recommended as a

valid and widely used usability questionnaire. Supplementary questionnaires should be selected with regard to the specific research question. If usability is not the only construct of interest in an experiment, the link between the dependent variables and the constructs should be clearly stated. Standardized metrics should be used to enable comparisons between experiments and create transparency with other researchers. Short interviews provide valuable insights that can be tailored to the specific research question. Interviews should be conducted after test trials and questionnaires to avoid distorted results.

#### 2.2.6. Conditions of Use

When the usability of a system is tested, the duration of use and the prior experience need to be considered. The conditions can range between the first contact between a novice user and the system and everyday use by an experienced user. The first contact can be tested with prior experience, e.g., after reading the manual, after conducting an interactive tutorial, or after being instructed by an advisor. Prolonged use can be interpreted as a series of repeat contacts between the user and the operator within a few hours or in the scope of a long-term study. The articles analyzed in this review generally do not provide detailed information on the conditions of use that is of research interest. Table 6 shows an overview with aggregated information on the nature of the usability testing provided by 14 of the articles in this review. In all 14 articles, first contact is tested or, in the case of the theoretical articles, it is recommended to be tested [4–14,16–18]. In four of these cases, the testing circumstances are specified as testing intuitive use without having first being given detailed instructions [5,6,8,16]. Another article investigates the influence of different tutorials and therefore tests both tutorials and intuitive use [4]. Of the 14 articles that test first contact, seven also assess repeat contact with the system [5,6,10,12–14,17].


**Table 6.** Aggregation of the conditions of use.

<sup>1</sup> Theoretical article.

Testing the first contact when assessing the usability of an ADS HMI appears to be the predominant method. Only a few of the selected articles tested repeat contacts when assessing usability. Prolonged use in the form of a long-term study testing everyday use is not considered in the articles selected for this review. Both first contact and prolonged use are important aspects to consider when evaluating usability. A successful first contact is highly important from the point of view of safety. This means that the handling of a system is intuitively understandable without consulting the manual—similar, for example, to a human driver using a rental car without first familiarizing themselves with the car's handling. For research effects such as disuse, misuse, or even abuse of a system, the consideration of prolonged use in everyday situations is critical [50]. As it poses a different type of research question, it requires a different kind of experiment. In alignment with most of the articles, this review concludes by recommending first contact tests. The NHTSA minimum requirements state that an HMI must be designed in such a way that a user understands if the ADS is "(1) functioning properly; (2) currently engaged in ADS mode; (3) currently "unavailable" for use; (4) experiencing a malfunction; and/or (5) requesting control transition from the ADS to the operator" [24] (p. 10). The fulfillment of these requirements can be checked by assessing usability in a first contact situation. This requires that participants are not given detailed instructions, such as pictures of the HMI requesting a control transition, prior to the first contact. Instead, participants should only receive instructions with general information on ADS.

#### **3. Discussions**

In this review paper, 16 selected articles focusing on usability assessments for ADS HMIs for CAD are analyzed. Information on methodological approaches, study characteristics, as well as the understanding of the term usability has been aggregated. The insights gained are used to draw conclusions on best practice for researchers investigating the usability of CAD HMIs. In this section, the recommendations are discussed and incorporated in more general advice on usability testing.

Three different sources are cited for the understanding of the term usability [23–25]. Yet, many articles do not provide information on the definition used. The definitions of the three sources result in different study designs than those that would have been derived had a different definition been selected. In order to assess the usability of CAD HMIs, we advise applying a combination of ISO 9241 and the NHTSA minimum requirements [23,24]. However, other definitions, e.g., [25], might be better suited for specific research questions. In general, it is important to provide an operationalization of the term usability when conducting assessments, especially where standards are not applied.

For practical reasons, the review concludes with the recommendation of high-fidelity driving simulators. Depending on the development stage, other testing environments may prove more applicable. For early prototypes, desktop methods provide valuable insights with minimal resource input. Real driving tests can help in the refinement process of preproduction products.

This review recommends that usability tests should be conducted with potential end users. These tests are indispensable for final usability evaluations. Other participants, such as experts or users that represent only a segment of the user population, e.g., students or participants with affiliations to technical or automotive domains, can provide valuable insights at earlier stages of the development process.

The test cases listed in the best practice advice of this review focus on transitions between, and the availability of, different automation levels in non-critical situations. These test cases are recommended for general usability assessments of ADS HMIs for CAD. Other test cases in this review cover usability assessments of HMIs displaying information on more complex scenarios, such as maneuvers, navigation systems, or dense traffic. These test cases are relevant for specific research questions, e.g., the design of integrated functions in the CAD HMI.

A set of metrics for testing the usability of CAD HMIs is listed in this paper. Depending on the study design and the research question, further metrics might prove suitable in obtaining valuable research findings. Researchers should clearly indicate the link between dependent variables and the respective definition or construct of interest.

This review recommends that the usability tests be performed in first contact situations without in-depth instructions on how to use the system having been provided prior to the testing situation. Where research questions not focusing on the NHTSA minimum requirements are concerned, the use of manuals or tutorials might be applicable in order to equalize the knowledge and experience level of the participants. In addition to testing first contacts, the everyday use of ADS is of great interest, especially in the context of CAD. The transition of the human driver from operator to passenger could generate side effects such as disuse, misuse, or abuse of the ADS, which might impair safety. The assessment of these effects poses an interesting and important topic for future research.

#### **4. Conclusions**

This paper reviews 16 articles, comprising both study and theoretical articles. These articles are analyzed in respect of six study characteristics. The insights into common practice and theoretical considerations lead to a derivation of best practice advice. This advice is aimed at helping researchers who are interested in usability assessments of CAD HMIs in the planning phase of a study. Furthermore, the comparability of studies in this field increases with the application of similar experimental designs. Table 7 summarizes the key statements of the derived best practice.

**Table 7.** Best practice advice for testing the usability of conditionally automated driving (CAD) human–machine interfaces (HMIs). ADS: automated driving system.


#### **5. Outlook**

In this review, driving simulators are identified as the prevalent testing environment in the field of usability assessments of ADS HMIs for CAD. As an efficient and risk-free alternative to real driving experiments, simulators offer a convenient and valuable testing environment. Since the validity of driving simulators has not yet been assessed, the transferability of results to the real world is not assured. A thorough validation study comparing a simulator and a test track experiment is advisable. This forms the foundation for a future validation study series comprising a driving simulator experiment and three real driving experiments on test tracks in Germany, the USA, and Japan.

**Author Contributions:** Conceptualization, D.A., J.R., S.H., F.N., A.K., and K.B.; Methodology, D.A., J.R., A.L., S.H., and F.N.; Formal Analysis, D.A., J.R., A.L., and S.H.; Writing—Original Draft, D.A.; Writing—Review and Editing, D.A., J.R., A.L., S.H., F.N., A.K., and K.B.; Supervision, A.K., and K.B. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the BMW Group.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the collection, analyses, or interpretation of the data.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

*Discussion*

### **Checklist for Expert Evaluation of HMIs of Automated Vehicles—Discussions on Its Value and Adaptions of the Method within an Expert Workshop**

**Nadja Schömig 1,\*, Katharina Wiedemann 1, Sebastian Hergeth 2, Yannick Forster 2, Je**ff**rey Muttart 3, Alexander Eriksson 4, David Mitropoulos-Rundus 5, Kevin Grove 6, Josef Krems 7, Andreas Keinath 2, Alexandra Neukum <sup>1</sup> and Frederik Naujoks <sup>2</sup>**


Received: 28 February 2020; Accepted: 20 April 2020; Published: 24 April 2020

**Abstract:** Within a workshop on evaluation methods for automated vehicles (AVs) at the Driving Assessment 2019 symposium in Santa Fe; New Mexico, a heuristic evaluation methodology that aims at supporting the development of human–machine interfaces (HMIs) for AVs was presented. The goal of the workshop was to bring together members of the human factors community to discuss the method and to further promote the development of HMI guidelines and assessment methods for the design of HMIs of automated driving systems (ADSs). The workshop included hands-on experience of rented series production partially automated vehicles, the application of the heuristic assessment method using a checklist, and intensive discussions about possible revisions of the checklist and the method itself. The aim of the paper is to summarize the results of the workshop, which will be used to further improve the checklist method and make the process available to the scientific community. The participants all had previous experience in HMI design of driver assistance systems, as well as development and evaluation methods. They brought valuable ideas into the discussion with regard to the overall value of the tool against the background of the intended application, concrete improvements of the checklist (e.g., categorization of items; checklist items that are currently perceived as missing or redundant in the checklist), when in the design process the tool should be applied, and improvements for the usability of the checklist.

**Keywords:** automated vehicles; automated driving systems; HMI; guidelines; heuristic evaluation; checklist; expert evaluation

#### **1. Background**

With the Federal Automated Vehicles Policy, the U.S. National Highway Traffic Safety Administration (NHTSA) has provided an outline that can be used to guide the development and validation of automated driving systems (ADS).

With regard to the human–machine interface (HMI), the policy proposes that an automated vehicle (AV) HMI at minimum shall inform the user that the system is either of the following [1]:


A suitable design of mode indicators should effectively support the driver in using an ADS and prevent a false understanding of the current driving mode. NHTSA encourages implementing and documenting a process for the testing, assessment, and validation of each element [1]. However, details on how entities can assess and validate if a specific HMI meets these requirements are not proposed. Therefore, a test procedure was developed that serves to evaluate the conformity of SAE level 3 (conditional automation) ADS HMIs with the requirements outlined in NHTSA's Automated Vehicles policy (for an overview, see [2]). Before this publication, no standardized tools for the assessment of the usability and safety of ADS HMIs existed.

The proposed evaluation protocol includes (1) a method to identify relevant use cases for testing on the basis of all theoretically possible system states and mode transitions of a given ADS (see [3]); (2) an expert-based heuristic assessment to evaluate whether the HMI complies with applicable norms, standards, and best practices (the topic of the present paper); and (3) an empirical evaluation of ADS HMIs using a standardized design for user studies and performance metrics [2]. An overview of the complete test procedure can be seen in Figure 1 (for further information, see [2]).

**Figure 1.** Overview of the test procedure for the evaluation of automated driving system (ADS) human–machine interfaces (HMIs) based on U.S. National Highway Traffic Safety Administration (NHTSA) requirements.

The present paper deals with the reviewing of the heuristic evaluation method that can be used by human factors and usability experts to evaluate and document whether an HMI meets the above-mentioned minimum requirements. In usability engineering, such heuristic assessment methods are commonly applied during the product development cycle and can be used as a quick and efficient tool to identify potential usability issues associated with the HMI [4].

The heuristic assessment method consists of a set of ADS HMI guidelines together with a checklist that can be used as a systematic HMI inspection and a problem reporting sheet. This version of the checklist and the considered HMI principles are reported in [5] and [6].

In comparison with existing approaches that test the usability via user studies/car clinics, the heuristic evaluation can be applied through rapid iteration early in the product cycle, and is thus able to correct identified issues and reduce late-stage design changes. Using experts has the advantage that inadequate mental models that might influence evaluations of naïve users can be better controlled. Furthermore, experts are trained to concentrate on single HMI aspects separately from each other in their evaluations. In addition, by means of the checklist, experts can evaluate an HMI in absolute values independently from a comparison with other HMIs. However, both heuristic evaluation and car clinics are recommended to be used as complementary methods in the evaluation protocol (see Figure 1).

The paper at hand has the goal to disseminate the already published work on the developed test procedure to a scientific community and to further adapt the checklist based on the results of the expert workshop. Suggestions for improvement from human factors experts and practitioners are discussed against the background of feasibility (keeping it an easy-to-use tool) and appropriateness for use in a checklist compared with other methods.

#### **2. Content and Usage of the Checklist**

#### *2.1. Checklist Items*

The aim of the assessment is to evaluate whether a set of pre-defined HMI principles (the "heuristics") are met. Thus, the checklist consists of 20 items summarizing the most important design recommendations for visual-auditory and visual-vibrotactile HMIs derived from existing norms, applicable standards, design guidelines, and empirical research, pertaining to in-vehicle interfaces. The complete list of items is presented in Table 1. The derivations of these items from the literature are elaborately described in [5].



The method should be conducted by a pair of HMI experts. Preferably, experts should have received formal training in human factors and usability engineering or have demonstrable practical experience in the assessment and evaluation of automotive HMIs. However, the evaluators should have no prior experience with the vehicle and features to be tested. The most suitable testing environment depends on the maturity of the product. In the very early development stages, where there is only a prototype available and series production is far away, it is recommended to use a driving simulator. For series production vehicles or high-fidelity prototypes, it is advised to conduct the study on-road as

this provides the most realistic conditions for testing. Each of the two evaluators completes a fixed set of use cases, observes the visual, auditory and haptic HMI output, and records potential usability issues arising from the non-compliance with the checklist items (see Figure 2 for an example). The use case set consists of the various system states and the transitions between them (e.g., activating the system, deactivating the system, switching between system modes, required control transition from the system to the operator) and depends on the specific design of the ADS with respect to the available levels of automation (e.g., whether only manual or conditional automation are available, or if partial automation (level 2) is also available within the same vehicle). While one of the evaluators is the driver, the other one is seated in the passenger seat, providing step-wise instructions about the desired system state to the driver at appropriate times during the drive. To ensure that both observers are able to experience each use case and resulting system and user reactions and responses in a comparable way, they switch position after one driving session and repeat the drive. The aim of the heuristic assessment is twofold:


Checklist compliance and identified usability issues should be initially documented independently by each of the evaluators. Each of the checklist items should be answered using the following rating categories:


The reasons for "major" and "minor" concerns should be documented in a separate reporting sheet. After the individual assessment, the results should be discussed between the evaluators to come to a unanimous rating decision for each item, which should also be documented. Figure 2 shows a simplified flow chart of the test procedure.

**Figure 2.** Simplified flow chart of the test procedure.

Figure 3 shows an example of the format/appearance of the checklist and an item notionally judged by an evaluator. Each checklist item contains the requirement. Additionally, positive and/or negative examples for a good/insufficient HMI solution of the requirement are given below the heuristic. Please note that the handwritten notes in Figure 2 do not refer to one of the systems investigated within the workshop, but do serve as exemplary problems that could potentially be identified during the heuristic evaluation. The complete checklist can be found in the Appendix A. It was used in a slightly adapted version in the workshop.

**Figure 3.** Example of the format of the checklist and an evaluated item.

#### *2.3. Application Domain of the Method*

In its current version, the checklist should be viewed as a living document that can be modified to account for gaps where research in the field of automated vehicles is still emerging. The checklist expresses one set of guidelines and is intended as a first step towards guideline development and vehicle-verification methods (the first version of the checklist and three validation studies were published in [6]). In addition, it should be noted that it was developed with the intention to evaluate ADS of level 3 systems. However, it may also be applied to L2 automation as most of the heuristics refer to general design guidelines that should be met by all types of automated system HMIs for ensuring proper usage. Users could misunderstand or misuse the capabilities of an L2 system, treating it like an L3 system when the automated driving mode is not clearly indicated, as demanded by the checklist items. Vehicles equipped with L3 systems may also be usable in an L2 operational state. Understanding user interactions in lower modes of automation may inform best practices in higher modes. This justifies the application towards L2 systems, as was done in the conducted workshop.

For the sake of practicability and efficiency, the list of guidelines was kept as short as possible; therefore, it is likely that it will not cover every aspect of the HMI for ADS. At this point, it is important to emphasize that new and innovative HMI designs may rely on other HMI elements than the ones covered by the sources used to compile the checklist. In this case, the evaluators are encouraged to give a positive assessment when they judge the respective guideline to be fulfilled during the on-road test and their level of expertise on this topic allows it, even if this cannot be based on the design recommendations. In the case of sufficient inter-evaluator agreement, verification of the guideline can be assumed. Otherwise, they are suggested to rate the item as "measurement necessary".

It should further be noted that the fulfilment of the HMI guidelines should facilitate regular and safe usage of the ADS, such as switching the automation on, checking the status of the ADS, or taking back manual control from the automation (either driver-initiated or as a result of a take-over request).

A more comprehensive evaluation of the ADS will most likely need to incorporate (a) usability testing with real users in instrumented vehicles or high-fidelity driving simulators and (b) investigating other domains such as the assessment of controllability of system limits or failures and foreseeable misuse of the ADS as defined in the RESPONSE Code of Practice [7] and ISO 26262 [8]. The aim of the proposed method is thus not to replace usability testing with participant samples, but to

complement empirical approaches with an economical heuristic evaluation tool. It may serve as a guide or sieve to identify and improve poorly performing HMIs before going into further user studies. Further limitations of the current approach are also discussed in the corresponding sections of this paper.

#### **3. Intention of the Workshop**

The intention of the conducted workshop at the Driving Assessment 2019 symposium in Santa Fe, New Mexico was to present the developed method and its application to human factors experts in the scientific community of automated vehicle research in order to further improve the method. A higher level goal was to stimulate the discussion within a larger scientific and technical community towards future standards that may be appropriate for guiding the development of automated vehicle HMIs. Another goal is to facilitate a more rapid convergence towards an agreed-upon set of robust guidelines and verification methods that can serve the industry in the important evolution of automation within vehicles.

#### **4. Evaluation Procedure during the Workshop**

The workshop was organized by WIVW GmbH (a company providing research services in the field of human factors, especially from the automotive sector) and was held following the closing session of the conference, on 27 June 2019 at the El Dorado Hotel in Santa Fe. In total, 14 participants took part in the workshop. The workshop was announced via the conference website and was open for application to all conference participants. The workshop attendees were selected based on their scientific background either as practitioners or academics in the automotive domain with previous experience in the HMI design, development, or evaluation methods of driver assistance systems. The resulting workshop group consisted of agents from the automotive industry, scientific institutes, and national agencies. In preparation of the workshop, the participants received publications that described the background of the checklist method [3] and its application [6].

Regarding the agenda, the workshop started with a short introduction of the organizers, who gave an outline of the workshop. Afterwards, the method and its background were presented. Then, the 14 participants were split into small groups of 3–4 people who would apply the method while riding together in two vehicles equipped with L2 driving automation systems: a Tesla Model 3 with the Autopilot system (AP) and a Cadillac with the GM Supercruise (SC) system.

These two systems were chosen as their system architecture, system operation principles, as well as their system performance (e.g., regarding the threshold required for overriding the system by steering input) are different from each other. The Tesla had a large center-stack touch screen for the HMI display, while the Cadillac used a classical instrument cluster and a light bar at the steering wheel for communicating the current system mode. The Cadillac had many of the automation controlling buttons on the steering wheel, while the Tesla system is operated by a lever behind the steering wheel. The two systems also offered different warning and alerts, different both in the type of alerts and when they occurred. Systems also differed in their approaches for driver monitoring; while the Tesla used a hands-on detection system, the SC determines whether the driver has enough visual attention on the road via a camera system. Furthermore, the Cadillac had more constraints to use than the Tesla L2 system (for further descriptions of the system, see Chapter 5).

Before starting the drive, each participant received a copy of the checklist and familiarized themselves with the items and the rating procedure. It was emphasized that the rating of the systems itself was not the relevant outcome of the evaluation process and that methodological issues emerging when applying the method were of greater importance.

The participants experienced both systems as passengers in a 30 min drive on the interstate 25 (Denver–Albuquerque). The first system was experienced while being driven in one direction on the interstate. After a stop at an exit, the groups switched vehicles and experienced the second system on the way back to the interstate exit. The drive from the conference site to the interstate took about 10 min.

Owing to safety and insurance reasons, the workshop organizers drove the vehicles. Therefore, the evaluation process during the workshop did not follow exactly the one proposed in the heuristic evaluation procedure, where the evaluators should really drive the vehicle (see [6]). On the way to the interstate, the drivers briefly explained the control elements (button, lever, and so on) used for operating the system and which HMI elements had to be observed by the evaluators. The test itself started when the vehicle reached the interstate. Up to here, the Tesla theoretically permitted to use the Autopilot, while the SC could not be used before entering the interstate. The drivers conducted several use cases in which the system could be experienced:


After getting back to the workshop room, participants were asked to fill out the checklist based on the second system they experienced. After all four groups had experienced the two vehicles, all workshop participants jointly discussed methodological issues they noticed during the application of the method.

#### **5. Description of the Evaluated Systems**

Although the rating of the systems themselves was not in the scope of the evaluation, a short description of both systems is inserted here to better understand the resulting discussions.

#### *5.1. The GM Supercruise System*

The GM Supercruise system is operated via buttons on the steering wheel. The system mode is indicated by presenting graphics, icons, and text messages at the instrument cluster (see Figure 4). In addition, there is a light bar on the steering wheel that is also used to indicate the current system mode by different colours and pulsation (static vs. flashing lights). The system is geofenced, meaning that it is only available on certain roads, such as interstates. Other preconditions for activation are that the driver has to drive in the center of the lane and that adaptive cruise control (ACC) is active. If the driver tries to activate the system outside these conditions, they receive a text message at the right side of the instrument cluster. The system state is indicated by a specific area on the left side of the instrument cluster as well as by telltales in the centre of the cluster. In order to activate Supercruise, first, ACC has to be shifted into standby mode (separate button) and activated by setting the speed. After that, Supercruise can be activated by a separate button. The activity of Supercruise is indicated by a green steering wheel together with green horizontal bars for ACC shown in the cluster. A short-term degraded standby mode (meaning lateral control is not active) is indicated by a blue steering wheel and the steering wheel light bar in blue. Lateral control is automatically resumed if the conditions are fulfilled once again. The Supercruise system does not require the driver to keep their hands permanently at the steering wheel, as UN-CE-R-79 [9] regulating this matter does not apply for the United States. However, if the driver tries to deactivate the system without having their hands at the wheel, a take-over request is triggered by a text message and red indicators. The driver monitoring system consists of a camera on the top of the steering wheel that determines whether the driver is looking towards the road. If the time not looking at the road exceeds a certain threshold, the steering wheel first flashes green before it turns red and lateral control is deactivated.

**Figure 4.** HMI elements of the GM Supercruise system.

#### *5.2. The Tesla Autopilot*

The Autopilot function by Tesla can theoretically be activated on all roads without any restriction (other than laws). In order to use the additional lane change assistance function, the navigation system has to be active. The system state is exclusively indicated on the left area of the touch display in the center stack console, which is used for all driving-related and non-driving-related information (replacing the instrument cluster; see Figure 5 left). The active L2 system is indicated by a blue trajectory on ego lane. The dynamic display additionally shows adjacent lanes and other vehicles surrounding the ego-vehicle. The system is activated by pulling the gear switch twice towards the driver (see Figure 5 right). After each activation process, the driver is requested to keep the hands on the steering wheel by a text message. In the case in which the system detects no steering interventions of the driver for a longer time interval, it requests the driver to exert a slight force on the steering wheel with the display flashing in blue and a symbol indicating a steering wheel with hands on it. If the driver does not react to such a hands-on request, the system will be switched off completely and can no longer be used for the remaining drive. Lateral control can be easily deactivated by a steering wheel intervention of the driver. There is no standby mode, meaning that lateral control has to be activated by the driver each time after it has been deactivated.

**Figure 5.** HMI elements of Tesla Autopilot.

#### **6. Methodological Issues Discussed during the Workshop**

After having experienced the two vehicles and after having applied the checklist, several methodological aspects arose from the discussions between the workshop participants that were grouped into the following topics.

#### *6.1. Design Issues of the Checklist*

With regard to better usability of the checklist, it was proposed to reorganize its design. For a better overview, some items could be grouped together into higher-level categories (e.g., with regard to color usage). Another suggestion for grouping the items was to categorize them with regard to use cases, for example, group all items with regard to change of system mode together. However, with the intention of the checklist to only test the minimum requirements set by NHTSA policy, this idea would prove as impractical as this would mean repetitions of some items that are valid for several use cases, which would unnecessarily stretch the checklist. The idea of shifting the positive and negative design examples to an appendix as added material was judged as not appropriate as the raters could profit from the current position of the examples, while there is a chance that an appendix tends to be overlooked during the rating process.

In addition to the current rating categories, it was proposed to add a category for "suggested improvements", not only in the final reporting sheet, but also on the item level to encourage experts to think about better solutions instead of simply marking "concerns".

#### *6.2. Missing and Redundant Items*

One concern regarding the selection of the checklist items was that some of them should not be evaluated subjectively by experts, but must be better objectively measured by technicians. These items comprise the following:


It is agreed that, for later stages of the HMI development, objective measurement by a technician is necessary, while in an early stage, a heuristic assessment of these items might be acceptable. Therefore, methods used for early stage and later stage assessment of the HMI might differ. It was discussed whether only those items should remain in the checklist, which must be subjectively assessed by experts. Other objectively measurable items could be deleted from the checklist and inserted into a separate technical checklist. Finally, there was some discussion about items that should be added to the checklist as they seem to cover aspects that are currently not adequately addressed.

Regarding extensions of the checklist, the greatest benefit, but also the greatest challenge, would be to rate the overall complexity of the system/HMI. The (perceived) complexity in using a system will heavily influence acceptance and trust in the system (e.g., [10]). The term includes two types of complexity. First, the system complexity, meaning the logic behind the various system modes and its transitions (e.g., are lateral and longitudinal control separate sub-functions that can be used in combination as well as independently of each other? Are standby modes included?). The system complexity will likely influence the complexity of system operation (e.g., sequence of operational steps to be performed or number of possible operation steps in order to reach a certain system state) or the demands on the distinctiveness of the several system modes (how many different indicators are necessary and how are they designed in order to clearly identify the current system mode). The latter is linked to the second type of complexity, the display complexity, which can be described by the arrangement of the information elements on the display, for example, in terms of display layout, number of display elements relative to display size (so-called visual clutter), spatial proximity of elements (e.g., in terms of overlapping), and so on.

Another possibility to operationalize the term complexity would be the categorization into different types of demands that are put on the operator of the system and that result in a certain level of perceived complexity in system usage. One possibility would be to define dimensions according to typical categorizations of workload based on Wicken's multiple resource model [11]. This model categorizes workload based on the following:


To sum up this issue, in the general discussion about the aim of the checklist, workshop participants tended to agree that complexity can be reasonably assessed by experts. Therefore, it was proposed to consider the system complexity in future iterations of the checklist. It is recommended to reflect the multidimensionality of this item in the checklist. One way would be to define multiple items addressing the defined sub-dimensions from the chapter above and group them together in the more global category of "complexity". The other option (which might be more appropriate as concrete standards for the assessment of complexity are missing) would be to formulate one single more generic item with the defined sub-items as positive/negative examples.

With regard to the evaluation of system operation, currently, there is only one item included that deals with the avoidance of unintentional activation and deactivation of the system (see also, for example, currently ongoing work by UNECE on the ACSF regulation (automatically commanded steering function) [12]). The reason for the limitation to only this item is that, at the creation of the checklist, there were no clear design guidelines or recommendations on how system operations should be designed. At the moment, there are some concrete specifications on activation, deactivation, and driver input principles under ongoing consideration in the UNECE ACSF document (e.g., the system should be deactivated when the driver overrides the system by steering, braking, or accelerating while holding the steering control). However, issues concerning the system logic are outside the scope of the checklist. However, the design of operational devices might be an extension of the checklist when more research and valid guidelines on this topic are available.

Highly correlated to the term complexity is the learnability of the system's logical operation and the HMI. Learnability is said to be one major attribute of usability (beside effectiveness, error tolerance, satisfaction, and memorization; for example, see [4]) and will be influenced by interface design (e.g., visibility of successful operations, feedback, continuity of task sequences, design conventions, information presentation, user assistance, error prevention) and conformity to users' expectations to the car manufacturer's philosophy (differences in functionality, differences in interaction style, concept clarity, and completeness of information [13]). Beside an intuitive first contact with a system, the concept of learnability should also include the aspect of re-learning the use of the system again after a longer interval of non-usage and the resources involved. However, it seems difficult for experts to provide a meaningful rating regarding learnability as an expert involved in system design, and assessment is likely biased when it comes to learnability owing to their experience. The same will be true for in-house testers who have extensive knowledge about currently developed products. This aspect should thus better be tested with naïve users. A small sample may be enough, and may include people not involved in ADS design. According to Nielsen [4], most of the usability problems can be identified by a number of five experts.

Another issue that could be considered by the checklist is the evaluation of other display elements beside the conventional ones, such as instrument cluster and head-up displays. This should contain not only the mere presence of peripheral displays, which are considered as an example in item 4 ("peripheral displays supporting noticing of mode changes, e.g., by movement or size of displays"), but also more concrete items referring to the design of those displays (e.g., steering wheel light bars). However, up to now, there are no concrete design guidelines, but a few empirical studies exist on

the positive effects of ambient displays on mode awareness and take-over performance (e.g., [14]). Beside concrete design aspects, it can be requested that such elements should be congruent with the ones displayed in the instrument cluster, as otherwise, this might be problematic for understanding of system modes.

For the design of warnings, it was discussed whether to consider additional aspects beside the ones that are already included dealing with the communication channels to be used (in multiple modalities; item 18) and the desired effect of not distracting the driver (item 19). Such aspects are, for example, nomenclature choices and linguistic complexity (i.e., fault messages based on engineering nomenclature vs. easily comprehensible names of system modes). In addition, the content of the warning could be defined more explicitly. It was proposed to positively evaluate if the potential consequences of a system limit are displayed (e.g., what would happen if the user does not intervene and how can the user recover or reactivate the system, for example, in the case of repeated hand-off warnings).

American National Standards Institute (ANSI) suggests that a safety warning should include the following (ANSI Z353 [15]):


While there is not a need to adhere closely to the ANSI warning standard, such a standard could be considered as a guideline. In the context of automated driving, warnings can occur owing to less time critical hazards such as sensor failures that do not inevitably require an immediate action like a forward collision warning. However, reaching system limits can be interpreted as imminent hazards that require the driver to immediately take over the driving task to avoid an accident. Signal words that can be used for describing the identification of the hazard are "Danger", "Warning", "Caution", or "Notice". Then, a notice of what to do next to avoid a hazard should be given.

In the case of an urgent take-over request, these first two aspects are probably the most critical points. Typically, the HMI addresses them by displaying a short text such as "Take Over" together with a warning sign. The third aspect, conveying the consequence of inaction, seems to be the most problematic point in the case of take-over requests, as it is not always clear what happens in the case in which the system is deactivated without the immediate reaction of the driver. This information might be better explained by a user manual of the system instead of by the HMI in the imminent situation.

Finally, there were some suggestions for new items to be included in the checklist. One example was about the usage of a dynamic environment display that shows the surrounding traffic. Such a display is currently included in the Tesla Autopilot HMI. However, up to the current state-of-art, the benefit of such a display has not yet been established, and it is thus not clear how such an additional display should be evaluated. Does it have a positive effect on situation awareness or might it distract the driver from extracting relevant status information from the display and promote over trust in the system? It is possible that displays utilizing motion/animation may blunt the driver's response to warnings as motion is a powerful attention grabber, and thus a driver may start filtering out display content that is more relevant in the respective situation (i.e., notices, state changes). Owing to the lack of empirical evidence on the potential effects, the formulation of a checklist item regarding such a display is problematic, and will thus be postponed until more research is available.

Another example is a potential item for the ease of overriding the system. This item would address the controllability of a system, which was not the initial scope of the checklist. Currently, it is not clear whether it is good or bad if a system can be easily overridden by the driver. In addition, it is argued that such highly complex interactions between various factors (e.g., the degree of lateral control will play an important factor on this issue) seem to be better assessed in user studies. Therefore, no item regarding this issue will be included for now.

#### *6.3. Human Factors Aspects to be Considered by Other Methods*

All workshop participants jointly agreed that other aspects such as differences in system behaviour, system logic, system operation logic (beside some describable specific aspects in the item of complexity), and their effects on system usability require additional evaluation methods, as the correlation between these different factors is not yet clear in order to formulate concrete design recommendations. Moreover, their effects on system acceptance and system trust must be assessed by user studies. User experience (UX)-related aspects such as the hedonistic quality of the system and the HMI are also recommended to be evaluated by user studies. In this way, real users can experience the system and report their emotions and attitudes towards the system.

High agreement was also achieved for the fact that, especially on the type of control, user studies are needed (e.g., operation via steering wheel buttons vs. touch screen) with regard to performance times or distraction potential before checklist items can be deduced.

#### *6.4. Test Procedure*

Regarding the test procedure, the workshop participants recommended to put more emphasis on the fact that the experts should take the perspective of naive users. A naive user can be defined as "a person that tests the ADAS under evaluation with no more experience and prior knowledge of the system than a later customer would have" ([7], p. 7). This should allow that the requirements are valid for the average population. The inclusion of certain items also makes it possible to address the needs of certain specific driver groups, for example, drivers with colour-blindness.

Nevertheless, it should be kept in mind that this method should not replace, but rather supplement other approaches like user studies that allow for eye tracking, reaction time testing, and performance measurement on tasks dealing with the handling of the system. Both methods are proposed to be conducted within the complete evaluation protocol (see Figure 1).

The proposed test procedure (a team of two experts rates the system after having once experienced the use cases themselves as a driver and having watched the other evaluator driving) was rated as a reasonable approach. This approach has the advantage that both experts do not merely observe someone interacting with the system, but really experience that interaction themselves. In addition, the fact that one evaluator can directly document their first impression while the other evaluator is driving (compared with a retrospective documentation) avoids negative effects such as memory decay. For later reference, it is suggested to capture video of the driving experience (by scenario and system response) using small video cameras mounted in locations that capture cluster, head unit, and other displays and controls, while not covering the rater's view on these elements.

It was further proposed to conduct the evaluation with a larger group of experts if no time and resource constraints object to this approach. However, as this might complicate the process, to reach an agreement in a joint discussion, we would recommend only consulting a third external evaluator if no such agreement can be made between the evaluators even after a longer discussion.

Owing to the variety of systems that can be evaluated and the fact that new and innovative HMI designs are currently not covered by the checklist items, there will be situations in which adaptations need to be made by the experts to accommodate for specific circumstances. In this case, we suggest that experts follow the following approach.


#### *6.5. General Value of the Checklist*

It was jointly agreed that the developed method is a useful tool in the design process of AV HMIs. It is primarily intended to facilitate the assessment of system usability. It is able to check whether the

minimum requirements proposed in the NHTSA policy are fulfilled. It is also reasonable as the current rapid evolvement of automated systems makes it extremely difficult to identify the "best design" of such a system and its HMI. This method serves as a tool to guide and make quick changes during the development process, that is, testing several concepts and narrowing down options, as well as ensuring a "basic" compliance throughout the design loops.

As said above, for a global evaluation, the assessment of aspects such as different system logics, different concepts for system operation (e.g., longitudinal and lateral control as separate systems, L2 as add on to L1, stepwise activation of L1, then L2, and so on), and different design philosophies should be considered, which are better answered by user studies.

#### **7. Conclusions and Outlook**

On the basis of the discussions with the workshop participants, the following adaptions of the checklist were decided:

	- the visual demands of the HMI in general;
	- the cognitive demands resulting from the complexity of the system's logic;
	- the motoric demands resulting from the number, positioning, and arrangement of operational devices;
	- the ease of learning the interaction with the system.
	- -An item on the appropriate design of other display elements;
	- -An item on the content of a warning/take-over request.

It is planned to transfer the checklist into a computer-application that can be used, for example, on tablets in order to support the experts in the documentation of the tests, the discussion of its output, and the recommendations for system improvements.

**Author Contributions:** Workshop organization: K.W., N.S., Workshop participation: N.S., K.W., Y.F., F.N., J.M., A.N., D.M.-R., K.G., J.K.; Writing – review and editing: All authors. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Acknowledgments:** The development of the method was funded by BMW Group. The workshop was organized by WIVW GmbH and hosted by the Driving Assessment Conference. Thanks to all workshop participants for their valuable inputs to the methodological discussions during the workshop. Special thanks to all of those workshop participants who additionally contributed to the present publication with input and comments (Jeffrey Muttart, Alexander Eriksson, Kevin Grove, David Mitropoulos-Rundus, Josef Krems, Sebastian Hergeth, Frederik Naujoks, and Yannick Forster).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A**

**Table A1.** Extended checklist items (see [5]). NHTSA, U.S. National Highway Traffic Safety Administration. Abbreviations in the table: NDRT = Non-driving-related task; DDT = Dynamic Driving Task.




#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Human–Vehicle Integration in the Code of Practice for Automated Driving**

#### **Stefan Wolter 1,\*, Giancarlo Caccia Dominioni 2, Sebastian Hergeth 3, Fabio Tango 4, Stuart Whitehouse <sup>5</sup> and Frederik Naujoks <sup>3</sup>**


Received: 21 April 2020; Accepted: 22 May 2020; Published: 27 May 2020

**Abstract:** The advancement of SAE Level 3 automated driving systems requires best practices to guide the development process. In the past, the Code of Practice for the Design and Evaluation of ADAS served this role for SAE Level 1 and 2 systems. The challenges of Level 3 automation make it necessary to create a new Code of Practice for automated driving (CoP-AD) as part of the public-funded European project L3Pilot. It provides the developer with a comprehensive guideline on how to design and test automated driving functions, with a focus on highway driving and parking. A variety of areas such as Functional Safety, Cybersecurity, Ethics, and finally the Human–Vehicle Integration are part of it. This paper focuses on the latter, the Human Factors aspects addressed in the CoP-AD. The process of gathering the topics for this category is outlined in the body of the paper. Thorough literature reviews and workshops were part of it. A summary is given on the draft content of the CoP-AD Human–Vehicle Integration topics. This includes general Human Factors related guidelines as well as Mode Awareness, Trust, and Misuse. Driver Monitoring is highlighted as well, together with the topic of Controllability and the execution of Customer Clinics. Furthermore, the Training and Variability of Users is included. Finally, the application of the CoP-AD in the development process for Human-Vehicle Integration is illustrated.

**Keywords:** automated driving; human factors; human machine interface; controllability; L3Pilot

#### **1. Introduction**

The European research project L3Pilot focuses on different activities with regard to automated driving. Split into seven subprojects, the main objective of the L3Pilot subproject 2 is to define a Code of Best Practice for Automated Driving (CoP-AD). The CoP-AD shall provide comprehensive guidelines for supporting the automotive industry and relevant stakeholders in the development of the automated driving technology. Thus, the CoP is meant to provide best practice guidance that can be used by designers and engineers throughout the lifecycle of automated driving systems. The guidelines are derived from knowledge gained in the industry as well as best practices collected on this topic.

Previously, for systems up to and including SAE level 2 [1], the Code of Practice for advanced driver assistance systems, derived from the Response project [2], served as a guideline for the development of such functions. With the advent of SAE level 3 systems and above, its application is no longer appropriate. Nonetheless, the existing code of practice was analyzed in order to apply the lessons learnt and to make use of the aspects, which remain appropriate for SAE level 3.

In order to define the scope of the document, a framework for the Code of Practice for Automated Driving was defined at the beginning of this project. It serves as a baseline for the work to be done for creating the CoP-AD. In the second section of this paper, the development process is outlined, culminating in the definition of the topics to be addressed, which were classified into four different categories. It also includes the applicable development phases and furthermore, the geographical regions, operational design domains, and SAE levels affected. The template on how to phrase and execute the questions that will form the checklist of aspects to consider when developing an Automated Driving Function (ADF) is also outlined and explained.

The third section shows the draft content of the Human–Vehicle Integration (HVI) category. This is one of the four main categories in the CoP-AD. It focuses on the topics related to the interaction between the vehicle and the user. This ranges across a broad area covering human factors, user experience, usability, and cognitive ergonomics. The section is divided into the areas of Guidelines for HVI, Mode Awareness, Trust and Misuse, Driver Monitoring, Controllability and Customer Clinics, and finally Driver Training and Variability of Users. The topics are explained, and examples are given on how to apply them as part of the CoP-AD.

In the final section, some general conclusions have been drawn, and further conclusions are highlighted with a focus on the HVI category. This paper is based on the L3Pilot deliverable D2.2 [3], which is a draft of the CoP-AD used to gather feedback from external partners outside of the L3Pilot consortium.

#### **2. Development Process**

At the beginning of the L3Pilot project, a survey was distributed to all L3Pilot partners in order to collect the requirements of all key stakeholders for the CoP-AD. This includes experts from both industry and research institutes. The relevant topics to be covered in best practices were derived using this feedback. The topics collected as part of the survey were selected based on predefined criteria during a subsequent workshop. The criteria for inclusion of a topic are listed in Table 1.

**Table 1.** Criteria for inclusion of topics into the Code of Practice for automated driving (CoP-AD).


With regard to the actual process of applying the CoP-AD, the decision was made to use the existing Code of Practice for Advanced Driver Assistance Systems as a baseline. Figure 1 shows the selected development phases for the CoP-AD. Compared to the Code of Practice for Advanced Driver Assistance Systems, the number of phases was reduced from six to four during the actual development. The second and fourth phase originally consisted of two separate stages, but these were condensed into the Concept Selection Phase and the Validation and Verification Phase for greater simplicity. An additional phase for the time post start of production was added to cover the entire lifecycle of the ADF. The conceptual stage consists of the Definition Phase and Concept Selection Phase, while the Design Phase and the Validation and Verification Phase constitute the series development stage. During the Definition Phase, the basic requirements are defined and based on this, the best concept is chosen in the Concept Selection Phase. The Design Phase requires the detailed design of the system. Then, it is validated and verified in the final phase before the start of production. Post start of production, further

data can be gathered and improvements can be applied. This process is not necessarily linear; iterative improvements with repetitions of important steps might be possible. The process has been designed to remain abstract on purpose, so that the CoP-AD can be applied to the many different development processes in place in the industry at various companies.

In order to clearly summarize the topics that were collected, a number of categories were defined to cluster them. Table 2 shows the categories finally chosen with the pertaining topics. They are based on extensive expert discussions, clustering all the available topics in a meaningful way. The last row on Human–Vehicle Integration is the key focus of this paper.


**Table 2.** Categories and topics. HMI: Human–Machine Interface, ODD: Operational Design Domain.

The first category is quite generic and focusses on overall guidelines and recommendations, such as a minimal risk manoeuver. The Operational Design Domain (ODD) on the Vehicle Level offers a description of the function and scenarios at the level of the vehicle. The category ODD on the Traffic System Level, including Behavioral Design, offers a description of the function on the level of the overall environment and a description of the behavior of other road users. Safeguarding Automation is about how to ensure a safe operation of the function, primarily the functional safety, but also the cybersecurity and data privacy aspects. Human–Vehicle Integration is the interaction between the driver and the vehicle's displays and control elements.

The topics within each of the categories were distributed along the development process phases in a workshop. In order to better address the topics derived from previously held expert sessions, a thorough literature review was done to back up the topics with research results and existing best practices. Based on this, the questions for the CoP-AD checklist were phrased. These questions underwent a rigorous iterative improvement process, improving overall quality and reducing the overall number of available questions to the most important ones. This enabled the deliverable D2.2 [3]

to be written, which is a draft used to gather feedback from external partners outside the L3Pilot consortium. This will culminate in the deliverable D2.3, the final CoP-AD, to be presented in 2021.

In order to apply the CoP-AD appropriately, a template was defined for all questions; this can be seen in Table 3. The reference number for each question can be found in the top left cell of the table, and the development phases associated with the question have been marked in the top right. In the body of the table, the main question is on the left, supported where applicable by sub-questions on the right. Only the main question needs to be answered directly with yes or no. Ideally, independent evaluators (e.g., individuals from other departments or external sources such as research institutes) who have formal training or experience in the subject matter of the topics are also involved in the application of the CoP-AD. For example, for the Human–Vehicle Integration topic, the evaluator should have experience in human factors, usability engineering, or cognitive ergonomics.



Following the CoP means that all of the questions should be answered positively, or, that the issue raised by the items has been solved in another way. The sub-questions serve as an elaboration. The main question is phrased in a way that an answer with yes always means that the question has been addressed sufficiently. However, even in case a no is given as an answer, this may still be appropriate, as there might be good reasons why something could not be done or answered, or is simply not applicable in a given case, as long as the underlying problem is solved and documented. For some of the items, accepted pass/fail criteria are available (such as the number of participants that need to pass a controllability confirmation test), others are relying on norms (e.g., legibility of displays) or expert assessments if these kinds of thresholds are not available. In a further step, the questions may be transferred to an Excel file or another software tool for easy application and editing.

The CoP-AD was scoped to cover motorway and parking scenarios for SAE level 3 and level 4 functions. Although only EU markets are currently in scope, it is assumed that the CoP-AD may also be applied to non-EU regions, as well as urban or rural traffic scenarios, and even driverless robot taxis. This needs to be investigated in further research.

In the third section of this paper, the HVI category is explained in detail. This also includes examples of the questions asked.

#### **3. Draft Content Human–Vehicle Integration**

The HVI category comprises all factors related to the interaction between the vehicle and the user. This ranges across a broad area covering human factors, user experience, usability, and cognitive ergonomics. The introduction of automated driving systems that allow fallback-ready users to disengage from driving and engage in non-driving-related tasks introduces a range of potential human factors problems that must be considered in the development process. First, the transitions from automated driving to manual driving must be supported so that users are capable of taking over the driving task in a safe way in case of system limits and malfunctions. Furthermore, the possibility of different automated driving modes being available within the same vehicle, each requiring different levels of responsibility from the user, creates the need to communicate the active driving mode unambiguously. Thus, the design of the Human–Machine Interface (HMI) is a central element in the design process to ensure proper mode awareness and controllable transitions to manual driving. Secondly, the "availability" of the driver to react to requests to intervene needs to be ensured, which is

mainly a function of non-driving-related tasks carried out during the automated ride. Thus, the design of the ADF should be made with foreseeable non-driving-related tasks that might likely be carried out by users during the automated ride. Thirdly, whether the ADF will be used in accordance with the intended usage, or whether users will misuse it (possibly because of over trust in the ADF) will depend on the training and information users receive.

Display and control concepts, i.e., the HMI, must be developed in a way that they are easily and safely operated by the user of an ADF. The HVI is about the harmonious interaction between the user and the vehicle in a broader sense, whereas the HMI is more specifically about the hardware and software interface between them. In order to streamline the various aspects related to HVI, this category is divided into five different topics. The first topic covers the general guidelines on how to design the HMI. This includes the acceptance of the ADF, as well as the usability and the user experience-related aspects. The Mode Awareness, Trust, and Misuse topic is primarily about the driver's awareness of the ADF's current driving mode. This also relates to the users' trust in the ADF and their potential for misuse. Driver monitoring is about assessing the user's state when operating an ADF, which is a topic closely related to the users' mental models and their workload. An important aspect of this is the impact of non-driving-related tasks (in the following referred to as secondary tasks) carried out while driving with a highly automated function. The Controllability and Customer Clinics topic refers to the question of an ADF's controllability from the user's perspective on the one hand and how to conduct a study with participants to test the controllability and other properties of the ADF on the other. Driver Training and Variability of Users is the final topic. It covers the area of user training required for an ADF. Furthermore, it also relates to the variability of users to be taken into account. Together, these topics, comprising 39 main questions, form a comprehensive overview on the overall category of HVI. All the main questions from this and all other categories are available in [3].

#### *3.1. Guidelines for Human–Machine Interface*

Guidelines for the ADF's HMI are prominently addressed as a topic in the CoP-AD. Following appropriate guidelines is key to producing a well-executed user experience and usability, which in turn will create a much higher level of underlying safety in the ADF [4]. On a generic level, this topic is about using HMI design guidelines to define, assess, and validate an HMI concept. They should be followed during the whole development process of the HMI for an ADF. There are various HMI guidelines available (e.g., [5,6]), and the guidelines used during the ADF development should be selected carefully to ensure they are suitable for the SAE level 3 systems. Guidelines adapted to HMIs for conditionally automated vehicles were presented by Naujoks et al. [7] and validated in empirical studies [8,9]. The HMI should be standardized where possible following industry standards that are consistent with the user's mental models [10,11]. This will minimize the time required to familiarize oneself with the HMI, therefore improving the experience of first-time users. Still, guidelines may differ for certain demographics, as different groups of people may prefer different communication methods such as symbols or color coding.

Table 4 shows an example question from the Guidelines for HMI topic. The question aims to determine whether unintentional activations and deactivations of the ADF are prevented or not. Unintentional deactivation of an ADF by the user is an event that needs to be avoided. The driver may be focusing on a secondary task and will not be ready to take over control of the driving task if necessary. The HMI concept should be designed so that it is not possible for the driver to inadvertently initiate a transfer of control. At the same time, it is important to prevent unintentional activations of the ADF. Unexpected longitudinal or lateral input from the ADF may have a detrimental effect on the user's trust in the ADF.


**Table 4.** Example question Human–Vehicle Integration (HVI) guidelines. ADF: Automated Driving Function.

Furthermore, the visual interface shall be designed to be easy to read and interpret [12]. This item focuses on the importance of having a clear strategy for the visual HMI. Guidelines and standards need to be followed to ensure that the visual feedback is easy and intuitive to understand. Icons can be designed to be interpreted quickly if standard symbols and colors are used where possible. Where icons cannot be used, text messages shall be applied. However, it is important that the text can be understood in short glances, so that the driver is not forced to remove the eyes from the road for extended periods of time [6,13,14]. Finally, it is important to cluster relevant HMI elements in similar locations so that the driver can intuitively understand where an HMI should appear [5,14,15].

The HMI shall be designed to portray the urgency of the message to be conveyed [11,12,16]. During the use of an ADF, the user may be subject to many types of HMI feedback with various levels of urgency. It is important that the driver understands which HMI elements are of high priority and are conveying urgent feedback to the driver [17]. Equally, it is important that the driver understands that other messages are provided primarily for informational purposes and therefore do not require immediate action. Assessing the user acceptance is also a key point. Customer clinics, heuristic expert assessments, and various other user trials can be carried out to gain both subjective and objective data on user acceptance.

#### *3.2. Mode Awareness, Trust, and Misuse*

This topic addresses the correct understanding of the role shared between the driver and the ADF, concerning the active mode, as well as the correct usage of and the trust in the ADF.

An example question is given in Table 5. This is about ensuring the drivers fully understand their responsibilities and the function's capabilities during each of the defined ADF modes. They may be informed by several means, such as in-product advertisements and written explanations in the owner's manual [18]. Drivers may get explicit information from the in-vehicle HMI, before, during, and after activation of the ADF itself. They may of course also learn by experience [19]. Additionally, a simple and intuitive HMI can improve the driver's situational awareness and help them to take the correct actions when necessary.


**Table 5.** Example question Mode Awareness, Trust, and Misuse.

All possible automated driving modes shall be explicitly defined in terms of how the driver should acknowledge them. The goal of this item is to ensure that the possible ADF modes are clearly defined from a user's perspective. It is important that a user is aware of the possible automated driving modes of the ADF to avoid any misunderstanding.

It is key to know whether the HMI modalities to communicate the relevant active (automated) driving modes are described. This item focuses on how the active automated driving modes are communicated to both the driver and the other road users, in terms of modalities (visual, auditory, haptic, etc.).

All reasonably foreseeable mistakes and misuse cases of the ADF in relation to the HMI shall be described. The purpose of this question is to ensure that possible driver mistakes, failures and misuses have been addressed in the best possible way, in order to be able to define countermeasures for them [2,20].

Communicating the automated driving modes to the driver in an appropriate and clear way shall be investigated and confirmed. For an ADF, a clear communication of the mode is crucial. This question focuses on the HMI to communicate the ADF modes, the consideration of a permanent display of the modes, how to communicate the mode changes, and how well these HMI elements are recognized by both the driver and other road users. A test procedure to assess whether basic mode indicators are capable of informing the driver about relevant modes and transitions has been proposed by Naujoks et al. [21]. Additional information regarding this topic is provided by JAMA [22], Albers et al. [23], and Schömig et al. [24].

A multimodal HMI to improve driver alertness and minimize the time to get back in the loop should be investigated. However, it should also be ensured that the HMI is no more intrusive than necessary. Therefore, it is necessary to find a balance between the effectiveness of the HMI and the level of annoyance that it may cause the users [25]. Speech is another possibility to communicate a take-over request. The impact of the HMI on relevant driver indicators such as eyes-on-road time should be investigated [26].

Information shall be provided to the driver about an ADF-initiated minimum risk manoeuver [27]. A minimum risk manoeuver typically happens if the driver fails to appropriately take over the controls, or if the function does not have enough time to make a proper take-over request (for example, due to a sudden unexpected situation). This item aims to consider how to inform the driver in the event that the function has initiated the minimum risk manoeuver in order to provide the driver with the necessary information, such as what is going on, why, and what action the driver should take.

The communication to the driver, of the driver's responsibilities in each defined automated driving mode should be investigated and confirmed. It shall be considered how and to what extent the operational design domain information will be displayed to the driver. The driver awareness of automated driving modes shall be investigated as well.

Driver expectations regarding the ADF's features need to be considered. It is crucial to confirm whether user expectations are met. This is a broad subject that would need to be narrowed down to precise specifications, and this question is there to make sure that this process will be considered. For example, in terms of HVI, the balance between the amount of information and its conciseness or simplicity should be investigated.

The driver's trust in the ADF is an important aspect to consider [28]. It is necessary that the users trust the function, in order for them to feel comfortable using it. On the other hand, it is necessary to avoid over-trust, as this may lead to unintended misuse of the function [29]. Again, a good balance should be targeted in order to ensure the correct amount of trust. The appropriate usage of the ADF should be assessed and confirmed, encouraging the intended use and preventing misuse.

Long-term effects of the ADF on the users shall be investigated. Typically, the main risks of long-term effects are skill degradation and building over-trust in the function [30]. The impact of the HMI on driver workload and other aspects over long journeys shall be investigated as well.

#### *3.3. Driver Monitoring*

This topic addresses the correct application of driver monitoring, specifically the identification and classification of the driver's status and the recognition of the actions made inside the vehicle. Monitoring a driver's attention is a crucial topic, especially when discussing automated driving [31]. Since driving is a complex phenomenon, involving the performance of various tasks (including simultaneous quick and accurate decision making), fatigue, workload, and distraction drastically increase human response time, which may result in an inability to drive correctly or to respond properly to a take-over request [32].

Table 6 shows an example item for this topic. The question is assessing whether all relevant secondary tasks are considered when defining the driver monitoring requirements. This item addresses which secondary tasks are allowed during automated driving. The idea is to consider what is currently available and what will become available in the future. In addition, one sub-question focuses on metrics that shall be taken into account when a driver monitoring function is present within the vehicle. Moreover, the possibility to add additional apps or secondary tasks to the HMI in the future shall be considered as well.


**Table 6.** Example question on Driver Monitoring.

A further important question is whether the HMI is connected with the driver monitoring function. It is essential to provide crucial information on driver's state directly to the driver, as an impairment may compromise the safety of the situation. Thus, unsafe driver states such as drowsiness need to be communicated effectively [33].

Furthermore, it should be taken into account whether it is possible to mirror the user's devices on the HMI [34,35]. If it is legally allowed, then it is important to consider how to prompt the driver to take back control of the vehicle while their device is being mirrored. For example, this could be done by overlaying a take-over request on the user's device. This way, the driver can be taken back into the control loop in an effective manner. Device-pairing offers further benefits; for instance, the larger in-vehicle screens may be used as opposed to the relatively small smartphone screens. Due to the use of dedicated controls and displays, driver distraction is also minimized. The impact of typical secondary tasks on take-over time and quality should be identified as well. It is useful to measure the impact of secondary tasks on the take-over request.

After the start of production, data may be gathered to assess the types of secondary tasks, the amount of time users spend doing them, and their impact on driving behavior, traffic safety, etc. This is related to measuring the long-term effects of secondary tasks on driver behavior.

#### *3.4. Controllability and Customer Clinics*

SAE level 3 automated driving will still require the driver to take over the driving task in case of system failures and malfunctions. Thus, it has to be ensured that drivers are able to control transitions to manual or assisted driving and avoid safety critical consequences. Driver-initiated transitions should also be considered from this perspective. This topic is one of the key elements in the existing Code of Practice for Advanced Driver Assistance Systems [2].

Table 7 shows an example question for this topic. It is about the suitability of testing environments for controllability. In the verification phase, controllability assessments should be carried out in suitable test environments, ranging from laboratory to test tracks, etc. When these controllability assessments are carried out on test tracks or on public roads, precautions regarding the safety of participants and other road users should be taken.


**Table 7.** Example question for Controllability and Customer Clinics.

During the definition phase, it shall be ensured that user needs regarding controllability are taken into account. For example, the design of the HMI should consider the transition from automated driving to lower levels of automation with respect to function failures and system limits as well as driver-initiated transitions. Relevant and applicable guidelines for the design of the HMI should be considered in the design phase in order to ensure that they are in line with generally accepted standards and best practices in view of the targeted user population [7,36,37].

Limitations of the human driver should be taken into account. Careful consideration of the driver's sensory and motor limitations (e.g., inability to move freely) need to be applied. The concept selection should thus consider topics such as color-blindness, general vision, sensory-motor, and hearing impairments.

The development should also account for a clear and understandable description of the ADF and its limits. Most importantly, if the driver is informed about function limits, that will trigger requests to intervene [38]. These should be described in the user manual and other available multimedia-based information, together with a description of the expected reaction. It also comprises the selection of a transition-of-control concept. Furthermore, it shall be tested if the vehicle is controllable in the case of a malfunction or by overruling or switching off the function.

The behavior of the ADF should not lead to uncontrollable situations from the perspective of other road users. The design should also consider the limitations and perception of other traffic participants that are not equipped with an ADF. The automated vehicle's behavior shall be designed in a way that it is controllable for these traffic participants and does not exceed the motion ranges of drivers who are driving manually in non-emergency situations.

Even in the early design phase, a preliminary assessment of the controllability can be carried out, which is normally based on expert assessments. A suitable prototype should be used that allows for an assessment of function limits and failures, but also normal driver-initiated transitions [39,40]. The final controllability verification can be based on different evaluation methods such as expert assessments, controllability verification tests, or customer clinics [40].

A suitable post-production evaluation strategy should be implemented that assesses the impact of the ADF on possible negative behavioral adaptations such as skill degradation and misuse. This way, the ADF is adequately evaluated from a human factors perspective after the start of production.

#### *3.5. Driver Training and Variability of Users*

This topic covers the training required for ADF users and the variability of these users, which needs to be considered. The training aspect is about the issue of providing users with the appropriate knowledge and skills to operate an ADF. As there is a huge variability of users, different age groups, gender, cultural backgrounds, and different levels of previous experience need to be considered. Both topics are combined here, as they share various aspects.

Table 8 shows an example question for this topic, asking if the information that the user needs to operate the ADF is available to create a training course. Creating the user training for the ADF requires a specification of the ADF's operation to serve as a baseline. Due to the complexity of ADFs generally, a user training course may be required or at least recommended. Ideally, this is unnecessary due to a well-executed intuitive system design. The training methods shall be defined in more detail to produce a course that could use one of many of the following mediums: a training course provided by the dealer, user manuals integrated within the vehicle, online material for home training, or the use of digital assistants. A reasonable combination of training methods shall be considered taking individual learning preferences into account [20,41,42].


**Table 8.** Example question Driver Training and Variability of Users.

There may be huge differences between user groups. The questions in the CoP-AD target the difference between countries and geographical regions. Infrastructural differences with regard to roads and traffic control functions as well as driver behavior in general have a huge impact on the design of ADFs and so these differences need to be handled appropriately. An ADF designed for only a specific country or geographical region without taking into account the local infrastructure and the requirements of their user groups must be avoided. Another factor to be taken into account are elderly drivers. Due to their degrading physical abilities, driving becomes more cumbersome. Therefore, during the definition of ADFs, the physical impairments of elderly drivers should be addressed. There is also a significant variability in users' physical dimensions and anthropometry. Size and strength differences between genders can play a role, and so the ADF shall be designed to be operated by a variety of different users, including those with non-age-related disabilities.

There shall also be a representative test sample for user studies. Depending on the exact user study to be conducted, this may range from age, gender, and socio-cultural background to test candidates with previous experience with ADFs or technology in general. The test participants in a sample should be selected accordingly.

A solid mix of customer education and information shall be made available to the users post start of production. Developers need to ensure that there is enough information available for the users of an ADF to properly operate it. There should be sufficient training material available inside the vehicle to provide users with the required knowledge to operate the ADF safely on the road. To reduce the likelihood of people over-estimating the possibilities offered by the ADF, the marketing shall support user information and training with realistic information regarding its abilities.

#### **4. Conclusions**

The introductory part gave an overview on the development process applied to finalize the draft of the CoP-AD. This comprises all the different main categories such as the ODD Vehicle Level, the ODD Traffic System and Behavioral Design as well as Safeguarding Automation. The draft results of the CoP-AD presented here with a focus on the HVI category offer the first insight on how the interaction between the driver of the vehicle and the automated driving system shall be part of a standardized development process. Whereas the first category focuses on available guidelines in general, the other topics concentrate more specifically on topics of interest for designing an appropriate interaction between the driver and the vehicle equipped with an automated driving system. Mode awareness, including the aspects of trust and misuse is a cornerstone on how to make people aware of the automated system's abilities, improving trust and at the same time preventing misuse. Driver monitoring plays a major role when taking into account the state of the driver and its importance for a safe operation of the automated driving function. Controllability and customer clinics actually focus on two distinct but interrelated topics. Ensuring the controllability of a system is key, especially in case of minimum risk manoeuvers. This shall be tested in user studies, which in turn serve as a primary method to test many of the guidelines and assumptions mentioned in this text. Driver training again emphasizes the importance of giving drivers the education they need, and in a medium that they can consume and learn from most effectively. In addition, the variability of users is taken into account, including the cultural and infrastructural differences between different cultures and geographical regions.

It must be emphasized that the proposed CoP-AD is based on current best practices, research, and applicable norms. Many of the published studies have been conducted using driving simulators or proving grounds; however, as automated vehicles have not been deployed, final proof that the proposed CoP-AD will be able to eliminate all possible design issues is not yet possible. The current publication is meant to stimulate the ongoing discussion in the technical and scientific community to further improve and converge current research and evaluation practice. It should also be noted that the current paper lays out a draft version of the CoP-AD that will be further refined based on available feedback. This does not only include the HVI, but also the other categories mentioned in this paper. The final CoP-AD needs to be available in an easy-to-use way, preferably as some kind of software application, either Excel-based or standalone. During the development process of an ADF, the questions presented here as examples, and those being part of the final document will guide the engineers from the concept phase up to the time post start of production.

The scope of this document is currently on highway driving and parking, primarily on SAE Level 3 and to a certain extent on SAE Level 4, for the European regions. Further work is required to see if it may be applied to other regions outside of the EU as well. Of particular interest are the USA and China. Automated driving systems that operate within the city or in rural areas shall also be applicable to the CoP-AD. Otherwise, future iterations will have to be adapted to be also applicable to other areas. This is also true for applications regarding robot taxis, reaching from geo-fenced SAE Level 4 up to SAE Level 5 systems. Until then, the CoP-AD will serve as an important guideline for the development of automated driving functions.

**Author Contributions:** For research Conceptualization, S.W. (Stefan Wolter), G.C.D., S.H., F.T., S.W. (Stuart Whitehouse) and F.N.; Supervision, S.W. (Stefan Wolter); Writing—original draft, S.W. (Stefan Wolter); Writing—review and editing, G.C.D., S.H., F.T., S.W. (Stuart Whitehouse) and F.N. All authors have read and agreed to the published version of the manuscript.

**Funding:** This paper results from the L3Pilot project. This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 723051. The sole responsibility of this publication lies with the authors. The authors would like to thank all partners within L3Pilot for their cooperation and valuable contribution.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Sleep Inertia Countermeasures in Automated Driving: A Concept of Cognitive Stimulation**

**Johanna Wörle 1,\*, Ramona Kenntner-Mabiala 1, Barbara Metz 1, Samantha Fritzsch 1, Christian Purucker 1, Dennis Befelein <sup>1</sup> and Andy Prill <sup>2</sup>**


Received: 26 May 2020; Accepted: 28 June 2020; Published: 30 June 2020

**Abstract:** When highly automated driving is realized, the role of the driver will change dramatically. Drivers will even be able to sleep during the drive. However, when awaking from sleep, drivers often experience sleep inertia, meaning they are feeling groggy and are impaired in their driving performance—which can be an issue with the concept of dual-mode vehicles that allow both manual and automated driving. Proactive methods to avoid sleep inertia like the widely applied 'NASA nap' are not immediately practicable in automated driving. Therefore, a reactive countermeasure, the sleep inertia counter-procedure for drivers (SICD), has been developed with the aim to activate and motivate the driver as well as to measure the driver's alertness level. The SICD is evaluated in a study with N = 21 drivers in a level highly automation driving simulator. The SICD was able to activate the driver after sleep and was perceived as "assisting" by the drivers. It was not capable of measuring the driver's alertness level. The interpretation of the findings is limited due to a lack of a comparative baseline condition. Future research is needed on direct comparisons of different countermeasures to sleep inertia that are effective and accepted by drivers.

**Keywords:** highly automated driving; sleep; sleep inertia; HMI design

#### **1. Introduction**

Highly automated driving systems (ADS) are about to be introduced to the market and they have the potential to change the way we travel fundamentally. Technologies such as Internet of Things, Big Data and Connected Vehicles further promote the progress in the development of ADS [1]. Surveys that were conducted on user requirements with regard to ADS reveal how potential users want to spend the gained time that they do not have to spend on controlling the vehicle. Among the most desired activities are phoning, mailing, interacting with passengers, eating and drinking, watching movies and resting [2]. In a more recent survey in five countries, "sleeping and relaxing" was stated as the preferred way to spend an automated drive [3]. This desire can be explained by the requirements of the modern lifestyle with long working hours and extended time spent on commuting. Some 30% of U.S. employees report sleeping less than six hours per night [4]. Thus, the option to use the commute to work for a nap appears highly promising for drivers.

Those ADS that are currently on the market (such as Tesla Autopilot, General Motors Super Cruise, or Mercedes-Benz Distronic Plus) do not offer the option for the driver to sleep. They are partially automated (level 2 according to the taxonomy of the Society of Automobile Engineers, SAE [5]) and therefore have to be supervised by the driver at all times. However, the ADS technology is advancing fast, and when reaching the level of high automation (SAE level 4), sleep can be implemented as a

use case during the automated drive. The concept of "dual mode vehicles" in level 4 automation includes the option for a driver to engage in manual driving or the ability of a driver to take over in situations the system cannot handle [5]. Hence, questions arise on the impact of driver state after sleep. After awakening from sleep, humans experience a "period of transitory hypovigilance, confusion, disorientation of behavior and impaired cognitive and sensory-motor performance" ([6], p. 834), called "sleep inertia".

Sleep inertia is widely recognized and well-regulated in operational domains. In aviation, e.g., where pilots are allowed to take a nap during a flight, standardized nap protocols are in place to avoid performance impairment due to sleep inertia. Pilots on long-haul flights are allowed and even advised to sleep to restore their alertness throughout the flight. In order to avoid performance decrements after sleep, and thus potential safety risks, a procedure called the "NASA nap" is implemented. The NASA nap is a standardized rest period of 40 min with the opportunity to sleep followed by a 20-min period of wakefulness to overcome sleep inertia before returning to duty [7]. The duration of sleep is restricted to avoid deep sleep which produces the highest magnitude of sleep inertia. The NASA nap is recommended with slight differences in various aviation operator guidelines [8,9].

In the AD domain, there are no guidelines and no common understanding on how to deal with a sleeping driver. The first driving simulator studies on human performance after sleep indicate that after sleep, drivers are impaired in their ability to engage in vehicle control and their driving performance is worsened [10,11]. Adverse driver states are a major safety issue in conventional driving to date. In automated driving, the safety impact of an adverse driver state is especially crucial in take-over situations, i.e., in the period after taking back vehicle control from automated to manual driving. EuroNCAP, the European car safety assessment program, introduces reliable driver state monitoring and effective action when an adverse driver state is detected as a primary safety measure [12]. This could mean that, in the case of the driver monitoring system detecting a driver getting too drowsy, it will warn the driver or even initiate a safety manoeuver. Drivers who awaken from sleep experience critical impairments in their take-over and driving performance [10]. The duration of sleep inertia depends on various factors such as the duration of prior sleep or the sleep stage the driver is awakened from [13]. It is thus critical to assess the driver's readiness to engage in vehicle control after sleep to avoid safety-critical situations. The second approach is to actively counter sleep inertia and thus performance impairment and reduce the severity and duration of impaired performance after awakening.

The aim of the paper is to make a proposal of a reactive strategy to deal with sleep inertia in AD and start a discussion on sleep inertia countermeasures in this new field of application.

#### *1.1. Sleep and Sleep Inertia*

Sleep is broadly defined as a "reversible behavioral state of perceptual disengagement from and unresponsiveness to the environment" [14] (p. 15). Sleep itself is not a constant state but rather characterized by an alternation of different sleep stages. The sleep stages according to the American Academy of Sleep Medicine (AASM) standard [15] are:


The transitional phase from sleep to wakefulness is also a distinct state characterized by "hypovigilance, confusion, disorientation of behavior and impaired cognitive and sensory-motor performance" ([6], p. 834), called "sleep inertia". Physiologically, the state of sleep inertia is characterized by a decreased cerebral blood flow [16]. Spectral analyses of the EEG show higher power in the delta-theta and alpha frequency range and a lower power in the beta frequency range which

indicates low general alertness [17,18]. Hilditch and McHill [19] suggest that the function of sleep inertia might be for the organism to promote sleep upon awakening so that sleep can be maintained when the awakening is undesired.

In the post-awakening period, performance impairment is evident in a wide range of tasks. Most laboratory studies investigate human performance after sleep with highly standardized tasks: the Psychomotor Vigilance Task (PVT) is widely used in studies on fatigue, but also on sleep inertia [20–22]. The PVT is a standardized test that measures alertness. Subjects have to respond to a visual stimulus as quickly as possible. One of the advantages of the PVT is that it has no learning curve. After sleep, subjects react slower to the stimuli [21,22] and they have more lapses [20]. Other studies assessed the working memory upon awakening: In the n-back task, a subject is presented with a sequence of stimuli and they have to react when the current stimulus matches the one from n steps earlier. Groeger and colleagues [23] applied a 1-, 2- and 3-back task to investigate impairments in working memory on tasks of rising difficulty after 90-min naps. They found stronger performance decrements on tasks which highly rely on executive functions. The Digit Symbol Substitution Test (DSST) assesses working memory and processing speed by presenting digit-symbol pairs followed by a list of digits. Subjects have to assign the correct symbol as fast as possible. The number of correct responses was lower after sleep than before [22,24].

The magnitude and duration of sleep inertia is shaped by many factors. A circadian influence seems apparent with sleep inertia being stronger in the circadian low i.e., during the biological night [21,25,26]. An important factor that influences the magnitude of sleep inertia is the sleep stage prior to awakening. Deep sleep (or slow wave sleep, SWS) produces the highest impairments due to sleep inertia [25,27]. For other sleep stages, results are ambiguous: Cavallero and Versace [28] found a higher impairment of performance on a reaction time task after N2 sleep than after REM sleep. Reaction times were prolonged after N2 sleep compared to N1 sleep [29]. Scheer and colleagues [21] found no differences in the performance of an addition task between subjects awakening from N2, deep sleep or REM sleep. The duration of sleep inertia ranges from 1 min up to 4 h depending on the study design. However, without major sleep deprivation, a duration of more than 30 min is unlikely [13].

#### *1.2. Countering Sleep Inertia*

Sleep inertia can be a serious safety issue especially in settings where optimal human performance is crucial under adverse conditions. Such conditions can be extended working hours, working during the circadian trough or traveling through different time zones. In those operational domains, Fatigue Risk Management Systems (FMRS) are in place to avoid safety risks due to impaired alertness, e.g., fatigue or sleep inertia. Different strategies can be distinguished to counter sleep inertia: proactive strategies are commonly recommended in work guidelines that regulate, e.g., sleep schedules or nap durations to avoid sleep inertia. Reactive countermeasures are implemented after awakening when sleep inertia is already present.

Proactive strategies to mitigate sleep inertia are well-established in operator guidelines or shift schedules, e.g., in hospitals or the transportation industry. They include recommendations on sleep schedules, avoid awakening during the circadian low and strategic naps. A common proactive strategy to minimize sleep inertia is the 'NASA nap'. The total sleep duration is restricted to avoid deep sleep, and after awakening, operators have to wait for 20 min to return to duty to overcome sleep inertia. It has to be mentioned, however, that the 40-min rest opportunity of the NASA nap does not fully avoid deep sleep. Even in the original study on the NASA nap, some pilots entered deep sleep within this short period [7]. Since deep sleep occurs in cycles throughout the sleep period, awakening after 80−100 min could be an alternative. After this time the first whole sleep cycle with the first deep sleep period is normally finished [13]. It was found, for instance, that sleep inertia magnitude was greater after a 40 min nap than after a 60 min nap [30]. Ferrara and De Gennaro [6] suggest that awakenings after extended periods of sleep deprivation and during the circadian trough (i.e., during the night)

should be avoided. The implementation of proactive strategies is that they require a planned sleep opportunity and a planned wake-up time.

Reactive countermeasures are not commonly implemented and empirical evidence on their effectiveness is incomplete. Examples for applications of reactive countermeasures are light alarms that claim to wake the user more gently and thus minimize sleep inertia. Hilditch, Dorrian and Banks [31] give an overview of the literature on reactive sleep inertia countermeasures. The review includes studies on caffeine, light (postwaking), light (prewaking), sound, temperature, self-awakening and face-washing. Studies included in the review assess the impact of these countermeasures on either subjective alertness or objective alertness (i.e., physiology or performance) or both. One main conclusion of the authors is that there is a gap in the evidence-base of research on sleep inertia countermeasures. Caffeine administered before sleep is suggested as the most effective reactive sleep inertia countermeasure. Empirical evidence on the effectiveness of light or temperature is not sufficient to draw conclusions at this point.

#### *1.3. Implications of Sleep Inertia in Automated Driving*

Driving automation has not yet progressed to a level that allows drivers to sleep during the drive. Current ADS require the driver to supervise the ADS at all times. Despite that, videos are making the headlines that show drivers sleeping behind the wheel of their automated vehicles [32,33]. At the current stage, sleep is a clear misuse and has to be avoided at any cost. However, with progress in the development of ADS, the systems will be able to execute all parts of the driving task reliably within the system boundaries. Fully automated driving is not realized throughout all road sections of a trip. That is why at some point of the drive (e.g., at a motorway exit, or when boarding a ferry) the user will be required to execute the driving task manually. The concept of dual-mode vehicles outlined by the SAE [5] explicitly refers to this design option where the user of a highly ADS has the option to request manual driving if she or he wants. The user can therefore switch actively between a user state and a driver state. Users therefore will be allowed to sleep but when they take back the driving task as a driver, it has to be ensured that they are fit to drive after awakening. An exemplary use case could be a business trip where a saleswoman starts her drive with a highly automated vehicle early in the morning. The trip consists of two hours of motorway driving and after leaving the motorway, a rural road leads to her destination. The ADS only supports driving on the motorway, but not on rural roads. The saleswoman uses the motorway section to get some more sleep and is alerted by the ADS before the motorway exit. This way she is able to take back vehicle control before entering the rural road. The ADS ensures that after awakening, the driver's manual driving ability is not impaired due to sleep inertia.

When humans awake from sleep, they experience sleep inertia and are therefore impaired in their ability to drive [10,11]. Sleep inertia as a driver state has barely been an issue in road transport research so far. A study of take-over performance after sleep yielded clearly impaired performance when drivers take control back from the ADS [10]. Drivers' take-over reactions (i.e., glance at the road, glance at the mirrors, hands on the wheel) were all delayed by a few seconds after sleep compared to an awake baseline. It also seemed that the ADS's HMI display was a more important source of information. After sleep, drivers first checked the information on the HMI display before taking over. In contrast to that, the drivers ignored it when taking over after wakefulness. Most importantly, the drivers' overall performance in the take-over situation was worse after sleep. Drivers' lane keeping was clearly impaired and they performed fewer safety glances when changing lanes. Drivers subjectively perceived the take-over situations as more critical after sleep. In another driving simulator study, the focus was on the drivers' manual driving behavior during the first 10 min after sleep [11]. After being awakened by a request to intervene, drivers had to drive manually on a monotonous motorway for about 10 min. Lane-keeping performance was clearly impaired after sleep. This effect was mainly evident in the first two minutes of the manual drive. After sleep, drivers drove at a reduced speed and they had problems

keeping to a constant speed. The speed-keeping performance did not improve significantly in the course of the 10-min drive.

The presented findings from previous studies emphasize the necessity for a framework to minimize sleep inertia and associated safety risks in automated driving. Established strategies from other operational domains might only be partly transferable to AD. One approach could be—similar to the NASA nap—to limit rest periods of the driver during the drive, in order to avoid deep sleep. However, this might hardly be acceptable for the driver. If, for example, a two hour period of uninterrupted AD is available, it might not be communicable to the driver that they are only allowed to rest for 40 min. Another strategy could be to awaken the driver early enough to let sleep inertia dissipate before they re-engage in driving. In this case however, the ADS has to ensure that the driver does not go back to sleep in the meantime. Proactive approaches to deal with sleep inertia are usually designed for professionals like pilots or hospital workers that are trained in alertness management. It cannot be expected from regular drivers to stick to such protocols. Therefore, we argue that technical solutions are to be preferred in AD.

Instead of proactive strategies to avoid sleep inertia, reactive strategies that counteract it seem highly promising in AD. While proactive strategies are thoroughly investigated and implemented, e.g., in industry guidelines, there is not much research on strategies to minimize sleep inertia after awakening. Some authors [6] suggest that sleep inertia can be reduced by stimulating or activating the individual after sleep. Everyday strategies like washing one's face with cold water or intensive stretching is not applicable in the vehicle cabin (although one could think of physical activities that could be performed while seated, similar to exercises suggested by some airlines for on-board fitness).

Due to the mainly cognitive requirements of the task of vehicle control, a more promising approach is to cognitively stimulate the driver after sleep. A very popular approach in daily life is a task-based mobile alarm app for smartphones. The basic principle is, that after awaking from sleep, one has to complete a task on their smartphone to ensure that they wake up reliably. Examples for tasks are taking a picture or solving math problems [34]. The activation through cognitive stimulation can promote cerebral activity on the one hand and be motivating because of its playful character on the other. However, there is barely any empirical evidence for the effectiveness of such task-based alarms.

Besides cognitive stimulation and motivation of the driver, our concept had a third aim: similar to established measurements of alertness such as the PVT, we aimed at assessing the alertness level of the driver after waking to assess the driver's fitness to drive.

A prototype sleep inertia countermeasure was developed and is tested in a driving simulator setup to evaluate its effectiveness in terms of:


#### **2. Materials and Methods**

#### *2.1. The Concept of the Sleep-Inertia Counter-Procedure for Drivers*

Two expert workshops were conducted with the aim to work out prototype wake-up concepts and a framework for a concept to counter sleep inertia in automated driving. N = 8 and N = 9 experts with backgrounds in human factors, traffic psychology and HMI design participated in the workshops. After several iterations, a concept for a sleep inertia counter-procedure for drivers (SICD) was developed and implemented as a tablet application in the driving simulation.

The basic idea of the SICD was to minimize sleep inertia by activating the driver after sleep as it is suggested by [6]. Suggestions such as washing one's face with cold water or physical exercise were discussed but rejected since they were not practicable in a vehicle cabin. Other reliable methods to counter sleep inertia such as caffeine administration [20] were rejected because they were judged to be too intrusive. One approach that was assessed to be feasible as an HMI solution was to cognitively activate the driver with a challenging task. Another advantage of a cognitive task is that it can also be used as a diagnostic tool to measure the alertness level of drivers after sleep similar to, e.g., the PVT. The SICD was designed with a gamification approach so that it was perceived as motivating and created a positive feeling.

The SICD was implemented as a gaming application on a tablet similar to a classical choice-reaction task, see Figure 1. Purple and turquoise dots appeared on a tablet screen every 1–2 s at random positions of the play area. Both the position of appearance and the time point were defined by random number generators. Frequency and position varied to avoid predictability and thus repetitive behavior and boredom. Drivers had to hit all target stimuli (purple dots) and avoid distractor stimuli (turquoise dots). To promote drivers' motivation during the task they received motivating messages such as "You are doing great". The duration of the SICD was 10 min.

**Figure 1.** Screenshot of the sleep inertia counter-procedure for drivers (SICD, **left**) and participant equipped with EEG electrodes performing the SICD (**right**).

If participants hit a purple target dot, the counter registered a "hit". For all hits, reaction times were recorded. If participants hit a turquoise dot, a "fail" was counted. If a target dot was not hit, a "miss" was counted. Reaction times were calculated starting with the appearance of the dot on the screen until it was hit. The four parameters hits, fails, misses and reaction times were supposed to serve as measures for alertness, similar to established alertness measurements, such as the PVT.

#### *2.2. An L4 Concept Driving Simulator to Investigate Sleep*

The evaluation study was conducted in a driving simulator using the simulation software SILAB. The simulator was specifically designed for evaluating HMI concepts for automated driving. The main components were a dashboard with a steering wheel and a large diagonal display. Accelerator and brake pedal were available in manual driving mode. The driver was seated in a comfortable seat with a central infotainment touch display. All relevant components of the driving simulator were equipped with electric linear actuators and could be controlled via a computer. It was possible to move the seat to a lying position and to retract the steering wheel and pedals so that the driver had more space. Therefore, the cabin concept in the "manual driving mode" was different with the seat in the upright position and steering wheel and pedals extended while in the "automated driving mode" the seat was moved backwards and steering wheel and pedals were retracted and in "sleep mode" the seat was moved to a horizontal position.

Two different wake-up procedures were developed and implemented. The first wake-up procedure focused on a reliable awakening with a loud and sharp sound and flashing lights. The second wake-up procedure focused on a comfortable and pleasant wake up with soft music and a warm yellow light concept. The two wake-up procedures were tested between-subjects in the driving simulator study. However, no results will be presented on the acceptance and effectiveness of the wake-up procedures since this is outside the scope of this paper.

#### *2.3. Study Design*

The study was conducted at the premises of the Würzburg Institute for Traffic Sciences (WIVW). N = 21 test participants (10 female, meanage = 33, sd = 8) completed two driving sessions in an L4 driving simulator using a highly automated driving system. All participants were recruited from the WIVW driver panel. Session 1 was scheduled during the daytime and session 2 was scheduled at 6 a.m. after a night of partial sleep deprivation, i.e., drivers were allowed to sleep no more than 4 h. The aim was to get the drivers to fall asleep in the driving simulator. Each session started with a prequestionnaire and ended with a postquestionnaire.

In session 1, drivers first gave their informed consent, filled in the prequestionnaire and were then familiarized with the driving simulator, i.e., they learned the system handling and drove manually for 10 min. Then they practiced the SICD. After the familiarization, the test drive started. For a graphical representation of the test drive, see Figure 2. The test drive started on a parking lot and drivers entered the highway. On the highway they activated the automated mode, then the vehicle drove automatically and the vehicle cabin also changed: the steering wheel folded back, the wide screen moved in near to the participant and the driving seat moved backwards and tilted back slightly so that the driver was in a more comfortable position. Then, the system offered the sleep mode that the driver confirmed with a button press. The screen turned darker and the driver's seat tilted to a lying position. Then, drivers were instructed to close their eyes and relax but not to sleep. After two minutes, drivers were alerted with either of two "wake-up procedures". Then they were asked to rate their subjective arousal and their subjective well-being on a slightly adapted version of the 9-point Self-Assessment Manikin (SAM) scale [35]. The SAM-scale is a "non-verbal pictorial assessment technique" ([35], p. 49). For our purposes, the valence scale showed five manikins displaying a scale ranging from an unhappy face expression to a happy face expression. Participants were asked "How good do you feel?" On the arousal scale the manikins ranged from a relaxed looking manikin with closed eyes to a very active manikin. Participants were asked "How activated are you?"

**Figure 2.** Schematic sequence of the test drive.

After the rating, the SICD was offered on the tablet screen and was started by the drivers via button press. The SICD was executed for 10 min and when it was finished, drivers were asked again to rate their well-being and their arousal. Then they performed a 10-min manual drive on a 3-lane highway with low traffic volume. After the manual drive, drivers rated their subjective well-being and arousal.

Session 2 had a similar procedure with the only differences that it took place at 6 a.m. and drivers were sleep deprived. Participants arrived at the test facilities by taxi and after filling in the pre-questionnaire, they were equipped with the EEG. Electrodes were placed according to the International 10–20 system [36]. The procedure of the test drive was the same as in session 1, but drivers were awakened when a sleep expert confirmed sleep stage N2 via EEG evaluation. Sleep stage N2 was chosen since it is the "deepest" stage that is considered appropriate during a nap in most operational guidelines. After awakening, drivers engaged in the SICD for 10 min and then drove manually for 10 min. Then, the AD was available again and drivers tried to sleep again. If sleep stage N2 was confirmed a second time, drivers were awakened and the procedure with first the SICD and then the manual drive was triggered again. During both driving sessions, heart rate was measured with the Polar T34 chest belt as a measure for physiological activation.

Alertness is either measured with self-report measures such as a visual analogue scale or the Karolinska Sleepiness Scale, measures of cognitive performance—of which the PVT is arguably the most common—or physiologic measures [37]. All of them have a high inter-correlation and it is advised to use a combination of different measures. We therefore chose to use a combination of different alertness measures with the SAM-Scale as a self-report measure, the cardiac parameters as physiologic measures and the performance parameters of the SICD as measures of cognitive performance.

#### *2.4. Data Analysis*

For all indicators of arousal, repeated measures ANOVAs were conducted with the factors state and time. For state, three manifestations of the driver state were compared: after wakefulness (wakefulness, session 1), after drivers were asleep for the first time (after Sleep 1) and after drivers were asleep for the second time (after Sleep 2, both session 2). Furthermore, a factor time was analyzed which showed the change of the indicators over time.

For the subjective state, changes in arousal and well-being were compared for three points in time, directly after being awakened, after the SICD and after the manual drive. Mean and standard deviation of heart rate were analyzed as objective indicators for physiological arousal. Starting from the beginning of the SICD, parameters for heart rate were calculated for time segments of one minute duration, starting two minutes prior to the start of the SICD, including the 10 min of the SICD and 8 min of successive manual driving. In a similar approach, indicators measuring the performance in the SICD were calculated for segments with one minute of duration and analyzed over time. The proportion of hits and reaction times were calculated.

#### **3. Results**

#### *3.1. Subjective Arousal and Well-Being*

For the subjective arousal, there was a significant main effect of driver state [F(2, 36) = 9.898, *p* < 0.000], a significant main effect of time [F(2, 36) = 17.069, *p* < 0.000] and a significant interaction effect driver state\*time [F(4, 72) = 6.499, *p* < 0.000]. Tukey post hoc test revealed that before the SICD (time point wake-up), the arousal after wakefulness was higher than after sleep. During the SICD, arousal increased for all three states. After the SICD, the differences between wakefulness and after sleep were no longer significant and were reduced further until after the drive. The development of the subjective arousal over time did not differ between after Sleep 1 and after Sleep 2.

For the subjective well-being, there was a main effect of driver state [F(2, 36) = 9.537, *p* < 0.000], no effect of time [F(2, 36) = 0.159, *p* = 0.853] and an interaction effect driver state\*time [F(4, 7) = 3.719, *p* = 0.008]. As can be seen in Figure 3, the only effect that could be interpreted is the interaction. After wake-up, subjective well-being was significantly lower after sleep than after wakefulness. Then, after sleep, there was a slight increase of well-being over time. On the contrary, subjective well-being decreased in the awake baseline condition. After the drive, subjective well-being after sleep and after wakefulness were on a similar level. Again, there was no difference in subjective state between Sleep 1 and Sleep 2.

**Figure 3.** Subjective arousal (**left**) and well-being (**right**) after wake-up, after the SICD and after the manual drive for drivers after wakefulness and twice after sleep. The graph shows means and 95% confidence intervals.

#### *3.2. Physiological Activation*

For the mean heartrate, there was a significant effect of time [F(19, 361) = 20.7, *p* < 0.001] and a signficant interaction effect [F(38, 722) = 1.6177, *p* = 0.012]. For all states, there was an increase in heart rate with the beginning of the SICD which was more pronounced after sleep. During the SICD, the mean heart rate stayed on a constant level. After wakefulness, the increase of mean heart rate during the SICD was followed by a decrease during the manual drive. This decrease could not be found after sleep. For the standard deviation of heart rate, there was a significant effect of time [F(19,361) = 31.22, *p* < 0.001], of state [F(2, 38) = 4.42, *p* = 0.019] and a significant interaction [F(38, 722) = 10.505, *p* < 0.000]. All effects were based on a strong increase of heart rate variability during the process of waking up and starting with SICD after sleep. After wakefulness, heart rate variability stayed on a constant level throughout the analysed time frame. Means and standard deviations of heart rate during the SICD as well as during the manual drive are shown in Figure 4.

**Figure 4.** Mean (**left**) and standard deviation (**right**) of heart rate in time segments of 1 min before the SICD, during the SICD and during the successive manual drive.

Figure 5 shows an example of one driver's heart rate in the course of the drive. It illustrated how the heart rate was low during the automated drive due to a low arousal level of the driver. When the driver was awakened by the ADS, there was a sharp increase in heart rate and an overall higher level during the SICD and the manual drive. The arousal lowered as soon as the ADS was activated again.

**Figure 5.** Example for change of heart rate for automated drive (ADS), during the SICD and during the manual drive and back to ADS. The driver was asleep during the automated drive (ADS).

#### *3.3. Subjective Evaluation of the SICD*

For assessing the acceptance of the SICD by users, the drivers were asked to rate the SICD on the 9-point acceptance scale after the drive in the second session. One sample *t*-tests were calculated against the scale mean (0). The SICD was perceived as assisting (M = −0.57, SD = 1.07, *p* = 0.024) and marginally as good (M = 0.38, SD = 0.8, *p* = 0.057). All other scales did not differ from the scale mean. Drivers' acceptance of the SICD is depicted in Figure 6.

**Figure 6.** Means and 95% confidence intervals for the acceptance of the SICD. Significant differences from the scale mean are marked with \*, nearly significant differences with (\*).

In the postquestionnaire, drivers were asked to judge their driving behaviour after sleep with an open question. Some answers revealed information about the evaluation of the SICD. Those answers were, e.g., "Directly at the beginning, I felt more awake, but this effect quickly changed after the game was finished", "When I first played the game, it was refreshing and activating, but when I had to play it again, it was rather sleep-inducing.", "The game was sleep-inducing, because it is too long and not varied enough", "At the beginning of the game, it is arousing and raises alertness, but after a while it becomes annoying and monotonous".

#### *3.4. Performance on the SICD*

For the parameters mean reaction time and standard deviation of reaction time there were no significant effects. For the proportion of hits there was a significant effect of time [F(9, 162) = 2.1521, *p* = 0.028], which was based on an increase during the first three minutes after the start of SICD. Figure 7 shows two of the performance parameters, as examples, of the SICD, the percentage of hits and the mean reaction time in the course of 10 min.

**Figure 7.** Proportion of hits of the SICD (**left**) and mean reaction times (**right**) for drivers after wakefulness, and twice after sleep.

#### **4. Discussion**

Sleeping drivers in automated vehicles are already an issue today at a level of automation where the driver is clearly required to stay alert [30,31]. With the progression of automated driving technology and the development of dual-mode vehicles, sleep will arise as a use case and thus as a new driver state to be considered, e.g., in safety research and vehicle design. After sleep, human performance is impaired due to sleep inertia [6]. First studies on sleep in automated driving show that performance impairments are evident after awakening. There are more errors in take-over performance and manual driving performance is impaired [10,11]. The aim of the presented study was to develop a first implementation of a countermeasure to sleep inertia for drivers who are awakened during an automated drive. A sleep inertia counter-procedure for drivers (SICD) was developed. The purpose of the SICD was threefold: First, to activate the driver after awakening, second, to improve the driver's mood and motivation and third, to measure the driver's alertness.

The effectiveness and acceptance of the SICD was evaluated with N = 21 drivers who completed the SICD (a) after wakefulness and (b) after sleep. In both sessions we assessed physiological activation, subjective arousal and well-being, subjective evaluation of the SICD, as well as performance on the SICD.

#### *4.1. Activation of the Driver*

The SICD was designed especially to be activating. Drivers had to react quickly to achieve as many hits as possible and had to avoid distractor cues. Drivers report a higher subjective arousal after the SICD than before. This was the case after drivers had slept but also when they were awake. The activating effect of the SICD lasted until the end of the successive manual drive for drivers who had slept before but not for drivers who had been awake. The subjective arousal is also reflected by the physiological activation of the drivers. Drivers' activation, measured by the mean heart rate, is higher during the SICD than during the rather monotonous manual drive for drivers who had not slept. However, when drivers were awakened from sleep, the awakening process was highly activating and the heart rate was on a rather high level throughout the SICD but also during the manual drive. It seems that when drivers were asleep prior to the SICD, the SICD was able to physiologically activate the driver. This effect even lasted until after the SICD was finished and drivers drove manually. When drivers had not slept before, the SICD had a similar activating effect. However, after being awake, this effect did not last after the SICD was finished. The effects found for subjective arousal and for mean heart rate are similar: There was an increase of activation during the SICD which was followed again by a decrease during the manual drive after wakefulness. After sleep, the arousal level reached during SICD remained quite stable during the manual drive. Both the subjective and the objective activation effects after sleep were very stable and occurred in a similar way after both awakenings during the drive. Therefore, the aim of activating the driver with the SICD can be confirmed.

#### *4.2. Driver Mood and Attitude towards the SICD*

The drivers' subjective well-being was generally higher in the "awake" condition than after sleep which can be explained by the affective component of sleep inertia which is described as "grogginess". The SICD did not improve the drivers' well-being after sleep. The subjective evaluation of the SICD by drivers on the acceptance scale [38] was neutral. However, the drivers perceived the SICD as "assisting". The drivers stated that the SICD was too long and monotonous. The SICD could thus be improved by shortening the duration or by adding features that help to reduce monotony.

#### *4.3. Measuring Driver Alertness with the SICD*

One basic idea of the SICD was to use it as a diagnostic tool that reveals whether the driver's alertness has improved enough to consider her or him "ready to drive". Therefore, similar to classical measures of alertness like the PVT, performance parameters "hits", "misses", "fails" and "reaction times" were measured. Unfortunately, there is no effect of the drivers' state on the parameters of the SICD. In the current implementation it can therefore not be considered a valid diagnostic tool to assess drivers' readiness to drive. A simpler task is indicated where no learning effects can be expected. However, this goal might be challenging to be combined with the aim to implement a less monotonous task.

#### *4.4. Conclusions*

While the SICD proved to be subjectively and physiologically activating, the trade-off between motivating appeal and diagnostic capacity—similar to classic alertness tasks—turned out to be the essential challenge. Standardized and validated tasks like the PVT have the advantage that, due to the simplicity of the design, there are no learning effects and the subject's alertness can directly be derived from the performance parameters. However, this task could clearly not be considered motivating and rather, was annoying to the subject. We therefore tried to design our SICD such that it was more varied and added motivating messages. The drivers considered the SICD to be of assistance and the subjective arousal scales as well as the heart rate show that it was also activating. On the other hand, it was not capable of measuring alertness and thus its diagnostic properties could not be confirmed. The SICD was accepted by drivers and at least did not worsen the driver's well-being. On the other hand, driving behavior is impaired [11] and therefore, before handing the vehicle control over to the driver, some kind of performance check of the driver is indicated. The SICD did not reveal information about the driver's alertness. In summary, the SICD was able to fulfill two of the aims: to activate and to motivate the driver. The third aim, to measure the driver's alertness, was not accomplished.

#### *4.5. Limitations*

The main limitation for the interpretation of the results is the chosen study design. The two experimental conditions compared the full SICD in the state of sleep inertia to wakefulness as a baseline condition. However, to draw clear conclusions about the effectiveness of the SICD, the treatment (SICD) should be compared to a baseline (no SICD). The obtained physiological, subjective and performance data can only be evaluated in a timely perspective. The variance in the data might not only depend on the SICD but also on time effects and consequently a clear interpretation of the data is difficult. Future studies should directly compare different approaches to deal with sleep inertia. One approach could be to test the presented cognitive stimulation approach to the NASA nap paradigm, thus a reactive approach to a proactive approach. As other approaches, physical exercise or a combined physical-cognitive activation task (e.g., a cue-search task incorporating the whole vehicle cockpit) could be introduced. Despite the limitations posed by the study design, we can conclude that the approach of cognitive stimulation is a promising framework for activating the driver after sleep. However, direct comparison of different approaches is indicated to assess their effectiveness.

#### *4.6. Directions for Future Research*

The approach of cognitive stimulation proved to be effective to physiologically and subjectively activate the driver. However, the SICD was not capable of assessing the driver's readiness to drive. This could either be done by a sophisticated driver monitoring system that detects the driver state [37] or by a performance check as it was conceptualized in the SICD. The driver's performance capabilities are clearly reduced after sleep [9]. It is crucial for driving safety to detect the driver's readiness to drive [12]. If the driver is detected as being not ready to drive, appropriate actions have to be taken. When the driver of a dual-mode vehicle is awakened from sleep, the ADS has to ensure that the driver is cognitively alert to engage in vehicle control. The SICD was developed similar to the PVT, as a validated measure of alertness. The difference was only that it was conceptualized not as a single reaction task but as a choice-reaction task to make it more varied and therefore more appealing and motivating. Measuring alertness with cognitive tasks is an established approach, however, our task

was not able to measure alertness reliably. Future task designs should be more similar to established tasks, e.g., simpler single-choice tasks where no learning effects can be expected.

Another promising approach is physiological activation instead or in combination with cognitive activation [39]. Cerebral blood flow is decreased in the sleep inertia period which delays the reinstatement of alertness. Physical exercises have the potential to increase the overall blood circulation and therefore counter physiological sleep inertia. The implementation of physical exercises in the vehicle cabin is restricted. However, it is imaginable to instruct the driver to do stretching exercises. Physiological activation on the other hand, does not ensure that vehicle control—a primarily cognitive task—can be safely executed. Therefore, a combination of physiological and cognitive stimulation seems promising. Future research is needed in order to compare the effectiveness of different SICD approaches and to develop a method that is capable of successfully activating the driver, of measuring alertness and is motivating at the same time.

It is clearly critical to establish a framework to avoid sleep inertia from becoming a safety issue in automated driving. In other operational areas, e.g., aviation, standardized guidelines are in place to avoid sleep inertia. In automated driving, there is no such framework. Our proposed approach of cognitive stimulation has the potential to activate the driver. However, a sleep inertia countermeasure can only be considered effective when the driver's alertness and thus readiness to drive can be determined reliably.

**Author Contributions:** Conceptualization, J.W., R.K.-M.; C.P., D.B., S.F. and A.P.; methodology, J.W. and R.K.-M.; software, S.F.; investigation, J.W. and R.K.-M.; resources, A.P.; data curation, R.K.-M., J.W. and B.M.; writing—original draft preparation, J.W.; writing—review and editing, B.M., C.P. and R.K.-M.; project management, C.P. and A.P.; project acquisition: C.P. and A.P.; funding acquisition, A.P. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by Hyundai Motor Europe Technical Center GmbH.

**Acknowledgments:** The authors would like to thank the co-workers of the WIVW who supported the research team with providing a fruitful and creative research environment and with conceptualizing and building the driving simulator hardware and the SILAB driving simulation implementation used in the study: Alexandra Neukum, Mathias Gold, Michael Hanig, Stefan Ludwig, and Markus Tomzig.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

*Article*
