**Mode Awareness and Automated Driving—What Is It and How Can It Be Measured?**

#### **Christina Kurpiers \*, Bianca Biebl \*, Julia Mejia Hernandez and Florian Raisch**

BMW Group, Knorrstrasse 147, 80788 München, Germany; Julia.Mejia-Hernandez@bmw.de (J.M.H.); Florian.Raisch@bmw.de (F.R.)

**\*** Correspondence: christina.kurpiers@bmw.de (C.K.); Bianca.Biebl@bmw.de (B.B.)

Received: 27 April 2020; Accepted: 18 May 2020; Published: 21 May 2020

**Abstract:** In SAE (Society of Automotive Engineers) Level 2, the driver has to monitor the traffic situation and system performance at all times, whereas the system assumes responsibility within a certain operational design domain in SAE Level 3. The different responsibility allocation in these automation modes requires the driver to always be aware of the currently active system and its limits to ensure a safe drive. For that reason, current research focuses on identifying factors that might promote mode awareness. There is, however, no gold standard for measuring mode awareness and different approaches are used to assess this highly complex construct. This circumstance complicates the comparability and validity of study results. We thus propose a measurement method that combines the knowledge and the behavior pillar of mode awareness. The latter is represented by the relational attention ratio in manual, Level 2 and Level 3 driving as well as the controllability of a system limit in Level 2. The knowledge aspect of mode awareness is operationalized by a questionnaire on the mental model for the automation systems after an initial instruction as well as an extensive enquiry following the driving sequence. Further assessments of system trust, engagement in non-driving related tasks and subjective mode awareness are proposed.

**Keywords:** mode awareness; measurement method; automated driving; SAE Level 2; SAE Level 3

#### **1. The Relevance of Automation**

Within the next few years, technical advances will enable the development of vehicles that can transport users to their destination without human input. This exclusion of drivers from the control and guidance tasks eliminates human errors and as a result leads to increased road safety [1,2]. The technical complexity of such systems however does not allow a direct switch from manual to fully autonomous driving. Consequently, various car manufacturers are currently developing semi-autonomous systems that can manage some but not all driving functions. The level of automation in these systems is called partially automated driving (PAD; Level 2) according to the taxonomy by SAE International [3]. PAD systems can control longitudinal and lateral acceleration. Nevertheless, these systems require constant monitoring of their performance, traffic and the surrounding by the driver. In contrast to fully autonomous driving, PAD is still prone to human errors like inattention or distraction since the driver has the role of a supervisory controller who acts in collaboration with the system. This automation system cannot detect all its limits and errors, which is why the driver is responsible for intervening if necessary even without a preceding warning or take-over request [4]. This responsibility allocation will change with the introduction of conditionally automated driving (CAD; Level 3). These systems also control longitudinal and lateral acceleration and thus resemble PAD. Contrary to Level 2 however, CAD can detect all system limits itself and will request the user to take over within a certain time frame if necessary. As such, the driver is not required to be attentive to the system's status or traffic when CAD is active and he or she is then allowed to engage in non-driving related tasks (NDRT). Taken

together, the main difference between Level 2 and Level 3 systems is the driver's responsibility for the driving task and the concomitant obligation to pay attention to the traffic situation in PAD but not CAD.

The safety of assisted driving functions is therefore reliant on the user's awareness of the currently active system and the knowledge about his or her responsibilities in this automated driving mode. This understanding is naturally aggravated if PAD and CAD are available within the same vehicle and if both systems are repeatedly activated within one drive [5,6]. It is especially safety critical if the user neglects the monitoring task during PAD because he or she might not notice the system reaching a limit in time. The danger for such an improper behavior is especially increased if the system works perfectly because the user might not expect any system limits [7,8]. In conclusion, it is of great importance to secure a good understanding and clear differentiation of the responsibilities in PAD and CAD. Various measures are currently being developed and tested to provide a so called mode awareness like the issuance of attention requests [9], hands On/Off options [10], the inclusion of one or multiple automation modes within a drive [5] and manual drives in-between periods of automated driving [11]. To assess the effect of such measures on mode awareness, it is however necessary to define mode awareness and to develop appropriate measurement methods first. This article aims to give an overview on the concept of mode awareness and to present a newly developed approach to measure mode awareness during alternating manual, PAD and CAD drives.

#### **2. Constructs Concerning Monitoring Behavior**

Before addressing the measurement of mode awareness, it is important to define this complex construct first and differentiate it from otherwise variables. The following chapter will provide an overview over all relevant constructs to mode awareness.

#### *2.1. Situation Awareness*

Mode awareness is similar but not identical to the concept of situation awareness. The latter is constituted of sufficient knowledge about the vehicle's surrounding, the current state of the automation, the system's task performance and the driver's own tasks and responsibilities [12]. If the driver lacks situation awareness, critical situations might be identified too late so that the driver cannot take compensating actions to resolve the situation [12]. According to Endsley and Kiris [13] the extent of situation awareness depends on three factors: automation information presentation; vigilance, monitoring and trust; engagement. Since systems with a higher reliability of autonomy go along with less attention on traffic and system performance, situation awareness is often reduced in higher automation levels [14]. That is why the driver should be given a sufficient amount of take-over time in order to get back in-the-loop before driving manually [15].

#### *2.2. Mode Awareness*

According to [16], there are two kinds of mode awareness: the awareness of the existence of different automation levels and the awareness of the currently active mode. While both aspects are necessary for mode compliant behavior, the latter is at particular risk when a vehicle incorporates two or more automation levels and is intended to be the focus of this paper [6]. Mode awareness is a subconstruct of situation awareness that merely excludes the knowledge about the current situation and surrounding [17]. It comprises the knowledge about the currently active automation system, its performance level and the driver's tasks and responsibilities [6]. Similar to situation awareness, mode awareness is established by the perception and correct interpretation of system information, the build-up of knowledge and finally the prediction of future system behavior [6,17,18]. A deficit can arise on any of these levels. Most common are however a misinterpretation of the systems' behavior and symbols (mode confusion) or a lack of knowledge about the systems (mental model).

#### *2.3. Mode Confusion and Mode Errors*

Mode confusion is one possible reason for deficient mode awareness [17,19]. It can be described as a kind of automation surprise, where the system does not behave according to the user's expectations. In the case of mode confusion, the user loses track of which system is currently active or what kind of behavior is appropriate for which mode. Mode confusion is safety critical [20] because it can lead to mode errors. This term describes behavior that fits the assumed but not the actual active automation level [6,21]. It results from an erroneous combination of information in the mental model [22]. Mode confusion can arise if a driver experiences two or more systems when changing between vehicles or when multiple systems are available within one vehicle, of which the latter represents a greater risk for mode confusion. The likelihood for mode confusion further increases if the systems appear similar for the user, e.g., in the case of PAD and CAD [6]. As a result, drivers might engage in NDRTs while driving in PAD and thus neglect their monitoring task. This can be highly dangerous if the system reaches its limit without the driver noticing, which can lead to collisions.

#### *2.4. Mental Model*

An awareness of the currently active automation mode itself is not sufficient for the creation of mode awareness. In addition, the user must have a correct mental model concerning the automation systems. Mental models are internal representations of a system that are formed by interacting with the system. These models do not need to contain correct technical details as long as users understand the functional characteristics of the system. Mental models can differ greatly in complexity depending on existing knowledge about the system, experience from interacting with the system and education [18,23].

#### *2.5. Overtrust*

As mentioned earlier, mode awareness is essential for a correct amount of monitoring and the controllability of system limits. Even if users have adequate declarative knowledge about the currently active system, its function and the users' own responsibility, they might however not behave according to the requirements of the automated system [24]. Next to fatigue, risk tolerance, boredom or extrinsic motivation, the greatest danger for such an improper behavior is an inappropriate level of trust in the system. In general, trust describes the attitude of users to let a system support them in situations characterized by uncertainty and potential danger [25]. This trust influences the usage of the automated system. In case of under-trust, users will tend to disuse the system because of the subjectively increased work load and risk [26]. This state should be avoided because a disuse of the automated system will decrease the customer value of the vehicle [27]. Van Loon and Martens [28] for example describe three factors that might be positively impacted by increased automated highway systems: a reduction of traffic congestions, a more economic driving style with a concomitant conservation of resources as well as increased traffic safety. According to the authors, the latter is especially improved after users get used to the new technologies or the system assumes increasing parts of the driving tasks.

In respect of safety in use, the more pertinent problem is over-trust. This blind trust in a seemingly perfect system can result in a misuse of the system beyond its functional limits and thus in a safety risk. In the case of PAD, users will presumably show a decreased monitoring of the driving scene and the system performance with an increased trust in automation because they assume that everything will be working properly [29,30]. This so called complacency is especially provoked if participants simultaneously have to perform multiple tasks which reduces the amount of cognitive resources available for monitoring [31]. Ironically, over-trust and its adjunctive misuse is elicited by a highly reliably functioning system because users will hardly ever experience the system limits they theoretically know about [8,32].

#### **3. Measurement of Mode Awareness**

The development of a measurement method for mode awareness is crucial for the serial implementation of automated systems since mode inappropriate behavior increases the risk for critical take-over scenarios and crashes. Principally, there are multiple methods to examine mode awareness. It is, however, very difficult to identify a technique which allows a measurement of all subjective and objective aspects of mode awareness. Various potential approaches will be presented and discussed in the following chapter.

#### *3.1. Subjective Measurement Methods*

Surely, the simplest way of getting insight into the user's mode awareness is by simply asking the driver via self-rating scales or interviews (e.g., [33,34]). Both methods give fast and explicit information about the user's state and can be used directly after a use case of interest or subsequent to the entire driving sequence. As with any subjective measurement method, it is however subject to a personal bias. Self-ratings on the user's assessment of his or her mode awareness are furthermore insufficient, because users might not understand the complexity of this construct in its entirety. Since misconceptions in the mental model can inherently not be detected by the users themselves, it is not advisable to use self-ratings as an indicator for mode awareness. An interview meets some of these flaws by allowing a standardized, and thus, partly objective assessment of mode awareness. In order to cover all aspects of mode unawareness however you need to identify all potential problems beforehand, which is not only time-consuming but also improbable. Additionally, interviews present multiple difficulties concerning study designs. If they are conducted while driving, the cognitive distraction might confound the driving performance. The conduction of multiple interviews (e.g., before and after an experimental manipulation) also poses the risk of influencing the mental model because the mere reproduction of information increases the knowledge level [35]. If the interview is conducted subsequent to the drive to avoid these confounding factors, the time interval between the driving scenario and the interview may lead to memory distortions, which in turn reduces the validity of information. One method to counteract the disadvantages of self-rating scales and interviews but maintain explicit information about the user's inner processes is a driver commentary (e.g., [36]). This method does not have the problem of memory loss or the need for predefined mode awareness deficits, because it aims to gather all thoughts of a user directly while using the system. The greatest benefit of this method is surely its flexibility towards individual and situational differences. The lack of standardization on the other hand complicates quantitative and comparative analyses. Furthermore, it might lower the ecological validity of a study because the simple instruction to formulate all thoughts might change these thoughts and interfere with the driving task.

#### *3.2. Objective Measurement Methods*

Another approach to investigate mode awareness is the use of objective measurements, which eliminates all subjective distortions and focuses on the actual user behavior. As illustrated previously, the main difference between PAD and CAD is the allocation of responsibility between the system and the driver and as such the required amount of monitoring [3,37]. Monitoring implies the placing of visual attention to the street or control instruments, which is most often accompanied by a corresponding eye movement [38]. Therefore, gaze behavior counts as a good indicator for mode awareness. To interpret gaze behavior in terms of mode awareness however you need a comparison value like a drive in another automation level. Another indicator of the user's knowledge about his or her responsibility is the interaction with NDRTs. The engagement in tasks like e.g., smartphone apps, in-vehicle information system, phoning or eating [39] will reduce the time of gaze spent on the traffic or system functionality. Consequently, it covers similar aspects concerning gaze behavior as mode awareness but is more restricted to the engagement in specific tasks. It must be noted that an engagement in NDRTs cannot always be implemented because of the study design or legal requirements.

Ultimately, the main interest in mode awareness does not lie in monitoring behavior, distractions or declarative knowledge itself, but in the consequential driving performance. This includes the take-over performance and the handling of critical situations, like e.g., the reaction time (time until gaze redirects from the NDRT to the road or control instruments; time until hands on the steering wheel and time until the first take-over reaction is performed), the time-to-collision (TTC), the maximum lateral and longitudinal acceleration and crash rate among others [40]. Possible take-over situations can range from uncritical switches between automation levels to undetected system failures in PAD (e.g., following a tar track instead of the line marking). The controllability of such situations is of special interest because of its safety implication. It must however be noted that driver behavior in such take-over situations is not specific to mode awareness and cannot indicate mode awareness problems on its own. It might, for example, be influenced by momentary inattention, fatigue, the familiarity with take-over situations or the individual participant's driving skills. It can thus only be interpreted against the backdrop of the attention ratio during the drive, the pre-existing knowledge and the post-enquiry.

That is why some studies (e.g., [41,42]) look for certain behavior patterns that are likely to be specific to mode unawareness. Mode confusion for example could become apparent when the system reaches a system limit. In addition, a user might grab the steering wheel during CAD, press random buttons repeatedly or show facial cues of confusion. These behavioral characteristics can however vary between participants and might not always occur during a drive, which is why their comparability is reduced. Furthermore, this behavior cannot be ascribed to mode awareness for sure without a follow-up interview. A user might for example put the hands on the steering wheel for comfort or by habit and not because of a misunderstanding of the currently active automation mode.

#### *3.3. Combination of Measurement Methods*

Mode awareness is a complex construct that circumferences sufficient knowledge about the system and its limits as well as behavioral aspects while driving. Currently, there is no gold standard for measuring all aspects of mode awareness, which is why most authors use a combination of multiple methods. Victor et al. [43], for instance, examined mode awareness during a PAD drive and added a take-over situation at the end of the drive because of an obstacle on the road. Mode awareness was operationalized by subjective as well as objective variables. The former consisted of a questionnaire on trust and open interview questions on impulse to intervene as well as the realization of the need to intervene. Objective data compromised response process variables and glance variables. Generally, a mixture of subjective and objective indicators is advisable for a valid interpretation of the declarative knowledge of users about the current system and its functionality as well as the behavior according to system requirements. In this case, the behavior aspect is assessed sufficiently by analyzing gaze behavior as well as take-over performance and the handling of critical situations. While the short interviews after the drive can give an impression on trust level and situation awareness, however they do not allow an evaluation of the user's mental model and mode confusion.

Another approach to measuring mode awareness can be found in a study by Wang and Söffker [44]. The authors investigated six driving scenarios. The implementation of both Level 2 and Three in these scenarios is reasonable for studying mode awareness in a worst case approach [5]. Mode awareness was operationalized by a situation awareness questionnaire but the authors also measured take-over time and quality in case of system failures and the engagement in NDRTs. While these measures do provide subjective and objective information, the lack of monitoring data does not allow a full objective interpretation of mode awareness. Furthermore, it has to be noted that the mid-questionnaire between the scenarios only included six questions on mode awareness, which is very little for such a complex construct.

Othersen [18] conducted a study on situation and mode awareness. Objective measures consisted of driving parameters, specifically reaction and take-over time, the quality of reactions and potential deactivating of the system. Furthermore, the author examined gaze behavior and video data as well as the performance in an audio-verbal NDRT. Subjective data circumferenced items on mode confusion, monitoring behavior, responsibilities during the drive and critical situations as well as the user's take-over performance. This approach covers many aspects of mode awareness like knowledge about the user's monitoring task (objective and subjective), awareness of the currently active mode (subjective) as well as the resulting take-over performance. The analysis of gaze behavior was however conducted absolutely without systematically comparing different drives to a baseline. In addition, the closed self-rating scale does not provide detailed information about the user's responsibilities. On the contrary, a comprehensive assessment of the user's mental model is crucial to define the cause of a potential lack of monitoring behavior.

#### **4. A Subjective and Objective Measurement Method for Mode Awareness**

In order to assess all aspects of mode awareness, we wanted to develop a new method that combines subjective and objective information in a worst case scenario. This approach allows the assessment of all major aspects of mode awareness (see Figure 1): the knowledge about which mode is currently active and the knowledge about the system's abilities and limits (knowledge pillar) as well as the resulting mode compliant behavior (behavior pillar).

**Figure 1.** Mode awareness can be subdivided into a knowledge and a behavior pillar, which are measured separately in our proposed study design. All white frames represent dependent variables.

#### *4.1. Knowledge Pillar*

One aspect of the definition of mode awareness according to [6] is sufficient knowledge about the system. Therefore, an assessment of mode awareness should include the measurement of the participant's mental model, which should be conducted before the experimental drive. In studies questioning the effectiveness of certain methods to promote mode awareness, it is vital to first instruct the participants on the automation systems because a different amount of preknowledge can impact the effectiveness of such methods. The subsequent knowledge test can thus ensure a homogenous level of existing knowledge before the drive. Such an extensive instruction is, however, not advisable when studies aim to hedge mode awareness in order to get approval for an automated system. The initial knowledge test then serves as a first indicator of mode awareness during the drive.

It is, furthermore, important to include an extensive post-enquiry to test the amount of knowledge after the drive. That allows a conclusion on the knowledge on the systems' limits, human-machine interface (HMI) and the driver's responsibilities during the drive. This second conductance of the test is furthermore relevant in order to measure the change of the mental model due to a driving sequence or (if applicable) certain experimental manipulation.

The questionnaire for the mental model we developed consists of five parts. At the beginning, participants are asked to subjectively rate their knowledge about all assistance systems of interest on a 7-point Likert scale. This subjective rating is followed by an objective evaluation of the user's knowledge. They are first asked to formulate the two main aspects of each system. These statements are then evaluated by the examiner on the basis of a rating system, which categorizes information in mandatory and optional information. This is followed by detailed questions on various aspects of the assistance system, mainly the systems' limits and abilities as well as the responsibility of the driver. These statements have to be assigned to the respective assistance system or alternatively classified as true or false. Lastly, the participants are tested on their knowledge about the HMI and the handling of the systems' (de-)activation. That mainly functions as an indicator for mode confusion in the drive since insecurities about the corresponding icons for each mode as well as the correct button for activating and deactivating the systems can easily lead to confusion about the currently active system.

#### *4.2. Behavior Pillar*

#### 4.2.1. Design

Declarative knowledge about the systems and their capabilities is necessary but not sufficient for mode awareness. Drivers might for instance technically be well aware of the currently active system and his or her responsibilities but still neglect the monitoring task because of over-trust [29]. A distracted or inattentive driver might then not notice a PAD system reaching its limit and thus crash. This use case is just one potential scenario and certainly represents a worst case setting. In order to ensure the safety of driver assistance systems in studies however a worst case approach is necessary [45].

We propose the following study design to validly measure mode compliant behavior in a worst case scenario (see Figure 2).

**Figure 2.** A schematic depiction of the driving sequences. A first familiarizing drive and a manual baseline are followed by a sequence of driving partially automated driving (PAD), conditionally automated driving (CAD) and then PAD again. Mode awareness is operationalized by the comparison of attention to driving related areas of interest during the drives and the controllability of a system limit at the end of the second PAD drive.

The drive starts subsequent to the theoretical instruction and the questionnaire on the mental model with a familiarizing drive. This drive is crucial to eliminate the influence of prior experiences with driver assistance systems, the make of the car or potential situational factors (e.g., being in a driving simulator). Depending on the research question, the familiarizing drive can contain short drives in all assistance modes including the switches between them or just a manual drive. It is advisable to then start a short period of driving manually as a baseline. The gaze behavior and driving data during this drive serve as comparison values for all subsequent automated drives. As mentioned earlier, a frequent switch between these automation modes is especially challenging for maintaining mode awareness, since the functions seem very similar for the user. In line with a worst case approach, we thus recommend including multiple switches between automation modes within the study. Between CAD and PAD, the latter has fewer situational requirements, which is why a first switch from manual driving to PAD is the most ecologically valid option. It also allows the assessment of the first contact of drivers with PAD as a baseline value. After a certain period of driving PAD, the system should

then enable the switch to CAD. This drive should be terminated by a take-over request to initiate the last driving sequence in PAD. Until this point, both Level 2 and Level 3 would have worked perfectly without reaching any unexpected system limits. This is in line with a worst case approach since the high performance level makes it difficult to distinguish between both systems. By definition however, PAD systems might reach their limit without giving a warning or take-over request, e.g., because they accidentally follow a tar track instead of the actual lane. It is advisable to include such a scenario to assess the controllability of a potentially critical situation. That is crucial for driving safety and actually of higher importance than monitoring behavior. We recommend a silent system error at the end of the second PAD drive, by driving straight ahead instead of following the curved road. Without intervention of the driver the vehicle would then crash with the crash barrier or drive on the adjacent patch of grass.

The time frame of these drives can be chosen according to resources and research question. Generally, a longer time-frame will lead to more reliable data. A longer time-frame will however lead to increased driver fatigue [46]. Multiple internal studies showed that participants need 5 to 10 min to get used to the system. To avoid the influence of fatigue but ensure a sufficient amount of data we thus advise a duration of approximately 8 to 10 min per automated drive. Studies by Kurpiers et al. [9] and Feldhütter et al. [47] confirmed the assumption that this is an appropriate time frame to avoid insecurities when handling the system while simultaneously avoiding fatigue. Certain research questions and participant characteristics might however require the adaptation of these time slots. It must also be noted that the study design proposed in Figure 2 is only applicable in this form for studies in driving simulators. The uncontrollability of on-road studies may not allow the strict adherence to the proposed time frames because of interchanging road conditions and environments. This study design is however an appropriate basis for measuring mode awareness in a simulated environment. As such, it serves as a good tool to test changes in the automated function during development to ensure their security. Furthermore, it can be used to check the effectiveness of measures to increase mode awareness.

#### 4.2.2. Attention Ratio

The aim of the manual-PAD-CAD-PAD sequence is to assess the participants' behavior in respect to the mode dependent responsibilities for the user. The requirement to keep the attention on the traffic at all times in PAD but not CAD is surely the greatest difference between these two automation systems and the monitoring behavior therefore a suitable operationalization for mode awareness. As most shifts in visual attention go along with a shift of gaze, glance behavior can be used for operationalization of mode awareness [48]. The most interesting metric of glance behavior is the attention ratio, which represents the percentage of time that a participant's gaze is directed in a certain area of interest (AOI) in relation to the total duration of each driving phase. One AOI of particular interest is the road center slightly below the horizon [49,50], since hazard and event detection requires focal vision [51]. Further driving relevant areas within the visual field are the instrument cluster to monitor the system's status as well as the lanes to the left and right, the side mirrors and the rearview mirror for traffic monitoring. If NDRTs are used (e.g., smartphone apps or games in the central information display), their location should be evaluated as a relevant non-driving related AOI to track the attention ratio to the NDRT.

#### 4.2.3. Target and Actual Values

When PAD is active, the driver is assisted in the lateral and longitudinal guidance of the vehicle. Similar to manual driving, all other responsibility lies with the driver [3]. Since the driver must be able to detect all system limits and take-over at all times even without warning during PAD, the attention ratio to the traffic situation and the system's performance should not differ from that in manual driving. A mode aware driver should furthermore not show any significant differences in gaze behavior between the first and the second PAD drive despite the interposed CAD sequence. During CAD on the other side, it is expected to find a reduced amount of monitoring behavior compared to manual and PAD

driving, since the driver is allowed and instructed to engage in NDRTs [3]. While the amount of monitoring in PAD is safety critical, a comparison between CAD and a manual or a PAD drive can mainly serves as an indicator for the quality of discrimination concerning the user's tasks. Lack of such a discrepancy in monitoring behavior is however not necessarily evidence for mode unawareness since users might also deliberately want to monitor the CAD function and their surroundings.

This proposed gaze behavior has been tested with similar designs in various studies [9,51]. In the static simulator study by Feldhütter et al. [47] for example, participants showed an attention ratio to road center of 89% during manual driving, which was significantly reduced to 51% and 18% in the first and second PAD drive respectively with a significant decrease from the first to the second PAD drive. This is a characteristic example for a mode awareness deficit, since the monitoring task was neglected during the PAD drives compared to manual driving, which was intensified by the CAD drive in between (schematic depiction in Figure 3).

**Figure 3.** A schematic depiction of the target and actual gaze behavior during manual, PAD and CAD driving. Mode awareness can be assumed if there is no significant difference between the attention ratio in the manual and the first PAD drive or a significant decline in monitoring behavior from the first to the second PAD drive.

#### 4.2.4. Controllability

Next to the monitoring behavior itself, the safety of automated vehicles in Level 2 and Level 3 is highly dependent on the user's ability to manage system limits. In the proposed worst case scenario of a silent system error in the second PAD drive, the car will keep driving straight ahead while the track makes a curve. The controllability of this situation can be assessed by the ability of the driver to keep the car on track. Potential parameters are the amount of the vehicle in surface area that has crossed the track before the driver intervenes and the crash rate. Feldhütter et al. [47] for example found that only 16% of participants intervened before the car had left the track and 29% did not take-over before the car had left the track completely. This clearly demonstrates the dangers of insufficient monitoring behavior during PAD that results from a deficit in mode awareness. It has to be noted however, that a bad performance in the take-over scenario cannot necessarily be ascribed to a lack of mode awareness. As a result, we advise a qualitative interrogation on the take-over scenario after the drive.

#### *4.3. Additional Variables*

The main problem when using gaze data is its lack of specificity since it can be influenced by factors like extrinsic motivation or boredom [24], risk tolerance [11], a faulty mental model and mode confusion [6] or over-trust [29]. Questionnaires before and after the drive are thus essential to ascribe a lack of monitoring and controllability to a concrete source. Next to the before mentioned knowledge test, one important assessment is the evaluation of trust in the automated system, (e.g., the automation

trust scale (ATS) by Jian, Bisantz, and Drury [52]; the questionnaire on human-computer trust by Madsen and Gregor [53]), because deficits in mode awareness and over-trust cannot be distinguished without background information on the user's experience and mindset. In addition, a subjective test for mode awareness (like the one used by Othersen et al. [18]) might be useful in many cases. Furthermore, it is of great value to add various questions, e.g., concerning the perception of the events during the silent system error, the engagement in NDRTs during PAD, a lack of engagement in NDRTs during CAD, automation surprises and other subjective data. The specific choice of questions should be based on the individual characteristics of the driving behavior and the examiner's observations.

#### **5. Limitations and Benefits**

Despite this approach's theoretical soundness, the validity of the proposed study design has not been calculated yet. The data in [9,47] that result from study designs using the proposed method can give a first impression of its applicability but allows no testimony on the validity of the measurement approach. The main reason for this lies in the circumstance that there is no best practice for measuring mode awareness that could be compared to the results of the suggested approach. Furthermore, the interpretation of mode awareness in our study design is based on a number of different variables that need to be encountered as a whole. As a mixture of quantitative and qualitative measure, it is hardly possible to calculate one parameter for mode awareness that might be used in a validation process. In addition, this design for measuring mode awareness is not applicable in all study designs. First of all, the addition of a critical take-over scenario at the end of the second PAD drive is obviously impossible in on-road car studies. The only solution to evaluate controllability is to look for naturally arising system limits of PAD and assess the take-over quality of the participants. Any study in real traffic is, furthermore, liable to uncontrollable circumstances like weather, traffic and road works that might influence the availability of the assistance systems. Second, some research questions might call for a variation of drives compared to the proposed design, which might change the values of mode awareness. Furthermore, since the order of drives is essential to the assessment of mode awareness, it is not advisable to alter the sequence of the automated drives. That way, however, the attention ratio in the second PAD drive might already be reduced because of tiredness or exhaustion. That should be factored in by performing an objective sleepiness rating, e.g., the Karolinska sleepiness scale (KSS; [54]).

When testing the human-machine interaction of automated functions, the aim of most studies is to predict user behavior in the field. In order to secure the safety of the function, it is important to prove the robustness of the function even in worst cases. Wickens [45] actually argue that accidents in aviation are often caused by worst-case performers in worst-case situations. That is why extreme cases should not be treated as outliers in a normal distribution but considered for safety issues. Consequently, the proposed study design is an appropriate approach to testing mode awareness in PAD and CAD. In addition, this design is eligible for measuring mode awareness in different scenarios because the switch from PAD to CAD and back to PAD allows the relative comparison of attention ratio in Level 2. The use of absolute values on the other hand would lead to misinterpretations, since attention ratio itself will differ greatly between a simulator study without any actual danger and a real car study on public highways.

#### **6. Conclusions**

We propose a study design to assess mode awareness by focusing on its behavioral aspect, more precisely the attention ratio while driving in PAD, CAD and then PAD again in addition to the controllability of a critical take-over scenario at the end of the second PAD drive. Questionnaires and interviews on the mental mode, trust, the engagement in NDRTs and other observations during the drive will enable the examiner to extract the source of a potential negligence of the monitoring behavior during PAD. Taken together, we feel positive about the potential of this approach to cover all aspects of mode awareness while differentiating it from similar constructs. Further validation of the proposed design and assessment technique is required for a further evaluation.

**Author Contributions:** Conceptualization, C.K. and B.B.; Investigation, C.K.; Methodology, C.K., B.B. and J.M.H.; Project administration, C.K.; Writing—original draft, C.K. and B.B.; Writing—review & editing, C.K., B.B., J.M.H. and F.R. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Acknowledgments:** We thank the Chair of Ergonomics of the Technical University Munich for the collaboration. Especially, we would like to thank Klaus Bengler, Anna Feldhütter, Moritz Körber and Michael Rettenmaier for providing the static driving simulator of the Technical University Munich and for conducting three studies that investigated different measures for maintaining mode awareness with the developed measurement method.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

*Article*

## **Engagement in Non-Driving Related Tasks as a Non-Intrusive Measure for Mode Awareness: A Simulator Study**

**Yannick Forster 1,\*, Viktoria Geisel 2, Sebastian Hergeth 1, Frederik Naujoks <sup>1</sup> and Andreas Keinath <sup>1</sup>**


Received: 8 April 2020; Accepted: 23 April 2020; Published: 28 April 2020

**Abstract:** Research on the role of non-driving related tasks (NDRT) in the area of automated driving is indispensable. At the same time, the construct mode awareness has received considerable interest in regard to human–machine interface (HMI) evaluation. Based on the expectation that HMI design and practice with different levels of driving automation influence NDRT engagement, a driving simulator study was conducted. In a 2 × 5 (automation level x block) design, *N* = 49 participants completed several transitions of control. They were told that they could engage in an NDRT if they felt safe and comfortable to do so. The NDRT was the Surrogate Reference Task (SuRT) as a representative of a wide range of visual–manual NDRTs. Engagement (i.e., number of inputs on the NDRT interface) was assessed at the onset of a respective episode of automated driving (i.e., after transition) and during ongoing automation (i.e., before subsequent transition). Results revealed that over time, NDRT engagement increased during both L2 and L3 automation until stable engagement at the third block. This trend was observed for both onset and ongoing NDRT engagement. The overall engagement level and the increase in engagement are significantly stronger for L3 automation compared to L2 automation. These results outline the potential of NDRT engagement as an online non-intrusive measure for mode awareness. Moreover, repeated interaction is necessary until users are familiar with the automated system and its HMI to engage in NDRTs. These results provide researchers and practitioners with indications about users' minimum degree of familiarity with driving automation and HMIs for mode awareness testing.

**Keywords:** automated driving; human-machine interface; mode awareness

#### **1. Introduction**

The market introduction of vehicles equipped with SAE Level 3 (L3) automated driving systems (ADS) is only a matter of time. Automated driving promises numerous benefits: among others, it is expected to foster efficiency in terms of time usage. The driver may divert his/her attention to non-driving related activities while the ADS is executing vehicle guidance. SAE Level 2 (L2) driving automation—which is already commercially available—is also capable of controlling vehicle guidance while the driver still has to constantly monitor the system functioning [1]. L3 automated driving systems differ from L2 automation in such a manner that the driver has to be readily available as a fallback performer in case the system requests a transition to manual control. Thus, with the transition from L2 to L3 automation, the human driver's role shifts from that of an active system supervisor to a fallback-ready user who may engage in non-driving related tasks (NDRT). The availability of different

driving modes (i.e., L1, L2, and L3) in one vehicle poses additional challenges to the driver to understand his/her role accordingly and not to confuse different automation modes and levels. Mode awareness as a critical issue in driving automation requires further research efforts for ensuring safe operation of different automated driving functions. Knowledge on the assessment of mode awareness, however, is scarce. Addressing this issue, the present study examines engagement in a representative visual–manual NDRT during different levels of automated driving as a non-intrusive measure for mode awareness. In the following, we first outline theoretical backgrounds on mode awareness and methodology to assess this construct. Subsequently, the research question and hypotheses are derived based on the preceding considerations.

#### **2. Background**

In the automotive context, the evaluation of HMIs has a long history. The distraction potential of in-vehicle information systems (IVIS) is the main focus for manual driving (SAE L0). Here, test procedures to assess visual workload associated with the IVIS have already been established [2,3]. However, the change of the driver's role from manual driver to supervisor in L2 and fallback performer in L3 automation renders the application of these methods unfeasible. For example, NHTSA distraction guidelines only permit 2 s per glance and 12 s total glance duration on IVIS. It might be questionable whether these numbers as they were proposed for manual driving are also suitable for L2 automation. In addition, with the driving automation executing longitudinal and lateral vehicle control, distance and lane keeping are not applicable measures for indicating the suitability of an HMI in this particular context. In contrast, a variety of constructs related to the safe driver–automation interaction such as trust [4–7] controllability [8–10], understanding in form of mental models [11–13], or usability [14] could be used as criteria. Research has shown that these pose challenges to the design and evaluation of automated vehicle HMIs. For an outline of evaluation methods for automated vehicle HMIs see [15]. One further step towards an ADS method validation concerns the investigation of mode awareness. This term was proposed by Sarter and Woods [16]. The authors report that even pilots who can be considered highly skilled and trained operators of flight automation can face situations where they are not certain of roles and responsibilities for the aircraft operation task. Such situations can lead to dangerous outcomes and consequently a safety-related assessment is indispensable.

Mode awareness is a central aspect for appropriate and safe human–automation interaction in general and in the context of driving automation in particular. For example, Gopinath and Johansen [17] outline that mode awareness of operators is of crucial importance for safety when interacting with production robots. By appropriate design of the automation and according HMIs, safety risks can be mitigated (e.g., [18]). In the driving automation context, Feldhuetter, Segler and Bengler [19] provide evidence that drivers' mode awareness is reduced when the vehicle is equipped with additional driving automation functions (see also [20]). Similar to the proposal by Gopinath and Johansen [17], they investigated whether an adaptive HMI design could support mode awareness, but could not find an effect. Other research supports their hypothesis that HMI design can affects drivers' visual behavior. For example, Kraft, Naujoks, Woerle and Neukum [21] report the impact of the HMI design on glance distributions during active L2 automation. In this study, a reduced and simple display produced positive effects in terms of distraction on both a self-reported and behavioral level. In addition, familiarity-dependent practice effects occurred for glance patterns. In general, behavioral adaptation to automated driving can be expected as outlined in [22]. An appropriate design of L3 automated vehicle HMIs can support self-reported usability and trust in automation (Hergeth, 2016). Since trust is expected to determine reliance behavior [6,23], we assume that such HMI variations can also affect behavioral parameters concerning NDRT engagement. This influence of HMI design on user behavior is of high importance since it must convey information about the driver's role during active L2 and L3 functioning. Investigating mode awareness between driving episodes, Feldhuetter and colleagues [24] tested whether manual driving episodes as intermittent features between transitions of L2 and L3 automation can help to promote mode awareness. In this

experiment, they operationalized mode awareness via the visual attention towards driving-relevant areas and engagement in NDRTs. The study shows that there is a difference of visual attention allocation and NDRT engagement. However, it remains unknown whether this observation is stable or prone to changes over time. As there is research indicating behavioral changes in interaction with driving automation when interacting repeatedly [14,21], NDRT-related behavior might also change. Especially findings of more accurate mental models over time [11–13] lead to the question whether mode awareness is also dependent on the familiarity with the driving automation.

As indicated above, reliance behavior is suggested to be closely tied to NDRT engagement during automated driving [7]. The difference between L2 and L3 is that the driver is responsible for supervising the automation in L2 whereas he/she has to be readily available to perform driving task fallback in L3. For the HMI design, this indicates that L2 automation systems require a feature ensuring that drivers are attentive to the supervising task either by steering wheel input or gaze tracking to the forward roadway (see e.g., [25]). By issuing a so called "hands-on request" or "attention request", the system draws the driver's attention back towards the supervising task. In comparison, such interface features are not part of a L3 system as it allows NDRT engagement. L3 systems only request driver input at operational design domain (ODD) limits or system malfunctions [26]. Thus, NDRT-related behavior should differ depending on the understanding of the current level of automation (i.e., mode awareness) given an interface is designed in accordance with the prior considerations. The design of automated vehicle HMIs is therefore a crucial aspect for the facilitation of visual attention towards relevant events inside or outside the vehicle [27,28]. A study by Llaneras and colleagues [29] found that drivers tend to engage in NDRTs during reliable L2 automation that does not monitor or restrict behavior. This leads to risky driving and diverts attention away from the roadway and supervision of the system. Therefore, investigation and comparison of NDRT engagement during L2 and L3 automation is of high importance. It is expected that HMI features such as hands-on or attention requests during L2 automation should consequently lead to improved mode awareness with better understanding of his/her roles and responsibilities (i.e., supervising during L2). This understanding eventually translates in observable behavior of less NDRT engagement during L2 as compared to L3 automation.

The study outlined above shows that there is a growing body of research on mode awareness in the driving automation domain. Additionally, HMI considerations outlined above suggest that NDRT engagement can serve as an indicator of mode awareness. However, commonly agreed methodological approaches are still missing. In relation to the theoretical and conceptual developments, the present study's aim was to investigate how mode awareness can be assessed in a non-intrusive way. It seeks to extend the findings on understanding as reported in [13]. Results of this publication showed that the general understanding of roles and responsibilities (i.e., mode awareness) was high for both L2 and L3 automation. However, the question remained whether this understanding also translates in observable behavior. Non-intrusive measurements of mode awareness bear both advantages for researchers and practitioners as well as for the real-world application of driver-monitoring systems. On the one hand, during the development and evaluation of automated vehicle HMIs, mode awareness represents a critical issue that needs to be assessed. With the availability of a non-intrusive measure, research methodology benefits from the present research. On the other hand, real-world application could use driver monitoring technology to detect potential losses of mode awareness based on the driver's current behavior. Thus, an ADS might undertake necessary precautions such as displaying warning messages which are already in effect today for fatigue detection.

#### *Research Question and Hypotheses*

From theoretical considerations outlined above, the following research question is derived: How does NDRT engagement calibrate for different levels of automation (i.e., for different graphical HMI designs) and with rising system experience? The following two hypotheses are formulated for this research question:

#### **Hypothesis 1 (H1).** *Drivers change their engagement in NDRTs over time*;

**Hypothesis 2 (H2).** *There is more NDRT engagement during an active L3 ADS compared to an active L2 driving automation*.

#### **3. Method**

#### *3.1. Sample*

A total of *N* = 59 participants took part in the driving simulation experiment. *N* = 10 drop-outs occurred because four participants did not complete the experimental procedure and six incomplete datasets were collected. This left *N* = 49 (13 female, 36 male) participants for data analysis. Mean age of the final sample was 30.96 years (*SD* = 9.08, *MAX* = 62, *MIN* = 21). All participants were BMW Group employees, held a German driver's license, and had normal or were corrected to normal vision.

#### *3.2. Driving Simulation and Non-Driving Related Task*

The study was conducted in a moving-base driving simulator (see Figure 1, left). The integrated vehicle's console contained all necessary instrumentation and was identical to a BMW 5 series with automatic transmission. Seven 1080p projectors provided a 240◦ horizontal × 45◦ vertical frontal field of view. One LCD screen positioned behind the back inside the vehicle mockup seats and two outside projections with the same specifications served as rear view. The motion system consisted of a hydraulic hexapod with six degrees of freedom, capable of up to 7 m/s2 transitional acceleration and 4.9 m/s2 continuous acceleration. The Surrogate Reference Task [30] was displayed on a 12.3" tablet mounted on the center stack console and was active during the entire experimental drive (see Figure 1, right). NDRT engagement is measured using a task that is representative for many NDRTs in terms of demands and distraction potential to obtain high external validity. The Surrogate Reference Task (SuRT, [31]) is such a representative task since it is used as a generic visual–manual secondary task in distraction studies. In addition to these, it has also been used for an NDRT in automated driving studies [7,9,32]. The SuRT requires participants to identify a target stimulus (i.e., large circle) within an array of distractors (i.e., small circles). By varying the amount of distractors and size difference between target and distractors, the NDRT demand and resulting workload can be adjusted specifically. An advantage of the SuRT is its potential to support high experimental control while on the downside, it is not a naturalistic NDRT and thus motivation to extensively engage in the SuRT could be limited.

**Figure 1.** Dynamic driving simulator from the outside (**left**) and mockup interior with the Surrogate Reference Task (SuRT) tablet used in the current study (**right**).

The interface on which the SuRT was presented did not display a score to the drivers to make NDRT engagement completely voluntary and free of a potential competitive character. The circles could be selected by touching the surface with a finger. When the participant selected the correct circle, it turned green before the subsequent pattern emerged. In case the wrong target was selected, it turned red and the pattern stayed until it was solved correctly.

#### *3.3. Study Design and Procedure*

The study employed a 2 × 5 mixed within–between subjects design. The within-subject factor "block" had five levels from the first to the fifth block of use cases. The between-subjects factor "feedback" had two levels where participants either received feedback on their interaction success after each use case or not. Because the between-subjects factor was out of scope for the present research question, this research reports results of the within-subject factor "block".

Upon arrival, participants were welcomed and gave informed consent. After a brief explanation of the study purpose, the experimenter led them to the vehicle mockup. To accustom themselves with the simulator setup, participants had to complete at least two correct trials with the SuRT at standstill. Subsequently, they completed a five-minute manual familiarization drive without NDRT engagement. Prior to the experimental drive, the experimenter outlined the procedure and explained that participants would encounter two automated systems that are a L2 driving automation and a L3 ADS. They also received information stating that they would not have to constantly monitor the correct functioning of the L3 ADS. Concerning NDRT engagement, participants were instructed before each block that they could freely decide whether to engage in the NDRT when the automation was active. In doing so, the experimenter did not specify the level of automation or explicitly named any of the two functions. Furthermore, there was no additional incentive for executing the NDRT. The subsequent experimental drive included five blocks, each consisting of six driver initiated control transitions. After the successful completion of each interaction, there was a 20-s time window where users' NDRT-related behavior was observed. Table 2 additionally provides an overview of the windows of observation for NDRT-related behavior. Subsequently, there was a brief inquiry during the drive that occurred six times for each block [33]. Having finished use case specific questions, there was another time window of at least 20 s up to one minute where users could freely engage in the NDRT before the upcoming instruction of the next use case. After each block, participants were told to pull over to the right shoulder, stop there, and complete the block inquiry. Participants completed the drive on a three-lane highway with low to medium traffic density. Surrounding vehicles drove with an average of 150 km/h on the center lane and an average of 180 km/h on the left lane. Vehicles on the right lane drove with an average of 130 km/h. The conditions were good with clear visibility at daytime and a dry road. The highway itself was in good condition without potholes or construction areas. The experimental drive lasted approximately 60 min. Figure 2 schematically depicts the procedure.

**Figure 2.** Schematic outline of experimental procedure.

#### *3.4. Use Cases*

The present experiment included driver initiated transitions between manual, L2, and L3 automated driving [34] as use cases (UCs). Considering both upward and downward transitions, one experimental block consisted of six use cases. For the present analysis, only transitions to an automated driving mode are of interest. Consequently, transitions to manual are not considered here. The use cases with transition type, automation level at use case initiation, target automation level, and use case numbering are shown in Table 1. To counteract sequential effects, participants were randomly assigned to one of six possible block sequences that were created using a Latin square. Each block consisted of six trials. In total, each participant completed 30 use cases. To standardize instructions, we recorded samples for each use case that were triggered by the experimenter.


**Table 1.** Overview of use cases for one experimental block.

#### *3.5. Automated Driving System*

As soon as the driver activated the respective function, it carried out longitudinal and lateral vehicle guidance. The longitudinal and lateral vehicle guidance of the L2 and L3 automation was identical. The L3 ADS was capable of executing independent lane change maneuvers (e.g., overtaking slower vehicles ahead, pulling back to the right lane). The L2 driving automation set speed was the current velocity and could be adjusted without restrictions. The L3 ADS set speed was 130 km/h and could be adjusted to slower speeds. If adjusted to a faster speed than 130 km/h, it deactivated the L3 ADS and activated the L2 driving automation. Vehicle following distance (time headway) to a lead vehicle was 2 s.

#### *3.6. Human–Machine Interface*

The visual HMI was shown on the instrument cluster. It showed the vehicle and its surroundings in both L2 and L3 automated driving. The HMI for automated driving resembled a combination of adaptive cruise control and additional steering assistance [35]. The present HMI constitutes a representative solution for an automated system due to the conceptual similarity to solutions in prior research [4,36]. The L2 vehicle surroundings and L3 vehicle surroundings differed in (1) their informational content (i.e., higher level of detail in L3: visibility of adjacent lanes and vehicles) and (2) their perspective (i.e., larger field of view in L3). Thus, specifically the distance between the eye point and the vehicle, the angle between the direct line of sight and the road, and the opening angle of the field of view were manipulated. Figure 3 schematically depicts the configurations for L2 and L3 automation of the vehicle surround views from a profile perspective. An activated L2 automation was colored in green while an activated L3 ADS was colored in blue. In addition, during activated L3 ADS, the steering wheel was illuminated in blue color. The L2 driving automation displayed a hands-on request (HOR) after 15 s of hands-free driving. The HOR was displayed as hands grabbing a steering wheel [37,38] and yellow pulses on the illuminated steering wheel. The system functions could be activated with a button on the left side of the steering wheel for both levels of automation. For a more comprehensive description of the operating elements, see [14].

**Figure 3.** Schematic depiction of vehicle surroundings point of view for L2 (**left**) and L3 automation (**right**). The gray dot represents the eye point.

#### *3.7. Dependent Variables*

The present study operationalized NDRT engagement as input with the finger on the NDRT surface. Table 2 visualizes the windows of observation for the dependent variables. To find out about the onset of engagement, we counted the total number of inputs on the surface for a time interval of 20 s after successful completion of each use case (NDRT observation window 1). Since it can be assumed that it takes some time for the NDRT engagement to set in and then to stabilize, we also investigated NDRT-related behavior at the end of an automated driving episode where the onset had most likely occurred and NDRT engagement was on a stable level. For that purpose, there was another window of observation covering the 20 s just before the onset of the subsequent use case (NDRT observation window 2).

**Table 2.** Schematic outline of experimental procedure for each use case. The two observation windows are colored in blue.


#### *3.8. Statistical Procedure and Data Analysis*

NDRT data were pre-processed and visualized using Matlab Version 2015 (Mathworks Inc., Natis, MA, USA). Statistical tests were calculated using IBM SPSS Statistics Version 23 (IBM, Armonk, NY, USA). For observation window 1, means and standard deviations (SD) were computed for onset NDRT input frequency by use case and block. In contrast, when observation window 2 started, the transition of control already dated back too far so that a comparison of NDRT-related behavior on use case level (i.e., considering the respective previous level of automation) would not be useful for that period of time. Therefore, we compared NDRT engagement during observation window 2 only in regard to the level of automation that was active at that time. For that purpose, the sum of NDRT inputs during active L2 automation (after UC2 and UC4) and active L3 ADS (after UC1 and UC3), respectively, was calculated for each participant and block. Means and standard deviations (SD) were computed for these ongoing input sums. A significance level of α = 0.05 was applied for inferential testing unless stated otherwise. To control for alpha inflation due to multiple testing, correction after [39] was applied if necessary.

#### **4. Results**

#### *4.1. Onset Input Frequency*

Table 3 shows descriptive statistics (i.e., *M*, *SD*) of NDRT input frequency within the 20 s after UC completion by use case and block. Means and standard errors of onset input frequency by use case and block are depicted in Figure 4. Descriptive values revealed that the overall number of NDRT inputs during the 20 s after task completion was on a low level with mean input frequency not exceeding a number of two. Furthermore, there was a tendency towards more NDRT engagement with increasing system experience in all four use cases. However, the observed increase was stronger for transitions to L3 automation (UC1 and UC3) than for transitions to L2 automation (UC2 and UC4). Independent from the block, descriptive data showed considerably more NDRT engagement after transitions to L3 than after transitions to L2.


**Table 3.** Descriptive statistics (i.e., M, SD) of onset input frequency for the four use cases (UCs) by block.

**Figure 4.** Means and SE of onset input frequency by UC and block (blue: transitions to L3 automation, red: transitions to L2 automation).

A 4 × 5 (UC × block) repeated measures analysis of variance (ANOVA) was conducted for onset input frequency. Results revealed significant main effects for both use case and block as well as a significant interaction effect (see Table 4). These inferential results indicate that mean input frequency differed significantly over time and for the different use cases, but the effect of the block depended on the respective use case. The effect sizes showed large effects ([40]; see Table 4). To examine these effects in detail, planned contrast analyses were performed to compare onset input frequency for the two different levels of automation (L2: after UC2 and UC4; L3: after UC1 and UC3) and for consecutive blocks. Results are displayed in Table 5. Regarding the two levels of automation, results revealed that there was significantly more NDRT engagement during active L3 than during active L2 automation; the effect size (see Table 5) indicated a strong effect [40]. Comparisons between consecutive blocks showed a mixed picture: Mean NDRT input frequency was significantly higher in block 2 than in block 1. There were also significantly more NDRT inputs in block 3 as compared to block 2; medium to large effect sizes were obtained [40] (Cohen, 1988). The remaining contrasts between successive blocks did not reach significance (see Table 5). The results of the planned contrast analyses indicate that NDRT engagement increased within the first three system encounters and stabilized in subsequent system encounters.


**Table 4.** Inferential statistics (i.e., F, df1, df2, p, ηp2-value) of main and interaction effects for onset input frequency. Statistically significant effects are colored in gray.

**Table 5.** Inferential statistics (i.e., F, df1, df2, p, ηp2-value, and 95% CI limits) of planned contrast analyses for L2 (after UC2 and UC4) vs. L3 automation (after UC1 and UC3) and successive blocks for onset input frequency. Statistically significant effects are colored in gray.


#### *4.2. Ongoing Input Frequency*

Descriptive statistics (i.e., *M*, *SD*) of ongoing NDRT input sums within the 20 s before the onset of the upcoming use case by level of automation (L2: after UC2 and UC4; L3: after UC1 and UC3) and block can be found in Table 6. Figure 5 depicts means and standard errors of ongoing NDRT inputs by level of automation and block. The descriptive values showed similar tendencies as for onset NDRT engagement: The overall number of inputs during the 20 s before onset of the upcoming use case summed for active L2 and L3 automation, respectively, was relatively small with means not exceeding a number of four. Furthermore, a trend towards more NDRT engagement with rising system experience could be observed for both levels of automation with a seemingly weaker upward trend for L2 automation. However, descriptive NDRT engagement tended to stabilize after the first three system encounters. Descriptive data also indicated notably more ongoing NDRT engagement during active L3 automation than during active L2 automation in all five blocks.

**Table 6.** Descriptive statistics (i.e., M, SD) of ongoing input frequency summed for L2 (after UC2 and UC4) and L3 automation (after UC1 and UC3) by block.


**Figure 5.** Means and SE of ongoing input frequency summed for L2 and L3 automation by block.

A 2 × 5 (level of automation × block) repeated measures ANOVA was performed for ongoing NDRT engagement to examine main and interaction effects of the level of automation. Results are displayed in Table 7. There was a significant main effect of level of automation as well as of block. This means that ongoing NDRT engagement was significantly higher during L3 automation than during L2 automation and differed over time. Furthermore, there was a significant interaction effect indicating that the effect of block on NDRT engagement depended on the level of automation that was active. The effect sizes (see Table 7) showed large effects [40].

**Table 7.** Inferential statistics (i.e., F, df1, df2, p, ηp2-value) of main and interaction effects for ongoing input frequency summed for L2 and L3 automation. Statistically significant effects are colored in gray.


#### **5. Discussion and Conclusions**

This research investigated the analysis of NDRT engagement at different levels of automated driving. The results of *N* = 49 participants showed that the levels of driving automation and accordingly designed HMIs lead to differences in NDRT engagement. An increase of NDRT engagement over time was observed for both automation levels whereas this increase was stronger in L3 as compared to L2 automation. These results indicate that users' behavioral adaptation occurs during initial system encounters. It also shows that the HMI design that follows considerations for L2 and L3 driving automation leads to specific behavioral patterns. The following section discusses the obtained results and relates them to prior considerations about NDRT engagement and mode awareness.

Overall, there were differences in NDRT engagement between the L3 and the L2 automation with significantly more engagement in L3 as compared to L2 automation as indicated by statistically significant main effects in Tables 4 and 7. Thus, these differences can be traced back to two sources. First, the L3 HMI permitted hands-free driving while the L2 HMI included hands-on requests. Second, the HMI designs differed in adaptations of informational content and perspective. Eventually, there is no final statement possible which HMI variation led to the differences in the observed behavior between the automation levels. Referring back to initial considerations of the HMI design for automated vehicles, it is important to include a form of feedback for L2 automation that prompts the drivers to supervise the driving automation. If these are not present (as in the present L3 case), there is high NDRT engagement. This observation supports the results by Llaneras and colleagues [29] The difference between NDRT engagement during L2 and L3 automation was observed for both the onset (see Figure 4) and ongoing (see Figure 5) NDRT engagement. These observations are in accordance with the findings reported in [19]. The results reported herein extend their findings by repeatedly observing the engagement in an NDRT. Here, similar results were obtained for L2 and L3 automation. Namely, engagement in NDRTs at initial contacts with driving automation—independent of the level of automation—is on a low level. The engagement rises in both instances as indicated by significant main effects for the block factor in both Tables 4 and 7. However, the rise in NDRT engagement was much stronger for L3 automation as compared to L2 automation as indicated by the significant interaction effects in the same tables. These results show that mode awareness might not only be captured by users' NDRT engagement in one block but also over the time course (e.g., five repetitions). The behavioral adaptation of NDRT engagement corresponds to related research that investigated human–automation interaction across repeated interactions [13,14,21]. A closer investigation of differences between the blocks by means of planned contrast analysis (see Table 5) showed that a change over time is present from the first up to the third encounter. From then on, stable engagement in NDRTs can be assumed. This has implications for study designs concerning automated driving and engagements in NDRTs. When setting up a study, researchers should be aware that behavioral adaptation requires a certain

number of repeated trials until reliable user behavior is present. One example is the study by Hergeth and colleagues [7], where the authors investigated whether NDRT engagement and according glance behavior could be an indicator of reliance behavior and marker for trust in automation. Indeed, they considered familiarization with NDRT and automated driving system including *N* = 8 repeated NDRT engagements.

NDRT engagement was also present at L2 driving automation. By definition, users of L2 driving automation are responsible for supervising the driving task at all times and may not leave the control loop [1]. Even though NDRT engagement during L2 automation was on a descriptively low level, there were participants that diverted their attention away from supervising the driving automation. This observation has implications for the design of L2 automation. It has to be noted, that secondary task activities occur even in manual driving [41]. Such distraction during manual driving (i.e., engaging in NDRTs) is considered a safety risk and should be minimized [1]. In contrast, there is first evidence that this tendency can be used in a beneficial way during automated driving as it might be turned into controlled engagement. For example, Paetzold and colleagues [42] did not find differences in reaction time to automation errors between participants that were either engaged or not engaged in an NDRT. In the same vein, Hensch and colleagues [43] found effects of display position and secondary task on the driver's glance behavior in both automated and manual driving. They especially report longer eyes-on display time for NDRTs in head-up display configurations. However, due to its proximity to the driving environment it might enable a faster identification of and reaction to critical situations such as system failures. Thus, there are still challenges for conceptual developments of a HMI design for L2 automated vehicle HMIs.

Eventually, this study supports that NDRT-related behavior can be used to distinguish between levels of automation and their HMI conceptualization. Indeed, drivers' differences in behavior in NDRTs support the conclusion that mode awareness for the HMIs in L2 and L3 automation was on a high level. This difference is not only apparent overall, but also by differences in changes over time. Moreover, the study showed a methodological aspect on how to evaluate NDRT behavior during an episode (i.e., onset vs. ongoing) which led to similar results. Especially the fact that NDRT engagement changes over time implies that research needs to focus on prolonged periods and that drivers need to adapt to this technology first before it can be used appropriately.

#### *Limitations and Future Research*

This study comes with a number of limitations. First, there were no incentives for engaging in the NDRT. In real-road driving, drivers might disengage only if the NDRT has a rewarding character. It remains therefore unknown whether the NDRT engagement in especially L2 automation would remain at such a low level if rewards would have been applied in this study. Second, the NDRT consisted of the SuRT alone, which is a standardized method for visual–manual distraction. This NDRT does, on the one hand, only cover two modalities of distraction (i.e., visual and manual) and, on the other hand, it might not be a very motivating NDRT. For example, Purucker and colleagues [44] have used a more naturalistic set of NDRTs for their study that increases external validity of the findings. Third, the NDRT was mounted in a fixed way in the center console. It might be that engagement is increased if the NDRT is located closer to the line of sight [43]. Thus, future research has to determine how the NDRT-related behavior in a different level of automation evolves for differing activities, modalities, and locations in the vehicle interior. Moreover, the present research only supports insights on the group level that support the predictive character of the SuRT as a measure for mode awareness. However, this does not permit inferences on the individual level. There is still room for future research to determine whether and how predictive the engagement in the SuRT is for mode awareness on an individual level.

**Author Contributions:** Conceptualization, Y.F., S.H., F.N.; methodology, Y.F., S.H., and F.N.; formal analysis, Y.F., V.G.; data curation, Y.F., V.G.; writing—original draft preparation, Y.F., V.G.; writing—review and editing, Y.F.; visualization, Y.F., V.G.; supervision, S.H., A.K. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Conflicts of Interest:** The authors declare no conflict of interests.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Methodological Considerations Concerning Motion Sickness Investigations during Automated Driving**

#### **Dominik Mühlbacher 1,\*, Markus Tomzig 1, Katharina Reinmüller <sup>2</sup> and Lena Rittger <sup>2</sup>**


Received: 22 April 2020; Accepted: 11 May 2020; Published: 13 May 2020

**Abstract:** Automated driving vehicles will allow all occupants to spend their time with various non-driving related tasks like relaxing, working, or reading during the journey. However, a significant percentage of people is susceptible to motion sickness, which limits the comfort of engaging in those tasks during automated driving. Therefore, it is necessary to investigate the phenomenon of motion sickness during automated driving and to develop countermeasures. As most existing studies concerning motion sickness are fundamental research studies, a methodology for driving studies is yet missing. This paper discusses methodological aspects for investigating motion sickness in the context of driving including measurement tools, test environments, sample, and ethical restrictions. Additionally, methodological considerations guided by different underlying research questions and hypotheses are provided. Selected results from own studies concerning motion sickness during automated driving which were conducted in a motion-based driving simulation and a real vehicle are used to support the discussion.

**Keywords:** motion sickness; automated driving; methodology; driving comfort

#### **1. Introduction and Overview**

Motion sickness is well known amongst users of any kind of transportation. Sea sickness, airplane sickness, even space sickness have been investigated over the past 100 years [1]. Today, depending on the considered reference, up to 60% of Americans suffer from car sickness [2,3]. At the same time, original equipment manufacturers (OEMs) are developing towards automated driving, allowing drivers to hand over full control to the vehicle and by that engaging in non-driving related tasks while driving. Moreover, automated vehicles may include new cabin designs that enable different human postures and thus support the execution of non-driving related activities. With that, the possibility of suffering from motion sickness expands from passengers to drivers. To use the "value of time" generated by automated vehicles, users expect to engage in a large variety of tasks during driving, ranging from reading, working, playing video games, watching movies and many more [4,5]. However, while enabling such tasks in the vehicle is promising in terms of user satisfaction, it is exactly those activities that increase the probability of motion sickness [2,6]. Hence, within the context of automated driving, higher incidence numbers and more severe symptoms of motion sickness can be expected [7,8], which will impair the user experience in the vehicle. Besides negative subjective experiences, it is, until now, unclear how motion sickness influences take over and driving performance in case the automation system reaches its limitations and drivers have to take over the driving task. Regarding professionals at sea, a study found up to 60% impaired performance due to sea sickness [7]. Hence, motion sickness could not only lead to a decreased acceptance of automated vehicles but also to a decrease in driving safety. Consequently, there has been an increasing demand of investigation in motion sickness in the context of automated driving. In particular, two research questions are of interest:


To answer these research questions, controlled empirical studies are necessary. Yet, there has only been a small amount of research conducted in realistic vehicle settings.

This paper discusses basic methodological considerations for designing and conducting empirical studies on motion sickness in the vehicle context. It shall support applied researchers to decide on the right methods, measures, samples, and ethical considerations. The paper furthermore includes unpublished data confirming the methodological approaches.

The data presented in the subsequent chapters is based on a study conducted in the high-level driving simulator of the Wuerzburg Institute for Traffic Sciences (WIVW GmbH) and an AUDI A8 serial vehicle. A total of N = 24 participants took part in the study. The study had a within-subjects design, i.e., every participant took part in four separate sessions, of which two were drives in the real vehicle and two were drives in the driving simulator. In the real driving part, an Audi A8L equipped with SAE Level 2 functionality was used as the test vehicle. A trained experimenter drove the car on an Autobahn track while the participant was sitting in the front passenger seat. The experimenter used the Level 2 functions whenever possible. Passing maneuvers were performed similar to an autonomous vehicle. The simulator runs with the driving simulation software SILAB®. The motion system uses a hexapod with six degrees of freedom and can briefly display a linear acceleration up to 5 m/s2 or 100◦/s<sup>2</sup> on a rotary scale. It consists of 6 electro-pneumatic actuators (stroke <sup>±</sup> 60 cm; inclination ± 10◦). The mockup is created with a BMW 520i with automatic transmission. As the visual system of the WIVW simulator is defined for the driver only, the participants were sitting in the driver's seat during the run in the driving simulator. The driving behavior of the simulated vehicle was defined as comparable as possible to the driving behavior of the real vehicle. Additionally, the road geometry of the real Autobahn track was implemented precisely in the driving simulation. The participant's task was to watch a video during the rides of approx. 40 min. The four runs occurred in a counterbalanced order with a minimum of two days between each day of participation. In the study, motion sickness was measured via the misery-scale (MISC) [9] every two minutes during the run. After the run, a symptom questionnaire was used. It included a list with symptoms of the motion sickness questionnaire (MSQ) [10] and from the simulator sickness questionnaire [11], which were rated on a scale with four categories ranging from "none" to "severe". After the last run, the participants had to compare both test settings (real vehicle vs. driving simulator) in form of several questions. Physiological data (participants' temperature, electrodermal activity, electrogastrogram) were recorded with a Varioport Polygraph (Becker Meditec).

#### **2. The Phenomenon of Motion Sickness**

#### *2.1. Symptoms, Prevalence, and Time Course of Motion Sickness*

The main symptom of motion sickness is nausea leading up to vomiting [12]. However, nausea is typically preceded and accompanied by symptoms like burping, (cold) sweat, pallor, fatigue, headache, or dizziness [9,13,14]. Appearance and chronology of the symptoms varies a lot between different persons and between the different types of motion sickness (e.g., carsickness, seasickness, simulator sickness) [7]. For example, oculomotor symptoms (such as eye strain or difficulties in focusing) are more often to be found in situations in which sickness is induced by visual stimuli (e.g., simulators) than in situations in which sickness is primarily elicited by movements (e.g., sea travel) [12].

It is difficult to make statements about the prevalence of motion sickness because its occurrence depends on multiple factors such as the mean of transport (car, bus, ship, train, etc.) but also on duration and intensity of provocation (driving through curves vs. driving on a straight highway). However, it has repeatedly been demonstrated in laboratory studies that motion sickness occurrence and intensity highly depend on the frequency of accelerations which act upon the passenger. Frequencies of about 0.16 to 0.2 Hz are particularly provocative to elicit motion sickness [15–24]. Furthermore, there is an effect of age: young children up to two years are immune to motion sickness [25]. Afterwards, mean motion sickness susceptibility raises to a peak at the age of 16 to 20 years [26]. Subsequently, the susceptibility decreases with increasing age. Concerning gender, women are generally more prone to motion sickness than men [25,27–31]. The reasons for the gender difference are not clear. However, hormonal factors or a lower threshold to admit motion sickness symptoms in women are discussed as possible explanations [25,27–31].

Similarly to the prevalence, the time course of motion sickness also depends on various factors like the individual susceptibility as well as the type and intensity of sickness provocation. The first symptoms may be perceived immediately after onset in highly provoking conditions. In less severe conditions, motion sickness may occur after 10 to 20 min in susceptible participants. Depending on the study design, symptoms often intensify linearly with progressing provocation duration and decrease rapidly after offset [32–34].

#### *2.2. Motion Sickness Theories*

There are various models and theories trying to explain the mechanisms leading to motion sickness, e.g., the toxin-hypothesis [35], postural instability theory [36], negative reinforcement model [37], or the rule of thumb [38]. More popular than these models is the theory of sensory conflict and rearrangement [14] or its revision, the neural mismatch theory [39]. It states that motion sickness occurs if there is a discord between different sensory inputs, i.e., the visual and the vestibular system. For example, if a passenger is reading a book during a drive, the eyes register a static environment and give feedback that the person is not moving. However, the vestibular organs register the longitudinal and lateral accelerations of the vehicle and give feedback that the person is moving. In this situation, the probability for motion sickness is higher than in a passenger who looks ahead and thus has no contradictory impressions [34]. In addition, motion sickness depends on the type of task [40,41].

Reason and Brand extended the theory by the component of expected sensory impressions [14]. The sensory rearrangement theory states that motion sickness increases when the actual visual and vestibular impressions differ from the expected ones, i.e., when future movements cannot be anticipated. Within this context, effects of habituation may also be relevant (neural mismatch theory) [39]. In general, the better the passenger's view ahead, the lower the risk of motion sickness [34]. For these reasons, motion sickness is more likely to occur when the passenger is sitting in the back seat compared to sitting in the front seat.

#### **3. Methods for Investigating Motion Sickness in Autonomous Vehicles**

#### *3.1. The Study Setting*

In general, two study methods are applicable for motion sickness studies concerning autonomous driving: field experiments with real vehicles and studies in a driving simulator.

In field experiments, the participant is passenger of a real vehicle in a realistic road environment or on a test track. Naturally, the experimenter has no full control of the dynamic events happening and the experiences the participants make in a field study, resulting in reductions of internal validity. However, there are different ways to control for this. First of all, the driving style of the used vehicle should be standardized. In future, this can be realized by using automated functions that perform driving manoeuvers in the same way with high reliability. Until these automated vehicles are commonly available for this kind of study, human drivers need to drive the test vehicles. High levels of realism for future automated vehicles can be achieved by Wizard-Of-Oz settings, in which the automation is simulated by a human driver, e.g., [42–44]. It is necessary that human drivers are instructed or trained towards a specified and thus reproducible driving style [45]. In the presented setting (cf. Chapter 1), we used assistance systems like adaptive cruise control (ACC) and lane keeping for standardization. The trained drivers learned how to perform lane change maneuvers with the necessary step of actions (for example setting indicator before moving steering wheel, changing lanes in six to seven seconds). Along with that, there should be a low number of experimental drivers in order to avoid inter-individual differences in driving style between experimenters. Finally, after completing data collection, it is recommended to analyze the dynamic driving data to identify any conspicuousness within the actual realized driving behavior. If possible, the data can be systematically compared to the vehicle dynamics measured (1) within the same study to check for internal validity and (2) in other settings or situations in order to check for external validity.

However, even if the vehicle dynamics are kept as standardized as possible, external factors like traffic or weather conditions cannot be kept constant or manipulated consciously. However, these aspects can affect motion sickness: A high traffic density can lead to an increased number of braking and overtaking maneuvers due to slower vehicles. This driving behavior can lead to stronger symptoms of motion sickness. In contrast, a low traffic density enables homogenous driving with less accelerations and decelerations, which reduces the probability for motion sickness.

In contrast to field studies, driving simulators enable conducting studies in a highly controlled environment. They are used since the 1960s to investigate driving performance and behavior and are classified into three categories [46]:


As most researchers attribute motion sickness in vehicles to contradictory impressions between the vestibular system (which perceives motion) and the visual system (which perceives no motion, e.g., while reading a book), the use of a high-level simulator with a motion system is recommended. In mid-level and low-level simulators, in contrast, only visual induced motion sickness can be investigated. Basically, research questions concerning countermeasures or physiological correlates are conceivable in these simulators. However, it remains unclear how the results of these studies in simulators without motion system would be applicable for automated driving.

The most common motion platform of high-level simulators is a hexapod which provides motion in six degrees of freedom (x, y, z, roll, pitch, yaw). Compared to travelling in a real vehicle, longitudinal and lateral accelerations are different. The feeling for realistic accelerations is generated by hacks like tilting the presented scenery. More elaborated simulators mount the hexapod on an x-y table on which the simulation cabin is moved to produce more realistic accelerations. According to Carsten and Jamson [47], however, even a large motion system is not capable to provide realistic accelerations in special driving situations like negotiating a long curve.

Probably the most important benefit of driving simulation is the ability to create repeatable scenarios which are tailored to a certain research question. Depending on the research question, motion-sickness provoking scenarios with many strong lateral and longitudinal accelerations are possible as well as more homogenous driving scenarios with few accelerations only (e.g., highway scenarios). Additionally, the researcher is free in the selection of the driving behavior of the autonomous vehicle: each imaginable driving style is feasible even if this driving behavior is not possible in a real autonomous vehicle yet. Another benefit of driving simulation is the availability of data: the simulator provides all data that would be provided by a real test vehicle (e.g., velocity, acceleration) as well as data of the traffic environment (e.g., surrounding traffic, road geometry). Besides, the participant's behavior (e.g., head movement, glance behavior) and physiological data can be monitored and recorded in a simple way: the laboratory conditions make video recordings easier due to constant light conditions and physiological data recording more precise due to less disturbing artifacts of the environment (e.g., temperature, humidity).

On the other hand, there are disadvantages of driving simulation. Some participants of simulator studies suffer simulator sickness, which is a subtype of motion sickness in simulated environments. The phenomenon occurs in all types of simulators—it also appears in fixed-base simulators without motion system due to visual stimuli only. Similar to motion sickness, it is caused by a mismatch between the visual perception and the vestibular sensation of acceleration and deceleration [14,48]. For a motion sickness study, this means that the results for motion sickness can be confounded with simulator sickness. For studies regarding prevalence or development of motion sickness it is recommended to exclude participants who have shown symptoms of simulator sickness in previous studies in order to diminish this artifact. However, as simulator sickness and motion sickness are related and show similar symptoms due to similar reasons, it is possible that some countermeasures are effective against both symptoms. Therefore, it has to be discussed if participants with simulator sickness problems are allowed in a study concerning motion sickness countermeasures. However, this issue has to be decided for each countermeasure or research question separately.

An important issue of driving simulation is the validity. A distinction that has been made on simulator validity is between absolute and relative validity [49]. Relative validity exists when effects in the simulator and under the same road conditions are in the same order and direction. In contrast, absolute validity is present when the numerical values are about equal in both systems. A lot of validation studies were carried out in various simulators. They compared various parameters of the driver's behavior (e.g., velocity, lateral displacement, braking behavior, gaze direction) between driving in a simulator and driving in a real vehicle. In most cases, the studies showed that relative validity exists while absolute validity was only rarely verified [50]. However, these results do not provide evidence that validity is given for motion sickness studies. In a motion sickness study, behavior of a driver is not relevant—moreover, the occupants' visual and vestibular perceptions are important.

To the authors' knowledge, there have not yet been studies comparing an occupant's motion sickness in a driving simulator to his/her motion sickness in a real vehicle. Therefore, we conducted the study design as described above.

The results showed that the progress of motion sickness was comparable in both conditions. After a general rise at the beginning of the run (approx. first 12 min), the sickness ratings increased more slowly in the second and last third. Compared to real driving, self-reported motion sickness was slightly higher in the simulation compared to the real vehicle (Figure 1). However, the maximum sickness values during the runs do not differ (Wilcoxon signed-rank test: *Z* = 1.40, *p* = 0.162). The sessions of *n* = 3 drivers had to be aborted due to high sickness ratings in the simulator. In the field study, the run of *n* = 1 driver was terminated before the end of the test course. According to the symptom questionnaire, most symptoms occurred in a similar frequency and intensity in both runs (Figure 2 left). However, three symptoms differed significantly concerning their intensity: in the driving simulator, participants had higher general discomfort, more difficulties concerning focusing, and increased appetite (Figure 2 right). In a final interview after both runs, the participants stated that the motion sickness symptoms were more distinct in the driving simulator compared to the real vehicle (*t*(23) = 5.65, *p* < 0.001).

These results indicate that relative validity is given for the high-level simulator of the WIVW GmbH concerning motion sickness as the progression during the runs was comparable and the occurrence of frequent symptoms was similar. In contrast, absolute validity cannot be verified, as some of the self-reported symptoms were more distinct in the simulator.

The recommendation for the most appropriate study setting depends on the research questions: A field experiment offers the highest validity and should be used for studies which investigate the prevalence and the development of motion sickness. In this case, a conduction on public roads should be selected. The realistic test track could represent a highway, rural road or inner-city track. Previous studies used driving on highways and inner city roads to identify if and how strong motion sickness occurs. In these studies, the participants performed different tasks in the vehicle [51].

**Figure 1.** Misery scale (MISC) rating over time during the runs in a real vehicle and in the high-level driving simulator. Sessions were terminated when a value of 7 = moderate nausea was given. These participants were assigned continuing values of 7 for the purpose of this illustration. Boxplots are shown.

**Figure 2.** Frequencies of symptom judgments in field experiment and driving simulator. The color indicates the symptoms' intensity (**left**). Results of a Wilcoxon pair signed rank test, significant results (*p* < 0.05) are indicated by \* (**right**).

In contrast, in case the research question covers the investigation of countermeasures avoiding or reducing the symptoms of motion sickness, it is crucial to choose a test setting that causes motion sickness in the participants quickly and with a high probability. In the vehicle context, this setting was mainly realized on test tracks, on which high provoking maneuvers were driven by the experimenters (e.g., driving in the shape of an eight, or constant stop and go). Other researchers made use of placing the participant rearwards in a vehicle driving on urban roads [51,52]. Within this setting, a comparison between a baseline trial and a repetition of the same condition with potential countermeasures allows to investigate the effectiveness in avoiding symptoms. In particular, considering the efforts put into these kinds of participant studies, an efficient and reliable creation of provoking situations needs to be considered in the study design. Besides, a simulator study using a motion sickness provoking scenario can also be conducted when investigating countermeasures. A requirement for this option is the validity of the driving simulator. The presented study shows that a high level driving simulator without x-y table can also offer relative validity—however, as driving simulators are very different this has to be tested for each simulator individually.

#### *3.2. The Participant's Task*

In general, automated driving will enable the driver to engage in various non-driving related activities. In motion sickness research, one relevant research question refers to specifically examining the different non-driving related tasks (NDRTs) for their potential to cause motion sickness. In respective investigations, subjects could either be free to engage in realistic everyday NDRTs of their choice or be presented with a specific NDRT. While many standardized tasks exist in the context of manual driving, such standardization is widely missing in the context of automated driving. Therefore, it would be desirable to also evaluate secondary tasks that cover certain groups of conceivable NDRTs in the future. For our setting, we chose a naturalistic NDRT. Based on previous research [5], it can be expected that the use case watching a video in an automated vehicle has some external validity.

Concerning other research questions such as the evaluation of countermeasures, it may also be relevant to induce motion sickness in a targeted manner or to investigate an extreme scenario. In this case, NDRTs that are characterized by highly limited peripheral and external vision of motion are required as hints about the vehicle's future motions can counteract motion sickness [53–55]. Therefore, a mainly visual NDRT should be presented in a way that assures gazing away from the road scene. To ensure standardization of the amount of peripheral vision across participants, visual material should be presented at a fixed location, e.g., by means of displays instead of providing handheld devices such as tablets. Naturally, fixed display positions also lead to more standardized participant movements. Since peripheral vision can be manipulated by both display position and size [7], to prevent the participant from using peripheral vision, a visual NDRT could be presented at a downward angle or on a large display. Further, to promote continuous task engagement, it is recommended to choose an NDRT that is difficult to interrupt or provides instructions and incentives for subjects to focus on the task and refrain from road glances (e.g., concentrating on visual tasks like reading or watching a movie during the drive increases the risk of motion sickness). Artificial, standardized NDRTs can therefore be suitable for this. Please note that engaging in a visual NDRT may cause visual problems such as strained eyes or blurred vision, which cannot be differentiated from symptoms of motion sickness. To better control for this, visual task characteristics and the duration of the task engagement may be considered. Similarly, fatigue may occur due to the experimental session's duration or as a motion sickness symptom. Further research should examine both the relationship between motion sickness and fatigue as well as methods to control for confounding effects.

For the selection of an adequate task for empirical studies on motion sickness, classifications of NDRTs, provide relevant dimensions such as the primary modality, the locality, the possibility of road glances, the need for sustained attention, and incentives to continue the task [56]. In addition, the presented material should be controlled for emotionality of content when motion sickness is measured using physiological correlates. Therefore, in our study, subjects watched a movie on a

display positioned below the central information display. We further instructed subjects to refrain from road glances. The videos contained documentaries, which were interesting but not emotionally arousing. Other examples for such tasks may be reading a text or answering a quiz that is presented visually. Finally, participant posture should be considered in motion sickness studies given that the risk of motion sickness is also higher when the passenger is sitting on a rearward facing seat compared to a forward facing seat [52]. Moreover, for postures facing in the driving direction, a regular driving posture may increase the risk of motion sickness compared to a reclined posture [57].

#### *3.3. Sample and Recruitment*

In order to investigate motion sickness in autonomous vehicles a participant study is recommended. The requirements for the recruitment depend on the study's research question.

For a large variety of research questions, it is necessary that a significant part of the sample suffers from motion sickness during the study. For example, the effect of a countermeasure for motion sickness during travelling can only be demonstrated when a control condition in a between- or within-subjects design exists in which motion sickness occurs. In contrast, people who are not susceptible to motion sickness do not need countermeasures and are not relevant for the study question. It is only possible to identify physiological correlates of motion sickness when the participants have phases with and without motion sickness. Therefore, the selection of participants is crucial for the study's success as not all people are susceptible to motion sickness. This consideration leads to the next question regarding participants' recruitment: how to identify participants who are susceptible to motion sickness?

A common instrument to predict motion sickness susceptibility is the MSSQ (Motion Sickness Susceptibility Questionnaire) [14,58]. This tool queries how often several means of transport (e.g., cars, busses, airplanes) and amusement rides (e.g., carousels, rollercoasters) were used in the past and how often sickness occurred. The answers result in a motion sickness susceptibility score. However, the results of our own study indicate that the MSSQ total score is not appropriate to identify subjects who are susceptible to motion sickness while travelling in a car. There was no significant correlation (Spearman *r*(24) = 0.266; *p* = 0.210) between the MSSQ total score and the suffered motion sickness (measured via a misery scale according to [9]) in a real driving study on the Autobahn in which the *N* = 24 participants were passengers and had to watch a video during the drive (see Figure 3 left). The MSSQ probably covers too many means of transport—respondents with no motion sickness problems in cars can also achieve high MSSQ scores when having symptoms, for instance, in trains and airplanes. In contrast, respondents who compensate for their motion sickness in real driving situations might reach lower MSSQ scores than would be intended: people who know that they are susceptible to motion sickness might not engage in NDRTs in provoking situations and therefore did not experience any severe motion sickness in the past years.

However, the more specific MSSQ item "Over the last 10 years, how often you felt sick or nauseated in cars?" also showed no significant correlation (Spearman *r*(24) = 0.212; *p* = 0.319) to the suffered motion sickness in the study (see Figure 3 right). The question is very inaccurate as it does not differ between driving in an urban or rural area or on a highway. In addition, it summarizes travelling in a car while reading or texting on the back seat as well as being a co-driver who is attentive to the traffic situation. As the prevalence depends on the individual threshold to motion stimulation and varies under different situations [59], a curvy rural road can lead to symptoms for some people while other people suffer from motion sickness in urban scenarios only. Therefore, it is recommended to use a highly specific question with the exact test scenario as a screening question for the participants' recruitment (e.g., "Do you get symptoms of motion sickness as a co-driver while reading on the Autobahn?").

Concerning other research questions, a more common sample is required. A representative sample is necessary to investigate the prevalence of motion sickness. The sample should be representative concerning all aspects which can affect the prevalence of motion sickness, e.g., age [60,61] and gender [27,29].

**Figure 3.** Spearman correlation between the maximum value on the misery scale during the session and (**a**) the MSSQ total score and (**b**) the single item concerning sickness in cars. Size and color of the dot indicate the number of respondents.

#### *3.4. Measurement of Motion Sickness*

#### 3.4.1. Subjectively Perceived Motion Sickness

Subjective participant ratings via questionnaires are the most common method to measure motion sickness and to validate other measurement tools like physiological or behavioral measures. Within the subjective measurement approaches, there are two basic principles: either the participants are asked to evaluate their overall motion sickness in a single rating or the participants are questioned in detail about multiple or even all potential motion sickness symptoms and their intensity. Short questionnaires allow for a continuous online assessment of motion sickness during the test drive, which enables describing the time course of motion sickness development. In contrast, detailed questionnaires are suitable for pre-post evaluations to determine if and to what extent a certain condition has led to motion sickness.

One example for a short overall rating is the fast motion sickness scale (FMS) [62]. The FMS is a verbal rating scale ranging from 0 (no motion sickness at all) to 20 (severe sickness). Participants are asked to evaluate the current motion sickness and to focus on nausea, general discomfort and stomach problems. However, the scale of the FMS is unanchored. Hence, it is not possible to verbally describe what the distinct values on the scale stand for. Further, it is uncertain if the values on the scale actually represent the same degree of subjectively perceived motion sickness for each participant. It thus remains concealed if e.g., a value of 15 is associated with nausea and if this is valid for every participant of the sample. Therefore, unanchored scales do not deliver information about the characteristics of motion sickness. Due to its unspecific character, the rating may further be biased by other comfort restrictive factors, like boredom or fatigue.

Another tool to quickly measure subjective motion sickness is the misery-scale (MISC) [9]. It is an 11-point scale trying to capture the quantitative and qualitative degree of motion sickness within one combined rating. For this purpose, the scale's numeric values are assigned to more or less specific motion sickness symptoms and their intensity. The scale comprises the following gradation: 0 (no problems), 1 (uneasiness without specific symptoms), 2–5 (slightly to severely perceived specific symptoms like dizziness, headache, stomach awareness, etc.), 6–9 (nausea from slight to severe/retching), and 10 (vomiting). Thus, in contrast to scales like the FMS, the MISC values can be interpreted descriptively and it is assumed that every single value is interpreted similarly by all participants. Like the FMS, the MISC is able to assess motion sickness quickly, in short intervals and during motion sickness induction. The MISC suggests that nausea is perceived as more inconvenient than all other motion sickness symptoms. This, however, neglects that other symptoms like severe headache may also be

perceived as very unpleasant. Without experiencing nausea, the MISC does not allow the participant to reach high motion sickness scores, even if the driving comfort has largely decreased. Therefore, strictly speaking MISC data cannot be considered as interval scaled. This impedes the analysis and interpretation of the results.

For these reasons, it may be useful to let the participants evaluate different specific symptoms on separate Likert scales. In addition to nausea, it would be plausible to include headache, general discomfort, dizziness, and—depending on the study design—also fatigue (especially during long or uneventful drives). In our study, these symptoms have been observed frequently after a 40-min Autobahn drive (71% of participants stated general discomfort, 96% fatigue) or are assumed to be perceived as particularly inconvenient (nausea, headache, dizziness). However, it should be ensured that the interrogation remains short.

In contrast to these quick and efficient methods, the motion sickness questionnaire (MSQ) [10] represents an approach to capture multiple or even all potential motion sickness symptoms and their intensity. There are different versions of the MSQ with different numbers of items [11]. The questionnaire consists of a checklist with items that are evaluated either concerning their presence (symptom present vs. not present) or concerning their intensity (none, slight, moderate, severe). Thus, the MSQ provides an extensive impression of the participant's current motion sickness. However, completing the questionnaire is relatively time-consuming and is thus not suitable for frequent motion sickness interrogations. It is, therefore, recommended to use it at the end of the driving study or during breaks (directly after provocation offset). Hence, the scale is rather suitable for pre-post evaluations and may be combined with a short online-questionnaire like the FMS, MISC or symptom-specific Likert scales. A comparative overview of the four discussed tools is given in Table 1.

**Table 1.** Comparison of the four discussed motion sickness assessment tools (++ very good; + good; **o** okay; − weak).


It is important to add that subjective ratings may be prone to several biases, such as demand characteristics or social desirability as discussed in Chapter 4. Further, the participant's mental model of the own susceptibility may affect the ratings (i.e., self-fulfilling prophecy). For example, participants believing to be highly susceptible may indicate higher motion sickness ratings, not only because they feel motion sick, but also because they expect to do so and in that sense to confirm their own beliefs. In addition, directly asking participants about their motion sickness symptoms may lead to a very conscious introspection of perceived motion sickness symptoms. Thus, participants may "discover" symptoms which would not have been perceived consciously otherwise. Further research is needed to determine if and to what extend these potential biases affect subjective motion sickness ratings. Nonetheless we consider it important to directly ask participants about their sickness symptoms because motion sickness and discomfort highly depends on the subjective evaluation.

#### 3.4.2. Physiological Correlates

Because subjective ratings may be prone to biases, research has tried to measure motion sickness objectively. Over the last decades, there have been many attempts to describe motion sickness with physiological correlates. Among others, heart rate, blood pressure, respiration rate, gastrointestinal reactions, and skin conductance parameters have been investigated, e.g., [63–67]. However, until now there has been no reliable success in correlating physiological measures with subjectively perceived motion sickness. Reasons are the high variability of motion sickness provoking stimuli as well as the

high individual specificity of reactions. For example, there are rather individual correlations between subjectively reported motion sickness and heart rate or blood pressure [68].

Three measures in which a correlation with motion sickness has been shown across multiple laboratory studies are body temperature [69], skin conductance [69,70], and electrogastrogram [71,72]. Hereinafter it shall be discussed to what extent these three measures are applicable to capture motion sickness in a driving study under naturalistic conditions.

#### Temperature

In previous studies, it was shown that motion sickness affects the human thermoregulation [69]. Nobel and colleagues demonstrated that in cold water body temperature decreases faster in motion sickness induced participants than in control participants [73]. Similarly, in a thermo-neutral environment body temperature was lower in motion sick participants than in control participants [74]. In the latter study, for example, the mean difference was about 0.4 ◦C between control participants and such who stated to be "very nauseous/almost vomiting". In the cited studies, body temperature was measured by a rectal thermistor. Not surprisingly, this procedure is perceived as an unreasonable imposition by many participants and may be doubtful for ethical reasons. One of multiple alternatives to make temperature measurement more convenient for the participants is to place the thermistor under the armpit. The participants should not move their arm during the measurement. It should be considered that mean axillary temperature is some tenth ◦C lower than rectal body temperature [75]. Within this procedure, body and skin temperature cannot be clearly distinguished, although they should not be equated. In some previous studies, differences in body temperature were not necessarily accompanied by significant differences in skin temperature [67,73]. Further, skin temperature can be biased, e.g., by perspiration, environmental temperature, or participants' clothing (warm/light). However, most biases can be controlled easily by the experimenter. Temperature and ventilation in the test vehicle can be held constant by air condition and participants can be instructed to wear comparable types of warm/light clothes. Further, the measured signal can be controlled easily by the experimenter since the range of value is relatively constant across participants (approx. between 36 and 38 degrees Celsius), which makes it easy to detect technical signal disturbances. Moreover, the signal is relatively stable and hardly susceptible to artifacts (e.g., movements, speaking; see Figure 4). As body temperature seems to react relatively slowly to influences, it is to expect that it does so with regard to motion sickness. Consequently, to detect potential effects, heavy provocation and/or a long measurement period might be necessary.

**Figure 4.** Exemplary raw temperature data of a participant during a 44-min drive as passenger on a German Autobahn. In contrast to other physiological data, temperature is hardly affected by artifacts (see Figures 5 and 6).

**Figure 5.** Count of significant Spearman correlations between temperature and motion sickness for each test drive and each participant.

**Figure 6.** Exemplary raw electrodermal activity (EDA) data of a participant during a 44-min drive as a passenger on a German Autobahn. The numerous peaks in the chart indicate external events like braking, participant's movements, and motion sickness rating procedures. Since these events are not necessarily related to motion sickness in a naturalistic test setting, these peaks should be considered as artifacts.

In our study, the temperature's median was calculated for each interval of two minutes and served as the dependent measure for the subsequent analyses. The temperature was recorded under the armpit and correlated with the likewise every two minutes recorded MISC-ratings. Because a high inter-individual variability was expected [68], the number of significant positive or negative correlations between temperature and subjective measurement of motion sickness every two minutes was counted for each participant and each run (two-tailed testing). In 57.6% of all cases, a significant positive (i.e., temperature increases with motion sickness rating) or negative correlation (i.e., temperature decreases with an increasing motion sickness rating) between temperature and subjective motion sickness was observed (Figure 5).

In order to estimate if the found correlations are stable within each participant, the possibility to replicate the found correlations was checked. However, only *n* = 3 participants showed significant negative correlations between temperature and motion sickness in more than two test drives (i.e., temperature decreased with increasing motion sickness ratings). The results indicate not only a high inter-individual, but also a high intra-individual variability of the found correlations. The variability may also derive from confounding factors like driving time or time of day.

#### Electrodermal Activity

Another measure which has frequently been investigated with regard to motion sickness is skin conductance. Derived from the observations of "cold sweating" [76], a positive correlation between motion sickness and electrodermal activity (EDA) seems quite plausible and has been shown in several studies [69,70]. Like temperature measurement, EDA recording is technically simple. The procedure is hardly unpleasant for the participants because the electrodes are fixed on the hands (frequently index and middle finger). The electrodes can be attached by the experimenter; hence, it is ensured that the electrodes are pinned correctly and identically across all participants. The measurement can be monitored by the experimenter because whether the measurement is working properly is apparent from the raw signal.

However, EDA is very susceptible to external influences and artifacts. This is a major obstacle in recording EDA under natural driving conditions. Unexpected stimuli strongly affect the EDA. These include, for example, motion perceptions resulting from longitudinal and lateral accelerations, which emerge naturally during driving. Additionally, EDA is affected by speaking and movements of the participants (see Figure 6). Therefore, participants should not move or speak during the drive—this should particularly be considered when asking participants about their current motion sickness. Instead of orally answering questions, it is possible to capture the participants' responses via e.g., a numeric keypad. Alternatively, intervals of motion sickness provocation and intervals of interrogation can be separated, and the latter be excluded from the statistical analysis. However, a temporal separated recording of subjective and physiological data impairs correlation analyses. Beside artifacts, effects deriving from the driving time can bias EDA.

EDA measurement and analysis is characterized into two types: first, the (tonic) skin conductance level (SCL) which describes the slowly changing conductance of the skin and can be analyzed by computing and comparing means or medians per time interval. The tonic level is overlaid by the second type—the (phasic) skin conductance reactions (SCR)—which are referred to discrete stimuli (e.g., sound, motion perception) and can be seen as sudden peaks in the raw signal. In a naturalistic setting, these phasic reactions frequently represent artifacts which are not directly associated with motion sickness but rather surprise or arousal [77] and are therefore not a suitable measure to detect motion sickness in driving. Therefore the more robust SCL should be analyzed if EDA is recorded.

To assess if the EDA is associated with motion sickness, our study also investigated the effects of motion sickness on skin conductance. EDA was recorded on the participants' index and middle fingers (left hand in simulator, right hand in real vehicle). The EDA's median was calculated for each interval of two minutes and served as the dependent measure for the correlations with the MISC ratings. A rise of the EDA was observed at the beginning of the test drive. Therefore, the first eight minutes of the 40 to 45-min test drive were excluded. Additionally, intervals with tight curves were also excluded from the analysis to minimize artifacts deriving from the traffic scenario. Like in the temperature analysis, for each participant and each condition it was counted whether there is a significant positive (i.e., EDA increases with motion sickness rating) or negative correlation (i.e., EDA decreases with increasing motion sickness rating) with subjectively measured motion sickness. In 38.6% of all cases, a significant positive or negative correlation between EDA and motion sickness was observed. Again, the possibility to replicate the found correlations was checked in order to estimate if the found correlations are stable within each participant. However, as shown in Figure 7, no participant showed replicable positive or negative correlations between EDA and motion sickness in more than two test drives. Again,

the results indicate not only a high inter-individual but also a high intra-individual variability of the found correlations. As described above, we observed that SCL rose at the beginning of the test drive (probably due to excitement) and then fell over time, independently of perceived motion sickness (probably due to habituation to the study setting). Thus, contrary to the temperature findings, it is highly probable that the found variability derives from confounding factors like driving time or the appearance of external events (e.g., sudden brakes), which emerge naturally during a realistic test drive. These biases may conceal potential effects from motion sickness on SCL. Altogether, there are several confounding effects which affect EDA in a natural driving setting. These should be considered and carefully controlled within the study.

**Figure 7.** Count of significant Spearman correlations between EDA and motion sickness for each test drive and each participant.

#### Electrogastrography

Electrogastrography (EGG) is another method which has been investigated to measure motion sickness. The EGG measures pacemaker potentials in the stomach which coordinate the gastric contractions [78]. Thus, the EGG does not capture the actual motility of the stomach but rather the efforts to actuate. Corresponding to typical motion sickness symptoms like nausea or awareness of the stomach, Stern and colleagues found changes in this pacemaker potential, namely a decrease in amplitude and an increase in frequency from 3 to 5–7 cycles per minute in motion sick participants [71,72]. Even if this correlation seems to be rather individual [79], it could nonetheless be shown across different studies, as for example [71,72,79–81]. Therefore, the EGG seems to be a promising signal for a physiological measurement of motion sickness. The EGG is a very weak signal which is easily overlaid by movements (e.g., of the abdominal muscles; see Figure 8) [78,82]. Therefore, it is very important that participants do not move or speak during EGG recording. [71,72] used an optokinetic drum to induce motion sickness by vection. With this method it is possible to induce motion sickness without participants moving or being moved. In the context of driving, however, the application of EGG is naturally more challenging. In a naturalistic drive, participants are moved by the vehicle. The resulting acceleration forces may elicit unconscious movements of the participants like e.g., muscle tensions to

compensate centrifugal forces in a curve. Similar to SCL, the circumstance that participants should not speak or move makes it difficult to ask them about their current motion sickness. However, motion artifacts have a different impact on EGG-analysis in comparison to the impact they have on SCL-analysis. SCL is analyzed by computing and comparing means or medians. Therefore motion artifacts reduce the interpretability of the results. In contrast, EGG is analyzed by spectral analysis which can be entirely ruled out by frequent or unnoticed motion artifacts [78,82]. In addition, the EGG raw signal is overlaid by other signals (e.g., from respiration, activity of intestine, etc.) [78] which are filtered later on. Hence, the experimenter cannot monitor any interpretable raw-signal during the test drive.

**Figure 8.** Exemplary raw electrogastrography (EGG) data of a participant during a 44-min drive as passenger on a German Autobahn. Like with EDA (see Figure 6), the peaks in the chart indicate artifacts. In EGG data, these derive mainly from participant's movements.

In our study, EGG was recorded on the participant's abdominal surface. The electrodes were positioned according to the recommendations of Yin and Chen [82] and were attached by the participants themselves. Despite the instruction not to move, we found a high number and frequency of motion artifacts in most participants (an example is given in Figure 8). Therefore, a meaningful analysis was not possible and we refrain from reporting results.

Beside these methodological issues, some ethical aspects should be considered when EGG is recorded. The restriction not to move or speak might withhold participants from reporting when they feel very ill or when they wish to quit the study. In addition, for some participants it can be uncomfortable to have electrodes placed on the abdominal surface by an experimenter. To avoid this, it is possible to let the participants attach the electrodes themselves. Then, however, the experimenter has no control over whether the electrodes are placed correctly. Preparing the skin for attaching the electrodes [82] can also result in unpleasant feelings for participants. Additionally, amount and time of the last meal have to be controlled because this affects the stomach's activity and the development of motion sickness [83]. To avoid this, it is possible to ask participants to be fasted when EGG is recorded or to provide a standardized meal at some time before the start of the test drive.

Altogether, the EGG is hardly suitable to be applied in motion sickness studies under naturalistic driving conditions from the standpoint of current measurement techniques.

#### *3.5. Data Analyses in General*

Due to ethical reasons (see Chapter 4), participants must be able to terminate participation at any stage of the study. Furthermore, the experimenter has to terminate the session in cases of conspicuous suffering of the participant. Therefore, a researcher has to expect dropouts during the conduction of a motion sickness study. In driving studies concerning other topics (e.g., acceptance of a new driver assistance system) these dropout participants are often replaced by other participants so that each condition consists of a sufficient and equal number of data, which facilitates the statistical analysis. In a motion sickness study, however, the occurrence of a dropout is very important as it indicates that motion sickness was too distinct.

Concerning post-study questionnaires (e.g., MSQ), dropouts are not a problem for data analysis as all participants—regardless of cancelling or completing the session—can fill it out. However, all data collected during the runs are sensitive to dropouts during the session. On the one hand, this influences the statistical data analysis and might necessitate the usage of tests which can handle dropouts and missing data. On the other hand, however, researchers can use dropout rates as dependent variables, investigating which conditions caused how many people to abort the trials due to sickness. Furthermore, dropouts enable time-based parameters describing the progress of motion sickness: How long does it take until the dropouts occur? Does this time differ between the test conditions? Therefore, researchers should not see dropouts as a problem (like in other research issues), but rather as an increase of information.

In general, time-based parameters describing the progress of motion sickness are important for motion sickness studies: if a continuous online assessment of motion sickness is conducted (e.g., via FMS, MISC, or symptom-specific Likert scales), it is possible to use parameters which define the time until a participant reaches a specific symptom (e.g., "time to nausea" or "time to sweating"). These data are helpful for the description of motion sickness and the effect of countermeasures.

#### **4. Ethics in Motion Sickness Studies**

The American Psychological Association has released a code of conduct that is relevant to research in psychology and other sciences [84]. It includes five fundamental principles which define how to treat participants in scientific investigations. The first principle "beneficence and nonmaleficence" states that researchers should take care of their participants and their wellbeing.

This principle is violated by studies concerning motion sickness as unpleasant symptoms like headache, nausea, or sweating are provoked in these studies. Regarding this aspect, motion sickness research has similarities to pain research: research on a specific topic requires undesirable physical effects and uncomfortable situations for the study participants. Concerning pain studies, the Committee on Ethical Issues of the International Association for the Study of Pain (IASP) has published ethical guidelines for pain research [85]. According to the authors, "health, safety and dignity of human subjects have the highest priority in pain research"—of course, this is also applicable for motion sickness research. Researchers of motion sickness can orientate and adjust their procedure to these guidelines, in particular concerning the following principles:

"Potential participants should be informed fully of the goals, procedures, and risks of the study before giving their consent". In a motion sickness study, participants must fill out an informed consent prior to the study. In particular, research on motion sickness has to be mentioned as the study's aim (i.e., no cover story) and the participant has to be informed that undesirable physical effects of the study (e.g., headache, sickness, sweating) are likely.

"Participants must be able to decline, or to terminate, participation at any stage without risk or penalty. Stimuli should never exceed a subject's tolerance limit and subjects should be able to escape or terminate a painful stimulus at will". In a motion sickness study, the participant is allowed to leave the study anytime. The experimenter has to stop the run immediately or as soon as possible. Of course, it is not allowed to exert pressure on the participants to continue the test session.

"The minimal intensity of noxious stimulus necessary to achieve goals of the study should be established and not exceeded." In a motion sickness study, the researcher must consider criteria when to break off a session: is it really necessary that the participants get strong motion sickness until vomiting? For most research questions it should be sufficient that participants feel first or moderate symptoms of nausea (e.g., for the evaluation of an intervention's effect) as several studies have shown that the motion sickness process is linear with further provocation [32,33]. Besides, even weaker symptoms are experienced as uncomfortable and are not desired during autonomous driving. The break-off criterion could be a predefined participant judgement on a scale measuring well-being, which is given regularly during the session. Additionally, a continuous monitoring through the experimenter can also help to evaluate the participants' well-being: In cases of conspicuous suffering (e.g., moaning, convulsing) the experimenter has to terminate the experiment.

After deciding to stop the experiment due to the participant's wish or a participant's rating over a predefined threshold or conspicuous suffering, the experimenter must stop the session immediately. After the participant has left the sickness provoking situation, the experimenter has to offer various options to the participant in order to relieve her/his motion sickness: e.g., breathe fresh air, visit a restroom, have a cold or warm drink; for emergency cases like a circulatory collapse the participants should have the option to lie down.

At the end of the study the participants' well-being should be evaluated again. If the participants still suffer from motion sickness symptoms, they should be strongly encouraged not to drive a car for safety reasons. In this case, the researcher should provide a shuttle back home or organize a taxi transfer and take on its costs.

The experimenter must be trained in all these mentioned aspects to ensure a good treatment of the participant. A high degree of empathy and training in the detection of motion sickness signals is especially important in order to avoid artefacts of the study situation. Some participants might play down the symptom severity because (1) they form an interpretation of the experiment's purpose and adjust their judgments to fit that interpretation (demand characteristics) or (2) they see high severity judgments as an indicator for weakness (social desirability). The experimenter must break off the experiment in both cases to impede further suffering of the participant.

#### **5. Conclusions**

Automated vehicles have the potential to provide significant benefits for the occupants as they can spend their time with various non-driving related activities during the journey. However, this scenario increases the risk of motion sickness and requires an investigation of the phenomenon of motion sickness in the context of automated driving. The present paper discusses methodological aspects for studies investigating the two main research questions: (1) what is the prevalence of motion sickness in a specific scenario (e.g., autonomous driving on a highway) and how do the symptoms develop? (2) Which countermeasures are effective in the prevention and reduction of motion sickness?

If researchers are interested in the prevalence and development of motion sickness in a specific scenario, we suggest conducting a field study in a setting which is as natural as possible. The test vehicle should be driving autonomously or operated by a trained experimenter (Wizard-Of-Oz setting) on public roads in order to achieve external validity. The participants should deal with an NDRT which is likely to be used in an autonomous vehicle in a future setting (e.g., reading or texting). This task should be self-paced so that the participants can interrupt the task when they want to and are able to glance up at the road. As the prevalence of motion sickness in this scenario is of interest, the researchers should select a representative sample concerning all aspects which can affect the prevalence of motion sickness, e.g., age and gender.

In contrast, the setting of a study investigating countermeasures for motion sickness is more standardized. This is necessary as the comparison between runs with the countermeasure (treatment run) and runs without the countermeasure (baseline run) has to be conducted under controlled conditions in order to achieve a high degree of internal validity. The influences of extraneous variables to the measurement should be minimized or removed. Therefore, the study must be conducted in a standardized setting, either on a test track or in a driving simulator. The scenario should provoke motion sickness in the baseline run as a positive effect in the treatment run can only be detected under these conditions. On a test track, standardized maneuvers like driving in a figure eight or constant stop-and-go are recommended. The maneuvers should be driven by a trained experimenter. In the driving simulator, a more naturalistic test course like a winding rural road is possible. The participants should deal with a standardized NDRT which controls glances on the road or totally impedes them. The researchers should select participants who are susceptible to motion sickness in the investigated

setting. For this purpose, specific screening questions are more useful than general tools like the MSSQ. Table 2 gives an overview of the recommendations for studies concerning the two main research questions.


**Table 2.** Overview of the recommendations for studies concerning the two main research questions.

Of course, the two research questions concerning prevalence/development and countermeasures are not distinct opposites which require an "either-or decision" in the study design. Mixed research questions are imaginable, e.g., when investigating which of two countermeasures is the most effective one in a naturalistic setting. These studies require a mix of methods from both directions.

Independent of the research question, subjective measurement tools like questionnaires and inquiries are necessary to determine motion sickness. Quick and efficient tools like the MISC scale or symptom-specific Likert scales are recommended to assess the intensity of the symptoms during driving. In contrast, comprehensive questionnaires like the MSQ are appropriate to capture a lot of motion sickness symptoms and their intensity after a run. The usage of physiological measurements to detect motion sickness is difficult under non-laboratory conditions. Existing literature reports a high degree of inter-individual variance in physiological reactions—additionally, we found a high intra-individual variance during the study with four test sessions. Furthermore, most data are affected by external events like breaking or a change of posture. It will be challenging to detect physiological correlates of motion sickness which can be assessed reliably and practicably during autonomous driving in realistic settings.

When planning a study concerning motion sickness during autonomous driving, it is imperative that the researchers consider ethical principles. Especially a comprehensive informed consent, predefined break-off criteria, and a protective treatment by trained experimenters is necessary to conduct the study in an appropriate manner.

In sum, more research is necessary for the investigation of motion sickness and possible countermeasures. This paper contributes to solving methodological questions during this research.

**Author Contributions:** Conceptualization, D.M., K.R., L.R., M.T.; methodology, D.M., K.R., LR., M.T.; software, D.M. and M.T.; validation, D.M., K.R., LR., MT.; formal analysis, D.M. and M.T.; investigation, D.M. and M.T.; resources, K.R. and L.R.; data curation, D.M. and M.T.; writing—original draft preparation, D.M. and M.T.; writing—review and editing, K.R. and L.R.; visualization, D.M. and M.T.; supervision, not relevant; project administration, K.R. and L.R.; funding acquisition, not relevant. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded and supported by the AUDI AG.

**Acknowledgments:** We want to thank Alex Neukum, Michael Herter and Jörg Oberglock for administrative support in our experiment. We would like to thank Anna Posset for her support in data collection and for correcting English in the final version of this manuscript.

**Conflicts of Interest:** This study was supported by the AUDI AG. Katharina Reinmueller and Lena Rittger are employees of this company. They contributed in the design of the study, the review and edition process of the manuscript and the decision to publish the results.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
