Bi-Directional Gaze-Based Communication: A Review

Severitt, Björn Rene; Castner, Nora; Wahl, Siegfried

doi:10.3390/mti8120108

Open AccessReview

Bi-Directional Gaze-Based Communication: A Review

by

Björn Rene Severitt

^1,*

,

Nora Castner

²

and

Siegfried Wahl

^1,2

¹

ZEISS Vision Science Lab, University of Tübingen, Maria-von-Linden-Straße 6, 72076 Tübingen, Germany

²

Carl Zeiss Vision International GmbH, Turnstrasse 27, 73430 Aalen, Germany

^*

Author to whom correspondence should be addressed.

Multimodal Technol. Interact. 2024, 8(12), 108; https://doi.org/10.3390/mti8120108

Submission received: 17 October 2024 / Revised: 20 November 2024 / Accepted: 26 November 2024 / Published: 4 December 2024

(This article belongs to the Special Issue Multimodal User Interfaces and Experiences: Challenges, Applications, and Perspectives—2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

Bi-directional gaze-based communication offers an intuitive and natural way for users to interact with systems. This approach utilizes the user’s gaze not only to communicate intent but also to obtain feedback, which promotes mutual understanding and trust between the user and the system. In this review, we explore the state of the art in gaze-based communication, focusing on both directions: From user to system and from system to user. First, we examine how eye-tracking data is processed and utilized for communication from the user to the system. This includes a range of techniques for gaze-based interaction and the critical role of intent prediction, which enhances the system’s ability to anticipate the user’s needs. Next, we analyze the reverse pathway—how systems provide feedback to users via various channels, highlighting their advantages and limitations. Finally, we discuss the potential integration of these two communication streams, paving the way for more intuitive and efficient gaze-based interaction models, especially in the context of Artificial Intelligence. Our overview emphasizes the future prospects for combining these approaches to create seamless, trust-building communication between users and systems. Ensuring that these systems are designed with a focus on usability and accessibility will be critical to making them effective communication tools for a wide range of users.

Keywords:

accessibility; eye movements and cognition; error prediction; gaze interaction; machine learning; eye tracking

Graphical Abstract

1. Introduction

Artificial Intelligence (AI) has become a pervasive and influential field, with supporting systems playing an increasingly critical role in various domains [1]. These systems are particularly evident in the health and medical sectors, where significant advancements are being made [2], e.g., as an aid in diagnosis [3] or during surgery [4]. AI has notably enhanced accessibility for individuals with disabilities through innovations in rehabilitation [5,6,7], the development of exoskeletons [8], the integration in wearables [9,10], and advancements in wheelchair navigation technologies [11,12]. Beyond healthcare, AI is increasingly shaping other fields, including gaming [13], domestic automation [14], and education [15,16]. As we integrate AI more into our daily life and certain tasks become more reliant on it, how we interact with and how we expect AI to respond becomes important to our feelings of trust in these systems.

Trust is an important factor in the adoption and effectiveness of AI. This trust is influenced by several key elements, including human characteristics and abilities, the performance and attributes of the AI itself, and the specific context in which tasks are performed [17]. To ensure that AI systems can excel across these dimensions, it is essential to provide clear and precise communication of tasks. However, communicating with AI differs significantly from traditional human-to-human interactions, requiring a deeper understanding of these distinctions and the unique challenges they present [18]. These distinctions include the functional role of AI in communication, the relational dynamics it fosters between humans and technology, and the metaphysical implications of the blurring boundaries between humans, machines, and the nature of communication. Exploring these differences is vital for enhancing the effectiveness and reliability of AI systems [19].

Effective communication with a system often requires that the input data be formatted and structured according to the specific requirements of the system. This ensures that the machine can interpret and process the data correctly. However, achieving this level of precision can be time-consuming and complex, making it challenging for the average user who may lack the expertise or resources to prepare data in the required format consistently. This difficulty highlights the need for more user-friendly communication methods that do not rely heavily on complex data structuring. While Natural Language Processing (NLP) models enable communication with machines through text input [20,21,22], which can promote accessibility, they can also have drawbacks like computational demand [23,24]. More importantly, language-based input modalities exclude a proportion of the population that may not be able to effectively communicate with speech or keyboard input.

System communication can often be a barrier for individuals with motor impairments, where devices like keyboards may be cumbersome, slow, or even physically inaccessible for some individuals. Speech impairments can lead to poor understanding of commands, which can affect usage [25,26]. Persons who experience deteriorating motor abilities from neurodegenerative disorders, like ALS or stroke, can also struggle with various communication modalities. An overview by Story et al. [27] found patients with motor impairments were not taken enough into account regarding the usability of devices (e.g., human factors, ergonomics). One modality for more implicit and natural communication that has been refined over the past 40 years is gaze-based interaction with systems. Gaze-based interaction can offer a more intuitive and resource-efficient way to engage with machines, potentially addressing some challenges associated with text-based input. This paper explores the state of the art of using gaze as an alternative communication method.

2. Structure of the Review

In this review, we consider communication with a system as a two-way process: from the user to the system and back. To take this into account, we have organized the overview around key research questions that are central to understanding and improving this interaction. We start in Section 3 with the question: How is gaze data used in interactive systems, especially in immersive environments? We provide an overview of existing methods for capturing, processing, and analyzing gaze data, with a focus on applications in immersive scenes. In the next Section 4, we look at the question: How can gaze be used effectively as an interaction tool, and what challenges does this pose? Here, we examine general applications of gaze-based interaction, followed by a detailed look at gaze-based object selection. We then assess the research on adapting these methods to immersive environments and consider the particular challenges that arise in such contexts. In Section 5, we explore the question: How can gaze be used to predict user intentions? Since effective communication requires an understanding of user intention, we start with general approaches to gaze-based intention prediction and then focus on studies in which gaze is used for specific applications of intention recognition. In the following Section 6, considerations are made: What types of feedback channels from the system to the user are most effective? We examine different feedback mechanisms and analyze studies on their strengths, limitations, and impact on the user experience. Finally, in Section 7, we address the overarching question: How can these gaze-based techniques be integrated to create a system that encourages intuitive, responsive communication? We discuss how combining these approaches can create a system that gives users a stronger sense of being understood and engaged.

3. Gaze Estimation and Usage in Immersive Scenes

For effective gaze-based communication, precise gaze estimation from a seemingly ambiguous signal is non-trivial. This task is challenging due to the ambiguity of gaze signals, which respond to subtle head movements, lighting variations, and individual differences in eye physiology—factors that can affect the robustness and accuracy of eye-tracking systems [28]. Here, we provide a brief introduction to gaze estimation techniques and the pipeline for their integration into applications. We illustrate our general overview of the pipeline for gaze estimation and common applications in Figure 1.

Gaze estimation is commonly achieved through eye tracking technology, which involves using sensors, such as cameras, to capture and analyze eye movements to determine gaze direction. In typical screen-based setups, remote eye trackers are often employed. These devices are usually mounted on a screen or desk and include systems like RemoteEye [29]. In dynamic environments, such as outdoor settings or Virtual Reality (VR), head-mounted eye trackers are preferred. These devices generally use cameras to track both eyes, as seen in systems like PupilLabs [30]. Some advanced VR headsets, such as the HTC Vive Pro Eye, incorporate integrated eye tracking sensors within the headset itself [31]. Regardless of the system used, proper calibration is crucial to ensure accurate gaze estimation based on eye orientation [32,33,34,35]. During the calibration phase, the system establishes a relationship between the raw, uncalibrated data from the device and the real gaze points. To achieve this, users are usually asked to follow an easily recognizable target, such as an ArUco marker, so that the system can match the gaze data with the exact target positions. Calibration is essential to account for individual differences in eye anatomy and positioning and to ensure accurate gaze estimation.

The output of gaze estimation is typically represented in one of two forms: gaze coordinates, which are x and y coordinates relative to a plane (e.g., Monitor or scene camera), or gaze vectors, which provide the direction in three-dimensional space (x, y, z) or as horizontal and vertical angles. This data allows for the determination of where participants are looking and enables the analysis of eye movements and their events. The most well-known eye movement events are fixations and saccades. Fixations occur when the eye remains stable and focused on a specific point, indicating where attention is directed. In contrast, saccades are rapid movements of the eye as it shifts focus from one location to another, playing a crucial role in scanning the visual environment [36]. Another important eye movement event is smooth pursuit, characterized by slow, continuous tracking as the eyes follow a moving object. There are other eye movement events as well as ways of processing the pupil signal, though they are out of the scope of this current review.

Eye movement events can be detected using threshold-based algorithms like Identification by Dispersion Threshold (I-DT) and Identification by Velocity Threshold (I-VT). The I-DT algorithm effectively identifies fixations by measuring gaze dispersion, classifying a movement as a fixation when dispersion remains below a set threshold, thereby sensitively detecting points of stable focus. On the other hand, the I-VT algorithm detects saccades by analyzing eye movement velocity, classifying a movement as a saccade when velocity exceeds a predefined threshold, thereby effectively capturing rapid shifts in gaze [37]. In addition to these methods, probabilistic models like Hidden Markov Models (HMMs) provide a more advanced approach for identifying fixations and saccades, treating them as hidden states and using velocity as an observed feature to account for the probabilistic nature of eye movement sequences [37]. These algorithms are also applicable in VR environments, where they help to accurately track eye movements within immersive settings [38,39,40]. Smooth pursuits are distinct from saccades and fixations and can be detected using algorithms that analyze shape features, such as those developed by Vidal et al. [41]. Recent advances in eye movement event detection have expanded beyond traditional methods. Santini et al. [42] introduced a Bayesian model that detects fixations, saccades, and smooth pursuits, even in online settings. Fuhl et al. have further advanced the field with deep learning models, including tiny transformers, for highly accurate event classification [43]. A comparative study by Andersson et al. [44] highlights the strengths and weaknesses of various algorithms, guiding the selection of the most suitable method for specific applications.

From the fixation and saccade events, metrics can be derived, such as the mean fixation duration or saccade amplitudes, which provide deeper insights into visual attention and behavior. These features can be used for statistical analysis to infer cognitive states [45], distinguish between individuals [46], identify the task being performed [47], or serve as indicators of usability [48]. The analysis of fixations and saccades can reconstruct the user’s scanpath—the temporal patterns they generate while exploring a scene. Scanpaths vary depending on the task and the interplay of top-down context and bottom-up saliency [49,50] (Figure 2). Additionally, metrics can be derived from eye movements, such as the frequency of revisiting specific Areas of Interest (AOI)—distinct regions within a scene that are pre-defined or dynamically identified based on their relevance—which indicate which parts of the scene are examined in detail [51,52,53]. The calculation of eye-tracking metrics is essential to gain quantitative insights into visual attention, cognitive processes, and user behavior. This allows researchers to interpret how people absorb visual information and react to different tasks or stimuli.

With advancements in gaze estimation technology, particularly in head-mounted devices, the use of gaze tracking in Virtual Environment (VE) is becoming increasingly prevalent. VR immersive nature offers a unique platform for examining gaze interactions [55], enabling applications in areas such as gaming [56,57], training simulations [58,59], and medical diagnostics [60]. This environment also presents new challenges, such as dealing with motion sickness, navigating complex 3D spaces, and ensuring accurate calibration, which are driving ongoing research and development [61]. Compared to traditional 2D screen-based gaze tracking, VR offers richer insights into user behavior by capturing interactions in a fully immersive setting. In VR, the precise control of an AOI within the scene simplifies the automatic detection of when a user is gazing at an object, a task that is more challenging in other mobile eye tracking systems [35,62,63]. Future directions may include improving accuracy, integrating AI for predictive gaze analytics, and enhancing user experiences by enabling more natural and intuitive gaze-based interactions [64,65]. We further overview the current trends in gaze-based interaction methods and how they can support users in immersive environments.

4. Gaze-Based Interaction

Gaze-based interaction has a rich and extensive history, thoroughly reviewed in several survey papers [66,67]. One of the earliest and most futuristic depictions of a gaze-controlled, immersive environment is from [68]. He envisioned a room of multiple monitors where “…Eye—tracking technology, integrated with speech and manual inputs, controls the display’s visual dynamics, and orchestrates its sound accompaniments”. Since then, research has furthered the idea that eye movements can act as output to control tasks in real time. The gaze data collected from eye trackers is utilized as an input method for user interfaces, offering an intuitive and hands-free means of interaction [69]. Additionally, this form of interaction allows for the evaluation and training of specialist knowledge through task-specific tools that use gaze data [50,70].

4.1. Gaze-Based Interaction with Screen Applications

Gaze-based interaction offers a powerful and intuitive way to engage with screen-based applications. It allows users to control and interact with digital interfaces simply by looking at specific areas of the screen. This method not only enhances accessibility but also provides a hands-free alternative for navigating and operating software, making it particularly valuable in scenarios where traditional input devices like keyboards and mice may be impractical or inefficient. In this subsection, we will explore the various techniques and applications of gaze-based interaction within screen environments, highlighting both the potential benefits and challenges associated with this technology.

A significant application of gaze-based interaction is using gaze as an input method for sequences of letters and numbers, such as writing text [71,72,73,74,75]. Gaze-based text entry systems, like those used for typing or selecting characters on a virtual keyboard, have shown considerable promise. For instance, gaze can be effectively utilized to write by selecting letters and numbers on an on-screen interface, though it often requires users to undergo a training period to achieve proficiency. Research demonstrates that with practice, users can significantly improve their typing speed using gaze-based systems. Tuisku et al. conducted a study where participants practiced gaze-based text entry over ten sessions, and the results were compelling: participants increased their typing speed from 2.5 words per minute to 17.3 words per minute using the Dasher [73] text entry system [76]. This increase highlights the potential of gaze-based writing as a viable alternative to traditional input methods, especially for users who might benefit from hands-free interaction. Similarly, gaze-based systems have been explored for secure tasks like entering PINs, where the user’s gaze can be used to select digits without physical input, offering both convenience and a degree of security [77,78]. These applications demonstrate the versatility of gaze as an input modality and its potential to improve user interaction in various digital contexts.

Another valuable application of gaze-based interaction is navigating hierarchical systems, where users can traverse complex menu structures using only their gaze. An exemplary system designed for this purpose is pEye [79,80], which facilitates hierarchical navigation through a rotary interface specifically optimized for eye tracking. This interface relies on two key parameters: pie segmentation, which divides the circular menu into selectable slices, and menu depth, which allows for multiple layers of hierarchical options. The design of pEye enables users to select items from these segmented slices, moving through the layers of the hierarchy with ease. The system is particularly effective for tasks that involve complex decision trees or nested options, where traditional input methods might be cumbersome. In a study conducted by Urbina et al. [81], the effectiveness of this gaze-based rotary interface was tested. The study demonstrated that users could accurately and quickly navigate through hierarchical menus with up to six slices across multiple depth layers. This finding underscores the potential of gaze-based navigation to streamline interactions in hierarchical systems, making it a powerful tool for both everyday users and individuals with specific accessibility needs.

Gaze-based communication in screen applications primarily involves using eye movements to select or interact with objects on a screen, where the gaze itself serves as a pointer to target specific elements. One common method to confirm a selection is the use of dwell time, a technique introduced by Jacob [82]. Dwell time refers to the duration for which a user must fixate on an object before it is registered as a selection. This method can also serve as a metric for measuring user interest, with longer fixations indicating greater interest or intention [83]. However, the reliance on dwell time brings about a significant challenge in gaze-based communication, known as the Midas Touch problem. This issue arises because, in gaze interaction, not every glance is intended to trigger an action; people naturally scan their environment, and these incidental gazes can lead to unintended selections. The Midas Touch problem thus refers to the difficulty of distinguishing between intentional and unintentional eye movements, which can result in a frustrating user experience. To address this challenge, various strategies have been developed (Figure 3). One common solution is to carefully adjust the dwell time threshold to balance responsiveness and accuracy. If the dwell time is too short, it may lead to accidental selections, exacerbating the Midas Touch problem. On the other hand, if it is too long, it may cause delays and make the interaction feel sluggish. Optimizing dwell time is crucial for ensuring that gaze-based systems are both efficient and user-friendly, allowing for accurate and intentional communication without the drawbacks of unintended actions.

An alternative to relying on dwell time for gaze-based selection is to incorporate head gestures as a means of confirming the user’s intention. This approach allows users to make selections with their gaze while using specific head movements to validate or execute those choices. Špakov and Majaranta [84] conducted a study exploring the effectiveness of various head gestures in conjunction with eye gaze for different types of tasks. In their research, participants tested a range of head gestures, such as nodding, turning, and tilting, each mapped to specific actions. The study found that different head gestures were better suited for different tasks. Participants reported that nodding was the most intuitive and effective gesture for confirming selections, providing a natural and efficient way to indicate a choice. Turning the head was preferred for navigation tasks, such as scrolling or moving through a menu, as it felt more aligned with the directional nature of such tasks. Finally, tilting the head was found to be useful for mode switching, allowing users to easily shift between different interaction modes or settings. This combination of gaze and head gestures offers a more nuanced and flexible interaction method, reducing the reliance on dwell time and addressing some of its inherent limitations, such as the Midas Touch problem. By decoupling the selection process from gaze alone, this method provides users with greater control and reduces the likelihood of accidental selections, enhancing the overall user experience.

Another approach to object selection in gaze-based interaction involves using the Pearson correlation between a user’s gaze and the movement of objects within a scene. This method was extensively explored by Vidal et al. [85], who developed a technique to calculate the correlation between the user’s gaze direction and the trajectories of multiple moving objects. In their approach, the system continuously monitors the gaze and compares it with the motion paths of all objects in the scene. If the correlation value for any object exceeds a predefined threshold, the object with the highest correlation is automatically selected. This method allows for dynamic and precise selection, particularly in environments where multiple objects are in motion. Building on this concept, Esteves et al. [86] introduced a novel mechanism for selecting items on a smartwatch using gaze in combination with object orbits. In their system, when a user focuses on an object, an orbit—essentially a circular or elliptical path—appears around it. The user can then confirm their selection by maintaining a gaze that correlates with the movement of the orbit. The system calculates the Pearson correlation between the gaze path and the trajectory of the orbit, and if the correlation is higher than a threshold value, the selection is confirmed. This method not only provides a robust way to ensure that selections are intentional but also enhances interaction with small devices like smartwatches, where traditional input methods may be cumbersome. By coupling gaze dynamics and object motion, these techniques offer an intuitive and efficient way to interact with complex visual environments.

4.2. Gaze-Based Interaction in Immersive Scenes

While many gaze-based interaction systems have traditionally focused on scenarios displayed on computer screens, the emergence of Extended Reality (XR)—including Augmented Reality (AR), Virtual Reality (VR), and Mixed Reality (MR)—has shifted research toward interactions within fully immersive virtual environments. These immersive scenes present unique challenges and opportunities for gaze-based interaction, as users are no longer confined to 2D screen interfaces but are instead navigating complex 3D virtual spaces. Moreover, the virtual realm allows for highly controlled, yet dynamic environments, where potentially dangerous scenarios can be played out (e.g., driving [87], surgical training [88,89], security training [90], etc.). A systematic review by Monteiro et al. [91] highlights that in immersive environments, interaction is most commonly driven by voice commands, followed by head and eye-gaze control. However, there has been a notable increase in research focusing on the potential of eye gaze as an interaction modality, especially in recent years. Still, one of the main challenges in these immersive environments is the Midas Touch. To address this, researchers often combine gaze with other input modalities. In many cases, gaze is used to select a target, while a secondary modality—such as voice—is employed to confirm the action. For instance, Klamka et al. [92] explored the use of foot pedals as a confirmation mechanism, showing that alternative methods can also be effective in avoiding accidental selections. Though this research successfully shows multi-modality for refined interaction, it does not fully address groups of users with specific mobility constraints.

In numerous scenarios, other input methods such as voice or physical controls are also not desired, thus gaze-based interaction can revert to using dwell time as the primary mechanism for target selection. In these cases, both head movements and eye gaze can be utilized for selecting a target, each offering distinct advantages. While eye gaze can precisely indicate a point of interest, head movements provide additional confirmation, adding robustness to the selection process. A comparative study by Qian and Teather [93] evaluated head-based versus gaze-based selection for target detection tasks. Their findings revealed that head-based selection generally outperformed gaze-based selection in terms of both selection time and error rates. Specifically, participants were able to make selections more quickly and accurately when using head movements as the primary input method, likely due to the larger and more controlled motions involved in head-based navigation, which reduce the risk of accidental selections. In contrast, a study by Blatterste et al. [94] produced different results, showing that eye-gaze selection was superior across multiple metrics, including faster task completion times, fewer head movements, and lower error rates. The researchers suggested that the quality of the eye-tracking signal might account for the contrasting outcomes between the two studies. As eye-tracking technology continues to improve, with greater accuracy and reliability, gaze-based input may become more effective for rapid and precise target selection, potentially surpassing head-based methods in specific contexts. This divergence in findings highlights the importance of considering both the context of the task and the quality of the tracking systems when choosing between head and gaze inputs for immersive interactions. In scenarios where eye-tracking is highly accurate, gaze may offer more natural and efficient interaction, whereas head-based inputs may remain preferable when precision is critical, or eye-tracking is less reliable.

Another approach to enhancing gaze-based interaction is to combine both head movements and eye gaze for more accurate and efficient target selection. This integration leverages the strengths of each modality, offering greater control and reducing errors. Sidenmark and Gellersen [95] explored this combination by developing three distinct interaction methods that utilized both head-supported versus eyes-only gaze (Figure 4a). In their study, participants were tasked with answering questions by selecting from four multiple-choice options. Their results show that eye and head techniques enable a new type of gaze behavior that gives users more control and flexibility in rapid gaze alignment and selection. Wei et al. [96] took this integration further by introducing a probabilistic model that combines head and eye gaze endpoints to improve target selection accuracy. Rather than simply using a binary selection based on dwell time or direct gaze, their approach calculated the probability of selecting a particular target based on the distribution of both head and gaze positions. By analyzing where both modalities converge, the system could estimate which target the user was most likely focusing on. This probabilistic approach reduced the likelihood of incorrect selections due to slight variations in gaze or head movement, making the interaction more reliable in complex environments with multiple selectable objects. These methods demonstrate that combining head movements and eye gaze can significantly improve interaction accuracy, speed, and user satisfaction, particularly in immersive scenes where precise control is essential. The integration of these modalities not only addresses the limitations of using gaze or head movement alone but also opens new possibilities for more fluid and intuitive interaction in virtual and augmented reality environments.

One recent interaction paradigm that is purely gaze-based is from Sidenmark et al. [97]. They proposed a method for selecting an object based on vergence (Figure 4b). This approach works on the principle that both eyes are directed in opposite directions to obtain a single, focused image. The angle between these two gaze vectors gives the vergence, which is an indicator of the distance of an object. By exploiting the vergence angle to accurately determine which object is being focused on, even in scenarios where the gaze direction is nearly identical for multiple objects located at different distances. Their approach enhances precision in identifying the selected target object, distinguishing it from other nearby objects that may be in close proximity but at varying depths.

Gaze-based interaction has also attracted great interest in the gaming industry, where it offers unique opportunities to enhance the gaming experience. Research comparing gaze-based and hand-based interactions in games has produced insightful findings. Hülsmann et al. [98] conducted a comparative study on these two modalities and found that while both gaze-based and hand-based interactions were effective, hand-based input generally provided greater accuracy and ease of use. Their study indicated that hand-based controls were more intuitive and precise, which is crucial for games requiring fine motor skills and rapid responses. With the increased accessibility and affordability of eye-tracking technology, further research has emerged exploring the potential of gaze-based interaction in gaming contexts. Luro and Sundstedt [99] compared eye-tracking with traditional hand-tracking methods and observed that, although aiming with the eyes tended to be less accurate compared to hand-based controls, it resulted in lower cognitive load and shorter task completion times. Their findings suggest that while gaze aiming may require some adjustment in terms of precision, it offers advantages in reducing mental effort and speeding up interactions, which could be beneficial in certain gaming scenarios. In a different approach, Lee et al. [100] explored the effectiveness of gaze-based targeting compared to using a physical controller in a baseball-throwing game. Participants were asked to aim at targets either by using a controller or by directing their gaze. The results revealed that while the accuracy of hitting the targets was comparable between the two methods, gaze-based interaction excelled in terms of task completion time and overall usability. This study highlighted the potential of gaze-based controls to streamline gameplay and improve user experience by making interactions more direct and less cumbersome. These studies collectively indicate that while gaze-based interaction in gaming may not yet match the accuracy of traditional hand-based methods, it offers compelling benefits in terms of reduced cognitive load and improved usability. As eye-tracking technology continues to advance, its integration into gaming is likely to offer more engaging player experiences, leveraging its strengths in reducing physical input and enhancing accessibility.

To summarize, using gaze to interact with the environment is a difficult task. To avoid the Midas Touch, different approaches are tested (see Table 1). These approaches usually use gaze as a pointing mechanism and other modalities for confirmation. In this way, unwanted selection could be minimized, making gaze-based interaction more practical in dynamic scenarios such as games.

5. Gaze-Based Intention Prediction

In good communication, it is helpful to recognize what the other person wants. While clear signals allow the user to tell the system what they want to interact with, it is still necessary to decide what exactly should be done. The feedback can be used to communicate the intention of the system to the user or to provide the user with information that is helpful for their intention. In order to achieve this, however, the intention must first be recognized. In gaze-based communication, eye movements are the main means of doing this. As Yarbus [49] has already shown, the scanpath differs depending on the underlying intention. Nowadays, Gaze-based intention prediction has different fields of usage like driving scenarios [102,103], interaction with robots [104,105] or to interact with a VE [106]. A good summary of gaze-based intention prediction is given by Belardinelli [107]. This section gives a general overview of gaze-based intention prediction and how it can be used to communicate with an AI.

5.1. How to Use Gaze to Predict Intention

In intention prediction, a person’s intention is considered a latent, unobservable state that must be inferred through their observable actions. In this context, gaze-based intention prediction involves estimating a person’s intention by analyzing their eye movements. Gaze features such as fixation duration or dwell time are commonly extracted, and machine learning models are employed to classify these features. Common classifiers include Support Vector Machine (SVM) [108,109], random forests [47], logistic regression [106], and neural networks [102]. However, given that intention is a hidden state, machine learning models designed for estimating hidden states, such as Dynamic Bayesian Networks and Hidden Markov Models, have also been explored [110,111,112,113,114,115]. These models estimate the probability of an intention based on observable features, often by applying principles from Bayes’ theorem [116]:

\begin{matrix} P (I | F) = \frac{P (F | I) \cdot P (I)}{P (F)} \end{matrix}

In this formula, I represents the intention and F represents the observed features.

P (I)

is the prior probability of the intention,

P (F)

is the probability of observing the features,

P (I | F)

is the probability of the intention given the observed features, and

P (F | I)

is the likelihood of observing the features given the intention. The goal is to find the intention that maximizes

P (I | F)

, making the most likely prediction based on the observed eye movements. Since the intention I is hidden, directly estimating

P (I | F)

is difficult. However, it is not necessary to calculate

P (I | F)

directly. Instead, Bayes’ theorem allows finding the intention that maximizes this probability:

\begin{matrix} \underset{I}{arg max} P (I | F) = \underset{I}{arg max} \frac{P (F | I) \cdot P (I)}{P (F)} \end{matrix}

Because

P (F)

is constant across all intentions, it does not affect which intention maximizes the probability. Therefore, it is sufficient to maximize

P (F | I) \cdot P (I)

, which considers both the likelihood of the features and the prior probability of the intention. Since the features F are observable, estimating

P (F | I)

is feasible, allowing for accurate intention predictions using observed data and general prior probabilities.

5.2. Areas of Application

Gaze-based intention prediction has diverse applications in communication across different domains (Figure 5). One notable application is in human–agent interaction, where understanding a person’s intention is crucial, especially when dealing with ambiguous behaviors. Singh et al. [117] introduced a model for online human intention recognition that integrates gaze data with model-based AI planning. Their findings suggest that this approach could enhance human–agent interaction, particularly in scenarios where the person’s behavior might be uncertain, such as when they are being honest, deceitful, or semi-rational. This is even possible without directly classifying these behaviors, as gaze data can provide insights into the underlying cognitive processes and user behavior. This model could be instrumental in designing more adaptive and responsive systems that can handle these complex human behaviors.

Another significant area of application is in real-time interactions, particularly in immersive environments. Chen and Hou [118] conducted a VR experiment demonstrating that gaze data combined with hand-eye coordination data can be used to predict a user’s interaction intentions. This has promising implications for the development of predictive interfaces that can anticipate user actions, making interactions more seamless and intuitive. Similarly, Newn et al. [119] demonstrated that gaze can serve as a natural and intentional input in socially interactive systems. In their demo, they showed how gaze can be used by interactive agents to collaborate with humans by understanding and responding to their intentions, making the interaction more fluid and engaging.

In the field of assistive technology, gaze-based intention prediction holds significant potential for improving communication, particularly for individuals with impairments. Koochaki and Najafizade [120,121] explored how eye gaze patterns can be leveraged to predict a user’s intention to perform daily tasks. This approach can empower people with disabilities by enabling them to communicate their needs and intentions to assistive systems, thus improving their independence and quality of life.

Gaze-based intention prediction is also increasingly being utilized in human–robot interaction (HRI). For instance, Weber et al. [104,105] demonstrated how gaze-based intention prediction can be used to communicate a target to a robot, enabling more effective and natural interactions between humans and robots. This concept is further extended by Shi et al. [122], who proposed GazeEMD, a method for using gaze cues to communicate the intention to pick up objects in real-world scenarios. Such gaze-driven interactions are particularly beneficial in collaborative environments, where clear and efficient communication between humans and robots is essential. Dermy et al. [123] highlighted this in their work by developing a method that combines gaze with other modalities to control a robot in assisting a human during collaborative tasks. Their approach underscores the value of multimodal communication, where gaze plays a pivotal role in enhancing the coordination between human and robotic partners.

6. Feedback in Gaze-Based Communication

Communication is inherently a two-way process, and it is widely recognized that feedback from the system to the user plays a crucial role in ensuring effective interaction [124]. Visual feedback, for instance, can provide valuable information about the status and resolution of problems. However, it often lacks the ability to effectively capture the user’s attention in urgent situations. Conversely, in teleoperation settings, modalities such as vibration and sound are more effective at alerting users, though they may not convey detailed information as clearly as visual feedback [125]. Given these dynamics, it becomes evident that even in gaze-based communication, exploring different channels for feedback is essential. By incorporating multiple feedback mechanisms, systems can improve responsiveness and enhance the overall interaction experience, ensuring that users receive timely and accurate feedback across various contexts and tasks. As we look further into gaze-based interactions, it is important to explore how these feedback mechanisms can be optimized for a range of applications, especially in immersive and complex environments. In more immersive environments, such as VR or AR, feedback mechanisms must be carefully tuned to maintain the user’s sense of presence while enhancing interaction efficiency. For example, in virtual environments, visual feedback alone may become overwhelming or insufficient due to the complexity of the virtual scene. In such cases, combining feedback channels can significantly improve performance and user experience.

6.1. Haptic Feedback

Haptic feedback such as vibrations are often used to inform the user when an event or action happens. The timing of this feedback is important; if delayed, users may misinterpret the source or cause of the feedback. Kangas et al. [126] demonstrated this in a study where participants were required to locate a target using their gaze, with haptic feedback providing the moment when the correct target was fixated on. Their findings revealed that when the delay between the initial gaze and the feedback was too long, the number of errors increased significantly, indicating that prompt feedback is essential for accurate task performance. These findings are further supported by Rantala et al. [127] in their review of gaze interaction with vibrotactile feedback, which emphasizes the importance of delivering haptic cues in a timely manner. Delays in feedback disrupt the natural flow of interaction and can lead to confusion or mistakes, underscoring the necessity of swift, well-coordinated feedback mechanisms in gaze-based systems.

The need for timely and effective feedback becomes even more apparent in gaze-based typing. In these scenarios, visual feedback generally informs the user of the selected letter and helps monitor the dwell time, which is commonly used to confirm selections. However, once a letter is successfully typed, auditory or haptic feedback is often more effective in signaling completion, as it is not visually distracting. Among these, haptic feedback offers a more discreet and private form of communication, ensuring the user remains focused on the task without distracting external noise [128].

Concerning immersive environments, a study by Sakamak et al. illustrates the integrating gaze and haptic feedback for better interaction during robot navigation [129]. In their research, a robot was guided towards a target using the user’s gaze, with haptic feedback provided to indicate proximity to the target. This additional haptic cue allowed users to adjust their gaze more precisely, helping them reach the target more quickly and effectively, demonstrating the value of multisensory feedback in dynamic gaze-based interactions.

6.2. Audio Feedback

While haptic feedback effectively signals that an event has occurred, audio feedback offers a more versatile means of delivering complex information to the user. Audio cues can range from simple notifications, like clicks or error sounds, to longer tones that indicate specific actions, such as an incoming call. Additionally, audio feedback can convey highly detailed instructions through speech generation, offering real-time guidance or conveying contextual information that visual or haptic channels might not easily communicate. This versatility makes audio feedback a valuable tool for enhancing user experience, particularly in tasks that require immediate and clear responses.

Even in acoustically complex environments with multiple audio sources and potential distractors, participants have demonstrated the ability to accurately localize sounds, as highlighted by the study conducted by Moraes et al. [130]. It evaluated how listeners perceive spatialized audio within a VE. Their findings revealed not only the effectiveness of spatial audio in guiding user attention but also that eye gaze can be correlated with varying levels of cognitive load and effort during sound localization tasks. This suggests that gaze patterns may offer insights into the cognitive demands placed on users when navigating acoustically rich environments, further emphasizing the potential of gaze–audio integration.

In the context of grasping tasks within a VE, audio feedback emerges as a promising alternative to haptic feedback, which is traditionally used to signal the successful grasp of a virtual object. In such tasks, recognizing when an object has been successfully grasped can be challenging without tactile cues. Canales and Jörg [131] conducted a study to explore the effectiveness of audio and visual feedback as substitutes for tactile feedback in virtual grasping interactions. Their results indicated that participants consistently preferred audio feedback over visual feedback, even though this preference seemed to come at the cost of a slight decrease in grasping performance. This preference for audio feedback highlights its potential to create more immersive and intuitive interaction experiences in VEs, especially when traditional tactile feedback is not available. Using audio cues for tasks like object manipulation, virtual environments can provide users with essential feedback, fostering a more engaging and responsive interaction. Although haptic feedback remains the gold standard for certain tasks, audio feedback presents a viable and sometimes preferred alternative, particularly in situations where physical touch is impractical or impossible.

In a virtual environment, speech generation can guide users through complex spaces, helping them reach their destination even without prior knowledge of the environment’s layout. By providing verbal cues, such as directional instructions or contextual hints, users can orient themselves more easily within the virtual space. This auditory feedback can mimic human-to-human type interaction for teaching and guiding. Moreover, it can be made even more effective by incorporating real-time gaze data into the process. When gaze information is integrated with speech generation, the system can adapt its verbal instructions based on the user’s current attention. For instance, if the user’s gaze is focused on an irrelevant part of the scene, the system could redirect their attention with more precise cues, such as “Look to your left” or “Focus on the red door”. This dynamic interaction personalizes the experience, making navigation more efficient and intuitive by responding to the user’s natural focus points. Studies by Staudte et al. and Garoufi et al. [132,133] have shown that exploiting gaze data for speech generation can significantly enhance the clarity and effectiveness of instructions in such environments, improving both user engagement and task performance. This combination of gaze tracking and audio feedback enables a more responsive system that tailors its guidance to individual user behavior, offering a more immersive and adaptive experience in virtual environments.

6.3. Visual Feedback

As previously discussed, visual feedback is highly effective for conveying detailed and specific information within a scenario. In a VE, visual cues can provide general context, such as indicating which objects can be interacted with, or more specific details, like identifying the type of error that has occurred. However, while visual feedback excels at delivering precise information, it tends to be less effective at immediately capturing the user’s attention, especially in dynamic or immersive environments. For this reason, visual feedback is often combined with other modalities, such as auditory or haptic feedback, to ensure that users are both aware of and can respond to critical changes or events in the environment. For instance, Zhang et al. [134] conducted a study in which participants were tasked with completing a “peg-in-a-hole” task—a classic motor skill exercise that involves navigating a small object through a confined space. The study evaluated the effectiveness of both audio and visual feedback during the task. The results revealed that while both feedback types were beneficial, participants expressed a clear preference for the combination of audio and visual feedback. This multimodal approach allowed users to benefit from the precision of visual feedback in understanding the task requirements and the immediacy of auditory feedback to alert them of their actions.

While visual feedback proves to be effective, its usefulness is also task-dependent. In a study by Kangas et al. [135], the researchers compared haptic, audio, visual, and no feedback in a task where participants were required to adjust the gray level using smooth pursuit eye movements. Interestingly, despite the diverse feedback modalities, there were no statistically significant differences in performance across the different conditions. However, participant preferences told a different story: both haptic and audio feedback were rated more favorably compared to visual and no feedback. This suggests that while visual feedback can provide valuable information, other feedback types may offer greater immediacy or intuitiveness in certain tasks. Similarly, another study conducted by Lankes and Haslinger [136] explored the use of visual, audio, and haptic feedback in an exploratory game where participants had to locate objects using the different feedback modalities. Their findings revealed a clear preference for visual and audio feedback, with haptic feedback receiving notably lower ratings. The participants found the combination of visual and auditory feedback to be the most engaging and effective, leading the researchers to conclude that further exploration of this multimodal approach could offer significant benefits for feedback systems in similar tasks.

These studies underscore the complexity of feedback design, particularly in gaze-based or interactive systems. While visual feedback can provide detailed and context-rich information, its effectiveness varies depending on the task at hand. Combining it with auditory or haptic feedback often enhances user experience by supporting the strengths of each modality (Figure 6). Therefore, a deeper investigation into how these feedback mechanisms interact is essential to optimize user performance and satisfaction across various types of virtual environments and applications.

These feedback mechanisms are beneficial, but they often either inform the user of actions that have already occurred or attempt to guide them toward desired behavior. However, systems that can accurately predict a user’s next likely action have the potential to create a more seamless and efficient user experience—much like how text prediction accelerates messaging. In this context, gaze-based human–system interaction stands to gain significantly from intent prediction, enabling more proactive and intuitive support.

7. Discussion

As gaze-based communication technology continues to advance, it is increasingly becoming a viable solution for interacting with AI systems while minimizing computational effort. Figure 7 illustrates a two-way communication model that emphasizes the dynamic exchange between the user and the AI system. This loop begins with the user’s input, which is captured through modalities such as gaze or head movements and contextualized with information about the scene. The system processes this data to interpret the user’s intention and provides feedback via visual, auditory, or haptic channels. This bi-directional interaction mirrors natural human communication, where both parties mutually learn based on each other’s responses. In this discussion, we explore the components underlying this communication model, including gaze data acquisition, challenges in gaze-based interaction, intention prediction, and feedback strategies. Finally, we address how the integration of these elements can promote a seamless and intuitive interaction experience that builds trust and improves usability.

Starting with the question: How is gaze data used in interactive systems, especially in immersive environments? Gaze data is typically collected by eye-tracking devices, which record eye movements and determine the direction of the user’s gaze. The raw gaze coordinates collected over time are then processed to provide meaningful insights into gaze behavior. For example, algorithms analyze the duration and stability of the gaze at specific coordinates to determine whether a user is fixating on an object, following a moving target, or directing their gaze to a new point of interest. These inferred behaviors, including fixations, smooth pursuits, and saccades, provide fundamental data that can be used to calculate higher-level metrics. Such interpretations of gaze data are essential for applications in immersive environments, as they enable interactive systems to respond to the user’s focus and intent, creating a more natural and intuitive communication interface. A summary of that pipeline is shown in Figure 1. Despite progress, many issues remain, such as identifying key features for interpreting gaze in unaffected environments and developing robust real-time analysis methods; increasingly, AI is being used to overcome these challenges and enable learning with minimal supervision and adaptability to real-world conditions [137].

Moving forward, we address the question: How can gaze be used effectively as an interaction tool, and what challenges does this pose? By combining gaze data with scene data, eye-tracking systems can accurately identify the user’s object of interest, enabling direct communication of it to the system. This capability opens up a wide range of applications: For example, gaze can facilitate tasks such as text entry, navigation through menus or interfaces or—especially in immersive environments—the selection of objects in augmented or virtual reality. However, the use of gaze as an interaction tool brings with it unique challenges. One well-known problem is the so-called Midas Touch problem, where unintended decisions are made simply because the user is looking at an object. To solve this problem, various approaches have been developed to ensure that gaze-based interaction is precise and intentional (see Figure 3). These solutions typically involve using gaze to select the target while requiring confirmation through an additional input, such as a head movement or an external control (e.g., a hand gesture or voice command). This multimodal approach helps to avoid accidental actions and makes gaze-based interaction more reliable and intuitive for users [69]. By utilizing these strategies, gaze-based systems can create responsive and engaging experiences for various applications, from hands-free control of devices to seamless interaction in immersive environments.

The next step in gaze-based communication is intention prediction: How can gaze be used to predict user intentions? In general, systems interpret gaze and scene data to classify likely intentions, often using machine learning models. Since intentions can be viewed as hidden states that influence user behavior, methods such as Hidden Markov Models (HMMs) are commonly applied to model the sequential nature of gaze data and predict the underlying intentions over time. However, predicting intentions based on gaze data alone can be challenging as accuracy decreases with the number of possible intentions, but when the set of possible intentions is small, high accuracies can be achieved [120]. Combining gaze-based interaction methods with intention prediction simplifies this process: Identifying the specific object of interest reduces the range of possible user intentions. This combined approach has shown success in predefined environments, where the limited set of intentions enables accurate predictions. Expanding this accuracy to more dynamic, open-ended settings remains an ongoing research challenge.

Once the system has interpreted the user’s intent, it needs to provide feedback to ensure that the user understands the system’s response, which leads us to the question: What types of feedback channels from the system to the user are most effective? There are three primary feedback channels: visual, haptic, and auditory, each with unique strengths and limitations (see Figure 6). Visual feedback can quickly confirm the point of interest, haptic feedback provides a more tactile confirmation, and audio feedback provides additional context or instructions. In practice, the effectiveness of a feedback channel depends heavily on the task. For example, visual cues can emphasize the point of interest, while audio feedback can provide relevant information without disrupting the visual focus. Multimodal feedback, where two or more channels are combined, often increases clarity and improves user interaction by reinforcing the message across multiple senses [138]. However, it is important to balance the richness of feedback with user comfort, as excessive feedback can lead to sensory overload, especially in immersive environments.

This brings us to the final question: How can these gaze-based techniques be integrated to create a system that encourages intuitive, responsive communication? For users, the process of expressing intent should be straightforward and natural, with clear and easy-to-interpret feedback from the AI. By enabling users to effectively communicate their intentions and receive meaningful feedback, the system creates a dynamic feedback loop that allows users to feel understood and correct any misinterpretations. This interactive flow mirrors human-to-human communication and promotes trust between the user and the AI system [139,140]. This review shows that gaze-based input has proven successful in various applications, particularly in immersive environments. While much of the research to date has focused on healthy users, gaze-based communication systems show promise for people with disabilities as they offer a more accessible way of interacting. Future research should aim to make these systems more accessible to a wider range of users, adapting to individual needs and optimizing user comfort and engagement. By incorporating these findings, emerging AI systems can become more inclusive and user-friendly and serve a wider range of users. However, it remains a challenge to achieve reliable performance in various real-world environments. To overcome these challenges, future vision-based systems must prioritize safety and ease of use, especially for people with mobility impairments. The advancement of technologies such as Apple Vision Pro and HoloLens 2 has the potential to transform everyday interactions, especially for users with special needs. By focusing on accessibility and usability, these devices could make everyday interactions more efficient, inclusive, and impactful, supporting a wider range of users in a meaningful way.

Moreover, gaze-based communication systems for real-world applications must take ethical and privacy considerations into account: With attention to both the input and the feedback mechanisms. Gaze is an intuitive and natural method for communicating intentions, allowing users to convey their focus and goals without text or spoken language. This makes it inherently private compared to voice or text input, as the risk of eavesdropping or unauthorized access is minimized [141]. Feedback channels also pose a particular challenge for data protection. Audio feedback is clear and specific but becomes a public channel unless headphones are used, potentially exposing sensitive information. Haptic feedback, such as vibrations, is very private but lacks the precision required for detailed communication. Visual feedback, delivered through devices such as AR glasses, offers a middle ground that provides clarity while maintaining privacy, although it requires specialized devices that are not always practical. Additionally, it is important to recognize that advanced data analytics can extract extensive information from gaze signals, including biometric identity, demographic details, personality traits, emotional states, interests, cognitive processes, and even indicators of physical or mental health conditions [142]. Balancing these factors is critical to developing systems that are not only effective but also in line with ethical principles and user trust.

8. Conclusions

In this overview, the main methods and applications for gaze-based communication have been analyzed and their potential to improve interaction with AI systems highlighted. Bi-directional, gaze-based communication provides users with a natural and intuitive way to communicate their intentions to the technology, mimicking familiar interactions. By using gaze as the primary input, users can express their intentions with minimal effort and are less reliant on complex interaction methods such as voice commands or manual input. This simplicity in design is one of the biggest advantages of gaze-based communication, as it works with minimal computer requirements and is a scalable solution for different platforms. As AI systems become an integral part of everyday life, two-way communication with gaze enables more dynamic and responsive interactions that improve both the accuracy of intent recognition and the user experience by providing immediate, clear feedback. This feedback loop not only builds user confidence but also promotes satisfaction by making the system feel more intuitive and engaging. Gaze-based communication holds promise for a range of applications, from augmented reality to assistive robots, as its intuitive nature and efficiency make it an attractive solution for improving human-AI interaction. Future research and development work should continue to focus on usability and accessibility to ensure that these systems serve a variety of users effectively. In this way, gaze-based communication can shape a more inclusive and seamless future for AI-driven technology.

Funding

This research is supported by European Union’s Horizon 2020 research and innovation program under grant agreement No. 951910 and the German Research Foundation (DFG): SFB 1233, Robust Vision: Inference Principles and Neural Mechanisms, TP TRA, project No. 276693517.

Acknowledgments

We acknowledge support from the Open Access Publication Fund of the University of Tübingen. We acknowledge ChatGPT for its support in slight changes to sentence structure and wording.

Conflicts of Interest

We declare that Sigfried Wahl and Nora Castner are employees of Carl Zeiss Vision Internetional GmbH, as stated in the list of affiliations. The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Maedche, A.; Legner, C.; Benlian, A.; Berger, B.; Gimpel, H.; Hess, T.; Hinz, O.; Morana, S.; Söllner, M. AI-Based Digital Assistants. Bus. Inf. Syst. Eng. 2019, 61, 535–544. [Google Scholar] [CrossRef]
Tran, B.; Vu, G.; Ha, G.H.; Vuong, Q.; Ho, M.T.; Vuong, T.T.; La, V.P.; Ho, M.T.; Nghiem, K.C.P.; Nguyen, H.L.T.; et al. Global Evolution of Research in Artificial Intelligence in Health and Medicine: A Bibliometric Study. J. Clin. Med. 2019, 8, 360. [Google Scholar] [CrossRef] [PubMed]
Marullo, G.; Ulrich, L.; Antonaci, F.G.; Audisio, A.; Aprato, A.; Massè, A.; Vezzetti, E. Classification of AO/OTA 31A/B femur fractures in X-ray images using YOLOv8 and advanced data augmentation techniques. Bone Rep. 2024, 22, 101801. [Google Scholar] [CrossRef] [PubMed]
Checcucci, E.; Piazzolla, P.; Marullo, G.; Innocente, C.; Salerno, F.; Ulrich, L.; Moos, S.; Quarà, A.; Volpi, G.; Amparore, D.; et al. Development of Bleeding Artificial Intelligence Detector (BLAIR) System for Robotic Radical Prostatectomy. J. Clin. Med. 2023, 12, 7355. [Google Scholar] [CrossRef]
Lee, M.H.; Siewiorek, D.P.; Smailagic, A.; Bernardino, A.; Badia, S.B.i. Enabling AI and robotic coaches for physical rehabilitation therapy: Iterative design and evaluation with therapists and post-stroke survivors. Int. J. Soc. Robot. 2022, 16, 1–22. [Google Scholar] [CrossRef]
Zhou, X.Y.; Guo, Y.; Shen, M.; Yang, G.Z. Application of artificial intelligence in surgery. Front. Med. 2020, 14, 417–430. [Google Scholar] [CrossRef]
Andras, I.; Mazzone, E.; van Leeuwen, F.W.; De Naeyer, G.; van Oosterom, M.N.; Beato, S.; Buckle, T.; O’Sullivan, S.; van Leeuwen, P.J.; Beulens, A.; et al. Artificial intelligence and robotics: A combination that is changing the operating room. World J. Urol. 2020, 38, 2359–2366. [Google Scholar] [CrossRef]
Zhu, C.; Liu, Q.; Meng, W.; Ai, Q.; Xie, S.Q. An Attention-Based CNN-LSTM Model with Limb Synergy for Joint Angles Prediction. In Proceedings of the 2021 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM), Delft, The Netherlands, 12–16 July 2021; pp. 747–752. [Google Scholar]
Wang, K.J.; Liu, Q.; Zhao, Y.; Zheng, C.Y.; Vhasure, S.; Liu, Q.; Thakur, P.; Sun, M.; Mao, Z.H. Intelligent wearable virtual reality (VR) gaming controller for people with motor disabilities. In Proceedings of the 2018 IEEE International Conference on Artificial Intelligence and Virtual Reality (AIVR), Nagoya, Japan, 23–25 November 2018; pp. 161–164. [Google Scholar]
Wen, F.; Zhang, Z.; He, T.; Lee, C. AI enabled sign language recognition and VR space bidirectional communication using triboelectric smart glove. Nat. Commun. 2021, 12, 5378. [Google Scholar] [CrossRef]
Zgallai, W.; Brown, J.T.; Ibrahim, A.; Mahmood, F.; Mohammad, K.; Khalfan, M.; Mohammed, M.; Salem, M.; Hamood, N. Deep learning AI application to an EEG driven BCI smart wheelchair. In Proceedings of the 2019 Advances in Science and Engineering Technology International Conferences (ASET), Dubai, United Arab Emirates, 26 March–10 April 2019; pp. 1–5. [Google Scholar]
Sakai, Y.; Lu, H.; Tan, J.K.; Kim, H. Recognition of surrounding environment from electric wheelchair videos based on modified YOLOv2. Future Gener. Comput. Syst. 2019, 92, 157–161. [Google Scholar] [CrossRef]
Nareyek, A. AI in Computer Games. Queue 2004, 1, 58–65. [Google Scholar] [CrossRef]
Nhizam, S.; Zyarif, M.; Tuhfa, S.Z. Utilization of Artificial Intelligence Technology in Assisting House Chores. J. Multiapp 2021, 2, 29–34. [Google Scholar] [CrossRef]
Chen, L.; Chen, P.; Lin, Z. Artificial Intelligence in Education: A Review. IEEE Access 2020, 8, 75264–75278. [Google Scholar] [CrossRef]
Kasneci, E.; Seßler, K.; Küchemann, S.; Bannert, M.; Dementieva, D.; Fischer, F.; Gasser, U.; Groh, G.; Günnemann, S.; Hüllermeier, E.; et al. ChatGPT for good? On opportunities and challenges of large language models for education. Learn. Individ. Differ. 2023, 103, 102274. [Google Scholar] [CrossRef]
Kaplan, A.D.; Kessler, T.; Brill, J.C.; Hancock, P. Trust in Artificial Intelligence: Meta-Analytic Findings. Hum. Factors J. Hum. Factors Ergon. Soc. 2021, 65, 337–359. [Google Scholar] [CrossRef]
Guzman, A.L.; Lewis, S. Artificial intelligence and communication: A Human–Machine Communication research agenda. New Media Soc. 2019, 22, 70–86. [Google Scholar] [CrossRef]
Hassija, V.; Chakrabarti, A.; Singh, A.; Chamola, V.; Sikdar, B. Unleashing the Potential of Conversational AI: Amplifying Chat-GPT’s Capabilities and Tackling Technical Hurdles. IEEE Access 2023, 11, 143657–143682. [Google Scholar] [CrossRef]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar]
Machiraju, S.; Modi, R. Natural Language Processing. In Developing Bots with Microsoft Bots Framework: Create Intelligent Bots Using MS Bot Framework and Azure Cognitive Services; Apress: Berkeley, CA, USA, 2018; pp. 203–232. [Google Scholar] [CrossRef]
Wei, W.; Wu, J.; Zhu, C. Special issue on deep learning for natural language processing. Computing 2020, 102, 601–603. [Google Scholar] [CrossRef]
Singh, S.; Mahmood, A. The NLP Cookbook: Modern Recipes for Transformer Based Deep Learning Architectures. IEEE Access 2021, 9, 68675–68702. [Google Scholar] [CrossRef]
Strubell, E.; Ganesh, A.; McCallum, A. Energy and Policy Considerations for Modern Deep Learning Research. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 13693–13696. [Google Scholar] [CrossRef]
De Russis, L.; Corno, F. On the impact of dysarthric speech on contemporary ASR cloud platforms. J. Reliab. Intell. Environ. 2019, 5, 163–172. [Google Scholar] [CrossRef]
Hawley, M.S.; Cunningham, S.P.; Green, P.D.; Enderby, P.; Palmer, R.; Sehgal, S.; O’Neill, P. A Voice-Input Voice-Output Communication Aid for People With Severe Speech Impairment. IEEE Trans. Neural Syst. Rehabil. Eng. 2013, 21, 23–31. [Google Scholar] [CrossRef] [PubMed]
Story, M.F.; Winters, J.M.; Lemke, M.R.; Barr, A.; Omiatek, E.; Janowitz, I.; Brafman, D.; Rempel, D. Development of a method for evaluating accessibility of medical equipment for patients with disabilities. Appl. Ergon. 2010, 42, 178–183. [Google Scholar] [CrossRef] [PubMed]
Niehorster, D.C.; Santini, T.; Hessels, R.S.; Hooge, I.T.; Kasneci, E.; Nyström, M. The impact of slippage on the data quality of head-worn eye trackers. Behav. Res. Methods 2020, 52, 1140–1160. [Google Scholar] [CrossRef]
Hosp, B.; Eivazi, S.; Maurer, M.; Fuhl, W.; Geisler, D.; Kasneci, E. RemoteEye: An open-source high-speed remote eye tracker: Implementation insights of a pupil-and glint-detection algorithm for high-speed remote eye tracking. Behav. Res. Methods 2020, 52, 1387–1401. [Google Scholar] [CrossRef]
Kassner, M.; Patera, W.; Bulling, A. Pupil: An Open Source Platform for Pervasive Eye Tracking and Mobile Gaze-based Interaction. In Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct Publication (UbiComp’14 Adjunct), New York, NY, USA, 13–17 September 2014; pp. 1151–1160. [Google Scholar] [CrossRef]
Sipatchin, A.; Wahl, S.; Rifai, K. Accuracy and precision of the HTC VIVE PRO eye tracking in head-restrained and head-free conditions. Investig. Ophthalmol. Vis. Sci. 2020, 61, 5071–5071. [Google Scholar]
Nyström, M.; Andersson, R.; Holmqvist, K.; Van De Weijer, J. The influence of calibration method and eye physiology on eyetracking data quality. Behav. Res. Methods 2013, 45, 272–288. [Google Scholar] [CrossRef] [PubMed]
Harezlak, K.; Kasprowski, P.; Stasch, M. Towards Accurate Eye Tracker Calibration—Methods and Procedures. Procedia Comput. Sci. 2014, 35, 1073–1081. [Google Scholar] [CrossRef]
Severitt, B.R.; Kübler, T.C.; Kasneci, E. Testing different function fitting methods for mobile eye-tracker calibration. J. Eye Mov. Res. 2023, 16. [Google Scholar] [CrossRef]
Niehorster, D.C.; Hessels, R.S.; Benjamins, J.S.; Nyström, M.; Hooge, I.T. GlassesValidator: A data quality tool for eye tracking glasses. Behav. Res. Methods 2024, 56, 1476–1484. [Google Scholar] [CrossRef]
Hessels, R.S.; Niehorster, D.C.; Nyström, M.; Andersson, R.; Hooge, I.T. Is the eye-movement field confused about fixations and saccades? A survey among 124 researchers. R. Soc. Open Sci. 2018, 5, 180502. [Google Scholar] [CrossRef] [PubMed]
Salvucci, D.D.; Goldberg, J.H. Identifying fixations and saccades in eye-tracking protocols. In Proceedings of the 2000 Symposium on Eye Tracking Research & Applications (ETRA’00), Palm Beach Gardens, FL, USA, 6–8 November 2000; pp. 71–78. [Google Scholar] [CrossRef]
Chen, X.L.; Hou, W.J. Identifying Fixation and Saccades in Virtual Reality. arXiv 2022, arXiv:2205.04121. [Google Scholar]
Gao, H.; Bozkir, E.; Hasenbein, L.; Hahn, J.U.; Göllner, R.; Kasneci, E. Digital transformations of classrooms in virtual reality. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, Yokohama, Japan, 8–13 May 2021; pp. 1–10. [Google Scholar]
Gao, H.; Frommelt, L.; Kasneci, E. The Evaluation of Gait-Free Locomotion Methods with Eye Movement in Virtual Reality. In Proceedings of the 2022 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct), Singapore, 17–21 October 2022; pp. 530–535. [Google Scholar]
Vidal, M.; Bulling, A.; Gellersen, H. Detection of smooth pursuits using eye movement shape features. In Proceedings of the Symposium on Eye Tracking Research and Applications (ETRA’12), Santa Barbara, CA, USA, 28–30 March 2012; pp. 177–180. [Google Scholar] [CrossRef]
Santini, T.; Fuhl, W.; Kübler, T.; Kasneci, E. Bayesian identification of fixations, saccades, and smooth pursuits. In Proceedings of the Ninth Biennial ACM Symposium on Eye Tracking Research & Applications (ETRA’16), Charleston, SC, USA, 14–17 March 2016; pp. 163–170. [Google Scholar] [CrossRef]
Fuhl, W.; Herrmann-Werner, A.; Nieselt, K. The Tiny Eye Movement Transformer. In Proceedings of the 2023 Symposium on Eye Tracking Research and Applications (ETRA’23), Tübingen, Germany, 30 May–2 June 2023. [Google Scholar] [CrossRef]
Andersson, R.; Larsson, L.; Holmqvist, K.; Stridh, M.; Nyström, M. One algorithm to rule them all? An evaluation and discussion of ten eye movement event-detection algorithms. Behav. Res. Methods 2017, 49, 616–637. [Google Scholar] [CrossRef] [PubMed]
Marshall, S. Identifying cognitive state from eye metrics. Aviat. Space Environ. Med. 2007, 78 (Suppl. 5), B165–B175. [Google Scholar]
Yoon, H.J.; Carmichael, T.R.; Tourassi, G. Gaze as a biometric. In Proceedings of the SPIE 9037, Medical Imaging 2014: Image Perception, Observer Performance, and Technology Assessment, San Diego, CA, USA, 16–17 February 2014. [Google Scholar] [CrossRef]
Boisvert, J.F.; Bruce, N.D. Predicting task from eye movements: On the importance of spatial distribution, dynamics, and image features. Neurocomputing 2016, 207, 653–668. [Google Scholar] [CrossRef]
Castner, N.; Arsiwala-Scheppach, L.; Mertens, S.; Krois, J.; Thaqi, E.; Kasneci, E.; Wahl, S.; Schwendicke, F. Expert gaze as a usability indicator of medical AI decision support systems: A preliminary study. npj Digit. Med. 2024, 7, 199. [Google Scholar] [CrossRef] [PubMed]
Yarbus, A.L. Eye Movements and Vision; Springer: New York, NY, USA, 1967; p. 171. [Google Scholar]
Castner, N.; Kuebler, T.C.; Scheiter, K.; Richter, J.; Eder, T.; Hüttig, F.; Keutel, C.; Kasneci, E. Deep semantic gaze embedding and scanpath comparison for expertise classification during OPT viewing. In Proceedings of the ACM Symposium on Eye Tracking Research and Applications (ETRA’20), Stuttgart, Germany, 2–5 June 2020; pp. 1–10. [Google Scholar] [CrossRef]
Anderson, N.C.; Bischof, W.F.; Laidlaw, K.E.; Risko, E.F.; Kingstone, A. Recurrence quantification analysis of eye movements. Behav. Res. Methods 2013, 45, 842–856. [Google Scholar] [CrossRef]
Dewhurst, R.; Nyström, M.; Jarodzka, H.; Foulsham, T.; Johansson, R.; Holmqvist, K. It depends on how you look at it: Scanpath comparison in multiple dimensions with MultiMatch, a vector-based approach. Behav. Res. Methods 2012, 44, 1079–1100. [Google Scholar] [CrossRef] [PubMed]
Li, A.; Chen, Z. Representative scanpath identification for group viewing pattern analysis. J. Eye Mov. Res. 2018, 11. [Google Scholar] [CrossRef]
Geisler, D.; Castner, N.; Kasneci, G.; Kasneci, E. A MinHash approach for fast scanpath classification. In ACM Symposium on Eye Tracking Research and Applications; Association for Computing Machinery: New York, NY, USA, 2020. [Google Scholar]
Pfeuffer, K.; Mayer, B.; Mardanbegi, D.; Gellersen, H. Gaze + pinch interaction in virtual reality. In Proceedings of the 5th Symposium on Spatial User Interaction (SUI’17), Brighton, UK, 16–17 October 2017; pp. 99–108. [Google Scholar] [CrossRef]
Dohan, M.; Mu, M. Understanding User Attention In VR Using Gaze Controlled Games. In Proceedings of the 2019 ACM International Conference on Interactive Experiences for TV and Online Video (TVX’19), Salford, UK, 5–7 June 2019; pp. 167–173. [Google Scholar] [CrossRef]
Kocur, M.; Dechant, M.J.; Lankes, M.; Wolff, C.; Mandryk, R. Eye Caramba: Gaze-based Assistance for Virtual Reality Aiming and Throwing Tasks in Games. In Proceedings of the ACM Symposium on Eye Tracking Research and Applications (ETRA’20 Short Papers), Stuttgart, Germany, 2–5 June 2020. [Google Scholar] [CrossRef]
Harris, D.J.; Hardcastle, K.J.; Wilson, M.R.; Vine, S.J. Assessing the learning and transfer of gaze behaviours in immersive virtual reality. Virtual Real. 2021, 25, 961–973. [Google Scholar] [CrossRef]
Neugebauer, A.; Castner, N.; Severitt, B.; Štingl, K.; Ivanov, I.; Wahl, S. Simulating vision impairment in virtual reality: A comparison of visual task performance with real and simulated tunnel vision. Virtual Real. 2024, 28, 97. [Google Scholar] [CrossRef]
Orlosky, J.; Itoh, Y.; Ranchet, M.; Kiyokawa, K.; Morgan, J.; Devos, H. Emulation of Physician Tasks in Eye-Tracked Virtual Reality for Remote Diagnosis of Neurodegenerative Disease. IEEE Trans. Vis. Comput. Graph. 2017, 23, 1302–1311. [Google Scholar] [CrossRef] [PubMed]
Adhanom, I.B.; MacNeilage, P.; Folmer, E. Eye tracking in virtual reality: A broad review of applications and challenges. Virtual Real. 2023, 27, 1481–1505. [Google Scholar] [CrossRef] [PubMed]
Clay, V.; König, P.; Koenig, S. Eye tracking in virtual reality. J. Eye Mov. Res. 2019, 12. [Google Scholar] [CrossRef] [PubMed]
Naspetti, S.; Pierdicca, R.; Mandolesi, S.; Paolanti, M.; Frontoni, E.; Zanoli, R. Automatic analysis of eye-tracking data for augmented reality applications: A prospective outlook. In Augmented Reality, Virtual Reality, and Computer Graphics: Proceedings of the Third International Conference, AVR 2016, Lecce, Italy, 15–18 June 2016; Proceedings, Part II 3; Springer: Cham, Switzerland, 2016; pp. 217–230. [Google Scholar]
Mania, K.; McNamara, A.; Polychronakis, A. Gaze-aware displays and interaction. In Proceedings of the ACM SIGGRAPH 2021 Courses (SIGGRAPH’21), Virtual, 9–13 August 2021. [Google Scholar] [CrossRef]
Alt, F.; Schneegass, S.; Auda, J.; Rzayev, R.; Broy, N. Using eye-tracking to support interaction with layered 3D interfaces on stereoscopic displays. In Proceedings of the 19th International Conference on Intelligent User Interfaces (IUI’14), Haifa, Israel, 24–27 February 2014; pp. 267–272. [Google Scholar] [CrossRef]
Duchowski, A.T. Gaze-based interaction: A 30 year retrospective. Comput. Graph. 2018, 73, 59–69. [Google Scholar] [CrossRef]
Plopski, A.; Hirzle, T.; Norouzi, N.; Qian, L.; Bruder, G.; Langlotz, T. The Eye in Extended Reality: A Survey on Gaze Interaction and Eye Tracking in Head-worn Extended Reality. ACM Comput. Surv. 2022, 55. [Google Scholar] [CrossRef]
Bolt, R.A. Gaze-orchestrated dynamic windows. ACM SIGGRAPH Comput. Graph. 1981, 15, 109–119. [Google Scholar] [CrossRef]
Kiefer, P.; Giannopoulos, I.; Raubal, M.; Duchowski, A. Eye tracking for spatial research: Cognition, computation, challenges. Spat. Cogn. Comput. 2017, 17, 1–19. [Google Scholar] [CrossRef]
Bednarik, R. Expertise-dependent visual attention strategies develop over time during debugging with multiple code representations. Int. J. Hum.-Comput. Stud. 2012, 70, 143–155. [Google Scholar] [CrossRef]
Majaranta, P.; Räihä, K.J. Twenty years of eye typing: Systems and design issues. In Proceedings of the 2002 Symposium on Eye Tracking Research & Applications (ETRA’02), New Orleans, LA, USA, 25–27 March 2002; pp. 15–22. [Google Scholar] [CrossRef]
Wobbrock, J.O.; Rubinstein, J.; Sawyer, M.W.; Duchowski, A.T. Longitudinal evaluation of discrete consecutive gaze gestures for text entry. In Proceedings of the 2008 Symposium on Eye Tracking Research & Applications (ETRA’08), Savannah, GA, USA, 26–28 March 2008; pp. 11–18. [Google Scholar] [CrossRef]
Ward, D.J.; MacKay, D.J. Fast hands-free writing by gaze direction. Nature 2002, 418, 838–838. [Google Scholar] [CrossRef]
Majaranta, P.; Bates, R. Special issue: Communication by gaze interaction. Univers. Access Inf. Soc. 2009, 8, 239–240. [Google Scholar] [CrossRef]
Hansen, J.P.; Tørning, K.; Johansen, A.S.; Itoh, K.; Aoki, H. Gaze typing compared with input by head and hand. In Proceedings of the 2004 Symposium on Eye Tracking Research & Applications (ETRA’04), San Antonio, TX, USA, 22–24 March 2004; pp. 131–138. [Google Scholar] [CrossRef]
Tuisku, O.; Majaranta, P.; Isokoski, P.; Räihä, K.J. Now Dasher! Dash away! longitudinal study of fast text entry by Eye Gaze. In Proceedings of the 2008 Symposium on Eye Tracking Research & Applications (ETRA’08), Savannah, GA, USA, 26–28 March 2008; pp. 19–26. [Google Scholar] [CrossRef]
Hoanca, B.; Mock, K. Secure graphical password system for high traffic public areas. In Proceedings of the 2006 Symposium on Eye Tracking Research & Applications (ETRA’06), San Diego, CA, USA, 27–29 March 2006; p. 35. [Google Scholar] [CrossRef]
Best, D.S.; Duchowski, A.T. A rotary dial for gaze-based PIN entry. In Proceedings of the Ninth Biennial ACM Symposium on Eye Tracking Research & Applications (ETRA’16), Charleston, SC, USA, 14–17 March 2016; pp. 69–76. [Google Scholar] [CrossRef]
Huckauf, A.; Urbina, M. Gazing with pEYE: New concepts in eye typing. In Proceedings of the 4th Symposium on Applied Perception in Graphics and Visualization (APGV’07), Tübingen, Germany, 25–27 July 2007; p. 141. [Google Scholar] [CrossRef]
Huckauf, A.; Urbina, M.H. Gazing with pEYEs: Towards a universal input for various applications. In Proceedings of the 2008 Symposium on Eye Tracking Research & Applications (ETRA’08), Savannah, Georgia, 26–28 March 2008; pp. 51–54. [Google Scholar] [CrossRef]
Urbina, M.H.; Lorenz, M.; Huckauf, A. Pies with EYEs: The limits of hierarchical pie menus in gaze control. In Proceedings of the 2010 Symposium on Eye-Tracking Research & Applications (ETRA’10), Austin, TX, USA, 22–24 March 2010; pp. 93–96. [Google Scholar] [CrossRef]
Jacob, R.J.K. What you look at is what you get: Eye movement-based interaction techniques. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI’90), Seattle, WA, USA, 1–5 April 1990; pp. 11–18. [Google Scholar] [CrossRef]
Starker, I.; Bolt, R.A. A gaze-responsive self-disclosing display. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI’90), Seattle, WA, USA, 1–5 April 1990; pp. 3–10. [Google Scholar] [CrossRef]
Špakov, O.; Majaranta, P. Enhanced gaze interaction using simple head gestures. In Proceedings of the 2012 ACM Conference on Ubiquitous Computing (UbiComp’12), Pittsburgh, PA, USA, 5–8 September 2012; pp. 705–710. [Google Scholar] [CrossRef]
Vidal, M.; Bulling, A.; Gellersen, H. Pursuits: Spontaneous interaction with displays based on smooth pursuit eye movement and moving targets. In Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp’13), Zurich, Switzerland, 8–12 September 2013; pp. 439–448. [Google Scholar] [CrossRef]
Esteves, A.; Velloso, E.; Bulling, A.; Gellersen, H. Orbits: Gaze Interaction for Smart Watches using Smooth Pursuit Eye Movements. In Proceedings of the 28th Annual ACM Symposium on User Interface Software & Technology (UIST’15), Charlotte, NC, USA, 8–11 November 2015; pp. 457–466. [Google Scholar] [CrossRef]
Sportillo, D.; Paljic, A.; Ojeda, L. Get ready for automated driving using Virtual Reality. Accid. Anal. Prev. 2018, 118, 102–113. [Google Scholar] [CrossRef] [PubMed]
Piromchai, P.; Avery, A.; Laopaiboon, M.; Kennedy, G.; O’Leary, S. Virtual reality training for improving the skills needed for performing surgery of the ear, nose or throat. Cochrane Database Syst. Rev. 2015, 9, CD010198. [Google Scholar] [CrossRef] [PubMed]
Stirling, E.R.B.; Lewis, T.L.; Ferran, N.A. Surgical skills simulation in trauma and orthopaedic training. J. Orthop. Surg. Res. 2014, 9, 126. [Google Scholar] [CrossRef] [PubMed]
de Armas, C.; Tori, R.; Netto, A.V. Use of virtual reality simulators for training programs in the areas of security and defense: A systematic review. Multimed. Tools Appl. 2020, 79, 3495–3515. [Google Scholar] [CrossRef]
Monteiro, P.; Gonçalves, G.; Coelho, H.; Melo, M.; Bessa, M. Hands-free interaction in immersive virtual reality: A systematic review. IEEE Trans. Vis. Comput. Graph. 2021, 27, 2702–2713. [Google Scholar] [CrossRef]
Klamka, K.; Siegel, A.; Vogt, S.; Göbel, F.; Stellmach, S.; Dachselt, R. Look & Pedal: Hands-free Navigation in Zoomable Information Spaces through Gaze-supported Foot Input. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction (ICMI’15), Seattle, WA, USA, 9–13 November 2015; pp. 123–130. [Google Scholar] [CrossRef]
Qian, Y.Y.; Teather, R.J. The eyes don’t have it: An empirical comparison of head-based and eye-based selection in virtual reality. In Proceedings of the 5th Symposium on Spatial User Interaction (SUI’17), Brighton, UK, 16–17 October 2017; pp. 91–98. [Google Scholar] [CrossRef]
Blattgerste, J.; Renner, P.; Pfeiffer, T. Advantages of eye-gaze over head-gaze-based selection in virtual and augmented reality under varying field of views. In Proceedings of the Workshop on Communication by Gaze Interaction (COGAIN’18), Warsaw, Poland, 15 June 2018. [Google Scholar] [CrossRef]
Sidenmark, L.; Gellersen, H. Eye&Head: Synergetic Eye and Head Movement for Gaze Pointing and Selection. In Proceedings of the 32nd Annual ACM Symposium on User Interface Software and Technology (UIST’19), New Orleans, LA, USA, 20–23 October 2019; pp. 1161–1174. [Google Scholar] [CrossRef]
Wei, Y.; Shi, R.; Yu, D.; Wang, Y.; Li, Y.; Yu, L.; Liang, H.N. Predicting Gaze-based Target Selection in Augmented Reality Headsets based on Eye and Head Endpoint Distributions. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI’23), Hamburg, Germany, 23–28 April 2023. [Google Scholar] [CrossRef]
Sidenmark, L.; Clarke, C.; Newn, J.; Lystbæk, M.N.; Pfeuffer, K.; Gellersen, H. Vergence Matching: Inferring Attention to Objects in 3D Environments for Gaze-Assisted Selection. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI’23), Hamburg, Germany, 23–28 April 2023. [Google Scholar] [CrossRef]
Hülsmann, F.; Dankert, T.; Pfeiffer, T. Comparing gaze-based and manual interaction in a fast-paced gaming task in virtual reality. In Proceedings of the Workshop Virtuelle & Erweiterte Realität 2011; Shaker Verlag: Aachen, Germany, 2011. [Google Scholar]
Luro, F.L.; Sundstedt, V. A comparative study of eye tracking and hand controller for aiming tasks in virtual reality. In Proceedings of the 11th ACM Symposium on Eye Tracking Research & Applications (ETRA’19), Denver, CO, USA, 25–28 June 2019. [Google Scholar] [CrossRef]
Lee, J.; Kim, H.; Kim, G.J. Keep Your Eyes on the Target: Enhancing Immersion and Usability by Designing Natural Object Throwing with Gaze-based Targeting. In Proceedings of the 2024 Symposium on Eye Tracking Research and Applications (ETRA’24), Glasgow, UK, 4–7 June 2024. [Google Scholar] [CrossRef]
Sidorakis, N.; Koulieris, G.A.; Mania, K. Binocular eye-tracking for the control of a 3D immersive multimedia user interface. In Proceedings of the 2015 IEEE 1st Workshop on Everyday Virtual Reality (WEVR), Arles, France, 23 March 2015; pp. 15–18. [Google Scholar] [CrossRef]
Lethaus, F.; Baumann, M.; Köster, F.; Lemmer, K. A comparison of selected simple supervised learning algorithms to predict driver intent based on gaze data. Neurocomputing 2013, 121, 108–130. [Google Scholar] [CrossRef]
Wu, M.; Louw, T.; Lahijanian, M.; Ruan, W.; Huang, X.; Merat, N.; Kwiatkowska, M. Gaze-based Intention Anticipation over Driving Manoeuvres in Semi-Autonomous Vehicles. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macao, China, 4–8 November 2019; pp. 6210–6216. [Google Scholar] [CrossRef]
Weber, D.; Kasneci, E.; Zell, A. Exploiting Augmented Reality for Extrinsic Robot Calibration and Eye-based Human-Robot Collaboration. In Proceedings of the 2022 17th ACM/IEEE International Conference on Human-Robot Interaction (HRI), Sapporo, Hokkaido, Japan, 7–10 March 2022; pp. 284–293. [Google Scholar] [CrossRef]
Weber, D.; Santini, T.; Zell, A.; Kasneci, E. Distilling Location Proposals of Unknown Objects through Gaze Information for Human-Robot Interaction. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 25–29 October 2020; pp. 11086–11093. [Google Scholar] [CrossRef]
David-John, B.; Peacock, C.; Zhang, T.; Murdison, T.S.; Benko, H.; Jonker, T.R. Towards gaze-based prediction of the intent to interact in virtual reality. In Proceedings of the ACM Symposium on Eye Tracking Research and Applications (ETRA’21), Virtual, 25–27 May 2021. [Google Scholar] [CrossRef]
Belardinelli, A. Gaze-based intention estimation: Principles, methodologies, and applications in HRI. arXiv 2023, arXiv:2302.04530. [Google Scholar] [CrossRef]
Huang, C.M.; Mutlu, B. Anticipatory robot control for efficient human-robot collaboration. In Proceedings of the 2016 11th ACM/IEEE International Conference on Human-Robot Interaction (HRI), Christchurch, New Zealand, 7–10 March 2016; pp. 83–90. [Google Scholar] [CrossRef]
Kanan, C.; Ray, N.A.; Bseiso, D.N.F.; Hsiao, J.H.; Cottrell, G.W. Predicting an observer’s task using multi-fixation pattern analysis. In Proceedings of the Symposium on Eye Tracking Research and Applications (ETRA’14), Safety Harbor, FL, USA, 26–28 March 2014; pp. 287–290. [Google Scholar] [CrossRef]
Bader, T.; Vogelgesang, M.; Klaus, E. Multimodal integration of natural gaze behavior for intention recognition during object manipulation. In Proceedings of the 2009 International Conference on Multimodal Interfaces (ICMI-MLMI’09), Cambridge, MA, USA, 2–6 November 2009; pp. 199–206. [Google Scholar] [CrossRef]
Boccignone, G. Advanced Statistical Methods for Eye Movement Analysis and Modelling: A Gentle Introduction. In Eye Movement Research: An Introduction to Its Scientific Foundations and Applications; Klein, C., Ettinger, U., Eds.; Springer International Publishing: Cham, Switzerland, 2019; pp. 309–405. [Google Scholar] [CrossRef]
Fuchs, S.; Belardinelli, A. Gaze-Based Intention Estimation for Shared Autonomy in Pick-and-Place Tasks. Front. Neurorobot. 2021, 15. [Google Scholar] [CrossRef]
Haji-Abolhassani, A.; Clark, J.J. An inverse Yarbus process: Predicting observers’ task from eye movement patterns. Vision Res. 2014, 103, 127–142. [Google Scholar] [CrossRef]
Tahboub, K.A. Intelligent human-machine interaction based on dynamic bayesian networks probabilistic intention recognition. J. Intell. Robot. Syst. 2006, 45, 31–52. [Google Scholar] [CrossRef]
Yi, W.; Ballard, D. Recognizing behavior in hand-eye coordination patterns. Int. J. Humanoid Robot. 2009, 6, 337–359. [Google Scholar] [CrossRef] [PubMed]
Malakoff, D. A Brief Guide to Bayes Theorem. Science 1999, 286, 1461–1461. [Google Scholar] [CrossRef]
Singh, R.; Miller, T.; Newn, J.; Velloso, E.; Vetere, F.; Sonenberg, L. Combining gaze and AI planning for online human intention recognition. Artif. Intell. 2020, 284, 103275. [Google Scholar] [CrossRef]
Chen, X.L.; Hou, W.J. Gaze-Based Interaction Intention Recognition in Virtual Reality. Electronics 2022, 11, 1647. [Google Scholar] [CrossRef]
Newn, J.; Singh, R.; Velloso, E.; Vetere, F. Combining implicit gaze and AI for real-time intention projection. In Proceedings of the Adjunct Proceedings of the 2019 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2019 ACM International Symposium on Wearable Computers (UbiComp/ISWC’19 Adjunct), London, UK, 9–13 September 2019; pp. 324–327. [Google Scholar] [CrossRef]
Koochaki, F.; Najafizadeh, L. Predicting Intention Through Eye Gaze Patterns. In Proceedings of the 2018 IEEE Biomedical Circuits and Systems Conference (BioCAS), Cleveland, OH, USA, 17–19 October 2018; pp. 1–4. [Google Scholar] [CrossRef]
Koochaki, F.; Najafizadeh, L. A Data-Driven Framework for Intention Prediction via Eye Movement with Applications to Assistive Systems. IEEE Trans. Neural Syst. Rehabil. Eng. 2021, 29, 974–984. [Google Scholar] [CrossRef]
Shi, L.; Copot, C.; Vanlanduit, S. GazeEMD: Detecting Visual Intention in Gaze-Based Human-Robot Interaction. Robotics 2021, 10, 68. [Google Scholar] [CrossRef]
Dermy, O.; Charpillet, F.; Ivaldi, S. Multi-modal Intention Prediction with Probabilistic Movement Primitives. In Human Friendly Robotics; Ficuciello, F., Ruggiero, F., Finzi, A., Eds.; Springer: Cham, Switzerland, 2019; pp. 181–196. [Google Scholar]
Pérez-Quiñones, M.A.; Sibert, J.L. A collaborative model of feedback in human-computer interaction. In Proceedings of the SIGCHI conference on Human Factors in Computing Systems, Vancouver, BC, Canada, 13–18 April 1996; pp. 316–323. [Google Scholar]
aus der Wieschen, M.V.; Fischer, K.; Kukliński, K.; Jensen, L.C.; Savarimuthu, T.R. Multimodal Feedback in Human-Robot Interaction; IGI Global: Hershey, PA, USA, 2020; pp. 990–1017. [Google Scholar] [CrossRef]
Kangas, J.; Rantala, J.; Majaranta, P.; Isokoski, P.; Raisamo, R. Haptic feedback to gaze events. In Proceedings of the Symposium on Eye Tracking Research and Applications (ETRA’14), Safety Harbor, FL, USA, 26–28 March 2014; pp. 11–18. [Google Scholar] [CrossRef]
Rantala, J.; Majaranta, P.; Kangas, J.; Isokoski, P.; Akkil, D.; Špakov, O.; Raisamo, R. Gaze Interaction With Vibrotactile Feedback: Review and Design Guidelines. Hum.–Comput. Interact. 2020, 35, 1–39. [Google Scholar] [CrossRef]
Majaranta, P.; Isokoski, P.; Rantala, J.; Špakov, O.; Akkil, D.; Kangas, J.; Raisamo, R. Haptic feedback in eye typing. J. Eye Mov. Res. 2016, 9. [Google Scholar] [CrossRef]
Sakamak, I.; Tavakoli, M.; Wiebe, S.; Adams, K. Integration of an Eye Gaze Interface and BCI with Biofeedback for Human-Robot Interaction. 2020. Available online: https://era.library.ualberta.ca/items/c00514a1-e810-4ddf-9e1b-af3a3d90c65a (accessed on 2 December 2024).
Moraes, A.N.; Flynn, R.; Murray, N. Analysing Listener Behaviour Through Gaze Data and User Performance during a Sound Localisation Task in a VR Environment. In Proceedings of the 2022 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct), Singapore, 17–21 October 2022; pp. 485–490. [Google Scholar] [CrossRef]
Canales, R.; Jörg, S. Performance Is Not Everything: Audio Feedback Preferred Over Visual Feedback for Grasping Task in Virtual Reality. In Proceedings of the 13th ACM SIGGRAPH Conference on Motion, Interaction and Games (MIG’20), North Charleston, SC, USA, 16–18 October 2020. [Google Scholar] [CrossRef]
Staudte, M.; Koller, A.; Garoufi, K.; Crocker, M. Using listener gaze to augment speech generation in a virtual 3D environment. In Proceedings of the Annual Meeting of the Cognitive Science Society, Sapporo, Japan, 1–4 August 2012; Volume 34. [Google Scholar]
Garoufi, K.; Staudte, M.; Koller, A.; Crocker, M. Exploiting Listener Gaze to Improve Situated Communication in Dynamic Virtual Environments. Cogn. Sci. 2016, 40, 1671–1703. [Google Scholar] [CrossRef]
Zhang, Y.; Fernando, T.; Xiao, H.; Travis, A. Evaluation of Auditory and Visual Feedback on Task Performance in a Virtual Assembly Environment. PRESENCE Teleoper. Virtual Environ. 2006, 15, 613–626. [Google Scholar] [CrossRef]
Kangas, J.; Špakov, O.; Isokoski, P.; Akkil, D.; Rantala, J.; Raisamo, R. Feedback for Smooth Pursuit Gaze Tracking Based Control. In Proceedings of the 7th Augmented Human International Conference 2016, Geneva, Switzerland, 25–27 February 2016. [Google Scholar] [CrossRef]
Lankes, M.; Haslinger, A. Lost & Found: Gaze-based Player Guidance Feedback in Exploration Games. In Proceedings of the Extended Abstracts of the Annual Symposium on Computer-Human Interaction in Play Companion Extended Abstracts, Barcelona, Spain, 22–25 October 2019. [Google Scholar] [CrossRef]
Ghosh, S.; Dhall, A.; Hayat, M.; Knibbe, J.; Ji, Q. Automatic Gaze Analysis: A Survey of Deep Learning Based Approaches. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 61–84. [Google Scholar] [CrossRef] [PubMed]
Frid, E.; Moll, J.; Bresin, R.; Sallnäs Pysander, E.L. Haptic feedback combined with movement sonification using a friction sound improves task performance in a virtual throwing task. J. Multimodal User Interfaces 2019, 13, 279–290. [Google Scholar] [CrossRef]
Cominelli, L.; Feri, F.; Garofalo, R.; Giannetti, C.; Meléndez-Jiménez, M.A.; Greco, A.; Nardelli, M.; Scilingo, E.P.; Kirchkamp, O. Promises and trust in human-robot interaction. Sci. Rep. 2021, 11, 9687. [Google Scholar] [CrossRef]
Bao, Y.; Cheng, X.; de Vreede, T.; de Vreede, G.J. Investigating the relationship between AI and trust in human-AI collaboration. In Proceedings of the Hawaii International Conference on System Sciences, Kauai, HI, USA, 5 January 2021. [Google Scholar]
David-John, B.; Hosfelt, D.; Butler, K.; Jain, E. A privacy-preserving approach to streaming eye-tracking data. IEEE Trans. Vis. Comput. Graph. 2021, 27, 2555–2565. [Google Scholar] [CrossRef] [PubMed]
Kröger, J.L.; Lutz, O.H.M.; Müller, F. What Does Your Gaze Reveal About You? On the Privacy Implications of Eye Tracking. In Privacy and Identity Management. Data for Better Living: AI and Privacy: 14th IFIP WG 9.2, 9.6/11.7, 11.6/SIG 9.2.2 International Summer School, Windisch, Switzerland, 19–23 August 2019; Revised Selected Papers; Friedewald, M., Önen, M., Lievens, E., Krenn, S., Fricker, S., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 226–241. [Google Scholar] [CrossRef]

Figure 1. The process of interpreting gaze data involves extracting features such as fixations, saccades, and smooth pursuits, which can then be used for various applications. These applications range from gaming and training simulations to medical diagnostics, showcasing the versatility of gaze-based interaction. The data interpretation can be accomplished through various methods, including scanpath analysis, statistical approaches, and machine learning or deep learning techniques.

Figure 2. Examples of scanpath analyzes. One of the first published scanpath analyses was by Yarbus [49] An example can be seen in (a) which illustrates scanpaths of a Yarbus-like measurement by Geisler et al. [54], where the participants see the same images, but the task was different: Task 1 Indicate the age of the subjects. Task 2 Remember the clothes the people are wearing. Task 3 estimate how long the visitor was away from the family. This showed that the scanpath depends on the task. This knowledge can now be transferred to deep learning architectures, as shown in (b), a visualzation of the analysis by Castner et al. [50], where the scanpath is reconstructed based on the fixation data and visualized using a VGG-16 network to compare the similarity of fixations.

Figure 3. The most intuitive way to select a target with the gaze is to use the dwell time [82,83]. However, as this leads to problems (Midas Touch), there are other solutions such as using head gestures [84] or specific eye movements to confirm the selection [85,86].

Figure 4. Possible solutions by Sidenmark et al. for various problems in gaze-based interaction are visualized here. In (a) is a visualization of a gaze-based interaction method by Sidenmark et al. [95] to avoid Midas Touch, since the selection is not only based on the gaze, but also uses the head as confirmation. A target (square) is selected using the gaze point (red circle). Then the circle expands and the selection can be confirmed by moving the head direction point (green circle) into the gaze circle. In (b) a sketch of the idea from Sidenmark et al. [97] is shown. It is an approach for selecting different targets that are in a very similar direction to the user. The vergence angle can be used to decide which target the user wants to select, even if the combined gaze direction, shown by the orange dashed line, is the same. In A, the object of interest is close, so the vergence angle is large, while in B the target is far away, so the vergence angle is smaller.

Figure 5. Summary of areas of application for gaze-based intention prediction. These applications span across various domains, including human–agent interaction, immersive environments, assistive technology, and human–robot interaction.

Figure 6. Summary of feedback modalities and their strengths in gaze-based communication. Combining modalities can optimize user experience by enhancing awareness and improving interaction efficiency.

Figure 7. Diagram of a gaze-based communication system with AI. The process starts with the user providing input through modalities such as gaze or head movements, and scene information captured by sensors. The gaze-based interaction system processes this input to estimate which object the user intends to interact with. The intention prediction component then uses both the estimated interaction object and the raw data (modalities and scene information) to predict the user’s broader intention. The AI interprets this intention and provides feedback through various channels (visual, audio, haptic), creating a feedback loop that informs the user and enhances communication and interaction with the system.

Table 1. Overview of Studies on Gaze-Based Interaction Techniques in VE. Highlighting the use of eye, head, and hand movements across different tasks.

Publication	Eyes	Head	Hands	Task
[93]	X	X		head vs. gaze selection
[94]	X	X		head vs. gaze selection
[95]	X	X		head-supported gaze selection
[96]	X	X		probabilistic model based on head and gaze
[97]	X			vergence based selection
[98]	X		X	gaze vs. hand based interaction in games
[99]	X		X	gaze vs. hand based aiming in games
[100]	X		X	gaze vs. controller in baseball-throwing game
[101]	X			using gaze for common computer activities

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Severitt, B.R.; Castner, N.; Wahl, S. Bi-Directional Gaze-Based Communication: A Review. Multimodal Technol. Interact. 2024, 8, 108. https://doi.org/10.3390/mti8120108

AMA Style

Severitt BR, Castner N, Wahl S. Bi-Directional Gaze-Based Communication: A Review. Multimodal Technologies and Interaction. 2024; 8(12):108. https://doi.org/10.3390/mti8120108

Chicago/Turabian Style

Severitt, Björn Rene, Nora Castner, and Siegfried Wahl. 2024. "Bi-Directional Gaze-Based Communication: A Review" Multimodal Technologies and Interaction 8, no. 12: 108. https://doi.org/10.3390/mti8120108

APA Style

Severitt, B. R., Castner, N., & Wahl, S. (2024). Bi-Directional Gaze-Based Communication: A Review. Multimodal Technologies and Interaction, 8(12), 108. https://doi.org/10.3390/mti8120108

Article Menu

Bi-Directional Gaze-Based Communication: A Review

Abstract

1. Introduction

2. Structure of the Review

3. Gaze Estimation and Usage in Immersive Scenes

4. Gaze-Based Interaction

4.1. Gaze-Based Interaction with Screen Applications

4.2. Gaze-Based Interaction in Immersive Scenes

5. Gaze-Based Intention Prediction

5.1. How to Use Gaze to Predict Intention

5.2. Areas of Application

6. Feedback in Gaze-Based Communication

6.1. Haptic Feedback

6.2. Audio Feedback

6.3. Visual Feedback

7. Discussion

8. Conclusions

Funding

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI