1. Introduction
Most current eye tracking methodologies use video to capture the position of the iris and/or reflected lights sources–glints [
1]. As such, these methods can be affected by ambient light [
2], which is particularly true for use cases such as augmented reality with eye glasses. Other light-based methods such as scanning lasers [
3], third Purkinje images [
4] and directional light sensors [
5] can likewise be affected. Speed can also be limited, especially in wearables, where operating a camera at high speed (on the order of 100 Hz or above) would imply high power consumption. At these speeds the camera-based sensors can capture fixations but not other parameters such as saccades, which have been implicated as a markers of schizophrenia spectrum in at-risk mental states [
6] as well as other neurological disorders [
7]. Fast eye tracking is required for measuring saccades. Current devices capable of measuring saccades are designed for laboratory use, and tend to lack portability [
8]. The possibility of using ultrasound for eye tracking has been raised in a patent [
9] and there exist studies that use eye-tracking to assist ultrasound procedures [
10]. However, to the best of our knowledge, there is no report on experimental study to empirically demonstrate the feasibility for gaze estimation using ultrasound sensors.
A recent paper explored the possibility of using non-contact ultrasound sensors to track fast eye movements [
11]. The work focused on the development of a finite element simulation model to investigate the use for ultrasound time of flight data to track fast eye motions. The simulation model is based on a setup made of four transducers positioned perpendicular to the cornea. Distances are measured with each transducer based on the time for it to receive the reflection of its own signal. Given the cornea protrudes, this time changes with the gaze angle. For implementing this simulation setup in any form of glasses-form factor device, the device needs to be precisely positioned relative to the eye. However, we are interested in applications for eye tracking in augmented reality (AR) and virtual reality (VR), where user-specific placement of the sensors is not possible (in AR/VR the eye tracking system will be fixed and the position of the eye will vary from user to user, which means alignment will vary).
It is also to be noted that the modeling in [
11] was done in the absence of occlusions (such as eyelids). Eye occlusions are known to be problematic for eye tracking systems in general [
12]. Furthermore, the authors [
11] chose to model standard 40 kHz transducers. While these would be advantageous in terms of minimizing attenuation in air, such a system may be subject to interference from range-finding applications (typically in the 40–70 kHz range). Common range finding systems lack the resolution and short distance sensing capabilities required for eye tracking (the typical sensing range would be in meters with a resolution of 1 cm). Another concern for our application of interest is size. Devices would need to fit in glasses frames. Capacitive Micromachined Ultrasonic Transducers (CMUTs) operating at 500 kHz–2 MHz [
13] provide the range, resolution and size that is suitable for use in VR and AR devices. This type of transducer has found numerous medical applications in both imaging and therapy [
13], which are applications for contact ultrasound.
Here, we use the CMUTs for remote sensing as airborne transmitters and receivers. In this mode, the difference in impedance between air and tissue means over 99 percent of the ultrasound signal will be reflected by the eye surface [
11]. As such, the size of transducers was a primary concern for our choice of CMUTs for the proposed study and concerns related to test bench size and power consumption did not drive our investigations.
In order to systematically investigate the feasibility of near-field ultrasound sensing for eye-tracking with an AR form-factor device, we first did our own finite-element-modeling study using acoustic rays for 1.7 MHz transducers configured on AR glasses. We compared directional and omnidirectional transmit and receive configuration for the sensors to determine where we would expect to see a meaningful signal around glasses frames for a source placed near the glasses branch. We then built a series of table top test bench systems to (a) verify our ability to accurately measure distances in the appropriate range, (b) characterize the transducers, and (c) generate data to be used in a machine learning model to estimate gaze. As such we focus on empirically testing the hypothesis that ultrasound sensors can be used for gaze estimation in the presence of occlusions. We note that in the context of our experiments, gaze is defined by the static orientation of model eye on the goniometer. We demonstrate that ultrasound time of flight and amplitude signals can be leveraged to track gaze in such conditions. In particular, we train a regression model using gradient boosted decision trees to estimate the gaze vector given the set of ultrasound time-of-flight and amplitude signals captured by the CMUT receivers. The nonlinearities introduced by occlusion artifacts make the task of regressing gaze directly from recorded signals non-trivial and we believe that a nonlinear regression model trained on the collected data is best suited to extract the relvant signals for gaze estimation. Results show that the trained model produces a regression score of 90.2 ± 4.6% and a gaze RMSE error of 0.965 ± 0.178 degrees.
3. Results
In this section we present findings from our modeling study as well as experiments conducted using the three benchtop setups described in
Section 2.2.
We begin by presenting our findings on the CMUT sensor characterization. Data collected using test bench setup 1 allowed us to investigate the decay characteristics of the ultrasound signal in air, see
Figure 6A. As expected, the ultrasound signal decays exponentially as a function of distance. An extrapolated fit shows it decays to zero. The distance axis shows the distance between the pair of transducers and the target (
Figure 2A). Actual travel distance is twice this measurement. The range is similar to the distances for transducers mounted on eye glasses frames, our use case scenario.
Data collected using test bench 2 (
Figure 2B) allowed us investigate whether the CMUT transducers exhibit directionality. Our findings are reported in
Figure 6B. The CMUT transducers indeed exhibit directionality with an emission cone of 10 degrees. This applies to the transducers in both transmit and receive mode.
Based on the above findings we conclude that the strength of ultrasound signal at the receiver CMUT transducer will depend on two factors: distance and incident angle. As such, we believe that the amplitude of the ultrasound signal at the receiver contains relevant information to contribute to our ability to estimate gaze and as shown below, our findings indeed support this claim.
Our modeling study explored two situations: an omnidirectional transducer and one that mimics the properties of our CMUTs, see
Figure 7. 131,072 rays are released from a point source in each case. The rationale for exploring the two situations is that while our CMUTs fit our needs, single crystal piezo transducers may provide a robust, inexpensive alternative. They are omnidirectional but can be turned into a directional device by adding baffles. In terms of size, they would be slightly larger (2.5 mm instead of 1 mm in our frequency range).
We implemented the sensor native curve by releasing rays with a uniform density distribution and assigning weight functions to the rays based on rays angle of emission and reception. The weight function of the rays is cos(min(alpha*(90/15),90°). Alpha is the angle between the ray and transducer direction.
In the directional case we used a similar approach to account for the receiver native curve. We assigned a similar weight function to the acoustic rays that reach receivers based on the angle between the incoming acoustic rays and the sensor direction of each receiver. Therefore, each ray has two weight functions. One weight function is assigned initially when the ray is released, another weight function is assigned when the ray is detected by a receiver. The product of the two weight functions is applied. If the angle between a ray (that reaches a sensor) and the receiver direction is more than 15 degrees then the ray is not detected (its weight function is zero). If this angle is zero then the weight function is 1.
Figure 8 shows a comparison of the predicted signal at our sensor locations for directional and omnidirectional transducers. The left and right panels correspond to thirty degree rotations of the eye to the left (towards the nose) or right. In the case of omnidirectional transducers the differences between gazes are small. Differences are more pronounced for directional transducers. Peaks are also better defined with directional transducers. Late peaks resulting from longer paths due to multiple reflections are minimized. It is to be noted that such late peaks would be ignored in our analysis, as we only use the time to peak and peak amplitude for the first peak detected in a given channel. With the same total number of rays (transducer power), receiver sensors with a directional transducer have higher signal strength than receiver sensors with an omnidirectional transducer. We ran the same models for a straight gaze as well as up/down twenty degree rotations (data not shown), and obtained similar results. Taken together the directional transducers perform better to resolve gaze.
Next we looked at where on the frame we might detect a signal, and why.
Figure 9, left, shows signal intensity around the frame. Areas in red have a higher chance of detecting rays reflected from the eye. Rays reflected off the glasses or skin are ignored. The center panel shows the path taken by the rays that reach receiver 6. Some of the rays arrive after multiple reflections from the skin and glasses. The right panel provides a detailed view of the direction of rays reaching receiver 6 (sphere). Sensor direction is shown with the solid black line. The majority of these rays will not be detected by the receiver due to the narrow angle of detection dictated by the receiver native curve. If we were using omni-directional transducers, rays arriving after two or more reflections would broaden the signal or create multiple peaks. Directional transducers allow us to reject unwanted signals before they are counted.
We next report findings from training a GBRT model on data collected using the third test bench setup (see
Figure 3). For each model eye position on the goniometer,
, for a fixed receiver transducer position (180 degrees) and for a set of 19 transmit transducer positions, we fire the ultrasound test signal 50 times, at 2 kHz and record the raw receiver signal (see
Figure 4 top row). In order to increase the strength of ultrasound response at the receiver we average 10 traces of the raw response signals at a time, to effectively generate 5 averaged ultrasound response signals, in effect acquiring data at 200 Hz. The averaged response signal is passed through a Butterworth bandpass filter and we extract two ultrasound signal features: time of flight (
) and the amplitude at peak (
a), as explained in
Section 2.3. In total for each model eye position, we generate a total of 45 samples for each model eye position on the goniometer over the duration of the study. For the set of 36 model eye positions, we produce a total of 1620 data samples.
We train a GBRT model on these data samples, performing a 5-fold cross-validation study. The model performance is reported using an adjusted R
score [
17] and the gaze RMSE error in degrees. Hyper-parameter search on the GBRT model parameters that produced the best adjusted R
score for 5-fold cross-validation are reported in
Table 2. We obtain gaze RMSE error of 0.965 ± 0.178 and mean adjusted R
score of 90.2% with a standard deviation of 4.6, suggesting that almost 90% of the data fit the regression model. We also perform a similar analysis using a linear regression model and the results are reported in
Table 1. Nonlinear modeling of the problem through GBRT produced an improvement in performance for RMSE of ≈ 18% and goodness-of-fit improvement of ≈5.7%, in support of our claim that the occlusions introduce nonlinearities in the ultrasound signals captured by the CMUT receivers, that can be best captured using a nonlinear regression model.
Residuals analysis confirmed that the estimates obtained using the GBRT model are un-biased (data not shown). In
Figure 10, we show the plot of the fraction of GBRT estimated gaze values that fall within an epsilon-ball of given radius (degrees). We see that
% of estimated gaze values fall within an epsilon ball of radius 0.8 degrees and
% of estimated gaze values fall within an epsilon-ball of radius 2 degrees. Based on these findings, we conclude that using CMUT ultrasound sensors, we can expect gaze resolution of up to 2 degrees.
In
Figure 11A,B, we show feature importance for the GBRT tree models trained to estimate the model eye gaze coordinates,
(horizontal gaze) and
(vertical gaze). We can see that the top two features for both horizontal and vertical gaze GBRT model are time of flight ultrasound signal. It has been our observation that while the time of flight component of ultrasound signal contains dominant information signal to estimate gaze (95% contribution to the regression score), the amplitude signal is also an important contributor for GBRT model to produce an adjusted-R
score close to 90%. In order to test this observation, we trained GBRT model using just the ultrasound time-of-flight feature and another GBRT model using just the ultrasound amplitude feature. The findings are: GBRT model trained using time-of-flight features, produces an adjusted R
score of 85.4 ± 5.2, where as the GBRT model trained using only the amplitude feature produces an adjusted R
score of 78.6 ± 8.2. In
Figure 9C, we show the mean-RMSE error (across all CV-folds) for the GBRT model. The error is biased towards the lower half of vertical gaze, primarily resulting from occlusions.
4. Discussion
This study is the first experimental demonstration of use for ultrasound sensors in gaze estimation. We show that ultrasonic transducers can effectively produce signals useful to resolve eye gaze, as defined by the static orientation of model eye on a goniometer, within the range tested, degrees in both up/down () and left/right () directions. This range reflects the full deflection of our goniometer. We plan on expanding the range in future studies.
Prior to embarking on our experiments with bench-top setup we conducted ray tracing modeling. This modeling helped us refine our test bench design, procedures, and analysis. First, it pointed to the utility of directional over omnidirectional transducers. Second, it informed us on where, given a source location, we can expect a signal around the glasses frame. Finally, it provided information on the signals we need to measure from our bench-top experiments: the time to peak (indicative of distance traveled), and the amplitude. Due to attenuation in air the amplitude decreases with travel distance. Our modeling indicated that amplitude also carries a signal based on the angle of incidence. This is further evidence for using directional instead of omnidirectional transducers.
Our GBRTs show that both amplitude and time of flight contribute to our ability to estimate gaze. This is a new finding as previous modeling work dealt with time of flight alone. As mentioned in our modeling section, two factors contribute to amplitude: attenuation and the incident angle of the incoming sound. One way to compensate for attenuation is to use the time-gain correction built in our amplifier, increasing gain over time to compensate for the signal attenuation with longer distances. When we did this (data not shown) our model performed slightly worse. This indicates that attenuation plays a role in our ability to estimate gaze, and would favor the use of high frequency transducers.
For this proof of concept we chose to average ten individual tests prior to filtering the signal and extracting peak and amplitude. This reduces the eye tracking acquisition speed from a maximum of 2 kHz to 200 Hz, which may not be sufficient to track saccadic eye motion. While this study focused on primarily testing the hypothesis that ultrasound signals can be leveraged to estimate gaze, in future works we will explore avenues to investigate the use for ultrasound in tracking fast eye motion. Specifically, we plan on using a fast-moving model eye coupled with multiple receivers operating at 2 kHz. The GBRT models will be adapted so we can test the potential of ultrasound for fast eye tracking to resolve saccades.
We are interested in investigating the feasibility for using ultrasound sensors for eye tracking in virtual and augmented reality devices. In addition to sampling speed, power consumption is an important factor to consider. The transducers are very low power, in the milliwatt range. Our current system utilizes a high speed A/D converter. This can be replaced with a low power peak detection circuit. On the compute side, GBRTs are considered low compute. This is in particular true for run time on multi-core machines. Specifically, the run time compute complexity for GBRT models is , where p represent the number of input features and are the number of regression trees and C is the number of compute cores on a given machine. For , the run time complexity for GBRT is on parity with linear regression models, at O(p).
In summary, this study presents data driven proof-of-principle findings to support the claim that ultrasound sensors can be used for gaze estimation.