1. Introduction
Humans receive information from the external environment through various sensory organs. Among them, vision is a higher sense than other senses and plays an important role in eliciting or helping other senses. Most of the ways humans use to acquire external information are transmitted through vision [
1].
Human visual function cannot be said to be complete only with visual acuity, which indicates whether an external object can be seen well at a certain distance [
2,
3,
4]. In addition to visual acuity, visual functions include convergence, divergence, accommodation, color vision, stereopsis, and dynamic acuity.
Examination of refraction and visual function is largely divided into subjective and objective examinations. Subjective examination is conducted through the examinee’s feedback. When the examiner asks a question while the examination is in progress, the examinee answers the question [
5,
6,
7]. This is also known as a subjective examination, and in actual clinical practice, it is treated as an important examination because it is necessary to consider how clear the examinee wants to see and the feeling of wearing, such as dizziness and comfort. Conversely, objective examination is an examination method which uses examination equipment for measurements without the examinee’s feedback. High proficiency in examination and examination equipment is required because the results are determined by the examiner’s judgment [
8,
9,
10].
In clinical practice, instead of using only one method, accurate examination values are derived by performing subjective and objective examinations in parallel [
11]. When a patient who needs optical treatment uses a spectacle and contact lens with the erroneous refractive power, a subjective examination must be performed at the last stage of the examination. This is because of the fact that if you change it to the correct refractive power at once, you may not be able to adapt.
In this study, we used a gaze-tracking-based kiosk to conduct near-point-of-convergence (npc) subjective, visual fixation, and 4-prism diopter (Δ) base-out (BO) examinations. Measurements were conducted by observing the micromovements of the eyes to prove that the examination is possible with the objective visual function examination method (proof of Symmetrical Characteristics with Existing examination method), and to verify its effectiveness compared with the existing examination method. Items npc, visual fixation, 4 Δ BO were selected to confirm the eye movement tracking performance of the gaze-tracking algorithm used in this paper.
Through the experiments in this paper, the first is that the subjective test can be performed as an objective test, the second is that it can be used as an auxiliary means of the examiner in the micro-eye movement tracking test that requires examination skill, and third, we would like to suggest that the examinee himself can perform an objective test without an examiner due to the nature of the preliminary test.
3. Real-Time Gaze-Tracking Equipment Based on Eye Feature Points
3.1. Overview of Real-Time Eye-Tracking Equipment Based on Eye Feature Points
The eye-tracking method is one of the most important fields of computer vision because it provides essential information for the identification of the visual elements that people are interested in. In this study, after predicting the positions of eye feature points using the HRNet [
19] (which yields an excellent performance in extracting body feature points), support vector regression (SVR) [
20] is used to calculate the gaze vector based on the feature point distribution. In the process of training artificial neural networks and SVR models, UnityEyes, a synthetic eye image dataset close to real images, was used [
21]. Finally, the eye aspect ratio (EAR) [
22] was introduced to reduce the eye-tracking error when the eyes were closed. When the measured EAR value was smaller than the set value, the eye tracking accuracy was improved by reducing false positives by not outputting the gaze vector and iris center coordinates.
The overall flowchart of the real-time gaze-tracking method based on eye feature points is shown in
Figure 4.
First, the black and white eye frames received from the IR eye camera were input to the artificial neural network HRNet. HRNet extracts feature maps for eye feature point prediction through the effective convergence of the bottom-up and top-down paths. In each channel of the extracted feature map, the positions of the eye feature points in the heat map method were predicted, and the coordinates of the predicted feature points were used to predict the gaze vector through SVR. Finally, the eye-tracking system calculates the EAR value before outputting the gaze vector and iris center coordinates, and determines whether to output the prediction result by comparing the magnitude with a preset value.
3.2. Feature Map Extraction Using HRNet
The artificial neural network plays a key role in the overall gaze-tracking method because it generates a feature map that includes information on the coordinates of the eye feature point. The performance of the eye-tracking method largely depends on the method of extraction and fusion of the feature map of an artificial neural network. In the proposed method, HRNet is adopted as an artificial neural network to increase the accuracy and improve the inference speed. Existing deep-learning-based eye-tracking methods, such as the hourglass network [
23] and SimpleBaseline [
24], generate the final feature map using the encoder-decoder method. These methods have the advantage of effectively using a wide range of local information by focusing on a top-down path of decoding a low-resolution into a high-resolution feature map. However, because the fusion process uses a relatively unprocessed low-resolution feature map compared with the high-resolution feature map, it has structural performance limitations.
HRNet uses a method which fuses feature maps by focusing on both the bottom-up and top-down paths, thus increasing the accuracy and reducing the number of floating-point operations (FLOPS) by half. HRNet generates low-resolution feature maps at every step, simultaneously applies a 1 × 1 convolution to the high-resolution feature maps, and maintains them to enhance features within a small area. HRNet consists of a total of four sequential steps. At each step, a feature map with 1/2 the size of the smallest resolution feature map is created, and the residual block is then applied four times. Subsequently, as the last process of each step, the exchange process is performed to converge the feature maps of all resolutions in a fully connected manner. The exchange process at each stage consists of several iterations of the exchange unit, with one iteration in the second stage, four repetitions in the third stage, and three repetitions in the fourth stage. In the exchange process, in the bottom-up path, 1 × 1 convolution and nearest-neighbor interpolation are used for upsampling, and in the top-down path, a 3 × 3 convolution with a stride of two is used for downsampling. In contrast, in the parallel path, a 1 × 1 convolution was used to maintain the resolution.
Figure 5 shows the HRNet structure applied to the gaze tracking method. The residual block is omitted in this figure. HRNet-W32, a lightweight model, was used for real-time operation, and the number of channels from the highest-resolution feature map to the low-resolution feature map was set to 32, 64, 128, and 256, respectively. The highest resolution feature map is output, and the number of channels in the final feature map is set to 48, which is the number of eye feature points.
3.3. Prediction of Eye Features
Each channel of the output feature map extracted from HRNet represents a heat map of the location of the eye feature point. Of the 48 channels, 16 correspond to the coordinates of the eye edge points, and 30 correspond to the coordinates of the iris edge points. The remaining two channels represent the coordinates of the center of the eye and the coordinates of the center of the iris. Because the point with the highest value in each heat map indicates the point with the highest probability of the presence of eye feature points, the coordinates of the corresponding points are finally output as predicted values. In the learning process of the artificial neural network, a two-dimensional (2D) Gaussian distribution with an eye feature point as the center and a standard deviation of 1 is input as the ground truth in each channel. Regarding the loss function between the output heat map and ground truth, an L2 loss mean square error, is applied. Equation (1) expresses the overall loss function for the heat map for predicting the eye feature points. In Equation (1),
is the number of eye feature points,
K is the number of pixels in the output feature map,
is the predicted heat map, and
is the ground-truth heat map.
The equation denotes the overall loss function used for the heatmap for the prediction of eye features.
3.4. Gaze Vector Prediction Based on Eye Feature Points
There is a feature-based method which uses the distribution of eye feature point positions and a model-based method for the formation of an eye model based on these feature points as a method to predict a gaze vector. An appearance-based method can be used directly. In this study, the feature map is extracted using the artificial neural network of the appearance-based method, and the gaze was traced using the feature-based method on the predicted eye feature coordinates. Compared with other methods, the feature-based method has the advantage of being less affected by user changes and shows excellent performance even when learning uses a small amount of data. Conversely, because the positions of the eye feature points play a decisive role in the prediction of the gaze vector, there is a feature which depends considerably on the result of the eye feature point detection. The feature-based method applied in this study is as follows. First, the ocular feature point coordinates predicted through the heat map are centered on the ocular center coordinates and are normalized by the length of the ocular radius (the distance between the ocular center coordinates and the coordinates of the ocular edge point with the furthest distance). The normalized coordinates were input to an SVR model, and the model predicted a three-dimensional gaze vector. In the training process of the SVR model, the pitch of the eyeball (pitch, ) and yaw (yaw, ) are input as the ground truth.
3.5. Reduced False Positives during Eye Movements
The artificial neural network model proposed in this study predicts the positions of the set number of eye features regardless of the actual presence or absence of eye feature points in the image. Therefore, even when the user closes his/her eyes, a false positive for tracking the user’s eyes may be generated. Therefore, the eye aspect ratio (EAR) [
22] was introduced to solve this problem. The EAR is used to measure the degree of eye closure as the ratio of the distance between the horizontal end points of the eye’s edge and the distance between the vertical end points.
Figure 6 shows the horizontal
and vertical
endpoints of the eye’s edge, and the distance between the two endpoints. Equation (2) shows the formula for calculating the EAR.
EAR
When the eyes are fully opened, the EAR varies considerably from user to user. Therefore, in this eye-tracking system, after measuring the user’s EAR values for 2 s when the program is executed, 1/2 times the median value K is set as the comparison value. When the measured EAR is larger than K, the iris center coordinate and gaze vector are output. If the measured EAR is smaller than K, it is judged that the eyes are more than half closed, and the result is not output.
3.6. Dataset and Data Preprocessing
A synthetic eye image created with UnityEyes was used as a dataset for training an artificial neural network for gaze tracking. UnityEyes is a three-dimensional (3D) eye model program which uses the Unity software, and is extensively used as an alternative to real datasets which lack eye image data and high-quality ground truth.
Figure 7 shows an example of a synthetic eye image created with UnityEyes and the coordinates of the eye feature points.
In total, there were 53 eye feature coordinates, including 16 eye edges, seven caruncles, and 30 iris edge coordinates. In this study, only the eye and iris edge coordinates were used, except for the caruncle coordinates. In addition, the eyeball center coordinates and iris center coordinates were calculated and added as the average of the edge coordinates. They were subsequently used as the ground truth of the heat map of the eye feature point position. The composite eye image was created with a size of 640 × 480, and were then converted to a black-and-white image, and cropped around the eye center coordinates according to the resolution of the infrared (IR) eye camera which was used.
Figure 8 shows the eye feature point detection results and gaze vector prediction for the actual eye image of the eye-tracking system. The 16 eye marginal points are shown in red, the 30 iris edge points in blue, and the two central coordinates in white and green dots, respectively. In addition, the 3D gaze vector was visualized as a 2D vector projected onto the plane and is indicated in yellow.
The 3D gaze vector
provided by UnityEyes is converted into
and
, and IS used as the ground truth of the SVR.
Figure 9 shows the gaze vector and the angles
and
in the 3D coordinate system, and Equation (3) expresses mathematically the angle conversion of the gaze vector.
where Equation (3) is the transformation of the 3D gaze vector in
and
.
6. Discussion
When humans receive information from the external environment, these are mostly visual. To maintain good visual acuity, it is recommended to have regular visual acuity and visual function checkup once a year on average for adults over 20 years of age. To this end, clinical refraction and visual function examinations are performed, and subjective and objective examinations are performed concurrently to derive accurate examination values. In general, objective examinations require improved examination skills relative to subjective examinations.
In this study, a kiosk-type objective visual function examination system was implemented using real-time gaze-tracking equipment based on eye feature points. For the evaluation of its effectiveness, a data comparison analysis was performed between the existing examination method and the kiosk-type examination method for npc, visual fixation, and 4 Δ BO examinations.
The kiosk-type examination system is performed in the same way as the existing examination method, but the examination results are determined by detecting the movement of 30 feature points at the edge of the iris. Because it tracks the movement of the ocular input through the webcam, it is difficult to see the ocular in a relatively dark lighting environment, so the accuracy is poor. Therefore, the same lighting environment as the existing inspection method is needed. The gaze-tracking model is a model having a small resolution of 192 × 192, just an RGB camera having a resolution of 192 × 192 or more has an advantage that does not affect a driving specification. However, since the learning data have only been learned by the ocular, the performance tends to fall when the spectacles frame image is visible during the examination. In the future, it is necessary to increase the accuracy of the test results of the examinee wearing spectacles by adding image learning data including the frame of spectacles.
One of the main performances of the gaze-tracking algorithm used in this paper is micro-eye movement tracking. Examinees with deviations or abnormal eye movements were excluded from the experiment because they judged that they could not examine or trace micro-eye movements due to large eye movements. Examinees with deviations or abnormal eye movements have large eye movements when examination, making it easier to perform examinations with kiosk examinations.
In the clinical protocol, in order to measure visual function, it was measured in order by the existing examination method. Afterwards, when the examination was performed using the kiosk type examination method, the examination was performed randomly because the examinees experienced the measurement environment. An order effect was prevented by setting up a 5-min break after the existing examination and randomly performing a kiosk type examination.
In the npc examination, the kiosk-type examination method showed a slightly shorter near point compared with the existing examination method, but no statistically significant difference was observed. The npc value presented in the result is the average data value. There were also examinees whose npc was measured longer in the kiosk examination. During the npc examination, the inward adduction of the two eyes increased as the near point became closer, and one of the two eyes was abducted outward when the near point was reached. Even in the existing examination method, there were cases wherein objective examination was judged based on the use of the phenomenon of abduction, but there was a disadvantage in that it was difficult to judge because measurements involved micro-eye movements. Thus, it is generally performed as a subjective examination. In the kiosk-type examination method, it was judged that the objective examination could be performed because it was easier to check with the naked eye compared with the existing method through the detection of feature points on the edge of the iris.
In the visual fixation examination, both examinees showed the same examination results in the existing examination method and the kiosk-type examination method. Visual fixation is an essential skill for a comfortable and stable visual life. In particular, in reading ability, it is considered an important visual function along with micro-eye springing, and saccadic eye, pursuit eye, and vestibular ocular movements. If the visual fixation function falls, it is difficult to continue fixation, and subjective symptoms are reported in areas that require a start-up by focusing for a long time, including reading.
For the human eye to see clearly and comfortably for a long time, the balance between the accommodation function to make the focus clear and the disjunctive movement (vergence) to make it into one image must be well maintained. At the same time, the visual fixation function that keeps the micromovement of the eye stable is important, and there is a training program to improve and maintain this function.
In the 4 Δ BO examination, the same examination results were obtained by both the existing and the kiosk-type examination methods in the same way as the visual fixation examination. 4 Δ BO examination is an examination used to determine the presence or absence of a suppression area and microscopic central scotoma. Given that it is necessary to determine the micromovement of the eye, the proficiency of the examiner is required. If there is a suppression area or microscopic scotoma in the eye, monocular vision and stereoacuity tend to decrease, and monofixation syndrome [
25,
26] symptoms may appear due to small-angle strabismus [
27,
28]. Monocular fixation reduces stereopsis. In ocular physiology, when suppression occurs to enable binocular vision as much as possible, the non-suppression eye is weakened to make the suppression eye focus. In binocular rivalry that occurs in the brain, it increases the neural transmission power of the suppression eye.
7. Conclusions
In clinical sites, such as ophthalmic clinics and opticians, it is very important to derive accurate examination values by performing subjective and objective examinations in parallel. This means matching the symmetric characteristics between the two test examination methods. Even if the examination value is derived based on accurate objective examinations, failure may occur owing to various factors if the optical prescription is given by considering only the objective examination value without the subjective examination of the patient.
In this study, we demonstrated that npc, visual fixation, and 4 Δ BO examination can objectively examine visual function using the gaze-tracking-based kiosk-type examination method. The method’s effectiveness was verified based on comparisons with the existing examination method (derivation and proof of symmetric properties). The kiosk-type examination method proposed in this study shows that the examination can be easily conducted in parallel with the existing method. In the future, a model can be developed that can conveniently conduct examinations by the examinee at clinical centers, such as the Ophthalmic Clinic and Optician. In addition, if the clinic center establishes the kiosk-type visual function examination system proposed in this paper, users can easily and conveniently perform preliminary examination, so that they can check approximate visual function information before this test. Subjective examination and objective examination can be performed together; thus, many industrial uses are expected as the accuracy of examination results can be improved.