1. Introduction
Eye movement data analysis holds significant promise in uncovering insights into brain disorders and pathologies. Abnormalities in eye movements often serve as valuable markers for various neurological conditions, making the examination of eye movement patterns crucial for early detection and intervention strategies. In our daily lives, we make about 150,000 movements per day, corresponding to 3 movements per second. This abundance of data underscores the potential for understanding underlying neurological conditions through eye movement analysis.
While statistical frameworks [
1,
2,
3,
4] and machine learning algorithms [
5,
6,
7,
8,
9,
10] have been employed for analyzing eye movement data, recent advancements in deep learning [
11,
12,
13,
14,
15,
16,
17,
18] have shown great potential in automating the analysis process. These approaches not only automate feature extraction but also enhance the resilience of learning algorithms to noise, thereby maximizing the information encoded in the data. However, training such algorithms requires annotated data, and imposing the annotation can drastically increase the difficulty of dataset construction. In response to this challenge, alternative approaches such as unsupervised learning [
19] and self-supervised representation learning [
20,
21] are being explored.
However, two limitations arise, namely the sample size and model size. While the effect of these two limitations may not be significant when training using a research dataset, in clinical datasets, which represent the real-model use case, the screening task complexity is higher due to the increase in input and output (pathology class) variability, making training small models prone to underfitting. For instance, clinical data exacerbate challenges such as low signal-to-noise ratio and inherent variability stemming from population diversity and data recording protocols. Additionally, the high variability of the target class distribution poses a challenge. As a result, small-sized models, due to limited expressivity, tend to underfit when confronted with these challenges. Notably, there is a gap in the literature concerning the evaluation of large architectures’ performance on pathology screening tasks of eye movement time series trained using deep learning algorithms, including self-supervised learning methods. In scenarios lacking a big dataset, highly expressive models are prone to learning the training set without necessarily generalizing. This issue is particularly acute with Transformers, given their high expressivity and absence of weight sharing inherent in Convolutional Neural Network (CNN) architectures.
On the other hand, constructing a large clinical annotated dataset can be challenging, especially for rare illnesses. One approach is using self-supervised learning (SSL) algorithms. SSL involves training the model on a non-annotated dataset to learn a pretext task that compresses the input space into a low-dimensional representation. The model weights are then fine-tuned on a smaller annotated dataset for the downstream task (the main task). A possible option for the pretext task is masked modeling, which was introduced on Bidirectional Encoder Representations from Transformers (BERTs) [
22] and has since been extensively explored in different fields, including natural language processing (NLP) [
23,
24,
25] and computer vision domains [
26,
27,
28,
29].
Our approach is inspired by the work of [
29], although we employ a different pretext task tailored specifically to stimulus-driven eye movement time series data. Furthermore, we refine the masking heuristic to suit time series data and the characteristics of an eye movement position signal. For instance, a fundamental distinction between image and time series data structures lies in their information density [
30]. In time series data, missing masks can be readily addressed through interpolation. Additionally, temporal dependency, inherent to time series data, implies a sequential ordering of data points, unlike the spatially independent nature of images. Moreover, when dealing with eye movement time series data, local failures in the eye tracking system result in sparse regions with reduced sampling rates. Furthermore, errors in eye tracking predictions introduce noise into the input space, leading to a low signal-to-noise ratio.
To address these challenges, we adapt the masking process to simulate eye tracking failures and introduce noise signal corruption to emulate inaccuracies in eye tracking predictions. Our proposed masking heuristic represents a generalization of patch-based and random masking strategies, incorporating variable masking densities during training to account for different scenarios.
Furthermore, we introduce a corruption heuristic tailored to our domain to enhance the complexity of the pretext task. This encourages the capture of mutual information between different time series and learned representations by maximizing their mutual information lower bound [
31,
32]. Additionally, we refine our noise modeling to better replicate the noise present in our data.
Finally, we integrate the target signal into the decoder to mitigate the effects of hard masking and temporal dependencies in our data. This replacement of the patch filling task aims to capture both the characteristics of the pilot (brain) and the motor (eye) by analyzing present patches relative to their corresponding stimulus signals, thereby uncovering missing zones given in the stimulus signal.
As a result of these adjustments, we demonstrated that it is possible to reconstruct eye movement time series data using only up to 12.5% of initial eye movement signals combined with REMOBI target signals. Our studies make the following main contributions:
We propose an extension of the masked autoencoder (MAE) for modeling eye movement time series, aiming to learn meaningful high-dimensional representations through semi-supervised learning on eye movement data.
We introduce a novel pretext task tailored to stimulus-driven eye movement analysis. First, we introduce a masking heuristic. Furthermore, we condition the reconstruction task with the stimulus signal, enabling the model to learn to capture eye movement characteristics and use them to reconstruct the initial signal with masking of up to 87.5% of the initial eye movement signals. Finally, we incorporate a noise injection heuristic to increase the robustness of the trained model by emulating intrinsic noise in eye movement position signal.
We compare our method with two different unsupervised learning methods, as well as training using supervised learning.
We explore using the high-dimensional learned representations to screen eight groups of pathologies on four different characteristic datasets.
Demonstrating the feasibility of reconstructing eye movement time series data using only a small portion, up to 12.5%.
Figure 1 offers a visual summary of the exploration presented in this study.
2. Related Work
Autoencoder: This is an unsupervised method for learning a low-dimensional representation [
33,
34,
35] comprising two models (an encoder and a decoder). The encoder performs feature selection by compressing the input into a low-dimensional representation. Subsequently, the decoder learns to reconstruct the original input features from the output of the encoder. This algorithm is trained by minimizing mean-squared error during reconstruction. Another variant of this algorithm is a variational autoencoder [
36,
37], which introduces a probabilistic perspective by regularizing the latent space through an assumption of conditional multi-Gaussian distribution in that space. As a result, the encoder predicts both mean and standard deviation for each Gaussian component conditioned by input; meanwhile, the decoder samples from this learned distribution to construct the initial input.
Denoising autoencoder: This is a special form of autoencoder, which aims to increase the robustness of the learned representation by training the model to remove noise from the input [
38,
39]. Each input datum is corrupted before being passed to the encoder, and then the autoencoder minimizes the reconstruction error between the uncorrupted input and the reconstructed input. In our approach, we follow the same method, aiming to mimic the inherent noise produced by an eye-tracking system error by generating additional noise that is proportional to the signal amplitude, to encourage the model to learn to exploit the correlation of the different multivariate components, as well as the overall response represented by the eye signal relative to the target signal to filter the noise.
Masked data modeling: This involves removing patches in each sample during training. The model then needs to use the learned representation for the unmasked region to retrieve information from the masked region. Initially used in NLP [
22], this technique showed promising results when evaluated on downstream tasks. Kaming He et al. [
29] explored its application on ImageNet as a pretext task and reported an accuracy of 87.8 on ImageNet-1k without relying on data augmentation techniques, using only cropping instead of contrastive learning.
Additionally, alongside our study, two other studies [
30,
40] have pursued similar objectives: Ti-MAE [
40], which focuses on training MAE with time series data by replacing a continuous masking strategy with a random time masking strategy introduced in MAE and evaluating it on classification tasks where it achieves consistent improvement across all evaluated forecasting datasets; and MTSMAE [
30], which introduces a patch embedding method more suited for MAE application to time series characteristics.
In our approach, we investigate different directions by designing a novel masking heuristic that can be seen as a generalization of patch-based and random masked modeling. The proposed masked modeling method is inspired by the problem of eye tracking system failure.
Self-supervised learning: This involves two stages of training. In the first stage, the model is trained using annotated data to learn a low-dimensional and meaningful representation. The second stage fine-tunes the model for a specific downstream task using an additional annotated dataset. The first stage can incorporate a pretext task such as regression with pseudo-labels [
41,
42,
43], a contrastive method [
44,
45,
46,
47], or an adversarial method [
48,
49,
50]. Finally, a more complete taxonomy of unsupervised representation learning is presented in [
51].
Our study considers the mask filling task as a pretext task while viewing the denoising task with dual purposes; it serves as a pretext task in the first stage and subsequently acts as a preliminary step for the downstream tasks in the second stage.
Application to eye movement: To address the challenge of analyzing eye movements through eye movement data [
19,
20,
21,
52,
53,
54,
55], one can first learn to estimate eye movement position coordinates. Subsequently, these estimated eye movements can be utilized to learn to identify corresponding pathologies within the eye movement data. Self-supervised learning techniques have been employed in different studies to address these phases, demonstrating promising outcomes. Bautista et al. [
19] utilized a temporal convolutional network autoencoder to derive meaningful representations from eye movement position and velocity segments separately using unsupervised algorithms. They assessed the efficacy of this embedding by training a linear Support Vector Machine (SVM) for patient identification tasks, achieving accuracies of up to 93.9% for stimulus tasks and up to 87.8% for biometric tasks. In a second approach [
21], they applied the same encoder architecture for self-supervised contrastive learning, achieving an accuracy of up to 94.5% for biometric tasks. However, there was a noticeable decrease in generalization performance when assessing datasets not included in the training/testing split algorithm. In another study, Lee et al. investigated the detection of abnormal behavior during screen observation using self-supervised contrastive learning, achieving an accuracy of 91% in identifying abnormal eye movements associated with attention lapses.
On the other hand, the papers [
20,
53,
54,
55] explore the application of SSL for eye movement coordinate estimation. In our study, we prefer using the Pupil solution for the estimation due to its physiological criteria and high accuracy. This solution has been extensively used in studies. Therefore, in this study, we explored learning the screening task from eye movement position signals.
Author Contributions
Supervision, Z.K. and T.P.; methodology, A.E.E.H.; software, A.E.E.H.; validation, T.P. and Z.K.; formal analysis, A.E.E.H.; investigation, A.E.E.H.; resources, Z.K. and T.P.; data curation, A.E.E.H.; Conceptualization, A.E.E.H.; writing—original draft, A.E.E.H.; writing—review and editing, Z.K. and T.P.; visualization, A.E.E.H.; project administration, Z.K. and T.P.; funding acquisition, Z.K. All authors have read and agreed to the published version of the manuscript.
Funding
Alae Eddine El Hmimdi is funded by Orasis-Ear and ANRT, CIFRE.
Informed Consent Statement
This meta-analysis drew upon data sourced from Orasis Ear, in collaboration with clinical centers employing Remobi and Aideal technology. Participating centers agreed to store their data anonymously for further analysis.
Data Availability Statement
The datasets generated and/or analyzed during the current study are not publicly available. This meta-analysis drew upon data sourced from Orasis Ear, in collaboration with clinical centers employing REMOBI and Aideal technology. Participating centers agreed to store their data anonymously for further analysis. However, upon reasonable request, they are available from the corresponding author.
Acknowledgments
This work was granted access to the HPC resources of IDRIS under the allocation 2024-AD011014231 made by GENCI.
Conflicts of Interest
Zoï Kapoula is the founder of Orasis-EAR.
Appendix A
Figure A1.
An overview of the left and right position signals during the saccade (right columns) and vergence (left columns) tests. The first and second rows correspond to the left eye and right eye position signals, respectively. Additionally, the last two rows correspond to the disconjugate and conjugate signals. Note that the conjugate signal allows for better analysis of the saccade recording, while the disconjugate signal facilitates better analysis of the vergence recording.
Figure A2.
Visual task distribution pie chart for the different eye movement recording types on Ora-M.
Table A1.
Train and test sets’ per-class counts for each dataset of the four datasets, used for the downstream task.
| Saccade | Vergence |
---|
| Reduced | Non-Reduced | Reduced | Non-Reduced |
---|
Class | Train | Test | Train | Test | Train | Test | Train | Test |
---|
0 | 5302 | 9170 | 17,458 | 33,090 | 4830 | 9076 | 16,714 | 33,676 |
1 | 3970 | 12,898 | 24,180 | 46,864 | 6696 | 13,136 | 24,714 | 47,998 |
2 | 7552 | 2832 | 5612 | 12,056 | 3622 | 2750 | 5616 | 12,088 |
3 | 4162 | 3212 | 6572 | 14,022 | 2960 | 3358 | 6134 | 13,734 |
4 | 5200 | 12,132 | 22,606 | 43,520 | 4496 | 11,840 | 21,066 | 42,862 |
5 | 2904 | 4752 | 7956 | 17,974 | 2212 | 5712 | 12,200 | 18,730 |
6 | 5156 | 7570 | 14,280 | 28,826 | 2552 | 6708 | 12,258 | 25,158 |
7 | 1802 | 1784 | 3490 | 7,178 | 1616 | 2812 | 4958 | 10,728 |
Figure A3.
An overview of the proposed architecture; note that the decoder takes as input the encoder embedding as well as the REMOBI target signal.
Figure A4.
An overview of the 8 explored head classifiers for the downstream task. For simplicity, we omit the activation layers, dropout layer, as well as the layers expanding the normalization features’ dimensional tensors to have a compatible shape with the output feature map. The flattening layer reduces the temporal dimension. For the Transformer block (Block), we use the timm library implementation. Furthermore, block-LN corresponds to the block implementation followed by a LayerNorm layer.
Figure A5.
Overview of autoencoder performance in signal reconstruction. Columns represent (1) initial signal, (2) noise-injected signal, (3) target signal masked, (4) masked eye movement signal, and (5) reconstructed signal. Even rows (in red) show the position of conjugate signals within the X-axis, corresponding to mean eye values, while odd rows (in red) display disconjugate signals (the difference between the eyes). The first four rows depict samples with easy masking and hard noise injection, while the last four rows illustrate cases with hard masking and relatively low noise injection.
Table A2.
Second-stage optimization settings.
Training Type | Initial Learning Rate | Weight Decay | Initialization | Encoder Weights |
---|
Freezing | | 0 | Pretext Weight | Frozen |
Fine-tuning | | | Pretext Weight | Trainable |
Sup. Learning | | | Random | Trainable |
Table A3.
Alpha (class balancing) parameters on the focal loss for non-reduced and reduced values. Note that the different hyperparameters are shared across the two visual tasks.
| Non-Reduced | Reduced |
---|
Class 0 | 0.73 | 0.8 |
Class 1 | 0.61 | 0.8 |
Class 2 | 0.9 | 0.8 |
Class 3 | 0.88 | 0.8 |
Class 4 | 0.67 | 0.8 |
Class 5 | 0.83 | 0.8 |
Class 6 | 0.81 | 0.8 |
Class 7 | 0.31 | 0.2 |
Table A4.
Per-class sample macro F1 scores for each downstream method when training the vergence dataset. The best metrics for each head classifier are highlighted in bold. Note that Head0 corresponds to the linear probing method.
Technique | Head | Class 0 | Class 1 | Class 2 | Class 3 | Class 4 | Class 5 | Class 6 | Class 7 |
---|
| 1 | 60.5 | 64.9 | 54.1 | 53.1 | 61.0 | 63.0 | 59.8 | 67.4 |
| 2 | 61.4 | 63.6 | 56.2 | 53.1 | 59.0 | 61.9 | 59.8 | 67.0 |
Fine-tuning | 3 | 62.2 | 66.6 | 56.0 | 49.0 | 60.1 | 60.7 | 58.0 | 64.2 |
4 | 61.6 | 64.9 | 54.9 | 50.6 | 59.6 | 61.1 | 59.9 | 65.0 |
| 5 | 64.0 | 66.0 | 55.4 | 50.6 | 58.3 | 60.3 | 55.4 | 67.3 |
| 6 | 63.9 | 65.2 | 54.2 | 51.3 | 58.7 | 59.8 | 59.6 | 69.1 |
| 7 | 61.9 | 65.6 | 56.2 | 50.6 | 59.2 | 62.3 | 62.0 | 66.0 |
| 0 | 52.6 | 53.6 | 47.1 | 40.5 | 47.0 | 50.5 | 41.9 | 45.8 |
| 1 | 61.0 | 63.7 | 57.0 | 51.2 | 57.4 | 63.3 | 58.9 | 61.7 |
| 2 | 62.3 | 64.8 | 56.5 | 52.3 | 58.2 | 60.6 | 58.9 | 68.0 |
Freezing | 3 | 62.1 | 63.6 | 55.3 | 52.4 | 59.3 | 62.4 | 57.5 | 68.5 |
4 | 61.4 | 64.1 | 54.4 | 51.0 | 59.7 | 63.0 | 58.5 | 67.0 |
| 5 | 61.7 | 63.0 | 56.3 | 51.7 | 58.2 | 62.4 | 58.0 | 68.7 |
| 6 | 58.9 | 61.5 | 53.5 | 50.9 | 57.0 | 59.5 | 57.7 | 68.7 |
| 7 | 64.7 | 66.1 | 56.6 | 53.5 | 61.1 | 63.4 | 58.2 | 68.4 |
| 1 | 59.8 | 61.9 | 53.1 | 50.0 | 57.5 | 60.7 | 57.7 | 65.1 |
| 2 | 59.7 | 63.1 | 53.5 | 51.1 | 56.0 | 60.0 | 57.9 | 66.9 |
Sup. Learning | 3 | 60.4 | 61.7 | 54.3 | 50.7 | 55.8 | 60.9 | 56.8 | 67.2 |
4 | 62.2 | 63.4 | 55.4 | 52.7 | 57.7 | 61.5 | 58.5 | 66.9 |
| 5 | 60.0 | 62.9 | 55.2 | 51.0 | 57.3 | 62.7 | 58.2 | 67.8 |
| 6 | 59.3 | 62.4 | 53.1 | 49.7 | 57.3 | 60.7 | 57.9 | 67.5 |
| 7 | 59.6 | 63.8 | 55.8 | 51.8 | 58.1 | 61.0 | 58.3 | 65.3 |
Table A5.
Per-class sample macro F1 scores for each downstream method when training the saccade dataset. The best metrics for each head classifier are highlighted in bold. Note that Head0 corresponds to the linear probing method.
Technique | Head | Class 0 | Class 1 | Class 2 | Class 3 | Class 4 | Class 5 | Class 6 | Class 7 |
---|
| 1 | 64.6 | 64.8 | 57.3 | 54.4 | 60.2 | 63.3 | 61.2 | 61.3 |
| 2 | 63.9 | 65.6 | 57.5 | 56.3 | 60.2 | 63.2 | 61.7 | 60.9 |
Fine-tuning | 3 | 64.3 | 64.4 | 56.6 | 53.9 | 59.2 | 63.0 | 58.5 | 59.6 |
4 | 64.5 | 65.9 | 57.5 | 55.0 | 60.5 | 64.2 | 59.9 | 60.2 |
| 5 | 64.6 | 65.2 | 56.3 | 56.0 | 61.5 | 64.3 | 61.1 | 61.3 |
| 6 | 64.1 | 66.0 | 56.6 | 55.0 | 59.3 | 65.3 | 61.2 | 60.7 |
| 7 | 64.9 | 63.6 | 56.2 | 54.4 | 59.9 | 64.0 | 61.0 | 59.9 |
| 0 | 48.9 | 53.5 | 50.5 | 43.8 | 46.0 | 53.1 | 44.8 | 48.5 |
| 1 | 64.3 | 63.5 | 58.4 | 54.8 | 58.4 | 62.4 | 60.8 | 60.5 |
| 2 | 62.4 | 63.4 | 58.9 | 55.6 | 58.8 | 64.9 | 60.8 | 59.0 |
Freezing | 3 | 63.2 | 65.0 | 56.9 | 56.3 | 60.1 | 63.4 | 61.2 | 59.4 |
4 | 61.4 | 61.8 | 57.2 | 56.2 | 60.0 | 67.1 | 59.1 | 61.2 |
| 5 | 63.5 | 63.7 | 56.2 | 53.6 | 57.9 | 65.5 | 61.9 | 60.3 |
| 6 | 59.5 | 59.9 | 55.9 | 53.2 | 57.0 | 62.0 | 59.1 | 61.0 |
| 7 | 62.8 | 64.4 | 57.7 | 54.3 | 60.7 | 67.6 | 60.7 | 59.7 |
| 1 | 62.9 | 62.2 | 57.8 | 55.5 | 56.8 | 62.7 | 59.4 | 61.5 |
| 2 | 61.8 | 61.8 | 56.0 | 53.9 | 57.0 | 61.1 | 60.0 | 59.5 |
Sup. Learning | 3 | 63.5 | 63.0 | 57.2 | 53.8 | 58.1 | 61.6 | 60.9 | 59.4 |
4 | 61.9 | 62.5 | 56.8 | 54.2 | 55.5 | 62.7 | 59.3 | 58.2 |
| 5 | 63.6 | 61.8 | 56.8 | 53.1 | 59.8 | 64.0 | 60.5 | 62.1 |
| 6 | 61.1 | 59.2 | 56.1 | 53.1 | 56.4 | 59.7 | 57.9 | 60.6 |
| 7 | 63.6 | 65.0 | 57.1 | 54.9 | 57.6 | 61.7 | 59.5 | 62.1 |
Table A6.
Per-class sample macro F1 scores for each downstream method when training the reduced vergence dataset. The best metrics for each head classifier are highlighted in bold. Note that Head0 corresponds to the linear probing method.
Technique | Head | Class 0 | Class 1 | Class 2 | Class 3 | Class 4 | Class 5 | Class 6 | Class 7 |
---|
| 1 | 59.6 | 61.4 | 45.7 | 52.6 | 51.5 | 51.1 | 54.3 | 52.3 |
| 2 | 57.3 | 62.6 | 47.0 | 55.3 | 58.3 | 56.5 | 52.3 | 46.8 |
Fine-tuning | 3 | 61.9 | 67.0 | 52.3 | 50.6 | 54.6 | 54.3 | 52.4 | 48.9 |
4 | 55.7 | 62.8 | 48.8 | 52.2 | 56.2 | 52.2 | 53.5 | 52.2 |
| 5 | 58.6 | 62.8 | 57.1 | 54.1 | 59.7 | 53.2 | 52.6 | 47.1 |
| 6 | 57.8 | 62.1 | 45.0 | 52.9 | 54.3 | 57.1 | 49.0 | 47.9 |
| 7 | 55.5 | 61.7 | 46.8 | 52.3 | 53.5 | 54.2 | 48.8 | 48.1 |
| 0 | 55.7 | 56.7 | 49.1 | 46.7 | 46.3 | 50.7 | 42.4 | 51.2 |
| 1 | 60.5 | 61.0 | 49.0 | 48.7 | 49.8 | 52.1 | 46.6 | 47.5 |
| 2 | 59.7 | 61.2 | 48.2 | 47.8 | 47.0 | 52.7 | 54.0 | 47.5 |
Freezing | 3 | 58.6 | 59.4 | 57.7 | 49.5 | 49.1 | 49.1 | 48.3 | 47.5 |
4 | 57.3 | 59.0 | 52.6 | 48.8 | 54.2 | 51.4 | 51.4 | 47.0 |
| 5 | 60.5 | 63.9 | 58.6 | 49.6 | 53.1 | 51.5 | 53.6 | 53.5 |
| 6 | 57.0 | 54.5 | 49.8 | 47.2 | 49.5 | 46.8 | 50.5 | 47.5 |
| 7 | 60.8 | 64.1 | 51.0 | 51.2 | 56.1 | 50.0 | 53.8 | 49.3 |
| 1 | 57.1 | 56.4 | 49.5 | 48.9 | 51.2 | 52.7 | 54.0 | 53.7 |
| 2 | 55.2 | 58.7 | 48.9 | 51.1 | 47.1 | 49.6 | 45.6 | 52.4 |
Sup. Learning | 3 | 56.9 | 58.9 | 52.5 | 49.6 | 47.8 | 51.0 | 47.9 | 47.4 |
4 | 57.3 | 58.7 | 53.1 | 51.2 | 48.7 | 54.6 | 47.1 | 47.6 |
| 5 | 59.2 | 58.9 | 53.2 | 50.1 | 47.1 | 54.3 | 47.3 | 47.7 |
| 6 | 53.7 | 56.0 | 47.0 | 51.3 | 50.2 | 56.0 | 55.6 | 46.6 |
| 7 | 53.2 | 56.6 | 51.7 | 53.6 | 50.6 | 49.5 | 52.6 | 47.9 |
Table A7.
Per-class sample macro F1 scores for each downstream method when training the reduced saccade dataset. The best metrics for each head classifier are highlighted in bold. Note that Head0 corresponds to the linear probing method.
Technique | Head | Class 0 | Class 1 | Class 2 | Class 3 | Class 4 | Class 5 | Class 6 | Class 7 |
---|
| 1 | 56.4 | 55.7 | 49.8 | 48.5 | 49.3 | 52.7 | 47.7 | 51.7 |
| 2 | 56.8 | 60.6 | 50 | 50.2 | 55.1 | 52.6 | 51.8 | 48 |
Fine-tuning | 3 | 55.4 | 57.6 | 59.1 | 47.6 | 50.9 | 54.5 | 46 | 52.4 |
4 | 58.2 | 60.7 | 53.3 | 50.2 | 52.3 | 54.1 | 49.4 | 47 |
| 5 | 52.5 | 56.4 | 51.4 | 50.6 | 60.1 | 55.1 | 50.6 | 47.7 |
| 6 | 55.7 | 60.1 | 54.3 | 48.1 | 54.6 | 56.7 | 53.6 | 47.9 |
| 7 | 58.4 | 56.7 | 52.7 | 50.7 | 52.8 | 53.8 | 47.4 | 48.6 |
| 0 | 49.3 | 51.6 | 52 | 46.3 | 41.7 | 48.3 | 40.3 | 46.1 |
| 1 | 54.2 | 54.1 | 52.2 | 50.8 | 49 | 51 | 49.6 | 48 |
| 2 | 59.8 | 56.7 | 53.1 | 49.7 | 50.9 | 55 | 48.9 | 47.5 |
Freezing | 3 | 60.9 | 57.9 | 54.4 | 49.9 | 49.1 | 54.5 | 52.5 | 49.7 |
4 | 59 | 58.3 | 54.4 | 54.8 | 50.1 | 51.6 | 54.1 | 47.4 |
| 5 | 57 | 55.8 | 55.8 | 46.1 | 51.3 | 55.1 | 49.3 | 48 |
| 6 | 50.7 | 51.1 | 50.4 | 53.3 | 50.8 | 53.2 | 51 | 47.8 |
| 7 | 58.8 | 59.1 | 54.9 | 50.8 | 51.6 | 53.6 | 53.3 | 48 |
| 1 | 51.5 | 55.8 | 53.2 | 48.1 | 52.2 | 56 | 50.9 | 47.2 |
| 2 | 57.7 | 57.9 | 56.1 | 46.8 | 47.6 | 52.5 | 51 | 49.7 |
Sup. Learning | 3 | 53.1 | 56.7 | 50.2 | 45.6 | 51.5 | 52.7 | 49 | 47.9 |
4 | 54.9 | 53.8 | 51 | 49.5 | 51.9 | 51.4 | 49.1 | 47.5 |
| 5 | 53.5 | 59.3 | 49.7 | 53.5 | 49.6 | 51.4 | 48.8 | 49 |
| 6 | 55.4 | 54.7 | 48.7 | 50.6 | 48.8 | 50.2 | 51.9 | 49.6 |
| 7 | 58.1 | 55.5 | 50.9 | 46.4 | 49.2 | 51.1 | 46.5 | 48 |
References
- Ward, L.M.; Kapoula, Z. Dyslexics’ Fragile Oculomotor Control Is Further Destabilized by Increased Text Difficulty. Brain Sci. 2021, 11, 990. [Google Scholar] [CrossRef] [PubMed]
- Ward, L.M.; Kapoula, Z. Differential diagnosis of vergence and saccade disorders in dyslexia. Sci. Rep. 2020, 10, 22116. [Google Scholar] [CrossRef]
- Ward, L.M.; Kapoula, Z. Creativity, Eye-Movement Abnormalities, and Aesthetic Appreciation of Magritte’s Paintings. Brain Sci. 2022, 12, 1028. [Google Scholar] [CrossRef] [PubMed]
- Kapoula, Z.; Morize, A.; Daniel, F.; Jonqua, F.; Orssaud, C.; Bremond-Gignac, D. Objective evaluation of vergence disorders and a research-based novel method for vergence rehabilitation. Transl. Vis. Sci. Technol. 2016, 5, 8. [Google Scholar] [CrossRef]
- El Hmimdi, A.E.; Ward, L.M.; Palpanas, T.; Kapoula, Z. Predicting dyslexia and reading speed in adolescents from eye movements in reading and non-reading tasks: A machine learning approach. Brain Sci. 2021, 11, 1337. [Google Scholar] [CrossRef]
- El Hmimdi, A.E.; Ward, L.M.; Palpanas, T.; Sainte Fare Garnot, V.; Kapoula, Z. Predicting Dyslexia in Adolescents from Eye Movements during Free Painting Viewing. Brain Sci. 2022, 12, 1031. [Google Scholar] [CrossRef]
- Rizzo, A.; Ermini, S.; Zanca, D.; Bernabini, D.; Rossi, A. A machine learning approach for detecting cognitive interference based on eye-tracking data. Front. Hum. Neurosci. 2022, 16, 806330. [Google Scholar] [CrossRef] [PubMed]
- Bixler, R.; D’Mello, S. Automatic gaze-based user-independent detection of mind wandering during computerized reading. User Model. User-Adapt. Interact. 2016, 26, 33–68. [Google Scholar] [CrossRef]
- Asvestopoulou, T.; Manousaki, V.; Psistakis, A.; Smyrnakis, I.; Andreadakis, V.; Aslanides, I.M.; Papadopouli, M. Dyslexml: Screening tool for dyslexia using machine learning. arXiv 2019, arXiv:1903.06274. [Google Scholar]
- Nilsson Benfatto, M.; Öqvist Seimyr, G.; Ygge, J.; Pansell, T.; Rydberg, A.; Jacobson, C. Screening for dyslexia using eye tracking during reading. PLoS ONE 2016, 11, e0165508. [Google Scholar] [CrossRef]
- Vajs, I.A.; Kvaščev, G.S.; Papić, T.M.; Janković, M.M. Eye-tracking Image Encoding: Autoencoders for the Crossing of Language Boundaries in Developmental Dyslexia Detection. IEEE Access 2023, 11, 3024–3033. [Google Scholar] [CrossRef]
- El Hmimdi, A.E.; Kapoula, Z.; Sainte Fare Garnot, V. Deep Learning-Based Detection of Learning Disorders on a Large Scale Dataset of Eye Movement Records. BioMedInformatics 2024, 4, 519–541. [Google Scholar] [CrossRef]
- Chen, S.; Zhao, Q. Attention-based autism spectrum disorder screening with privileged modality. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1181–1190. [Google Scholar]
- Jiang, M.; Zhao, Q. Learning visual attention to identify people with autism spectrum disorder. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3267–3276. [Google Scholar]
- Tao, Y.; Shyu, M.L. SP-ASDNet: CNN-LSTM based ASD classification model using observer scanpaths. In Proceedings of the 2019 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Shanghai, China, 8–12 July 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 641–646. [Google Scholar]
- Vajs, I.; Ković, V.; Papić, T.; Savić, A.M.; Janković, M.M. Dyslexia detection in children using eye tracking data based on VGG16 network. In Proceedings of the 2022 30th European Signal Processing Conference (EUSIPCO), Belgrade, Serbia, 29 August–2 September 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1601–1605. [Google Scholar]
- Harisinghani, A.; Sriram, H.; Conati, C.; Carenini, G.; Field, T.; Jang, H.; Murray, G. Classification of Alzheimer’s using Deep-learning Methods on Webcam-based Gaze Data. Proc. ACM Hum.-Comput. Interact. 2023, 7, 1–17. [Google Scholar] [CrossRef]
- Sun, J.; Liu, Y.; Wu, H.; Jing, P.; Ji, Y. A novel deep learning approach for diagnosing Alzheimer’s disease based on eye-tracking data. Front. Hum. Neurosci. 2022, 16, 972773. [Google Scholar] [CrossRef] [PubMed]
- Bautista, L.G.C.; Naval, P.C. Gazemae: General representations of eye movements using a micro-macro autoencoder. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 7004–7011. [Google Scholar]
- Jindal, S.; Manduchi, R. Contrastive representation learning for gaze estimation. In Proceedings of the Annual Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; pp. 37–49. [Google Scholar]
- Bautista, L.G.C.; Naval, P.C. CLRGaze: Contrastive Learning of Representations for Eye Movement Signals. In Proceedings of the 2021 29th European Signal Processing Conference (EUSIPCO), Dublin, Ireland, 23–27 August 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1241–1245. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving language understanding by generative pre-training. OpenAI Blog, 2018; preprint. [Google Scholar]
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
- Song, K.; Tan, X.; Qin, T.; Lu, J.; Liu, T.Y. Mass: Masked sequence to sequence pre-training for language generation. arXiv 2019, arXiv:1905.02450. [Google Scholar]
- Chen, M.; Radford, A.; Child, R.; Wu, J.; Jun, H.; Luan, D.; Sutskever, I. Generative pretraining from pixels. In Proceedings of the International Conference on Machine Learning, Virtual Event, 13–18 July 2020; pp. 1691–1703. [Google Scholar]
- Bao, H.; Dong, L.; Piao, S.; Wei, F. Beit: Bert pre-training of image transformers. arXiv 2021, arXiv:2106.08254. [Google Scholar]
- Xie, Z.; Zhang, Z.; Cao, Y.; Lin, Y.; Bao, J.; Yao, Z.; Dai, Q.; Hu, H. Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA., 18–24 June 2022; pp. 9653–9663. [Google Scholar]
- He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16000–16009. [Google Scholar]
- Tang, P.; Zhang, X. MTSMAE: Masked Autoencoders for Multivariate Time-Series Forecasting. In Proceedings of the 2022 IEEE 34th International Conference on Tools with Artificial Intelligence (ICTAI), Macao, China, 31 October–2 November 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 982–989. [Google Scholar]
- Vincent, P.; Larochelle, H.; Lajoie, I.; Bengio, Y.; Manzagol, P.A.; Bottou, L. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 2010, 11, 3371–3408. [Google Scholar]
- Lee, W.H.; Ozger, M.; Challita, U.; Sung, K.W. Noise learning-based denoising autoencoder. IEEE Commun. Lett. 2021, 25, 2983–2987. [Google Scholar] [CrossRef]
- Hinton, G.E.; Zemel, R. Autoencoders, minimum description length and Helmholtz free energy. Adv. Neural Inf. Process. Syst. 1993, 6, 3–10. [Google Scholar]
- Hinton, G.E.; Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science 2006, 313, 504–507. [Google Scholar] [CrossRef]
- Baldi, P. Autoencoders, unsupervised learning, and deep architectures. In Proceedings of the ICML Workshop on Unsupervised and Transfer Learning, JMLR Workshop and Conference Proceedings, Bellevue, WA, USA, 2 July 2011; pp. 37–49. [Google Scholar]
- Kingma, D.P.; Welling, M. An introduction to variational autoencoders. Found. Trends® Mach. Learn. 2019, 12, 307–392. [Google Scholar] [CrossRef]
- Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
- Bajaj, K.; Singh, D.K.; Ansari, M.A. Autoencoders based deep learner for image denoising. Procedia Comput. Sci. 2020, 171, 1535–1541. [Google Scholar] [CrossRef]
- Vincent, P.; Larochelle, H.; Bengio, Y.; Manzagol, P.A. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland, 5–9 July 2008; pp. 1096–1103. [Google Scholar]
- Li, Z.; Rao, Z.; Pan, L.; Wang, P.; Xu, Z. Ti-MAE: Self-Supervised Masked Time Series Autoencoders. arXiv 2023, arXiv:2301.08871. [Google Scholar]
- Doersch, C.; Gupta, A.; Efros, A.A. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1422–1430. [Google Scholar]
- Noroozi, M.; Favaro, P. Unsupervised learning of visual representations by solving jigsaw puzzles. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 69–84. [Google Scholar]
- Gidaris, S.; Singh, P.; Komodakis, N. Unsupervised representation learning by predicting image rotations. arXiv 2018, arXiv:1803.07728. [Google Scholar]
- Yang, X.; Zhang, Z.; Cui, R. Timeclr: A self-supervised contrastive learning framework for univariate time series representation. Knowl.-Based Syst. 2022, 245, 108606. [Google Scholar] [CrossRef]
- Yue, Z.; Wang, Y.; Duan, J.; Yang, T.; Huang, C.; Tong, Y.; Xu, B. Ts2vec: Towards universal representation of time series. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; Volume 36, pp. 8980–8987. [Google Scholar]
- Tonekaboni, S.; Eytan, D.; Goldenberg, A. Unsupervised representation learning for time series with temporal neighborhood coding. arXiv 2021, arXiv:2106.00750. [Google Scholar]
- Zhang, X.; Zhao, Z.; Tsiligkaridis, T.; Zitnik, M. Self-supervised contrastive pre-training for time series via time-frequency consistency. Adv. Neural Inf. Process. Syst. 2022, 35, 3988–4003. [Google Scholar]
- Yoon, J.; Jarrett, D.; Van der Schaar, M. Time-series generative adversarial networks. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
- Desai, A.; Freeman, C.; Wang, Z.; Beaver, I. Timevae: A variational auto-encoder for multivariate time series generation. arXiv 2021, arXiv:2111.08095. [Google Scholar]
- Esteban, C.; Hyland, S.L.; Rätsch, G. Real-valued (medical) time series generation with recurrent conditional gans. arXiv 2017, arXiv:1706.02633. [Google Scholar]
- Meng, Q.; Qian, H.; Liu, Y.; Xu, Y.; Shen, Z.; Cui, L. Unsupervised Representation Learning for Time Series: A Review. arXiv 2023, arXiv:2308.01578. [Google Scholar]
- Lee, S.W.; Kim, S. Detection of Abnormal Behavior with Self-Supervised Gaze Estimation. arXiv 2021, arXiv:2107.06530. [Google Scholar]
- Du, L.; Zhang, X.; Lan, G. Unsupervised Gaze-aware Contrastive Learning with Subject-specific Condition. arXiv 2023, arXiv:2309.04506. [Google Scholar]
- Yu, Y.; Odobez, J.M. Unsupervised representation learning for gaze estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 7314–7324. [Google Scholar]
- Park, S.; Mello, S.D.; Molchanov, P.; Iqbal, U.; Hilliges, O.; Kautz, J. Few-shot adaptive gaze estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9368–9377. [Google Scholar]
- Pupila Capture Eye Tracker. Available online: https://pupil-labs.com/ (accessed on 29 March 2024).
- Pytorch Image Models (timm). Available online: https://timm.fast.ai/ (accessed on 29 March 2024).
- Tian, K.; Jiang, Y.; Diao, Q.; Lin, C.; Wang, L.; Yuan, Z. Designing bert for convolutional networks: Sparse and hierarchical masked modeling. arXiv 2023, arXiv:2301.03580. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017; 30, arXiv:1706.03762. [Google Scholar]
- Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
- Pouyanfar, S.; Sadiq, S.; Yan, Y.; Tian, H.; Tao, Y.; Reyes, M.P.; Shyu, M.L.; Chen, S.C.; Iyengar, S.S. A survey on deep learning: Algorithms, techniques, and applications. ACM Comput. Surv. (CSUR) 2018, 51, 1–36. [Google Scholar] [CrossRef]
- El Hmimdi, A.E.; Palpanas, T.; Kapoula, Z. Efficient Diagnostic Classification of Diverse Pathologies through Contextual Eye Movement Data Analysis with a Novel Hybrid Architecture. BioMedInformatics 2024, 4, 1457–1479. [Google Scholar] [CrossRef]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).