We conducted evaluations on three datasets introduced in
Section 2. To this end, we designed an emprical evaluation protocol, which is described in the first part of this section. After that, we present the results of our proposed model and compare them with existing state-of-the-art results. Finally, we evaluate each component of our proposed preprocessing, augmentation, and model architecture to demonstrate the effectiveness of our approach.
4.1. Evaluation Protocol
We evaluated our model using leave-one-subject-out (LOSO) cross-validation under user-independent conditions. In this approach, the evaluation process iterates through each subject in the dataset. Data from all subjects except one is used for training, while the remaining subject’s data is used exclusively for testing. This methodology ensures the model is evaluated on completely unseen subjects, providing a rigorous assessment of its generalization capabilities across different individuals.
Figure 11 illustrates our evaluation framework. The protocol begins by splitting the source dataset into two separate portions: one containing data from a single subject (for testing) and another comprising data from all other subjects (for training). The training data undergoes data augmentation. We implemented three augmentation techniques, which are described in the next subsection. Both the augmented and original training data are preprocessed before model training. Similarly, the test data undergoes preprocessing before evaluation. This systematic approach ensures an unbiased assessment while maintaining strict separation between training and testing data, which is crucial for validating the model’s ability to generalize to new users.
In preprocessing, we implemented a pipeline of steps illustrated in
Figure 12. Raw eye-writing sequences often contain noise and unwanted artifacts that reduce recognition accuracy. Following the approach of [
6], we applied a sequence of filters to the raw eye-writing data from two EOG-captured datasets. First, the input signal was processed with a median filter to remove sporadic noise. Next, a low-pass filter was applied to eliminate high-frequency components irrelevant to recognition. Then, the signal was passed through a direct current (DC) block filter that removes any constant offset in the signal, centering the values around zero. After these filtering steps, the Discrete Fourier Transform (DFT) length normalization was applied to standardize the signal length, followed by initial-point normalization to align the initial point of each eye-writing pattern to the origin. For the webcam-captured Arabic numbers dataset, the first three filters (dashed boxes in
Figure 12) were not applied due to the different nature of data capture.
4.2. Experimental Settings
The deep learning models were implemented using the PyTorch framework in Python 3.10. Experiments were conducted on a desktop computer equipped with an Intel Core i5-10400 CPU (Intel, Santa Clara, CA, USA), NVIDIA GeForce RTX 3060 GPU (CUDA 12.4) (NVIDIA, Santa Clara, CA, USA), and 48 GB RAM.
We employed the Adam optimizer with AMSGrad for model training and set the L2 weight decay to 0.1 to mitigate overfitting. The cross-entropy loss function was used to measure the discrepancy between predicted and true labels. The cross-entropy loss is defined as
where
N is the number of samples,
C is the number of classes,
is a binary indicator (0 or 1) if class label
j is the correct classification for sample
i, and
is the predicted probability that sample
i belongs to class
j. In all experiments, the model was trained for 400 epochs with a batch size of 512. We evaluated performance metrics such as accuracy, precision, recall, and F1-score after training.
Hyperparameters such as the batch size, number of epochs, weight decay, and number of cosine annealing iterations were determined through grid search and empirical experiments to optimize performance across all datasets.
Common hyperparameter settings for all datasets are detailed in
Table 2. Regarding two types of distortion data augmentation, the normal distribution parameters are different, where the type I distortion has a mean
and variance
, while the type II distortion has a mean
and variance
. The gap
g for both types of distortion is dataset-specific. The learning rate was dynamically adjusted during the training process, initialized at 0.001 and regulated through a cosine annealing [
21] scheduler. The period of the cosine annealing was set to 16 iterations.
In the filtering process for the Japanese Katakana and EOG-captured Arabic datasets, dataset-specific cutoff frequencies were established based on each dataset’s sampling rate, as presented in
Table 3. The gap
g for the two types of distortion was set to 64 for the EOG-captured Arabic dataset and 32 for other datasets.
4.3. Performance Metrics
For evaluation, we use standard performance metrics including accuracy, precision, recall, and F1-score.
Furthermore, hypothesis testing is conducted to statistically analyze whether the proposed approach is significantly better than the baseline method. To calculate the
p-value, we employ the two-tailed Wilcoxon signed-rank test [
22] with a significance level of 0.05. A
p-value less than the significance level indicates that the null hypothesis can be rejected, suggesting that the observed difference is unlikely to be due to chance.
4.7. Ablation Experiments
We conducted ablation experiments to evaluate the individual contribution of each component in our proposed model. Specifically, we systematically evaluated the model by adding or removing three key components: DFT length normalization, initial-point normalization, and data augmentation techniques.
The baseline 1D CNN-TCN model was compared against various configurations to isolate the impact of each component and their combinations. Since eye-writing speed, duration, and motion range vary considerably across subjects and datasets, we standardized input lengths in the baseline model by zero-padding the raw signals to fixed lengths specific to each dataset. Although zero-padding may introduce some noise, to minimize its impact, we applied zero-padding to the raw input signals before feature extraction, ensuring it occurred prior to the fully connected layer. In the baseline model, raw signals were zero-padded to fixed lengths of 1300, 1800, and 512 samples for the Japanese Katakana, EOG-captured Arabic numbers, and webcam-captured Arabic numbers datasets, respectively, without truncation. This naive approach serves as a reference point to assess the effectiveness and fairness of our proposed preprocessing and augmentation techniques.
Table 9 presents the results of these experiments across different datasets. The configurations tested include: (1) the baseline 1D CNN-TCN model without any preprocessing or augmentation, (2) the model with DFT length normalization, (3) the model with DFT length normalization and initial-point normalization, (4) the model with DFT length normalization and data augmentations, and (5) the full model with DFT length normalization, initial-point normalization, and data augmentations. In all configurations, the filtering steps (median filter, low-pass filter, and DC block filter) were consistently applied to the Japanese Katakana and EOG-captured Arabic numbers datasets, as these filters are essential for enhancing signal quality. The webcam-captured Arabic numbers dataset did not include these filtering steps, as the raw signals were already of sufficient quality for effective processing by the model. Each measurement represents the average across LOSO cross-validation for all subjects in corresponding datasets.
Analysis of the results reveals several findings across all datasets. The partial model with DFT length normalization achieved outstanding improvements in accuracy compared to the baseline model: 15.99 pp for the Japanese Katakana dataset (87.74% vs. 71.75%), 27.97 pp for the EOG-captured Arabic numbers dataset (96.30% vs. 68.33%), and 34.42 pp for the webcam-captured Arabic numbers dataset (95.37% vs. 60.95%). This improvement is attributed to the effective normalization technique, which enhances feature extraction capabilities.
In addition, initial-point normalization and data augmentations also contributed to performance improvements complementary to DFT length normalization. The full model improved the accuracy by 2.76 and 4.94 pp for the Japanese Katakana dataset compared to models without initial-point normalization and without augmentations, respectively. For the EOG-captured Arabic numbers dataset, the full model improved the accuracy by 2.22 and 1.29 pp compared to models without initial-point normalization and without augmentations, respectively. For the webcam-captured Arabic numbers dataset, the full model improved the accuracy by 2.31 and 0.73 pp compared to models without initial-point normalization and without augmentations, respectively. These results indicate that both initial-point normalization and data augmentations effectively enhance the model’s robustness against a wider variety of eye-writing patterns.
Table 10 shows the calculated
p-values for the ablation study comparing the baseline model with each of the other configurations across all datasets. Consequently, all
p-values were found to be less than the significance level of 0.05, indicating that improvements in accuracy were statistically significant relative to the baseline across all configurations.
Table 11 presents the results of the ablation experiments on the fusion components. By comparing the performance of the 1D CNN, TCN, and their combination in both parallel and serial integrations (where 1D CNN output is fed to TCN), it is evident that the parallel model outperforms both individual models and the serial fused model. Across all datasets, the parallel model achieved the highest accuracy, demonstrating the effectiveness of the proposed model and the benefits of the fusion architecture.