2.2. Deep Learning Approaches
In the case of user classification based on typing patterns for variable text, similar to fixed text, machine learning methods and their application as classifiers are commonly chosen by researchers. In 2014, a publication [
10] introduced a method for user identification based on keystroke dynamics using multilayer perceptron (MLP) neural networks. The authors extracted features from any text without restricting the nature of the key sequence used by the study participants. The dataset was collected over a 5-month period, recording sequences of key presses by 53 users during their daily computer-related activities, resulting in a resource containing an average of 18,008 key press events for each user. It can be noted that the data used in the experiment were obtained in an uncontrolled manner.
The decision model constructed by the researchers was based on multilayer perceptron (MLP) neural networks. Similarly to other studies [
11,
12], during the preparation of input data for the model, two features were extracted, the time between pressing and releasing a key H
time and the time between releasing the current key and pressing the next key UD
time (up to down time). The decision model relied on two neural networks, each powered by a single feature extracted from the input data. Input and output data at the neural network level were normalized using the min–max algorithm. To evaluate the system’s effectiveness for each user, the set of obtained sequences was divided into training and validation sets, with 1500 digraphs included in the training set. The remaining data for a given user were placed in the validation set. Cross-validation was applied during the model evaluation using the leave-one-out approach, repeating the experiment 53 times for the same user. As a result of the experiment, average values of false acceptance rate (FAR) coefficients were obtained at the level of 0.0152%. The average value of the false positive rate (FPR) coefficient was achieved at 4.82%, while the equal error rate (EER) coefficient was 2.46%. The results obtained by the authors can be considered above average compared to previous works in the field of personal identification based on keystroke dynamics.
Despite the high interest in machine learning methods for analyzing typing patterns in variable text, statistical methods continue to be an area of continuous research for this problem. In 2019, Ayotte et al. presented a user authentication method based on keystroke dynamics for variable text, employing statistical methods in the decision model [
13]. The authors utilized the Clarkson II dataset for their experiment. During the initial data processing and feature extraction from the dataset, researchers chose to use the following features: the first key code K
1, the second key code K
2, and only one of the available time features of the digraph, the time between pressing the first key and pressing the second key DD
time (down to down time). Three statistical methods were used and compared in the study:
KDE (Kernel Density Function) [
14];
ED (Energy Distance) [
15];
Kolmogorov–Smirnov test [
16].
Furthermore, the study presented results for classifiers using the fusion of these methods. The authors also explored the impact of the digraph vector length on classification accuracy. Experiments were conducted for digraph sequences of different lengths: 100, 200, 500, and 1000. For a feature vector consisting of only 100 digraphs, the lowest EER coefficient, at 35.1%, was achieved for the classifier based on the fusion of KDE and ED methods. For sequences of length 200, the best result was achieved for a model based on the fusion of all three methods analyzed by the authors, i.e., KDE, ED, and Kolmogorov–Smirnov test, setting the EER coefficient at 15.3%. For a feature vector of length 500, the lowest EER value, 6.3%, was achieved for the KDE method. The lowest EER value, at 3.6%, was obtained for a feature vector of length 1000 and a classifier based on the fusion of three methods, KDE, ED, and Kolmogorov–Smirnov test.
A year later, the same authors published a research paper [
17], continuing the investigations from the article [
13] from 2019, where satisfactory results were obtained for the Clarkson II dataset. In reference to their previous work, a significant drawback of the proposed solution was the minimum required length of a single sequence, which was 1000. In systems requiring rapid unauthorized access detection, the application of the proposed approach was not feasible. Building on the conclusions from the previous article, the authors decided to explore other statistical methods and propose their own method called ITAD (Instance-based Tail Area Density) to reduce the required minimum sequence length for the Clarkson II dataset. Among the methods examined by the authors were the following:
For comparison, the authors extracted the following features: the first key code K
1, the second key code K
2, and the time between pressing the first key and pressing the second key DD
time. Additionally, compared to the previous study, the authors extracted the following features: the time between releasing the first key and pressing the second key UD
time (up to down), the time between pressing the first key and releasing the second key DU
time (down to up), and the time between releasing the first key and releasing the second keyUU
time (up to up time) for the Clarkson II dataset. A comparative experiment was conducted based on the EER metric for sequences of length 50. The study compared the effectiveness of each method, also examining the impact of the feature vector size on classification accuracy for each method. According to the authors, identification based on the fusion of all features allows for a reduction in the balanced error rate (EER). For the Clarkson II dataset, the ITAD method and the fusion of all features achieved the lowest EER value among the results for each approach, at 12.3%. In another experiment, the effectiveness of the proposed classifier was examined for different sequence lengths, namely 10, 20, 100, and 200. For shorter sequences of lengths 10 and 20, a high EER error of 22.1% and 17.7% was recorded, respectively. Nevertheless, compared to the previous article [
13], a significant improvement in the classification method was achieved for sequences of lengths 100 and 200, setting the EER error at 9.07% and 7.8%, respectively. Additionally, the authors decided to conduct an experiment regarding the impact of sequence length for other data, namely the Buffalo dataset [
18]. For sequences of lengths 10, 20, 50, 100, and 200, the EER coefficient was 19.9%, 13.6%, 8%, 5.3%, and 3%, respectively. The results obtained by the authors for both the Clarkson II and Buffalo datasets can be considered above average compared to the results of other studies in the field of user identification based on keystroke dynamics.
An area of intensive research in recent years is the analysis of the feasibility of applying recurrent and convolutional neural networks in the context of typing pattern analysis for variable text. In 2019, Lu et al. [
19] presented a binary classifier based on RNN [
20] and CNN [
21] neural networks. The authors used the publicly available Buffalo dataset [
18]. During the dataset processing and preparation of input data for the model, the authors performed feature extraction, including K
1, K
2, H1
time, H2
time, UD
time, and DD
time, extracting a total of six features for a single digraph. Due to the required input format for RNN-type neural networks, the authors aggregated feature vectors, obtaining sequences, each representing a two-dimensional feature vector for consecutively occurring digraphs for a given user. In the experiment, an analysis was conducted on the impact of several variables on classification accuracy, including the following:
To determine the optimal sequence length, the authors examined the classification accuracy for sequences of different lengths: 10, 30, 50, 70, and 100. The experiment demonstrated that sequence length significantly influences classification accuracy. According to the authors, a sequence that is too short contains insufficient information about a user. For the experiment with a sequence length of 10, unsatisfactory results were obtained, where the values of the FRR, FAR, and EER coefficients were 16.02%, 3.48%, and 9.75%, respectively. Furthermore, the authors showed that a sequence that is too long contains noise that negatively affects identification effectiveness. For a sequence length of 100, satisfactory but not the best results for the entire experiment were achieved, where the FRR, FAR, and EER coefficients were set at 2.67%, 7.57%, and 5.12%, respectively. According to the authors, based on the lowest values of the FRR, FAR, and EER coefficients, the optimal sequence length for the Buffalo dataset is 30. For the experiment with the optimal sequence length of 30, the coefficient values were 1.95% for FRR, 4.12% for FAR, and 3.04% for EER. The authors also demonstrated that the feature set used to build the model affects classification accuracy. In the experiment, the optimal sequence length determined in the previous experiment was used, which was 30. As a result of the research, for a feature set containing only a subset of all extracted features, namely K1, K2, H1time, and H2time, the lowest classification accuracy was obtained, where the FRR and FAR coefficients were set at 12.39% and 5.96%, respectively, and the EER coefficient reached 9.17%. The highest classification accuracy was achieved in the experiment in which the model was powered by the full set of extracted features: K1, K2, H1time, H2time, UDtime, and DDtime, where the FRR, FAR, and EER coefficients were 1.95%, 4.12%, and 3.04%, respectively. In the last experiment, the study investigated and compared the classification accuracy depending on the neural network architecture. During the experiment, the authors took into account the conclusions from previous research, conducting studies for the optimal sequence length (30) and the optimal feature set (K1, K2, H1time, H2time, UDtime, and DDtime). The analysis showed that, using only recurrent neural networks as classifiers, based on GRU cells, allows achieving satisfactory results, with metric values at 4.05% for FRR, 6.01% for FAR, and 5.03% for EER. However, for the CNN + RNN architecture, higher classification accuracy was obtained than in the case of the RNN network alone. According to the authors, the convolutional layer preceding GRU units allows for the extraction of higher-order features from input data, thus improving identification effectiveness. The FRR, FAR, and EER values for the classifier with CNN + RNN architecture were 1.95%, 4.12%, and 3.04%, respectively. It should also be emphasized that the authors’ use of the public Buffalo dataset allows for a reliable comparison of results with other studies in the field of user identification based on keystroke dynamics.
A year later, Lu et al., continuing their own research [
19], reanalyzed the possibility of using neural networks in the CNN + RNN architecture as a binary classifier based on typing patterns, examining the impact of various factors on model effectiveness [
22]. Two public datasets, namely Clarkson II and Buffalo, were used for the studies. The authors decided to examine a series of parameters, both in the feature extraction domain and in recurrent and convolutional networks. Experiments were conducted for different network architectures and their parameters. In the first experiment, the impact of the type of recurrent network was analyzed, comparing LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit) networks, additionally conducting an analysis for a different number of cells within each of the used layers. As a result of the experiment, the model using GRU-type recurrent neural networks with 16 units in a single layer achieved the lowest EER coefficient, at 14.28% for the Clarkson II dataset and 3.61% for the Buffalo dataset. In the second experiment, the impact of using convolutional networks for extracting higher-order features was examined. The lowest balanced EER value for this experiment was obtained for the CNN + GRU (16) network variant, reducing the EER error compared to the previous experiment to 2.67%. In the next analysis, the impact of the kernel size parameter on user verification accuracy was examined. It turned out that the optimal value of this parameter for both datasets, Clarkson II (with an EER of 6.61%) and Buffalo (with an EER at the level of 2.67%), is 2. In the fourth study, the accuracy of binary classification was analyzed depending on the sequence length, obtaining an optimal sequence length of 50 for both datasets and recording an EER error value at 5.97% for the Clarkson II dataset and 2.36% for the Buffalo dataset. In the last experiment, the effectiveness of the decision model was compared for a different number of extracted features, assuming a constant sequence length of 50 and conducting studies only for the Buffalo dataset. During the analysis, it was shown that the number of extracted features has a significant impact on classification accuracy, achieving the lowest EER coefficient value at 2.36% for an approach in which the feature vector contained features K
1, K
2, H1
time, H2
time, UD
time, and DD
time. However, the authors note that the experiment was conducted for the Buffalo dataset only within the scope of the first task, i.e., the transcription of Steve Jobs’s speech. For other datasets, such as Clarkson II or the second task in the Buffalo dataset, the extraction of the DD
time feature may result in a decrease in classification accuracy due to possible long breaks in typing.
The research conducted by Lu et al. in 2020 demonstrated that the use of neural networks in the CNN + RNN architecture as a method for classifying users based on typing patterns allows for promising results, thereby inspiring other researchers for further studies. Building on the approach previously proposed, Kasprowski et al. conducted an analysis of the effectiveness of neural networks in the CNN + RNN architecture, examining the impact of the presence of individual layers and their parameters on classification accuracy in 2022 [
23]. The Buffalo dataset was used for the studies; however, in contrast to the base article, the dataset was limited to the second task, i.e., keyboard events recorded during any user activity, which better reflects real user behavior when operating a computer. It is also worth mentioning that the authors decided to apply an overlapping window mechanism, setting the shift coefficient at 40%. The authors’ approach proposed in the paper assumed the construction of a multi-class classifier, assuming that the number of classes is 20, where each class is represented by one user from the Buffalo dataset. The model’s effectiveness was then evaluated depending on the architecture and its parameters. To reduce the temporal complexity of the experiments, data for each user were limited to 1500 keyboard events, subsequently dividing the resulting input set into two sets: training (75% of the input set) and testing (25% of the input set). The study showed that the CNN + RNN architecture allows for higher effectiveness in identifying users based on typing patterns than models using only one type of network (either CNN or RNN). Furthermore, one of the studies conducted by the authors confirmed the conclusion obtained in the base work [
22] regarding the optimal sequence length, demonstrating that the correct sequence length for the Buffalo dataset falls within the range of 40 to 60. Additionally, the research extended the architecture of the base model described in the article [
22] by adding additional CNN and GRU layers, achieving higher effectiveness compared to the base model. For the proposed model, the anomaly correlation coefficient (ACC) value was 87%. The study also examined the impact of kernel size and the number of filters for the CNN layer on classification accuracy, confirming that the optimal kernel size is 2. Meanwhile, the optimal number of filters, according to the authors, falls within the range of 64 to 256. It is also worth noting that, to reduce the overfitting phenomenon, the authors decided to apply dropout layers between certain layers of the network. According to the authors, the optimal dropout rate is 0.5, indicating a noted decline in the model’s ability to generalize knowledge for lower dropout rate values.
According to the above review of research based on selected key scientific publications in this field, researchers’ interest in user identification based on typing patterns started relatively early. However, only recently have works been conducted that allow for a comparison of results obtained by individual authors. Additionally, the recent development of deep learning techniques has clearly dominated this field and contributed to further improving results achieved in the recent time horizon. These observations inspired the authors of this paper to make their own attempt to define the architecture and build a user verification system, also using deep learning techniques.