Next Article in Journal
Challenges and Opportunities for Multimedia Transmission in Vehicular Ad Hoc Networks: A Comprehensive Review
Previous Article in Journal
Firmware and Software Implementation Status of the ICBLM and nBLM Systems for the ESS Facility
Previous Article in Special Issue
An Efficient Attribute-Based Encryption Scheme with Data Security Classification in the Multi-Cloud Environment
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

DoubleStrokeNet: Bigram-Level Keystroke Authentication

1
Computer Science and Engineering Department, National University of Science and Technology POLITEHNICA of Bucharest, 313 Splaiul Independentei, 060042 Bucharest, Romania
2
Research Technology, 19D Soseaua Virtutii, 060782 Bucharest, Romania
3
Academy of Romanian Scientists, Str. Ilfov, Nr. 3, 050044 Bucharest, Romania
*
Author to whom correspondence should be addressed.
Electronics 2023, 12(20), 4309; https://doi.org/10.3390/electronics12204309
Submission received: 9 September 2023 / Revised: 12 October 2023 / Accepted: 17 October 2023 / Published: 18 October 2023
(This article belongs to the Special Issue Novel Approaches in Cybersecurity and Privacy Protection)

Abstract

:
Keystroke authentication is a well-established biometric technique that has gained significant attention due to its non-intrusive and continuous characteristics. The method analyzes the unique typing patterns of individuals to verify their identity while interacting with the keyboard, both virtual and hardware. Current deep-learning approaches like TypeNet and TypeFormer focus on generating biometric signatures as embeddings for the entire typing sequence. The authentication process is defined using the Euclidean distances between the new typing embedding and the saved biometric signatures. This paper introduces a novel approach called DoubleStrokeNet for authenticating users through keystroke analysis using bigram embeddings. Unlike conventional methods, our model targets the temporal features of bigrams to generate user embeddings. This is achieved using a Transformer-based neural network that distinguishes between different bigrams. Furthermore, we employ self-supervised learning techniques to compute embeddings for both bigrams and users. By harnessing the power of the Transformer’s attention mechanism, the DoubleStrokeNet approach represents a significant departure from existing methods. It allows for a more precise and accurate assessment of user authenticity, specifically emphasizing the temporal characteristics and latent representations of bigrams in deriving user embeddings. Our experiments were conducted using the Aalto University keystrokes datasets, which include 136 million keystrokes from 168,000 subjects using physical keyboards and 63 million keystrokes acquired on mobile devices from 60,000 subjects. The DoubleStrokeNet outperforms the TypeNet-based authentication system using 10 enrollment typing sequences, achieving Equal Error Rate (EER) values of 0.75% and 2.35% for physical and touchscreen keyboards, respectively.

1. Introduction

The market for biometric authentication has been growing rapidly over the past few years owing to the increasing need for secure authentication methods in various industries such as banking, healthcare, and government. Biometric authentication has several advantages over traditional authentication methods, such as passwords or PINs, since biometric features are unique to each individual and cannot be easily replicated or stolen. The increasing demand for security and convenience in digital transactions has driven the adoption of biometric authentication methods. The global biometric system market size was valued at USD 29.09 billion in 2021 and USD 30.77 billion in 2022. The market is expected to reach USD 76.70 billion by 2029 with a Compound Annual Growth Rate (CAGR) of 13.9% during the forecast period [1].
Keystroke biometrics is a type of behavioral biometric authentication that uses the unique typing patterns of individuals to verify their identities. Keystroke biometrics has several advantages over other biometric authentication methods, such as facial or fingerprint recognition, because it does not require any additional hardware and can be implemented on almost any device with a keyboard. Keystroke biometrics has several use cases in industries such as banking, e-commerce, and healthcare, where it can be used to authenticate users securely and conveniently. The global keystroke dynamics market size reached USD 390.6 Million in 2022. and is expected to reach USD 1.4 billion by 2028, growing at a CAGR of 23.4% during the forecast period (2023–2028) [2].
The concept of keystroke authentication assumes that each individual has a unique typing rhythm and timing pattern, including keypress duration, latency, and intervals between keystrokes. This information can be used to create a biometric template for the user, which can be compared to their subsequent typing for authentication. This study is motivated by the critical importance of enhancing the reliability and effectiveness of keystroke biometric authentication systems, which play a pivotal role in securing sensitive digital information and safeguarding against unauthorized access.
To the best of our knowledge, the existing approaches are centered on generating the most representative sequence embeddings. The training samples (enrolling typing sequences) are used to create an embedding gallery for each user. A score (average Euclidean distance for each sample in a gallery) is computed for each user’s gallery to authenticate a user. The performance of the model is evaluated using the Equal Error Rate (EER), which is the point where the False Acceptance Rate (FAR) and False Rejection Rate (FRR) are equal. FAR represents the proportion of impostors incorrectly classified as genuine, while FRR represents the proportion of genuine users incorrectly classified as impostors. EER is calculated for each subject and then averaged across all subjects to obtain an overall performance metric.
The intuition driving this research is that a concentrated focus on individual pairs of keys, instead of aggregated sequence embeddings, holds the potential for substantial improvement in keystroke biometric authentication. By homing in on the nuanced interactions between specific keys, this study seeks to uncover patterns and characteristics that may have been overlooked in previous approaches. This granular perspective aims to discriminate between genuine and impostor keystroke sequences, ultimately fortifying the security and accuracy of the authentication process.

Current Study Objectives

The primary focus of this study revolves around the self-supervised learning of keys and user embeddings to ensure keystroke biometric authentication. This endeavor adopts a training sample generation approach inspired by the ELECTRA method [3], involving the creation of intentionally corrupted typing sequences. A Transformer discriminator neural network discerns between the original and replaced elements. The Transformer’s attention mechanism allows DoubleStrokeNet to capture intricate dependencies and relationships between individual keys, enabling it to discern subtle variations in typing patterns. This capability is crucial for generating accurate user embeddings, enabling the model to differentiate between genuine and imitated keystrokes. Additionally, the Transformer’s proficiency in learning bigram embeddings, which represent the interactions between pairs of keys, provides a detailed understanding of the subtle dynamics within typing sequences.
The main contributions of this study are the following:
  • Providing a bigram-level authentication system that leverages each individual keystroke in the sequence instead of relying on a sequence embedding.
  • Introducing a self-supervised approach to learning representative bigram-level authentication patterns using sequence corruption.
  • Introducing a dual view on how we model authentication by splitting the user and sequence representations. The objective is defined as learning to determine if a user u corresponds to an enrollment sequence s. This enables fast fine-tuning times when integrating new users in the platform by solely learning new user embeddings without having to perform parameter updates on the Transformer network.
The code has been released as open source at https://github.com/readerbench/doublestrokenet (accessed on 9 October 2023).

2. Related Work

Keystroke biometric systems are widely used for authentication on various devices that support text entry, such as desktops, laptops, and mobile devices with virtual keyboards. The timing differences between KeyPressed and KeyReleased events for each keystroke are used to identify an individual’s keystroke pattern, making keystroke biometric systems easily deployable.
There are two categories of keystroke biometric systems: fixed-text, where the typed sequence is predefined, such as a username or password, and free-text, where the keystroke sequence is arbitrary, such as writing an email body or sentence with errors. Notably, free-text inputs result in different keystroke sequences between gallery and test samples compared to fixed-text inputs. Biometric authentication algorithms for desktop and laptop keyboards have been studied mainly in fixed-text scenarios where accuracies exceeding 95% are typical. The user authentication problem can be formulated as a multivariate time series classification. The sequence is usually based on the time differences between KeyPressed and KeyReleased events, and the labels are the users.
Deep-learning approaches focus on generating embeddings from typing sequences. The first typing sequence encoder for biometric authentication with good results, TypeNet [4], was based on a Recurrent Neural Network with Long Short-Term Memory (LSTM) cells [5]. TypeNet [4] outperformed traditional state-of-the-art free-text algorithms in both mobile and desktop scenarios, even with a reduced amount of enrollment data. The success of TypeNet was mainly owing to the rich embedding feature vector it generated, which required less data for enrollment. The TypeNet model generated the best results when the training was paired with a TripletLoss [6]
L t r i p l e t = max ( d ( a , p ) d ( a , n ) + α , 0 )
The TripletLoss was computed using samples that contained an anchor (a), a positive (p), and a negative (n) input; d measured the distance between pairs of samples, while  α  was a margin value reflecting the minimum acceptable distance between the anchor-positive and the anchor-negative pairs. The goal of the TripletLoss was to learn a feature representation for items such that similar items were mapped closer together in the learned space; in contrast, dissimilar items were mapped further apart. The TripletLoss in the keystroke authentication setting was employed such that the embeddings of the typing sequences for a user were mapped closer together in the learned space while typing sequences from other users were kept further apart.
With the introduction of the Transformer [7], the performance of the multivariate time series classification has improved significantly. The Transformer architecture, built on self-attention mechanisms and parallel processing, has proven to be highly versatile, especially on sequential data. Gated Transformer Network [8] is considered to be the state of the model for the classification of multivariate time series. The architecture comprised 2 encoders: (a) a stepwise encoder to encode the temporal features relying on self-attention and (b) a channel-wise encoder to compute attention weights among different channels across all timestamps. The former encoder used the time series as input, which was linearly projected and incorporated positional encodings. The latter encoder received as input the transpose of the time series, which was also linearly projected. Both branches contained encoders based on the vanilla Transformer architecture. The result was generated using a gating mechanism that weighted the output of each Transformer.
The first Transformer-based encoder [9] used for keystroke biometrics utilizes the Gated Transformer Network to generate sequence embeddings. Performance was significantly improved compared to the LSTM-based TypeNet model with a global EER (Equal Error Rate) of 6.26% and an average EER of 3.93% on the Aalto Mobile Dataset [10]. The current state-of-the-art model, the TypeFormer [11], is also based on a Gated Transformer Network, the difference being the insertion of a recurrent layer between the Transformer layers of the stepwise encoder. The recurrent layer is based on the Block-Recurrent Transformer [12].
All existing approaches focus on mapping the typing sequences to embeddings. In the learned space, the embeddings of typing sequences originating from the same user are mapped closer together, whereas sequences originating from different users have embeddings mapped further apart.

3. Method

Our work is centered on learning key and user embeddings in a self-supervised way with the goal of keystroke biometric authentication. The training sample generation process is based on the ELECTRA [3] method of constructing corrupted typing sequences, while the differentiation of the original and replaced elements is conducted using a Transformer discriminator neural network.

3.1. Datasets

Our models were trained and evaluated using the Aalto University datasets. The university made public 2 typing datasets for scientific use: Aalto Desktop Dataset [13] and Aalto Mobile Dataset [10].
The Aalto Desktop Dataset contains samples from 168,000 participants, which amounts to 5.1 GB of data. The study analyzed the typing behavior of volunteers in an online study and reported findings regarding predictive letter pairs and rollover typing. The data allowed for unsupervised clustering of typists into eight groups based on performance, accuracy, hand/finger usage, and rollover. This dataset contains 15 samples for each user corresponding to 15 short phrases written in simple language. The study found that modern typing behavior has similarities to typewriting but also notable differences, particularly in hand usage. Although the average performance is slower, the uncorrected error rates are similar. Substitution errors are more frequent than insertion and omission errors. The lower bound for inter-key intervals is 60 ms. The left hand is generally slower than the right hand, but the effect is negligible for modern typists. The hand-alternation benefit is less pronounced than in typewriting studies. Using different fingers of the same hand provides similar benefits to using fingers of different hands.
The Aalto Mobile Dataset contains samples from 37,000 participants, which amounts to 5.6 GB of data. The study presented a large-scale dataset on mobile text entry collected via a web-based transcription task performed by volunteers. The average typing speed was 36.2 WPM with 2.3% uncorrected errors. The dataset enables statistical analysis of the correlation between typing performance and various factors, such as demographics, finger usage, and intelligent text entry techniques. The study analyzed performance metrics for text entry research using the standard definition. The metrics were computed during runtime and averaged per sentence for each participant. The study revealed distributions of key text entry metrics and classified participants into four technique categories. The results showed that autocorrect users tend to be faster, while prediction users tend to be slower. An important fact to consider is that mobile keystroke biometrics are influenced by external factors such as the user’s position and surroundings more significantly compared to physical keyboards, which are usually immobile.
The use of web-based logging for data collection presents certain limitations when compared to inputting data through desktop keyboards, which directly capture raw device-level events and funnel them into input fields. The browser-side logging method used by Palin et al. [10] for mobile data acquisition is subject to several constraints akin to those encountered in various online transcription tests. The limitations are the following:
  • Undefined keycodeCertain Android devices, as highlighted by Palin et al. [10], exhibit the issue of reporting undefined keycodes. This discrepancy in reporting hinders the accurate recording of pressed keys, leading to incomplete data for those devices.
  • Merged touch events. Many devices manifest a behavior where touch-down and touch-up events occur almost simultaneously during touch-up instances. This compressed event timing, sometimes as brief as <10 ms, results in inaccurate interpretation of keystroke durations, as the system fails to effectively distinguish between individual touch-down and touch-up events.
  • Multi-touch and rollover. The intricacies of multi-touch interactions and rollover scenarios pose another challenge. Specifically, the transmission of events related to these scenarios is not consistently accurate. In cases where multiple touch actions occur in quick succession or overlapping manner, the system’s event dispatching falters. This leads to scenarios where a key-up event associated with the first keystroke is erroneously triggered as soon as a second finger makes contact with the screen. This particular flaw renders the keycode of the initially pressed key inaccessible, as it remains held down. Consequently, this anomaly affects the precision of timestamp accuracy.
The benchmark proposed by Stragapede et al. [11] is based on the Aalto Mobile Dataset, which has the above limitations, and those limitations were not mitigated. The benchmark dataset is composed of 1000 users with specific session IDs. However, sequences with undefined keycodes were used excessively upon analyzing the typing sessions from the users, thus making the dataset unusable in this study. Figure 1 presents the percentage of keycodes logged as undefined (i.e., the undefined value is 229) out of all keycodes typed for each user. The percentage of users that had more than 80% of the keystrokes logged as undefined was 51%, resulting in a separation between correct and faulty key-logging. Given the significance of individual keycodes in our approach, the decision was made to evaluate our system in a manner distinct from the one proposed in the TypeFormer Benchmark. This choice originates from the observation that half of the users in the aforementioned benchmark have the same keycode assigned to all keystrokes. It should be noted that this inconsistency in the data distribution (see Figure 1) might lead to skewed quantitative results.
The benchmark’s assumptions might not align with the real-world conditions that the evaluated algorithms are intended to address. This misalignment could lead to unrealistic expectations and misrepresentations of their actual performance. The dataset employed for benchmarking appears to be skewed towards certain types of scenarios, potentially limiting the generalizability of the results to broader real-world scenarios. This bias could inadvertently favor certain algorithms over others. The evaluation of the keystroke authentication system was not performed on the TypeFormer Benchmark [11] because keycodes are a main component in our approach, and half the users in the benchmark have the same keycode for all keystrokes. Aligning the benchmark more closely with real-world conditions would enable a more meaningful evaluation of algorithmic performance in practical scenarios.

3.2. Data Preprocessing and Feature Extraction

The data preprocessing steps for the Aalto Mobile Dataset begin with the exclusion of all users who have sequences with undefined keycodes for more than 50% of the keys. For both Aalto Datasets, two more steps of preprocessing are performed, namely to filter users with sequences with less than 25 keystrokes or a larger time difference than 5 s, which can represent a pause in typing.
Each element from the raw sequences has three dimensions: the press timestamp, the release timestamp, and the keycode of the key that was used. Timestamps are in UNIX format with millisecond resolution, and the keycodes are integers between 0 and 255 according to the JavaScript keycodes. The decision to exclude additional device features like tilt, display dimensions, and others from the mobile dataset is deliberate to ensure a device-agnostic authentication mechanism. By focusing solely on the raw sequences consisting of press timestamps, release timestamps, and keycodes, our research aims to develop a robust authentication system that is not reliant on specific hardware characteristics or configurations. The feature extraction process involves extracting four features commonly utilized in both fixed-text and free-text keystroke systems for each pair of consecutive keys (i.e., a key bigram) [14]. The features based on temporal differences as described in Figure 2 are the following:
  • HL (Hold Latency)—the time between key press and release events
  • IL (Inter-key Latency)—the time between releasing a key and pressing the next key
  • PL (Press Latency)—the time between two consecutive press events
  • RL (Release Latency)—the time between two consecutive release events
Previous approaches [4,11] used the normalized keystroke codes as an input feature, resulting in a time series with 5 features as input. In this work, an alternative approach is presented, which focuses on learning embeddings for each key to capture the inherent characteristics of each key, making it easier to extract meaningful features from the typing sessions. Learning key embeddings enables the model to recognize and differentiate between users based on their typing patterns. To encode each unique ASCII bigram, we would require an embedding table with  256 × 256  entries. Given that particular key combinations occur very rarely or not at all in certain datasets, this encoding scheme might lead to highly sparse training signals during optimization. To counteract this, the choice is made to utilize an embedding table that contains entries for each ASCII character, effectively reducing its size to 256. At runtime, the bigram is split into its 2 individual key codes, and their embeddings are concatenated to obtain the bigram embedding.
The input construction process transforms each bigram with 4 temporal features and the 2 keycodes into a higher dimensional embedding, as shown in Figure 3. The 4 raw features are transformed using a linear projection into a vector of dimension 256, followed by a normalization layer to stabilize the training and reduce the sensitivity to initialization. Batch normalization further stabilizes training by normalizing the activations within mini-batches, improving gradient flow, and reducing internal covariate shift. Each keycode is transformed using the key embedding table into a vector of 256 dimensions, and all 3 vector representations are concatenated, resulting in a vector of dimension 768. If a typing sequence has M elements, the resulting sequence after the input construction has the dimension  M × 768 . To authenticate the users based on the typing sequence, the user embedding is learned by appending a user representation vector at the beginning of the typing sequence.

3.3. Electra-Style Training

The ELECTRA [3] approach is a variant of the Transformer model architecture designed initially for pretraining on text data to learn word embeddings. The main idea behind ELECTRA is to learn to discriminate between original tokens and corrupt ones while generating improved representations of the words. Transforming the key authentication problem from a sequence classification problem into a token-level classification problem enables us to consider the ELECTRA approach. One of the key strengths of ELECTRA in this context is its ability to learn rich, contextualized representations of key bigrams and user embeddings while also capturing complex temporal patterns important for typing sequences. The discriminator-only ELECTRA loss used in our task is the following:
L ELECTRA = E ( x , y ) log σ ( D ( x , y ) ) + log 1 σ ( D ( x ˜ , y ) )
where  D ( · , · )  represents the DoubleStrokeNet discriminator network,  σ  represents the sigmoid function, which outputs the probabilities of the original/replaced labels, x represents the original input,  x ˜  represents the corrupted input and y denotes the label.
During dataset preprocessing, all bigrams and their corresponding temporal features for every enrollment sequence available for a user are retained. This bigram-user mapping is used to generate corrupted or invalid typing sequences for users during training. As illustrated in Figure 4, the generation of training samples depends on a random probability that determines the type of training sample to be constructed. The preprocessing procedure can result in three training samples from an existing user-sequence example: positive, negative, or corrupt. A positive sample is essentially a user-sequence pairing already present in the dataset. For negative samples, a bigram sequence is assigned to a user even if they were not the original author. In this scenario, negative labels are assigned to both the user prediction token and all the bigram tokens in the sequence. Finally, corrupted tokens are constructed, containing both positive and negative training signals. To achieve this, a pivot position is randomly sampled in the typing sequence of a random user. The temporal features of all bigrams on the right side of the pivot are then replaced with temporal features sampled from other users who have those bigrams as part of their enrollment sequences. These are considered to be much more challenging samples, as the model is compelled to rely on temporal feature irregularities that may emerge in a user’s typing behavior at the bigram level.

3.4. DoubleStrokeNet Architecture

The core of our DoubleStrokeNet architecture (see Figure 5) is a Transformer encoder that fulfills the function of a discriminator between genuine and impostor users, as well as original and replaced key bigrams. A major advantage of the Transformer encoder is its capability to capture rich contextual information within data sequences, making it suitable for analyzing keystroke patterns. By encoding key bigrams with context-aware embeddings, the Transformer encoder facilitates a deeper understanding of typing dynamics and dependencies within sequences, enabling more accurate predictions and key representations.
The multi-head attention operation employed in the Transformer encoder relies on linear projections of input vectors from the same embedding manifold. Simply using the concatenated embedding would be ill-posed due to various scale differences or covariances caused by the domain mixture of this composed vector. To alleviate this, the first step is to pass the concatenated embeddings through a Multi-Layer Perceptron network to construct a homogeneous embedding space.
During our experiments with various algorithms for positional encodings, we discovered that the optimal choice for adding positional information is tightly linked to the type of keystroke dataset the model is being trained on. The preference for classical positional encoding in the context of a desktop keystroke authentication dataset arises from the inherent characteristics of desktop usage. Desktop computers offer a relatively stable and consistent environment, where users typically interact with physical keyboards in a stationary position. Classical positional encoding, which relies on fixed sinusoidal functions to represent token positions, aligns well with this stability. It effectively captures the spatial relationships between keystrokes and their temporal sequences, enabling accurate authentication.
On the other hand, rotary positional embeddings [15] (RoPE) perform better when applied to mobile datasets due to the dynamic nature of mobile device usage. Mobile users often employ varied interaction patterns, including tapping, swiping, and changing device orientation. These diverse actions can result in a highly variable input space. RoPE is better suited for accommodating the fluid and context-dependent nature of mobile interactions, leading to improved performance in tasks like authentication on mobile devices. RoPE, unlike Sinusoidal [16] or Learned Positional encodings [17], models the relative distance between queries and keys by applying a fixed rotation matrix to both queries and keys at each attention layer.
As with the broader Transformer literature, we opt for a simple linear projection on top of the final hidden layer for each learnable task.

3.5. Authentication Protocol

In our experimental methodology, we work with a dataset where each subject contributes with 15 sequences. To form our test set, we randomly curate 5 sequences per subject, therefore creating 5 genuine test scores for each individual. Our objective is to systematically assess system performance while varying the number of enrollment sequences, denoted as G, and allowing it to vary within the range of 1 to 10.
We employ a rigorous procedure to generate impostor scores for our experimental trials. For each subject enrolled in the system, we randomly select one test sample from every remaining subject in the dataset. This approach ensures a comprehensive set of impostor scores, with each enrolled subject having 5 genuine test scores and k−1 impostor scores. The parameter k represents the number of subjects included in the evaluation, and we manipulate k in our experiments over a range extending from 100 to 100,000.
Notably, our experimental setup results in a surplus of impostor scores compared to genuine ones, a characteristic frequently observed in keystroke dynamics authentication research. To quantitatively assess the efficacy of our authentication system, the Equal Error Rate (EER) is employed as our primary performance metric. The EER is the point at which the False Acceptance Rate (FAR), representing the proportion of impostors incorrectly classified as genuine users, is equal to the False Rejection Rate (FRR), signifying the proportion of genuine subjects erroneously classified as impostors. These error rates are computed for each subject and subsequently averaged across the entire cohort of k subjects in our experiments.
Computing EER using logits of the user prediction is a fundamental step in evaluating the performance of biometric authentication systems. Logits represent the raw, unnormalized outputs of a model prior to the application of any activation functions, providing a direct measure of confidence in a particular prediction. The threshold is methodically fine-tuned to ascertain the Equal Error Rate (EER), therefore discerning between legitimate and fraudulent authentication attempts. This iterative process ensures the selection of an optimal threshold that balances FAR and FRR, which are computed at each threshold value. EER is the point at which these two rates intersect, signifying the optimal trade-off between accepting unauthorized users and rejecting legitimate ones. This metric is pivotal in assessing the effectiveness of an authentication system, allowing for fine-tuning and optimization to achieve an equilibrium between security and user convenience. A lower EER indicates a more accurate and reliable authentication system, offering higher confidence in user identification.

3.6. Experimental Setup and Results

The pretraining and fine-tuning approach is effective in our case due to transfer learning, enabling the discriminator network to leverage pre-learned key embeddings to authenticate new users. It offers data and computational efficiency, requiring less user-specific data and reduced training time. This approach helps prevent overfitting by adapting pretrained key representations to learn embeddings for unseen users.
The DoubleStrokeNet is trained on the Aalto Datasets mobile and desktop versions. Since the goal of the model is to learn bigram embeddings, the number of users used for training is 50,000 for the desktop setting and 15,000 for the mobile setting to ensure that there are enough bigrams available for learning key embeddings. Regarding the probability distribution over the possible training samples that our preprocessing yields, the best performance was [0.1, 0.8, 0.1] for positive, negative, and corrupt samples, respectively.
The optimal hyperparameter configuration for model training yields superior results, featuring a learning rate of lr = 0.0005 and the utilization of the Adam optimizer [18] with specific parameter settings, namely  β 1 = 0.9 β 2 = 0.999 , and  ϵ = 10 8 . The incorporation of a dual learning rate scheduling regimen, encompassing both a constant and a cosine annealing scheduler, resulted in a discernible improvement in the performance of our approach. Furthermore, within the model architecture, the primary key embeddings exhibit a size of 256, while the user embeddings are set at a dimensionality of 768. Gradient clipping [19] was used to mitigate the instability in training by capping the magnitude of the gradients at 1. The implementation of our method was carried out utilizing the PyTorch Lightning framework [20].
The highest performance for the Transformer encoder employed for bigram discrimination was achieved when employing a configuration consisting of 12 Transformer layers, each equipped with 12 attention heads. The feedforward dimension within each Transformer layer is set to 512, and a dropout rate of 0.2 is applied. In terms of attention mechanisms, a full attention approach is adopted, encompassing both the sequence and user components. As delineated in the architectural specifications, the choice of positional encoding varies depending on the setting, with classic positional encoding employed for the desktop setting and RoPE positional encoding utilized for the mobile setting.

4. Results

4.1. Intra-Database Results

The impact of two variables is investigated: the length of the keystroke sequences (referred to as M) and the number of sequences used for enrolling each user (referred to as G). This experiment aims to uncover the extent to which variations in M and G influence the authentication performance of the model. It is important to note that the models receive input of a fixed size M. For this, the value of k has been set to 1000, representing the number of users enrolled. Table 1 and Table 2 provide a summary of the error rates achieved in both desktop and mobile scenarios. These error rates are associated with different combinations of sequence lengths (M) and the number of enrollment sequences per user (G).
In the desktop scenario, performance improves marginally for sequences longer than M = 50. Increasing the number of keystrokes by 2 times (from M = 50 to M = 100) only results in an average EER reduction of 0.3% across all G values. However, augmenting the number of sequences in the gallery leads to more substantial enhancements, with approximately a 50% relative reduction in error regardless of M.
Similar trends emerge in the mobile scenario Table 2 compared to the desktop scenario Table 1. Once again, extending sequence length beyond M = 50 keystrokes does not significantly enhance performance. However, a noteworthy improvement is observed when increasing the number of sequences per subject. Optimal results are attained with M = 100 and G = 10, with a 2.35% error rate achieved by the DoubleStrokeNet model.

4.2. Inter-Database Results

The importance of generalization to other datasets in Machine Learning cannot be overstated. Generalization is the key to ensuring that models trained on one dataset can effectively and accurately handle new, unseen data from different sources or under varied conditions. In this experiment, we investigate the interoperability of the DoubleStrokeNet architecture, assessing how well it can function across different devices (i.e., desktop and mobile) and adapt to diverse input sources and databases. As such, we evaluate the performance using a distinct keystroke dataset that differs from the one used in their initial training. However, it is imperative to note that the same train/test subject partitioning is maintained, ensuring equitable comparative analyses. To uphold methodological consistency with previous experiments, we adhere to a protocol involving G = 5 enrollment sequences per subject, M = 50 keystrokes per sequence, and k = 1000 test subjects. The results in Table 3 highlight the adaptability of the DoubleStrokeNet and emphasize the importance of the pretraining fine-tuning configuration. When compared to the TypeNet results, we observe an approximative 40% reduction in EER, highlighting the effectiveness of the DoubleStrokeNet model.

5. Discussion

We present a comprehensive analysis highlighting the importance and impact of DoubleStrokeNet components within the keystroke authentication task. By systematically deconstructing and evaluating the performance of these components, we seek to provide valuable insights into the design and optimization of keystroke authentication mechanisms considering as baseline the vanilla Transformer.
Table 4 presents the comparative results conducted on the Aalto Mobile Dataset. The analysis is performed by considering a variable number of enrollment sessions G = 5, 10, and the fixed length of M = 50. We explore how the Transformer architecture effectively models keystroke sequences, even using only the standard attention mechanism. As presented in Table 4, we start by setting the baseline for the performance of the vanilla Transformer and adding the proposed components to exhibit the EER improvements. Modifying the input and adding the key and user embeddings resulted in a modest improvement to the EER.
The pivotal factor that enhanced the performance of the authentication system was the homogenization of the concatenated bigram representations using the MLP, which reduced the EER by approximately 2%, resulting in a 27% and 31% improvement for G = 5 and G = 10, respectively. As detailed in the previous section, the positional encoding used in the Transformer greatly impacts the performance. Replacing the sinusoidal positional encoding with the rotary positional encoding further improved the model’s performance, reducing the EER by 0.89% and 1.23%.
Lastly, Table 4 presents a comparison of the proposed DoubleStrokeNet with other systems presented in the literature that were not originally evaluated according to the protocol adopted in this work: Digraphs and SVM [21], POHMMs [22], and a combination of RNNs and CNNs [23]. This broader comparison provides a comprehensive view of TypeFormer’s performance in relation to various existing methods, thus introducing valuable insights into its effectiveness in keystroke sequence modeling and authentication.
Table 4. Comparative performance (bold denotes the best results).
Table 4. Comparative performance (bold denotes the best results).
ArchitectureG = 5G = 10
Vanilla Transformer7.62%6.56%
Transformer w/KeyEmbeddigns6.92%5.97%
Transformer w/MLP5.04%4.08%
DoubleStrokeNet4.15%2.85%
TypeNet [4]9.2%8.0%
CNN + RNN [23]12.2%-
Digraphs and SVM [21]29.20%-
POHMM [22]40.40%-

6. Conclusions

This study focuses on advancing keystroke biometric authentication through self-supervised learning of user and key embeddings. We propose a dual-view authentication modeling approach, separating user and sequence representations. This enables efficient integration of new users into the platform by exclusively learning new user embeddings without the need for parameter updates on the Transformer network.
Inspired by the ELECTRA method, we employ a training sample generation approach that intentionally corrupts typing sequences. The DoubleStrokeNet architecture uses a Transformer encoder as a discerning tool for user authentication and keystroke analysis. This encoder’s strength lies in its capability to capture intricate contextual details within data sequences, making it particularly effective for analyzing keystroke patterns. Each token of the input sequence represents a key bigram and is constructed as a composite of 2 key embeddings and a projection of the temporal features of the specific pair of keys.
Our experiments argue for the critical importance of tailoring positional encodings to the specific keystroke dataset. Classical positional encoding, reliant on stable sinusoidal functions, proves most effective for desktop authentication, aligning seamlessly with the stationary nature of desktop computer usage. Conversely, rotary positional embeddings are better suited for mobile datasets, excelling in accommodating the variability of mobile interactions. RoPE’s adaptability proves invaluable, leading to improved performance, especially in authentication tasks on mobile devices.
Throughout our evaluation of various authentication scenarios, the DoubleStrokeNet achieved superior performance, particularly in scenarios with numerous subjects but limited enrollment samples per subject. Our results surpass those of previous state-of-the-art algorithms. We observed EER values ranging from 4.72% to 0.75% in desktop settings and from 8.94% to 2.35% in mobile settings, contingent on the volume of subject data enrolled. A favorable balance between performance and the quantity of enrollment data per subject was attained with 5 enrollment sequences and 50 keystrokes per sequence. This configuration yielded an EER of 1.38% (desktop) and 4.15% (mobile) for a group of 1000 test subjects. If the number of enrollment sequences is raised to 10, the network yields 0.84% for the desktop setting and 2.85% for the mobile one.
Despite the notable advancements achieved in keystroke biometric authentication through self-supervised learning of user and key embeddings, it is imperative to acknowledge certain limitations. One notable constraint is the necessity for fine-tuning when integrating new users into the authentication system. Previous models require fine tuning when introducing new users, unlike our approach, which enables efficient integration without the need for parameter updates on the Transformer network. This additional step may incur some operational overhead and could potentially hinder the seamless onboarding of new users. Furthermore, it is important to address the substantial computational resources required by the DoubleStrokeNet architecture. The utilization of a Transformer encoder for user authentication and keystroke analysis demands significant processing power, which may be a limiting factor for organizations with constrained computational capabilities. This limitation underscores the need for further optimization or alternative approaches to make this authentication system more accessible and feasible for a broader range of applications and environments.
Although this study has provided valuable insights into using self-supervised learning to generate key and user embeddings, there are several avenues for future research. In the context of keystroke authentication, Random Fourier Features (RFF) [24] applied on the temporal characteristics could generate a more complex and expressive representation of the key bigrams. The second point of focus for future research is to assign a key embedding table to each user and utilize only the keys used in the enrollment sequences in the authentication process. This later approach would reduce the noise of unused keys and ensure a more robust model for user authentication. Additionally, it would be valuable to evaluate the performance of this method in real-world applications across a range of heterogeneous devices. This evaluation should consider both accuracy and latency, as these factors are crucial for practical usability and deployment in various contexts.

7. Patents

Part of this work was considered under the Patent Application A/00315/21.06.2023 submitted to the State Office for Inventions and Trademarks from Romania [25].

Author Contributions

Conceptualization, T.N., T.P., S.R. and M.D.; methodology, T.N., T.P., S.R. and M.D.; software, T.N.; validation, T.N.; formal analysis, T.N.; data curation, T.N.; writing—original draft preparation, T.N.; writing—review and editing, T.P., S.R. and M.D.; visualization, T.N.; supervision, M.D. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the “Innovative Solution for Optimizing User Productivity through Multi-Modal Monitoring of Activity and Profiles—OPTIMIZE”/“Soluție Inovativă de Optimizare a Productivității Utilizatorilor prin Monitorizarea Multi-Modala a Activității și a Profilelor—OPTIMIZE” project, Contract number 366/390042/27.09.2021, MySMIS code: 121491.

Institutional Review Board Statement

Ethical review and approval were not required for this study since the data originated from previously published typing datasets, namely the Aalto Desktop Dataset [13] and the Aalto Mobile Dataset [10]. The datasets are publicly available for scientific use.

Informed Consent Statement

Not applicable since the data originated from previous studies.

Data Availability Statement

All data are publicly available: Aalto Desktop Dataset at https://userinterfaces.aalto.fi/136Mkeystrokes/ (accessed on 10 June 2023) and Aalto Mobile Dataset at https://userinterfaces.aalto.fi/typing37k/ (accessed on 22 June 2023). All the code is available at https://github.com/readerbench/doublestrokenet (accessed on 10 October 2023).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
CAGRCompound Annual Growth Rate
EEREqual Error Rate
ELECTRAEfficiently Learning an Encoder that Classifies Token Replacements Accurately
FARFalse Acceptance Rate
FRRFalse Rejection Rate
MLPMulti-Layer Perceptron
RoPERotary Positional Embedding
WPMWords Per Minute

References

  1. Biometric System Market Size Worth USD 76.70 Billion by 2029. 2023. Available online: https://www.globenewswire.com/news-release/2023/01/17/2589810/0/en/Biometric-System-Market-Size-Worth-USD-76-70-Billion-by-2029-Report-by-Fortune-Business-Insights.html (accessed on 10 August 2023).
  2. Global Keystroke Dynamics Market to Reach US$ 1413.1 Million by 2028, Impelled by Increasing Incidences of Fraudulent Digital Transactions. 2022. Available online: https://www.imarcgroup.com/global-keystroke-dynamics-market (accessed on 10 August 2023).
  3. Clark, K.; Luong, M.; Le, Q.V.; Manning, C.D. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. arXiv 2020, arXiv:2003.10555. [Google Scholar]
  4. Acien, A.; Morales, A.; Monaco, J.V.; Vera-Rodríguez, R.; Fiérrez, J. TypeNet: Deep Learning Keystroke Biometrics. IEEE Trans. Biom. Behav. Identity Sci. 2021, 4, 57–70. [Google Scholar] [CrossRef]
  5. Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
  6. Dong, X.; Shen, J. Triplet Loss in Siamese Network for Object Tracking. In Proceedings of the European Conference on Computer Vision (ECCV) 2018, Munich, Germany, 8–14 September 2018. [Google Scholar]
  7. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
  8. Liu, M.; Ren, S.; Ma, S.; Jiao, J.; Chen, Y.; Wang, Z.; Song, W. Gated Transformer Networks for Multivariate Time Series Classification. arXiv 2021, arXiv:2103.14438. [Google Scholar]
  9. Stragapede, G.; Delgado-Santos, P.; Tolosana, R.; Vera-Rodriguez, R.; Guest, R.; Morales, A. Mobile Keystroke Biometrics Using Transformers. arXiv 2022, arXiv:2207.07596. [Google Scholar]
  10. Palin, K.; Feit, A.; Kim, S.; Kristensson, P.O.; Oulasvirta, A. How do People Type on Mobile Devices? Observations from a Study with 37,000 Volunteers. In Proceedings of the 21st International Conference on Human-Computer Interaction with Mobile Devices and Services (MobileHCI’19), Taipei, Taiwan, 1–4 October 2019. [Google Scholar]
  11. Stragapede, G.; Delgado-Santos, P.; Tolosana, R.; Vera-Rodriguez, R.; Guest, R.; Morales, A. TypeFormer: Transformers for Mobile Keystroke Biometrics. arXiv 2022, arXiv:2212.13075. [Google Scholar]
  12. Hutchins, D.; Schlag, I.; Wu, Y.; Dyer, E.; Neyshabur, B. Block-Recurrent Transformers. arXiv 2022, arXiv:2203.07852. [Google Scholar]
  13. Dhakal, V.; Feit, A.; Kristensson, P.O.; Oulasvirta, A. Observations on Typing from 136 Million Keystrokes. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI ’18), Montreal, QC, Canada, 21–26 April 2018. [Google Scholar] [CrossRef]
  14. Alsultan, A.; Warwick, K. Keystroke dynamics authentication: A survey of free-text. Int. J. Comput. Sci. Issues 2013, 10, 1–10. [Google Scholar]
  15. Su, J.; Lu, Y.; Pan, S.; Murtadha, A.; Wen, B.; Liu, Y. RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv 2022, arXiv:2104.09864. [Google Scholar]
  16. Dufter, P.; Schmitt, M.; Schütze, H. Position Information in Transformers: An Overview. arXiv 2021, arXiv:2102.11090. [Google Scholar] [CrossRef]
  17. Ramasinghe, S.; Lucey, S. Learning Positional Embeddings for Coordinate-MLPs. arXiv 2021, arXiv:2112.11577. [Google Scholar]
  18. Kingma, D.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations (ICLR) 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
  19. Zhang, J.; He, T.; Sra, S.; Jadbabaie, A. Why gradient clipping accelerates training: A theoretical justification for adaptivity. arXiv 2020, arXiv:1905.11881. [Google Scholar]
  20. Falcon, W.; The PyTorch Lightning Team. PyTorch Lightning. 2019. [Google Scholar] [CrossRef]
  21. Çeker, H.; Upadhyaya, S. User authentication with keystroke dynamics in long-text data. In Proceedings of the 2016 IEEE 8th International Conference on Biometrics Theory, Applications and Systems (BTAS), Niagara Falls, NY, USA, 6–9 September 2016; pp. 1–6. [Google Scholar] [CrossRef]
  22. Vertanen, K.; Kristensson, P.O. A Versatile Dataset for Text Entry Evaluations Based on Genuine Mobile Emails. In Proceedings of the MobileHCI ’11 13th International Conference on Human Computer Interaction with Mobile Devices and Services, Lisbon, Portugal, 5–9 September 2011; Association for Computing Machinery: New York, NY, USA, 2011; pp. 295–298. [Google Scholar] [CrossRef]
  23. Xiaofeng, L.; Shengfei, Z.; Shengwei, Y. Continuous authentication by free-text keystroke based on CNN plus RNN. Procedia Comput. Sci. 2019, 147, 314–318. [Google Scholar] [CrossRef]
  24. Rahimi, A.; Recht, B. Random Features for Large-Scale Kernel Machines. Adv. Neural Inf. Process. Syst. 2007, 20. [Google Scholar]
  25. Neacsu, T.; Ruseti, S.; Dascalu, M.; Banica, C.K. Sistem de Autentificare prin Secvențe de Tastare folosind Perechi de Taste și Rețele Neurale Profunde, 20023. OSIM Patent Application A/00315/21.06.2023.
Figure 1. User distribution by the percentages of undefined keycodes in the benchmark sessions (0 means that there are no undefined keycodes in the user sessions, and 1 means that all the keys from all the sequences are undefined).
Figure 1. User distribution by the percentages of undefined keycodes in the benchmark sessions (0 means that there are no undefined keycodes in the user sessions, and 1 means that all the keys from all the sequences are undefined).
Electronics 12 04309 g001
Figure 2. Example of temporal characteristics of a keystroke bigram composed of the keys A and B: HL (Hold Latency), Inter-key Latency (IL), Press Latency (PL), and Release Latency (RL).
Figure 2. Example of temporal characteristics of a keystroke bigram composed of the keys A and B: HL (Hold Latency), Inter-key Latency (IL), Press Latency (PL), and Release Latency (RL).
Electronics 12 04309 g002
Figure 3. The transformation of each element from the time series after feature extraction by concatenating the key embeddings with the result of a linear projection of the temporal features.
Figure 3. The transformation of each element from the time series after feature extraction by concatenating the key embeddings with the result of a linear projection of the temporal features.
Electronics 12 04309 g003
Figure 4. The process of generating training samples which include positive, negative, and corrupt samples. The user bigram dictionary contains all typing sequences of all users and all temporal features of bigrams extracted from the typing sequences. (A, B), (B, C), … (E, F) represent the key pairs that construct the typing sequence.
Figure 4. The process of generating training samples which include positive, negative, and corrupt samples. The user bigram dictionary contains all typing sequences of all users and all temporal features of bigrams extracted from the typing sequences. (A, B), (B, C), … (E, F) represent the key pairs that construct the typing sequence.
Electronics 12 04309 g004
Figure 5. DoubleStrokeNet discriminator architecture.
Figure 5. DoubleStrokeNet discriminator architecture.
Electronics 12 04309 g005
Table 1. Equal Error Rates (%) achieved in the desktop scenario using TypeNet/DoubleStrokeNet for different values of the parameters M (sequence length) and G (number of enrollment sequences per subject) (bold denotes the best results).
Table 1. Equal Error Rates (%) achieved in the desktop scenario using TypeNet/DoubleStrokeNet for different values of the parameters M (sequence length) and G (number of enrollment sequences per subject) (bold denotes the best results).
M|G135710
308.6/4.726.4/3.114.6/2.434.1/2.403.7/1.57
505.4/2.843.6/1.632.2/1.381.8/1.121.6/0.84
704.5/2.472.8/1.381.7/1.251.4/0.971.2/0.78
1004.2/2.322.7/1.231.6/1.221.4/0.911.2/0.75
Table 2. Equal Error Rates (%) achieved in the mobile scenario using TypeNet/DoubleStrokeNet for different values of the parameters M (sequence length) and G (number of enrollment sequences per subject; bold denotes the best results).
Table 2. Equal Error Rates (%) achieved in the mobile scenario using TypeNet/DoubleStrokeNet for different values of the parameters M (sequence length) and G (number of enrollment sequences per subject; bold denotes the best results).
M|G135710
3014.2/8.9412.5/5.8711.3/5.3410.9/4.9210.5/4.3
5012.6/5.7710.7/5.289.2/4.158.5/3.358.0/2.85
7011.3/5.519.5/4.567.8/3.637.2/2.836.8/2.53
10010.7/4.978.9/4.137.3/3.406.6/2.536.3/2.35
Table 3. Equal Error Rates (EER) achieved in the inter-database scenario for the DoubleStrokeNet.
Table 3. Equal Error Rates (EER) achieved in the inter-database scenario for the DoubleStrokeNet.
Train|TestDesktopMobile
Aalto Desktop1.38%12.73%
Aalto Mobile9.49%4.15%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Neacsu, T.; Poncu, T.; Ruseti, S.; Dascalu, M. DoubleStrokeNet: Bigram-Level Keystroke Authentication. Electronics 2023, 12, 4309. https://doi.org/10.3390/electronics12204309

AMA Style

Neacsu T, Poncu T, Ruseti S, Dascalu M. DoubleStrokeNet: Bigram-Level Keystroke Authentication. Electronics. 2023; 12(20):4309. https://doi.org/10.3390/electronics12204309

Chicago/Turabian Style

Neacsu, Teodor, Teodor Poncu, Stefan Ruseti, and Mihai Dascalu. 2023. "DoubleStrokeNet: Bigram-Level Keystroke Authentication" Electronics 12, no. 20: 4309. https://doi.org/10.3390/electronics12204309

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop