1. Introduction
Smart devices, such as smartphones and tablets, play a key role in everyday life. Despite the technological advancements that have enabled such devices to dominate our lives, even the latest security measures are becoming outdated and expose users to various threats. For example, personal identification numbers, passwords, or patterns, while providing some degree of security, are especially prone to shoulder-surfing. Whereas biometrics-based methods such as fingerprint [
1], facial [
2], or iris recognition [
3] do provide more security, most are limited to authentication at the point of entry.
Unlike physical biometrics, which primarily authenticate, i.e., one-to-one matching, at the point of entry, continuous authentication involves ongoing verification. Behavioral biometrics facilitates a seamless and smooth authentication process in this regard. This body of literature involves keystrokes [
4], gait features [
5], and touchstroke dynamics [
6]. However, behavioral biometrics inherently includes various characteristics within a single identity, varying on factors such as mood. Therefore, the machine learning approach has been widely adopted, especially for touchstroke dynamics [
6,
7,
8,
9]. For clarity, we note that touchstroke dynamics are coupled with gait features, whereas, in this work, we refer to touchstroke-gait feature multimodal signals as touchstrokes for brevity.
The touchstroke authentication problems can be framed as traditional binary classification (closed-set) [
6,
7,
8,
10,
11,
12,
13,
14], anomaly detection (closed-set) [
7,
15], or open-set authentication [
16] problems. Among the most primitive is traditional binary classification, where access to an abundant amounts of genuine and impostor data is assumed, and the model is trained as a binary classifier upon such data. However, [
7,
15] points out that the accessibility to impostor data is unrealistic, thus reformulating the problem as anomaly detection, where only genuine data are taken to train an anomaly detector. Unfortunately, anomaly detectors are widely known to suffer from the problem of feature representation collapse [
17]. Moreover, though arguing as anomaly detectors, most rely upon limited generated or reinforced impostor data [
7,
13,
14,
18,
19,
20,
21]. Also, training the model individually on genuine user data can be resource-consuming, make further updates and maintenance inefficient, and prone to model parameter breaches.
To resolve the problems above [
16], introduces an open-set authentication (OSA) for touchstroke biometrics, where the model is trained upon a multiple identity pretraining dataset. Upon distribution of the model, the enrollment phase involves a simple extraction of the touchstroke features from the user inputs, where the extracted features are stored as the template, in contrast to the training of models in binary classification or anomaly-detection frameworks. Subsequently, in the authentication phase, the input is likewise forwarded through the model, where its features are compared to measure the nearest neighbor distance with the stored template.
By nature, OSA removes the shortcomings of binary classification of anomaly detection, enabling a more secure and efficient authentication. Specifically, its application with transformer-based architectures [
22,
23], regarding the temporal nature of touchstroke dynamics, showed robust performance. State-of-the-art was achieved by a model called PIEformer [
16], attempting to resemble the effect of ensembles of multiple Transformers while leading to minimal increment in model parameters and computational complexity. While OSA with PIEformer sets a notable benchmark, we make the following observations.
The touchstroke input from the user may be substantial in reality, leading to a bulky template where resource-limited mobile devices may hinder finding the nearest neighbor distance. Hence, a method of extracting a user-representative, robust subset of the original template should be investigated.
The PIEformer intends to resemble the effect of having multiple models by having multiple learnable embedding inputs, where various experiments show its soundness. However, the global attention mechanism introduces some dependency between these embeddings, thus introducing a marginal limitation on the original objective.
In this work, we primarily address the observations above. First, realizing the necessity of a method of reducing the template to a user-representative, robust subset, we propose CoreTemp: coreset-reduced templates. Motivated from the memory-reduction technique [
24], where the combination of iterative greedy approximation [
25,
26] and Johnson–Lindenstrauss (JL) [
27] results in an effective reduction of templates. Moreover, based on the integrity of the reduction of dimensionality attributed to the JL theorem, we further put forth a quick-authentication mechanism, where the claimants’ features are also reduced to be compared to the template reduced both in size (CoreTemp) and dimension, that is, a dimensional-reduced CoreTemp.
Achieving a user-representative template of much smaller size and dimensionality leads us to explore its new capabilities further. Noticing that some mobile devices are shared, such as educational tablets, we utilize the smaller capacity of CoreTemp to enable identification, i.e., one-to-many matching by simply tagging multiple (identity) CoreTemps with their corresponding identity. We note that a mere application of the original templates within this framework of OSA leads to a substantial computational overhead, deviating from the objective of behavioral biometric authentication.
Second, we improve the PIEformer by introducing a masking layer between the learnable embedding inputs. The main idea of PIEformer comes from resembling an explicit ensemble [
28] of multiple Transformers by adopting multiple aggregated learnable embedding inputs.
However, unlike the explicit ensemble where the performance enhancement is dependent on the diversity of individual sets of model parameters, attributed to predominantly by independent randomized initialization of model parameters and independent training of each model, the independence between embedding inputs in PIEformer is jeopardized due to the attention mechanism itself.
Thus, by including an extra masking layer in the attention between the learnable embeddings, we achieve a clearer partitioning of information across the embedding inputs, achieving a closer approximation to the behavior of an explicit ensemble. This modified architecture, now referred to as PIEformer+ and illustrated in Figure 2, reaches state-of-the-art performance while maintaining the identical computation and model complexity as PIEformer.
Our main contributions are as follows:
We propose utilizing greedy coreset sampling upon the touchstroke template. Extracting a user-representative yet reduced template efficiently uses limited memory on mobile devices. Moreover, we also propose a fast match algorithm conceived from the JL theorem adopted in greedy coreset sampling, leading to much more efficient authentication.
We introduce a novel variant of PIEformer suitable for initially generating user-discriminative templates, PIEformer+. While [
16] has deeply investigated simulating an ensembled effect of Transformers with multiple learnable embedding inputs at a low computational cost, PIEformer+ simulates this phenomenon more closely by introducing a masking block in between the learnable parameters.
We demonstrate the feasibility of authentication and identification with the reduced templates. This highlights the importance of extracting a robust, user-representative touchstroke template. The proposed methods are evaluated on two datasets, HMOG and BBMAS, where they reach state-of-the-art performance in authentication and identification tasks.
Building on these foundation, our study aims to enhance the efficiency and effectiveness of mobile biometric authentication systems through several key innovations, i.e., CoreTemp and PIEformer+. These enhancements aim to address the primary challenges of traditional methods by providing a more secure and efficient approach to continuous authentication and identification using touchstroke-gait behavioral biometrics.
2. Related Works
Authentication via touch gestures and gait features, or touchstrokes, involves identifying users by their touch interactions on the screen, which include touch trajectories, device movement, and orientation. Previous studies have primarily utilized data like touch coordinates, pressure, and motion sensor data from accelerometers and gyroscopes [
18,
19,
20,
21,
29,
30,
31,
32,
33,
34,
35].
Some primitive approaches attempt manually handcrafting the touchstroke signals, where they involve deriving the trajectory length, the median pressure, and so on [
18,
19,
20,
21,
30,
31,
32,
33,
34,
35]. This dependence is due to the sensitivity of touch gestures to behavioral variances. Typically, these features are extracted and used to train various predictive models based on traditional machine learning, such as one-class support vector machines (SVM) [
21,
30], kernel ridge regression [
33], random forest [
19], temporal regression forest [
32], or SVM [
18,
20]. A major fallback of such methods is the reliance on task- or user-specific knowledge for creation, making them challenging to design. Additionally, these features are often not fine-grained enough to effectively identify subtle patterns of impostors. Some studies have further explored the realm of deep learning, where integration of deep models with handcrafted features takes place [
36], or the development of models that learn features directly from raw touch-gesture data in an end-to-end manner [
29,
37] to overcome the limitations of manual features.
Realizing the weak correlation of behavioral biometrics with the user’s identity, and with the rise of deep learning, more literature has focused on utilizing deep models for authentication with touchstroke-gait biometrics. Specifically, most works have commonly employed sensory data from the touchscreen, accelerometer, and gyroscope. We limit our scope of interest to deep learning-based approaches and further explore the studies that have utilized deep models.
A typical approach begins by understanding this problem as a binary classification among genuine and imposter users. The reader may refer to a recent survey in [
38]. However, acquiring complete datasets of impostor data for mobile devices is often unattainable. To overcome this limitation, the issue can be approached as a few-shot binary classification problem, as proposed by [
9]. This method seeks to enhance the robustness of detection against unseen impostors by utilizing the limited impostor data available. Ideally, a one-class classification model would be used, training solely on authentic user data [
7,
15]. Yet, this method performs poorly because it is particularly susceptible to feature representation collapse, which undermines the model’s ability to generalize and maintain resilience, as identified by [
17].
Traditionally, classification models for mobile behavioral biometrics authentication have been developed within a closed-set framework, utilizing either known genuine and impostor data or only genuine data for training. This traditional approach, however, falls short in practical security and usability scenarios. In response, ref. [
16] introduced an open-set authentication (OSA) strategy that allows for zero-shot inference, where the model can predict unseen data. Moreover, in [
16], in a realization that the exploration of various deep learning architectures for mobile biometrics authentication has been limited to LSTM networks [
6,
7], Convolutional LSTM [
8,
14], Autoencoders [
15], and Temporal Convolutional Networks (TCN) [
9], the authors attempt their solution with Transformers [
22,
23]. Furthermore, an implicit-ensembled model named PIEformer is proposed, noted on the ensembled model’s capability of exhibiting robustness in open set scenarios, mainly due to their enhanced generalization performance [
39].
In this work, we realize that such a framework requires a bankset of templates, whose size would likely present a computational overhead in practical scenarios. In realizing the necessity of shrinking the banks into a user-representative, robust set of templates, we thus propose performing greedy coreset sampling on the initial template. A reduced template expands its capabilities further to identification, where tagging templates with multiple identities does the job, as further detailed in
Section 4.2. Furthermore, we propose PIEformer+, which attempts to mimic the effect of ensembles closer, where its presentation with Transformer and PIEformer is given in a comparison in
Section 4.3.
3. Preliminary
3.1. Open-Set Authentication and Identification
Open-set methodologies are gaining traction in biometric security research. In this context, ref. [
16] is noted for pioneering Open Set Authentication (OSA) in touchstroke biometrics. At the same time, our study is the inaugural application of identification techniques within an open-set environment specifically for touchstroke-gait security. We delineate the scope of the “open set” used in our research to clarify the terminology and enhance the presentation.
We first assume model where it is pretrained upon a pretraining set that consists of samples from K distinct classes, that is, . For authentication, a set of input samples is given to for a construction of templates, where its identity is considered disjoint from the pretraining set, that is, .
During authentication, the claimant’s sample is likewise passed to the system, whose associated, unknown identity likewise comes from an open-set, thus without a loss of generality, it is assumed that . Subsequently, a similarity score is derived upon a comparison with the template, where the decision to accept or reject is made upon a certain threshold. We point out that a model with profound generalization ability is required because of the assumption that .
For identification, we likewise assume a set of users whose identities are pairwise disjoint from , that is, , where the samples of associated identity are utilized to generate templates of the given identity in enrollment. Likewise, following the aforementioned notations, we assume whose associated identity comes from , that is, , whose identity is derived by an argmax of similarity scores spanning the entire set of templates.
3.2. Greedy Coreset Sampling
To our knowledge, ref. [
24] is one of the leading works to implement greedy coreset sampling to generate a reduced yet representative memory bank in machine learning; recognizing the ubiquitousness of memory banks in conjunction with pretrained networks accentuates the importance of structuring a generalized, robust, yet practical (especially in size) memory bank. Realistic applications being hindered due to their size have already been pointed out in [
40]. Consequently, a primitive approach to this problem would be understanding the generalizability and size of the memory bank as a tradeoff.
However, both concerns may be addressed simultaneously by reducing the memory bank with the coreset subsampling algorithm. To this end, let us assume training samples
, and a pretrained network
, where the initial memory bank would be simply defined as
Here, we aim to search a representative subset such that .
Observing that ref. [
16] contemplates the use of nearest neighbor distance in harmony with [
24], we likewise adopt minimax facility location coreset selection via iterative greedy approximation [
25,
26], assuming a flowing memory bank randomly initialized as
. Subsequently, our objective is to search
such that
where
is iteratively given as
with
being continually updated by Equation (
2) each iteration until it reaches the predefined size, and
a random projector for computational efficiency [
41], its usage justified by JL theorem [
27]. The integrity of such subsampling has been explored in [
24], where the results indicate a significant enhancement in the efficiency with minimal deterioration in its generalizability.