The five main elements of the system are described below.
3.1.1. Raw Keystroke Data Collection
There are many experiments that have been conducted in keystroke biometrics. However, different studies have used different data sets and different features. They also differ in terms of evaluation procedures used. Some researchers have used benchmark data sets and some have used their own data sets. Therefore, it is hard to make comparisons of model performance with other work. To evaluate the proposed model, a CMU keystroke dynamics benchmark data set [
14] has been used.
The CMU data set was chosen for evaluation due to its ability to provide a data set and analyze the performance of different existing keystroke dynamics algorithms for objective comparisons. The authors have evaluated the data set with fourteen existing keystroke dynamics classifiers, including Euclidean, Euclidean-normed, Manhattan-filtered, neural network-standard, fuzzy logic, and SVM-one class.
Keystrokes are detected by a keylogger that records and stores the sequence of keys typed by the users along with key-press and key-release timing information. Event times are measured in milliseconds with roughly 16-millisecond precision [
58]. There were 51 subjects (typists) in the CMU keystroke benchmark data set, each typing a static strong password string: “.tie5Roanl”. The data set also considers the Enter key to be a part of the password, making the 10-character password 11 keystrokes long. There were eight data-collection sessions for each subject with at least one day between each two-session period. Fifty repetitions for the password string were collected in each session, resulting in 400 samples for each subject and a total of 20,400 samples for 51 users.
The CMU keystroke data set contains keystroke dynamics consisting of the dwell time (hold time) for each key, as well as the flight time between two successive keys—key press latency (KPL) and key interval (KI).
Figure 2 shows the hold time of the keystrokes “.tie5Roanl” and Enter for the first two subjects and their first two sessions from CMU keystroke data sets. Each of the lines from each chart represents 50 samples per data-collection session. The features are highly correlated with large-scale variations and some are linearly dependent.
3.1.2. Feature Selection
In the two-state POHMM, a user can be either in an active state or a passive state of typing. The key names “.tie5Roanl Enter” are event types that partially reveal information about the hidden state. The timing feature vector for the CMU keystroke data set is formed by the 11-key hold (KH) time, 10 key-press latencies (KPL)/down-down, and 10 up-down/key interval (KI) latencies of the 11-keystroke sample, create a total of 31 timing features. Ten key-release latencies (KRL) and 10 release-press latency (RPL) features were also extracted using KH, KI, and KPL features. Therefore, a total of 51 dimensional feature vectors were extracted from the CMU keystroke benchmark data set.
For user identification and verification, the hybrid POHMM/SVM model uses hold time and key-press latency features. Hold time and key-press latency for each of the 11 characters are modeled by a lognormal distribution conditioned on the hidden state and the key name. Finally, they are multiplied by 1000 for normalization. Similar feature selection and normalization were used in [
13]. The other features were extracted for comparison purposes in different experimental setups. Different studies have used different features or combinations of features. Hold time is the most used feature in keystroke biometrics. M.S. Obaidat [
59] suggested that hold-time classification is superior to interkey time-based classification, and a combined hold-time and interkey time-based approach gave the least misclassification error. This research explores different features from the CMU keystroke data set and compares identification accuracy using different classifiers.
From two consecutive keystroke events, five types of features were extracted using the following formula:
where
denotes the key hold time,
is the key interval time,
denotes the key press latency,
is the key release latency, and
is the release press latency of the
nth keystroke.
and
are the release and press timestamps of the
nth keystroke. Similarly
and
are the release and press timestamps of the (
n−1)th keystroke.
3.1.3. Training POHMM and Extract POHMM Parameters
After feature extraction, the POHMM is trained and parameters are collected. The POHMM was developed and implemented by Monaco et al. [
13], and the Python code can be downloaded from here [
60]. The POHMM is an extension of the hidden Markov model (HMM), and the structure of the model is shown in
Figure 3.
represents the sequence of observation vector (emission vectors),
is the sequence of observed value (event type),
is the sequence of hidden values (system state), and
T denotes the total number of observations. In the POHMM, the hidden state and emission depend on an observed independent Markov chain
. The emission
Ot+1 depends on event type
Xt+1 in addition to
Θt+1 and the hidden state
Θt+1 depends on
Xi and
Xi+1 in addition to
Θt+1 [
13]. For each sample, the POHMM is trained and parameters collected and stored in a file. The complete parameter estimation using a modified Baum–Welch reestimation algorithm, marginal distributions, and parameter smoothing is described as follows:
- (a)
Initialization: find initial parameters and let
- (b)
Expectation: compute forward variable αj|Xn(n), backward variable βj|Xn(t), posterior probabilities γjXn(n) and ξijXn,Xn+1(n). Let where O is the emission sequence, is the model parameter and X is the event type.
- (c)
Maximization: using the reestimation formula presented in [
2,
13], update the model parameters: initial state distribution
π, state transition probability matrix
A, and state emission probability matrix
B, and let
.
- (d)
Marginal distributions: find marginal distributions.
- (e)
Parameter smoothing: find smoothing weights and smooth the parameters with marginal distributions.
- (f)
Termination: if , then terminate and let , otherwise go to step (b). is the convergence criterion threshold.
For each training sample, POHMM provides a total of 130 parameters: 104 emission parameters and 26 transition parameters. POHMM provides 130 parameters for each training sample, and these parameters create a parameter vector for each sample that is used for identification and verification in hybrid models. For experimental purposes, three sets of parameter-emission parameters, transition parameters, and combined emission–transition parameters were collected.
3.1.4. Classifier Selection and Building Models
To address the classification problem, a support vector machine classifier was used in the proposed POHMM/SVM approach. The literature search revealed that the generative models performed better than the discriminative models for a smaller data set. However, for a larger data set, the discriminative models performed better than the generative models. To make the trade-off between small and large data sets, this research proposes a unique hybrid POHMM/SVM model. POHMM handles missing or irregular training data and SVM provides faster and better classification accuracy. POHMM was trained using key-press latency and key-hold time features for each sample from the CMU data set. After that, POHMM parameters were collected for each sample. These parameters were used for classification using the support vector machine, but for comparison purposes, we also examined POHMM parameters using other popular discriminative classifiers, such as k-NN, random forest, MLP NN, and logistic regression, in the experiment.
- (a)
Support Vector Machine:
The support vector machine is a well-known supervised machine-learning algorithm generally used to solve two- or multiclass classification problems. However, for user verification, only the genuine data are available to train the model and a model has to be built for a genuine user only. Then, the model is used to detect an imposter user [
50]. Schölkopf et al. [
61] extended two- or multiclass SVM to one-class SVM to solve the one-class classification problem. The Scikit-learn 0.18.1 Python package and sklearn.svm module were used in the experiment [
62]. We used the one-class SVM (svm.OneClassSVM) classifier in the proposed model for user verification. OneClassSVM is an unsupervised outlier detector that is based on the support vector machine library libsvm [
63]. The components used for this classifier are radial basis function (RBF) as kernel function, 0.5 tolerance of training error, which means half the samples will become support vectors, and a kernel coefficient for “rbf” of 0.9 (gamma value). The CMU data set contains 51 unique users. For identification tasks, the multiclass SVM (svm.SVC) with “linear” kernel function was used. SVC is a C-support vector classifier based on the support vector machine library libsvm [
64].
- (b)
k-Nearest Neighbor
The
k-nearest neighbor classifier is another frequently used classifier in biometrics. The
k-nearest neighbor (
k-NN) is a nonparametric classification method where assignment of a new class label to the input pattern is based on the nearest training samples in feature space. The
k-NN is a simple classifier that requires only reference data points for both genuine and imposter classes. It uses data directly for classification without building a model first and does not requires any specific training phase. For a given unknown sample
f and a distance measure, the nearest-neighbor rule for classifying
f among
N classes is presented below [
64]:
Find k-nearest neighbors from M training vectors without considering class label. Generally k in not multiple of N and chosen to be odd for a two-class problem.
Find the number of samples ki from k neighbors and belonging to class ni where
The unknown sample f will be assigned to the class label ni with maximum ki number of samples.
The k-NN classifier was used for user identification using POHMM parameters. The Python implementation for k-NN classifier (KNeighborsClassifier) from the sklearn.neighbors module with all default parameters (number of neighbors = 5) was applied. The major drawbacks for any kind of k-NN based classifier are that the computing time is still longer than other classifiers and the performance is generally worse on high-dimensional data.
- (c)
Random forest
Random forest [
65] or random-decision forest is an ensemble learning method for classification. It is a class of ensemble method using decision-tree classifiers. Random forest is a mixture of tree predictors where every tree depends on the values of a random sample and with the equal distribution for all trees in the forest. Random forest has become popular in keystroke dynamics in recent years. The time required for training and testing using random forest is fast and achieves better accuracy in many applications. Random forest is an effective tool in prediction, but has been observed to overfit for some data sets with noisy classification. The RandomForestClassifier classifier and IsolationForest algorithm from the sklearn.ensemble module were used for identification and verification, respectively, in this experiment. We used default parameters for the RandomForestClassifier classifier, and the default number of trees in the forest was 10.
Isolation Forest: Isolation forest [
66] outlier detection uses a random forest of decision trees for anomaly detection. Isolation forest or iForest builds an ensemble of iTrees for given samples and then samples with short average path lengths on the iTrees are considered as anomalies. The isolation forest algorithm isolates observations in two steps: (a) randomly select a feature, and then (b) randomly select a split value between the maximum and the minimum values of the selected feature. Isolation forest is ideal for large data sets because it has a linear time complexity with low constant and low memory requirement [
66]. It also converges quickly with a small ensemble size, allowing high efficiency in anomaly detection. We used all default parameters for the iForest classifier (IsolationForest) from the sklearn.ensemble module.
Logistic regression: The logistic regression or the
logit model [
67] is a nonlinear transformation of the linear regression. The model is useful when dependent variables are limited to two-class problems [
68] and generally calculates the class-membership probability for one of the two categories in the data set. The relationship between the predictor and the dependent variables in logistic regression can be written as:
The prediction can be written in terms of
, which is a linear function of
x.
is used to predict genuine and imposter users. The decision boundary for logistic regression is also linear
. If the value of
is less than the threshold for a claimant user, then the user is genuine user; otherwise the user will be an imposter. Logistic regression is a popular classifier in the areas of medicine and bioinformatics. Logistic regression performs better than the decision trees and
k-NN on continuous data sets [
69]. This study used the logistic regression classifier LogisticRegression (aka logit, MaxEnt) from the Python sklearn.linear_model module for user identification.
- (d)
Multilayer perceptron neural networks (MLP NNs):
Multi-layer perceptron (MLP) neural networks are feed-forward ANNs used in pattern recognition, classification, and prediction. The backpropagation (BP) algorithm is the most popular training technique used with MLP and has been applied in various fields, including network security, visual pattern recognition, handwriting recognition, medicine [
70], intrusion detection, management, and finance. The performance of multilayer perceptron neural networks (MLP NNs) depends on various elements, such as number of input layers, number of neurons in each layer, the activation functions used by the neurons, and the choice of initial weights.
MLP is a supervised learning algorithm that learns a function by training on a data set. For a given set of features and a label, it can learn a nonlinear function estimator for classification. The difference between MLP and logistic regression is that there can be one or more hidden nonlinear layers between input and output layers. The most important advantages of MLP are that it is capable of learning a nonlinear model and can learn in real time. This experiment used the MLPClassifier class from [
62] for identification. MLPClassifier implements a multilayer perceptron algorithm that trains with the backpropagation technique.
Besides the five discriminative models, the identification accuracy of the proposed POHMM/HMM model was also compared with the generative models HMM, POHMM, naïve Bayes, Gaussian mixture model (GMM), and Bayesian Gaussian mixture model (BGMM). To train/test HMM and POHMM, the procedures described in [
13] were followed. For the naïve Bayes classifier, the GaussianNB class with default parameters was used from scikit-learn. For GMM and BGMM, the GaussianMixture class and BayesianGaussianMixture class were used, respectively, with a maximum of 100 iterations.
3.1.5. Model Training and Testing
A keystroke biometric system’s performance is evaluated by how correctly the model can differentiate real users from attackers. The classification accuracy (ACC) is measured for model testing and evaluation purposes. Identification of the POHMM/SVM, POHMM/k-NN, POHMM/random forest, POHMM/MLP NN, and POHMM/logistic regression was performed as follows: key-hold time and key-press latency (KPL)/down-down features were extracted from the CMU data set. The POHMM model was trained with the 21 dimensional feature vector for each sample, and POHMM parameters were collected. There are two types of parameters extracted from POHMM: emission parameters and transition parameters.
The parameters are then split for training and testing data sets. Stratified
k-fold cross-validation (SCV) is used to split data that randomly selects training and testing data sets. Stratified
k-fold cross-validation provides training/testing indices to split data into training/testing sets. In
k-fold cross-validation, the data set is partitioned into
k equal subsets. Each datum of the
k subset is used for the testing set, and the remaining (
k − 1) subsets are used for the training set. The cross-validation object StratifiedKFold is the variation of KFold that returns stratified folds, i.e., the fold preserves the percentage of samples for each class. The accuracy of each fold is determined and average accuracy of
k-fold determined for overall accuracy. An example of stratified fourfold cross-validation is shown in
Figure 4.
By training multiclass linear SVM with the training parameters, a system of N models is created. For an unknown testing sample, the highest-likelihood class label is predicted. Finally, the accuracy score is determined for testing labels and data. The same evaluation procedure was followed for POHMM/k-NN, POHMM/random forest, POHMM/MLP NN, and POHMM/logistic regression, where the POHMM’s parameters were used as features, and k-NN, random forest, and logistic regression were used as classifiers. We also evaluated the identification performance of k-NN, random forest, logistic regression, SVM (kernel = linear), SVM (kernel = RBF), MLP NN, and naïve Bayes on the CMU keystroke data set. Instead of POHMM’s parameters, dwell time and flight time were used as features to train and test the models.