1. Introduction
Recognizing certain characteristics of a user is important for a number of processes, such as improving the performance of authentication systems. Various techniques are proposed in the literature [
1] for how this can be achieved and are mainly divided into two categories, physiological biometrics and behavioral biometrics. The first are associated with the shape or measurements of the human body, while the second are associated with the behavior of an individual. The former include the measurement and analysis of fingerprints, iris, palm geometry, etc., while the latter includes the measurement and analysis of handwriting, gait, voice, etc.
Keystroke dynamics is a behavioral biometric which exploits data derived from the way users use the keyboard [
2], physical or virtual. Its main advantage over other biometric methods is that it does not require any specialized equipment, while its main disadvantage, like all behavioral biometrics, is that a user can change the way he/she types. Research into the keystroke dynamics, which began in the 1970s, has resulted in the implementation of systems that perform user authentication, recognize some inherit and/or acquired user characteristics and recognize mental and/or physical states of the users.
The term “the way a user uses the keyboard” means, among other things, the typing speed, the duration the keys remain pressed, the duration needed for using a series of specific keys, the number and frequency of pauses during typing, the number of typing errors and the way they are corrected, the frequency of use of specific keys, the time of day the typing is performed and the applications in which the typing is performed.
The keystroke dynamics features that have been used in research so far and those that could be used total up to number in the order of millions, but each one of them encloses a small amount of information. This creates a pleasant headache for the researcher, who is asked to choose from a huge number of features, so that his system has the highest possible accuracy in the shortest possible operating time.
This paper attempts to answer two questions. First, which of the most commonly used types of keystroke dynamics features work best. To the best of our knowledge, this analysis has not been done for this field. Second, if the simultaneous and combined characteristics of users are presenting a more appropriate technique than utilization of these characteristics individually. Specifically, in the problem of simultaneously finding of the gender and age of an unknown user, with the help of five machine learning models, the most widely used keystroke dynamics features are used and the system performance is compared for different feature sets.
The rest of the paper is structured as follows: First, a review of the literature that frames the topic of this work is made. The methodology followed is then explained and analyzed. In the
Section 4, the system performance results for each feature set and for each of the five machine learning models are presented. Finally, the work is concluded and possible future-work directions of this research are presented.
2. Theoretical Background
Although the user authentication with passwords is an easy to implement security mechanism it has many vulnerabilities such as shoulder surfing, keyloggers, phishing and brute force [
3]. For this reason, many alternatives to user authentication have been proposed, including the use of keystroke dynamics. This technology counts almost five decades of research [
4] and has now matured, showing authentication systems with very low equal error rate (EER).
Recognizing the aforementioned problems, Lin et al. [
5] proposed an authentication system that takes advantage of keystroke dynamics features, including the duration when a key is pressed (keystroke duration (KD)) and the time in between the release of one key and the pressure of the next (up-down diagram latency (UDDL)), in order to detect an unlawful user, even when he/she knew the genuine password for an account. After the data are obtained and the necessary features are extracted, a convolutional neural network is used, performing with the accuracy of 99% for the detection of the legit users.
In another study, Venugopal and Viji [
6] used keystroke durations and the four different ways in which diagram latencies can be calculated for their authentication system. This is defined as the time in between pressing one key and pressing the next (down-down diagram latency, DDDL) and, respectively, down-up diagram latency (DUDL), up-up diagram latency (UUDL) and UDDL. The researchers collected data from the passwords entered by the volunteers and developed their system in MATLAB. They managed to achieve an EER of 0.5%.
For a similar purpose, Young et al. [
7] in their work attempt to verify the identity of users participating in online courses. To achieve their goal, they collected data from 78 volunteers during typing by copying two texts and answering two questions. They used keystroke durations and up-down diagram latencies as features and with the help of them they extracted the “keyprint signature” and “keyprint profile” for each volunteer. After conducting the experiments, they showed that there is no absolute consistency in the way a user types and, therefore, in order for the authentication to be more successful, a lot of data is required.
Moreover, Sun et al. [
8] focus on and present results for novel keystroke dynamics features such as, the keystroke durations of Space, Backspace, arrow keys and the down-down diagram latencies of Shift-“I”, Shift-“N”, period–Space, comma–Space. They used data from the typing recording of 34 users and after studying the behavior of each of the feature that had isolated, as well as sets of them, with the help of an SVM in user identification, they came up with a system that uses only 13 features and presents an EER of 2.94% and an AUC of 0.994.
In addition to user authentication, which has piqued interest of most researchers, keystroke dynamics also offer solutions to user classification in relation to some of their inherent or acquired characteristics, such as gender, age, handedness, educational level, etc. There are several reasons for this kind of classification and this is because text is the dominant medium through which the billions of Internet users today communicate [
9]. So, for a complete unknown user, who types to communicate with other users, to post announcements and to perform Internet searches, it is possible to identify some of his/her characteristics. In return, results are transferred to, firstly, the optimization of the targeted advertisements so that the user is not to be overwhelmed by messages that do not interest him/her, secondly, the facilitation of the use of Internet services, for example by automatically filling in fields on forms and by making suggestions for participation in discussion groups and visiting websites related to user characteristics and, thirdly, the collection of information on the identity of malicious users, which will help a forensic investigation.
In their work Buker et al. [
10] recognize the importance of keystroke dynamics with citing the projected increase in investment in live chat applications, as well as the increase of their usage. Resulting in the consequence that collecting the information of how people type plays a vital social and economic role. In an attempt to find out the gender of a user, they collected data from 60 users, who were communicating in pairs within a live chat application, on a specific topic of discussion. They used only 15 features, including the frequency of use of Backspace, the frequency of use of the question mark and the frequency of use of non-alphabetic characters. They implemented random forest for classification performing with an accuracy of over 95%. Among their findings, an interesting result is that male users have a greater tendency to correct typos and misspellings.
In another study, Bandeira et al. [
11] researched on the gender of an individual from data derived by handwriting and by recording of typing. In their endeavor involved 100 volunteers and regarding keystroke dynamics, the data are collected from four tasks completed by the participants that included copying text and typing at will. These two ways of data acquisition in keystroke dynamics are called fixed-text and free-text, respectively. The researchers took advantage of 29 features and used four classifiers. The best results came from SVM with 64% accuracy.
Research on the data from different sources, namely keystroke dynamics, touchscreen dynamics and handwritten signature data are conducted in the work of Da Costa-Abreu and Goncalves [
12]. The researchers used data from the recordings of 76 volunteers when typing a specific text. From the data, they extracted from the up-down diagram latencies of 14 diagrams and the keystroke durations of the characters that make them up. Finally, they were able to determine the gender of the user, with the help of an SVM, with an accuracy of 83.5%.
In the field of user age search, Tsimperidis et al. [
13] used a dataset consisting of 387 logfiles and extracted 700 features from it, which were keystroke durations and down-down diagram latencies. Utilizing five classifiers, experiments are conducted with different sets of features. The experiments resulted in creation of a system that can distinguish with an accuracy of about 90% the age group of an unknown user, among four options.
In a different field, Tzafilkou and Protogeros [
14] tried to relate mouse movements and keystroke dynamics to states that software developers may be found, namely self-efficacy, risk-perception, willingness to learn, perceived usefulness and perceived easy to use. They developed a software for recording the mouse movements and keyboard usage, which also includes a questionnaire to assess the user’s status. They collected data from 30 participants and the keystroke dynamics features they extracted were typing speed, keystroke durations, UDDL and DDDL. Their experiments show a correlation between the typing mode and some of the states that are defined for a software developer.
In another paper, Ulinskas et al. [
15] used an existing keystroke dynamics dataset that came from recordings from 53 people typing the same password, in order to recognize user fatigue. From the data, they extracted as features the keystroke durations and all the different forms of diagram latencies. They used six classifiers and found that the best results came from the up-up diagram latencies, which made it possible to recognize fatigue at 91%.
A more concise picture of the literature is presented in
Table 1, where in the field “Scope” with UA is marked the user authentication, with UV the user verification, with GC the gender classification, with AC the age classification and with SA the state assessment.
Two conclusions drawn from the literature review of keystroke dynamics: (i) it has been observed in the literature that most studies use as key features the keystroke durations and one or more forms of diagram latencies; (ii) there is a diversity and inconsistency in the terminology for these features. Thus, the keystroke duration is found as dwell time, or hold time, or press hold, or key press time. Down-down diagram latency is found as flight time, or press-to-press, or press latency. Up-down diagram latency occurs as interval, or flight time, or release-to-press, or latency. Down-up diagram latency is known as latency, or press-to-release. In addition, down-down diagram latency is met as flight time, or up-to-up, or release-to-release, or release latency.
3. Methodology
In the present study, the dataset created for [
16] is used. This work describes the keylogger used, the process of recording volunteers during their daily computer use, the size and format of logfiles, the number of participants and logfiles and the process of extracting keystroke durations and down-down diagram latencies used as features to create four systems, each of which recognizes the gender, age group, handedness and educational level of the users with high accuracy.
In contrast, in the present work, a system is designed for the simultaneous identification of the gender and age group of the user. In the available dataset, which consists of 387 logfiles, users are divided into two genders and four age groups. Therefore, eight classes are created. The number of logfiles with respect to these age and gender groups are presented in
Table 2.
As shown in
Table 2, the dataset is not balanced, but each class is adequately represented.
The extraction of keystroke durations and down-down diagram latencies led to a feature set with over one hundred thousand features. In order to select the appropriate features that include the most information in the separation of users according to their gender and age, a procedure was followed, which is described in detail in [
17], in which the information gain (
IG) for each feature is calculated. In a brief description, the
IG of a feature
f is calculated from the reducing of the entropy that causes to a system
x, as follows:
The entropy
H(
x) of system
x is calculated as:
with
m being the number of classes and
P(
xi) being the probability of class
xi.
The term
H(
x|
f) is calculated by splitting the dataset into groups according to the value of the particular feature
f. Then, the entropy of each group is calculated and
H(
x|
f) is given by:
with
N being the number of instances of the initial dataset,
k being the number of groups that the initial dataset was split,
nj being the number of instances of the
j-th group and
H(
xj) being the entropy of the
j-th group, which can be calculated from Equation (2).
The results showed that there are 811 features with non-zero
IG and the first 30 are presented in
Table 3, where features with one number, corresponding to the virtual key code (VKC) of the key used, are the keystroke durations and features with two numbers, corresponding to the VKCs of the diagram used, are the down-down diagram latencies.
An important observation made in
Table 3 is that keystroke durations seem to play a more important role in classifying users according to their gender and age.
Table 4 shows the number and percentage of KD and DDDL per 100 features.
Considering that there are around 100 KD and 100,000 DDDL, the striking thing is that 0.1% of the available features represent 32% in the top 100 with the highest IG. One explanation that can be given for this phenomenon is that in the data recorded from the volunteers the KD of a key is found many more times than the DDDL of a diagram. However, in the present study the value of the feature is taken as the average value and therefore the number of appearances of the feature does not play a direct role. Therefore, if there is no hidden correlation between the number of appearances of each feature in the raw data and the enclosed information, an explanation should be sought for how important they are in keystroke dynamics studies, or at least in user classification according to their gender and age.
The ability to simultaneously find of the gender and age of an unknown user, using the features highlighted by the feature selection process, as well as all available features, all available KDs and all available DDDLs, is tested in this research using five known machine learning models, namely, support vector machine (SVM), simple logistic (SL), naïve Bayes (NB), Bayesian network classifier (BNC), and radial basis function network (RBFN). The selection of these classifiers was made with two criteria. Firstly, to make a direct comparison with the results presented in the work [
16] and secondly because these models showed the best performance over many others tested, such as multilayer perceptron (MLP), random forest (RF), naïve Bayes tree (NBtree), etc., which showed low accuracy and/or very high training time.
Experiments were performed for each of the classifiers in order to find the best performance, using as feature sets the 100 features with the highest IG, the 200 features with the highest IG, etc., up to 800 features with the highest IG, which is the hundred closest to the number of features with non-zero IG. In addition, for the best performing classifiers, all available features, all available KDs and all available DDDLs were used as additional feature sets, so that a direct comparison can be made and conclusions can be drawn as to which type of features is most suitable for solving the problem of user classification.
The comparison of the performance between the classifiers and the different feature sets was done with the criterion of accuracy (Acc.) and the required training time of the system (time to build model (TBM)). In addition, the F-score (F1), which is the harmonic means between precision and recall [
18] and is a safe measurement even for unbalanced datasets and the area under the ROC curve (AUC), which is the area below the receiver operating characteristic curve [
19] were used as comparison measures.
Each experiment was performed with the 10-fold cross-validation method [
20] in order to obtain a more unbiased picture of its statistical measures, since it is performed ten times with a different training set and testing set each time and their average value is calculated. This avoids the possibility of calculating an outlier as accuracy, F-score and AUC.
In our case, where the dataset consists of 387 logfiles, each fold consists of 38 or 39 files. Thus, the 348 or 349 files are used to train the model and the rest as a testing set. This is repeated in a round robin manner. In addition, another approach is the leave-one-out mode, where in our case the dataset would be divided into 387 folds, so that only one logfile is examined at a time. This approach is more appropriate for the better estimation of model performance, but it costs in computational time; this is planned to be used in future research.
4. Results and Discussion
For each of the five machine learning models and for each of the eight feature sets of 100 to 800 features, a number of experiments was performed in order to find those classifier parameters that lead to the best performance.
The best results in simultaneously finding the gender and age of an unknown user, for each different feature set, using SVM, along with classifier parameters, are presented in
Table 5.
SVM shows the best performance using the polynomial kernel in each different feature set, while the highest accuracy is achieved in a feature set with less than 800 features, which is the largest tested.
Table 6 shows the results and classifier parameters for SL, for each different feature set.
The best SL performance is achieved in the feature set with the largest number of features. Regarding NB, the results of the experiments showed better performance for each feature set, as presented in
Table 7.
The accuracy and time complexity of the NB increase as more features are used, while the best performance is the one of 600 features since has the higher AUC.
Similarly, the best results and BNC settings that lead to them are shown in
Table 8.
BNC performance, as expected, improves as the number of features used increases, while time complexity depends on the number of features and the max number of parents.
Finally,
Table 9 presents the best performance and the corresponding RBFN settings for each different feature set.
As in SVM, so in RBFN, the best performance is presented in a feature set with less than 800 features, while the time complexity increases steadily with the number of features. To explain why the accuracy of the system slightly decreases as the number of features in the feature set increases, meticulous experiments over the subsets of the feature set which lie beyond the scope of this work need to be conducted. This task is left as a future research goal.
At this point two comparisons can be made. One regards the performance of the classifiers used.
Figure 1 shows the optimal performance of each classifier.
As shown in
Figure 1, the RBFN is superior to all other classifiers in accuracy exceeding 78%, in the F-score and in the area under the ROC curve, with SVM second in each case having an optimal accuracy approaching 67%, while NB has the lowest values in all three sizes. Regarding the time required to train the model, the NB runs faster than other classifiers, followed by the SVM, while the SL presents the greatest time complexity.
The confusion matrix of the best run, which is that of the RBFN model in the feature set of 200 features, in order to give a picture of the distribution of predictions, is presented in
Table 10.
One observation made in
Table 10 is that a large proportion of erroneous predictions, 27 out of 84, are in adjacent age groups. This, in part, was to be expected, as the classification of users according to the age group they belong to is not clear enough. For example, a 26-year-old user probably has more in common with a 25-year-old user than with a 35-year-old user. However, while the user is in the same group with the latter, they will be in a different age group than the former.
The second comparison is between the performance of the system presented in the present work in simultaneously searching for a user’s gender and age group and the performance of the systems for separately searching for these two characteristics in the study [
16]. This comparison is made in
Table 11.
The column “Both” in the results [
16] in
Table 11 is calculated from the product of the corresponding values in the columns “Gender” and “Age”. Indeed, gender and age of a person are two events statistically independent of each other and thus, to calculate the probability of a user belonging to a specific gender and age group, it is sufficient to multiply the individual probabilities [
21]. The conclusion is that all machine learning models, except RBFN, have a higher accuracy of 3 to 6% in simultaneously finding the gender and age of an unknown user, compared to the finding of these characteristics separately. Although the phenomenon is not universal, it seems that the choice of such features that do not focus only on the segregation of users according to their gender, or only according to their age, leads to systems with better performance in the simultaneous finding of user characteristics.
In addition, in the work [
16], 514 features were used for the gender classification and 690 for the age classification. Some of these features are common to both classification problems, but many others are found in only one of them, with the result that in order to achieve the performance presented in
Table 11, a total of 947 keystroke dynamics features are needed. This is a number of features 18% larger than the largest feature set used in this study and it is almost five times more than the 200 features that give the best result in RBFN. The consequence of this is that a shorter processing time is required to extract the appropriate features. This phenomenon is observed in the simultaneous search for additional user characteristics, such as handedness, educational level, etc. Therefore, one of the novel contributions of this work, to the best of our knowledge, the simultaneous search of more than one user characteristics using keystroke dynamics. Conducting experiments for this effort shows that, the simultaneously searching for multiple characteristics of an unknown user, using a single feature set for all characteristics, leads to similar or better results, but while spending less time extracting features.
4.1. Comparing Keystroke Durations and Down-Down Digram Latencies
The findings of
Table 4 lead to the investigation of the importance of the two most frequently used types of features in keystroke dynamics, namely KD and DDDL, and to their direct comparison. Specifically, the two best performing classifiers, SVM and RBFN, were used to test which of the two types of features best separates users by gender and age.
In the available dataset there are actions recorded from 108 keys and from about 11,500 diagrams. Additional experiments were performed to find the classifier parameters that lead to the best performance for the dataset with all the available features, all KD and all DDDL. The results are presented in
Table 12.
From
Table 12, it is clear that including all available DDDLs leads to higher performing systems than those using all available KDs. However, using such a large feature set, such as one that contains all DDDLs, results in very high time complexity. Thus, using all available DDDLs requires tens of times more training time than using all available KDs. In the case of RBFN, an increase in accuracy of 1.5% requires 20 times more execution time. The results are not so diverse in SVM, where increasing the accuracy by 10% takes 10 times more time. So, on the one hand, using only all KDs leads to systems that run much faster than those that use only all DDDLs, but on the other hand, they have lower accuracy and in some cases much lower.
The claim that far fewer KDs are used and that the comparison should be made on the same number of features is refuted by the fact that there are many more available DDDLs that can contribute with information that contain. Therefore, it is not possible to draw a safe conclusion, with the existing data and results, as to which of the two types of features is considered the most important. However, what can be safely extracted is that the combined use of both types of features leads to even better results. In fact, in the case of RBFN the system has an accuracy of 79.3%, which is even higher than those shown in
Figure 1. In addition, further proof of this claim, as shown in
Table 5,
Table 9 and
Table 12, is that experimenting with the same number of features, systems that use both KDs and DDDLs perform much better than those that use only KDs.
The fact that the combination of KDs and DDDLs leads to systems with better accuracy sets the basis for a research that will involve even more types of keystroke dynamics features.
4.2. Improving the Accuracy
The main requirement of a system that attempts to recognize certain characteristics from a completely unknown user is the accuracy it presents and in our research is the most important criterion for its evaluation. For this reason, some of the algorithms which use the results of various classifiers apply numerous techniques to improve the accuracy and usually are called meta-algorithms, have been utilized in this research.
As such, AdaBoost [
22] was initially used with “weak” classifier the RBFN, which showed the best results, as shown in
Figure 1. The experiments were conducted in the feature set of 200 features, in which the highest accuracy of 78.3% is presented and in the feature set of 800 features, which is the largest of the feature sets formed with the features that have non-zero
IG. The best performance of the model, for every 10 iterations of the AdaBoost algorithm, is presented in
Table 13, which does not include experiments with the number of iterations where the training time was prohibitively long.
As shown in
Table 13, improving the accuracy of the RBFN using AdaBoost is possible, but the cost is very high in terms of the required computational time. Indicatively, for an improvement of 3.6% it takes about 900 times more time.
Another algorithm used is Rotation Forest [
23] with base classifier C4.5 decision tree. The experiments were conducted in the feature set of 800 features and the best performance of the model, for every 10 iterations of the Rotation Forest algorithm, up to 100 iterations, is presented in
Table 14.
Among the classifiers that are tested for the requirements of the presented study is the C4.5 decision tree, which presents 36.7% as higher accuracy with training time of 1.75 s, which was the reason for not being included in the models with the best performance. However,
Table 14 shows that using the rotation forest algorithm, the accuracy of C4.5 is significantly improved reaching 80%, although the processing time has to be multiplied due to the iterative approach.
Although many more experiments have to be conducted, using more boosting algorithms, more classifiers and more fine-tuning, which goes beyond the scope of this research, it is concluded that the accuracy of the system can be improved, however, with the cost of the much longer training time.
4.3. Method Limitations
As can be seen from the results of the present study, as well as others in the field of keystroke dynamics, it is possible to find some user characteristics with quite high percentages in accuracy. However, there are some issues to be discussed regarding limitations and objections to the use of the proposed method.
The user typing pattern is quite hidden and it is not clear what separates the males from the females, the users of one age group from those of another, etc. However, it is not known whether a user can modify the way they type to hide their characteristics. If this is possible, then new techniques should be sought to overcome this problem.
In addition, the collection of keystroke dynamics data is a limitation of the method. This is because, while the data exploited by keystroke dynamics do not reveal personal or sensitive user data, the way they are obtained leaves questions as to whether a malicious user can exploit them.
5. Conclusions and Future Work
Keystroke dynamics are a subject of research mainly in the field of user authentication, but also in the detection of the physical and mental condition of users, as well as in their classification according to some inherent or acquired characteristics. The latter can find a variety of applications, such as obtaining valuable information about the person who committed a cybercrime in a digital forensics investigation, facilitating Internet users in exploiting useful services on a case-by-case basis, improving targeted advertising and warning of unsuspecting users about the danger of becoming victims of deception. In addition, keystroke dynamics can be added as a side authentication and verification mechanisms for the cases where a continuous authentication is needed. The contributions from this research can be applied to user sessions dynamically to provide security measures for an unattended computer terminal that is left unlocked to verify the user activity to raise alerts or flags.
In almost all keystroke dynamics studies the features used were keystroke durations and the four forms of diagram latencies. Due to this fact, in this work, it is investigated which type of features includes the most information. For this purpose, this research contributes to the existing keystroke dynamics literature by introducing a system that simultaneously recognizes the gender and age of unknown users and by exploiting the features of an existing keystroke dynamics dataset with five known classifiers. The results showed that (i) the identification of the class, among eight, to which a user belongs according to his gender and age group can be achieved with a percentage approaching 80% and (ii) it is not possible to draw a safe conclusion as to which of the two types of keystroke dynamics features examined contains the most information and is most appropriate for user segregation. However, one conclusion reached was that the combined use of both types gives significantly better results.
As a conclusion, the extension of the research will be directed towards the extraction of other types of features and the implementation of systems that will use a combination of them. Such features can come from trigrams, tetragrams, etc., typing pauses, typing corrections, etc. Other possible extensions are to approach the problem with Dempster-Shafer theory [
24] to extend the existing dataset by recording volunteers which have different native languages in order to examine the possibility of recognizing this characteristic and to test the robustness of our methodology using different keystroke dynamics datasets.