4.1. Experiment Results Based on Feature 1 (Index and Scancode)
We classified into any number of the training set, validation set, and test set to avoid overfitting and underfitting for the experiment, and
Table 4 shows experiment results of the training set, validation set, test set, and the cross-validation using Dataset 1.
Specifically, the training set had the best score for random forest at 0.93, while the rest of the models had a score at 0.90. The validation set had the worst score of random forest at 0.86, while the rest of the models had a score at 0.88. The test set had the worst score for linear SVC at 0.21, while the rest of the models had a score at 0.88. Cross-validation had the worst score for MLP at 0.746, while the rest of the models had a score at 0.86, except for linear SVC.
Figure 4 shows performance evaluation results of the real keyboard data input from the keyboard device according to Datasets 1 to 3. Cross-validation splits the data repeatedly, and trains multiple models. Accuracy denotes the number of correctly predicted numbers (true positive and true negative), and precision denotes the number of true positive among the number of true positive and false positive. Recall means the number of true positive among the number of true positive and false negative, and F1-score means the harmonic average of precision and recall. Area Under the Curve (AUC) is a summary of the Receiver Operating Characteristics (ROC) curve, and the AUC score falls between the worst 0 value and the best 1 value.
To be more specific about the results for each dataset, gradient boosting had high performance in Dataset 1, while random forest had high performance in Datasets 2 and 3. Nevertheless, most of the models could not measure precision, recall, and F1-score. KNN, logistic regression, decision tree, SVM, and MLP models could not measure in Dataset 1. In Dataset 2, KNN, logistic regression, linear SVC, decision tree, gradient boosting, SVM, and MLP models could not measure, and KNN, logistic regression, linear SVC, decision tree, gradient boosting, and SVM models cannot measure in Dataset 3. In other words, a dataset consisting of index and scancode as a feature cannot classify scancode input from the keyboard and generated random scancode, which means that an attacker is very unlikely to succeed in a password stealing attack.
4.2. Experiment Results Based on Feature 2 (Elapsed Time and Scancode)
We classified into any number of the training set, validation set, and test set,
Table 5 shows experiment results of the training set, validation set, test set, and the cross-validation using Dataset 1.
Experiment results of training set, validation set, test set, and cross-validation using Dataset 1.
As a result, the training set had the best score for random forest at 1.0, the rest of the models had similar score at 0.97 and above 0.90. The validation set had the worst score for linear SVC at 0.89, the rest of the models had a score at 0.96. The test set had the worst score for linear SVC at 0.90, while the rest of the models had a score at 0.95. Cross-validation had the worst score for linear SVC at 0.895, while the rest of the models had a score at 0.96.
When comparing the dataset with the index and scancode described in
Section 4.1 to the dataset with the elapsed time described in this section, the random forest increased from 0.93 to 1.0, the rest of the models increased from 0.90 to 0.91 in the training set score. In the validation set score, linear SVC increased from 0.86 to 0.89, the rest of the models increased from 0.88 to 0.96. In the test set score, linear SVC increased from 0.21 to 0.90, and the rest of the models increased from 0.88 to 0.95. In the cross-validation score, the score 0.746 at MLP increased to the score 0.895 at linear SVC, and the rest of the models increased from 0.86 to 0.96. Therefore, the all scores of performance evaluation for datasets with the elapsed time are higher than those of datasets without the elapsed time. This indicates high performance when including the elapsed time. In other words, by utilizing a dataset with the elapsed time, the real keyboard data input by the user can be classified more effectively.
Figure 5 shows the detailed performance evaluation results, such as cross-validation, accuracy, precision, recall, F1-score, and AUC according to Datasets 1 to 3 with the elapsed time.
Specifically, Datasets 1 and 2 had low performance in linear SVC and random forest, while Dataset 3 had low performance in linear SVC, decision tree, and random forest. Moreover, the rest models had similar level of performance. Above all, all models evaluated the performance in all models for a dataset that includes the elapsed time when all datasets with elapsed time are compared to datasets without elapsed time. Consequently, feature 2 is appropriately selected to classify benign from malignant.
4.3. Experiment Results Based on Feature 3 (Elapsed Time, Scancode, and Flag)
Finally, we compared the proposed method with the attack technique using C/D bit that steals the keyboard data, although this attack technique causes the overload on the system, and abnormal behavior and access. We constructed a dataset containing the flag included in the attack technique using C/D bit, and analyzed the experimental results. We classified into any number of the training set, validation set, and test set,
Table 6 shows experiment results of each set and the cross-validation using Dataset 1.
As a result, except for special cases, all scores of all machine-learning models were close to 1.0. This means that using flag (C/D bit), it classifies the actual keyboard data effectively and completely.
Figure 6 shows the detailed performance evaluation results, such as cross-validation, accuracy, precision, recall, F1-score, and AUC according to Datasets 1 to 3 with the elapsed time and flag.
Specifically, except for the logistic regression of Datasets 1 and 2 and the cross-validation score of the random forest, all models showed performance measures of accuracy, precision, recall, F1-score, and AUC has perfect score of 1.0. Specifically, if the flag is used, the data input from the actual keyboard can be classified, which means that the password of the user can be stolen easily.
4.4. Comparison of Performance Evaluation Results by Features
In this study, we evaluated the performance of datasets with index and scancode, datasets with elapsed time and scancode, and datasets with elapsed time, scancode, and flag. We demonstrated that the performances of the dataset with elapsed time and scancode and the dataset with elapsed time, scancode, and flag are higher than that of the dataset with index and scancode. Furthermore, the dataset with the flag causes system overload and abnormal behavior and access, but the proposed method effectively classifies the real keyboard data, without causing this drawback. Consequently, the results of the performance evaluation differ based on features, thereby analyzing the feature benefit by comparing the performance evaluation accordingly.
Figure 7 shows the comparison of performance evaluation results of each set and the cross-validation.
In the figure, the left side is the result of the dataset with index and scancode, the middle is the result of the dataset with elapsed time and scancode, and the right side is the result of the dataset with elapsed time, scancode, and flag. The results show that the performance tends to be higher toward the right, and the dataset with elapsed time and scancode is significantly higher than the dataset with index and scancode. Moreover, performance evaluation value was closer to 1, which means that collecting data and constructing features according to the proposed method lead to performance improvement.
Analyzing the results of changes in each set and cross-validation showed significant changes. All machine-learning models using the datasets without elapsed time showed significant changes. MLP, linear SVC, and random forest showed marked differences. Conversely, all machine learning models using the datasets with elapsed time had relatively small changes, however, Linear SVC and Random Forest show significant changes.
As shown in
Figure 7, performance evaluation results were significantly different depending on the features, and we verified that the dataset with the elapsed time had high performance.
Figure 8 shows the comparison results for more practical performance evaluation, such as accuracy, precision, recall, F1-score, and AUC.
In the figure, on the left is the result of the dataset with index and scancode, in the middle is the result of the dataset with elapsed time and scancode, and on the right is the result of the dataset with elapsed time, scancode, and flag. The performance tends to be higher toward the right, and the performance of the datasets with elapsed time and scancode is significantly better than the datasets with index and scancode. Moreover, the performance of the dataset is higher from 1 to 3, and the accuracy of the dataset with elapsed time and scancode is close to 1. An accuracy of close to 1 means that most passwords can be obtained by effectively differentiating between random scancodes and real scancodes.
In terms of practical performance evaluation changes, most models using the datasets without elapsed time changed considerably. Among them, gradient boosting, linear SVC, and random forest showed notable differences. Conversely, all models using the datasets with elapsed time had relatively slight changes, i.e., linear SVC, decision tree, and random forest show significant differences.
Finally, to analyze the change rate of the increase and decrease according to features, the performance differences of all datasets were analyzed shown in
Table 7.
Specifically, the best model in terms of accuracy in Dataset 1 was linear SVC which increased by 420.4%, while the worst model was decision tree which increased by 108.1%. The best model in precision was linear SVC increased by 781.2%, while the worst model was gradient boosting which increased by 135.2%. The best model in recall in was gradient boosting which increased by 3642.1%. On the other hand, recall showed that linear SVC model suffered poor evaluations of Datasets with elapsed time, which decreased to 21.8%. The best model in F1-score was gradient boosting which increased by 2175%, while the worst model was Linear SVC increased by 150.8%. The best model in AUC was KNN increased by 295.1%, while the worst model was gradient boosting increased by 143.3%. Except for the linear SVC model in recall, all performance results were increased in Dataset 1, and the highest increase rate was 3642.1%.
In conclusion, using the dataset with elapsed time, we can effectively classify random keyboard data with up to 96.2% accuracy. This means that an attacker steals the real keyboard data input by the user in the real world. Consequently, the proposed attack technique discussed in this paper has derived a security threat and a new vulnerability that effectively steal user authentication information in password authentication.