1. Introduction
Cybersecurity is one of the most important and sensitive topics in the world, as various types of hacking usually have dire consequences. Today, research is being conducted in this area. This includes developing data protection methods or models [
1], detecting GDPR violations [
2,
3], developing or analyzing cyberattacks [
4,
5], security, detecting phishing emails [
6], malware detection [
7], and many other similar applications.
The human factor plays a crucial role in security assurance [
8]. For an ordinary user, the biggest challenge is to protect their data, devices, and accounts. This is because we live in a time where important information can be leaked and used against us. To balance data and functionality access and security, trusted user authentication is necessary. Password authentication is the most widely used authentication method [
9]. It is easy to use and implement in various systems and is, moreover, inexpensive. Authentication methods can include biometrics [
10,
11,
12,
13] (fingerprints, face recognition, eye iris recognition, or blinking speed), smart cards, and others, but password authentication remains the most popular. However, password authentication is also one of the most insecure forms of authentication. This is because users often choose easy-to-remember passwords that are easy to decipher [
14].
Strong passwords are one of the fundamental ways that password authentication can be considered secure. Imperishable passwords are essential, especially given the increasing number of cyberattacks and data leaks [
15]. Password strength is significantly dependent on the user, but password-strength meters are crucial tools that help users create more secure passwords. Passwords are evaluated using password-strength meters based on various factors, for example, password length, complexity, or comparison to common password lists.
Nowadays, there are many different password-strength meters that have been developed. Password-strength-estimation models focus on English passwords by including different dictionaries (popular names, places), common password lists, leaked password lists, or other specific dictionaries. Most of them do not consider the passwords from other languages that could be used in the systems. Usually, statistical-based approaches are used to estimate the strength of a password, without taking into account the language in which the password could be written. Meanwhile, many different investigations can be found to confirm the fact that user passwords depend on the human factor [
16,
17] of the user. They are closely related to the language or environment used by the user. The extension of existing tools by providing language-specific data (the language dictionary, common names, passwords) is not enough, as some languages are more complex. For example, the Lithuanian language in comparison to the English language is more difficult morphologically, as the words in Lithuanian have more different forms. Furthermore, the Lithuanian language uses diacritics, which are considered special symbols when estimating the strength of the password. However, in the context of the Lithuanian language, it is not a special symbol. All these facts must be taken into account when evaluating the strength of the password.
In this study, the main objective is to adjust the password-strength-estimation model to accommodate the requirements of the Lithuanian language. Existing password-strength-estimation tools analyze the usage of English words in the password. Therefore, if a Lithuanian word is used in the password, the password-strength tools provide a positively too high security level. Thus, it is important to incorporate the Lithuanian language to estimate the strength of Lithuanian user passwords. Due to the complexity of the Lithuanian language, traditional deterministic approaches are not suitable. Therefore, the novelty of this research is based on several aspects: a machine learning model is proposed to evaluate the strength of passwords of Lithuanian users, and a password-strength dataset is developed, based on separated subsets of passwords in different languages (English and Lithuanian). For dataset development, a heuristic method has been proposed to evaluate the strength of passwords written in the Lithuanian language.
The proposed approach is suitable for English and Lithuanian passwords, but the logic can be easily modified and used for other less popular languages.
The structure of the manuscript is as follows.
Section 2 reviews related works. In
Section 3, the proposed approach scheme is presented, and all steps are described. In addition, the selected dataset is described. In
Section 4, we describe the experimental investigation and validation of the proposed approach to estimate password strength. Five classification algorithms have been used to train the machine learning model and estimate password strength. In
Section 5, the discussion and possible limitations of the proposed approach are presented.
Section 6 concludes the paper.
2. Related Works
Today, password-strength estimation is usually performed by two different techniques: based on statistical rules and machine learning models. When it comes to estimating the cracking time of a password, brute-force attacks are usually not an option [
18]. As a result, many different password-strength meters can be found and used publicly. Taking into account today’s given threat environment, password strength can be defined as a measure of the effectiveness of a password against brute-force attacks. Most password-strength meters use mathematical-based approaches, such as the password’s length, number of special characters, digits, lowercase and uppercase characters, and dictionary matching.
The research performed by Golla et al. [
19] created a list of characteristics that password-strength estimation must possess to be considered accurate. In addition, the authors compared the accuracy of several password-strength meters using a weighted Spearman correlation. The Spearman rank correlation coefficient is used to discover the strength of a link between two sets of data. Both online and standalone solutions were compared to estimate password strength. The comparative analysis of password-strength meters showed the differences between them by extracting their type, method, and evaluation. The results obtained by various password-strength meters based on Markov models, neural networks, and fuzzyPSM allowed us to obtain the highest accuracy.
The zxcvbn password-strength-estimation model [
20] calculates the strength of a password using a combination of pattern matching and entropy estimation. It provides feedback to users on how to improve their password’s strength. The zxcvbn model evaluates passwords based on a variety of criteria, such as length, character diversity, and common patterns or phrases. To indicate the level of security provided, the tool assigns a score to each password ranging from 0 to 4. The zxcvbn password-strength meter performed better than the Eleven- and LPSE-based password-strength meters, described in the research by Golla et al. The performance of Comprehensive8 based on the Spearman correlation results was the worst in the research. Therefore, the zxcvbn tool is generally used as a framework to estimate the strength of a password.
However, the systematic literature and password-strength-tool analysis has shown that widely known password-strength-estimation models are usually used to assess password strength in English-only environments. There are not many studies that adapt password-strength meters to other languages. In the research by Daucek et al. [
21], the strength of the Czech and Slovak language password was analyzed. The researchers proposed an approach and reported on the results of adapting the password-strength meter zxcvbn to the Czech and Slovak languages.
Daucek et al. modified the zxcvbn password-strength meter by adding several types of additional dictionaries that fit Czech- and Slovak-language password detection. The authors used a large set of leaked passwords from the Czech environment (approximately 3.1 million), divided into 12 categories, to test the results obtained. The direct results for the Czech and Slovak languages and the method described in the research can be used as a methodology to adapt the zxcvbn password-strength meter to other less widespread European languages.
In the research by Hong et al. [
22], Korean-language passwords were analyzed. According to their research, existing password-strength meters or proposed models for password-strength estimation are based on the English alphabet. The authors created a Korean-language-based password dictionary and proposed a password-strength-evaluation model based on this for Korean users. As in the research of Daucek et al., Hong et al. conducted experiments to evaluate the security of Korean-language-based passwords using a database of passwords that have been leaked. As a result, the proposed model showed 99.38% accuracy for leaked passwords based on the Korean language. This is superior to the 80.06% accuracy shown by the existing model.
Sarkar et al. [
23] proposed to predict the strength of new passwords using machine learning models. Therefore, several machine learning algorithms were chosen and experimentally investigated, such as logistic regression, gradient-boosted trees, random forest, multilayer perceptron, and Naïve Bayes. Approximately 80,000 passwords were collected [
24] and used in the experimental investigation, which was divided into training and testing datasets, respectively, 80/20%. All passwords of the datasets were assigned to three classes: 0–weak, 1–medium, and 2–strong. The research results showed that the highest accuracy was obtained using the gradient-boosted tree algorithm (99%). A slightly smaller accuracy (95%) was obtained by multilayer perceptron, and the smallest accuracy was obtained by Naïve Bayes (only 75% was achieved). Kim et al., in their research [
25], used the same dataset as Sarkar et al., but only the largest number of passwords was included. The authors proposed a password-strength-estimation model based on deep learning algorithms and multiclass classification, which solves the existing problem that leaked frequency is not considered during the evaluation. To evaluate the performance of the proposed model, an experiment was conducted that compared the password leaked frequency stored in a database using a password list. As a result, the proposed model correctly evaluated 99% of the 345 leaked passwords.
The analysis performed of related works has shown that there is no research focused on Lithuanian-language password analysis, neither using mathematical approaches nor models based on machine learning. The most suitable solution for Lithuanian password-strength estimation is to use existing password meters, such as zxcvbn, and modify them for the language by adding dictionaries. However, in the case of the Lithuanian language, the language is more specific, and common words have many different forms, so it is not enough to use traditional dictionaries such as in English password detection. For this reason, the new approach based on machine learning, similarity measures, and the modification of the password-strength meter zxcvbn has been proposed and is described in the next section.
3. The Proposed Approach and Data Preparation
3.1. Password-Strength Dataset
As the focus of the research is to estimate the strength of a password specifically in a Lithuanian context, two datasets were constructed and later combined into one used to train models. One is a password dataset taken from international-context passwords and the other from Lithuanian-context passwords. This is because Lithuanian users can use both English and Lithuanian words in their passwords. To build the Lithuanian-context password dataset, the part of passwords that belongs to the most used passwords list in Lithuanian, prepared by Mantas Sasnauskas [
26], has been taken. The primary list of the most common Lithuanian passwords consists of approximately 500,000 passwords, but in this experimental research, only a part of the list has been used, approximately 110,000. The main reason is that the full Lithuanian password list presented by Mantas Sasnauskas [
26] contains English passwords, and also, there are many passwords that have been formed by random characters, so it is not possible to understand the context of passwords; it is obvious as the passwords are not word-based. Regarding this problem, the password list has been manually reviewed, eliminating the English-language-word-based passwords and other noise found in the dataset. It was important to keep only unequivocally Lithuanian-context passwords. In the case of Lithuanian-context passwords, the dataset is not labeled; it is a plain list of passwords, with no password-strength labels assigned to it. The dataset of international-context passwords used in this experimental investigation is taken from Kaggle [
24]. The dataset has been chosen because it is well known and widely used in various related works. The full dataset contains a total of almost 700,000 passwords. It is important to maintain the balance between the Lithuanian- and international-password dataset subsets, so the same number of passwords (approximately 110,000) has been randomly chosen. All passwords in the international dataset have initially been assigned to one of three classes (0–weak, 1–medium, 2–strong). Later, in the next steps of the research, both datasets were combined into one, labeled, and used in experimental investigation.
As we can see in
Figure 1, only the passwords from the two datasets (international and Lithuanian) are labeled using a password-strength meter (zxcvbn) to measure the strength of the password. In
Section 3.2, the description of the labeling process is presented in more detail. For international data, it was decided that zxcvbn is suitable for measuring password strength, as it takes into account many different English dictionaries (
Table 1) and is effective according to related works.
For the Lithuanian dataset, a modified version of the zxcvbn model is applied. The modifications applied to zxcvbn aim to take into account the morphological complexities that make up Lithuanian words (the four similarity measures have been used to find the similarity between the input password and one of the dictionaries). Once the passwords are labeled, the password features are extracted. These features include the maximum similarity of each dictionary used in the tool and four different similarity functions to compare. In addition, it includes simple password features, such as password length, number of lowercase and uppercase characters, digits, and special symbols in a password. Finally, the password dataset is combined into one.
The combined dataset is used for machine learning methods to predict password strength. There is no automated method to detect whether the password has been written in the Lithuanian language, so a machine-learning-based approach is needed (the accuracy of the password language estimation was executed, but the accuracy was very limited and not suitable for practical application). The principle of the proposed approach is presented in the following steps (
Figure 2). For each similarity measure, the data is cleaned, and different machine learning models are trained and tested using the cross-validation method. Machine learning models are evaluated by calculating accuracy, precision, recall, and the F1 score. Furthermore, the performance of each machine learning model has been tested against international and Lithuanian subsets. These subsets were not used in the training process to evaluate the effectiveness of the model in simulating a real environment. The international subset consists of 9999 passwords, and the Lithuanian subset, 16,269.
3.2. Labeling of Dataset Items
First, the passwords for the international password dataset are newly labeled using the zxcvbn password-strength meter. The main reason is that the original dataset source provides passwords that are assigned just to one of three classes (weak, medium, and strong), and it does not provide a clear view of password strength. Therefore, in our experimental investigation, we added a new label (0–too guessable, 1–very guessable, 2–somewhat guessable, 3–safely unguessable, 4–very unguessable) by using the zxcvbn password-strength meter. The zxcvbn password meter uses five classes and becomes the main source for the initial estimation of the strength of the password. Therefore, the usage of five classes is more in compliance with the existing password-strength systems based on the usage of zxcvbn. The modification made to zxcvbn can be seen in
Figure 3, where the tool has an additional text-matching and one additional fragment-scoring element. For the international dataset, the similarity-scoring component is turned off. Text similarity to defined dictionaries is estimated for further dataset feature construction, but the measures do not affect the password-strength score. For Lithuanian-word-based passwords, similarity metric-based scoring is activated.
One of the key components of the zxcvbn tool is dictionaries. A total of six default dictionaries are used, and three additional dictionaries are added, reflecting Lithuanian-language-specific information (see
Table 1). Additionally, in the modified version of the tool, the similarity of the password to each dictionary item is measured. The similarity is calculated by four different similarity functions: Fuzz, Levenshtein Jaro, Levenshtein Jaro Winkler, and Levenshtein Ratio. Only the most similar word from each dictionary was selected and used for further password-strength estimation.
The tool has a structure consisting of text matching, scoring, selecting the weakest combination, setting the cracking time, and a password-strength class. The original version of zxcvbn calculates the password strength based on the place of the string of characters located inside a dictionary: the closer the string is to the beginning of the list, the weaker the password. For the Lithuanian password dataset, the similarity of a string in a dictionary is used as the first criterion, followed by the rank in the dictionary, if necessary. This is relevant for Lithuanian words, as there are so many morphological cases of a word. It would be resource-intensive and close to impossible to have a dictionary of Lithuanian words that includes every possible case of each word that is also ranked as the most common. Only the most similar dictionary item is used for scoring. To get the number of guesses needed for the estimated password, the length of the similar fragment
L must be estimated. It is achieved by taking the shortest of the compared words, password, or dictionary item:
where
L is the length of the text fragment, which is determined by the shortest length of the evaluated password and the most similar word
;
password is the password in question; and
is the maximum similarity of the dictionary item to the password.
For scoring, the
and appropriate scorings are determined for each dictionary. The next phase of zxcvbn is the estimation of the guesses needed to recover the password, on which the password score is based. To estimate how many guesses
K will be needed to guess the password according to the rules and specifics of the Lithuanian language, an approximate formula was derived (2). It takes into account how many words in different forms exist in the language and multiplies it by the ratio between the length of the text fragment and the complexity of guessing the fragment by the already known matching part:
where
K is the needed guesses, and
R is the approximate number of the language words in different forms. Taking into account the variety of words and their forms in the Lithuanian language, the initial value of the
R is assigned to 2.5 million,
L is the length of the coincident word, and
is the similarity of the most similar word in the dictionary. Value 10 was selected taking into account the average number of possible letters to continue the word sequence.
The calculated value of
K is used as one of the alternatives to calculate the total number of attempts required to guess the password and, accordingly, the guess time. The password-guessing-time model is the same as in the original zxcvbn model and is performed on the basis of the needed guesses for certain fragments of the passwords. Finally, based on the calculated password-guessing time, the password-security class is evaluated. The distribution of the combined dataset is presented in
Figure 4. As we can see, the dataset is not balanced because only a small number of users use very weak passwords. Most of the passwords in the dataset are assigned to Class 2, and the class distribution is similar to a normal distribution, reflecting the real-world situation. A deep analysis of the balanced dataset has not been performed. Class 1 consists of just 405 passwords, so using the same number of passwords to make the dataset balanced leads to the problem that the size of the dataset (2025) will not be enough to train the machine learning model. Artificially generated passwords were not included to balance the dataset either to maintain real-environment conditions.
4. Password-Strength-Estimation Model
4.1. The Selection of Features for Dataset Items
As mentioned above, there are multiple password features that influence and describe the strength of the password. The description of the characteristics of the analyzed dataset is presented in
Table 2. The first five password features can be easily calculated and represented as integer values. The other features are the maximum similarity measure of each dictionary (the dictionaries are presented in
Table 1) used by the modified zxcvbn password-strength meter.
The features were selected taking into account the metrics used by zxcvbn, to prevent the use of the same ones and express them in a more general, ensured form of greater availability. Therefore, the maximum similarities to the items of the analyzed dictionary were selected as metrics. These metrics can help identify weak passwords susceptible to dictionary and brute-force attacks but do not reflect the full set of metrics used by the zxcvbn model.
It was also decided to select four different similarity functions: Fuzz, Levenshtein Jaro, Levenshtein Jaro Winkler, and Levenshtein Ratio. This was to compare whether one or the others can provide better results in terms of machine learning algorithm effectiveness for password-strength estimation. Therefore, the four datasets were prepared and used in the experimental investigation , where is one of the four similarity measures used, , and is the number of dataset items analyzed.
The descriptive statistics for each password feature are presented in
Figure 5. As we can see, the use of symbols and uppercase characters is rare and is usually not used in passwords at all. Based on the sources of the password dataset, this could show that the password policy implemented on the sites did not have the requirement to include symbols or uppercase characters. The length of the passwords vary, but the most typical password length in the dataset is nine characters. It is interesting that there are passwords that include less than eight characters, which means that the minimum length of a password might also not have been included in the policy.
4.2. The Results of the Experimental Investigation
Data classification tasks are widely used in various fields, especially when machine learning models must be created. There exist many types of classification algorithms [
29,
30], such as neural networks, decision-tree-based, probabilistic algorithms, deep learning algorithms [
31], or transformers [
32]. The performed related works analysis showed that traditional machine learning algorithms are still frequently used because of the simplicity in the model training or retraining processes and because of their speed when the new dataset item is fed to the model when the model is used in a real environment. Our experimental investigation examined five widely used machine learning algorithms: Naïve Bayes, linear, decision tree, kNN, and SVM. The effectiveness of these methods has already been proven in terms of speed.
All selected classification methods have been trained using the combined dataset constructed from English and Lithuanian passwords. It is necessary to mention that in total, four datasets have been analyzed because some features differ according to the one similarity measure that has been used (Fuzz, Levenshtein Jaro, Levenshtein Jaro Winkler, and Levenshtein Ratio). In the training process, cross-validation has been used, where the k-fold number is equal to 10, and the stratified option has been selected. The accuracy, precision, recall, and F1 scores of each model have been calculated.
Figure 6 presents the results of each model’s performance in terms of accuracy.
As we can see, the lowest accuracy was obtained using the Naïve Bayes algorithm. Better results are obtained when the Levenshtein Ratio similarity measure is chosen (61%). The linear and SVM algorithms perform similarly and reach 70% accuracy. There is not much influence on the results of the different similarity measures used. The results of the kNN (about 75%) and decision tree algorithms (about 77%) are higher than those of other methods. The best accuracy performance is obtained by the decision tree algorithm, where accuracy reaches 78%. The experimental investigation has shown that the Levenshtein Ratio similarity measure allows us to obtain slightly better accuracy for all algorithms analyzed compared to the other similarity measures used.
Figure 7 presents a deeper analysis of the results of the decision tree algorithm. We can see the results of the precision, recall, and F1 scores of the decision tree algorithm using the Levenshtein Ratio similarity measures. It is obvious that the results of Class 0 (too guessable) are the worst in terms of all the measurements of the machine learning models. This class had a small number of dataset items (
Figure 4) because it is not common for users to create and use such simple passwords.
The confusion matrix of the decision tree model using the Levenshtein Ratio similarity measure is presented in
Figure 8. As we can see, the majority of Class 0 (too guessable) data items have not been predicted correctly, and usually dataset items have been considered as Class 1 (very guessable). Class 4 (very unguessable) has been predicted the best, where the model has made the smallest number of incorrect predictions. The prediction of Classes 2 and 3 have also been correctly predicted in a high ratio.
As was mentioned, to evaluate the machine learning model’s effectiveness in simulating a real environment, two separate subsets of international (9999 passwords) and Lithuanian (16,269) passwords have been hidden from the model’s training and fed to the highest-accuracy-obtained model (decision tree).
Figure 9 presents the results of the model’s accuracy. As we can see, the accuracy of the model using testing subsets differs slightly from that of the trained model. In the case of the international-password subset, the accuracy becomes 2–3% lower, and in the case of the Lithuanian-password subset, it slightly increases (1–2%). The results of the model testing using new passwords show that the model is properly trained because the difference is not that high from the accuracy obtained in the model creation process.
The results presented in
Figure 10 point to the same problem that shows that the prediction of Class 0 (very guessable) passwords is very bad, especially when the international-password subset is used. The results of the Lithuanian-password subset show that international passwords are more effective in predicting Class 4 (very unguessable) passwords than Lithuanian passwords. Based on the Lithuanian-password subset, the most accurate prediction results are obtained using Class 1 (very guessable) passwords.
5. Discussion
The performed experimental investigation of the proposed approach for password-strength estimation has shown that it is suitable to evaluate Lithuanian-context passwords. The proposed modification of the zxcvbn passwords’ strength meter can be used for other less popular languages as well, combining it with machine learning models.
A comparison of the results obtained with other password-strength-estimation models or tools is not possible, as none of the existing models is adapted to Lithuanian users who use the Lithuanian language in their passwords. Consequently, the accuracy of English-only password solutions is higher because of the mentioned complexity of the Lithuanian language and the model’s adaptation to it. This research also differs from other comparable works in that it applies the developed model to previously unexplored data for the model in addition to using a larger dataset. Taking all of the above into account, we can state that the achieved 78% password-strength-estimation accuracy for 5-class password strength is acceptable as a high result. It can be supported even by the fact that the Lithuanian-language text analysis usually obtains very similar results (the recent results in a Lithuanian-text sentiment analysis achieved up to 79% accuracy [
33]).
It should be noted that the experimental investigation has not taken into account all possible classification algorithms and hyperparameter optimizations in the training process. In further studies, this could be improved by implementing other algorithms, for example, deep learning, transformers. The modification of the zxcvbn password-strength meter could also be improved by adding more dictionaries that could be important for analyzing language.
The primary experimental investigation using the balanced dataset but with a small number of dataset items has shown that the data-balance problems do not influence the results much, but a deeper analysis should also be performed in the future as well. All limitations of the experimental investigation do not detract from the research and its results. The results are promising and could be useful for other same-type research, especially for non-English password-strength estimation.
6. Conclusions
The systematic literature analysis revealed the lack of solutions for estimating password strength in a non-English language. This part can be explained by the usage of English-only, most frequent password- or brute-force-based password-cracking solutions. However, hacking tools are improving, and the usage of weak passwords based on non-English-language word usage should not be ignored. This insight is supported by recently published research papers, where password-strength models are proposed for Czech, Korean, and other languages.
The existing password datasets are not suitable for the Lithuanian language; therefore, a Lithuanian-word password dataset was constructed. The estimated password-strength class for the dataset indicates similarity to real-world situations where normal password-strength distributions can be inspected. The integration of main password-strength-estimating features and maximum password-similarity scores for nine selected dictionaries based on four different text-similarity metrics ensures the dataset’s application extensibility.
The experiments executed with five machine learning methods for the prediction of password strength reveal that the text-similarity metric of text does not have a significant effect on the accuracy of the model. As a result, the classification algorithms show different accuracy scores, which vary by more than 20% when comparing the worst and best models. For the designed password-strength dataset, the decision tree algorithm with Levenshtein Ratio for password similarity to the provided dictionaries demonstrated the highest accuracy of 78%. The results are not comparable to those of other research works. However, the score is very similar to the results that can be achieved for the Lithuanian language in other machine learning tasks.
The password-strength-class estimation model obtained for Lithuanian users was tested with an additional password dataset. This dataset was not used for model training and testing in previous experiments. The results obtained indicate very similar results and show 76% accuracy for international passwords and 79% accuracy for Lithuanian-word-based passwords. This is not a statistically significant difference and shows the stability of the model, and the size of the training data is sufficient to ensure it.