1. Introduction
Phishing, a deceptive method through social and technical engineering, poses a severe threat to online security, aiming to obtain illicit user identities, personal account details, and bank credentials [
1]. It is a primary concern within criminal activity, with phishers pursuing objectives such as selling stolen identities, extracting cash, exploiting vulnerabilities, or deriving financial gains [
2,
3]. The nuanced landscape of phishing techniques showcasing symmetry and asymmetry includes algorithms, domain spoofing, HTTPS phishing, SMS phishing, link handling, email phishing, and pop-ups. Attributes such as prefixes, suffixes, subdomains, IP addresses, URL lengths, ‘@’ symbol, spear phishing, dual-slash attributes, ports, HTTPS tokens, request URLs, URL anchors, tag links, and domain age contribute to the multifaceted nature of phishing attacks [
4]. Phishing perpetrators adeptly mimic legitimate websites, particularly those related to online banking and e-commerce. This creates a symmetrical illusion that induces users to unwittingly divulge sensitive information, leading to various fraudulent actions [
5,
6].
A phishing attacker’s role involves three specific duties: influencing target selection, sociological aspects, and technological infiltration [
7]. As of March 2006, the Anti-Phishing Working Organization reported 18,480 significant phishing assaults and 9666 distinct phishing domains, resulting in substantial financial repercussions for businesses and affecting billions of site visitors [
8]. Microsoft estimates the potential cost of computerized offenses on the global network to be a staggering USD 500 billion, underscoring the symmetrical impact of cyber threats on the financial ecosystem [
9]. A single data breach could incur an average cost of approximately USD 3.8 million for organizations in 2018, highlighting the symmetrical consequences of security lapses. Data from the Anti-Phishing Working Group (APWG) reveal a notable increase in attack networks, with 180,768 identified during the first quarters of 2019, up from 138,328 in the fourth quarter of 2018 and 151,014 in the third quarter of 2018 [
10]. The visual symmetry between benign and deceptive websites challenges human perception, making it difficult to distinguish between them. When visitors access these mimicked sites, critical information is stolen through scripting, underscoring the symmetrical vulnerability in human–computer interaction. The exponential growth in e-commerce consumers contributes to the escalating frequency of phishing attacks, carried out through various means such as malware, online platforms, and emails, creating a symmetrical escalation in cyber threats [
11].
Researchers propose varied solutions to enhance symmetry in phishing detection. Some use a blacklist for identifying phishing sites [
12]. However, this method fails to detect non-blacklisted phishing websites, introducing asymmetry, such as zero-day attacks. Heuristic-based detection analyzes website content and third-party service features, but potential service restrictions create asymmetry. Simultaneously, exploring online content and third-party features introduces temporal asymmetry due to its time-consuming nature [
13]. Similarly, a hierarchical clustering method groups DOM vectors based on distance, limiting detection efficiency and suggesting a need for symmetrical analysis of URL features to enhance throughput [
14].
URLs play a pivotal role in phishing attacks transmitted to users through various channels like emails and social media, presenting a facade of symmetry by appearing as a genuine URL [
15]. Machine learning-based techniques emerge as symmetrical solutions among the available approaches for evaluating URLs. By familiarizing malicious URLs with categorization algorithms, these techniques effectively differentiate between phishing and benign URLs, introducing a symmetrical balance in the categorization process [
16]. URL-based studies leverage a phishing tank database, a comprehensive collection tracking reported phishing URLs by various online security companies. While this database offers organized data categorization patterns, asymmetries arise when using categorization algorithms or machine learning for URL data, necessitating additional symmetrical URL management techniques [
17]. Standard techniques like blacklisting, regular expression, and signature matching, although employed to identify phishing attempts, exhibit asymmetry by falling short in detecting unfamiliar URLs [
4]. Continuous updating of database signatures to detect unexpected patterns in malicious URLs underscores the need for applying symmetrical machine learning-based research, particularly with deep learning models, for robust and symmetrical identification of malicious URLs [
18].
Machine learning and deep neural networks have been pivotal in various research endeavors, showcasing substantial performance improvements [
19,
20,
21,
22]. In the context of phishing detection, the authors of [
19] proposed a multidimensional feature engineering approach, harnessing a deep learning model (CNN-LSTM) and machine learning algorithms. This method integrated predictions using the XGBoost (eX-treme Gradient Boosting) algorithm, offering a solution to extract features from diverse dimensions for swiftly effective attack detection. However, the reported results indicated a decline in the false positive rate to 59%, signaling a reduction in the level of attack prediction. Another study [
20] introduced an end-to-end deep learning architecture grounded in natural language processing techniques to combat malicious URL phishing. The model aimed to classify benign and malicious URLs using character-level and word-level embedding in CNN networks. However, the model exhibited a lack of generalization on test data, indicating a need for improved accuracy and malicious URL detection ability. Wang et al. [
21] presented the PDRCNN approach, designed to enhance phishing-detection efficiency by eliminating reliance on feature crawling from third-party services. Based on the LSTM network, this approach selects optimal features from the URL, employs CNN to distinguish characters influencing phishing, and predictions with machine learning classifiers. While reporting efficient performance, the mechanism’s dependency on existing knowledge of phishing detection raises concerns about its susceptibility to errors in identifying the latest vulnerabilities.
In contrast to traditional machine learning methods that implicitly extract hand-crafted features, deep learning approaches prove advantageous when faced with the challenge of professional phishers exploiting the multilayer features of URLs. To address this, stacking, an ensemble learning methodology integrating various machine learning algorithms and deep learning models, employs a metamodel to amalgamate predictions, enhancing overall performance. Initially employed for malware identification on mobile devices, the stacking approach demonstrated improved accuracy and the F measure [
23]. We extended this stacking mechanism by designing two distinct phases, leveraging the symmetrical integration of other methods to enhance detection impact.
This paper leverages a deep learning neural network, long short-term memory (LSTM), introducing a novel stack generalization model named AntiPhishStack. The proposed model employs five optimizers in two phases to detect phishing URLs effectively. In the first phase, machine learning classifiers, coupled with k-fold cross-validation to mitigate overfitting, generate a mean prediction. The second phase utilizes a two-layered LSTM-based stack generalized model optimized for premier prediction in phishing site detection. Merging the mean prediction from Phase I with the premier prediction from Phase II, meta-classifiers, specifically XGBoost, deliver the final prediction. This stacking model significantly enhances phishing-detection accuracy by learning URL and character-level TF-IDF features, showing symmetrical capabilities. The AntiPhishStack model intelligently identifies new phishing URLs previously unidentified as fraudulent. Experimental evaluations on two benchmark datasets ([
24,
25]) for benign and phishing sites demonstrate robust performance, assessed through various matrices, including AUC-ROC curve, precision, recall, F1, mean absolute error (MAE), mean square error (MSE), and accuracy. Comparative analysis with baseline models and traditional machine learning algorithms, such as support vector machine, decision tree, naïve Bayes, logistic regression, k-nearest neighbor, and sequential minimal optimization, highlights the AntiPhishStack model’s superior phishing-detection efficiency. Notably, this model offers the following significant advantages in achieving symmetrical advancements in cybersecurity:
Prior feature knowledge independence: The approach taken in this work embraces the concept of symmetry by treating URL strings as character sequences, serving as natural features that require no prior feature knowledge for our proposed model to learn effectively.
Strong generalization ability: The URL character-based features are utilized for more robust generalization and check-side accuracy, and the multi-level or low-level features are combined in the hidden layers of the neural network to attain effective generalization.
Independence of cybersecurity experts and third-party services: Our proposed stack generalization model autonomously extracts necessary URL features, eliminating the reliance on cybersecurity experts. Additionally, the AntiPhishStack model, reliant on URLs and character-level TF-IDF features, demonstrates independence from third-party features such as page rank or domain age.
The significant contributions of this paper are:
Presentation of a two-phase stacked-based generalization model (AntiPhishStack) that breaks free from the necessity of prior feature knowledge for phishing site detection. The model achieves this by learning URL and character-level TF-IDF features.
In Phase I, features are trained on the base machine learning classifier to generate the mean prediction. Meanwhile, Phase II employs two-layered stacked-based LSTM networks and five adaptive optimizers for premier prediction detection.
The final prediction is established by developing a meta-classifier (XGBoost) classifying URLs into benign and phishing categories. Experimental results showcase the AntiPhishStack model’s noteworthy performance on baseline models, utilizing symmetrically structured Alexa and PhishTank datasets.
The structure of the rest of the article is as follows:
Section 2 deliberates the background research work of phishing detection;
Section 3 introduces the AntiPhishStack proposed model;
Section 4 delivers the experiments, and
Section 5 presents the results and its evaluations, and
Section 6 elaborates the conclusion and future work.
3. AntiPhishStack Proposed Model
The primary purpose of this model is to determine the best output through evaluation by applying the stacking technique and deep neural network to the processed dataset and to propose an optimized model based on that output. The AntiPhishStack model of stack generalization is illustrated in
Figure 1.
Our model’s flow has a five-level approach. The key steps are as follows:
Collection of datasets and feature distribution into URL features and character-level features.
Dataset division into training and testing by 70:30 ratio, respectively.
Construct the stack generalization model’s first phase (Phase I) based on the machine learning-based model and calculate the mean prediction with the test dataset.
Construct the second phase (Phase II) of the stack generalization model with the LSTM model based on adaptive optimizers and compute the performance evaluation with the test set.
Merge predictions and evaluations from both Phase I and Phase II for the ultimate prediction, enhancing symmetrically the recognition and determination of the phishing web pages.
The notations and meanings used in this paper are described in
Table 2.
3.1. Datasets
The URLs were collected from a variety of sources (Alexa and PhishTank) [
24,
25]. URLs that were duplicated or did not survive were deleted before they were used to create a dataset. The typical URL elements, such as “http://”, “https://”, and “www.”, were deleted. Inconsistent URL forms can easily impair the model’s quality during training if the prefixes are not trimmed. The database management system (pgAdmin) was utilized in conjunction with Python to import the preprocessed data, and then the dataset was divided into two parts: 70% for training and 30% for testing. The distribution of legitimate and phishing URLs was as follows:
Dataset 2 (DS2): Benign sites from Common Crawl, the Alexa database, and phishing sites from PhishTank [
25].
The datasets were selected for their diverse and current mix of benign and phishing URLs, ensuring robust model training. DS1 and DS2 offer a balanced representation of typical internet environments and specialized sources, respectively. This variety enhances the model’s applicability and accuracy in real-world phishing detection. Meanwhile, the feature dataset was divided into 70% training and 30% testing datasets to ensure a balanced setup: 70% for training our AntiPhishStack model and 30% for robust testing on unseen data, aligning with standard machine learning practices.
3.2. Feature Distribution
Features and the capacity to use these features must be examined before examining the features selection section [
48]. There are four major features and a total of 30 sub-features. Based on the details, each characteristic provides information on whether the website is phishing, legitimate, or suspect. This section contains the plans for highlighting the characteristics.
3.2.1. URL Features
A Uniform Resource Locator (URL) provides the location of online resources such as pictures, files, hypertext, and videos. In general, attackers attempt to build phishing URLs that look like reputable websites to users. Attackers use URL jamming tactics to mislead users into disclosing personal information that can be exploited against them. This research aims to detect phishing websites quickly, utilizing lightweight characteristics, i.e., the weight factor URL token system, inspired by [
49]. For example, the segmentation of a URL (
Figure 2) provides the different tokens and their final weights
for
-th distinct words and can be calculated as:
where
indicates the length of
-th distinct word,
denotes the total steps available for tokens,
shows the number of URLs from webpages, and
is the total number of
-th word occurrences in step
with respect to the
level.
Calculating this weight delivers the weight value of each URL assigned to neural network gates for phishing prediction. This was accomplished by extracting only characteristics from the URL rather than accessing the website’s content.
Figure 2 shows an example of URL characteristics for the weight.
The first component of the URL is a protocol (https, http, ftp, etc.), which is a set of rules that regulates how data are transported from data transmission. The second component is the location of the host IP address or resource. The hostname is separated into two parts: major domains and top-level domains (TLDs). The URL’s hostname is comprised of the principal domain and the TLD. The hostname is followed by a port number, which is optional. The third component uses the path to identify the specific resource inside the domain accessed by a user. An optional field, such as inquiry, follows the path. The protocol, hostname, and URL path are appended to the base URL. The combination of the second domain and top-level domain names, known as the host domain, makes the URL unique. As a result, cybersecurity firms are working hard to identify the fraudulent websites used for phishing offenses by name. If a hostname is designated as phishing, an IP address can be banned to prevent it from accessing the web pages included within it.
It has the following sub-features, according to the dataset:
IP Address: If an IP address is used instead of a domain name in the URL of a phishing website, the client may virtually be certain that someone is attempting to steal their credentials. From this dataset, 570 URLs with an IP address were discovered, accounting for 22.8 percent of the dataset, and a rule IP address is in the URL that is termed phishing; otherwise, it was suggested to be legitimate.
Operate the @ Symbol: Web browsers usually ignore the section preceded by the @ sign. Because it is maintained separately from real-world addresses, finding 90 URLs with the ‘@’ sign will provide just 3.6 percent of the total, according to the dataset.
Operate the “//” symbol: As valid URLs, the “//” sign is used after HTTP or HTTPS. If the URL changes after the initial protocol declaration, it is called a phishing URL. The “//” sign is used to redirect to other websites.
Domain name prefixes and suffixes separated by the “-” sign: A URL with the “-” sign in its domain name is a phishing URL. In general, verified URLs do not include the “-” sign.
Use the “.” sign in the domain: Use the “.” sign in the domain. Adding a sub-domain with the domain name must include the dot. Consider it suspect if you drop out more than one subdomain, and anything greater than that will indicate phishing.
HTTPS (secure socket layer): The majority of legal sites use the HTTPS protocol. Therefore, the age of the certificate is quite important when utilizing HTTPS. This necessitates the use of a trustworthy certificate.
Favicon: A favicon might redirect clients to dubious sites when layered from an outside space. It is mainly used on websites and is a graphic picture.
3.2.2. Character-Level Features
Term Frequency-Inverse Document Frequency is abbreviated as TF-IDF. The TF-IDF score indicates a term’s relative significance in the document and throughout the whole corpus. The TF-IDF score is made up of two terms: the first computes the normalized Term Frequency (TF), and the second computes the Inverse Document Frequency (IDF), which is calculated as the logarithm of the number of documents in the corpus divided by the number of documents in which the specific term appears [
25,
50].
TF-IDF vectors may be produced at many levels of input tokens (words, characters, and n-grams):
Word-level TF-IDF: A matrix indicating the TF-IDF scores of each term in distinct texts.
Character-level TF-IDF: A matrix indicating the TF-IDF scores of character-level n-grams in the corpus.
N-gram-level TF-IDF: N-grams are the collection of N terms. This matrix indicates the TF-IDF scores of N-grams.
It should be mentioned that TF-IDF has been used in numerous studies to identify website phishing by examining URLs [
25] to obtain indirectly related connections, target websites, and the validity of suspicious websites [
51]. TF-IDF retrieves prominent keywords from the textual content. However, it has certain limitations. One of the limitations is that the approach fails when extracting mis-spelled terms. Because the URL might contain nonsensical words, it used a character-level TF-IDF method with a maximum feature count of 5000.
Furthermore, we measured the URL strings as character sequences by employing the idea from the literature [
52]. This idea provides the advantage that the proposed model can train the URL character sequences as natural features that do not need prior feature knowledge to be learned by our proposed model. Our proposed AntiPhishStack model uses the stack generalization model to extract the local URL features from the URL character sequences. Finally, the URL will be classified by designing a meta-classifier for final prediction.
3.3. Stack Generalization Model
The stack generalization model is divided into two phases, as illustrated in the flow model (
Figure 1).
3.3.1. Phase I
Based on the abovementioned characteristics, existing machine learning models were utilized directly to distinguish phishing and legitimate web pages. This paper proposes a stacking model (illustrated in
Figure 3) for this purpose by merging various machine learning models, including support vector machine (SVM), naïve Bayes (NB), decision tree (DT), logistic regression (LR), K-nearest neighbors (KNNs), sequential minimal optimization (SMO), and XGBoost.
The training set was split into copies, with copies utilized for training and one copy used for testing. The training process was not terminated until each basic model had predicted the samples. This suggested system employs k-fold cross-validation to avoid overfitting for this training set and each fold of the train part that might be predicted using out-of-fold.
This suggested model uses a value of three to ten for k-fold cross-validation; after all, it delivers output using a test set. Following the temporary prediction (TP) acquisition, the mean prediction is obtained, which is strengthened by the test dataset validation. This time, it is comprehensive, with a fold approach required for estimating all figures on all folds utilized.
3.3.2. Phase II
The train segment was put to a two-layer neural network architecture of LSTM once the features from the training dataset were loaded. Because there are dependencies on immediately preceding entries in sequential phishing webpage data, LSTM is better suited to simulate phishing detection in this investigation. Meanwhile, it is explicitly designed to avoid the long-term dependency problem by storing the feature information in its memory cell. It can remove or add information to these call states and is regulated by structures called gates. These gates and corresponding operations/functions are presented in [
53], while Phase II of the integrated stack generalized model is illustrated in
Figure 4.
In the first gate (Forget gate), the information from the current input
and the previous hidden state
is passed through the sigmoid activation function. If the output value of the feature is closer to 0, it means forget, and closer to 1 means retain. The second gate, the input gate, decides what relevant feature (phishing or benign) can be added from the current step. The third gate, the control gate, decides which values will be updated (either 0 or 1), for which a
layer creates a vector of
. The last gate, the output gate, determines the value of the next hidden state [
54].
At time , the LSTM cell’s components are modified as follows:
The sigmoid function is one of them;
represents the hidden state of the
instant;
represents the bias of each gate;
, and
are the input gate, forget gate, output gate, and unit status, respectively. For the connection,
, and
are represented as a weight matrix. The three gates of LSTM cells govern the flow of information and hence define the cell’s state. The gradient vanishing problem may be efficiently handled with LSTM [
55].
The suggested model, in this instance, comprises two LSTM layers. The first LSTM layer outputs a sequence as one input above the LSTM layer. As explained previously, the internal design of both LSTM layers is the same. It also tried the LSTM cell rather than another GRU cell because the network with the LSTM cell outperformed the network with the GRU cell. This study constructs an LSTM network with a hidden vector of 128 elements. After the first LSTM layer, a dropout layer is added. Dropout reduces overfitting and enhances the model’s generalization [
56]. The LSTM’s last layer generates a vector hi, which is supplied as the input to a fully linked multilayer network. Each layer has an activation function. The rectified linear unit (ReLU) activation function is used for each layer, and the exponential activation function is used for the output layer. Because the dataset is binary, a nonlinear activation function was used to solve the binary classification issue. For hidden layers of neurons, the ReLU function was employed, while for the output layer of neurons, the sigmoid function was used.
After the training process, the parameters were changed or tweaked to assess the wrong predictions and ensure the predictions are as correct as possible with optimization. We used an optimizer mold and designed the model for the most accurate and possible prediction with the parameters (or weights). The value that the weights were updated in the training process is called the learning rate, a configurable hyperparameter to train deep neural networks with a small value within the 0.0–1.0 range. However, the learning rate varies due to overfitting [
53]; thus, our model can predict accurately with the given dataset. Nevertheless, it is not appropriate for new or real-world data. We used the regularization technique to overcome the overfitting errors by fitting the functions appropriately on the training sets. It helped to attain optimal optimization solutions. These optimizers modify the neural network’s attributes, i.e., weights and learning rates, to improve the accuracy.
Thus, we utilized the following five adaptive optimizers to generalize the LSTM networks to overcome the overall loss and improve the accuracy. The selection of these optimizers is also given below:
AdaDelta: This optimizer is based on the learning rate per dimension instead of the learning rate by parameter. It can solve the continual decay of learning rates by training and based on manually selected learning rates.
Adam: This utilizes the prediction of the first and second moments to adapt the learning rate for the neural networks. It uses the momentum concept for adding a part of previous gradients to the current one. It is a faster optimizer and requires fewer parameters for tuning.
RMSprop: Root means square propagation optimizer avoids the oscillations in the vertical direction and can increase the learning rate with feasible steps in the horizontal direction.
AdaGard: This deals explicitly with individual features for different learning rates for different weights of sparse datasets to achieve a high learning rate. It can avoid the manual tuning of the learning rate for individual features.
SGD (Stochastic Gradient Descent): Gradient descent optimizer has a drawback for large datasets. A variant of gradient descent, SGD, is generalized to make neural networks learn faster on the large-scale dataset.
These optimizers were implemented based on the packages and function calls in the Pytorch framework. For instance, we utilized , where indicates the name of the optimizer, i.e., Adam or SGD, etc. The model was then compiled using these adaptive optimizers. The model was trained to avoid overfitting by utilizing several epochs and early stopping strategies. By assessing the model using the test set, the output is now accessible. The stack generalization technique was used in the dataset after the strategy was implemented.
3.4. Final Prediction
Two outputs were generated using the aforementioned multilayer stacked methods, and a model was chosen depending on the decision based on the value of the initial predictions. The mean prediction was considered and combined with the anticipated outcomes from the premier prediction. Finally, the outputs of the mean and premier prediction of the stacking models were combined as the final prediction using a meta-estimator classifier.
The meta-estimator involves constructing a robust classifier by applying the boosting method. Boosting combines multiple weak yet precise classifiers to create a powerful and resilient classifier for identifying phishing crimes. Additionally, boosting aids in integrating multiple features, resulting in improved classification performance. One notable boosting classifier is the XGBoost classifier, which transforms weak learners into potent contributors. It is well suited for our proposed stack generalization model for identifying phishing sites, introducing a sense of symmetry to the classification process. Implemented on integrated feature sets of URLs and character-level features, it acts as a robust classifier within our proposed AntiPhishStack model for phishing identification, emphasizing the importance of symmetry in enhancing detection capabilities.
Suppose there are
URLs in a set
, where
; represents a set of selected features corresponding to
-th URLs, while
is a class label, e.g.,
if the URLs are considered malicious or phishing websites. The final outcome of the XGBoost model was computed using the following equation [
57]:
where
is the model’s prediction at step
,
represents the training loss function, and
represents the input features used in the XGBoost model. The regularization term
is defined as
, where
is the number of leaf nodes in the base learner
,
is the complexity of each leaf, and
represents the regularization parameter, controlling the strength of regularization in the XGBoost model, while
is the output value at each final leaf node.
At step
, considering the base learners from previous steps (
) as fixed, the loss function can be expanded using Taylor’s series [
57,
58]:
where
and
are the first and second derivatives of the loss function
with respect to
, computed as:
This formulation defines the model’s optimization process at each step, incorporating both the loss function and the regularization term to balance model complexity and fit. Then, the integrated features are categorized into phishing and benign, based on the weights by the meta-estimator for final prediction. Furthermore, XGBoost comes up with many advantages, some of which include (i) the power to fix missing values within the training set and, (ii) working with extensive data that do not fit into memory, and (iii) utilizing multiple cores on the CPU to achieve faster computing, .
Deep learning involves many datasets and a significant time for model training. The efficiency of these models depends on the system resource specifications and the complexity of datasets. In order to identify phishing assaults, the time complexity is a crucial factor [
43]. The proposed method’s computational cost is based on how the characteristics are generated and extracted. URL and character-level features extracted by our proposed method require logarithmic time complexity
. The extraction of such features during the model training and time complexity depends on the number of samples
and dimensions
. Accordingly, the time complexity of our proposed work is
4. Experiments
This paper utilized Python 2.7 to develop the suggested model and TensorFlow GPU v1.8.0 as a machine learning framework. The operating system was Windows 10 Pro Education, and the architecture was built using Python. This project’s Python packages and libraries to detect phishing URLs included Keras (built-in with TensorFlow), SciPy, Pandas, NumPy, Matplotlib, and Seaborn.
Support Vector Machine (SVM), Decision Tree (DT), Naive Bayes (GNB), Logistic Regression (LR), Sequential Minimal Optimization (SMO), and K-Nearest Neighbor (KNN) algorithms were evaluated for stacking in this work. In the first stage, LSTM was employed as the basic classifier for stacked generalization, and further 10-fold cross-validation was utilized. In Phase II, the XGBoost classifier was utilized as a meta-estimator for the final prediction.
For the model’s effectiveness, the following statistical metrics were used to assess the proposed work for different purposes [
59].
Precision–Recall Curve: A graph was utilized for the trade-off between the true positive rate and the true negative or vice versa for the predictive model assessment [
59].
For Positive Precision
For Negative Precision
For Positive Recall
For Negative Recall
where TP indicates the true positive, which means the number of URLs is correctly classified as phishing; in contrast, the parameter TN indicates the true negative, which means the number of URLs is correctly determined as benign. FP is a false positive, which means the number of benign URLs is wrongly classified as phishing, and FN is a false negative, which shows the number of phishing URLs classified as benign.
Mean Absolute Error (MAE): The average value for all absolute errors [
60].
Mean Square Error (MSE): The average value for all squared errors [
60].