This section presents the experimental setup and performance comparison results for web-based attack detection.
4.1. Experimental Setup
We used a dataset of 39,443 malicious and 40,000 benign JS codes in the experimental setup. The malicious JS codes were obtained from Hynek Petrak’s dataset [
51], and the benign codes were obtained from the Majestic Million Service [
7]. We used a JS code parser to preprocess the dataset into AST-JS nodes, resulting in 32,430 malicious and 38,891 benign JS codes. This resulted in 25
and
and 23
and
AST-JS features.
Next, we visualized and investigated each AST-JS feature’s contribution to the malicious JS code detection model’s performance using SHAP values. This analysis indicated the need to define more concrete features. Using sample JS attack codes, we added ten more AST-JS features based on combinations of the original ones. These features represent JS-based attacks and contribute to the model’s detection performance. We then used association rule mining and a confidence metric to measure how often AST-JS nodes appear together in benign and malicious JS codes. This resulted in 33 features with parameters for malicious JS codes, for benign codes, and ’s for both benign and malicious codes. This selection reduced our original features and formed the second input set. Finally, we selected a final set of AST-JS features using the SHAP values. This selection resulted in forty-four features, comprising thirty-four AS-JS nodes and ten node combinations.
Performance evaluation was conducted using the three-input feature sets, i.e., the 48 AST-JS features obtained after the initial preprocessing, the 33 features obtained using association rule mining, and the 44 features obtained by feature selection using the SHAP value. The parameters obtained using
for the two best-performing models, XGBClassifier and LGBMClassifier, using the three AST-JS feature sets, are listed in
Table 1.
We found that these parameters are optimal for malicious JS code content detection using the three-input feature sets. We performed 10-fold cross-validation to evaluate the performance of our approach and computed precision, recall, and F1-score evaluation metrics. Precision is the classifier’s ability to not label a benign JS code as malicious, and recall is the classifier’s ability to find all malicious JS codes. The F1-score can be interpreted as the weighted harmonic mean of precision and recall. Given that
is the number of malicious JS codes correctly classified as malicious,
is the number of benign JS codes correctly classified as benign,
is the number of malicious JS codes classified as benign, and
is the number of benign JS codes classified as malicious.
,
, and
are given by:
4.2. Performance Comparisons
The results of these experiments are presented in this section.
Table 2 presents the precision, recall, and F1-score values obtained using features defined as AST-JS nodes and 10-fold cross-validation. Each model performed well on the recall metric, especially the tree ensemble methods. The best-performing model, XGBoost, could detect malicious JS code content with an error rate of 0.0019 for recall, 0.0302 for precision, and 0.0162 for F1. The lower precision metric revealed that the model misclassified some benign JS codes. Even though this scenario is not harmful, it may lead to threat-alert fatigue if users and security analysts receive many false alarms. Therefore, there is a need to improve this model further.
We analyzed the AST-JS node features using the SHAP values.
Figure 2 presents a SHAP summary plot that shows the relative impacts of AST-JS features on the JS code dataset.
The SHAP values are plotted on the x-axis for each AST-JS feature on a row sorted by the sums of their SHAP value magnitudes. The vertically piled points represent the feature density, and the colors show the feature values. The values give the distribution of each AST-JS feature’s impact on the model’s output. The red and blue colors represent high and low AST-JS feature values. The color allows us to visualize how changes in the value of an AST-JS feature would affect a change in prediction; for example, high SHAP values for the feature would indicate a high risk of maliciousness for a JS code.
Using such plots, we can deduce that features such as the would influence the model’s prediction more than the . A feature such as has a significant, positive effect on AST-JS prediction, and therefore, a high SHAP value may indicate a higher risk for maliciousness. On the contrary, a feature such as the has a significant, negative effect on the AST-JS class prediction, and therefore, a low SHAP value may indicate a higher risk for maliciousness.
Additionally, interesting patterns can be observed, such as high values of the feature clustered in a very dense region represented by the red blob. Additionally, low values of the feature are clustered in a very dense region, as shown by the blue blob. However, features such as the and have a much more uniform distribution with high and low SHAP values, respectively, pushing the prediction to 1. The red and blue blobs on the left and right indicate an even distribution of that feature in the JS code dataset. Some features such as and are not crucial for most JS codes. However, these features significantly impact a subset of JS codes in the dataset. This scenario highlights how a globally important feature is not necessarily the most critical feature for attack detection in JS codes.
Figure 3 shows the SHAP dependence plot for the top AST-JS feature before (a) and after (b) feature selection.
Every dot represents a JS code. Vertical dispersion at an AST-JS feature value results from interaction effects in the model. The color highlights the high or low forces behind the interactions. The y-axis represents the SHAP values. The SHAP summary plot is obtained by projecting the SHAP dependence plot points onto the y-axis and recoloring the value’s feature. The and features were automatically selected for coloring based on a potential interaction in the model. Plot (b) shows that low SHAP values of the feature influence the model’s output more significantly for observations where the feature has high SHAP values.
Figure 4 shows the SHAP interaction values plot for the top two AST-JS features before (a) and after (b) feature selection.
It shows the main effects and interaction effects for the feature. These effects capture all vertical dispersions. Plot (a) shows that high SHAP values for the and feature significantly influence the model’s output. Plot (b) shows that low SHAP values for the feature and high SHAP values for the feature significantly influence the model’s output.
Table 3 presents precision, recall, and F1-scores obtained using features defined by association rule mining and 10-fold cross-validation.
By including these features, precision was improved by 0.0127, 0.0129, 0.0131, 0.0125, 0.0188, and 0.2352 for the XGBoost, LightGBM, RandomForest, DecisionTree, LogisticRegression, and GaussianNB, respectively. This improvement indicates a reduction in the number of misclassified benign JS codes. However, there was a reduction in the recall metric, which indicates an increase in misclassified malicious JS codes. Leveraging features from the malicious JS code samples and SHAP selected features based on their contributions to the model’s output was pursued to improve the detection performance.
Table 4 presents precision, recall, and F1-scores obtained using features selected based on SHAP values and 10-fold cross-validation.
Compared to the association rule mining features, each model’s detection performance was improved in all three evaluation metrics, with notable improvements in recall and F1-score. The best performing model, the XGBoost model, had error rates of 0.0011, 0.0168, and 0.0091 for recall, precision, and F1, respectively. The model outperformed the LightGBM and DecisionTree models by 0.03%, and the RandomForest model by 0.04% in the recall metric. A high recall rate translates to low misclassification and false-negative rates. AST-JS features selected using the SHAP values capture global and local feature importance. Therefore, these features enhance the machine learning model’s feature learning process and lead to a better prediction model. As evidenced by the experiments, high performance was achieved by all the malicious JS code detection models, with tree ensemble methods yielding the best results. Consequently, these models had the lowest false positive and false negative rates.
Figure 5 shows the feature importance assigned to the AST-JS features by other feature selection methods: Boruta, ELI5, RandomForest, and SelectKBest [
47,
52].
Each method resulted in different feature values for each AST-JS feature, and therefore, a different number of features was selected for each feature selection method. Boruta, a wrapper around RandomForest, found all relevant features carrying information that can be used to predicting malicious JS. AST-JS features proven by a statistical test to be less relevant are rejected iteratively. ELI5 is used for explaining predictions. It is also referred to as permutation importance or Means Decrease Accuracy. The method measures how the score decreases when an AST-JS feature is eliminated. The RandomForest feature importance method measures each AST-JS feature’s importance using the function for the information gain. SelectKBest ranks AST-JS features by the k highest scores. This method measures the dependency between features using the mutual information score function.
Table 5 presents precision, recall, and F1-scores obtained using SHAP selected features compared to the other feature selection methods; Boruta, ELI5, RandomForest, and SelectKBest.
Our proposed detection model performed better than the other feature selection methods in all three evaluation metrics, with notable differences in precision and F1. The other feature selection methods have limitations because different model iterations assign different feature values to the AST-JS features. Additionally, Boruta assigns a
or
value to each AST-JS feature, as shown in
Figure 5a. The permutation-based methods are computationally expensive and can have problems with highly-correlated AST-JS features, resulting in loss of important information. SHAP-selected AST-JS features have consistency in the feature values assigned to each feature. Unlike Boruta, SHAP values show the degree and manner of each AST-JS feature’s contribution to the model prediction and are model-agnostic. These features also provide interaction graphs that are instrumental in getting information on AST-JS feature combinations, further boosting the model’s detection performance.
Table 6 shows the training and detection times for the various classifiers on the JS code dataset using SHAP selected features. XGBoost yielded the highest detection performance and the third-fastest detection time, making it the best classifier for JS-based attack detection. KNeighbors yielded the lowest training time; however, it achieved the lowest recall rate, rendering it ineffective for this detection task. DecisionTree yielded the lowest detection time; however, XGBoost outperformed DecisionTree in all three evaluation metrics.