This section provides an overview of the experimental setup, including details about the two malware datasets utilized. Additionally, it presents the analysis and discussion of the results obtained from the proposed framework, which focuses on classifying malware in executable files within web applications over unsecured systems and assessing the efficacy of the Python (3.11.12)-coded framework used in this study for malware detection.
4.3. Results Analysis
This section discusses the results of the proposed approach for detecting malware.
Table 3 presents the performance evaluation of different machine learning models used in the MAL-XSEL system on CLaMP (Dataset 1), based on four key metrics: accuracy, precision, recall, and F1-score. The results indicate that the stacking ensemble model achieves the highest performance, with an accuracy of 0.9962, confirming the effectiveness of combining multiple base learners for improved malware detection. Among the individual models, LightGBM (0.9952 accuracy), CatBoost (0.9942 accuracy), and random forest (RF) (0.9933 accuracy) perform exceptionally well, showcasing the strength of tree-based ensemble techniques in malware classification. Decision tree (DT) (0.9789 accuracy) and AdaBoost (0.9683 accuracy) demonstrate moderate performance but are outperformed by more advanced boosting models such as LightGBM and CatBoost. On the other hand, k-nearest neighbors (KNN) (0.928 accuracy) and linear discriminant analysis (LDA) (0.7639 accuracy) achieve the lowest accuracy, suggesting that traditional models may struggle with complex malware detection patterns. Overall, the results confirm that the stacking-based approach significantly enhances classification accuracy, making MAL-XSEL a highly effective and interpretable solution for web-based malware detection.
Table 4 presents the performance evaluation of the MAL-XSEL system on MalwareDataSet (Dataset 2), utilizing accuracy, precision, recall, and F1-score as key performance metrics. The findings reveal that the stacking ensemble model achieves the highest accuracy (0.9916), reinforcing its effectiveness in improving malware detection. Among individual models, decision tree (DT) (0.9889 accuracy), LightGBM (0.9881 accuracy), and CatBoost (0.9866 accuracy) exhibit strong classification performance, highlighting their robustness in malware detection. Meanwhile, random forest (RF) (0.9758 accuracy) and AdaBoost (0.9694 accuracy) perform moderately well but fall short of the accuracy levels achieved by more advanced boosting techniques. K-nearest neighbors (KNN) (0.9844 accuracy) provides competitive results, though it remains slightly less effective than tree-based models. Linear discriminant analysis (LDA) (0.8548 accuracy) records the lowest accuracy, suggesting its limitations in handling complex malware classification tasks. Overall, these results confirm that the stacking-based ensemble model delivers the most accurate and balanced performance.
Table 5 presents the class-wise performance of the proposed MAL-XSEL system on Dataset 1, evaluating each model’s precision, recall, and F1-score for Class 0 (benign files) and Class 1 (malware files). The stacking ensemble model achieves the highest performance across both classes, with a precision of 0.998 (Class 0) and 0.9944 (Class 1), recall of 0.9941 (Class 0) and 0.9981 (Class 1), and F1-scores of 0.996 (Class 0) and 0.9963 (Class 1). These results highlight the stacking model’s superior ability to correctly classify both benign and malware files with minimal misclassification. Among individual classifiers, LightGBM, CatBoost, and random forest (RF) perform exceptionally well, with precision, recall, and F1-scores above 0.99 for both classes, indicating strong generalization and robustness. Decision tree (DT) and AdaBoost follow closely behind, achieving slightly lower but still competitive scores, particularly in recall and F1-score. On the other hand, k-nearest neighbors (KNN) and linear discriminant analysis (LDA) exhibit lower performance. LDA has the weakest performance, particularly in recall for Class 1 (0.6922), indicating that it struggles to correctly detect malware samples. KNN performs better but is still outperformed by tree-based models, reinforcing the effectiveness of ensemble learning for malware classification.
Generally, these results confirm that the stacking-based approach significantly enhances malware detection accuracy for both classes, reducing the risk of false positives (incorrectly classifying benign files as malware) and false negatives (failing to detect actual malware). This further solidifies MAL-XSEL as a highly reliable and interpretable solution for web malware detection.
Table 6 provides a class-wise evaluation of the MAL-XSEL system on Dataset 2, analyzing the precision, recall, and F1-score of each model for Class 0 (benign files) and Class 1 (malware files). The stacking ensemble model demonstrates the highest overall performance, achieving a precision of 0.9943 (Class 0) and 0.9852 (Class 1), recall of 0.99358 (Class 0) and 0.9868 (Class 1), and F1-scores of 0.9939 (Class 0) and 0.9860 (Class 1). These results highlight the stacking model’s superior capability in accurately distinguishing between benign and malware samples, reinforcing its reliability as a robust classification approach. Among individual classifiers, LightGBM, CatBoost, and random forest (RF) deliver outstanding results, with precision, recall, and F1-scores exceeding 0.98 for both classes, demonstrating strong generalization across different malware variations. Decision tree (DT) and AdaBoost also perform well, though AdaBoost exhibits a slightly lower recall for Class 1 (0.9306), suggesting a tendency to misclassify some malware samples. In contrast, k-nearest neighbors (KNN) and linear discriminant analysis (LDA) show comparatively weaker performance. LDA, in particular, records the lowest scores, particularly for Class 1 (precision: 0.7407, recall: 0.79874, F1-score: 0.7686), indicating its challenges in accurately classifying malware. While KNN outperforms LDA, it remains less effective than ensemble-based models, reaffirming the advantage of stacking and boosting techniques for malware detection. Overall, these findings confirm that the stacking-based ensemble model significantly enhances malware detection for both benign and malicious files, reducing misclassification risks. The results further establish MAL-XSEL as a powerful and interpretable cybersecurity solution, offering enhanced security for web applications against evolving malware threats.
Likewise,
Figure 4 and
Figure 5 show the confusion matrices for Dataset 1 and Dataset 2, respectively.
Figure 6 presents a comparison of the results for Dataset 1 and Dataset 2 using the proposed approach and other machine learning models. Stacking outperforms all other models on both datasets, achieving the highest accuracy (0.9915 on Dataset 1 and 0.9962 on Dataset 2), along with superior precision, recall, and F1-scores. This confirms that the stacking model is the most effective approach compared to LDA, KNN, DT, RF, AdaBoost, LightGBM, and CatBoost.
Interpretability enhances confidence in AI models by clarifying predictions and exposing model weaknesses. In this work, it is wanted to achieve a transparent analysis of the classification decisions by assessing individual features by means of XAI techniques—SHAP and LIME—as shown in
Figure 7,
Figure 8 and
Figure 9. These visual explanations help cybersecurity analysts better understand why a sample was classified as malware versus benign, allowing for better decision-making and thereby facilitating policy adjustments. For example,
Figure 8 features SHAP summary plots that highlight specific features, for instance, Feature 3 or Feature 5, that greatly affect the classification decision in both datasets. In the same way,
Figure 9 ranks feature importance based on average impact so that it guides the analysts on which behaviors or signals to watch most closely. Our framework enhances real-time applications by providing such insights (e.g., adaptive firewall rules, automatic alert thresholds, or even policy writing for industrial intrusion prevention systems (IPSs)). In using such interpretability tools, we are effectively converting the raw predictions into actionable intelligence, which will enhance the usability of the system, making it more practical for real-world cybersecurity implementations.
Table 7 and
Figure 10 present a comparative analysis between the proposed MAL-XSEL framework and an existing malware detection model [
27] based on their performance on two datasets: ClaMP (Dataset 1) and MalwareDataSet (Dataset 2). The comparison evaluates accuracy as the primary performance metric. The existing approach utilizes a 1D-CNN-LSTM model, a deep learning-based hybrid architecture, achieving an accuracy of 98.75% on the ClaMP dataset and 97.07% on the MalwareDataSet dataset. While these results indicate strong performance, deep learning models often operate as “black boxes”, lacking interpretability, which can limit their adoption in security-critical applications.
In contrast, the proposed approach employs a hybrid stacking-based ensemble of machine learning models, significantly outperforming the existing work. It achieves an accuracy of 99.62% on the ClaMP dataset and 99.15% on the MalwareDataSet dataset, demonstrating a notable improvement in malware classification. The stacking ensemble approach leverages multiple base learners, optimizing classification performance by combining the strengths of different machine learning models. In general, this comparison highlights the effectiveness of the proposed MAL-XSEL framework, which not only improves accuracy but also incorporates explainability through XAI techniques. By enhancing transparency in decision-making, MAL-XSEL provides a more robust and interpretable solution for malware detection in industrial web applications, ensuring greater security for critical infrastructure, industrial control systems, and enterprise networks.