This section explains the adopted methodology to implement the proposed model. It also briefly describes the details of the dataset used to train the wide and deep parts of the proposed model. Moreover, the experimental considerations, methods, and tools are presented afterward.
4.1. Dataset
This work utilized two different datasets to train the proposed model (WWDEM). The behavioral dataset prepared in our previous work [
52] consists of pre-encryption data and synthetic data that mimic pre-encryption data. The synthetic data represented potential patterns and had a similar distribution as pre-encryption data to overcome the problem of data limitation during the pre-encryption phase. The behavioral dataset consists of significant features and is split up into two sets. One dataset contains 50 non-redundant informative features obtained by applying the TU-MRMR feature selection technique on the behavioral dataset. The second dataset contains redundant significant features by omitting informative features from the behavioral dataset.
In our dataset, many of the extracted features, which are derived from observed APIs and system calls, are inherently interpretable because they map directly to system-level operations (e.g., file creation, registry modification, and cryptographic library use). Intuitively, an API that handles encryption keys has clear relevance to ransomware behavior, offering a semantic clue as to how the malware is operating. For instance, calls like CryptAcquireContext, CryptGenKey, and CryptEncrypt offer insights into how encryption keys are generated and used, which is an essential part of ransomware functionality. Similarly, CreateFileW and WriteFile help indicate file manipulation behaviors, which are relevant for ransomware attempting to encrypt or overwrite user files. Registry modifications can also be detected through RegCreateKeyEx or RegSetValueEx calls. While some higher-level or aggregated features may appear more abstract, most of the core features we rely on carry domain-specific semantic meanings that cybersecurity analysts can interpret to understand the underlying behavior of each ransomware sample.
In this study, we selected ransomware families that utilize encryption methods commonly found in real-world ransomware campaigns—primarily variants of symmetric (e.g., AES) and asymmetric (e.g., RSA) ciphers. In practice, the most prevalent approach involves using a strong symmetric cipher (e.g., AES-256) for the actual file encryption, combined with an asymmetric algorithm for key exchange [
2]. Our dataset reflects this industry reality by incorporating samples that rely on such hybrid mechanisms to lock user data. For example, we included well-known ransomware families like ‘WannaCry’ (which uses RSA and AES), ‘Locky’ (typically RSA-2048 plus AES), and several others that frequently appear in threat intelligence reports. This ensures that the encryption methods present in our experimental setup align with those most often encountered in current ransomware attacks, thereby providing a realistic basis for evaluating detection performance.
4.4. Evaluation Metrics
We evaluated the model’s performance using precision, recall, F1-score, accuracy, and false-positive rate (FPR).
Precision represents reliability, i.e., correct predictions made by a model. It represents the relevance of predictions a model is supposed to make. Therefore, high precision means good model performance. Prediction is calculated according to Equation (
4).
The recall metrics are specific, i.e., they represent the sensitivity of the model by measuring its quantity. A model presenting high detection results will have a high recall value. Recall is calculated by using Equation (
5).
The F1-score is calculated by using the two evaluation metrics, including precision and recall. The F1-score incorporates model relevance and sensitivity. It is calculated according to Equation (
6).
The correctly identified samples are represented using accuracy metrics. It is described using the ratio of correct predictions along with total predictions made. Moreover, it is described according to Equation (
7).
The false positive rate (FPR), described by Equation (
8), measures the proportion of negative instances incorrectly classified as positive by a model. It evaluates the rate of false alarms in a system.
Another important evaluation metric is the detection rate, as shown in Equation (
9), which highlights the significance of the proposed work. The DR is calculated by dividing the number of detected ransomware samples by the total number of both ransomware and non-ransomware samples.
4.5. Experimental Results
The performance of the proposed WWDEM model, evaluated using varying numbers of ensembles, highlights its effectiveness in detecting evolving ransomware variants. To thoroughly assess its capabilities, a set of evaluation metrics, including precision, recall, F-score, accuracy, and FPR, was applied. WWDEM achieved the highest accuracy using a model consisting of seven ensembles trained on 50 features, whereas the lowest accuracy was observed in a model using three ensembles trained on 10 features. Similarly, the model demonstrated strong performance in terms of precision, achieving the highest precision across different ensemble and feature configurations. The variations in performance with different ensemble configurations are further discussed in
Section 5. The lowest precision was observed in the model with three ensembles trained on 10 features. The details of the results obtained by the proposed model are as follows.
The results in
Table 3 present the performance of the proposed WWDEM when using three ensembles (C3) and varying numbers of features (10, 20, 30, 40, and 50). The metrics include precision, recall, F1-score, accuracy, and false positive rate (FPR). Notably, the highest accuracy reaches 0.960 for the model trained on 50 features, accompanied by high precision (0.923) and recall (0.999). Conversely, the lowest reported metrics appear when only 10 features are used, with precision dipping to 0.845. Nevertheless, even the lower-range results maintain reasonable performance levels, underscoring the model’s robustness across different feature subsets. These results illustrate the significance of incorporating both critical and less impactful features under an ensemble approach, especially in cases where ransomware exhibits varying behavioral traits. The gradual improvement in accuracy, precision, and recall as more features are included indicates that expanding feature coverage helps the model capture more nuanced patterns of malicious behavior. Furthermore, the relatively low FPR values suggest that the joint memorization and generalization framework effectively discriminates benign from malicious instances, a key advantage of blending linear and deep ensemble components to address the evolving signatures of ransomware.
Table 4 shows the model’s performance using four ensembles (C4) across the same incremental sets of features. As before, 50 features yield a high accuracy of 0.960, and 10 features result in an accuracy of 0.914. Precision peaks at a perfect 1.000 for 10 features, although the corresponding recall is lower (0.822), balancing the F1-score around 0.903. Similar patterns emerge for the other feature increments, generally showing improved recall and accuracy as the feature set grows. These results suggest that while certain minimal subsets of features can yield strong precision, they may not thoroughly capture the breadth of ransomware behaviors, hence the lower recall in some cases. In contrast, the expanded feature sets (particularly 40 and 50 features) help the model more comprehensively detect diverse ransomware activities, thus achieving a more favorable balance across all metrics. The fact that the FPR is maintained close to zero underscores the method’s consistency in correctly identifying legitimate processes.
Table 5 summarizes the performance of WWDEM with five ensembles for different feature sizes (10 to 50). Accuracy steadily improved from 0.915 with 10 features to 0.960 with 50 features. Similarly, precision increased significantly from 0.850 to 0.924, indicating improved accuracy in identifying ransomware samples correctly. Recall remained consistently high, ranging between 0.886 and 1.000 across different feature sets. The F1-score, combining precision and recall, also showed a clear upward trend, improving from 0.919 to 0.960. Meanwhile, the false positive rate (FPR) significantly dropped from 0.164 with 10 features to as low as 0.001 with 30 features before stabilizing around 0.076 with 50 features.
The results in
Table 5 demonstrate that increasing feature size generally enhanced the performance of WWDEM. A larger set of selected features improved the model’s precision, indicating fewer false alerts and better reliability in distinguishing ransomware attacks from normal activities. Notably, the high recall values indicate consistent capability to detect actual ransomware instances, with only slight fluctuations. The optimal balance between low FPR and high accuracy occurs around 30–40 features, suggesting this range provides the best compromise between precision and recall. Overall, WWDEM effectively handles behavioral drift, maintaining robust and accurate detection performance as the feature set increases.
In
Table 6, WWDEM leverages six ensembles (C6), again reporting metrics for 10, 20, 30, 40, and 50 features. Notably, the model trained on 10 features displays perfect precision (1.000) and a recall of 0.824, culminating in an accuracy of 0.915. As the feature set expands to 50, precision remains near-perfect at 0.997, and recall increases to 0.932, leading to an accuracy of 0.966. F1-score also climbed accordingly, indicating a sound balance between catching ransomware and avoiding false alarms. Such a pattern demonstrates the consistent enhancement of model effectiveness through a more diverse feature base. The increasing recall rates indicate that the model becomes more adept at capturing subtle ransomware behaviors, while precision remains high even when a substantial number of features are involved. Throughout all feature scenarios, the FPR values remain low, confirming that the system does not sacrifice specificity to achieve improved recall.
Table 7 presents metrics under the configuration with seven ensembles (C7). Once again, distinct feature settings are provided, with 50 features achieving the highest accuracy at 0.971, a precision of 0.948, and a recall of 0.994. At the lower end, with only 10 features, the model still attains an accuracy of 0.918 and a perfect precision of 1.000, although recall is comparatively lower. Across all entries, the FPR remains below 0.120, illustrating the method’s ability to avoid frequent false positives. These results highlight that increasing the number of ensembles can bolster detection rates. When the model is configured with C7, it better integrates knowledge from diverse subsets of features, pushing overall accuracy and recall to higher levels than the smaller ensemble configurations. Consequently, as with previous tables, the combination of more comprehensive feature sets and additional ensembles further refines the detection of evolving ransomware variants.
Table 8 presents the detection rate (DR) for WWDEM across all ensemble configurations (C3, C4, C5, C6, and C7) and varying feature counts (10, 20, 30, 40, and 50). The DR consistently increases as both the ensemble size and the feature subset increase. The lowest detection rate of 0.912 is observed when the model is configured with C3 on a 10-feature set, while the highest detection rate of 0.971 is obtained with C7 on 50 features. This progression reaffirms the collective findings that larger ensemble configurations and more extensive feature sets enhance the detection of sophisticated ransomware behaviors. The table clearly shows how each incremental addition of features or ensembles helps the model recognize a wider array of malicious traits, culminating in near-optimal coverage of known and emerging variants. The fact that all detection rates exceed 0.900 in each tested scenario further underscores the overall consistency of WWDEM.
Table 9 compares WWDEM when using different ensemble sizes (C3 through C7) with two state-of-the-art models: the Enhanced Anomaly Behavioral Detection Model and the Hybrid Distinct Ensemble Model. WWDEM outperforms both baselines across all listed metrics: precision, recall, F1-score, accuracy, false positive rate (FPR), and detection rate (DR). Notably, when WWDEM uses C7 and 50 features, the detection rate peaks at 0.971, surpassing the 0.508 and 0.525 from the two baseline approaches by a significant margin. These results confirm the robustness and general superiority of the proposed method. While the baseline models show moderate performance, particularly in handling complex or variant-intensive attacks, WWDEM consistently reports better precision and recall. Additionally, the FPR is drastically lower in the proposed model (0.051 at best), implying fewer misclassifications of benign processes as malicious. This combination of high effectiveness and low false alarms is a hallmark of a reliable ransomware detection system.
4.6. Comparison with Related Solutions
Figure 3 compares precision scores across different ensemble configurations for the proposed WWDEM model and the two baseline models. The bars clearly demonstrate that WWDEM achieves consistently higher precision, with configurations involving larger ensembles (C6 and C7) outmatching the smaller ones (C3). In all setups, it stands notably above Enhanced Anomaly Behavioral Detection Model and Hybrid Distinct Ensemble Model. This superior precision is crucial from a practical standpoint because it indicates that WWDEM seldom flags benign processes as malicious. The high precision metrics reflect the model’s ability to learn highly discriminative features, thus minimizing the risk of interrupting legitimate user activities or business processes due to false positives.
The improvement in precision in
Figure 3 stems from the weighted deep ensemble component of WWDEM, which methodically assigns importance to key features while filtering out less indicative ones. By ‘ensembling’ multiple deep networks, each with a unique view of the input space, the model elevates indicators that reliably distinguish malicious from normal behavior. In comparison, older or simpler anomaly-based systems may define benign baselines in a static manner, leaving them susceptible to misclassifying normal variations as threats. Consequently, the approach showcased here suggests a more sophisticated path forward, leveraging wide-and-deep synergy to enhance indicator-specific weighting and, thus, produce fewer errant alerts—a pressing concern in real-world security operations.
Figure 4 focuses on recall, highlighting how effectively each model detects ransomware among all malicious instances. WWDEM, especially when using higher ensemble numbers, consistently achieves stronger recall than the two baseline methods. This is most evident for the C7 configuration, which approaches or surpasses the 0.90–0.95 range, indicating the model’s proficiency at catching the majority of ransomware attacks in the dataset. Such a high recall is pivotal in cybersecurity, where failing to detect a ransomware threat can lead to severe operational and financial damage. Even a small gap in detection capability can be exploited, making the margin of improvement here particularly valuable in practical scenarios.
Analytically, the recall strength of WWDEM is tied to the ensemble-based generalization strategy. As ransomware evolves, certain features become temporarily dominant, and others recede in importance. The weighted deep networks within the model can dynamically re-assign emphasis to these shifts. In contrast, classical detection solutions, which rely heavily on a narrower set of historically prominent features, may miss signals tied to nascent ransomware variants. Furthermore, this outcome also mirrors the advantage of the wide part that integrates well-established indicators—sustaining recall for known threats—alongside deep ensembles that discover unfamiliar or obscure characteristics. Therefore,
Figure 5 substantiates how the comprehensive approach helps the proposed model achieve near-complete coverage of malicious activities.
In
Figure 5, the F1-scores are compared for WWDEM under various ensemble configurations and for the baseline models. The F1-score, combining both precision and recall, reveals the overall detection effectiveness of each approach. The results confirm that the proposed solution leads in F1-score, with the best-performing configuration nearing perfect equilibrium between avoiding false positives (precision) and capturing actual threats (recall). The margin by which WWDEM outperforms the existing methods is consistent across different ensemble sizes. Even the smaller ensembles (C3) show respectable F1-scores, signifying a robust foundational approach, but the metric improves further as the number of ensembles increases. This demonstrates that added ensemble diversity refines the ability to detect various strains of ransomware.
The high F1-scores illustrated in
Figure 5 reinforce the fact that balancing memorization of key features and generalization for unforeseen attacks is pivotal in maintaining an all-around strong detection performance. This synergy spares the system from heavily skewing towards either precision or recall, a common pitfall in simpler models, where emphasizing one metric often compromises the other. Such balance is especially significant in organizational security environments, where both missed detections and numerous false alarms carry high stakes. By consistently excelling in F1-score, WWDEM exhibits a practical readiness to handle real-world ransomware threats. It also provides empirical evidence that addressing the shortcomings of existing methods—such as narrow feature reliance and static detection timelines—yields more balanced detection outcomes.
Figure 6 displays the accuracy rates of all tested models under various ensemble configurations, with WWDEM notably achieving the highest scores. In particular, configurations with a larger number of ensembles (C6 and C7) approach accuracy levels above 0.95. In contrast, Enhanced Anomaly Behavioral Detection Model and Hybrid Distinct Ensemble Model exhibit comparatively modest accuracies. Accuracy is a straightforward yet essential metric, depicting how many predictions out of all attempts are correct. The success of WWDEM here signifies that it can effectively distinguish benign from malicious behavior in most scenarios, validating the model’s capacity for correct overall classification across a broad ransomware sample set.
The improved accuracy seen in
Figure 6 reveals the ability of WWDEM to unify diverse feature sets and classification strategies. In cases where older models might rely on rigid, phase-specific features or purely anomaly-based thresholds, the proposed approach dynamically weighs multiple deep networks, preventing misclassifications that arise from short-lived or misleading feature signals. The consistency of high accuracy across ensembles also suggests resilience against adversarial behaviors that attempt to mimic benign processes. By relying on a wide breadth of features and the synergy of memorization-generalization, the model remains less prone to being deceived. Consequently, this adaptability represents a direct solution to the limitations of static, single-phased detection frameworks that struggle with rapidly shifting ransomware code patterns.
Figure 7 shows the comparison of the false positive rates (FPRs) for WWDEM at different ensemble sizes against the baseline methods. The graph shows that WWDEM systematically achieves much lower FPRs, especially when operating at higher ensemble counts (C6 or C7). Both baseline solutions produce substantially higher FPRs, indicating that they are more likely to label benign processes as malicious. Keeping the FPR low is critical for maintaining normal system operations. Security teams often rely on the FPR to gauge how frequently the system triggers unnecessary alerts. Excessive false positives can lead to “alert fatigue”, in which genuine threats might eventually be overlooked.
The reduced FPR in
Figure 7 underscores the model’s refined approach of assigning precise weights to each ensemble’s judgment. By reconciling multiple classifiers, WWDEM offsets the tendency of any single, possibly overfitted ensemble to mistakenly flag benign behaviors. This method diverges from classical anomaly detection, which often casts a broader net at the cost of more frequent false alarms. This attribute also has direct implications for real-time ransomware defense, where false positives could disrupt legitimate application operations or tarnish the model’s credibility among users. Hence, the proposed ensemble design stands out as not only effective but also judicious in preserving the stability of everyday system usage—a key benefit over older detection systems.
Figure 8 shows the detection rate (DR) for WWDEM when using various ensemble sizes and for the comparative models. The proposed model consistently outperforms the others, achieving detection rates close to or exceeding 0.90 in all ensemble configurations. As expected, the highest DR occurs with configurations that utilize both a larger number of ensembles and a more inclusive feature set, reaching peaks above 0.95 and reinforcing the findings from the tabular data. The DR metric highlights the proportion of the total samples—both benign and malicious—that are correctly labeled as ransomware. In scenarios where there are advanced, persistent threats, timely and accurate detection is essential, and a high DR ensures that new or variant-heavy families of ransomware are not missed.
The higher DR displayed in
Figure 8 is invaluable. It indicates that despite the constantly shifting landscape of ransomware behaviors, WWDEM is highly adept at catching even unconventional or newly emerging attack vectors. This success is attributable to the model’s fundamental design principle: leveraging memorized cues in the wide part and exploring unexplored feature combinations through multiple deep ensembles. Furthermore, the comparative advantage over existing detection techniques suggests that WWDEM effectively addresses historical challenges, such as reliance on static signatures or single-phase data. By uniting memorization, generalization, and weighted ensembling, WWDEM furnishes an advanced, future-ready tool for cybersecurity, mitigating both the risk of missing novel attacks and the disruptions from misclassifications.
Figure 9 presents a side-by-side comparison of the models’ average performance metrics, including precision, recall, F1-score, accuracy, FPR, and DR. The performance curves of WWDEM remain consistently higher than the baselines across the majority of metrics, illustrating its dominant presence in detection accuracy and recall. Notably, the proposed model’s FPR bars remain on the lower end, signifying a strong ability to minimize false alarms while still capturing malicious activities. The bar chart underscores that, regardless of the metric used for comparison, WWDEM generally retains its lead over the competing models. The visual representation also clarifies how each ensemble configuration (C3 through C7) contributes to incremental gains. This holistic snapshot demonstrates the efficiency of combining varied feature subsets with multiple ensemble learners, ultimately leading to heightened reliability.
From a broader perspective, the aggregated performance in
Figure 9 highlights how WWDEM systematically addresses limitations identified in older techniques. Rather than relying on a static cluster of historically relevant features, the model capitalizes on a twofold pipeline—memorizing established malicious signatures in the wide network while deploying ensemble-based deep learners to generalize potential new attack vectors. In practical terms, this combination aligns with the industry’s growing recognition that adaptive, multi-module architectures can offer stronger resilience against ransomware rapid evolution. The results, thus, confirm that the proposed approach effectively closes the following gap in the literature: while many systems excel at either memorization or generalization, few have successfully integrated both in a single pipeline to tackle concept drift and the varied nature of ransomware threats.
To validate the effectiveness of the proposed wide and weighted deep ensemble model (WWDEM), a comparative analysis against state-of-the-art approaches was conducted.
Table 10 summarizes the comparative performance of WWDEM against baseline models using three key parameters: accuracy, F1-score, and false positive rate (FPR). This comparison shows that the proposed WWDEM outperformed the baseline models across all three evaluation metrics, demonstrating its effectiveness in detecting evolving ransomware behaviors while minimizing false alarms. Specifically, the results show that WWDEM achieved notably higher accuracy (0.937), recall (0.971), and precision (0.937), reflecting its balanced performance in correctly identifying ransomware instances while limiting false positives. WWDEM also achieved a lower false positive rate (FPR) of 0.095 compared to 0.120 and 0.195 from Enhanced Anomaly Behavioral Detection Model and Hybrid Distinct Ensemble Model, respectively. This indicates WWDEM is both more accurate and reliable, effectively detecting ransomware while minimizing false alerts. Overall, these averages confirm that WWDEM consistently outperforms baseline methods across all measured parameters.