**1. Introduction**

The rapid proliferation of ransomware attacks has emerged as one of the most significant cybersecurity threats facing organizations today. In recent years, ransomware has become an increasingly popular tool with which cybercriminals extort money from victims by encrypting their data and demanding payment for a decryption key. The impact of ransomware attacks has been felt across all industries, from healthcare and finance to government and education. Given the high stakes involved, it is crucial to understand the nature of ransomware attacks, how they spread, and the potential consequences of falling victim to one [1]. The importance of research in this area cannot be overstated. With the threat of ransomware attacks continuing to grow, there is a pressing need for scholars and practitioners to delve deeper into the problem and identify effective strategies for prevention and mitigation. This paper aims to contribute to this effort by providing a comprehensive overview of the ransomware threat landscape, analyzing the factors that contribute to the spread of ransomware, and exploring potential avenues for future research. By shedding light on this critical issue, we hope to help individuals and organizations better-protect themselves against ransomware attacks and mitigate the potential damage caused by these malicious programs [1].

This paper is organized as follows: Section 2 introduces the concept of ransomware and how it works. It also discusses the different types of ransomware attacks, such as encrypting ransomware, locker ransomware, and scareware. Section 3 describes the methodology used for this paper. Section 4 provides studies of machine-learning-based ransomware-detection systems developed by researchers. It discusses the methodology used, the performance

**Citation:** Alraizza, A.; Algarni, A. Ransomware Detection Using Machine Learning: A Survey. *Big Data Cogn. Comput.* **2023**, *7*, 143. https://doi.org/10.3390/bdcc7030143

Academic Editors: Peter R.J. Trim, Yang-Im Lee and Min Chen

Received: 18 May 2023 Revised: 7 August 2023 Accepted: 11 August 2023 Published: 16 August 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

27

achieved, and the limitations of each system. It also discusses the challenges of collecting and preprocessing data for ransomware detection using machine learning. Section 5 provides an in-depth analysis of the evolution of ransomware over the last twelve years. Section 6 provides an overview of the existing ransomware detection techniques, including signature-based detection, behavior-based detection, and machine-learning-based detection. Furthermore, it discusses the different evaluation metrics used for measuring the performance of machine learning models for ransomware detection. It also focuses on the use of machine learning techniques for ransomware detection. It discusses the different machine learning algorithms used for this purpose, such as decision trees, random forests, support vector machines, and neural networks. It also addresses the different features used for ransomware detection using machine learning and covers the techniques used for feature selection. Section 7 discusses the challenges of developing effective machinelearning-based ransomware-detection systems. It also highlights future directions in this field, such as developing more robust and accurate models, incorporating real-time detection capabilities, and addressing the issue of adversarial attacks. Section 8 concludes what has been achieved in this research. This research offers a valuable resource for researchers and practitioners interested in developing effective ransomware-detection systems using machine-learning techniques.

#### **2. Background**

Ransomware encrypts information or computer systems and prevents unauthorized users from accessing them. Ransomware attacks use tactics, techniques, and procedures that can lock computers or encrypt data and are challenging for a computer professional to undo. They might also steal private information from victims' PCs and network systems. Individual PCs, commercial systems (and the data and software they contain), and industrial control systems are all potential targets for ransomware attacks. Additionally, we emphasize the variety of sensors that Internet of Things (IoT) users employ [1]. A ransomware attack employs private key encryption to prevent authorized users from accessing a system or data unless they pay a ransom (cash), typically in Bitcoin [2]. Ransomware operations may include data exfiltration techniques. Hackers steal private information from vulnerable networks and threaten to release it if the owner does not pay a ransom. The infection is disseminated through malicious advertising, email attachments, and connections to rogue websites. The attacker also sends a file (or files) with instructions for paying the ransom. Once the attacker has verified that the ransom has been paid, the victim can access the decryption key [3]. Files with encryption or ransomware infections frequently include extensions, such as Locky, Cryptolocker, Vault, Micro, Encrypted, TTTT, XYZ, ZZZ, Petya, etc. Each file's extension indicates the type of ransomware that affected it. Examples of ransomware include WannaCry, WannaCry.F, Fusob, TorrentLocker, CryptoWall, CryptoTear, and Reveton [4]. Figure 1 illustrates the classification of ransomware into three categories: scareware, locker ransomware, and crypto-ransomware [2,4].

Crypto is the most prevalent ransomware that targets computer systems and networks. Ransomware encrypts files and data using symmetric and asymmetric encryption algorithms. Even if the malicious software is removed from an infected computer or a compromised storage device is introduced into another system, crypto-ransomware renders the encrypted data unusable. Because the malware frequently does not corrupt imported essential data, the compromised device can still be used to pay the ransom [4]. Figure 2 provides a visual representation of crypto-ransomware, a form of malicious software that is becoming increasingly prevalent in cyberattacks [4].

However, by locking a computer or other device and demanding money, locker ransomware prevents its owner from using it. The workstation is affected by the locker ransomware, but saved data are not rendered inaccessible. Once the malicious program has been eliminated, the data are not altered. The data are often recoverable by connecting the infected storage device, such as a hard drive, to another machine. Individuals wanting to extort money from assault victims will not be drawn to locker ransomware. Figure 3

becoming increasingly prevalent in cyberattacks [4].

provides a visual representation of locker ransomware, a form of malicious software that is

**Figure 2.** Crypto-ransomware [4].

29

**Figure 3.** Locker ransomware [4].

Scareware preys on its victims by informing them that their machines have been hijacked and promising to eradicate the ransomware using a false antivirus program backed by the attacker. Numerous innocent consumers buy and install fake antivirus software due to scareware alerts' frequent appearance [5]. Human-operated malware and ransomware without data are different from ransomware. Cybercriminals also employ human-operated ransomware to break into networks or cloud infrastructure, carry out privilege escalation, and launch attacks on sensitive data. Instead of simply one system, the attack actively targets an entire organization. Attackers typically access a whole IT system, move laterally, and exploit flaws via improper security configurations. Ultimately, unauthorized access to privileged user credentials leads to ransomware assaults on IT systems that enable crucial corporate activities [3,4]. Figure 4 provides a visual representation of scareware, a form of malicious software that is becoming increasingly prevalent in cyberattacks [4].

However, ransomware without files uses a native and reliable system to launch attacks. It is difficult to identify the attack because no code needs to be placed on the victim's machine for it to work. As a result, anti-ransomware technologies do not find any suspicious files to trace during an attack. Depending on the attacker's intentions, file-based and human-operated ransomware can encrypt, lock, or leak data from files [2]. Ransomware poses a danger to businesses' technology and files. Until the ransom is paid, typically with Bitcoin, infected files or compromised devices are locked out of reach. The decryption key is frequently withheld even after a victim pays the ransom the hackers want. They periodically try to use the attacker's key to decrypt the data, which damages the system's stored files. Technology advancements such as ransomware development kits, ransomware-as-a-service, and bitcoins are to blame for the ongoing rise in ransomware attacks on desktop PCs, networks, and mobile devices [2]. Attacks using ransomware cost businesses and individuals hundreds of millions yearly [3]. New types of malware are continually being created thanks to the enormous cash benefits that hackers gain from ransomware assaults. Since 2013, numerous ransomware variants have appeared. Therefore, new, effective, and reliable techniques are needed to detect, prevent, and mitigate ransomware attacks. Different ransomware strains cannot be created using conventional

antivirus software or other intrusion-detection systems. People and companies experience significant financial losses as a result of ransomware attacks. The encryption of files or devices until a ransom is paid can result in the permanent loss of important data, which can have severe consequences for individuals and businesses alike. Even after the ransom is paid, the decryption key is often withheld, causing additional damage to the system's stored files when attackers attempt to decrypt the data [1,6].


**Figure 4.** Scareware ransomware [4].

#### **3. Survey Planning**

The present research involved several phases to achieve its overall objectives, including data collection and information gathering, data extraction and analysis, information synthesis, and reporting. A visual representation of the research process flow is presented in Figure 5, which depicts the activities involved in each phase and their interrelation.

The data collection process was carried out by selecting relevant and up-to-date journal and conference papers from reputable databases such as IEEE, Springer, MDPI, Elsevier, IET, and Archive.org, as well as other sources including university-based journals, theses/dissertations, and blogs published by reputable organizations such as Microsoft, Crowdstrike, Symantec, and Techspot. The collected materials were then categorized into two main groups: non-technical sources and technical sources. Non-technical sources contained general information on ransomware and were used to provide reliable information while writing the introduction and detailing the history of ransomware/chronology of attacks. Technical papers proposing solutions for ransomware attacks were divided into detection groups based on the nature and purpose of the proposed solution. Papers focusing on detection were further sub-categorized into artificial-intelligence-based methods and non-AI-based approaches. AI-based approaches were classified into machine learning methods, deep learning approaches, and artificial neural network approaches, while non-AIbased papers were grouped into packet and traffic analysis categories. The data extraction phase involved a detailed analysis and summary of each technical paper by identifying the problem it addressed, its objectives, the method/technique used, the achievements of the paper in terms of results obtained, and the research's limitations. Information synthesis

was applied to identify similarities or relationships among papers in each group and to determine if and how the research improved upon or addressed the limitations of another work. The reporting phase placed papers that addressed similar problems or used similar techniques in the same group and presented their reviews in the same paragraph. This approach provided a good flow of communication and enhanced the readability of the paper, while also providing readers with a clear understanding of the concepts discussed in the research.

**Figure 5.** Research process flow.

#### **4. Literature Review**

Preventing ransomware is challenging for several reasons. The way ransomware functions is the same as benign software, which acts covertly. Ransomware detection in zero-day assaults is, therefore, crucial at this time. The primary objectives are to avoid ransomware-caused system damage, identify zero-day (previously unidentified) malware, and minimize detection, which means reducing the number of false positives while still detecting all instances of ransomware. False positives are instances where the system flags a harmless program or file as ransomware, leading to unnecessary alerts and actions. Ransomware can be found using a variety of tools and methodologies. Methods based on static analysis decompose source code without running it. They generate many false positives and cannot find ransomware that is disguised. Attackers frequently create new variations and modify their codes using various packaging techniques. To solve these issues, researchers use dynamic behavior analysis methods that monitor interactions between the executed code and a virtual environment. However, these detection methods are cumbersome and memory-intensive. Machine learning is ideal for analyzing any process or application's behavior.

Machine learning is considered ideal for analyzing the behavior of processes or applications because it can effectively learn patterns and anomalies in large datasets, which can be difficult for humans to detect. In the context of ransomware detection, machine learning algorithms can be trained on large datasets of both benign and malicious software to learn the behavioral characteristics that distinguish ransomware from legitimate software. This training can be used to identify new and previously unseen variants of ransomware, including zero-day attacks, based on their behavioral patterns.

Moreover, machine learning can be used to continuously learn and adapt to new threats, making it an effective approach to keep up with the constantly evolving tactics of ransomware attackers. Machine learning can also reduce false positives by accurately distinguishing between benign software and ransomware based on their behavioral patterns.

Compared with traditional signature-based detection and static analysis methods, machine learning is considered ideal because it can provide a more comprehensive and

accurate analysis of the behavior of software, making it a powerful tool for ransomware detection. However, it is important to note that machine learning models need to be properly trained and validated to ensure their effectiveness and avoid biases or errors. The following are some machine-learning-based detection systems that follow highly traditional methodologies.

Table 1 summarizes previous studies on machine learning techniques (behavioral techniques) for ransomware detection from 2017 to 2022.

**Table 1.** Studies on machine learning techniques (behavioral techniques) for ransomware detection from 2017 to 2022.


An application's normal behavior is assessed from a user and resource perspective. A baseline for normal behavior is established based on what is thought to be the typical or routine operation of a computer system or network. Indicators of usual activity include logins, file access, user and file behaviors, resource utilization, and other significant indicators [1].

The length of the learning process is determined by the amount of data needed to build a baseline to represent typical system behavior. The tool investigates behavioral outliers from the baseline's depiction of the typical behavioral pattern. A ransomware-detection and -prevention model was created for unstructured datasets derived from Ecuadorian Control and Regulatory Institution (EcuCERT) logs [12].

The methodology uses musing to spot peculiar behavioral patterns connected to Windows malware. Feature selection is applied to the Log data to extract the most beneficial and discriminating information that indicate a ransomware attack. The extracted data represent that autonomous learning algorithms in ransomware are swiftly and precisely identified using the input feature set and algorithms that mimic abnormal behavioral patterns. Code obfuscation tools and new polymorphic variants have been developed as signature additions in identifying ransomware attacks, which are constantly evolving [8].

Since generic malware attack vectors cannot effectively capture the particular behavioral traits of cryptographic ransomware, they are insufficient or inaccurate for ransomware detection. The suggested approach, RansomWall, is a hybrid system that uses static and dynamic analytics to present a research set of properties that mimic ransomware activity. The technique allows for early ransomware detection while utilizing a strong trap layer to detect zero-day attacks. RansomWall with the Gradient Tree Boosting Algorithm demonstrated a detection rate of 98.25% and an incredibly low (almost nil) false-positive rate when tested against 574 samples of 12 cryptographic ransomware running on the Microsoft Windows operating system. It also had a detection rate of less than 10% for 30 zero-day attack samples compared with 60 VirusTotal security engines. One version of behavioral detection methodologies uses a machine learning baseline model for simulating and forecasting the specific network user behavior pattern at the micro level to identify potential scenarios that could indicate a vulnerability or a true ransomware assault [9].

The goal was to find a simple network system's vulnerability to a ransomware attack. Comparing the outcomes from the simulated network and the log data from the server in the existing network system revealed a realistic model with a correlation above 0.8. This method's drawback was that it only adequately captured the activity of a small percentage of users. Future studies should focus on mimicking user behavior over a large user base using big data analytics tools. A more recent method of behavioral ransomware detection used two parallel classifiers [10].

To distinguish between the several Locky ransomware variants, one technique focused on early detection based on the behavioral analysis of ransomware network traffic to prevent ransomware from connecting to command-and-control servers and carrying out damaging payloads. The study employed a dedicated network to collect information and extract important details from network traffic. Using data at the packet and datagram levels, two different (parallel) classifiers were used to analyze the extracted properties of the Locky ransomware family. The results of the studies show that the technology has a high level of success in detecting ransomware activities on the network. Furthermore, it permits an extreme lexicon with a low percentage of false positives. Using command-andcontrol (C&C), the server blocklists ransomware attacks as the means of communication and conducts behavioral analysis of the ransomware in an IoT environment [7].

A domain-specific strategy for identifying Cryptowall ransomware attacks is provided. The operation obtains the TCP/IP header from the web proxy server, which serves as the TCP/IP traffic gateway. Furthermore, it retrieves source and destination IPs and compares them to the IPs of forbidden command-and-control servers. Ransomware is identified if the source or destination IPs match an attack targeting Internet of Things devices. However, the model was not used to demonstrate how well it could spot ransomware and its attack vectors against different operating system environments. Using a very recent technique of behavioral-based detection that uses access privileges in process memory, ransomware may now be quickly and accurately detected [11,13].

It is possible to categorize new ransomware attacks and find malware families that have not yet been recognized by looking at a file or application's access privileges and the area of memory it intends to access. Examining the behavior and ascertaining the purposes of lawful files and applications before executing them is beneficial. The experimental results employing these several approaches show good detection accuracy, ranging from 81.38% to 96.28%.

Table 2 summarizes previous studies on machine learning techniques (static and dynamic analysis) for ransomware detection from 2017 to 2022.

**Table 2.** Studies on machine learning techniques (static and dynamic analysis) for ransomware detection from 2017 to 2022.


Several improved machine learning approaches have been applied for accurate and efficient ransomware detection. These methods are meant to address the drawbacks of the current ML-based ransomware-detection tools. One of these advancements regards the challenges detection systems (such as sandbox analysis and pipelines) face in isolating a sample and handling the wait time for isolated ransomware samples to be evaluated [20].

The approach predicts ransomware using a dataset containing 30,000 attributes as independent variables. Five qualities that were obtained through feature selection were used in the support vector machine technique. The approach provides a respectable 88.2% accuracy rate in ransomware detection. To reduce the number of false positives, this hybrid technique combines the "guilt by association" hypothesis with content-, metadata-, and behavior-based analysis. Giving the user control over recovery is necessary, and file versioning in cloud storage is used to halt the process. The only duty of the end user is to keep track of the recovery. Users are given classification information so they may make educated decisions and prevent false positives. The method results in more-accurate detection and reliable recovery. An innovative method for detecting network-level ransomware uses machine learning, certificate information, and network connection information [21].

This technique can be used with system-level monitoring to detect ransomware outbreaks early. This method uses connection-, encryption-, and certificate-based network traffic characteristics to extract and model ransomware features. It is a feature model that uses support vector machines, logistic regression, and random forest to distinguish ransomware traffic. According to experimental findings on various datasets, random forest has the best detection rate of 99.9% and the lowest rate of false positives. Another more-effective detection method is a decision tree model based on big data technology that uses Argus for packet preprocessing, combining, and malware file identification [21].

The flow replaced the packet data, resulting in a 1000-fold (1000:1) reduction in data size. Feature selection and concatenation were used to extract and aggregate the attributes of the actual network traffic. In order to improve classification accuracy, the technique made use of six feature selection techniques. Machine learning has recently been creatively applied to monitor Android device power usage as a ransomware-detection technique [13].

The suggested method measures how much energy particular Android processes use to distinguish ransomware from valuable programs. Data on the ransomware's unique local energy fingerprint are gathered and analyzed to accomplish this. According to experimental findings, the approach offers high detection and precision rates of 95.6% and 89%, respectively. Additionally, it outperforms k-nearest neighbor, neural network, support vector machine, and random forest regarding the accuracy, recall rate, precision rate, and F-measure.

Another superior option is the cutting-edge, portable RanDroid approach for automatically detecting polymorphic ransomware [22]. The RanDroid approach uses both static and dynamic analyses to detect polymorphic ransomware. The method compares the structural similarity of pieces obtained from an application with a collection of threat information from well-known ransomware variants to detect new ransomware variants on Android devices. Image similarity measurements (ISMs) and string similarity measurements (SSMs) are the two similarity measures used. Using language analysis, the app's behavioral attributes and picture textural strings are mined for additional information. The strategy reduces ransomware threats without changing the Android OS or its underlying security module while addressing the constraints of static analysis. The methodology can detect ransomware using evasive tactics such as complex codes or dynamic payloads, according to an analysis of the method based on 950 malware samples. According to a related study, a strategy combining static and dynamic analysis can help identify and separate Android ransomware from other malware [16].

We looked at network-based features, text, and permissions using static analysis. Furthermore, dynamic analysis was performed on the system call, CPU, and memory logs. The strategy's effectiveness in reducing evasive ransomware assaults is demonstrated by experiments using traits from malicious and benign samples. Additionally, it is 100 percent accurate at classifying and identifying unknown ransomware.

#### **5. Evolution of Ransomware**

Ransomware attacks have been around since the late 1980s; Joseph Popp showcased the first instance of ransomware. This attack utilized symmetric-key encryption to take control of victims' hard drives and request a ransom. The flaw in this system was that the same key was used for encryption and decryption, making it vulnerable. As a result, it was possible to research the AIDS ransomware (also known as PC Cyborg) to find the decryption key and create a solution for the malware's encryption. Ransomware attacks have continued evolving and have become more sophisticated in recent years, making them a significant threat to individuals and organizations [23]. A brief timeline of various potent ransomware attacks is shown in Table 3. The table, an excerpt from a timeline of the most significant ransomware attacks from 2012 to 2023, contains essential information on the evolution of ransomware based on the year the ransomware first appeared, its name, and its primary description [2,3,23].


**Table 3.** Brief chronology of major ransomware attacks from 2012 to 2022.

Ransomware has become a popular tool for cybercriminals to extort money from individuals and organizations. As technology advances, preventing such attacks is more challenging. It is essential to remain vigilant and take appropriate measures to protect against these threats, such as keeping software up-to-date and regularly backing up important data [5]. There are six levels, which can be summarized as follows, as adapted from [29] and shown in Figure 6.



**Figure 6.** Six levels of ransomware attacks [29].
