1. Introduction
The advent of autonomous driving has sparked a technological revolution, promising a future where vehicles navigate complex environments with minimal human intervention. This transformation hinges on the ability of autonomous agents to perceive their surroundings accurately, make informed decisions, and execute precise control maneuvers. Despite significant progress, the development of reliable and safe autonomous driving systems remains a formidable challenge, particularly in dealing with the intricacies of real-world traffic scenarios.
Reinforcement learning has emerged as a pivotal approach in training agents to solve a variety of control tasks. Algorithms such as Deep Q-Networks (DQNs) [
1], Proximal Policy Optimization (PPO) [
2], and Soft Actor–Critic (SAC) [
3] have been instrumental in enabling agents to learn from experience and improve their decision making over time. In the context of autonomous driving, reinforcement learning has been applied to learn optimal control policies through interaction with the environment. For instance, Ref. [
4] used DQN to learn steering control for an autonomous vehicle in a simulated environment. Ref. [
5] developed a continuous control deep RL algorithm to learn a deep neural policy for driving a vehicle on a simulated racing track. Ref. [
6] developed a hierarchical deep RL framework to handle driving scenarios with intricate decision-making processes, such as navigating traffic lights.
However, traditional reinforcement learning algorithms face two main challenges: (1) They often assume a low-dimensional and structured state space. Yet, in real-world applications such as robotic vision or autonomous driving, state observations are frequently high-dimensional (e.g., images or videos), complicating the learning of effective strategies. (2) In safety-critical applications, considering safety merely at the trajectory level is insufficient. Traditional methods may fail to ensure safety at each state. We will elaborate on the current research tensions from these two perspectives and the work we have conducted.
Regarding high-dimensional observations, the high-dimensional and noisy nature of sensor data in autonomous driving poses a significant challenge for traditional RL algorithms. The “curse of dimensionality” often leads to inefficient learning and suboptimal policies. To tackle this, the concept of latent state space has been introduced, offering a solution by compressing high-dimensional sensory inputs into a lower-dimensional representation. Ref. [
7] introduces Deep Planning Network (PlaNet), which is a model-based agent that effectively learns latent environment dynamics from pixel observations and accomplishes online planning to navigate complex control tasks with partial observability and sparse rewards. The Dreamer [
8,
9,
10] series of algorithms represents a progression of model-based reinforcement learning approaches designed to improve the efficiency and effectiveness of planning in complex, high-dimensional environments. This approach has been successfully applied in RL for autonomous driving, enhancing sample efficiency and improving generalization capabilities. For instance, Ref. [
11] introduces a model-free deep reinforcement learning framework tailored for urban autonomous driving scenarios, employing specialized input representations and visual encoding to capture low-dimensional latent states, which is demonstrated to be effective through complex roundabout navigation. Ref. [
12] presents an interpretable autonomous driving framework utilizing latent deep reinforcement learning, which enables the generation of a semantic bird-eye view mask for explaining the decision-making process in complex urban scenarios. Ref. [
13] maps high-dimensional images into implicit affordances using a pre-trained Resnet-18 encoder in a supervised setting.
Safety remains a paramount concern in autonomous driving. Traditional RL algorithms often lack inherent safety guarantees, which is unacceptable for safety-critical applications. To this end, State-Wise Safe Reinforcement Learning (SRL) [
14] has been proposed, aiming to ensure that learned policies adhere to safety constraints at every state. These methods integrate safety directly into the learning process, preventing the agent from taking actions that could lead to unsafe states, which is vital for applications where safety violations have severe consequences. Ref. [
15] optimizes the merging process of automated vehicles at traffic intersections, ensuring state-wise safety and efficiency by utilizing control barrier functions. Ref. [
16] presents a novel framework that integrates differentiable control barrier functions into a neural network architecture, enabling end-to-end training and guaranteed safety in various driving tasks. Additionally, there are similar studies applying these concepts in the control of autonomous driving.
Targeting the aforementioned challenges, there has been research [
17,
18] that combines latent state space with state-wise safety to create a reinforcement learning framework capable of handling high-dimensional observational data while ensuring safety at each state. Their methods have improved the efficiency, safety, and generalization capabilities of the algorithms, making them more suitable for complex and safety-critical application scenarios. However, these studies have been implemented within the simulated games of Safe Gym [
19]. To date, there has been no research in the field of autonomous driving that combines latent state space with state-wise safety. Therefore, in our study, we introduce a novel framework that bridges this gap by developing a latent space model and a barrier function that encodes state-wise PPO safety constraints. Our approach combines the benefits of model-based learning with the safety guarantees required for autonomous driving, resulting in a control policy that is not only efficient but also compliant with safety regulations. In contrast to existing works such as Roach PPO [
20], which utilize custom BEV images as input for PPO-based agents, and Chen et al.’s works [
11,
12], which introduce model-free base latent state space and interpretable frameworks for autonomous driving without state-wise safety guarantees, our framework addresses the limitations of existing methods by providing a more comprehensive and proactive approach to safety without compromising the learning efficiency or the performance of the autonomous agent.
Our contributions encompass several aspects. Firstly, we propose a novel framework that effectively integrates latent state space modeling with state-wise safety constraints, addressing the limitations of high-dimensional sensory inputs and ensuring safety at each state. Secondly, our approach enhances the efficiency and generalization capabilities of autonomous driving systems, outperforming existing methods in both driving performance and safety. Thirdly, we conduct extensive experiments in the CARLA simulator, demonstrating the robustness and applicability of our framework in diverse driving scenarios. Lastly, our work contributes to the advancement of safe and reliable autonomous driving systems, paving the way for a future with minimal human intervention and enhanced safety on the roads.
2. Related Work
2.1. State-Wise Safe Reinforcement Learning
State-Wise Safe Reinforcement Learning (SRL) is an advanced paradigm within the field of reinforcement learning that emphasizes the enforcement of safety constraints at every step of the learning process. This is particularly crucial in applications where safety is of paramount importance, such as autonomous vehicle navigation and robotic manipulation.
In SRL, the notion of state-wise safety is pivotal. It requires that for any given state, the actions taken by the agent do not violate predefined safety criteria. These criteria are often encapsulated by a set of cost functions , which quantify the safety of state–action transitions. An action a in state s is considered safe if it satisfies the constraints imposed by these cost functions.
Control Barrier Functions (CBFs) are mathematical tools derived from control theory that are used to define a region of the state space where the system is guaranteed to remain safe. The CBF-based methods in SRL leverage these functions to ensure that the agent’s actions maintain the system within a safe operating region.
The CBF-based SRL can be articulated through the following steps:
Safe Region Specification: Define a safe region in the state space using a Lyapunov-like energy function . The safe region is characterized by the level set , where c is a chosen threshold.
CBF Construction: Construct a that quantifies the distance to the boundary of the safe region. The CBF should satisfy for all within the safe region.
Policy Design: Design a policy that optimizes the expected cumulative reward while ensuring that the CBF remains positive, thus keeping the system within the safe region.
The barrier function serves as a formal safety certificate associated with a control policy, guaranteeing the state-wise safety of a dynamical system [
21,
22]. Classical control theory often relaxes the stringent conditions of the barrier function into optimization formulations like linear programs [
23,
24] and quadratic programs [
25,
26].
Recent research has explored the joint learning of control policies and neural barrier functions to optimize state-wise safety constraints in reinforcement learning [
27,
28,
29]. In the context of autonomous driving, ShieldNN [
30] leverages CBF to design a safety filter neural network, providing safety assurances for environments with known bicycle dynamics models. Ref. [
31] adopts an architecture akin to ShieldNN, employing a safety filter to furnish demonstrations for RL algorithms, consequently enhancing sample efficiency.
However, a major challenge for these approaches is their limited scalability to higher-dimensional systems, particularly those with pixel observations.
2.2. Latent Dynamic Models
Latent dynamic models [
32,
33,
34] represent a class of methods used for modeling time-series data with widespread applications in reinforcement learning (RL). These models are typically employed to capture the relationship between the hidden states of a system and the observed data, which is a crucial aspect in RL tasks as they assist agents in understanding the environment and making appropriate decisions.
In reinforcement learning, latent dynamic models are utilized to model the dynamic changes and hidden states of the environment. These models are often based on probabilistic frameworks, employing methods such as Bayesian inference or maximum likelihood estimation to learn model parameters and hidden states [
35,
36].
Mathematically, latent dynamic models can be represented as follows:
This equation describes the process of state transition in the system, where
represents the hidden state at time
t,
denotes the action chosen by the agent at time
t, and
represents environmental noise or randomness. The function
f can be deterministic or incorporate some degree of randomness.
This equation describes the observation
obtained when the state
is observed, where
g is a function and
represents observation noise.
This equation represents the immediate reward obtained by the agent when taking action in state .
The objective of latent dynamic models is to infer the hidden state sequence from observed data , action data , and possible rewards . This facilitates a better understanding of environmental dynamics and enables informed decision making.
These models are utilized to learn the dynamic model of the environment from observation data, enabling agents to predict the next state or observation. Additionally, they are employed in optimizing policies by integrating environmental dynamics modeling to maximize cumulative rewards [
12,
37,
38]. Furthermore, latent dynamic models contribute to inferring the current hidden state of the environment based on observation data, thereby enhancing the agent’s understanding of the environment and facilitating decision-making [
39,
40]. In conclusion, latent dynamic models play a crucial role in reinforcement learning by assisting agents in comprehending complex environments and making effective decisions.
3. Problem Modeling
In the field of reinforcement learning (RL), an autonomous agent faces the challenge of interacting with an uncertain environment in a sequential manner to optimize a given utility signal. This interaction is typically modeled using a finite-horizon Markov Decision Process (MDP), which is represented by the tuple . In this formulation, represents a continuous state space, denotes the continuous action space, and the environment’s transition dynamics are governed by , where and . The observation space is derived from the state space and is captured by the agent’s camera module. The reward function is defined as , and represents the discount factor. In this setting, the agent’s control policy generates actions based on observed images , i.e., , reflecting real-world scenarios where agents must act without direct access to the true state.
A crucial aspect of this MDP framework is the emphasis on state-wise safety, which is acknowledged by the presence of potentially hazardous states within a subset
. A safety violation occurs when the agent enters a state
, which can be detected by a safety mechanism
. As a result, the objective of state-wise safe RL with pixel observation is to optimize the control policy to maximize the expected cumulative discounted reward, which is subject to a constraint on the total number of safety violations:
Here, serves as an indicator of safety violations, and represents the allotted budget for such violations. In practical, safety-critical systems, it is crucial to mitigate safety violations, ideally aiming for during the learning process.
This formulation is similar to the concept of constrained MDPs (CMDPs), where the agent must learn to navigate an environment while minimizing a safety cost, which is analogous to the scalar cost variable in CMDPs. The agent’s policy is optimized to ensure that the expected sum of these costs remains below a predefined safety threshold , highlighting the dual objectives of maximizing rewards and satisfying safety constraints.
4. Methods
We propose a novel framework for state-wise safe PPO with a latent state in the context of autonomous driving.
Figure 1 depicts the high-level architecture, encompassing latent state modeling, barrier function learning, and policy optimization.
To mitigate the challenges associated with high-dimensional BEV (bird’s-eye view), we introduce a latent space representation. This is achieved by compressing the BEV image into a low-dimensional latent vector using a Variational Autoencoder (VAE)-like approach. We further learn latent dynamics within this space, enabling the model to capture temporal dependencies; this process is shown in
Figure 2. By leveraging the power of learning, this approach can effectively handle the complexities of the driving environment, including non-smooth contacts and rich interaction dynamics. Crucially, a latent safety predictor is incorporated within the framework to identify unsafe regions in the latent space. As shown in
Figure 2a, this learned safety information is then utilized to construct an MDP-like latent model, functioning as a generative model for producing synthetic training data. Consequently, the approach operates in a model-based fashion, minimizing interactions with the real environment and reducing the risk of safety violations during training.
Building upon the foundation of the latent dynamics model, a latent barrier-like function is introduced. This function encodes state-wise safety constraints within the latent space. Safety labels are generated from the learned latent safety predictor and used to train the barrier function on synthetic data. Notably, the training gradients from the barrier function can propagate back to the control policy, encouraging it to select safer actions. Concurrently, policy optimization is performed to maximize the total expected return in a model-based manner. Algorithm 1 summarizes the overall framework. Subsequent sections will delve deeper into each component.
Algorithm 1 State-Wise Safe PPO with Latent Dynamics |
Require: Initial policy , generated horizon H, action repeat R, collect interval C, batch size B, chunk length L, total episodes E, episode length T |
Ensure: Policy with barrier-like function and the latent model with Initialize dataset with random seed episodes |
Initialize models with parameters , and |
for epoch do |
for update step do |
Sample batch of sequence chunks |
Train latent model and calculate from Equation (5) |
Update // Update the latent model |
Generate trajectories using current policy in latent space |
Compute from Equation (9), from Equation (12) |
Update |
Update |
end for |
for do |
Compute and from latent model and , add exploration noise on |
env.step |
end for |
Add the new trajectory to |
end for |
4.1. Input Representation
In autonomous driving systems, a BEV semantic mask is leveraged to provide a comprehensive overview of road conditions and nearby objects. Following similar approaches [
11,
12,
20], we employ a mask structured as a 64 × 64 × 3 tensor, as visualized in
Figure 3. The semantic mask encompasses five key elements:
Map: The tensor encompasses a map segment that portrays the road network’s layout. Drivable areas and lane markings are rendered in the map.
Routing: Information regarding the planned path, composed of waypoints determined by a route planner, is integrated into the mask. This information is represented as a bold blue line, guiding the autonomous vehicle along its designated route.
Detected Objects: The tensor includes bounding boxes representing detected surrounding road participants. These participants can include vehicles, bicycles, and pedestrians.
Ego State: The position and orientation of the autonomous vehicle itself are indicated by a red box within the tensor.
Traffic Control: Components that inform the vehicle of traffic rules, like traffic lights and stop signs, are depicted with varying levels of brightness to signal their status. Active stop signs and red traffic lights are shown with the brightest colors for visibility. Yellow lights use a medium brightness level, and green lights are displayed with the darkest shade.
This semantic mask simplifies the high-dimensional raw sensor data, distilling it into a format that retains the essential information required for the vehicle’s navigation and decision-making processes. By transforming complex visual and spatial data into this bird’s-eye view tensor, the system can efficiently process and act upon the information necessary for safe and effective autonomous driving.
4.2. Latent State Space with Latent Dynamics
In order to address the complexity of high-dimensional input data and mitigate the risk of overfitting during the reinforcement learning process, our methodology employs a dimensionality reduction strategy. We leverage a Variational Autoencoder (VAE) architecture to derive a low-dimensional latent representation of the environment. The VAE comprises an encoder network, denoted by
, to compress high-dimensional pixel observations
into a low-dimensional latent space
, and a decoder network, denoted by
, which aims to reconstruct observations from the latent space. The encoder and decoder are parametrized by
, which are optimized to minimize the loss function composed of various components:
In the above formulation, the Kullback–Leibler (KL) divergence quantifies the discrepancy between the probability distribution of the inferred latent states and the actual distribution of the states compressed from the real observations . This component is essential for refining our transition model . The Mean Squared Error (MSE) terms are utilized to train the reward predictor and safety predictor as well as to capture the fidelity of the observation reconstruction from the latent space.
The encoder and decoder are Convolutional Neural Networks (CNNs) and transposed CNNs, respectively, while the reward and safety predictors are modeled using Deep Neural Networks (DNNs). The transition model
is designed as a Recurrent Neural Network (RNN), enabling the capture of temporal dependencies and dynamics. Our latent model structure builds upon the Recurrent State-Space Model (RSSM) proposed by [
7], but it introduces a novel interpretation of the latent state space and incorporates a safety predictor.
Our approach transforms the environment’s Markov Decision Process (MDP) into an MDP-like latent model characterized by a low-dimensional latent space, as illustrated in
Figure 1. The latent space
is designed to reflect the dynamics of the environment in a compressed manner while also integrating reward and safety signals. We posit the existence of a safety detector
, which may be an amalgamation of various sensor modalities. The purpose of the latent safety predictor
is to estimate this detector’s outputs such that potential safety violations within the latent space can be identified.
The transition model allows us to emulate the environment’s dynamics within the latent space, providing Gaussian predictions for subsequent latent states based on current states and actions, that is, . This latent model, which retains the same control policy as the actual MDP of the environment, can function as a generative model to synthesize data for training the control policy. By sampling latent trajectory data , the interaction with the real environment during training is minimized, thereby reducing exposure to unsafe conditions.
The training of the latent model utilizes trajectory chunks of time length T from the real environment MDP’s data buffer, adopting the loss function defined earlier. In doing so, the model learns to navigate transitions that may not be smooth, thereby addressing the challenges posed by the complex raw sensor data.
4.3. Control Barrier Functions as Safety Certificates
In the context of machine learning and the formulation of a safe latent state space, we establish a barrier-like function that serves as a demarcation between safe and unsafe states, which is guided by a safety predictor delineated within the latent model.
For a given policy , the barrier-like function is defined in the latent state space, emphasizing the following safety conditions:
Safety Condition for Safe States: For all safe states
, the barrier function
must exceed a positive threshold
, denoting the state’s location with in the safe region:
Continuity Condition for State Transitions: The model mandates that the barrier function’s value remains positive while transitioning from a current safe state
to the next expected state
. This continuity condition is integral to ensuring that the barrier function’s value for the expected next state is greater than that of the current state, adjusted by a term involving
, which is a positive class-
function:
Safety Condition for Unsafe States: In contrast, for all unsafe states
, the barrier function
should fall below the positive threshold
, signifying that the state is within the unsafe region:
Here, the latent states and are part of the subsets and of the latent space , respectively. The transition model governs the distribution of . The parameters encapsulate the characteristics of the barrier-like function and the policy network, reinforcing the safety framework within the latent space.
Within this framework, safe and unsafe latent states are distinguished by the learned safety predictor such that for , and otherwise. The essence of the latent barrier-like function is to provide a state-wise safety measure, ensuring that an agent remains in the safe subset by maintaining a positive barrier function value, as approximated by the second condition.
However, partial observability can lead to instances where the agent inadvertently enters unsafe regions. Partial observability in autonomous driving stems primarily from limitations in sensory technologies and environmental complexities that obstruct complete data acquisition. Sensors might fail to detect hidden or obscured hazards due to angle, distance, or adverse weather conditions. Furthermore, the unpredictable dynamics of road traffic, such as sudden stops or unexpected pedestrian movements, can go undetected until they pose immediate risks. These gaps in sensory information mean that the vehicle’s decision-making algorithms may not have access to all necessary data to make safe choices. Consequently, the system might make decisions based on incomplete or outdated information, thus inadvertently steering the vehicle into scenarios that increase the likelihood of accidents or safety breaches. This highlights the critical need for robust models that can infer the full scope of the environment from partial inputs and predict potential hazards with high accuracy. The barrier-like function aids in guiding the agent back to safety by leveraging the difference in the consecutive state function values, which is in contrast to the cumulative cost minimization focus of CMDP approaches.
Our study incorporates a stochastic component into the mean of the transition model’s distribution, represented by , which is a common practice in model-based approaches. Our experimental results indicate that neglecting this randomness leads to a decline in the reconstruction quality and policy performance due to a deterministic path from the encoder output to the decoder input.
To operationalize the latent barrier-like function, we employ a dense neural network and derive a loss vector as inspired by prior work [
17,
41]:
This loss vector penalizes non-positive safe states, enforces the positivity of the time derivative condition, and penalizes non-negative unsafe states. A small positive learning margin is introduced to facilitate optimization. Due to the inherent partial observability of the system, the formulation can only promote forward invariance without absolute guarantees.
In this revised equation, represents the evaluation function for safe states, assessing the safety of state . is the positive evaluation of the time derivative condition, reflecting whether the transition from state to adheres to the requirements for maintaining state safety. This could incorporate , which is now explicitly framed within the derivative evaluation context. denotes the evaluation function for unsafe states, measuring the degree of unsafety for state .
4.4. State-Wise Safe PPO
To optimize the total rewards while considering state-wise safety, we formulate an actor–critic approach with barrier-like function learning in the loop within the latent model by using the trajectories generated by the latent model. With the encoder network embedded inside, the policy network (or equivalently ) outputs action in a Gaussian distribution, which is randomly sampled for training and provides a mean action value for evaluation.
The value (critic) function of RL can be expressed as
where
represents the expected cumulative discounted reward starting from state
and following policy
. The discount factor
is used to balance the importance of immediate and future rewards.
In the Proximal Policy Optimization (PPO) algorithm, the optimization objective is modified to incorporate a clipped surrogate objective, which can be expressed as shown below:
where
is the probability ratio between the current policy
and the previous policy
. The advantage function
estimates how much better the current action is compared to the average action in state
. The clip function limits the ratio within the range
to prevent excessive policy updates.
Combining the clipped surrogate objective with the barrier loss function as a regularization term for safety, the overall optimization objective becomes
where
is a hyperparameter controlling the trade-off between reward optimization and safety constraint satisfaction, and
is the barrier loss function that penalizes unsafe actions.
The critic network
is optimized by minimizing the squared error between the estimated value and the target value obtained through Monte Carlo estimation:
where
is the target value estimated using the discounted sum of rewards and the value of the state at the end of the trajectory.
To estimate the advantage function, we use the Generalized Advantage Estimation (GAE):
where
is the temporal difference error, and
is a hyperparameter controlling the trade-off between bias and variance in the advantage estimation.
The optimization process involves sampling a batch of trajectories using the current policy , estimating the advantage function , optimizing the critic network by minimizing , and optimizing the actor network by maximizing using stochastic gradient ascent. This process is repeated until convergence, allowing the agent to learn a policy that maximizes the expected cumulative discounted reward while satisfying safety constraints.
6. Results
To comprehensively evaluate the performance of our autonomous driving system, we performed extensive comparisons with existing methods, including PPO [
2], SAC [
3], and Roach PPO [
20]. PPO and SAC are well-established deep reinforcement learning algorithms. Roach PPO builds upon PPO by representing the action space with a beta distribution and introducing a novel exploration loss function to enhance sample efficiency and problem quality, achieving promising results on the CARLA benchmark. This algorithm can serve as the baseline for our experiments, and its performance on autonomous driving tasks is primarily evaluated through comparative analysis.
We introduce Latent SW-PPO, which is a novel algorithm that integrates a latent model with a state-wise safe Proximal Policy Optimization (PPO) algorithm. To assess the independent contributions of each component, we conducted ablation studies. These studies employed two variants: Latent PPO, which utilizes only the latent model without the SRL constraint satisfaction function, effectively reducing the reinforcement learning component to a standard PPO algorithm; and SW-PPO, which utilizes the state-wise reinforcement learning CBF but omits the latent model, resulting in a conventional safe reinforcement learning algorithm. Given the shared foundation of reinforcement learning for all methods, we first evaluated their reward and cost trajectories during the training process. Subsequently, we assessed their performance in specific autonomous driving scenarios. This two-pronged approach facilitates a comprehensive and holistic evaluation of the algorithms.
6.1. Evaluating Reward and Safety
In conventional reinforcement learning, the performance of an algorithm is typically evaluated based on its reward. The reward function defines the immediate feedback that an agent receives for taking a specific action in a given state. The agent’s goal is to learn a policy that maximizes its cumulative reward over time. While this approach works well in many problems, it may be insufficient in scenarios involving safety-critical operations where relying solely on reward functions can be risky.
In safe reinforcement learning, the notion of cost becomes crucial. By introducing a cost function, safe reinforcement learning emphasizes not only maximizing reward but also maintaining safety during exploration and exploitation. This dual-objective optimization framework allows algorithms to pursue performance while ensuring the safety of their behavior, which is critical for safety-critical applications. In this way, safe reinforcement learning enables the development of intelligent systems that are both efficient and safe.
In safe reinforcement learning, AverageEpCost (Average Episode Cost) and CostRate [
19] are two key metrics used to evaluate the performance of algorithms on safety-critical exploration problems.
AverageEpCost measures the average accumulated cost over a single episode, quantifying the agent’s total cost of interacting with unsafe elements during a complete episode. Assuming there are many time steps in an episode, and each time step t incurs a cost signal
from the agent’s interaction with unsafe elements in the environment, AverageEpCost can be calculated using the following formula:
where
T is the total number of time steps in the episode.
CostRate represents the average cost per time step during the agent’s training process, providing a more fine-grained measure of the safety of the agent’s behavior during training. CostRate is calculated by accumulating the cost signals from each time step in all episodes and then dividing by the total number of environment interaction steps. If there are
N episodes, each with
time steps, CostRate can be calculated using the following formula:
where
N is the number of episodes,
is the number of time steps in the
i episode, and
I is the total number of interaction steps across all episodes.
AverageEpCost emphasizes the model’s ability to interact safely in a single trial, while CostRate provides a more comprehensive view of the model’s safety and robustness over the long-term learning process. These metrics provide important tools for evaluating and comparing the safety of different algorithms.
6.1.1. Evaluating Reward
As shown in the
Figure 6a, the Latent PPO algorithm achieves the highest reward compared to other algorithms, and its slope indicates a rapid convergence speed. Latent SW-PPO also exhibits a relatively fast convergence speed despite not achieving the highest reward. Both of these algorithms begin to converge at approximately 10 M steps, while Roach and SW-PPO require nearly 20 M steps to converge gradually.
This demonstrates that by adopting a latent state space and generating synthetic data in the latent space, the algorithm can be trained without direct interaction with the real environment. This approach reduces the time and resources required for environmental interaction, thereby improving sample efficiency [
11,
12].
A comparison of the rewards of Latent PPO and PPO algorithms reveals that PPO converges prematurely with a low reward value, which is similar to the SAC algorithm. This suggests that conventional reinforcement learning algorithms still have limited capabilities in handling high-dimensional information.
The dimensionality reduction performed by the latent state space helps to alleviate the computational burden of directly processing the original high-dimensional data while preserving sufficient information for effective decision making. Dimensionality reduction can improve the learning efficiency of the algorithm by reducing the amount of data that needs to be processed, and it can also potentially help to avoid the “curse of dimensionality”.
We observed that the rewards obtained by the safe reinforcement learning algorithms SW-PPO and Latent SW-PPO are not as high as those of the traditional reinforcement learning algorithms Latent PPO and Roach. This is because conventional reinforcement learning algorithms can achieve high rewards by taking unsafe actions, which can also lead to higher costs. On the other hand, safe reinforcement learning algorithms will obtain lower rewards but can also control the cost within the desired range [
19,
46]. We will discuss the cost in more detail in the next subsection.
6.1.2. Evaluating Safety
Figure 6b,c shows the AverageEpCost and CostRate of our approach and other algorithms. In principle, model-free reinforcement learning algorithms lead to more safety violations. As can be seen from the figure, the SW-PPO and Latent SW-PPO algorithms maintain AverageEpCost and CostRate at relatively low levels. These results affirm the advantages of our latent barrier-like function learning for encoding state-wise safety constraints. Additionally, Latent SW-PPO converges faster than SW-PPO. This is because during training, our latent model quickly identifies and captures the majority of unsafe latent states through supervised learning. With more interactions, the latent barrier-like encoding of hard state-wise safety constraints progressively forces the agent to take safer actions, leading to a lower Cost Return.
In autonomous driving, safety violations are inevitable. Considering vehicle dynamics and the unpredictability of the future environment, achieving zero violations is fundamentally a difficult problem to solve. Additionally, due to learning errors, our latent model may not always accurately distinguish between safe and unsafe images, which can lead to safety violations.
6.2. Driving Performance
In this section, we then compare the driving performance of our proposed algorithm against several methods. Finally, we conduct ablation studies to analyze the significance of various components within our approach.
6.2.1. Metrics
The following outlines the metrics employed to evaluate the driving behavior of each agent in the Carla Leaderboard [
47], which is a public leaderboard that ranks agents based on their performance in the Carla simulation environment. These metrics collectively provide a comprehensive understanding of an agent’s performance across various aspects of autonomous driving, including safety, efficiency, comfort, and compliance with rules.
6.2.2. Performance and Ablation
We first trained all models in Town03 and Town04; then, we evaluated them in Town01, Town02, Town05, and Town06, respectively. In each town, we generated 50 episodes with an average route length of 1.5 km, which is close to the average route length of the official leaderboard at 1.7 km. In each episode, the scene, driving route, and environmental vehicles were randomly generated to ensure the complexity and diversity of the evaluation. We ensured that the diversity and richness of the test data were comparable to those of the leaderboard and other algorithms. We recorded the performance of all models on the entire test set and on different towns.
Table 1 shows the performance of all models in the test environment.
As can be seen from
Table 1, Latent SW-PPO achieved the best results. It had the highest DS score, which was 60% higher than the Roach PPO algorithm. It also had the best path completion rate (RC) and safety index (IS) among all models. In terms of safety, both SW-PPO and Latent SW-PPO achieved higher IS scores than other algorithms. This confirms the advantages of our latent barrier-like function learning for encoding state-wise safety constraints over the CMDP formulation in the baselines. This conclusion is consistent with the conclusion that the cost changes during the reinforcement learning process.
Compared with SW-PPO, Latent SW-PPO uses a latent state space, which significantly improves the performance of the model. RC was improved by 35% and IS was improved by 21%. This indicates that latent dynamics enhances the understanding and prediction accuracy of complex environmental dynamics while improving data efficiency and model generalization ability. This allows the model to learn better with the same dataset.
The Roach PPO algorithm also achieved good results, even surpassing SW-PPO in path completion rate (RC). However, its safety score (IS) was lower. The Latent PPO algorithm also had similar problems. This indicates that without safety constraints, Roach PPO and Latent PPO may tend to violate rules in order to complete the driving task. This is consistent with the conclusion that reinforcement learning tends to make unsafe actions in order to pursue higher rewards during training. In our experiments, we also found that when the traffic scene is extremely congested, the Roach PPO algorithm will collide with cars or people in order to complete the task, while SW-PPO and Latent SW-PPO will choose to wait for the traffic congestion, which will result in the task not being completed and ending. This is the reason why SW-PPO’s path completion rate is lower than Roach PPO’s.
Experiments in the CARLA simulator demonstrate the importance of latent dynamics and safety constraints in reinforcement learning-based driving systems. The introduction of latent variable dynamics modeling and safety constraint mechanisms is valuable in the research of reinforcement learning driving systems. These mechanisms enable autonomous driving algorithms to learn safer and more intelligent strategies in complex environments.
6.2.3. Evaluate Generalization
In this section, we analyze the generalization ability of our proposed model and compare its performance with other algorithms under different scenarios.We use three main metrics for evaluation: Driving Score (DS), Route Complete (RC), and Infraction Score (IS). We exclude SAC and PPO due to their poor performance and focus on comparing our algorithm with Roach PPO.
Table 2 shows the performance of different models on the training set (Town03 and Town04), and
Table 3 shows their performance on the test set (Town01, Town02, Town05 and Town06). Town03 is a city block scene with more complex road features, such as roundabouts and overpasses. Town04 is a small town with a relatively simple road structure. Therefore, all models achieve lower Driving Scores in Town03 than in Town04, indicating that the complexity of the scene affects model performance.
Table 1 shows the average results of all models on the test set. Compared to
Table 2, the Driving Scores of all models decrease on the test set. This indicates that models still experience some difficulties in adapting when encountering new scenes, leading to lower Driving Scores.
From the insights presented in
Table 3, it is evident that the complexity of the environment plays a crucial role in the generalization performance of autonomous driving models. In simpler settings, as seen in Town01 and Town02, the Driving Scores do not decline as sharply, with models like Latent PPO and Latent SW-PPO even demonstrating improved performance in Town02. However, in more complex scenarios, such as those found in Town05 and Town06—which feature highways, expansive multi-lane roads, and intricate intersections—the Driving Scores experience a marked drop, with the decrease being particularly notable in Town05. These trends underscore a consistent observation within the realm of deep learning algorithms that the ability to generalize is significantly influenced by environmental complexity.
We analyze the generalization ability of different models in new environments, emphasizing the impact of scene complexity. The Latent SW-PPO model shows a minor decline in Driving Score, from 51.62 to 48.22 (a 6.5% drop), when transitioning from training to the test set. In contrast, Roach PPO’s score falls from 34.40 to 24.54, which is a significant 25.7% reduction.
The divergence in performance is more pronounced in complex scenarios, such as Town05. Here, Latent SW-PPO’s score drops by 13%, while Roach PPO’s plummets by 42.3%.
Figure 7 illustrates the Driving Score drop, indicating that the models with latent state space, Latent PPO, and Latent SW-PPO experience a much smaller reduction compared to SW-PPO and Roach PPO.
The employment of latent state space in these models contributes to their improved generalization, which is evidenced by their robust performance against variations in input data. This capability enhances the models’ predictive and decision-making abilities in unfamiliar environments.
6.2.4. Infraction Analysis
Through the previous analysis, our model has achieved good results in various metrics on the CARLA test. However, we still found some problems in driving during the experiment, which can be used as the direction of improvement for future research work.
We found that when some pedestrians or obstacles are close to the agent, the agent under SW-PPO control will stop, but sometimes this distance is still relatively far, and there will be no collision. SW-PPO sometimes overemphasizes safety, which can lead to a lot of time consumption. This is the reason why SW-PPO Route Complete does not perform well in many scenarios. Latent SW-PPO improves this problem, but in some cases it still exists. Therefore, how to balance efficiency and safety is a research that needs to be considered in the future.
We observed the error cases of Latent SW-PPO. In the case of heavy rain, Latent SW-PPO is more prone to errors. The main reason for this error is that the BEV map we currently use does not have climate information. In rainy weather in CARLA, the mechanical properties of the vehicle and the friction coefficient with the ground will change, which tends to make the agent adopt a more conservative strategy to ensure safety, such as slow speed or slow turning. However, we use the BEV map and do not perceive the change of environmental climate, so the collision occurs. Perception of the environment is the direction that needs to be improved in future work.
7. Conclusions
This research introduces a novel approach to safe reinforcement learning (SRL) for autonomous driving, combining latent space modeling with state-wise safety constraints. Our framework addresses the critical challenge of ensuring safety while optimizing performance in complex environments. By utilizing variational autoencoders, we have developed a latent space representation that enhances sample efficiency and reduces interactions with the real environment, thereby mitigating safety risks.
Our innovative barrier function encodes state-wise safety constraints, ensuring the policy maintains safety at each state. This integration of model-based learning with safety enforcement provides a robust solution for autonomous driving systems. Experimental results in the CARLA simulator demonstrate the effectiveness of our method, outperforming existing approaches in driving score and safety metrics.
The improved generalization capability of our model, evidenced by its performance across various scenarios, highlights the potential for real-world application. Future work will focus on refining the model to handle diverse conditions and bridge the gap between simulation and real-world performance. This study contributes to the progress of safe and reliable autonomous driving systems, aiming for a safer and more efficient transportation future.