1. Introduction
Vision-based autonomous driving requires robust visual representations that can capture the complex dynamics of urban environments. Learning effective representations from visual inputs is crucial for understanding the driving scene, including other vehicles’ behaviors, traffic conditions and road geometry [
1]. While traditional computer vision approaches focus on extracting low-level features, the key challenge lies in developing representations that can capture semantic and temporal relationships in driving scenarios [
2]. World models have emerged as a promising framework for learning such representations, as they can predict future states and enable autonomous vehicles to anticipate and plan for potential scenarios [
3,
4].
The development of future state prediction in world models has primarily followed two technical paths: the Recurrent State Space Model (RSSM) [
4] and the Joint-Embedding Predictive Architecture (JEPA) [
5]. While both approaches have shown promise, they face significant challenges in autonomous driving applications. RSSM relies on variational inference that requires decoding predictions back to high-dimensional and low-level pixel space for training, as shown in
Figure 1a. This not only introduces computational overhead but also potentially loses important semantic information through the low-level reconstruction bottleneck. This is particularly problematic for autonomous driving, where precise preservation of critical driving-relevant features is essential for safety. JEPA, while avoiding the low-level reconstruction, requires careful design of masking strategies to effectively learn meaningful representations. This is an especially challenging task in the complex and dynamic domain of autonomous driving where determining what information to mask requires deep domain knowledge. These limitations motivate our novel approach that fundamentally reimagines how future states are predicted in world models.
To address these challenges, we propose BYOL-Drive, a novel approach that introduces the self-supervised learning framework BYOL (Bootstrap Your Own Latent) [
6] to implement world modeling. As shown in
Figure 1b, our model enforces consistency constraints directly in low-dimensional latent space through a target network updated by exponential moving average (EMA), avoiding the computational overhead and potential information loss associated with high-dimensional reconstruction in RSSM. By operating in latent space, we can better preserve critical driving features while maintaining computational efficiency. Additionally, unlike JEPA which requires careful masking design, our approach learns meaningful representations through self-supervised prediction targets that naturally emerge from the temporal structure of driving sequences, eliminating the need for complex masking strategies. Furthermore, our model learns spatial-aware state representations that capture the geometric relationships and spatial dependencies in the driving environment. Experiments demonstrate that our learned representations enable competitive driving performance in downstream autonomous driving tasks with less labeling requirements compared to the state-of-the-art methods. The key contributions of our work are threefold:
We propose BYOL-Drive, a novel representation-learning framework for autonomous driving that avoids pixel-space reconstruction while achieving competitive performance with less labeled data compared to existing methods, using only monocular camera input which significantly reduces hardware costs and complexity.
We pioneer the integration of BYOL’s self-supervised learning paradigm into world modeling, extending BYOL from static images to temporal sequences for robust world modeling in autonomous driving.
We develop a spatially aware representation approach that preserves crucial spatial structural information in driving scenes, enabling more effective world modeling in autonomous driving.
2. Related Work
Recent advances in world models have demonstrated significant progress in autonomous driving applications, as comprehensively reviewed in two recent surveys [
7,
8]. Two dominant technical approaches have emerged in this field. The Recurrent State Space Model (RSSM) [
4] innovatively decomposes states into stochastic and deterministic components, enabling robust temporal modeling through a hybrid structure that balances predictive stability with adaptability. However, RSSM faces challenges in computational efficiency due to its reliance on high-dimensional reconstruction and variational inference. The Joint Embedding Predictive Architecture (JEPA) [
5] represents an alternative approach that operates directly in latent space through dual encoders, avoiding expensive pixel-space reconstruction. However, JEPA’s effectiveness depends heavily on carefully designed masking strategies, which proves particularly challenging in the dynamic and complex domain of autonomous driving.
There have been several notable advances in specialized areas. Multi-modal approaches like MUVO [
9] and UniWorld [
10] have integrated LIDAR and camera data, though requiring extensive labeled training data. Drive-WM [
11] introduced multi-view world-modeling capabilities. In handling uncertainty and temporal dynamics, GAIA-1 [
12] and DriveDreamer [
13] have made strides in future state prediction, while TrafficBots [
14] improved scalability through Conditional Variational Autoencoders. The integration of language understanding has been explored by ADriver-I [
15] and WorldDreamer [
16], while semantic understanding has been advanced by SEM2 [
17] and OccWorld [
18] through various forms of 3D and semantic modeling. A common limitation across these approaches is their heavy reliance on labeled data and computational resources. Multi-modal methods require synchronized sensor data and precise calibration. Language-integrated models need paired driving video and natural language supervision. Semantic understanding approaches demand pixel-level segmentation or dense 3D occupancy annotations. While these methods have shown promising results, their dependence on extensive human-annotated datasets and computational overhead poses significant challenges for real-world deployment and scalability.
Compared to recent self-supervised world models, BYOL-Drive exhibits several key architectural and methodological differences. First, unlike GAIA-1 [
12] which relies heavily on large-scale pre-training with paired video–text data, our approach eliminates the need for language supervision by leveraging purely visual self-supervision. While GAIA-1 demonstrates impressive zero-shot generalization capabilities through its language–vision alignment, the requirement for paired annotations limits its applicability to unlabeled driving data. In contrast, BYOL-Drive’s BYOL-based contrastive learning framework enables it to learn meaningful representations from raw sensor streams alone. Similarly, TrafficBots [
14] employs Conditional Variational Autoencoders (CVAEs) to model diverse behaviors of different traffic participants through personality and destination conditioning. While it achieves strong performance in both a priori and a posteriori simulation, the model requires access to future traffic light states and destinations that could be obtained via V2X or V2V communication. Our latent-space prediction approach aims to learn directly from raw sensor data without requiring such additional information. Furthermore, unlike OccWorld [
18] which requires dense 3D occupancy annotations, BYOL-Drive learns spatial-aware representations directly from 2D images through its matrix-variate state formulation. This makes our method more amenable to real-world deployment where 3D ground truth is scarce.
Recent works like DriveDreamer [
13] and WorldDreamer [
16] have explored diffusion-based world models that can generate high-fidelity future predictions. While these approaches excel at long-term forecasting, they demand substantial computational resources for both training and inference. BYOL-Drive strikes a better balance between predictive power and efficiency through its hybrid RSSM-BYOL architecture. The deterministic and stochastic state components enable robust temporal modeling while the contrastive objective ensures learning of task-relevant features. Additionally, compared to multi-modal approaches like MUVO [
9] and UniWorld [
10], our method achieves competitive performance using only monocular camera input. While sensor fusion can provide complementary information, the added complexity of synchronization and calibration poses challenges for widespread adoption. BYOL-Drive’s camera-centric design offers an economically viable path to scaling autonomous driving technology.
Existing world models face challenges in data labeling and computational complexity. We propose BYOL-Drive, a self-supervised framework that learns in low-dimensional latent space, preserving driving features while reducing computational costs and labeling requirements, enabling more practical deployment in autonomous driving systems.
3. Preliminaries
3.1. Recurrent State Space Model
The RSSM forms the foundation of many modern world models in autonomous driving. It combines deterministic and stochastic paths to model temporal dynamics while managing uncertainty. Given a sequence of observations
and actions
, where
represents the observation at time
t and
represents the action at time
t, RSSM models the observations and state transitions through the following generative process:
where
represents the sequence of stochastic latent states. The prior, likelihood and posterior can be further decomposed into factorized forms that reflect their temporal dependencies:
Since computing the true posterior
requires evaluating the intractable integral
, where the denominator involves integrating over all possible latent state trajectories, RSSM employs a predictor network
q to approximate it. This approximate posterior is defined as:
To efficiently handle the conditioning on previous states
and actions
, RSSM employs a shared Gated Recurrent Unit (GRU) to compress this information into a deterministic encoding
:
where MLP denotes a Multi-Layer Perceptron and concat represents vector concatenation. The deterministic encoding is then used to compute the sufficient statistics of the prior, likelihood and posterior:
where
denotes a Gaussian distribution. The model is trained by maximizing the ELBO:
This ELBO objective consists of two key terms: a reconstruction term that forces the model to reconstruct high-dimensional observations from latent states, and a KL divergence term that regularizes the learned posterior to stay close to the prior. Through this objective, RSSM learns to compress high-dimensional observations into a compact latent space while maintaining temporal consistency—the KL term ensures smooth state transitions by keeping the posterior close to the prior, while the reconstruction term provides supervision for learning meaningful representations. This enables RSSM to both encode the current state and predict future states by rolling out the learned dynamics in latent space. However, this formulation has two key limitations. First, the reconstruction term necessitates mapping back to the high-dimensional observation space at each timestep—a computationally expensive process that can lose critical driving-relevant features when compressing and decompressing the full sensory input. Second, the KL divergence term only provides a soft constraint between the prior and posterior distributions, allowing potential discrepancies between predicted and actual states to accumulate over time. This can lead to degraded long-term predictions, as the model may learn to satisfy the KL constraint while still producing inconsistent state transitions. These issues are particularly problematic in autonomous driving, where both computational efficiency and precise long-horizon prediction are crucial for safe operation.
3.2. Bootstrap Your Own Latent
BYOL represents a novel approach to self-supervised learning that prevents representation collapse through an asymmetric architecture design and stop-gradient operations, without requiring reconstruction to the original observation space. The architecture consists of two neural networks: an online network (parameterized by ) and a target network (parameterized by ), with critically different update mechanisms.
As shown in
Figure 2, given an input image
x, we first apply two different random augmentations to create views
v and
. Common augmentations include random cropping, color jittering, Gaussian blur and random grayscale conversion. The augmented views are then processed through the networks as follows:
where:
represent the encoder networks for online and target networks, respectively;
represent the projector networks for online and target networks, respectively;
represents the predictor network for the online network;
are differently augmented views of input x;
and are the parameter sets for online and target networks.
The target network parameters
are updated using EMA of the online parameters, providing stable learning targets:
where
is the EMA decay rate. The learning objective uses normalized predictions and target projections:
where
denotes the L2 norm and sg explicitly shows the stop-gradient operation. This asymmetric design provides several key advantages:
Collapse Prevention: The combination of asymmetric architecture, stop-gradient and EMA updates prevents the networks from converging to trivial solutions
Stable Training: The target network, updated via EMA, provides consistent learning targets that evolve smoothly during training
Feature Quality: The predictor network learns meaningful representations by trying to match the stable targets, without collapsing to constant features
Figure 2.
The BYOL framework for self-supervised learning. The asymmetric design between online and target branches (parameterized by and , respectively), combined with stop-gradient operations (sg), prevents representation collapse. The online branch updates through backpropagation while the target branch provides stable learning targets through EMA updates.
Figure 2.
The BYOL framework for self-supervised learning. The asymmetric design between online and target branches (parameterized by and , respectively), combined with stop-gradient operations (sg), prevents representation collapse. The online branch updates through backpropagation while the target branch provides stable learning targets through EMA updates.
3.3. Integration of RSSM and BYOL
While RSSM has demonstrated remarkable capabilities in world modeling, it faces certain limitations that can impact its effectiveness in autonomous driving scenarios. A key challenge lies in its reliance on observation reconstruction, which requires the model to decode high-dimensional visual observations from latent states. This reconstruction process can be computationally intensive and may force the model to capture irrelevant details that do not contribute to effective driving policy learning.
BYOL offers a promising solution to these limitations through its self-supervised representation learning approach. By avoiding explicit reconstruction in the observation space, BYOL enables more efficient and focused learning of relevant driving features. The predictor network in BYOL learns to capture meaningful representations without being constrained by pixel-level reconstruction objectives, allowing the model to concentrate on features that are most relevant for driving decision-making.
However, BYOL in its original form is primarily designed for static image processing and lacks mechanisms for handling temporal dynamics inherent in driving scenarios. This is where RSSM’s strengths in modeling temporal dependencies through its recurrent state-space architecture become particularly valuable. RSSM excels at capturing the sequential nature of driving data, enabling the model to understand and predict the evolution of driving scenarios over time.
By integrating BYOL’s representation learning with RSSM’s temporal modeling capabilities, we can address the limitations of both approaches. The combined framework leverages BYOL’s efficient feature learning while maintaining RSSM’s ability to model temporal dynamics, creating a more robust and computationally efficient world model for autonomous driving. This integration allows us to:
Eliminate the computational overhead of observation reconstruction while maintaining high-quality representation learning;
Preserve temporal consistency in learned representations through RSSM’s recurrent architecture;
Focus the model’s capacity on features that are most relevant for driving policy learning.
In the subsequent sections, we will demonstrate the details of integrating BYOL and RSSM, highlighting the architectural modifications and learning strategies that enable seamless fusion of these two powerful frameworks. Our integration approach focuses on leveraging BYOL’s robust representation-learning capabilities within the probabilistic state-space modeling framework of RSSM, creating a synergistic approach that enhances temporal understanding and representation stability in autonomous driving world models.
4. Methods
4.1. Model Architecture
The overall architecture is illustrated in
Figure 3. Our model consists of two key components: state representation and state prediction. For state representation, we propose a spatially aware approach that represents states as matrices rather than traditional vector-based representations, allowing us to preserve crucial spatial information. This matrix-based representation enables the model to maintain the spatial relationships inherent in driving scenes, which is essential for understanding the geometric structure of the environment. For state prediction, we adopt BYOL’s learning paradigm which uses an online encoder and a target encoder to learn robust representations through bootstrapping, avoiding the need for negative samples or explicit reconstruction objectives. This self-supervised approach allows the model to focus on learning meaningful temporal dynamics while maintaining computational efficiency.
4.1.1. Spatial-Aware State Representation
The encoder network
plays a crucial role in transforming high-dimensional visual inputs into a compact latent representation while preserving essential geometric and semantic information. Our encoder design is inspired by MILE [
19], which demonstrated the effectiveness of incorporating explicit 3D geometric understanding through a multi-stage architecture that processes visual information in both perspective and bird’s-eye view (BeV) spaces.
The first stage begins with processing the input RGB image through a ResNet-18 backbone to extract rich visual features . Concurrent with feature extraction, we employ a depth prediction network to estimate depth probability distribution . The depth distribution spans across predefined depth bins for each spatial location. This depth distribution enables us to lift the 2D image features into 3D space using the pinhole camera model with known intrinsics K and extrinsics M. The lifting operation can be formally expressed as , where features are transformed from camera coordinates to vehicle-centric 3D coordinates. While this lifting operation introduces additional computation compared to directly processing 2D features, the increase in inference time is relatively modest because the actual lifting transformation involves primarily matrix multiplications that can be efficiently batched and accelerated on GPU hardware. These architectural advantages, combined with the critical benefits of explicit 3D understanding for autonomous driving tasks, make the small additional latency a worthwhile trade-off for improved model performance.
Following the 3D lifting, we employ a BeV transformation [
19] that projects the 3D features onto a ground plane grid. This transformation uses a predefined grid with spatial dimensions
and resolution
meters per cell. The features are aggregated through sum-pooling within each grid cell, resulting in BeV features
. This BeV representation provides a geometrically consistent view of the scene that is particularly advantageous for driving tasks, as it naturally captures spatial relationships between objects and the road layout.
To achieve a spatially aware representation suitable for world modeling, we process the BeV features through convolutional layers, maintaining the spatial dimensions while reducing the channel dimension, producing an intermediate latent tensor
. As illustrated in
Figure 4, this latent representation is then used to parameterize a matrix-variate Gaussian distribution for the final state
, where:
Here,
and
are convolutional networks that output the mean and standard deviation matrices of the state distribution, respectively, preserving the spatial structure of the representation as depicted in
Figure 4 (right). To enable backpropagation through the sampling process, we employ the reparameterization trick:
where ⊙ denotes element-wise multiplication and
is an
identity matrix. The spatial elements in the state matrix are modeled with diagonal covariance structure, assuming independence between components. This design choice is motivated by physical reality, as real-world objects exhibit free spatial positioning where their locations are not inherently correlated, making the independence assumption a reasonable approximation of actual driving scenarios. Additionally, the diagonal structure reduces the quadratic complexity of full covariance estimation to linear scale, enabling real-time implementation. While the state representation assumes independence between spatial components, downstream tasks such as policy learning and BeV map prediction networks are able to model the correlations and dependencies between these components through their neural architectures, allowing the system to capture important spatial relationships during decision making. The final state embedding
maintains its spatial dimensions and is enriched with additional contextual information by concatenating encoded route information
along the channel dimension to provide necessary information for autonomous driving.
4.1.2. BYOL-Style State Prediction
As shown in
Figure 3, the core idea of our state prediction approach is to learn from pairs of consecutive frames by predicting the state distribution of the next frame given the current frame’s state, action and historical information. Specifically, for each pair of consecutive frames
and action
, we have an online encoder
that outputs the state distribution of the current frame, and a target encoder
that provides the state distribution of the next frame as a supervision signal. Following BYOL [
6], we adopt an asymmetric architecture and the parameters of target encoder are updated using EMA rather than backpropagation. Additionally, we use stop-gradient operations on the target encoder outputs to prevent representation collapse during training.
Formally, given the current frame
and action
, the online encoder
first produces a state representation
. The state predictor
first takes the stochastic state matrix
sampled from the state distribution and the current latent state
to predict the next latent state
:
Then, the action-conditioned state predictor network
takes the predicted latent state
and the action
as inputs to predict the state distribution of the next frame:
where
are the predicted mean and standard deviation matrices of the next state distribution. The
network enriches our state prediction framework by providing action-specific insights into state transitions. By explicitly modeling how different driving actions (such as lane changes, acceleration or braking) impact the future state based on the predicted latent state,
enables more precise and context-aware state predictions.
Meanwhile, the target encoder
processes the next frame
to obtain its state distribution parameters, with stop-gradient operations applied to prevent gradient flow:
where
denotes the stop-gradient operation and
. The prediction loss is computed as the Wasserstein distance between the predicted and target state distributions:
Here
D represents a certain metric measuring the distance between probability distributions, where the specific choice and implementation considerations of this metric will be discussed in subsequent sections. The target encoder’s parameters
are updated using EMA of the online encoder’s parameters
:
where
is the EMA decay rate. This EMA update, combined with the stop-gradient operations and asymmetric architecture, prevents representation collapse while allowing the model to learn meaningful state predictions.
Note that our model’s future prediction is designed for constructing an effective representation-learning paradigm. During policy learning, only the current representation and memory information are needed for decision making, without requiring future predictions. Therefore, while our model has the capability to predict future states, we focus on leveraging this predictive framework to learn better state representations rather than evaluating the quality of future predictions themselves. This design choice aligns with our goal of learning robust and meaningful state representations that can directly support downstream autonomous driving tasks.
4.2. Learning Objectives
Our model employs a carefully designed multi-objective loss function to achieve effective world modeling and policy learning. The first component is the prediction loss
, which is computed as the Wasserstein distance between predicted and target state distributions:
. This choice is motivated by the Wasserstein distance’s advantages in both convergence speed and numerical stability. Compared to KL or JS divergences that suffer from vanishing gradients when distributions have non-overlapping supports, the Wasserstein metric provides smoother gradient landscapes and remains differentiable even in such cases [
20]. The resultant stable gradient signals enable faster parameter updates while maintaining training stability, particularly crucial for learning complex environment dynamics through multi-step predictions. This loss term encourages the model to accurately predict environment state evolution given current latent states and actions.
To enhance the model’s understanding of human-defined traffic elements crucial for driving, we incorporate BeV supervision loss , where CE denotes cross-entropy loss to align the predicted BeV segmentation maps with ground truth. The BeV supervision provides essential prior knowledge about traffic infrastructure like traffic lights, crosswalks and lane markings that are human-defined conventions rather than natural physical objects. This additional learning signal helps the model develop semantic understanding of these critical driving-specific elements that cannot be learned purely from world dynamics, enabling it to make decisions that respect traffic rules and conventions.
For policy learning, we concatenate the learned state representation
with the hidden state
as input to a policy network
that outputs driving actions:
. The policy loss
is computed using mean squared error (MSE) between the predicted actions
and ground truth actions
:
This loss term helps optimize the driving behavior by minimizing the difference between predicted and actual actions. The complete training objective is:
where
,
and
are weighting coefficients. The training procedure is outlined in Algorithm 1.
Algorithm 1 Training Procedure |
Require: Training dataset , online encoder parameters , target encoder parameters , predictor parameters , policy parameters , EMA decay rate , stop-gradient operation - 1:
while not converged do - 2:
Sample batch from - 3:
Encode current observation: - 4:
Encode next observation: - 5:
Predict next state distribution: - 6:
Compute Wasserstein distance: - 7:
Predict BeV: - 8:
Compute BeV loss: - 9:
Predict action: - 10:
Compute policy loss: - 11:
Total loss: - 12:
Update , and using gradient descent on - 13:
Update target encoder: - 14:
end while
|
5. Experiments
5.1. Experiment Setup
5.1.1. Training Data and Testing Environment
We evaluate our method on the CARLA simulator, with detailed training and testing data settings shown in
Table 1. The table provides comprehensive specifications of our experimental setup:
The training data is collected from four different CARLA towns (Town01, Town03, Town04, Town06) under four diverse weather conditions (ClearNoon, WetNoon, HardRainNoon, ClearSunset). We gather approximately 0.58 M frames, equivalent to 32 h of driving data, collected by an expert RL agent at 5 Hz. The data format includes RGB images (), route maps (), continuous action vectors ( for acceleration and steering) and bird’s-eye view segmentation maps ().
To rigorously evaluate our model’s robustness and generalization capabilities, we construct highly diverse and challenging test scenarios. For each evaluation route, we dynamically populate the environment with varying numbers of vehicles (10–20) and pedestrians (10–20), creating complex, multi-agent interactions. These agents exhibit realistic behaviors including lane changes, sudden stops, jaywalking and other unpredictable movements. The test environment features diverse road types (highways, urban streets, intersections) and varying traffic conditions (light to heavy). The simulation concludes either when the agent deviates significantly (>30 m) from the designated route or remains stationary for an extended period (180 s). To ensure thorough evaluation and statistical reliability, we conduct three independent trials for each route with different random seeds.
The testing is performed in the previously unseen challenging Town05 environment, which presents 10 distinct routes with unique navigation challenges, under 4 novel weather conditions that test the model’s ability to handle varying lighting, visibility and road surface conditions. We select Town05 as our test environment because it offers a comprehensive mix of challenging scenarios not present in training towns, including complex intersections, varying elevations and diverse road types, making it ideal for evaluating model generalization.
5.1.2. Evaluation Metrics
We adopt the following metrics for comprehensive evaluation:
Route Completion (RC): The percentage of the designated route successfully completed by the autonomous agent. This metric directly measures the agent’s ability to navigate and follow the planned path.
Infraction Score (IS): A metric that quantifies the agent’s adherence to traffic rules and safe driving practices. The score accounts for various traffic violations such as red light infractions, off-road driving and collisions, with higher scores indicating better compliance with traffic regulations.
Driving Score (DS): The primary performance metric calculated as the product of Route Completion (RC) and Infraction Score (IS). This comprehensive metric evaluates both the agent’s ability to complete routes and its adherence to safe driving practices.
5.1.3. Implementation Details
Our model is implemented in PyTorch-2.1 with detailed hyperparameter settings as shown in
Table 2. We chose ResNet-18 over more complex Transformer-based architectures as our backbone to ensure that the performance improvements can be better attributed to our novel spatial-aware representation-learning approach and BYOL-inspired framework design, rather than being confounded by the inherent capabilities of a more sophisticated backbone architecture. The input images are resized to 600 × 960 (height, width). The experimental configuration includes training on 2 NVIDIA 4090 GPUs, employing an Adam optimizer with a learning rate of
and a batch size of 8 (determined through empirical tuning to balance computational efficiency and model performance), and running for 3 epochs. The EMA decay rate
was selected based on the original BYOL paper [
6], as this value has been shown to provide stable training dynamics for self-supervised representation learning.
5.2. Comparison with Mainstream Methods
Table 3 presents a comprehensive comparison between our BYOL-Drive and mainstream end-to-end autonomous driving methods. Our analysis reveals several significant findings regarding the effectiveness of our approach.
In terms of driving performance and supervision requirements, BYOL-Drive achieves remarkable results with a driving score of 68.6, while only requiring expert trajectories and map segmentation as supervision. This represents a substantial improvement over traditional approaches like Transfuser, which despite utilizing extensive supervision including depth estimation, semantic segmentation, map segmentation and 3D detection, only achieves a driving score of 31.0. Even recent methods like LAV, which employs semantic segmentation, map segmentation and detection supervision, falls short with a driving score of 46.5. Our performance is particularly noteworthy when compared to methods with similar supervision levels, such as TCP which only uses expert demonstrations yet achieves a lower driving score of 57.2.
The robustness of our approach is evidenced by the exceptional route-completion rate of 96.5%. This performance is comparable to state-of-the-art methods like MILE (97.4%) and ThinkTwice (95.5%), while significantly outperforming methods with similar supervision like TCP (80.4%). The consistency of our approach is further demonstrated by the minimal variance (±0.8) across different evaluation runs, indicating stable and reliable performance across various driving scenarios.
In addition, we also find that some methods, such as ThinkTwice and TCP, occasionally exhibit causal confusion issues [
30], where models that use ego-vehicle speed as input tend to learn a shortcut: applying brakes when speed is zero, due to the high correlation between zero speed and braking in training data. However, zero speed alone should not determine braking decisions. This can lead to problematic behaviors where ego-vehicles remain stationary indefinitely once stopped, as the model continuously applies brakes because the speed is zero. This issue likely affects ThinkTwice and TCP since they use ego-vehicle speed as input. In contrast, our model deliberately avoids using ego-vehicle speed as input, thereby circumventing this causal confusion problem. While the quantitative performance differences may appear modest, our approach demonstrates more robust and theoretically sound behavior by avoiding such systematic failure modes.
Safety metrics further validate the effectiveness of our approach, with BYOL-Drive achieving an infraction score of 0.71. This matches the performance of state-of-the-art approaches like DriveAdapter (0.72) and surpasses some heavily supervised methods like ThinkTwice (0.69). The low standard deviation of ±0.03 in the infraction score demonstrates the consistent safety performance of our system across different driving conditions. In our analysis of infraction types, we find that red light violations are most common, primarily due to the small size of traffic light objects making them challenging for the model to reliably detect. Despite this limitation, the overall infraction rate remains competitive with state-of-the-art methods.
When examining the overall landscape of autonomous driving methods, BYOL-Drive stands out for its balanced performance across all metrics. Our approach achieves competitive results with the latest state-of-the-art methods such as Interfuser (DS: 68.3) and DriveAdapter (DS: 65.9), while showing dramatic improvements over early approaches like CILRS (DS: 7.8) and LBC (DS: 12.3). The combination of high route-completion rate (96.5%) and strong safety metrics (IS: 0.71) demonstrates that our method successfully balances the critical aspects of autonomous driving performance.
These comprehensive results demonstrate that BYOL-Drive successfully achieves state-of-the-art performance while significantly reducing the annotation burden compared to existing methods. Our approach proves that well-designed self-supervised learning can not only match but often exceed the performance of heavily supervised approaches across all key metrics, representing a significant step forward in efficient autonomous driving systems.
5.3. Visualization of Learned Spatial-Aware State Representation
To analyze the characteristics of our learned spatial-aware state representations, we randomly selected eight samples and visualized their latent embeddings (formulated as Gaussian distributions in Equation (
20)) as shown in
Figure 5. Our visualizations show that the learned representations effectively capture essential features corresponding to observed objects. This indicates the interpretability of our representations—for instance, road structures are consistently visible across all samples in the latent space. In the first image, we can clearly observe the vehicle’s manifestation in the latent representation, while in the second image, when the vehicle disappears from view, its corresponding latent features also vanish accordingly. Similar phenomena can be observed in the third and fourth images, as well as in subsequent samples. Through this intuitive demonstration, we can appreciate both the potential of our learned representations in capturing environmental factors and their capability to enhance the interpretability of autonomous driving systems. Compared to methods like MILE [
19] that represent states as flat vectors, our approach significantly reduces the model’s burden of learning spatial relationships. This allows the model to focus more on predicting future states rather than having to implicitly learn spatial encoding, ultimately leading to better environmental understanding.
To further validate the effectiveness of our spatial-aware representation-learning approach, we examine the model’s ability to preserve driving-critical information without explicitly reconstructing images. As shown in
Figure 6, our model successfully captures and retains essential semantic and geometric information in the latent space through the supervision of BEV labels and our spatial-aware representation design. The generated semantic BEV visualizations demonstrate that our learned representations effectively encode key driving elements including vehicles, pedestrians, traffic lights, lane markings and road structures. This is achieved while avoiding the computational overhead and potential information loss associated with full image reconstruction approaches.
The qualitative results illustrate how our model maintains precise spatial relationships and semantic understanding in the latent space. The clear delineation of road boundaries, accurate positioning of dynamic objects and preservation of critical traffic elements in the generated BEV images validate that our representation-learning strategy effectively distills and retains the most relevant information for autonomous driving. This focused approach to information preservation, targeting specifically what matters for the driving task, contributes to the strong quantitative performance demonstrated in our experiments while maintaining computational efficiency.
5.4. Ablation Studies
To validate the effectiveness of each component in our framework, we conduct extensive ablation studies.
Table 4 shows the results of different model variants on the Town05 Long benchmark.
The ablation study results in
Table 4 provide several key insights into the effectiveness of our framework components. First, the baseline model with only expert supervision achieves modest performance with a driving score of 39.1, route-completion rate of 65.6% and infraction score of 0.53. This indicates that expert demonstrations alone provide a reasonable foundation but leave significant room for improvement.
Adding BeV supervision leads to substantial gains across all metrics, with the driving score increasing by 16.5 points to 55.6, route completion improving by 17.6% to 83.2% and infraction score rising by 0.14 to 0.67. This dramatic improvement demonstrates that explicitly modeling spatial relationships in bird’s-eye view is crucial for robust driving performance. The BeV representation helps the model better understand the geometric relationships between objects and the environment, leading to more informed decision-making.
Incorporating BYOL self-supervised learning to the baseline yields more modest but still notable improvements, with the driving score increasing to 41.6, route-completion rate to 67.6% and infraction score to 0.62. While the gains are smaller compared to BeV supervision, they suggest that self-supervised learning helps the model learn more robust and generalizable features from the input data, even without additional supervision signals.
Most significantly, combining both BeV supervision and BYOL self-supervised learning (our full model) produces dramatic synergistic improvements. The driving score increases substantially to 68.6, representing a 75% improvement over the baseline. The route-completion rate reaches an impressive 96.5%, and the infraction score improves to 0.71. The reduced standard deviations across all metrics (±1.0 for DS, ±0.8 for RC, ±0.03 for IS) also indicate that the full model achieves more stable and consistent performance.
These results demonstrate that BeV supervision and BYOL self-supervised learning provide complementary benefits. While BeV supervision helps the model learn crucial spatial relationships, BYOL enables more robust feature learning from unlabeled data. When combined, these components create a framework that significantly outperforms simpler variants while maintaining relatively modest supervision requirements.
6. Discussion
From an industrial applicability perspective, our framework’s camera-centric approach offers significant economic advantages over multi-sensor solutions. By achieving comparable performance using only monocular cameras rather than expensive LiDAR systems, the approach reduces hardware costs while maintaining simpler calibration workflows. This cost-effectiveness could accelerate commercial deployment in mid-range vehicles and robotaxi fleets, particularly when integrated with existing ADAS architectures.
While our framework demonstrates strong performance in end-to-end autonomous driving, there are several limitations worth discussing. First, although we aim to reduce reliance on expensive annotations, our approach still requires expert driving trajectories as supervision signals. This dependence on expert demonstrations means we cannot fully leverage the vast amount of naturalistic driving data available that may lack expert labels. Second, while the map prediction auxiliary task helps improve performance, it introduces additional supervision requirements through map annotations. Future work could explore ways to learn from raw sensor data alone without any form of supervision.
Furthermore, our current approach mainly relies on visual data, potentially missing out on valuable 3D spatial information that could be obtained from additional sensors such as LiDAR and radar. These alternative sensing technologies could help address challenges related to varying lighting conditions and restricted camera field of view. Therefore, we believe that the model’s performance could potentially be enhanced by incorporating multi-modal fusion strategies. Furthermore, while our BYOL-based self-supervised learning shows promising results, the contrastive learning approach may not capture all the nuanced temporal dynamics present in driving scenarios. More sophisticated self-supervised learning techniques specifically designed for sequential decision-making tasks could be explored.
Another limitation is that our evaluation focuses mainly on the CARLA simulator environment. While simulation provides a controlled testing ground, real-world driving conditions present additional challenges like varying weather conditions, lighting changes and complex multi-agent interactions that may not be fully captured in our current framework. To address this limitation, we plan to explore semi-supervised and unsupervised adaptation techniques that can leverage real-world driving data without requiring expert demonstrations. Specifically, we aim to investigate domain adaptation methods that can transfer knowledge from simulation to real-world scenarios by learning domain-invariant features. Additionally, we will explore self-training approaches where the model can gradually improve itself by generating pseudo-labels on unlabeled real-world data. These techniques, combined with carefully designed data augmentation strategies to bridge the sim-to-real gap, would help ensure that the economic advantages demonstrated in simulation translate effectively to real-world deployment while reducing reliance on expert supervision.
7. Conclusions
In this paper, we presented BYOL-Drive, a novel representation-learning framework for autonomous driving from a world-modeling perspective that operates using only monocular camera input. By leveraging the temporal continuity inherent in driving video data, we introduced BYOL-based self-supervised training to learn robust representations without requiring expensive pixel-space reconstruction. To preserve crucial spatial structural information in driving scenes, we proposed a spatially aware representation approach that enables effective world modeling. Our experimental results demonstrate that our method achieves strong performance using just a single camera, avoiding the need for expensive multi-sensor setups while providing interpretability into the learned representations. Overall, our work makes meaningful contributions toward improving the robustness and interpretability of autonomous driving systems through better representation learning and world-modeling capabilities.
Author Contributions
Conceptualization, H.C. and Y.L.; Data curation, H.C.; Formal analysis, H.C.; Funding acquisition, D.H.; Investigation, H.C.; Methodology, H.C. and Y.L.; Project administration, D.H.; Software, H.C.; Supervision, Y.L. and D.H.; Validation, H.C.; Visualization, H.C.; Writing—original draft, H.C.; Writing—review & editing, Y.L. and D.H. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by The Science and Technology Innovation Program of Hunan Province (Grant No. 2024QK2006).
Data Availability Statement
Data are contained within the article.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Epstein, R.A.; Patai, E.Z.; Julian, J.B.; Spiers, H.J. The cognitive map in humans: Spatial navigation and beyond. Nat. Neurosci. 2017, 20, 1504–1513. [Google Scholar] [CrossRef] [PubMed]
- Russell, S.; Norvig, P. Artificial Intelligence: A Modern Approach, 3rd ed.; Prentice Hall Press: Upper Saddle River, NJ, USA, 2009. [Google Scholar]
- Ha, D.; Schmidhuber, J. Recurrent world models facilitate policy evolution. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 3–8 December 2018; pp. 2455–2467. [Google Scholar]
- Hafner, D.; Lillicrap, T.; Fischer, I.; Villegas, R.; Ha, D.; Lee, H.; Davidson, J. Learning latent dynamics for planning from pixels. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; Chaudhuri, K., Salakhutdinov, R., Eds.; PMLR: Brookline, MA, USA, 2019; Volume 97, pp. 2555–2565. [Google Scholar]
- Assran, M.; Duval, Q.; Misra, I.; Bojanowski, P.; Vincent, P.; Rabbat, M.; LeCun, Y.; Ballas, N. Self-supervised learning from images with a joint-embedding predictive architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 15619–15629. [Google Scholar]
- Grill, J.B.; Strub, F.; Altché, F.; Tallec, C.; Richemond, P.H.; Buchatskaya, E.; Doersch, C.; Pires, B.A.; Guo, Z.D.; Azar, M.G.; et al. Bootstrap your own latent: A new approach to self-supervised Learning. arXiv 2020, arXiv:2006.07733. [Google Scholar]
- Guan, Y.; Liao, H.; Li, Z.; Hu, J.; Yuan, R.; Li, Y.; Zhang, G.; Xu, C. World Models for Autonomous Driving: An Initial Survey. arXiv 2024, arXiv:2403.02622. [Google Scholar] [CrossRef]
- Feng, T.; Wang, W.; Yang, Y. A Survey of World Models for Autonomous Driving. arXiv 2025, arXiv:2501.11260. [Google Scholar]
- Bogdoll, D.; Yang, Y.; Zöllner, J.M. Muvo: A multimodal generative world model for autonomous driving with geometric representations. arXiv 2023, arXiv:2311.11762. [Google Scholar]
- Min, C.; Zhao, D.; Xiao, L.; Nie, Y.; Dai, B. Uniworld: Autonomous driving pre-training via world models. arXiv 2023, arXiv:2308.07234. [Google Scholar]
- Wang, Y.; He, J.; Fan, L.; Li, H.; Chen, Y.; Zhang, Z. Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving. arXiv 2023, arXiv:2311.17918. [Google Scholar]
- Hu, A.; Russell, L.; Yeo, H.; Murez, Z.; Fedoseev, G.; Kendall, A.; Shotton, J.; Corrado, G. Gaia-1: A generative world model for autonomous driving. arXiv 2023, arXiv:2309.17080. [Google Scholar]
- Wang, X.; Zhu, Z.; Huang, G.; Chen, X.; Zhu, J.; Lu, J. DriveDreamer: Towards real-world-driven world models for autonomous driving. In Proceedings of the Computer Vision—ECCV 2024; Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G., Eds.; Springer: Cham, Swetzerland, 2025; pp. 55–72. [Google Scholar]
- Zhang, Z.; Liniger, A.; Dai, D.; Yu, F.; Gool, L.V. Trafficbots: Towards world models for autonomous driving simulation and motion prediction. arXiv 2023, arXiv:2303.04116. [Google Scholar]
- Jia, F.; Mao, W.; Liu, Y.; Zhao, Y.; Wen, Y.; Zhang, C.; Zhang, X.; Wang, T. Adriver-I: A general world model for autonomous driving. arXiv 2023, arXiv:2311.13549. [Google Scholar]
- Wang, X.; Zhu, Z.; Huang, G.; Wang, B.; Chen, X.; Lu, J. Worlddreamer: Towards general world models for video generation via predicting masked tokens. arXiv 2024, arXiv:2401.09985. [Google Scholar]
- Gao, Z.; Mu, Y.; Shen, R.; Chen, C.; Ren, Y.; Chen, J.; Li, S.E.; Luo, P.; Lu, Y. Enhance sample efficiency and robustness of end-to-end urban autonomous driving via semantic masked world model. arXiv 2022, arXiv:2210.04017. [Google Scholar] [CrossRef]
- Zheng, W.; Chen, W.; Huang, Y.; Zhang, B.; Duan, Y.; Lu, J. Occworld: Learning a 3d occupancy world model for autonomous driving. arXiv 2023, arXiv:2311.16038. [Google Scholar]
- Hu, A.; Corrado, G.; Griffiths, N.; Murez, Z.; Gurau, C.; Yeo, H.; Kendall, A.; Cipolla, R.; Shotton, J. Model-based imitation learning for urban driving. Adv. Neural Inf. Process. Syst. 2022, 35, 20703–20716. [Google Scholar]
- Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein GAN. arXiv 2017, arXiv:1701.07875. [Google Scholar]
- Codevilla, F.; Santana, E.; Lopez, A.; Gaidon, A. Exploring the limitations of behavior cloning for autonomous driving. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9328–9337. [Google Scholar]
- Chen, D.; Zhou, B.; Koltun, V.; Krähenbühl, P. Learning by cheating. In Proceedings of the Conference on Robot Learning (CoRL), Virtual, 16–18 November 2020; PMLR: Brookline, MA, USA, 2020; pp. 66–75. [Google Scholar]
- Prakash, A.; Chitta, K.; Geiger, A. Multi-Modal fusion transformer for end-to-end autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 7073–7083. [Google Scholar]
- Zhang, Z.; Liniger, A.; Dai, D.; Yu, F.; Gool, L.V. End-to-End Urban Driving by Imitating a Reinforcement Learning Coach. arXiv 2021, arXiv:2108.08265. [Google Scholar]
- Chen, D.; Krähenbühl, P. Learning from all vehicles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–20 June 2022; pp. 17222–17231. [Google Scholar]
- Wu, P.; Jia, X.; Chen, L.; Yan, J.; Li, H.; Qiao, Y. Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; Curran Associates, Inc.: Red Hook, NY, USA, 2022; Volume 35, pp. 6119–6132. [Google Scholar]
- Jia, X.; Wu, P.; Chen, L.; Xie, J.; He, C.; Yan, J.; Li, H. Think twice before driving: Towards scalable decoders for end-to-end autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 21983–21994. [Google Scholar]
- Jia, X.; Gao, Y.; Chen, L.; Yan, J.; Liu, P.L.; Li, H. DriveAdapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–5 October 2023; pp. 7919–7929. [Google Scholar]
- Shao, H.; Wang, L.; Chen, R.; Li, H.; Liu, Y. Safety-enhanced autonomous driving using interpretable sensor fusion transformer. In Proceedings of the 6th Conference on Robot Learning, Paris, France, 14–18 December 2023; Liu, K., Kulic, D., Ichnowski, J., Eds.; PMLR: Brookline, MA, USA, 2023; Volume 205, pp. 726–737. [Google Scholar]
- de Haan, P.; Jayaraman, D.; Levine, S. Causal confusion in imitation learning. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Volume 32. [Google Scholar]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).