1. Introduction
Autonomous driving has been a topic of interest for many years, with various advancements being made to improve its safety and efficiency [
1]. However, developing a fully autonomous system presents several challenges, such as the need for real-time decision-making and adaptability to changing environments. One of the most promising technologies used in this field is deep reinforcement learning, a type of Machine Learning that uses neural networks and reward-based training to enable decision-making [
2]. Despite these promising advancements, deep reinforcement learning also presents challenges and ethical implications [
3]. For example, it raises concerns about responsibility if an autonomous vehicle causes an accident. Determining fault involves assessing whether the manufacturer, the software developer, or the vehicle owner is responsible. Moreover, there are concerns about the impact of autonomous driving on employment and the economy. Nonetheless, the potential benefits of DRL in this field are vast, and the technology continues to evolve. For instance, developing more advanced neural networks and integrating other technologies, such as computer vision and natural language processing, can further enhance the decision-making process of autonomous systems.
Deep reinforcement learning enhances autonomous driving systems by allowing agents to learn from their mistakes and improve decision-making over time [
4]. For example, an agent can receive a reward for staying within the speed limit and a penalty for breaking it [
5]. This reward-based training helps the agent learn from past experiences, making adapting to different road and weather conditions possible [
6]. As a result, DRL has the potential to revolutionize autonomous driving by enhancing the system’s ability to respond effectively to diverse situations.
This work aims to assess autonomous driving tasks in urban environments using DQN agents. To achieve this, several approaches based on DQN agents will be investigated. The DQN agent learns a policy, and a set of behaviors, for lane-following tasks by applying visual and driving characteristics obtained from in-vehicle sensors and a trajectory planner based on a model [
7,
8,
9]. This method comprehensively analyzes how deep reinforcement learning is transforming autonomous driving technology.
An end-to-end autonomous system based on the Deep Q-learning algorithm offers several advantages over traditional approaches [
10]. Its simplicity lies in the seamless integration of perception, prediction, and planning into a unified model that can be trained together. Our study proposes an RL-based autonomous driving system, emphasizing more informative exploration and improved reward signaling. We will evaluate the performance of this system in urban environments using the DDQN approach combined with Long Short-Term Memory (LSTM) [
11].
We will develop an intelligent driving agent capable of navigating complex environments along predetermined routes [
12,
13], such as those in the CARLA simulator, to validate our approach [
14]. We will also analyze various design decisions to determine the best configurations for training autonomous driving agents using reinforcement learning. Additionally, we will demonstrate the training methods that can significantly impact the performance of a deep RL agent. Finally, we will validate our approach in various challenging traffic scenarios and show that our method outperforms previous state-of-the-art techniques.
2. Overview of Deep Reinforcement Learning in Autonomous Driving
2.1. Reinforcement Learning
RL is a key artificial intelligence technique for tackling complex problems like robotics [
15], industry automation, natural language processing, and autonomous driving. In autonomous driving, RL trains vehicles to make decisions in dynamic environments. The vehicle, the agent, interacts with its surroundings, including roads, traffic, and pedestrians, by taking actions such as steering, accelerating, and braking. Each action earns rewards or penalties based on its safety and effectiveness, guiding the vehicle to improve over time.
Autonomous driving with RL involves balancing exploration, i.e., trying new maneuvers, and exploitation, i.e., using proven strategies, to maximize rewards [
16]. Simulations play a vital role in training RL models safely before real-world deployment, addressing the challenges of complex, dynamic environments. Unlike supervised learning, where explicit examples are provided, RL relies on trial and error to optimize actions, using feedback to refine strategies and ensure safe, efficient performance. The general agent–environment interaction process in RL is shown in
Figure 1.
The learning and decision-making in RL critically involve an interaction between the agent and the environment. A
t a given time step t, the environment is in a certain state, S
t, from which the agent takes an action, A
t, according to its policy, which has assigned a strategy to map states to actions. The execution of action A
t causes the environment to transition into the next state, S
t+1, and gives the agent a reward R
t. The agent then observes this reward and new state, S
t+1. At the next time step, t+1, an updated action, A
t+1, is taken by the agent in response to its updated state. It goes on repeatedly that an agent selects an action, receives a reward, and observes the next state. It tries to maximize rewards over time, which can be measured regarding a sum of rewards, discounted rewards, or another metric concerning long-term benefit. This then creates an iterative feedback loop through which the agent can improve its policy and further develop its strategy in making decisions [
18].
2.2. Deep Reinforcement Learning
DRL has become a key approach for training autonomous vehicles to make complex decisions in dynamic environments [
19]. By combining reinforcement learning and deep learning, DRL allows vehicles to navigate, plan, and control effectively while handling tasks like lane changes, intersections, and obstacle avoidance. Training often occurs in simulated environments, enabling the safe exploration and refinement of strategies. Using reward systems, DRL optimizes decision-making by encouraging safe behaviors and penalizing unsafe actions [
20]. Techniques like Convolutional Neural Networks process sensory data to enhance perception and planning. Despite its promise, challenges remain, such as ensuring generalization to new scenarios, improving safety mechanisms, and addressing computational efficiency. DRL’s hierarchical structures help tackle complex traffic scenarios, improving speed, trajectory, and collision avoidance, while algorithms like DQN are evaluated for maneuver planning. Simulations significantly reduce real-world testing needs, minimizing risks and enhancing algorithm performance.
2.3. Deep Q-Networks (DQNs)
DQNs combine Q-learning with deep neural networks to enable reinforcement learning agents to handle high-dimensional state spaces. By approximating the Q-value function using a neural network, DQNs allow agents to learn optimal policies directly from raw inputs, such as images or sensor data [
8]. This approach has been successfully applied in various domains, including playing video games, robotics, and autonomous driving, showcasing its ability to solve complex decision-making problems.
2.4. Double Deep Q-Networks (DDQNs)
DDQN is an enhancement to the original DQN algorithm designed to address the overestimation bias in Q-learning, which arises when the same network is used to select and evaluate actions, leading to inflated Q-values [
21]. Unlike DQN, DDQN employs two separate networks: an online network for action selection and a target network for Q-value evaluation. This decoupling reduces bias and improves learning stability and Q-Network (DDQN)-based deep reinforcement learning (DRL) algorithms to enhance decision-making and control. Several novel approaches have emerged, each addressing enhancing performance in complex environments by ensuring more accurate Q-value updates. Thus, recent advancements in autonomous driving have leveraged Double Deep specific challenges. Notable approaches include Dueling DDQN for lane-keeping, which improves stability by separately estimating state value and advantage functions [
22], and Game-Theoretic DDQN for Intersection Control, enabling cooperative vehicle interactions [
23]. The Double Broad Q-Network (DBQN) enhances overtaking decisions using a broad learning system [
24], while Hierarchical Dueling DDQN improves sparse-reward learning for complex driving tasks [
25]. Lastly, LK-TDDQN applies transfer learning to adapt lane-keeping strategies across environments [
26]. These innovations demonstrate the versatility of DDQN-based DRL in autonomous driving.
3. Benchmarking—Urban Driving Simulator
Simulation is a cost-effective and safe alternative to real-world testing for autonomous systems. It allows developers to quickly create and evaluate prototypes, iterate on designs, and explore various scenarios. Simulators also provide precise performance measurement tools, enabling the in-depth analysis and optimization of algorithms. This approach accelerates development and enhances safety by allowing for comprehensive testing in controlled virtual environments.
CARLA (Car Learning to Act) is an open-source simulator for urban driving specifically designed for the study of autonomous vehicles [
27]. It was built from the ground up to simplify the creation, training, and evaluation of autonomous driving systems in urban environments. CARLA provides tools to perform detailed simulations and evaluate system performance. In addition to its open-source code and protocols, CARLA offers open digital assets, such as urban layouts, buildings, and vehicles, that users can use independently. The deep customization of sensor sets and environmental conditions is possible through the platform, enabling accurate simulation [
28,
29].
Each town in CARLA possesses its own unique characteristics. Towns (
Figure 2) was utilized as the primary platform for training our agent and the performance evaluation.
4. Related Work
In recent years, deep reinforcement learning (DRL) has achieved great success in the field of autonomous vehicles. Due to this great success, several researchers have chosen to use it in their research. Their main challenge is to develop an intelligent agent that can on the one hand avoid obstacles and mitigate collisions of autonomous vehicles and on the other hand correct errors from autonomous driving pipeline tasks such as decision-making and motion planning. In this context, several approaches have used the DQN-based DRL algorithm that has demonstrated great effectiveness in ensuring safe navigation in various simulated dynamic environments, namely CARLA [
30].
Elallid et al., 2022 present an approach that uses the CARLA simulator that is designed to imitate real-world streets to train and validate autonomous vehicles [
9]. This employed method for controlling an AV in complicated environments is based on a DQN model. The car is equipped with a front-facing camera to take real-time pictures. The captured photos, which were originally 640 × 480 pixels in RGB, are first converted to grayscale levels and then resized to 192 × 256 pixels. These processed images are then passed through two dense couches with 512 and 256 neurons, enabling the model to generate 190 alternative actions. In the CARLA environment, a 389 m course with right turns and intersections was designed. Ten pedestrians and five vehicles were added to make the environment more dynamic and realistic. The model was trained for 5000 episodes, with a mini-batch size of 16, using a repetition memory system to learn from past experiences. The results show that the model learns effectively across episodes: average rewards increase as the success rate of actions improves, and the collision rate gradually decreases until reaching almost zero. This demonstrates that the AV learns to avoid accidents with other vehicles and pedestrians present in the environment, ensuring safe driving.
Hossain et al. (2023) proposed a model based on a deep neural network to implement the DQN algorithm, used to approximate the Q-value function [
31]. This function evaluates the quality of each possible action in each state. The agent, representing the autonomous car, interacts with a simulated environment, where it receives observations, chooses actions, and learns to maximize cumulative rewards over time. The model architecture consists of several layers designed to process the observations of the environment and produce the Q-values associated with each possible action. The observations, constituting the input space, are represented by a 5 × 5 array describing the vehicles in the vicinity of the autonomous car. Each row of the table corresponds to a vehicle, with columns indicating the following characteristics: position (x, y) and speed (Vx, Vy). This information is processed by the neural network to determine the optimal action. The neural network is a Multi-Layer Perceptron (MLP) comprising several fully connected layers. These layers learn to identify complex relationships between vehicles, such as neighborhood and relative speed, to assess the quality of possible actions. Activation functions such as ReLU (Rectified Linear Unit) are used to introduce non-linearity and enable the model to better approximate the Q function. As an output, the network produces a vector of Q-values for each possible action in the environment. The action space comprises five possible actions: change lanes to the left (LANELEFT), stay put (IDLE), change lanes to the right (LANERIGHT), accelerate (FASTER), and slow down (SLOWER). Each Q-value represents the quality of the associated action in the current state of the environment. The agent then selects the action with the highest Q-value, corresponding to that which maximizes the expected cumulative reward. The reward function is designed to encourage fast and safe driving behavior. Thus, the agent receives a reward of 0.1 points when staying in the right lane and a reward of 0.4 points when maintaining a high speed. On the other hand, there is no specific penalty or incentive for lane changes, which therefore do not directly affect the reward (0 points). For each action performed, the agent receives these rewards, which incentivizes it to adopt a behavior that maximizes speed while avoiding collisions.
Tammewar et al., 2023, studied the improvement in autonomous driving performance using DRL [
32]. The approach involves training a simulated vehicle to navigate autonomously on a racing track using the DQN algorithm. The system uses the CarRacing-v2 simulator which provides a top view of a randomly generated track. The vehicle receives visual information from a front camera and interacts with the environment via actions. The built model receives input images from the front camera, represented as RGB pixels (96 × 96 pixels in this case). The images are subsequently converted to grayscale to reduce computational complexity and focus the model’s attention on important structural aspects of the image, such as the contours of the track. These images are used to capture information about the vehicle’s environment (speed, direction, etc.). The inputs are then processed using CNNs to extract relevant features. To capture temporal dependencies and understand the dynamics of vehicle movement across images, a recurrent neural network (LSTM) is used after the CNN layers. The goal of this part is to allow the model to retain information over multiple time steps and to adjust its actions based on past trajectories. Rewards are assigned based on the coverage of track tiles, while penalties are applied when the vehicle goes off track. As an output, the model chooses among possible actions (acceleration, braking, steering) based on the extracted features. In the continuous version, the actions are represented by three parameters: direction (from −1 to 1), acceleration (from 0 to 1), and braking (from 0 to 1). These actions aim to help the vehicle navigate the track while maximizing the reward obtained. The results show that the DQN algorithm with epsilon decay (ε-decay) performed well and provided excellent stability and efficiency as well as cumulative scores over episodes for the autonomous navigation task.
In these approaches, although the agents were tested in various simulation environments, their performance may not generalize to other real-world environments with very different driving scenarios. Likewise, the policies learned may be too specialized for the specific conditions of the simulation (traffic, weather, road infrastructure), which limits their applicability in varied real-life situations.
5. Materials and Methods
The main methodology of this work is to introduce a novel approach based on deep reinforcement learning, which will enable a car to drive autonomously in a virtual environment. Since the system is based on a computer-generated environment, the CARLA_0.9.13 simulator for autonomous cars is the environment used. Our research focused on the impact of various hyperparameters. The effects on the convergence and robustness of learning were studied using learning rates for each model. Our main performance measures were the consistency of training each model over a given number of episodes, with episode reward and learning stability as the primary measures. A controlled and reproducible comparison between models was facilitated by this approach, ensuring that any observed performance disparities were related to the intrinsic characteristics of the models and the chosen hyperparameters.
In this work, we propose a novel architecture that combines CNNs, LSTM, and Deep Q-learning with a DDQN to tackle reinforcement learning tasks [
33,
34,
35]. This hybrid model leverages CNNs to extract spatial features from image data and LSTMs to capture the temporal dependencies in sequential data, making it particularly well suited for environments where inputs are image sequences (e.g., video frames).
The integration of DDQN enhances the reinforcement learning component by addressing the overestimation bias common in standard Q-learning [
14]. This allows the model to make more stable and accurate decisions in dynamic and complex environments. To prevent overfitting, dropout is applied within the convolutional layers, which is crucial when working with high-dimensional input data and limited training samples.
The architecture is modular and flexible, enabling the configuration of key parameters such as the number of LSTM layers, hidden units, and CNN filters to adapt to various task complexities. By combining the strengths of CNNs, LSTMs, and DDQN, this approach presents a robust and efficient solution for reinforcement learning tasks involving sequential image data. The model’s structure is shown in
Figure 3.
5.1. Model Architecture and State Space
In our scenario, we process a stack of four RGB images captured by the front camera of the autonomous vehicle (AV). Initially, each image has 640 × 480 × 3 pixels. We resize them to 84 × 84 × 3 pixels, then convert them to grayscale. This transformation yields a new state, denoted as St, with dimensions of 84 × 84 × 1, which are fed into the input of the neural network. The model combines convolutional layers with an LSTM, followed by fully connected layers for final predictions. By correcting the overestimation of action values that can occur in the original algorithm, DDQN enhances the conventional Q-learning algorithm.
The Q-values of each action in each state, which stand for the anticipated future benefit of performing that action in that state, are learned by the algorithm in typical Q-learning. However, when function approximation is used, which occurs frequently in large state spaces, these Q-values might become overstated. As a result, less-than-ideal decisions may be made.
By creating a second network that is used to choose the actions to be executed, DDQN resolves this problem. While the secondary network is used to assess the Q-value of the selected action, the primary network is utilized to estimate the Q-values of each action in each state. To reduce the overestimation of Q-values, the idea is to decouple the selection and evaluation of activities.
Our architecture is mainly based on multiple layers such as convolutional layers. The number of convolutional layers is four, as shown in the architecture figure. The first convolutional layer applies 32 convolutional filters with a kernel size of 8 × 8 and a stride size of 4. It reduces the spatial dimensions of the input while extracting the initial feature maps. A dropout with a probability of 0.4 is applied after the first convolutional layer to mitigate overfitting. The second convolutional layer uses 64 convolutional filters with a kernel size of 4 × 4 and a stride size of 2 to extract more features and reduce the spatial dimensions. A dropout with a probability of 0.4 is applied after the second convolutional layer. For the third convolutional layer, 64 convolutional filters with a kernel size of 3 × 3 and a stride size of 1 are used to add complexity to feature extraction. Two max pooling layers, each with a kernel size and stride of 2, are used to further downsample the feature maps and capture the most salient features. A fourth convolutional layer with 64 filters, a kernel size of 3 × 3, and a stride of 1 is used to refine the features.
As shown in
Figure 4, after the convolution and pooling operations, the spatial dimensions are flattened to prepare the data for the LSTM layer. The tensor is reshaped from (batch_size * seq_len, c, h, w) to (batch_size, seq_len, −1), where −1 automatically calculates the size of the flattened feature. Afterward, an LSTM layer captures the temporal dependencies in the sequence of image frames. It consists of hidden units lstm_hidden_size and layers num_lstm_layers. input_size is the flattened size of the output of the convolutional layers. Next, we find the first fully connected layer that transforms the LSTM output into a 512-dimensional vector. Finally, the last fully connected layer maps the 512-dimensional vector to a vector of size num_actions, representing the Q-values for each possible action in the reinforcement learning task.
For the forward pass, the input x is processed through the convolutional layers, followed by the LSTM layer to capture temporal dependencies, and finally through the fully connected layers to produce the Q-values. In this model, we aim to design a robust architecture for reinforcement learning tasks involving image sequences, exploiting the strengths of convolutional and recurrent neural networks to process and learn from complex temporal data. The architecture of our system is a DDQN with the inputs and outputs shown in
Table 1:
5.2. Reward Function
The reward function in the provided code is structured to guide the agent’s behavior based on its interactions with the environment. The primary factors influencing the reward are collisions, the duration of the episode, and whether the agent avoids obstacles during its driving task. The reward function in this method is based on several specific scenarios, described in our approach. Here are the details of the different cases in this driving simulation: Firstly, in the event of a collision, detected by the presence of items in the collision_hist list, the reward is set to −20, reflecting a severe penalty for this incident. The episode immediately ends with done equal to True, marking the end of the episode following an accident. Then, if no collision occurs and the car continues to run smoothly, a + 5 reward is awarded. This encourages collision-free driving, and the episode continues with done equal to False, allowing the agent to continue the episode without interruption. Finally, if the episode lasts longer than 30 s, a significant reward of 250 is awarded. The episode ends with a done equal to True, reflecting a bonus for driving through the entire episode without major incident. This scenario offers an additional incentive to maintain prolonged, safe driving. In summary, this reward structure strongly penalizes collisions while promoting continuous, accident-free driving, enabling the agent to learn to avoid collisions and drive stably. Here is a summary of the reward function in pseudo-code form:
This encourages the agent to avoid collisions and continue driving until the episode ends.
5.3. Mounted Sensors and Hyperparameters
Sensors mounted on autonomous vehicles play a crucial role in their ability to perceive and understand their environment in real time to make safe and efficient decisions. To achieve this, several types of sensors are integrated into autonomous vehicles, each providing specific data to analyze different aspects of the environment. The following table introduces the types of data commonly captured by these sensors and their respective functions:
Table 2 outlines the two sensors used in our simulation, with details on their attributes, functions, and roles in controlling the autonomous vehicle.
It is also necessary to define the set of various hyperparameters that serve as outside configuration variables utilized to control model training. These hyperparameters determine important model properties like design, learning rate, and complexity.
Table 3 below details the hyperparameters::
These space hyperparameters control various aspects of the neural network architecture, reinforcement learning algorithm, and training dynamics.
5.4. Action Space
Our model inputs forward-facing RGB camera images from the CARLA simulator. Each image is converted to grayscale and resized to 84 · 84 for processing. The output of our system is actions: there are 6 possible actions (combinations of steering and throttle). In the CARLA simulator environment, the AV interacts with its environment using four main control commands: steer left, steer right, go straight, and slow down. These commands are represented as integer values in the range 0 to 5. DDQN is a discrete DRL algorithm; the agent must make discrete action choices as per
Table 4. The action is to be taken by car. The possible actions are shown in the following table:
Algorithm
The provided algorithm presents a clear and concise implementation of the DDQN for autonomous navigation:
Initialize CARLA environment and sensors
For each episode:
Reset environment and variables (collisions, camera, etc.)
As long as the episode has not ended:
Choose an action (epsilon-greedy)
Execute action in environment
Obtain new observation, reward, collision status, etc.
Add transition to buffer
If buffer is sufficient:
Sample a batch of transitions
Calculate Q-targets
Calculate loss between current Q-values and targets
Update model weights by minimizing loss
If episode is over or time is up, stop
Every N episodes:
Update target network with online network weights
Record model weights
Track rewards and display results
6. Results and Discussion
The suggested approach was validated using the free autonomous driving simulation program CALRA [
3,
9,
34]. Several training simulations were carried out in the Town 05 environment using CARLA 0.9.13. The simulation environment is shown in
Figure 5.
Using the Town 05 environment of CARLA 0.9.13, we simulated the behavior of an expert human driver to control an autonomous vehicle using the actions generated by our model. After the model was well trained, we integrated it into the actor network for RL. This approach reduces the gap between states and actions, thus accelerating the training and convergence of the model.
In the CARLA simulator, the starting and destination points were consistent, and the traffic conditions varied according to the various episodes of training in Town 05. While the autonomous vehicle was entrusted with the task of driving safely and effectively, it could turn right and left in several intersections in an urban environment.
The experimental simulation and training tests in our work were conducted using a PC HP Gaming I-5, Intel® Core ™ i5-11800h processor reconditioned 2.30 GHz 16 GB RAM, 512GB SSD Hard Drive, Nvidia GeForce RTX 3050 Graphic, 15.6 “HD LED”.
To test our work, the vehicle was initially generated randomly in the starting area, and it had to follow the designated route according to the town chosen until reaching its destination; it must also avoid collisions with other vehicles in dense traffic. The environment of the town chosen also contains other fixed and mobile objects, including signaling failures, pedestrians, cyclists, motorcycles, and other vehicles.
In
Figure 6, the graph represents the reward per episode for the Deep Q-Network (DQN) agent trained in CARLA over 2000 episodes. This method is used in state-of-the-art methodologies. The x-axis denotes the number of episodes, while the y-axis represents the reward value. Observing the graph, the reward values exhibit significant fluctuations and spikes throughout the episodes, indicating inconsistent learning. There is no clear trend of continuous improvement, suggesting that the agent struggles to optimize its policy effectively. Additionally, the large variance in rewards implies that the model does not achieve stable learning.
DQN, in addition, has several limitations when used for autonomous vehicle (AV) training in reinforcement learning. One major issue is the lack of stability in learning, as the high variance in rewards suggests that DQN struggles with maintaining stable Q-values. Furthermore, DQN is designed for discrete action spaces, whereas AVs require continuous control over acceleration, braking, and steering. Discretizing these actions limits fine control and reduces overall performance. Another challenge is the exploration versus exploitation trade-off; DQN often suffers from insufficient exploration, leading to suboptimal policies.
Moreover, the algorithm is highly sensitive to hyperparameter tuning, such as learning rate and experience replay buffer size, making training inefficient for AVs. Small changes in parameters can result in catastrophic forgetting or suboptimal learning outcomes. Another crucial limitation is DQN’s inability to handle safety constraints effectively. The high fluctuations in rewards suggest frequent collisions or unsafe driving events, highlighting DQN’s inadequacy for real-world AV training without additional modifications.
Figure 7 represents epsilon decay over 2000 episodes in a reinforcement learning (RL) training process, likely for a Deep Q-Network (DQN) agent. The x-axis represents the number of episodes, while the y-axis represents epsilon values.
Epsilon (ε) is a parameter in ε-greedy exploration used in DQN to balance exploration (trying new actions) and exploitation (choosing the best-known action). The graph shows that epsilon starts at 1.0, meaning that the agent initially explores randomly. Over time, epsilon decays exponentially, approaching near-zero values after around 1500 episodes, indicating that the agent shifts from exploration to exploitation.
Figure 8 represents the reward per episode for a reinforcement learning (RL) agent trained in CARLA over 2000 episodes. The x-axis represents the number of episodes, while the y-axis represents the reward value obtained by the agent in each episode.
The above graph nicely shows the stability and performance of the Double Deep Q-Network (DDQN) method I n the CARLA simulator, which is used for autonomous driving tasks. The upward trend in the episode rewards graph indicates that the agent is improving its driving skills over time. This improvement suggests that the agent is learning to stay in its lane, make the right turns, and avoid collisions more efficiently as it gains experience over episodes.
Figure 9 represents eEpsilon dDecay over 2000 episodes in a reinforcement learn-ing (RL) training process, likely for a Double Deep Q-Network (DQN) agent. The x-axis represents the number of episodes, while the y-axis represents epsilon values.
7. Conclusions
In this study, we developed an autonomous driving system utilizing Deep Q-Networks with Double Q-Learning (DDQN) in the CARLA simulator. Our model, which integrates Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) layers, effectively processes sequential visual input to control a Tesla Model 3 in a complex urban environment. Through reinforcement learning, the system learned to navigate Town 05 by making informed driving decisions, including steering, accelerating, and braking. One of the key contributions of this work is the implementation of DDQN, which mitigates the overestimation bias commonly found in traditional Deep Q-Networks (DQNs). By using a separate target network and decoupling action selection from value estimation, DDQN significantly enhances training stability and improves policy performance. The combination of a replay buffer and an epsilon-greedy exploration strategy further ensured a balance between learning from past experiences and discovering new driving behaviors. Our results demonstrate that reinforcement learning can be an effective approach for autonomous driving, with the trained model exhibiting improved decision-making capabilities, reduced collision rates, and smoother driving patterns over time. These findings highlight the potential of deep reinforcement learning for developing intelligent self-driving systems that can adapt to dynamic environments without human intervention. Despite these promising outcomes, several challenges remain. This study is constrained by the limitations of simulated environments, which do not fully capture the complexities of real-world driving. Additionally, fine-tuning hyperparameters, expanding the training dataset, and incorporating more diverse driving scenarios could further enhance model robustness. Future research should explore the integration of real-world sensory data, adaptive learning techniques, and multi-agent interactions to advance the applicability of reinforcement learning in autonomous driving. By continuing to refine these models and bridge the gap between simulation and real-world deployment, this research contributes to the broader effort of developing safer and more reliable autonomous driving technologies.