Intelligent Robot in Unknown Environments: Walk Path Using Q-Learning and Deep Q-Learning

El Wafi, Mouna; Youssefi, My Abdelkader; Dakir, Rachid; Bakir, Mohamed

doi:10.3390/automation6010012

Open AccessArticle

Intelligent Robot in Unknown Environments: Walk Path Using Q-Learning and Deep Q-Learning

by

Mouna El Wafi

^1,*

,

My Abdelkader Youssefi

¹

,

Rachid Dakir

² and

Mohamed Bakir

¹

Engineering Laboratory, Industrial Management and Innovation, Faculty of Sciences and Technics, Hassan First University of Settat, Settat 26000, Morocco

²

Laboratory of Computer Systems & Vision, Polydisciplinary Faculty of Ouarzazate, Ibnou Zohr University, Ouarzazate 45000, Morocco

^*

Author to whom correspondence should be addressed.

Automation 2025, 6(1), 12; https://doi.org/10.3390/automation6010012

Submission received: 6 February 2025 / Revised: 27 February 2025 / Accepted: 6 March 2025 / Published: 18 March 2025

Download

Browse Figures

Versions Notes

Abstract

:

Autonomous navigation is essential for mobile robots to efficiently operate in complex environments. This study investigates Q-learning and Deep Q-learning to improve navigation performance. The research examines their effectiveness in complex maze configurations, focusing on how the epsilon-greedy strategy influences the agent’s ability to reach its goal in minimal time using Q-learning. A distinctive aspect of this work is the adaptive tuning of hyperparameters, where alpha and gamma values are dynamically adjusted throughout training. This eliminates the need for manually fixed parameters and enables the learning algorithm to automatically determine optimal values, ensuring adaptability to diverse environments rather than being constrained to specific cases. By integrating neural networks, Deep Q-learning enhances decision-making in complex navigation tasks. Simulations carried out in MATLAB environments validate the proposed approach, illustrating its effectiveness in resource-constrained systems while preserving robust and efficient decision-making. Experimental results demonstrate that adaptive hyperparameter tuning significantly improves learning efficiency, leading to faster convergence and reduced navigation time. Additionally, Deep Q-learning exhibits superior performance in complex environments, showcasing enhanced decision-making capabilities in high-dimensional state spaces. These findings highlight the advantages of reinforcement learning-based navigation and emphasize how adaptive exploration strategies and dynamic parameter adjustments enhance performance across diverse scenarios.

Keywords:

Q-learning; deep Q-learning; reinforcement learning; neural network; path-planning

1. Introduction

With advances in machine learning and artificial intelligence, mobile robots are becoming increasingly popular. To improve their decision-making capabilities, various reinforcement learning algorithms are used. Deep Q-learning, a type of reinforcement learning, is particularly popular in the field of mobile robotics because it can learn the best actions directly from sensor data.

The foundation of mobile robotics includes four main areas: locomotion, understanding the environment (perception), decision-making (cognition), and orientation (navigation) [1]. Locomotion is about using mechanics and control theory to figure out how the robot should move. Perception is about analyzing signals and using techniques such as computer vision. Cognition is about understanding sensor data and deciding what actions to take. Navigation uses planning algorithms and artificial intelligence to help the robot reach its destination [2]. The abilities of environment perception and self-position determination and path planning should be endowed to a robot which is supposed to realize autonomous mapping and navigation. There exist various path planning methods. All these methods can be classified into four: template matching, artificial potential field, map construction, and artificial intelligence [3]. Each method works best under certain conditions but also has its limitation. Currently, most path planning by robots relies heavily on the environment, which is not easy in a complex environment. Therefore, finding an environment-independent path-planning methodology that can adapt to changes in as short a time as possible is very important.

DQN is an abbreviation for Deep Q-Learning Network, which is one possible way to model the environment and compute the function of collision energy, a primary cause of functionality loss. To minimize the loss function, gradient descent would be necessarily conducted for training in the neural network at each single step of path planning. The generalizing ability within a network can be improved with all kinds of sample data in its learning and training process. However, working with a too big dataset considerably extends the time of training [4].

Recent works have been focused on using new ones or combinations of existing algorithms to improve the performance of mobile robots. Lei, Zhang, and Dong’s research showed that by supplementing the reinforcement learning process with Q-Learning, robots become good at dynamic obstacle avoidance and pathfinding within the environment [5]. It is noted that Wang found that TDDQN enjoys the advantages of fast convergence and low loss, and for mobile robots, hierarchical reinforcement learning and neural networks used in path planning show superior results compared to other methods [6]. The overall performance is discussed in Jinglun Yu, Yuancheng Su, and Yifan Liao’s research. This approach minimizes time planification, diminishes the number of paths steps, and it speeds up convergence time while improving the recognition and movement functions of mobile robots [7]. Jianjun Yu Guangyi Chunyan and Yang present a comparative analysis in coverage efficiency between a single mobile robot and that of multiple mobile robots. There is a proposal to introduce an enhanced K-Means clustering algorithm into the process of reorganizing map areas in need of coverage. Then, the deep reinforcement learning with a dueling network is used. The reward function is further improved by enabling multiple mobile robots to collaborate and cooperate to cover the target area. They concluded that this results in a reduction in redundant coverage and shortens the coverage path length, thus enhancing the efficiency and quality of the path planning [8]. Meng, Fu Yang, Ji Chao, and Xin Ping argued that DQN path planning has a better method which incorporates heuristic rewards into adaptive exploration strategies. It improves the algorithm in reaching quick convergence, finding an effective path rapidly, thus enabling efficient optimal path-finding capability [9].

Despite significant advancements in reinforcement learning for autonomous navigation, several research gaps remain. Reinforcement learning has made great progress in autonomous navigation, but some important questions remain unanswered. There are very few studies comparing how well Q-learning works with and without epsilon-greedy, especially in large and complex environments like industrial settings. Also, researchers have not fully explored how adjusting different parameters affects Q-learning’s efficiency in these situations. Similarly, Deep Q-learning has been studied extensively, but there is still a need for a comprehensive analysis of how variations in neural network architectures, batch sizes, and target network update frequencies influence its stability and performance in dynamic and large-scale environments. By addressing these gaps, this study provides a deeper understanding of how reinforcement learning methods perform in progressively complex worlds and contributes to their optimization for real-world applications.

In this research, we explore how Q-learning and Deep Q-learning perform in a complex MATLAB-simulated environment, focusing on their speed, accuracy, and adaptability under different conditions. By systematically adjusting key hyperparameters such as the learning rate (α), discount factor (γ), and exploration rate (ε) in Q-learning, to enhance the stability and adaptability of the learning process, α and γ are dynamically adjusted based on two key factors: the variance of the Q-table updates and the agent’s performance over time, learning rate α adjustment based on Q-Table variance and discount factor γ adjustment based on agent performances. As neural network architectures, batch size, and target network updates in Deep Q-learning, we analyze how these variations impact learning efficiency and stability. The goal is to understand when and why one method outperforms the other, helping us fine-tune the algorithms for more challenging and dynamic environments. This study builds on previous work in complex environments and extends it to larger and more realistic scenarios, aiming to develop smarter and more adaptable autonomous systems.

This study builds on existing reinforcement learning research in achieving a broader application in complex and dynamic worlds, advancing knowledge in how these algorithms scale. Our results offer a comparative analysis of Q-learning and Deep Q-learning and discover strengths and shortcomings in diverse navigation scenarios. Moreover, we demonstrate how crucial choices in hyperparameters impact efficiency and stability in learning and offer valuable insights in optimizing these parameters in a way to maximize performance. By overcoming scalability problems and trying reinforcement learning methods in progressively complex worlds, we gain insights into how these algorithms are adaptable. Lastly, we discover future research on integrating adaptive learning paradigms and fusion sensing in order to better suit realistic worlds, such as in robot automation, factory automation, and autonomous vehicle driving. In these contributions, this research pushes reinforcement learning methods in intelligent navigation and creates a foundation in advancing autonomous decision-making systems.

This article is organized into six sections. Section 2 is devoted to reinforcement learning, Q-Learning, and Deep Q-Learning algorithms: basic concepts, their integration with neural networks, and different types of path planning. Section 3 and Section 4 are dedicated to presenting and discussing the results obtained from maze simulations performed using MATLAB. Finally, the last section presents the conclusions after the reviewed study and perspectives.

2. Overview of Q-Learning and Deep Q-Learning Methods

2.1. Materials and Methods

A series of experiments were conducted using MATLAB (R2021a) in a simulated environment to compare how Q-learning and Deep Q-learning work. Q-learning can be categorized as a value-table-based algorithm that instructs an agent through a discrete environment. Deep Q-learning is an estimate of action-values, using a deep neural network to cope with higher-dimensional state spaces. The methodology is structured as follows and in Figure 1:

Algorithm Setup: Q-learning and Deep Q-learning algorithms were implemented on the same premises, so they start on an equal footing. Each algorithm has its set of hyperparameters.
Hyperparameter Testing: The proposed algorithm systematically changed the key parameters’ learning rate and exploration factor epsilon, among others, and observed how each influences the performances. These variations were to test how robust the algorithms are and how well they adapt across different conditions.
Running Simulations: The algorithms were tested in a complex simulated environment to measure their time of convergence, the accuracy of navigation, and overall effectiveness. Then performance metrics for each combination of hyperparameters were collected.
Data Analysis: We analyzed results for comparing the performance between the two algorithms. Graphs and tables were used to show how performance varied across different settings of the hyperparameters.

2.2. Reinforcement Learning

Due to its model-free nature, reinforcement learning is a type of unsupervised learning. This gives it the capability to easily deal with complex environments and learn an optimum learning strategy. Reinforcement learning problems could normally be modeled under Markov Decision Processes. Speaking generally, reinforcement learning consists of the four basic components: state, action, transition probability, and reward function [10]. As one of the most powerful learning approaches, RL computes the best way to act by directly interacting with the environment, having no or little prior knowledge. In general, using an agent, this approach involves trial-and-error learning of the value of various actions for a given strategy in the environment. It builds on this learned information to predict the best moves, and thus the algorithm gradually improves and learns the optimal strategy with time [11]. The TD learning method is one of the most common approaches in reinforcement learning applications. We treat discrete timesteps regarding the observation–action cycle. While learning, the agent communicates the reward value for any action it takes. Such rewards should be positive and indicate desirable actions to perform and negative rewards indicate actions that one wishes they had not. The purpose of every algorithm in RL would be to learn a policy that would maximize the accumulated reward as the simulation continues, more popularly known as an episode as shown in Figure 2.

The standard method, Q-Learning, used to store the actions, states of the environment, and rewards in a tabular fashion. When there were many state changes in an environment along with a lot of behavioral strategies, this tabular approach required so much data that basic table searches became very difficult. As deep learning has advanced, reinforcement learning has too, replacing the traditional tables of estimation for values with neural networks. This transition is responsible for the recent interest in deep reinforcement learning methods. It includes Deep Q-Learning [11], one of the early examples of such approaches. In this case, the estimation of values is performed by convolutional neural networks. The architecture of such networks includes input of image data, representation of states from the images, and prediction of future-expected rewards corresponding to different actions. This method significantly improves the estimation of values more efficiently and accurately, which in turn makes the transition from discrete to continuous action strategies much smoother.

2.3. Comparative Analysis of Q-Learning and Deep Q-Learning

2.3.1. Q-Learning

Q-Learning is a TD learning algorithm classified as Off-Policy. Mathematical proofs establish that through sufficient training while adhering to an ε-soft policy, the algorithm will converge with a probability of 1 to an action-value function closely approximating any target policy [11]. Notably, Q-Learning can effectively learn the optimal policy, even if actions are chosen using a more exploratory or random approach. The procedural depiction of the algorithm is outlined as follows:

New Q(s,a) ← Q(s,a) + α [r + γ max_a′,Q′ (s′,a′) – Q(s,a)]

(1)

where: Q(s,a): Current Q value; α: Learning Rate; r: Reward; γ: Discount Rate; Q′ (s′,a′): Maximum Expected Future Reward

In the supervised learning setting, each part of the data was provided with the correct label. The task lied in finding an optimal hypothesis function, which was characterized by h(x), that can predict the label correctly [12]. In other words, the quality of a prediction improves if the predicted label is closer to the actual label. In reinforcement learning, the hypothesis function in Q-learning, as per notation, is known as the action-value function and presented as Q(s,a) [12]. That is the agent’s policy. The idea of this is to minimize the gap between the optimal policy and the learned policy defined by Q(s,a). If it can achieve the best Q(s,a), then it is said that an agent has learned the policy. Therefore, the goal here is to estimate the optimal Q(s,a). An alternative term for it is TD-target.

ε-greedy

The ε-greedy Q-learning algorithm is a powerful reinforcement learning technique that helps an agent learn the best actions to take within its environment. Building on the standard Q-learning approach, ε-greedy Q-learning introduces a balance between exploration—trying out new actions to discover better options—and exploitation—choosing the actions that currently seem most effective based on past experiences. This balance allows the agent to adapt and improve over time, ultimately finding an optimal action selection strategy that works even in complex or uncertain environments [12].

Dilemme exploration—exploitation

In Q-learning, an agent has two primary ways to interact with its environment: exploitation and exploration.

Exploitation involves using the information in the Q-table, where the agent evaluates all potential actions for a given state and selects the action with the highest Q-value. This approach relies on what the agent has learned so far, aiming to maximize rewards based on known information [12].

Exploration, on the other hand, involves choosing actions randomly, allowing the agent to discover new pathways and potentially uncover higher rewards that were not initially apparent. Balancing these two approaches is key to effective Q-learning. The challenge is to decide, at any given state st, whether to choose the action expected to yield the optimal outcome or to try a different action that might lead to higher cumulative rewards in the long run [12].

2.3.2. Deep Q-Learning

Figure 3 shows the most basic difference between Deep Q-Learning and Q-Learning is that it replaces the usual Q-table with a neural network. Rather than directly mapping state–action pairs to Q-values, it maps an input state to a pair of actions along with their associated Q-values [13]. We formulate a loss function in Deep Q-Learning to calculate the difference between the prediction of Q-values and the target Q-values. We employ gradient descent to update the weights of the Deep Q-Network in order to better approximate the Q-values [14].

During neural network training, the goal is to reduce the difference between predicted and target Q-values, derived from the Bellman equation [14].

Neural Network

Figure 4 shows that a neural network is a system of layers of interconnected nodes, inputs, and outputs with a black box in between. It may be unsupervised or supervised, depending on whether the expected output is fed into the system. In supervised learning, it trains the network for certain outputs, whereas in unsupervised learning, the network will categorize input data without predefined outputs. The idea is that it requires training in order to adapt to the network and make predictions from data [15].

The architecture of an artificial neural network (ANN) is composed of three main layers:

Input Layer: This is the first layer, where artificial input neurons introduce data into the network.

Hidden Layer: Located between the input and output layers, this layer contains artificial neurons connected by weighted links. The strength of each connection depends on the inputs each neuron receives.

Output Layer: As the final layer, it produces outputs tailored to the programmer’s specifications. Neurons in this layer serve as the network’s primary actors, delivering the results.

A neural network is a type of machine learning model, with various forms like feed-forward networks, cascade networks, shallow networks, recurrent networks, and convolutional networks [17]. Neural networks learn by adjusting the connections (weights) between neurons, which can be enhanced by adding more neurons or connections [18]. The adjustment of weights is based on the error’s derivative with respect to each weight. After training to reduce errors, the network is tested by inputting data and calculating the outputs. Successful learning is achieved when the network accurately predicts outputs, though it may not be perfect for all data [19].

Loss function

The Deep Q-Learning training process involves two key stages:

Sampling: Actions are taken, and the resulting experience tuples (state, action, reward, and next state) are stored in a replay memory.

Training: A random mini batch of tuples is selected from the replay memory, and the algorithm learns from this sample through a gradient descent update.

The parameters, often represented by θ in neural networks, are also known as ‘weights’. Deciding whether to include bias units depends on the problem’s specifics [20].

Deep Q-Learning continuously interacts with the environment, saving experiences in replay memory and updating the Q-network, which helps to estimate the optimal action-value function. This experience replay mechanism stabilizes learning by reducing correlations between consecutive samples, making better use of past experiences.

When selecting a loss function, several important factors come into play:

Convexity: A convex loss function has a single minimum, which aids in finding optimal solutions more efficiently [20].

Differentiability: Smooth and differentiable loss functions allow the use of specific optimization techniques effectively [20].

Robustness: A robust loss function can handle outliers, maintaining stability despite extreme values in the data [20].

Smoothness: Smooth loss functions have steady gradient changes without abrupt spikes, which aids the optimization process [20].

Sparsity: A sparse loss function encourages the model to focus on only a few significant values, which is useful in high-dimensional data with many irrelevant features [20].

Multi-modality: Multi-modal loss functions have multiple minima, which can be beneficial when the model needs to represent data in diverse ways [20].

Monotonicity: Monotonic loss functions decrease consistently as predictions become closer to actual values, helping the optimization to move towards an optimal solution [20].

Invariance: Invariant loss functions remain unchanged even after certain transformations of input or output, which is helpful for data transformations like rotation, scaling, or translation [20].

One commonly used loss function, the Mean Squared Error (MSE), calculates the average of squared differences between predicted and actual values. The MSE loss function is mathematically represented as follows [21]:

M S E = \frac{1}{n} \sum_{i = 1}^{n} {(y i - \hat{y i})}^{2}

(2)

n is number of samples,

y i

is real value of the ith sample, and

\hat{y i}

is the predicted value of the ith sample.

L = (r + γ max_a′ Q (s′, a′; θ_target) − Q (s, a, θ))²

(3)

LossFunction = Q_best (s_t, a_t) − Q (s_t, a_t)

(4)

Qbest (st, at): TD-target unidentified; Q (st, at): Current Q-value

LossFunction = R_t+1 + γ max Q(S_t+1, a) − Q(S_t, a_t)

(5)

where: R_t+1 + γ max Q(S_t+1, a): is the estimated TD-target; Q (st, at): Current Q-value; r: rewards, a: actions, s: state, γ: discount rate

2.4. Software Architecture of Autonomous Robot

Figure 5 describe how mobile robot systems are built around three core components: sensing, path planning, and movement control. Path planning plays a crucial role by linking the robot’s sensory data with its actions, directly impacting how the robot operates [22].

Path planning is essentially the process of determining the optimal route for a robot or autonomous vehicle. This process can work in environments that are fully known, partially known, or completely unknown. In familiar areas, the system has pre-existing information to aid in planning. In unfamiliar spaces, however, the robot or vehicle relies on its sensors to collect data and dynamically update its map as it moves, allowing it to determine the best path forward [23].

Figure 5. Software architecture of autonomous robot [24].

In mobile robotics, navigation planning generally falls into two main categories: global path planning and local path planning [25,26].

Global Path Planning: Also known as offline or static path planning, this method is used when the robot has prior knowledge of its surroundings and can follow a predefined route to reach its destination [26].

Local Path Planning: While the robot is in motion, it performs local path planning by using real-time data from its sensors. This approach enables the robot or autonomous vehicle to adapt its path in response to changes in its immediate environment [26]. Local path planning is crucial when the robot is navigating an unfamiliar area and needs to make decisions on-the-go, reacting dynamically to any obstacles it encounters.

In practice, robots typically combine these two approaches for effective navigation, as shown in Table 1. Global path planning focuses on creating an overall environmental map and evaluating potential routes, while local path planning is dedicated to real-time obstacle avoidance. Working together, these strategies help robots determine the safest and most efficient paths for movement [26].

3. Results

Solving maze problems with Q-learning or Deep Q-learning (DQL) can be challenging, as it depends on factors like the maze’s size, complexity, available actions, and the number of trials needed to find the optimal path.

In Q-learning, the agent tries different routes through the maze, learning which paths are better by updating a large table that records expected rewards for each move. This table can grow very large in big mazes, making it time-intensive to find the best path, especially in complex environments. Deep Q-learning works differently, instead of relying on a large table, it uses a neural network to predict the best moves. This approach is more memory-efficient and often learns faster.

This section evaluates the performance of Q-learning for autonomous navigation, particularly in maze environments. At the first, we compare two approaches: Q-learning with epsilon-greedy exploration and Q-learning without exploration, analyzing their impact on learning efficiency and the average time required for the agent to reach its goal. Additionally, we explore how hyperparameter tuning influences convergence speed and overall performance. To enhance adaptability, the proposed algorithm during training dynamically adjusts α, γ, and ε, ensuring their values stay within the [0, 1] range. It begins with predefined values but adapts them as the agent learns, responding to its progress rather than sticking to a rigid schedule. This adaptability helps the agent tackle various learning challenges more effectively. To ensure stability, these adjustments are carefully managed, preventing abrupt changes that could disrupt the learning process. Updates occur either at the start of an episode or in response to performance feedback, maintaining a balance between exploration and exploitation. While this approach enhances adaptability and efficiency, further research is needed to refine the adjustment strategy for optimal learning outcomes. During training, allowing the agent to discover the most effective settings on its own. This adaptive approach makes the algorithm more flexible, enabling it to handle a wide range of challenges rather than being tailored to specific scenarios. For evaluation, we use 61 × 21 mazes as a test environment to assess how Q-learning performs under different levels of complexity. However, the algorithm is not restricted to this specific environment thanks to its ability to adjust key hyperparameters, it can be applied to any environment, regardless of its size or complexity. This adaptability underscores the robustness of the proposed approach in optimizing autonomous navigation across various scenarios. In reinforcement learning, achieving a balanced tradeoff between exploration and exploitation is essential for optimal decision-making. The ε-greedy strategy plays a key role in this process by introducing a probability ε, which determines how often the agent chooses a random action instead of selecting the best-known move based on its current knowledge. By fine-tuning this parameter, we can enhance learning efficiency and ensure the agent effectively navigates complex environments. In addition to evaluating Q-learning, we also applied Deep Q-learning (DQL) in the same maze environment, modifying the number of neurons in the neural network to assess its impact on learning efficiency. By adjusting the network architecture, we observed how different neuron counts influenced the agent’s ability to reach its goal, analyzing the average time required and the stability of learning through regression analysis. This comparison is essential for identifying an optimal balance between model complexity and computational efficiency. By evaluating different configurations, we aim to determine the most efficient neural network structure that ensures both fast convergence and adaptability across various scenarios.

To implement Q-learning and Deep Q-Learning in MATLAB for maze navigation, we will divide the process into several steps:

Maze Representation: We need to represent the maze as a matrix where each cell can be a wall, an empty space, the starting point, and the goal.

Here, 0 represents walls (obstacles); 1 represents open paths (navigable spaces); 2 marks the starting position; and 3 represents the goal location. An example of a maze represented as a matrix could look as shown in Figure 6 and also quantitative complexity metrics shown in Table 2:

b.: Environment Creation: We will create an environment where the agent can interact. This environment will provide functionalities for the agent to take actions and receive rewards.
c.: DQN or Q-Learning Agent Implementation: We will first implement Q-learning, then the DQL agent learns to navigate the maze using a neural network to estimate the Q-function and we will compare these algorithms.
d.: Agent Training: We will train the agent by allowing it to explore the maze and learn the best actions to take at each state.
e.: Agent Evaluation: We will evaluate the agent’s performance once it is trained by testing it with new parameters in the same maze to see if it can find the optimal path to the goal.

3.1. Q-Learning Results

In each simulation, we will adjust the values of α and γ and measure two metrics:

(1): The time without epsilon-greedy in accordance with steps.
(2): The average time taken to reach the goal with ε-greedy.

To ensure robustness, we will repeat the entire process 50 iterations for each case.

The learning rate α is reduced as the Q-values stabilize to prevent excessive updates and oscillations. It follows an exponential decay based on the magnitude of Q-value changes:

α_t+1 = α_t.exp(β∣ΔQt∣)

(6)

where ΔQt represents the average change in Q-values between iterations; β is a decay factor controlling the rate of adaptation.

This ensures that α remains high during early exploration and gradually decreases as the agent converges to an optimal policy.

Discount factor γ adjustment based on agent performance is adjusted dynamically based on the agent’s cumulative reward. The idea is to encourage long-term planning when performance improves and to prioritize immediate rewards when performance declines. The update rule is defined as follows:

$γ t + 1 = \{\begin{array}{l} γ t + n, & if cumulative reward increase \\ γ t - n, otherwise \end{array}$

(7)

where η is a small step-size parameter to prevent drastic fluctuations. This adaptive approach allows the agent to prioritize immediate rewards in unstable conditions while gradually favoring long-term rewards as it learns a stable policy.

Table 3 and Table 4 are summary tables presenting the outcomes of the agent’s experiments conducted in the 61 × 21 maze.

3.1.1. Impact of α and γ on Convergence Time

-: α (Learning Rate): A high learning rate (α = 0.7) does not always guarantee faster learning. In fact, in cases 3, 6, and 9, the agent still takes a long time to converge (1790.75s to 1855.75s), suggesting that rapid updates can sometimes lead to instability. On the other hand, moderate learning rates (α = 0.5), seen in cases 4, 5, and 7, result in noticeably faster convergence. This indicates that a more balanced learning rate allows the agent to adapt efficiently without overwhelming the system.
-: γ (Discount Factor): According to the proposed study, a high discount factor (γ = 0.9) encourages the agent to prioritize long-term rewards, but this often prolongs convergence; as observed in cases 4 and 6, the agent takes significantly more time to reach the goal. Conversely, moderate values of γ (around 0.5–0.7) lead to better results, as seen in cases 4, 7, and 8. This suggests that an agent performs best when it balances short-term and long-term rewards, allowing it to learn efficiently in dynamic environments.

3.1.2. Comparison of Times with and Without Epsilon-Greedy

-: When ε Improves Convergence: In case 8 (α = 0.7, γ = 0.7), introducing epsilon-greedy, as shown in Table 3, significantly improves convergence, reducing the time, shown in Table 4, from 1793.60 s to just 1061.60 s. This demonstrates how strategic exploration helps the agent discover better policies faster.

A similar improvement is seen in case 7 (α = 0.5, γ = 0.7), where epsilon-greedy accelerates learning by enabling the agent to find optimal paths more efficiently. This suggests that a well-balanced combination of moderate learning rate and discount factor, paired with epsilon-greedy exploration, helps achieve faster convergence.

-: When ε Increases Convergence Time: In certain cases, exploration is not always beneficial. In case 1 (α = 0.2, γ = 0.5), using epsilon-greedy actually increases the time from 1127.50 s to 1803.50 s. This suggests that excessive exploration can sometimes be counterproductive, preventing the agent from exploiting the knowledge it has already gained. In such cases, the agent keeps exploring unnecessary paths instead of using what it has learned to make better decisions.

3.1.3. Interpretation of Best Performances

-: Optimal Configurations: The most effective configurations are case 8 (α = 0.7, γ = 0.7) and case 7 (α = 0.5, γ = 0.7). These setups achieve the fastest convergence times when combined with epsilon-greedy, proving that a moderate learning rate and a balanced discount factor provide the ideal mix of exploration and exploitation. This allows the agent to learn efficiently while maintaining stability.
-: Suboptimal Configurations: Case 3 (α = 0.7, γ = 0.9) performs the worst, with a convergence time of 1790.75 s without ε and 968.25 s with ε. This highlights a key issue: when both α and γ are high, excessive exploration and delayed rewards can make learning inefficient. Here, the agent struggles to exploit good policies, leading to only marginal improvements despite extensive exploration.

3.2. Deep Q-Learning Results

In this study, we interpret the results obtained from MATLAB simulations to evaluate the performance of the Deep Q-Learning agent in a specific environment. The network architecture is defined as follows:

Input Layer: Represents the environment’s state space.
Hidden Layers: The network consists of two hidden layers. The number of neurons varies from 5 to 30 as part of the experimental study.
Activation Function: ReLU is applied after each hidden layer to introduce non-linearity and prevent the vanishing gradient problem.
The ReLU function is defined as follows:

f(x) = max(0,x)

(8)

What this means is that if the input x is positive, the output remains x.
If the input x is negative, the output is zero.
Output Layer: Produces Q-values for all possible actions, using a linear activation function for stability.
Loss Function: Mean Squared Error (MSE) minimizes the difference between predicted and target Q-values.

3.2.1. Neuron Count and Iteration Time

The proposed study shows the following:

-: There is a general trend where an increase in the number of neurons in the network leads to longer iteration times. Table 5 shows that cases with a higher neuron count (25, 30 neurons) show significantly longer iteration times (up to 40 s), while smaller networks (5, 10 neurons) have much shorter times (6 s, 8 s).
-: This reflects the increased complexity of the model, where more neurons involve more calculations at each iteration.

3.2.2. Neuron Count and Regression Quality

The data demonstrate the following:

-: A network that is too small (five neurons) has a poor regression performance (0.620), indicating limited capacity to generalize the data.
-: A network with too many neurons (like 30) exhibits better regression (0.880), but at the cost of significantly longer execution times. This may reflect overfitting, where the model learns the training data too well but becomes less effective at quick generalization.
-: A moderate number of neurons (between 10 and 20) seems to offer the best balance between execution time and regression quality. For instance, 10 neurons yield the highest regression (0.900) while maintaining a low iteration time (8 s).

3.2.3. General Trends

-: Efficiency and Neuron Count: It appears that moderately sized models, with around 10 to 20 neurons, are the most efficient in this configuration, providing a good balance between speed and prediction accuracy.
-: Limits of Very Large Networks: Increasing the neuron count beyond a certain point (such as 30) results in disproportionately longer iteration times without significant performance gains.

The analysis reveals that by employing Deep Q-Learning, we can observe that this algorithm requires minimal time to reach the goal, particularly in environments with larger dimensions.

The iteration times in a neural network are influenced by the number of neurons, as this affects the computational workload, memory requirements, and how quickly the network converges. Designing an optimal neural network involves balancing these aspects to ensure both efficiency and effectiveness in training. The number of neurons in a neural network significantly also impacts regression performance by shaping the model’s learning ability, computational efficiency, and generalization capability. Achieving optimal results requires balancing these factors through careful experimentation and the use of regularization techniques. The size of a regression task influences how complex the model needs to be, how much computing power is required, and how well the model can generalize to new data. Small regression tasks are simpler and quicker to manage but have a higher chance of overfitting. On the other hand, large regression tasks require more advanced models, more computational resources, and careful techniques to avoid overfitting and underfitting.

In summary, the results of Deep Q-Learning can include effective policies learned by the agent, convergence of the Q-function, better performance compared to other methods, sensitivity to hyperparameters, and stability of learning. These results can vary based on many factors and often require extensive experimentation to achieve the best outcomes. Although Q-Learning as shown in Figure 7 and Deep Q-Learning as shown in Figure 8 both aim to acquire optimal policies through trial and error, Deep Q-Learning is distinguished by its ability to handle high-dimensional state spaces effectively, its enhanced learning efficiency from data, and its superior ability to generalize across environments. Consequently, it is well suited for addressing intricate and varied tasks.

4. Discussion

In this study, we observed that the convergence time in maze-solving tasks depends heavily on the choice of learning rate (α) and discount factor (γ), two essential parameters in reinforcement learning. The learning rate, which determines how much the algorithm adjusts after each step, greatly influences how quickly the agent can find the optimal solution. When the learning rate is set too high, the learning process can become unstable large updates cause the agent to make big, erratic changes, often overshooting optimal solutions and getting stuck in suboptimal ones. This instability tends to slow down convergence, especially in complex mazes that require steady, controlled adjustments. In contrast, a moderate learning rate helps the agent learn at a more stable pace, allowing it to respond gradually to new information and leading to both faster and more reliable convergence.

The discount factor, which affects how much future rewards matter compared to immediate ones, also plays a crucial role. A high discount factor makes the agent prioritize long-term rewards, which can slow down learning because it takes longer to recognize beneficial immediate actions. This effect is especially noticeable in more difficult mazes, where balancing short-term gains with long-term strategies is essential. A moderate discount factor, on the other hand, allows the agent to weigh immediate and future rewards more equally, making it easier to find efficient routes in dynamic environments.

Together, the learning rate and discount factor create a balance between adaptability and stability. A well-chosen learning rate and discount factor let the agent integrate short- and long-term rewards effectively, improving its learning and convergence times. These findings underscore the importance of carefully tuning parameters like α and γ in reinforcement learning, as this balance is essential for efficient and stable learning in complex decision-making tasks like maze navigation.

Future studies could focus on adapting learning strategies to constantly evolving environments and on optimizing Deep Q-learning to handle these real-time variations, based on the performance comparison of Q-Learning and Deep Q-Learning shown in Table 6. The regression scores from the experiments (e.g., 0.88 for 30 neurons) indicate that the model fits the training data well. However, this raises a concern about overfitting where the neural network memorizes patterns rather than truly learning to generalize. To assess its ability to generalize, it would be useful to track validation loss over training epochs to check for divergence from training loss. Additionally, testing the model on new mazes with different obstacle arrangements would provide a clearer picture of how well it adapts to unseen environments. To reduce the risk of overfitting, several techniques can be applied. Adding dropout layers, for example, helps prevent the model from relying too much on specific neurons. L2 regularization (weight decay) can also be used to limit excessive weight updates, improving overall robustness. Another effective approach is early stopping, where training is halted once validation loss stops improving, preventing unnecessary overfitting. Implementing these strategies would enhance the scalability of Deep Q-Learning, making it more suitable for real-world applications, particularly in dynamic environments.

Second, the impact of different sensor fusion techniques on algorithm performance should be explored in more detail. Assessing how various hardware configurations and fusion methods influence results could provide valuable insights into improving the reliability and efficiency of perception systems in real-world applications.

Finally, it would be interesting to study the application of these learning techniques to specific tasks such as image recognition, motion prediction, or autonomous driving in even more complex and varied scenarios. Particular attention could be given to how algorithms can be adapted to process data from diverse sources and integrate online learning mechanisms to enhance performance in unknown or unforeseen environments.

5. Conclusions

Recent findings indicate that Deep Q-learning generally outperforms Q-learning in complex environments. The results provide strong evidence that this advantage is due to Deep Q-learning’s superior ability to manage large state spaces and improve navigation accuracy, rather than merely reflecting a higher number of successful navigation attempts. This suggests that the advanced capabilities of Deep Q-learning make it a more effective choice for scenarios that require sophisticated state space management.

In Q-learning the learning rate (α) is crucial, as it dictates the balance between integrating new information and maintaining existing knowledge when updating Q-values. A high learning rate enables the agent to quickly adapt to new experiences but can cause instability and convergence issues. Meanwhile, a low learning rate improves stability by making the agent less sensitive to sudden changes in Q-values, though it may slow down the learning process. Determining the optimal learning rate (α optimal) varies depending on the problem and the exploration-exploitation strategy used by the agent. This often requires trial and error and parameter tuning to find the best learning rate for a particular application. Additionally, techniques like learning rate annealing, where the learning rate gradually decreases over time, can be beneficial in practical implementations. In real-world applications, Q-learning combined with an exploration strategy (epsilon-greedy). This strategy balances the agent to explore new actions based on exploitation of currently estimated optimal actions. The choice of exploration strategy significantly impacts the algorithm’s performance and the dynamics of the learning rate.

The results of this study indicate that Deep Q-learning, with its advanced capabilities for handling complex state spaces, offers significant advantages over Q-learning in complex environments. However, several areas still warrant further exploration. First, it would be valuable to examine how these algorithms perform in dynamic environments, where conditions and obstacles are constantly changing.

Author Contributions

Conceptualization, M.E.W. and M.A.Y.; methodology, M.E.W. and M.A.Y.; software, M.E.W.; validation, M.E.W., M.A.Y., R.D. and M.B.; formal analysis, M.E.W.; investigation, M.E.W.; resources, M.E.W. and M.A.Y.; data curation, M.E.W.; writing—original draft preparation, M.E.W.; writing—review and editing, M.E.W.; visualization, M.E.W.; supervision, M.A.Y. and R.D; project administration, M.E.W. and M.A.Y.; funding acquisition, M.E.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data is contained within the article.

Conflicts of Interest

The author declares no conflicts of interest.

Abbreviations

RL	Reinforcement Learning
TD	Temporal Difference
ReLU	Rectified Linear Unit

References

Rubio, F.; Valero, F.; Llopis-Albert, C. A review of mobile robots: Concepts, methods, theoretical framework, and applications. Int. J. Adv. Robot. Syst. 2019, 16, 1729881419839596. [Google Scholar] [CrossRef]
Raj, R.; Kos, A. A comprehensive study of mobile robot: History, developments, applications, and future research perspectives. Appl. Sci. 2022, 12, 6951. [Google Scholar] [CrossRef]
Sánchez-Ibáñez, J.R.; Pérez-Del-Pulgar, C.J.; García-Cerezo, A. Path planning for autonomous mobile robots: A review. Sensors 2021, 21, 7898. [Google Scholar] [CrossRef]
Ohnishi, S.; Uchibe, E.; Yamaguchi, Y.; Nakanishi, K.; Yasui, Y.; Ishii, S. Constrained deep q-learning gradually approaching ordinary q-learning. Front. Neurorobot. 2019, 13, 103. [Google Scholar] [CrossRef] [PubMed]
Lei, X.; Zhang, Z.; Dong, P. Dynamic path planning of unknown environment based on deep reinforcement learning. J. Robot. 2018, 2018, 5781591. [Google Scholar] [CrossRef]
Wang, P.; Li, X.; Song, C.; Zhai, S. Research on dynamic path planning of wheeled robot based on deep reinforcement learning on the slope ground. J. Robot. 2020, 2020, 7167243. [Google Scholar] [CrossRef]
Yu, J.; Su, Y.; Liao, Y. The path planning of mobile robot by neural networks and hierarchical reinforcement learning. Front. Neurorobot. 2020, 14, 63. [Google Scholar] [CrossRef]
Ni, J.; Gu, Y.; Tang, G.; Ke, C.; Gu, Y. Cooperative Coverage Path Planning for Multi-Mobile Robots Based on Improved K-Means Clustering and Deep Reinforcement Learning. Electronics 2024, 13, 944. [Google Scholar] [CrossRef]
Guan, M.; Yang, F.X.; Jiao, J.C.; Chen, X.P. Research on path planning of mobile robot based on improved Deep Q Network. J. Phys. Conf. Ser. 2021, 1820, 012024. [Google Scholar] [CrossRef]
Chen, T.; Jia, W.; Yuan, J.; Ma, S.; Cheng, L. Continuity and Smoothness Analysis and Possible Improvement of Traditional Reinforcement Learning Methods. In Proceedings of the 2020 IEEE International Conference on Mechatronics and Automation (ICMA), IEEE, Beijing, China, 13–16 October 2020; pp. 1722–1727. [Google Scholar] [CrossRef]
Jang, B.; Kim, M.; Harerimana, G.; Kim, J.W. Q-learning algorithms: A comprehensive classification and applications. IEEE Access 2019, 7, 133653–133667. [Google Scholar] [CrossRef]
Hu, Y.; Yang, L.; Lou, Y. Path planning with q-learning. J. Phys. Conf. Ser. 2021, 1948, 012038. [Google Scholar] [CrossRef]
Duryea, E.; Ganger, M.; Hu, W. Exploring Deep Reinforcement Learning with Multi Q-Learning. Intell. Control. Autom. 2016, 7, 129–144. [Google Scholar] [CrossRef]
Ben Hazem, Z. Study of Q-learning and deep Q-network learning control for a rotary inverted pendulum system. Discov. Appl. Sci. 2024, 6, 49. [Google Scholar] [CrossRef]
Aziz, I. Deep Learning: An Overview of Convolutional Neural Network (CNN). Master’s Thesis, Tampere University, Tampere, Finland, 2020. [Google Scholar]
Bahi, M.; Batouche, M. Deep learning for ligand-based virtual screening in drug discovery. In Proceedings of the 2018 3rd International Conference on Pattern Analysis and Intelligent Systems (PAIS), IEEE, Tebessa, Algeria, 24–25 October 2018; pp. 1–5. [Google Scholar] [CrossRef]
Batina, L.; Bhasin, S.; Jap, D.; Picek, S. {CSI}{NN}: Reverse engineering of neural network architectures through electromagnetic side channel. In Proceedings of the 28th USENIX Security Symposium (USENIX Security 19), Santa Clara, CA, USA, 14–16 August 2019; pp. 515–532. [Google Scholar]
Terven, J.; Cordova-Esparza, D.M.; Ramirez-Pedraza, A.; Chavez-Urbiola, E. Loss functions and metrics in deep learning. A review. arXiv 2023, arXiv:2307.02694. [Google Scholar] [CrossRef]
Rahman, M.; Rashid, S.M.H.; Hossain, M.M. Implementation of Q learning and deep Q network for controlling a self balancing robot model. Robot. Biomim. 2018, 5, 1–6. [Google Scholar] [CrossRef] [PubMed]
Qamar, R.; Zardari, B.A. Artificial Neural Networks: An Overview. Mesopotamian J. Comput. Sci. 2023, 2023, 130–139. [Google Scholar] [CrossRef]
Liu, L.; Wang, X.; Yang, X.; Liu, H.; Li, J.; Wang, P. Path planning techniques for mobile robots: Review and prospect. Expert Syst. Appl. 2023, 227, 120254. [Google Scholar] [CrossRef]
Qin, H.; Shao, S.; Wang, T.; Yu, X.; Jiang, Y.; Cao, Z. Review of Autonomous Path Planning Algorithms for Mobile Robots. Drones 2023, 7, 211. [Google Scholar] [CrossRef]
Liu, F.; Chen, C.; Li, Z.; Guan, Z.-H.; O Wang, H. Research on path planning of robot based on deep reinforcement learning. In Proceedings of the 2020 39th Chinese Control Conference (CCC), IEEE, Shenyang, China, 27–29 July 2020; pp. 3730–3734. [Google Scholar] [CrossRef]
Karur, K.; Sharma, N.; Dharmatti, C.; Siegel, J.E. A survey of path planning algorithms for mobile robots. Vehicles 2021, 3, 448–468. [Google Scholar] [CrossRef]
Yang, L.; Li, P.; Qian, S.; Quan, H.; Miao, J.; Liu, M.; Hu, Y.; Memetimin, E. Path Planning Technique for Mobile Robots: A Review. Machines 2023, 11, 980. [Google Scholar] [CrossRef]
Patle, B.K.; Pandey, A.; Parhi, D.; Jagadeesh, A. A review: On path planning strategies for navigation of mobile robot. Def. Technol. 2019, 15, 582–606. [Google Scholar] [CrossRef]

Figure 1. Block diagram of the methodology and design.

Figure 2. Shows RL principle [11].

Figure 3. Difference between Q-Learning (a) and Deep Q-Learning (b).

Figure 4. Deep neural network architecture [16].

Figure 6. Maze representation.

Figure 7. Results of Q-Learning algorithm using ε (a) and without using ε (b).

Figure 8. Impact of neurons’ number on iteration times (a) and impact of neurons’ number on regression (b).

Table 1. Methods of path planning [26].

Path Planning	Global Planning	C-Space	Graph Search
			Sampling Based
		Optimal Control	Global Optimization
			PDE Solving
	Local Planning	Reactive Computing	Local Optimization
			Reactive Maneuver
		Soft Computing	Artificial Intelligence
			Evolutionary Computation

Table 2. Quantitative complexity metrics of the 61 × 21 maze environment.

Metric	Measured Value (61 × 21 Maze)	Definition
Branching Factor	~2.5 (indicating moderate complexity)	The average number of possible moves at each step (degree of path choices).
Obstacle Density	~42% (ensuring significant navigation challenges)	The percentage of the grid occupied by walls.
Shortest Path Length	~93 steps	Minimum number of steps from start to goal, assuming optimal policy.
Open Space Ratio	~58% (balancing exploration opportunities)	Ratio of navigable cells to total cells in the maze.

Table 3. Q-learning results using epsilon-greedy.

	1st Case	2nd Case	3rd Case	4th Case	5th Case	6th Case	7th Case	8th Case	9th Case
α	0.2	0.5	0.7	0.5	0.7	0.2	0.5	0.7	0.2
γ	0.5	0.7	0.9	0.9	0.5	0.9	0.7	0.7	0.5
ε	0.1	0.3	0.5	0.1	0.3	0.5	0.1	0.3	0.1
Average time using ε (s)	1127.50	1046.25	968.25	1152.22	1048	975.75	1156.2	1061.60	955.75

Table 4. Q-learning results without using epsilon-greedy.

	1st Case	2nd Case	3rd Case	4th Case	5th Case	6th Case	7th Case	8th Case	9th Case
α	0.2	0.5	0.7	0.5	0.7	0.2	0.5	0.7	0.2
γ	0.5	0.7	0.9	0.9	0.5	0.9	0.7	0.7	0.5
Average time without ε (s)	1803.50	1818.25	1790.75	1813.95	1795.60	1855.75	1822.23	1793.60	1799.50

Table 5. Deep Q-learning results for 61 × 21 maze.

	1st Case	2nd Case	3rd Case	4th Case	5th Case	6th Case	7th Case	8th Case	9th Case
Neurons’ number	20	20	15	10	25	5	30	18	15
Iterations’ number attended	2000	2000	2000	2000	2000	2000	2000	2000	2000
Time of iterations (s)	19	20	13	8	28	6	40	24	12
Regression	0.84126	0.74918	0.84455	0.90177	0.69201	0.62031	0.88963	0.88704	0.86965

Table 6. Performance comparison of Q-Learning and Deep Q-Learning.

Metric	Q-Learning (With ε-Greedy)	Q-Learning (Without ε-Greedy)	Deep Q-Learning	DQL Improvement vs. QL (%)
Average Learning Time (s)	1048.04	1801.57	18.88	98.2% faster
Iterations to Goal	-	-	2000	-
Regression (R²)	-	-	0.84126–0.90177	Higher stability

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

El Wafi, M.; Youssefi, M.A.; Dakir, R.; Bakir, M. Intelligent Robot in Unknown Environments: Walk Path Using Q-Learning and Deep Q-Learning. Automation 2025, 6, 12. https://doi.org/10.3390/automation6010012

AMA Style

El Wafi M, Youssefi MA, Dakir R, Bakir M. Intelligent Robot in Unknown Environments: Walk Path Using Q-Learning and Deep Q-Learning. Automation. 2025; 6(1):12. https://doi.org/10.3390/automation6010012

Chicago/Turabian Style

El Wafi, Mouna, My Abdelkader Youssefi, Rachid Dakir, and Mohamed Bakir. 2025. "Intelligent Robot in Unknown Environments: Walk Path Using Q-Learning and Deep Q-Learning" Automation 6, no. 1: 12. https://doi.org/10.3390/automation6010012

APA Style

El Wafi, M., Youssefi, M. A., Dakir, R., & Bakir, M. (2025). Intelligent Robot in Unknown Environments: Walk Path Using Q-Learning and Deep Q-Learning. Automation, 6(1), 12. https://doi.org/10.3390/automation6010012

Article Menu

Intelligent Robot in Unknown Environments: Walk Path Using Q-Learning and Deep Q-Learning

Abstract

1. Introduction

2. Overview of Q-Learning and Deep Q-Learning Methods

2.1. Materials and Methods

2.2. Reinforcement Learning

2.3. Comparative Analysis of Q-Learning and Deep Q-Learning

2.3.1. Q-Learning

2.3.2. Deep Q-Learning

2.4. Software Architecture of Autonomous Robot

3. Results

3.1. Q-Learning Results

3.1.1. Impact of α and γ on Convergence Time

3.1.2. Comparison of Times with and Without Epsilon-Greedy

3.1.3. Interpretation of Best Performances

3.2. Deep Q-Learning Results

3.2.1. Neuron Count and Iteration Time

3.2.2. Neuron Count and Regression Quality

3.2.3. General Trends

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI