1. Introduction
Several bio-inspired robotic applications and learning tasks often require the agents to adapt to the uncertainties in their environment. In simulated environments, Deep Reinforcement Learning (DRL) has proven to be successful at learning tasks across a wide range of domains such as games [
1], rehabilitation [
2], locomotion [
3], production optimization [
4], control [
5,
6,
7], etc. However, in real-world environments, the deployment of DRL is often limited due to changing environmental conditions, an intrinsic feature mostly associated with real-world applications. Therefore, it is crucial that DRL agents are able to generalize over environmental conditions that they have never encountered during training [
8]. Although training DRL agents directly in real-world environments seems to be a probable alternative, it is rarely practiced due to issues related to safety, time, and cost. The success of DRL in simulated environments is realized due to the advantage of time and stability featured in simulated environments. Furthermore, during the DRL training process, unsafe and practically impossible behaviors pertaining to real-world scenarios such as robot singularity can be explored in simulation. Therefore, sim-to-real transfer of DRL agents has been a major topic in the DRL community [
9], where the motive is to develop DRL agents that can generalize their performance over a range of real-world scenarios through the knowledge gained during training [
9]. In DRL, the aim of generalization is to learn representations that are stable under dynamic environments [
10,
11,
12] and/or avoid overfitting during training [
13]. The literature on generalization under dynamic environments in DRL can be classified broadly into: (1) works that focus on developing dedicated DRL algorithms that are robust under changing environmental conditions [
14,
15]; and (2) works that introduce environmental uncertainties or domain randomization during training to realize agents that are more robust to the mismatches between simulation and reality [
10,
16,
17]. Domain randomization is primarily defined as the use of an environment (source domain) with randomized properties or parameters during the training phase of an RL agent with the expectation that the resulting agent would generalize under uncertainties in the test environment (target domain) [
18].
In the literature [
8,
12,
19], comparative studies have shown that the use of dedicated generalization DRL algorithms are not superior to base DRL algorithms such as PPO and A2C in terms of generalization with the same training resources and conditions. This could be due to the significant challenges posed to featured optimization methods by the highly non-convex–concave nature of the objective functions employed to achieve robustness. In game-theoretical terms, these methods search for pure Nash Equilibria (trade-off solutions) that might not even exist [
20]. According to one study [
8], simply adding uncertainties into the environments during training can yield agents that can generalize to environments with similar uncertainties. Studies on achieving generalization by varying environmental conditions or domain randomization during training were reported in the field of games, where variations in the paddle and ball size were applied on selected Atari Games during training [
17]. To facilitate DRL research in areas such as sim-to-real transfer as well as generalization, DeepRacer, an educational autonomous racing testbed, was proposed [
21]. In this platform, simulation parameters such as tracks, lighting, sensor, and actuator noise can be randomized. Consequently, agents trained based on such domain randomization were reported to generalize in terms of multiple cars and tracks as well as to variations in speed, background, lighting, track shape, color, and texture. Through the robot arm reacher task in PyBullet, researchers [
22] investigated the use of custom perturbations to introduce domain randomization. A number of works have also evaluated the use of environmental variation in robot navigation and locomotion tasks. For example, the authors of [
10] evaluated two robot locomotion tasks and introduced dynamic environmental conditions by varying the operation parameters or characteristics of the robot body. The authors of [
8] performed an empirical study on generalization using classical control environments (CartPole, MountainCar, and Pendulum, where features such as force, length, and Mass were randomly varied) as well as two locomotion tasks from Roboschool (HalfCheetah and Hopper, where robot operational features such as power, torso density, and friction were varied).
In the above works related to generalization in navigation and locomotion environments, the focus is more on varying the robot operation to mimic the environmental changes rather than varying the features of the environment itself, such as terrain. Furthermore, although the authors of [
8] combined changes in terrain with changes in the robot operation and evaluated this using two locomotion environments from Roboschool, it is difficult to reach viable conclusions based on only two cherry-picked environments. Hence, the need for a rigorous evaluation using several complex navigation and locomotion environments is critical. The main contribution of the current work is to provide rigorous evaluations to analyze the generalization capabilities of base DRL algorithms where uncertainties are only considered with respect to the environmental terrain without modifying the robot operations for complex navigation and locomotion RL environments. So, unlike existing works, rigorous evaluations of the generalization ability of base DRL algorithms when trained with dynamic environmental conditions are presented for complex navigation and locomotion environments. The evaluations are performed using six complex benchmark PyBullet locomotion tasks [
23] with varying environmental conditions. The variations in the dynamics of the environment were studied under varying friction settings of the environmental terrain. An application scenario corresponding to the above experimental design is as follows: a robot trained to navigate a normal terrain or floor is to be deployed in a slippery terrain. Furthermore, the evaluations were carried out using two proven state-of-the-art DRL algorithms: (1) from the actor–critic class, Soft Actor–Critic (SAC); and (2) from the policy gradient class, the Twin Delayed Deep Deterministic policy gradient algorithm (TD3).
The remainder of the paper is organized as follows.
Section 2 presents a detailed overview of the existing literature on generalization in DRL. In
Section 3, highlights on the proven DRL algorithms used in this study are presented.
Section 4 provides details regarding the benchmark locomotion environments used in this study and the dynamic changes introduced in each of the environments, while
Section 5 provides details about the experimental methodology. In
Section 6, the results and discussions from the experiments are presented, while
Section 7 presents the conclusions and future directions.
2. Related Works
In the literature, several DRL approaches have been proposed to achieve generalization in RL. One such approach is meta-learning, where the ability to adapt to unseen test environments is achieved through the learning process performed on multiple training tasks. Furthermore, referred to as deep metaRL, it usually involves a recurrent neural network policy whose inputs also entail the actions selected and reward received in the previous time step [
24]. Another class of DRL algorithms developed to achieve generalization are known as robust RL algorithms, where the algorithms are developed to be able to handle perturbations in the transition dynamics of the model. In one study [
25], a robust RL algorithm known as Maximum a posteriori Policy Optimization (MPO) was developed for continuous control. In the approach, a policy that optimizes the worst-case, entropy-regularized, expected return objective was learned to derive a corresponding robust entropy-regularized Bellman contraction operator. Training DRL agents on a collection of risk-averse environments was evaluated in another study [
10] using four benchmark locomotion environments where linear and RBF parameterizations were introduced to realize robustness in the DRL algorithm. Recently, introducing variations or uncertainties into RL environments during training to realize generalization has gained a lot of traction. This is due to the fact that recent studies have proved that vanilla DRL algorithms generalize better than their counterparts (
and
), dedicated Robust DRL algorithms for generalization [
8]. In an empirical study [
8], four classical control tasks were evaluated with changes in force, mass, and length introduced into the environments. In addition, this was extended for two locomotion environments where variations in the environments were introduced through changes in robot power, density, and ground friction. Those environments were evaluated using PPO and A2C and their corresponding Robust DRL algorithms (
and
) for the comparative analysis. The introduction of variation or uncertainties into DRL environments is mostly achieved in control tasks through domain randomization, where dynamic properties of either the associated system or its environment are varied. In a recent study [
26], Dexterous In-Hand Manipulation for a five-fingered robot was learned, where environmental variations were introduced by randomly varying parameters such as the object dimension, surface friction, robot link and object masses, etc. In addition, a case for generalization was proven in this work by a successful sim-to-real transfer. Our work adopts a similar approach to achieve domain randomization through environmental variation. However, rather than introducing variations in the robot dynamics or a combination of the robot dynamics and changes in its environment, we explore randomization only by varying the surface friction of the terrain to investigate a scenario when a robot moves from a smooth terrain to a slippery terrain. We avoided changes in the robot dynamics because, in the real world, such dynamics cannot be changed directly but are rather a function of the terrain of the robot.
4. Environments
The environments used for evaluations in this study are modified versions of six locomotion environments from PyBullet. The modification to the environments basically involves variations in the surface friction of the terrain. The resulting environmental terrains can be summarized as follows:
Normal Terrain (NT): The friction parameters of the environments under this terrain are kept constant with the default implementation values d from PyBullet. This implies that only the state variables are reset each time the environment is reset;
Random Terrain (RT): In this case, each time an episode is terminated, and the environment is re-initialized (reset), the friction coefficient is sampled randomly from a k-dimensional uniform distribution (box) containing the default values d;
Extreme Terrain (ET): Here the friction coefficient corresponding to the terrain is rest every time an episode is terminated, through uniformly sampling from one of the k-dimensional uniform distribution (box). Specifically, is sampled from the union of two intervals that straddle the corresponding interval in RT.
The search spaces corresponding to NT, RT, and ET are illustrated schematically in
Figure 1, where
, the search space of RT is indicated by the bounded white box and those of ET are illustrated by the disconnected black boxes. The ranges of the actual friction parameter values corresponding to each environment are presented in
Table 1.
Based on each of these environmental terrains, six locomotion tasks with different levels of difficulty and dynamics were trained. All the six tasks, Hopper [
32,
33], 2D Walker [
32,
33], Ant [
34], HalfCheetah [
34,
35,
36], Humanoid [
34,
37], and Humanoid Flagrun Harder (Flagrun) [
38], were implemented based on the existing locomotion environments in PyBullet, which are modified and more realistic versions of those from Mujoco. For example, in the PyBullet version of the Ant environment, the ant is much heavier, thus ensuring that it has two or more legs on the ground to sustain its weight.
Figure 2 shows a graphical illustration of each environment. The goal in all the tasks, except in Humanoid Flagrun Harder, is to learn to move forward quickly without falling. This requires that the humanoid move to a specific target whose position varies randomly while the humanoid is constantly bombarded by some cubes to push it off its trajectory. The detailed features, goal, and reward systems for each of the environments are as presented in [
23].
5. Experimental Design
To evaluate the effect of training based on each of the environmental terrain discussed in
Section 4, we perform a series of train–test scenarios where each agent is trained on a specific terrain and tested on the same terrain (in-distribution), as well as all other terrains (out-distribution). Specifically, we train on two DRL algorithms (SAC and TD3), with NT, RT, and ET variation on six environments (Ant, 2D Walker, Hooper, HalfCheetah, Humanoid, and Humanoid Flagrun Harder). Consequently, we tested all the resulting agents on all the terrains (NT, RT, and ET) and compared them using three testing scenarios. Both SAC and TD3 were trained for 1
timesteps for all the training scenarios. For testing purposes, the best models returned during training were used and each of the testing scenarios were performed based on 25 runs of independent episodes (with an episode length of 1000 for each). To enable fairness, the hyperparameters used for each algorithms were the same (for all the environments and training scenarios) and their values were fixed as those provided in stable-baselines—a collection of improved implementations of RL algorithms [
29].
7. Conclusions and Future Work
An assessment of domain randomization in DRL for locomotion tasks is presented in this work. Specifically, we evaluated two state-of-the-art deep RL algorithms (SAC and TD3) on six locomotion tasks based on three different terrain conditions (NT, RT, and ET). The adopted framework is motivated by previous studies on generalizations from the OpenAI Retro contest [
19] as well as the CoinRun benchmark [
12], which concluded that vanilla deep RL algorithms trained with environmental stochasticity may be more effective for generalization than specialized algorithm. Similar to the authors of [
8], we introduced a system of testbeds and experimental protocol to evaluate the capability of DRL algorithms trained with or without domain randomization to generalize to environments both similar to and different from those seen during training. Furthermore, we introduce a real-world scenario where the performance of all the trained DRL agents are compared on a common real-world scenario (slippery terrain).
Overall, agents trained with domain randomization have better generalization performance than those trained without any form of domain randomization in terms of the accumulated returns. However, the question of what type and what level of domain randomization is necessary and sufficient for a specific task is outstanding. Therefore, in the future, we plan to introduce an optimization framework that incorporates a range of parameters for domain randomization as hyperparameters. This is important because, as demonstrated by the results, different environments or tasks benefit from different levels of domain randomization, and the optimal setting of domain randomization parameters would lead to better generalization results.