A Deep Reinforcement Learning Approach to Optimal Morphologies Generation in Reconfigurable Tiling Robots

Kalimuthu, Manivannan; Hayat, Abdullah Aamir; Pathmakumar, Thejus; Rajesh Elara, Mohan; Wood, Kristin Lee

doi:10.3390/math11183893

Open AccessArticle

A Deep Reinforcement Learning Approach to Optimal Morphologies Generation in Reconfigurable Tiling Robots

by

Manivannan Kalimuthu

¹

,

Abdullah Aamir Hayat

^1,*

,

Thejus Pathmakumar

¹

,

Mohan Rajesh Elara

¹

and

Kristin Lee Wood

²

¹

ROAR Lab, Engineering Product Development Pillar, Singapore University of Technology and Design (SUTD), Singapore 487372, Singapore

²

College of Engineering, Design and Computing, University of Colorado Denver, 1200 Larimer St, Ste. 3034, Denver, CO 80204, USA

^*

Author to whom correspondence should be addressed.

Mathematics 2023, 11(18), 3893; https://doi.org/10.3390/math11183893

Submission received: 3 August 2023 / Revised: 5 September 2023 / Accepted: 9 September 2023 / Published: 13 September 2023

Download

Browse Figures

Versions Notes

Abstract

:

Reconfigurable robots have the potential to perform complex tasks by adapting their morphology to different environments. However, designing optimal morphologies for these robots is challenging due to the large design space and the complex interactions between the robot and the environment. An in-house robot named

S m o r p h i

, having four holonomic mobile units connected with three hinge joints, is designed to maximize area coverage with its shape-changing features using transformation design principles (TDP). The reinforcement learning (RL) approach is used to identify the optimal morphologies out of a vast combination of hinge angles for a given task by maximizing a reward signal that reflects the robot’s performance. The proposed approach involves three steps: (i) Modeling the Smorphi design space with a Markov decision process (MDP) for sequential decision-making; (ii) a footprint-based complete coverage path planner to compute coverage and path length metrics for various Smorphi morphologies; and (iii) pptimizing policies through proximal policy optimization (PPO) and asynchronous advantage actor–critic (A3C) reinforcement learning techniques, resulting in the generation of energy-efficient, optimal Smorphi robot configurations by maximizing rewards. The proposed approach is applied and validated using two different environment maps, and the results are also compared with the suboptimal random shapes along with the Pareto front solutions using NSGA-II. The study contributes to the field of reconfigurable robots by providing a systematic approach for generating optimal morphologies that can improve the performance of reconfigurable robots in a variety of tasks.

Keywords:

reconfigurable robotics; reinforcement learning; morphology generation; shape optimization; area coverage; design principles; transformation design principles

MSC:

68T40

1. Introduction

Reconfigurable robotic systems have gained significant attention due to their versatility and potential applications in a variety of domains. A notable framework for the taxonomy and evaluation of self-reconfigurable robotic systems is introduced in [1]. In recent years, considerable progress has been made in the exploration of advanced control methods, autonomy, and complex mechanisms for reconfigurable robots. For example, Refs. [2,3] introduce a staircase climbing robot capable of performing cleaning tasks such as vacuuming and sweeping by navigating over stairs. Similarly, Refs. [4,5] present a reconfigurable facade-accessing robot with autonomous capability to cross over the frames. Although there are numerous applications for reconfigurable robots, floor or horizontal area coverage tasks for cleaning and maintenance (CnM) are considered to be the most common and important.

The global floor cleaning equipment market is a rapidly growing industry, valued at approximately USD 7.53 billion in 2023 [6]. This growth has catalyzed the introduction of innovative cleaning solutions for both home and industrial uses. Service providers are increasingly adopting green cleaning practices, emphasizing eco-friendly techniques and reducing energy consumption. Advances in research have led to the development of reconfigurable robots that can accomplish a wide range of tasks efficiently. These robots can change their morphology by connecting multiple modules or by varying their size. As a result, they are widely adopted across various domains of maintenance and surveillance [7,8,9,10,11]. In the context of floor cleaning, these reconfigurable robots outperform fixed morphology robotic systems as they can vary their morphology and cover difficult-to-access regions [12]. However, the energy consumption of reconfigurable robots is a major challenge for their widespread adoption. This is because reconfigurable robots often need to change their shape during the area coverage, leading to significant energy use. This underscores the importance of devising energy-optimized strategies for area coverage using reconfigurable robots.

Design heuristics for reconfigurable robotic systems, exploring their applicability and efficiency in real-world scenarios, are highlighted in [7]. The implementation of a reconfigurable system for floor cleaning area coverage application is demonstrated through hTetro, a Tetris-inspired shape-shifting robot reported in [12]. Moreover, ensuring energy efficiency in the design of self-reconfigurable robots, particularly by Tetris blocks, has become an important area of study, as highlighted in [13]. Notably, optimizing area coverage and path planning in these robots presents intriguing challenges. The hTetro robot, as highlighted in [12], demonstrates the ability to reconfigure into seven different morphologies for effective cleaning. However, its limitation of only seven morphologies led to the conceptualization of hTetro-infi [14], which has the ability to adapt its morphologies indefinitely for greater area coverage potential.

The trade-offs between area coverage and energy usage based on user preferences were studied in [15]. Modified A* and genetic algorithms for efficient complete path planning were proposed in [16,17]. Further advancements in energy-efficient path planning in complex scenarios were explored to enhance the capabilities of reconfigurable robots [18]. In [19], the authors proposed a graph theory-based approach for planning paths that ensure complete coverage for reconfigurable robots. This approach uses dynamic programming to find the optimal path with the fewest reconfigurations, but it may be computationally expensive for real-world scenarios. A major contribution is made in [20] for energy-efficient area coverage by identifying a range of optimal morphologies using a metaheuristic algorithm in a reconfigurable robot. This approach generates morphology with an optimal balance between area coverage and energy consumption. Energy efficiency in reconfigurable robots has also been studied in other domains, such as underwater self-organizing robots [21] and uneven terrain [22]. Identifying the optimal morphology based on the map layout, where the system can learn from its environment and experiences to adapt and generate more effective morphologies that improve the performance of the reconfigurable tiling robot, remains an open area of study.

By taking considerations from the above discussion, the methodologies and the objectives of this work are listed below:

Deep RL-based morphology generation: We investigate a deep reinforcement learning (RL) framework to generate optimal morphologies for the Smorphi robot. This framework allows the robot to modify its morphology by reorienting its blocks relative to one another by changing the angle around the hinge.
MDP modeling: We model the generation of optimal Smorphi robot morphologies as a Markov decision process (MDP). This provides a structured approach to sequential decision-making for morphology optimization.
RL training with PPO and A3C: We train the MDP model using proximal policy optimization (PPO) and asynchronous advantage actor–critic (A3C) in simulated environments. These reinforcement learning algorithms enable efficient policy optimization for the Smorphi robot.
Evaluation and comparison: We evaluate the learned policy in various simulated environments and assess the effectiveness of our approach by comparing it with the non-dominated sorted genetic algorithm (NSGA-II), a metaheuristic optimization scheme. This comparative analysis helps demonstrate the advantages of our deep RL-based methodology in morphology optimization for the Smorphi robot.

The rest of the article is structured as follows. Section 2 discusses related works, and the concept of DRL is introduced in the Section 3. Section 4, presents the problem formulation, including the mechanical design of the robot and the coverage planner. Section 5 describes the Markov decision process and the reinforcement learning algorithms used: proximal policy optimization (PPO) and advantage actor–critic (A3C). Section 6 discusses the training and simulation results, followed by the limitations of the proposed approach. The conclusion is presented in Section 7.

2. Literature Survey

Reinforcement Learning in Robotics: Reinforcement learning has shown great potential for training robots to perform complex tasks in a variety of environments. One major area of research has focused on navigation and mapping. For example, in [23], the authors propose a reinforcement learning-based approach to explore unknown environments with a mobile robot, while [24] uses RL algorithms to map cluttered environments autonomously. In [25], the authors exploited the RL algorithm to solve the maze problem, exploring a modular RL-based approach that decomposes monolithic tasks into several chunks, where agents learn in parallel to achieve the goal. While these approaches have shown promise in mapping and navigation applications, it is important to note that they may require a large amount of data and computational resources, which may limit their applicability in real-time sceanrios. RL is also widely adopted in complex tasks such as object recognition and manipulation. For example, in [26], the authors solved the object grasping problem by applying RL with visio-motor feedback, while [27] applies RL to learn how to manipulate objects in a simulated environment. Furthermore, deep reinforcement learning (DRL) is widely applied in multi-robot collaboration to efficiently coordinate homogenous and heterogeneous systems. For example, in [28], the authors present a framework for multi-robot coordination, enabling multiple mobile robots to perform collectively in a decentralized manner. Also, in [29], the authors propose a cooperative exploration strategy for multiple mobile robots. Their approach uses dynamic Voronoi partitions to minimize duplicated exploration areas and a deep reinforcement learning-based collision avoidance algorithm to avoid sudden obstacles. However, it is crucial to evaluate the scalability and robustness of such DRL-based multi-robot coordination in complex, real-world scenarios.

While these limitations exist, a few studies explore RL approaches for improving energy efficiency in different robotic domains. Algorithms, such as proximal policy optimization (PPO), asynchronous advantage actor–critic (A3C), deep deterministic policy gradient (DDPG), recurrent DQN, and trust region policy optimization (TRPO), have been exploited to solve path planning and navigation tasks in mobile robots [30]. Among these algorithms, PPO and A3C have gained prominence due to their stability, ease of implementation, and ability to handle high-dimensional action spaces. TRPO restricts policy updates by constraints, and DDPG, whose policy is deterministic, is not suitable for complex tasks. Further, few studies have explored RL approaches for improving energy efficiency in different robotics domains. For example, in [31], the authors proposed an energy-efficient path planning approach using a Q-learning algorithm in mobile robots, while [32] use an RL approach with the proximal policy optimization (PPO) approach to produce an energy-efficient gait. However, the paper did not include the real trials of the physical robot to validate the proposed theoretical framework. Similarly, in [33], the authors incorporated a deep inverse RL method in a legged robot, where they introduce a preference label based on trajectory ranking; hence the inferred reward map results in a more energy-efficient policy.

Reinforcement learning (RL) has been studied in reconfigurable robots due to their simplicity and adaptability. For example, in [34], the authors introduce a deep RL approach for the hardware design of a reconfigurable legged robot. This approach identified the best crawling motion for the robot in a given environment by modifying its structure. Similarly, in [35], the authors present an RL-based approach for a staircase-climbing robot. The robot learned to safely climb staircases of different sizes using this approach. Additionally, the authors present a unified approach for locomotion and manipulation in a reconfigurable robotics limb using RL in [36]. This approach allows the skills to be shared between manipulation and locomotion. The RL approach has also been widely adopted for area coverage tasks in reconfigurable robots. For instance, in [37], the author presents strategies to overcome obstacles optimally in a reconfigurable robot named RSTAR using Q-learning. Energy-aware complete coverage for a reconfigurable robot named hrombo is reported in [38]. The author identified the optimal trajectories to cover the given map using RL by modeling the problem as the traveling salesman problem (TSP). Similarly, in [39], the author use the LSTM with the actor–critic experience replay (ACER) RL algorithm to generate optimal trajectories to cover the given environment. Similarly, a few coverage approaches using RL have been explored in reconfigurable robots [40,41]. However, all the RL-based approaches that have been studied in self-reconfigurable robots have primarily reported on identifying optimal trajectories for a set of configurations. Moreover, the exiting approaches did not focused on finding the optimal morphology for a given map that maximizes area coverage and minimizes energy consumption which is the focus of the present work.

Shape Optimization Using Reinforcement Learning: In recent years, reinforcement learning (RL) has been extensively used to address the shape optimization problem. For example, in [42], the authors use deep RL for airfoil shape optimization. The RL agent modifies the shape of the 2D airfoil within geometric constraints, and it is run on a low-fidelity external solver to calculate the reward. Through training, the agent is able to generate high-efficiency airfoils. Similarly, in [43], the authors present a novel optimizer for aerodynamic shape optimization that uses RL. The optimizer, based on reinforcement learning and transfer learning, can significantly reduce the computational time required for aerodynamic shape optimization. In [44], the authors proposed a novel RL-based shape optimization algorithm for aerodynamic mitigation of wind-sensitive structures. The algorithm uses transfer learning and meta-learning to learn a policy that can quickly adapt to new environments. These are just a few examples of the many recent studies that have applied RL to shape optimization in fluid mechanics [45,46,47,48,49]. These studies have showcased the adaptability of RL algorithms in fluid mechanics, demonstrating promising results for shape optimization tasks. However, it is important to note that they have only been applied and studied extensively in fluid mechanics. The effectiveness of RL algorithms in the context of optimal morphology generation in a reconfigurable robot remains an open question.

Our work builds upon the advancements in reinforcement learning (RL)-driven shape optimization, aiming to extend these capabilities into the domain of reconfigurable robotics. By leveraging RL methodologies, our approach aims to address the unique challenge of generating optimal robot morphologies that maximize area coverage while minimizing energy consumption—a problem that remains largely unexplored within this field. We use the proximal policy optimization (PPO) and asynchronous advantage actor–critic (A3C) RL algorithms to achieve this goal.

3. Deep Reinforcement Learning (DRL)

Reinforcement learning is an iterative process in which an agent learns how to behave in an environment by performing a certain set of actions and observing the result or rewards of those actions [50]. At each time step, the agent observes the current state

s_{t}

of the environment and chooses an action

a_{t}

. This action prompts the transition to the next state

s_{t + 1}

, and the agent receives the reward

r_{t}

. This process continues until the agent meets the certain mean reward or termination criteria, as shown in Figure 1. The agent’s main objective is to learn a sequence of steps or actions that maximizes the reward over an episode or scenario.

Deep reinforcement learning (DRL) extends the RL framework by integrating deep learning techniques, typically in the form of deep neural networks (DNNs) [51]. In deep RL, a policy is represented by a neural network. This means that the neural network is used to map states to actions. A state is a representation of the environment at a particular time. It can include information about the agent’s position, the objects in the environment, and the agent’s own state. An action is something that the agent can perform in the environment.

Training a DNN involves repeatedly adjusting the weights and biases to reduce the difference between the network’s output and the expected output [52]. This is typically achieved through a process known as back-propagation. In the context of DRL, the DNN learns to map the relationship between states and actions (inputs) and the corresponding expected rewards or values (outputs). Over time, this allows the agent to learn a policy that can effectively select actions to maximize cumulative reward.

4. Problem Formulation

In this work, we formulate the problem of identifying the optimal morphology of a reconfigurable robot

S m o r p h i

out of its vast array of configurations as a reinforcement learning (RL) problem. The overall framework for the RL-based morphology optimization problem is depicted in Figure 2.

Agents and Environments: In this formulation, the

S m o r p h i

robot is treated as the agent. The agent begins by observing the current state of the environment, which is represented by the morphology of the robot (specifically, the configuration of the three hinges, denoted as

θ_{1}, θ_{2}, θ_{3}

) and the 2D map of the environment, which is represented by the binary values 0’s (obstacle space) and 1’s (free space). Based on this observation, the agent chooses to perform an action in this environment by modifying the robot morphology. Subsequently, the modified morphology of the

S m o r p h i

robot is run by an external footprint-based coverage planner. This planner computes the percentage of the area covered

P_{c}

and the associated path length

P_{l e n}

. These metrics serve as a basis for calculating the reward for the agent, as detailed in Section 4.2. Furthermore, in this work, we assumed that the path length is directly proportional to the energy consumption of the robot, as supported by the study in [53].

The agent continues to cycle through these stages of observation, action, and reward calculation until it either achieves a threshold percentage of area covered or until a predefined number of iterations is reached, which is considered as one episode or scenario. At the end of each episode, the mean reward is computed to evaluate the overall performance of the agent. This process is repeated until the algorithm converges. The agent interaction with the designed environment over one cycle is depicted in Figure 2.

The following subsection details the mechanical design, and the footprint-based complete coverage path planner (FBCPP) in detail.

4.1. Mechanical Design of the Agent

The design of the Smorphi robot is derived from transformation design principles (TDP) and facilitators [54,55]. The authors explore a wide range of reconfigurable products, toys, and biological species and generalized their reconfiguration behaviour into three different principles and twenty facilitators. Transformation principles provide guidelines that, when embodied, create transformation, and facilitators enable the creation of mechanical transformation. Some of the work that exploits these principles for designing reconfigurable robots is reported here [7,56,57,58,59]. For example, by exploiting these facilitators, the authors designed a reconfigurable pavement robot with different subsystems. By adapting these TDP, we developed the

S m o r p h i

robot. The list below shows the mapping of transformation principles and facilitators to our Smorphi robot:

Expose/Cover—Reveal or conceal a new surface to alter functionality [54] ⇒ The Smorphi robot exposes/cover its side surface while reconfiguring to a different configuration.
Generic connection—Employ internal or external connections (structural, power) that can be used by different modules to perform different functions or perform the same [54] ⇒ The Smorphi robot uses hinges in the corners of each block to allow identical modules to be connected, which can enable more number of configurations.
Shared Drive Transmission—Transmit power from a common source to perform different functions in different configurations [54] ⇒ The Smorphi robot uses the same drive motor for locomotion and reconfiguration.

The design of Smorphi allows it to generate a plethora of unique configurations based on hinge angles. For instance, with a 180-degree step size, Smorphi can achieve eight distinct configurations, namely: [0, 0, 0], [0, 0, 180], [0, 180, 0], [0, 180, 180], [180, 0, 0], [180, 0, 180], [180, 180, 0], and [180, 180, 180]. As the step size decreases, the number of possible morphologies increases substantially. Specifically, with step sizes of 90, 45, 5, and 1 degree, Smorphi can generate 27, 125, 5.06

\times 10^{4}

, and 5.929

\times 10^{6}

morphologies, respectively.

The design of the reconfigurable Smorphi robot is shown in Figure 3. The Smorphi robot is designed modularly, where each unit is identical and can be connected serially through hinges, allowing the blocks to reconfigure. Each block carries the control unit, locomotion mechanism, and other electrical accessories. A 7.4 V DC motor drives the robot with an encoder coupled with Meccanum wheels. The Meccanum wheels in this robot facilitate two functionalities: they provide locomotion in one configuration and help to reconfigure the robot in the other configuration by controlling the wheels with different velocities. Block 2 is considered the core unit, which carries the main processing unit, including the NVIDIA Jetson PC to control and the Vectotnav IMU to maintain the robot’s orientation by compensating for the error caused by wheel slippage. Additionally, a 2D lidar is integrated into the robot for mapping and localization.

4.2. Footprint-Based Complete Coverage Path Planner (FBCPP) of the Environment

Our previous work propose a footprint-based coverage planner that can perform complete coverage for the different morphologies of a reconfigurable robot [20]. In this work, we adapt that coverage planner to perform complete coverage in a given environment. The coverage planner outputs the percentage of area covered (

P_{c}

) and the total path length (

P_{l e n}

) for every unique morphology, which serves as a basis for calculating the reward function for the RL agent. For clarity, we briefly explain the footprint-based coverage planner in this section. The approach for performing area coverage for any morphology based on their footprint consists of two layers: footprint generation and complete coverage path planning based on the footprint.

4.2.1. Footprint Generation

The Smorphi robot’s footprint is assumed as the surface area it occupies at any given time. It is modelled as four independent square blocks connected by hinges on either the left or right side of the blocks, as shown in Figure 4a. Each square block has four vertices, for a total of 16 vertices for the four blocks. The coordinates of these vertices vary depending on the hinge angles, and they are projected onto the X and Y axes to compute the occupied area, as detailed in Equations (1) and (2).

\begin{matrix} p_{x}^{m} & = {[p_{1}^{m} \cdot x p_{2}^{m} \cdot x \dots p_{m}^{m} \cdot x]}^{T} \end{matrix}

(1)

\begin{matrix} p_{y}^{m} & = {[p_{1}^{m} \cdot y p_{2}^{m} \cdot y \dots p_{m}^{m} \cdot y]}^{T} . \end{matrix}

(2)

The footprint size of the morphology is determined using the Equations (3) and (4)

\begin{matrix} w_{f p} = | m i n (p_{x}^{m}) - m a x (p_{x}^{m}) | \end{matrix}

(3)

\begin{matrix} l_{f p} = | m i n (p_{y}^{m}) - m a x (p_{y}^{m}) | . \end{matrix}

(4)

4.2.2. Complete Coverage Path Planning Based on the Footprint

The wavefront coverage algorithm calculates complete area coverage based on the generated footprint. It covers the given map in a wave-like pattern propagating from the starting point and covers all nodes in a grid or map [60].

Upon receiving the hinge angles (

θ_{1}, θ_{2}, θ_{3}

), a new configuration is generated, and the initial map is resized accordingly. The wavefront algorithm then calculates the complete area coverage on the resized map and transforms these waypoints back to the original map. Subsequently, the algorithm sweeps through the footprint matrix to estimate the area covered.

The output includes the path length and percentage of the area covered, which are optimized to identify the suitable robot morphology. Figure 4b–d shows an example of footprint-based coverage planning for the configuration

θ_{1}, θ_{2}, θ_{3} \equiv 0, 180, 0

. The original map is resized based on the footprint value, and the wavefront complete coverage planner is applied to the resized map. The waypoints are then mapped back to the original map to calculate the

P_{c}

, and

P_{l e n}

. The detailed pseudocode for the FBCPP is detailed in Algorithm 1. The following section details the modeling of the problem of optimal morphology generation.

Algorithm 1: footprint-based complete coverage path planner (FBCPP)

5. Markov Decision Process

The Markov decision process (MDP) is a fundamental model in the field of reinforcement learning (RL) [61]. An MDP is a mathematical model used to describe a physical or abstract system that evolves over time. It is typically represented as a tuple

(S, A, R, P, γ)

.

In this work, we modelled the optimal morphology generation for reconfigurable robot as the Markov decision process (MDP) similar to the conventional RL approach.

S

is the state space. For our task, the state of the system is defined by the three hinge angles of the robot, as well as the binary 2D Map M which represents obstacles (1) and free space:

S = \{\begin{matrix} θ_{1}, θ_{2}, θ_{3}, M . \end{matrix}\}

(5)

A is the action space. In our case, the action space is a multi-discrete action space, with separate actions for each hinge. The allowable range for each hinge is between 0 and 180 degrees. Actions are defined as follows:

A = \{\begin{matrix} Δ θ_{1}, Δ θ_{2}, Δ θ_{3} \end{matrix}\}

(6)

where each

Δ θ_{i} \in - 1, 0, 1

.

As per the MDP, the agent gains a reward from the environment based on the observations and action performed in a state. A real-valued function

R

determines the reward based on the current state of the robot, the action taken, and the resulting state. The reward function

R

maps every state-action-next state triplet to a real number, formally defined as

R : S \times A \to R

. The total reward is calculated based on the two main components

R_{1}

, and

R_{2}

:

R_{1}

is designed to guide the agent’s behavior towards maximizing area coverage while assigning negative reward for higher energy consumption. Weighting factors (

w_{1}, w_{2}

) are assigned to these two parameters to optimize the result based on specific tasks and requirements. These factors can vary depending on the task and the specific requirements.

R_{1} (S, A, S^{'}) = w_{1} * P_{c} - w_{2} * P_{l e n} .

(7)

R_{2}

is a penalty term for maintaining the same hinge angles across two consecutive time steps. It penalizes the agent for keeping the same hinge angles in consecutive steps, encouraging it to explore different configurations. This is defined as (

β

-Penalizing factor):

R_{2} (S, A, S^{'}) = \{\begin{matrix} - β & if S = S^{'} . \\ 0 & otherwise \end{matrix}

(8)

The total reward function

R

is calculated as,

R (S, A, S^{'}) = R_{1} + R_{2} .

(9)

P : S \times A \times S^{'} \to [0, 1]

is the state transition function. Given the deterministic nature of our task, if action a is applied in state s, we always get the resulting state

s^{'}

(

P \to 1

). Thus, the transition function can be considered as deterministic:

P = \{\begin{matrix} 1 & if S^{'} = f (S, A) . \\ 0 & otherwise \end{matrix}

(10)

In our context, the reconfigurable robot is treated as an agent that learns a behavior or policy that maximizes the reward obtained from the actions. The behavior of the agent is known as a policy

π

, which is a function that maps states and actions to real numbers. This is formally represented as

π : S \times A \to R

. The policy is a probability distribution over actions given a state, denoted as

π (a | s)

. However, in the deterministic case, the policy function can be simplified to a function that maps from states to actions, denoted as

π : S \to A

. In this scenario, the policy directly determines the action to be taken given the state. In this work, we utilized two different advanced reinforcement learning techniques, namely proximal policy optimization (PPO) and asynchronous actor–critic Agents (A3C). Both are types of reinforcement learning (RL) algorithms employed to determine the optimal morphology (shape) of a

S m o r p h i

with four blocks by optimizing the angles between the blocks. These algorithms are discussed in the following subsection.

5.1. Proximal Policy Optimization (PPO)

PPO is a family of policy gradient methods proposed by Schulman et al. [62]. A PPO algorithm employs multiple stochastic gradient ascents for policy updates over several epochs. PPO also collects experiences from the current policy using multiple parallel agents during training. The objective function of PPO (

J_{P P O}

), the clipped surrogate objective function, is shown in Equation (11):

J_{P P O} = \hat{E} [min (r_{t} (θ) \hat{A}, clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) \hat{A})] .

(11)

where

r_{t} (θ)

is the policy ratio, defined as the ratio of action probabilities as shown in Equation (12):

r_{t} (θ) = \frac{π_{θ} (a_{t} | s_{t})}{π_{θ o l d} (a_{t} | s_{t})},

(12)

where

π_{θ}

is the action probability under the current policy and

π_{θ o l d}

is the action probability under the old policy used for the rollout. In Equation (11),

r_{t} (θ)

is clipped within the interval

[1 - ϵ, 1 + ϵ]

.

\hat{A}

in Equation (11) is evaluated using generalized advantage estimation.

5.2. Asynchronous Actor–Critic Agents (A3C)

Mnih et al. propose a method for DRL using asynchronous variants of conventional RL methods [63]. Among these asynchronous methods, A3C demonstrated superior performance in an Atari benchmarking test. The A3C method keeps an advantage function and value estimates as

π (a_{t} | s_{t}; θ)

and

V (s_{t}; θ_{v})

, respectively. Periodic updates are performed based on Equation (13):

Δ θ^{'} = log π (a_{t} | s_{t}^{'}; θ^{'}) A (a_{t}, s_{t}, θ, θ_{v}),

(13)

where

A (a_{t}, s_{t}, θ, θ_{v})

is the advantage function estimate, computed by:

A (a_{t}, s_{t}, θ, θ_{v}) = \sum_{i = 0}^{k - 1} γ^{i} r_{t + 1} + γ^{k} V (s_{t + k}; θ_{v}) - V (s_{t}; θ_{v}) .

(14)

In Equation (14), k is upper-bounded by a fixed number of actions and varies from state to state. A3C employs multiple agents (networks) with multiple copies of weights that interact with the environment in parallel and update the global network periodically. The obtained results using this PPO and A3C approach are discussed in the following section.

6. Training and Simulation Results

This section details the training and validation of the proposed framework for optimal morphologies generation. The training is done using two different 2D maps, which resemble an indoor setting, as shown in Figure 5a,b. The black rectangular primitives in the map represent obstacles, and the remaining spaces are considered free for the robot to explore. Each of the 2D maps has a size of 5 m × 5 m, with a resolution of 0.1 m

^{2}

.

The environment for the RL training is created using the OpenAI Gym library [64], which provides a custom Application Programming Interface (API) to communicate between environments and different learning algorithms. It has four core functions:

Initialization: Initializes the environment, action, and observation space.
Step Function: Defines an action chosen by the agent, calculates the reward, and observes the agent’s behavior.
Reset: Resets the robot morphology, reward criteria, and number of iterations at the end of each episode.
Render: Provides visualization of the environment.

The Gym library only provides the environment to interact with, not the DRL agent. Multiple libraries have been developed to implement DRL. In this work, we adapted Rllib from Ray, which is an open-source library that provides state-of-the-art algorithms for training [65]. It allows for scalable and optimized implementation of various reinforcement learning models. We also used the PyTorch framework to train the model.

Table 1 shows the list of hyperparameters used for training using the RLLib library. The hyperparameters used in this study were chosen based on the results of previous studies [62,63]. These hyperparameters have been shown to be effective in a variety of tasks. The model training is done on an Ubuntu 20.04 running machine with an Intel Core i7 CPU and an NVIDIA RTX 3080 GPU.

6.1. Analysis of Simulation Results

We trained the model for optimal morphologies generation by implementing the proximal policy optimization (PPO) and asynchronous advantage actor–critic (A3C) algorithms. The results for Map-I and Map-II are shown in Figure 5c–f, which shows the agent’s cumulative reward over the first

3 \times 10^{5}

iterations for both algorithms.

As shown in Figure 5, both algorithms demonstrate a consistent increase in mean reward, indicating successful learning and adaptation by the agent. For Map-I, the PPO algorithm reached convergence in

2.5 \times 10^{5}

iterations, and for Map-II, it reached convergence in

2.7 \times 10^{5}

iterations. The A3C algorithm also showed increases in mean reward for both maps, but it did not reach steady-state convergence in either map.

The PPO algorithm produces a higher mean reward than the A3C algorithm in both maps. For example, in Map-I, the PPO algorithm achieved a mean reward of 122.5, while the A3C algorithm achieved a mean reward of 120. However, the difference in mean reward between the two algorithms is more significant in Map-II, where the PPO algorithm achieved a mean reward of above 105, while the A3C algorithm only reached a peak reward of less than 90 for

3 \times 10^{5}

iterations.

The PPO algorithm outperformed the A3C algorithm in generating optimal morphologies in both maps. This indicates the potential of the PPO algorithm for generating optimal morphologies and shape optimization in reinforcement learning problems.

Furthermore, the PPO algorithm demonstrates a smooth learning curve, which indicates its stability. This is because the PPO algorithm aims to provide reliable updates to the policy, avoiding fluctuations that could destabilize the learning process. The A3C algorithm, in contrast, exhibits a more volatile learning curve, displaying periods of rapid improvement followed by occasional performance drops. This behaviour can be attributed to the A3C algorithm’s asynchronous nature, which can lead to increased variability but potentially faster overall learning.

6.2. Evaluation of Optimal Morphologies

This section examines the performance of the top four morphologies produced by the proximal policy optimization (PPO) and asynchronous advantage actor–critic (A3C) algorithms. Table 2 and Table 3 summarize the results, which include the percentage of area coverage (

P_{c}

), path length (

P_{l e n}

), and the respective hinge angle for each morphology. Even though all of the morphologies listed in Table 2 and Table 3 can be considered optimal, the objective of this analysis is to identify the most effective morphologies for area coverage tasks.

The PPO algorithm consistently generates the most energy-efficient optimal morphologies for both Map-I and Map-II. Specifically, in Map-I (Table 2, row 1), the morphology with the hinge angles (

θ_{1}, θ_{2}, θ_{3} \equiv 95, 59, 103

) achieved a coverage percentage (

P_{c}

) of 98.5136% and a path length (

P_{l e n}

) of 99. In Map-II (Table 3, row 1), the optimal configuration with hinge angles (

θ_{1}, θ_{2}, θ_{3} \equiv 15, 15, 39

) resulted in a

P_{c}

of 96.9414% and a

P_{l e n}

of 134. Similarly, the A3C algorithm’s most optimal morphology in Map-I (Table 2, row 5) with the

θ_{1}, θ_{2}, θ_{3} \equiv (37, 33, 137)

achieved a

P_{c}

of 97.6692% with a

P_{l e n}

of 99. In Map-II, the most optimal morphology produced by the A3C with the hinge angle (

θ_{1}, θ_{2}, θ_{3} \equiv 75, 69, 83

) (Table 3, row 5) produced a

P_{c}

of 90.1382% and a

P_{l e n}

of 72.

Comparing the results produced by the PPO and A3C algorithms, the PPO algorithm generates highly energy-efficient morphologies in both maps. For example, the morphology set-1 in Table 2 (row 1) achieved a coverage percentage (

P_{c}

) of 98.5136% and a path length (

P_{l e n}

) of 99, while all the morphologies generated by the A3C algorithm for the same path length (

P_{l e n}

) has a lower percentage of area covered. On the other hand, in Map-II, the morphology produced by the PPO (set-2, Table 3, row 2) achieved a

P_{c}

of 96.6926% and a

P_{l e n}

of 134, while all the morphologies generated by the A3C in Map-II have significantly higher path lengths and lower percentages of area coverage than PPO. Figure 6 depicts the top optimal morphologies generated by the PPO and A3C algorithms in both of the maps, with the footprint-based coverage planner.

This comparative evaluation indicates that the optimal morphologies produced by the PPO algorithm are more energy-efficient compared to those generated by A3C. Therefore, based on the results obtained, we can conclude that the PPO algorithm demonstrates superior performance for tasks that aim to generate energy-efficient optimal morphologies.

6.3. Optimal vs. Sub-Optimal Morphologies

As highlighted in Section 4.1, the

S m o r p h i

has a wide range of configurations. Our RL algorithm generates a handful of optimal morphologies for area coverage tasks. This comparison aims to distinguish the optimal morphologies generated from the RL algorithm from the other morphologies from the design space. Table 4 shows the list of sub-optimal morphologies from the design space.

Comparing Table 2, Table 3 and Table 4 shows that the morphologies generated by the RL algorithm are superior in terms of both percentage of area covered (

P_{c}

) and path length (

P_{l e n}

). For example, in Map-I (Table 4, row 2), the random morphology with hinge angles of (

θ_{1}, θ_{2}, θ_{3} \equiv 70, 170, 86

) has a

P_{c}

and

P_{l e n}

of 88.350% and 121, respectively. In contrast, the morphology generated by the PPO for Map-1 (Table 2, row 3) has a higher percentage of area coverage (

P_{c}

= 98.1036%) and lesser path length (

P_{l e n}

= 99). Similarly, the morphology with hinge angles of (

θ_{1}, θ_{2}, θ_{3} \equiv 71, 113, 125

) has a lower energy consumption (

P_{l e n}

= 68), but with a lower percentage of area coverage (

P_{c}

= 86.518%), making it a sub-optimal morphology for area coverage tasks (Table 4, row 4).

Furthermore, the morphology set with the hinge angles of (

θ_{1}, θ_{2}, θ_{3} \equiv 0, 180, 0

), (Table 4, row 8), has the highest percentage of area covered (

P_{c}

= 100%) compared to all the morphologies produced by the PPO and A3C; however, the path length (

P_{l e n} = 224

) is significantly higher, making it a sub-optimal morphology for energy efficient area coverage tasks.

These results suggest that the morphologies produced by the RL algorithm are superior in producing energy-efficient optimal morphologies for area coverage tasks, with the optimal balance between area coverage and energy consumption.

6.4. Policy Validation

Policy validation is a crucial step in evaluating the generalization of a trained reinforcement learning (RL) agent. This process involves testing the trained agent in new environments that it was not trained on [66]. For this purpose, we created new 2D environments with varying obstacle positions and densities.

We trained our agents, based on the proximal policy optimization (PPO) algorithm, in three newly created environments, with the trained model and baseline. The results of the training are shown in Table 5. In all maps, the trained PPO algorithm demonstrated the ability to converge to an optimal result with a relatively small number of episodes. For example, in Validation Environment-1 (Table 5, row 1), the trained PPO algorithm only took

5 \times 10^{4}

iterations to converge to a mean reward, while the baseline took

2.7 \times 10^{5}

episodes. Similarly, in Validation Environment-3 (Table 5, row 3), the trained agent only took

6 \times 10^{4}

iterations, leading to faster convergence, while the baseline model took

3.2 \times 10^{7}

iterations. These results show that the PPO algorithm is able to learn effective policies for a variety of environments.

These policy validation results indicate that the proposed framework, leveraging PPO, exhibits scalability to diverse environments with efficient convergence. Given these results, it is plausible that, with more varied maps included in the training phase, the agents could converge even faster. These results highlight the importance and usefulness of reinforcement learning algorithms for optimal morphologies generation.

6.5. Comparison with the Optimization Approach

To further assess the effectiveness of our proposed approach, we compared RL results with a state-of-the-art metaheuristic optimization scheme, the non-dominated sorting genetic algorithm (NSGA-II) [67]. We treated our problem as a multi-objective optimization task with the objectives of maximizing area coverage (

f_{1}

) and minimizing path length (

f_{2}

). The Pareto Frontier [68] results obtained using the NSGA-II algorithm for Map-I (Figure 5a) are presented in Figure 7 and Table 6.

By analyzing the results, it is evident that our RL approach outperforms the multi-objective optimization in all considered scenarios. For example, the morphology with the hinge angles of (

θ_{1}, θ_{2}, θ_{3} \equiv 55, 17, 3

) has a percentage of area coverage (

P_{c}

) and path length of

(P_{l e n})

of 95.8784% and 107, respectively (Table 6, row 8). Whereas, the morphology generated by reinforcement learning has a lower path length and higher % of area covered. For example, morphology with the hinge angles of (

θ_{1}, θ_{2}, θ_{3} \equiv 105, 51, 81

) (Table 2, row 4) has an

P_{c}

of 97.422 and an

P_{l e n}

of 92. This is better than the morphology generated by the NSGA-II algorithm. Similarly, another morphology set produced by the optimization algorithm with hinge angles of (

θ_{1}, θ_{2}, θ_{3} \equiv 97, 12, 150

) has a lesser path length of (

P_{l e n}

) of 68, respectively (Table 6, row 1). However, the % of the area covered is only (

P_{c}

= 86.9368%), making it a sub-optimal morphology for area coverage tasks. This indicates that the morphology generated by the reinforcement learning algorithm is superior than optimization.

We can conclude that our approach generates morphology with higher area coverage and consumes less energy, making it a promising choice for real-world applications, which performs better than a metaheuristic optimization scheme.

6.6. Limitations

Despite the promising results of our work, there are some limitations to our approach:

The proposed framework is only applied and validated in the reconfigurable class of tiling robot $S m o r p h i$ . The results may differ for different classes of robots.
The current model is trained in a static environment. It does not consider dynamic or unpredictable environments. Therefore, the effectiveness of our proposed approach in such scenarios remains an open question.
The proposed framework is computationally expensive, as it requires training a reinforcement learning agent. This may limit its applicability to real-world applications with tight time constraints.
Our simulation-based study does not explicitly consider real-world physical constraints and hardware limitations

While the above limitations exist in the work, we believe that the proposed framework is a valuable contribution to the field of reconfigurable robots. It provides a systematic approach for generating optimal morphologies that can be used to improve the performance of reconfigurable robots in a variety of tasks.

7. Conclusions

The study presented in this paper demonstrates the potential of deep reinforcement learning (RL) in generating optimal morphologies for reconfigurable robots. Our findings reveal that the design of the Smorphi robot, inspired by design principles, enables the robot to produce an array of configurations, which can be exploited to perform energy-efficient area coverage. Furthermore, our study shows that the PPO algorithm outperforms the A3C algorithm in generating energy-efficient optimal morphologies for a given map. The PPO algorithm showcases a consistent learning process, with a continuous increase in the mean rewards. In contrast, the A3C algorithm displayed inconsistent learning patterns due to its asynchronous nature, causing periodic fluctuations.

Our comparison of RL-generated morphologies with random morphologies revealed that the morphologies produced by the RL algorithm are superior in performance. This further reveals the advantage of exploring a vast array of configurations with the RL algorithm. Policy validation revealed that agents trained on one map could efficiently adapt and converge more quickly on an unseen map, underscoring RL’s scalability and its ability to learn and transfer across varying environments. Additionally, the comparative study with the metaheuristic optimization scheme NSGA-II revealed that the morphology produced by the RL algorithm dominated the Pareto Frontier results produced by the NSGA-II algorithm.

The proposed approach in this paper has some limitations, as outlined in Section 6.6. These limitations suggest avenues for future work to enhance the framework’s applicability and robustness across various scenarios and conditions. Furthermore, future work aims to explore the potential of multi-robot collaborative learning for optimal morphology generation using reinforcement learning and to validate the proposed approach in a real-world scenario. Additionally, we aim to explore the sensitivity of our approach to different hyperparameters. This could involve conducting a sensitivity analysis to assess how variations in hyperparameters, such as the generalized advantage estimation (GAE) parameter and discount factor, affect the performance of our method. Such an analysis can provide valuable insights into the robustness and adaptability of our approach under different settings. We will extend the proposed framework to generate optimal robot design using the generative adversarial network (GAN). By incorporating GANs into the design process, optimal and novel reconfigurable robot designs can be generated.

Author Contributions

Conceptualization, M.K., T.P. and M.R.E.; methodology, M.K., T.P. and A.A.H.; software, M.K. and T.P.; validation, M.K., T.P. and A.A.H.; formal analysis, M.K., T.P. and A.A.H.; supervision, M.R.E. and K.L.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research is supported by the National Robotics Programme under its National Robotics Programme (NRP) BAU, Ermine III: Deployable Reconfigurable Robots, Award No. M22NBK0054 and also supported by A*STAR under its “RIE2025 IAF-PP Advanced ROS2-native Platform Technologies for Cross sectorial Robotics Adoption (M21K1a0104)” programme. We wish to acknowledge the support of the SUTD DesignZ Center, the Comcast Media and Technology Center (CU Denver), and the College of Engineering, Design, and Computing (CU Denver).

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

List of Symbols

$θ$	Hinge Angle
$P_{c}$	Percentage of area covered
$P_{len}$	Path length
$w_{f p}$	Width of morphology footprint
$l_{f p}$	Length of morphology footprint
M	Simulation environment
$M_{R}$	Resized map
$F_{N}$	Footprint matrix
$B_{N}$	Bounding box
$S$	The set of all possible states of the system.
$A$	The set of all possible actions that the agent can take.
$R$	Reward function
$P$	State transition probability
$γ$	Discount factor
$π$	Policy function
$r_{t} (θ)$	Policy ratio
$\hat{A}$	Advantage function estimate
$ϵ$	Hyperparameter that controls the amount of clipping that is performed on the policy ratio.
$J_{P P O}$	Clipped surrogate objective function
$\hat{E}$	Expected value operator.

References

Tan, N.; Hayat, A.A.; Elara, M.R.; Wood, K.L. A framework for taxonomy and evaluation of self-reconfigurable robotic systems. IEEE Access 2020, 8, 13969–13986. [Google Scholar] [CrossRef]
Ilyas, M.; Yuyao, S.; Mohan, R.E.; Devarassu, M.; Kalimuthu, M. Design of sTetro: A modular, reconfigurable, and autonomous staircase cleaning robot. J. Sens. 2018, 2018, 8190802. [Google Scholar] [CrossRef]
Veerajagadheswar, P.; Yuyao, S.; Kandasamy, P.; Elara, M.R.; Hayat, A.A. S-Sacrr: A staircase and slope accessing reconfigurable cleaning robot and its validation. IEEE Robot. Autom. Lett. 2022, 7, 4558–4565. [Google Scholar] [CrossRef]
Tun, T.T.; Elara, M.R.; Kalimuthu, M.; Vengadesh, A. Glass facade cleaning robot with passive suction cups and self-locking trapezoidal lead screw drive. Autom. Constr. 2018, 96, 180–188. [Google Scholar] [CrossRef]
Vega-Heredia, M.; Mohan, R.E.; Wen, T.Y.; Siti’Aisyah, J.; Vengadesh, A.; Ghanta, S.; Vinu, S. Design and modelling of a modular window cleaning robot. Autom. Constr. 2019, 103, 268–278. [Google Scholar] [CrossRef]
Floor Cleaning Equipment Market Size, Share & Trends Analysis Report by Product (Scrubber, Vacuum Cleaner, Sweeper), by Application (Residential, Commercial), by Region, and Segment Forecasts, 2019–2025. 2023. Available online: https://www.grandviewresearch.com/industry-analysis/floor-cleaning-equipment-market# (accessed on 15 July 2023).
Hayat, A.A.; Yi, L.; Kalimuthu, M.; Elara, M.; Wood, K.L. Reconfigurable robotic system design with application to cleaning and maintenance. J. Mech. Des. 2022, 144, 063305. [Google Scholar] [CrossRef]
Kwon, Y.S.; Jung, E.J.; Lim, H.; Yi, B.J. Design of a reconfigurable indoor pipeline inspection robot. In Proceedings of the 2007 International Conference on Control, Automation and Systems, Seoul, Republic of Korea, 17–20 October 2007; pp. 712–716. [Google Scholar]
Qiao, G.; Song, G.; Wang, Y.; Zhang, J.; Wang, W. Autonomous network repairing of a home security system using modular self-reconfigurable robots. IEEE Trans. Consum. Electron. 2013, 59, 562–570. [Google Scholar] [CrossRef]
Zhang, H.; Wang, W.; Deng, Z.; Zong, G.; Zhang, J. A novel reconfigurable robot for urban search and rescue. Int. J. Adv. Robot. Syst. 2006, 3, 48. [Google Scholar] [CrossRef]
Liang, G.; Luo, H.; Li, M.; Qian, H.; Lam, T.L. Freebot: A freeform modular self-reconfigurable robot with arbitrary connection point-design and implementation. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 6506–6513. [Google Scholar]
Prabakaran, V.; Elara, M.R.; Pathmakumar, T.; Nansai, S. hTetro: A tetris inspired shape shifting floor cleaning robot. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 6105–6112. [Google Scholar]
Hayat, A.A.; Karthikeyan, P.; Vega-Heredia, M.; Elara, M.R. Modeling and assessing of self-reconfigurable cleaning robot htetro based on energy consumption. Energies 2019, 12, 4112. [Google Scholar] [CrossRef]
Samarakoon, S.B.P.; Muthugala, M.V.J.; Le, A.V.; Elara, M.R. HTetro-infi: A reconfigurable floor cleaning robot with infinite morphologies. IEEE Access 2020, 8, 69816–69828. [Google Scholar] [CrossRef]
Muthugala, M.V.J.; Samarakoon, S.B.P.; Elara, M.R. Tradeoff between area coverage and energy usage of a self-reconfigurable floor cleaning robot based on user preference. IEEE Access 2020, 8, 76267–76275. [Google Scholar] [CrossRef]
Le, A.V.; Prabakaran, V.; Sivanantham, V.; Mohan, R.E. Modified a-star algorithm for efficient coverage path planning in tetris inspired self-reconfigurable robot with integrated laser sensor. Sensors 2018, 18, 2585. [Google Scholar] [CrossRef] [PubMed]
Le, A.V.; Arunmozhi, M.; Veerajagadheswar, P.; Ku, P.C.; Minh, T.H.Q.; Sivanantham, V.; Mohan, R.E. Complete path planning for a tetris-inspired self-reconfigurable robot by the genetic algorithm of the traveling salesman problem. Electronics 2018, 7, 344. [Google Scholar] [CrossRef]
Kyaw, P.T.; Le, A.V.; Veerajagadheswar, P.; Elara, M.R.; Thu, T.T.; Nhan, N.H.K.; Van Duc, P.; Vu, M.B. Energy-efficient path planning of reconfigurable robots in complex environments. IEEE Trans. Robot. 2022, 38, 2481–2494. [Google Scholar] [CrossRef]
Cheng, K.P.; Mohan, R.E.; Nhan, N.H.K.; Le, A.V. Graph theory-based approach to accomplish complete coverage path planning tasks for reconfigurable robots. IEEE Access 2019, 7, 94642–94657. [Google Scholar] [CrossRef]
Kalimuthu, M.; Pathmakumar, T.; Hayat, A.A.; Elara, M.R.; Wood, K.L. A metaheuristic approach to optimal morphology in reconfigurable tiling robots. Complex Intell. Syst. 2023, 1–20. [Google Scholar] [CrossRef]
Furno, L.; Blanke, M.; Galeazzi, R.; Christensen, D.J. Self-reconfiguration of modular underwater robots using an energy heuristic. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; pp. 6277–6284. [Google Scholar]
Norouzi, M.; Miro, J.V.; Dissanayake, G. Planning stable and efficient paths for reconfigurable robots on uneven terrain. J. Intell. Robot. Syst. 2017, 87, 291–312. [Google Scholar] [CrossRef]
Chaplot, D.S.; Gandhi, D.; Gupta, S.; Gupta, A.; Salakhutdinov, R. Learning to explore using active neural slam. arXiv 2020, arXiv:2004.05155. [Google Scholar]
Zhu, D.; Li, T.; Ho, D.; Wang, C.; Meng, M.Q.H. Deep reinforcement learning supervised autonomous exploration in office environments. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018; pp. 7548–7555. [Google Scholar]
Wang, J.; Elfwing, S.; Uchibe, E. Modular deep reinforcement learning from reward and punishment for robot navigation. Neural Netw. 2021, 135, 115–126. [Google Scholar] [CrossRef]
Joshi, S.; Kumra, S.; Sahin, F. Robotic grasping using deep reinforcement learning. In Proceedings of the 2020 IEEE 16th International Conference on Automation Science and Engineering (CASE), Hong Kong, China, 8 October 2020; pp. 1461–1466. [Google Scholar]
Andrychowicz, O.M.; Baker, B.; Chociej, M.; Jozefowicz, R.; McGrew, B.; Pachocki, J.; Petron, A.; Plappert, M.; Powell, G.; Ray, A.; et al. Learning dexterous in-hand manipulation. Int. J. Robot. Res. 2020, 39, 3–20. [Google Scholar] [CrossRef]
Foerster, J.; Assael, I.A.; De Freitas, N.; Whiteson, S. Learning to communicate with deep multi-agent reinforcement learning. arXiv 2016, arXiv:1605.06676. [Google Scholar]
Hu, J.; Niu, H.; Carrasco, J.; Lennox, B.; Arvin, F. Voronoi-based multi-robot autonomous exploration in unknown environments via deep reinforcement learning. IEEE Trans. Veh. Technol. 2020, 69, 14413–14423. [Google Scholar] [CrossRef]
Zhu, K.; Zhang, T. Deep reinforcement learning based mobile robot navigation: A review. Tsinghua Sci. Technol. 2021, 26, 674–691. [Google Scholar] [CrossRef]
Low, E.S.; Ong, P.; Cheah, K.C. Solving the optimal path planning of a mobile robot using improved Q-learning. Robot. Auton. Syst. 2019, 115, 143–161. [Google Scholar] [CrossRef]
Bing, Z.; Lemke, C.; Cheng, L.; Huang, K.; Knoll, A. Energy-efficient and damage-recovery slithering gait design for a snake-like robot based on reinforcement learning and inverse reinforcement learning. Neural Netw. 2020, 129, 323–333. [Google Scholar] [CrossRef]
Gan, L.; Grizzle, J.W.; Eustice, R.M.; Ghaffari, M. Energy-based legged robots terrain traversability modeling via deep inverse reinforcement learning. IEEE Robot. Autom. Lett. 2022, 7, 8807–8814. [Google Scholar] [CrossRef]
Ha, S.; Kim, J.; Yamane, K. Automated deep reinforcement learning environment for hardware of a modular legged robot. In Proceedings of the 2018 15th International Conference on Ubiquitous Robots (UR), Honolulu, HI, USA, 26–30 June 2018; pp. 348–354. [Google Scholar]
Mitriakov, A.; Papadakis, P.; Garlatti, S. Staircase traversal via reinforcement learning for active reconfiguration of assistive robots. In Proceedings of the 2020 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), Glasgow, UK, 19–24 July 2020; pp. 1–8. [Google Scholar]
Sun, H.; Yang, L.; Gu, Y.; Pan, J.; Wan, F.; Song, C. Bridging Locomotion and Manipulation Using Reconfigurable Robotic Limbs via Reinforcement Learning. Biomimetics 2023, 8, 364. [Google Scholar] [CrossRef]
Yehezkel, L.; Berman, S.; Zarrouk, D. Overcoming obstacles with a reconfigurable robot using reinforcement learning. IEEE Access 2020, 8, 217541–217553. [Google Scholar] [CrossRef]
Le, A.V.; Parween, R.; Kyaw, P.T.; Mohan, R.E.; Minh, T.H.Q.; Borusu, C.S.C.S. Reinforcement learning-based energy-aware area coverage for reconfigurable hRombo tiling robot. IEEE Access 2020, 8, 209750–209761. [Google Scholar] [CrossRef]
Lakshmanan, A.K.; Mohan, R.E.; Ramalingam, B.; Le, A.V.; Veerajagadeshwar, P.; Tiwari, K.; Ilyas, M. Complete coverage path planning using reinforcement learning for tetromino based cleaning and maintenance robot. Autom. Constr. 2020, 112, 103078. [Google Scholar] [CrossRef]
Le, A.V.; Veerajagadheswar, P.; Thiha Kyaw, P.; Elara, M.R.; Nhan, N.H.K. Coverage path planning using reinforcement learning-based TSP for hTetran—A polyabolo-inspired self-reconfigurable tiling robot. Sensors 2021, 21, 2577. [Google Scholar] [CrossRef] [PubMed]
Kyaw, P.T.; Paing, A.; Thu, T.T.; Mohan, R.E.; Le, A.V.; Veerajagadheswar, P. Coverage path planning for decomposition reconfigurable grid-maps using deep reinforcement learning based travelling salesman problem. IEEE Access 2020, 8, 225945–225956. [Google Scholar] [CrossRef]
Dussauge, T.P.; Sung, W.J.; Pinon Fischer, O.J.; Mavris, D.N. A reinforcement learning approach to airfoil shape optimization. Sci. Rep. 2023, 13, 9753. [Google Scholar] [CrossRef] [PubMed]
Yan, X.; Zhu, J.; Kuang, M.; Wang, X. Aerodynamic shape optimization using a novel optimizer based on machine learning techniques. Aerosp. Sci. Technol. 2019, 86, 826–835. [Google Scholar] [CrossRef]
Li, S.; Snaiki, R.; Wu, T. A knowledge-enhanced deep reinforcement learning-based shape optimizer for aerodynamic mitigation of wind-sensitive structures. Comput.-Aided Civ. Infrastruct. Eng. 2021, 36, 733–746. [Google Scholar] [CrossRef]
Viquerat, J.; Rabault, J.; Kuhnle, A.; Ghraieb, H.; Larcher, A.; Hachem, E. Direct shape optimization through deep reinforcement learning. J. Comput. Phys. 2021, 428, 110080. [Google Scholar] [CrossRef]
Bhola, S.; Pawar, S.; Balaprakash, P.; Maulik, R. Multi-fidelity reinforcement learning framework for shape optimization. J. Comput. Phys. 2023, 482, 112018. [Google Scholar] [CrossRef]
Rabault, J.; Ren, F.; Zhang, W.; Tang, H.; Xu, H. Deep reinforcement learning in fluid mechanics: A promising method for both active flow control and shape optimization. J. Hydrodyn. 2020, 32, 234–246. [Google Scholar] [CrossRef]
Ghraieb, H.; Viquerat, J.; Larcher, A.; Meliga, P.; Hachem, E. Single-step deep reinforcement learning for two-and three-dimensional optimal shape design. AIP Adv. 2022, 12, 085108. [Google Scholar] [CrossRef]
Yonekura, K.; Hattori, H. Framework for design optimization using deep reinforcement learning. Struct. Multidiscip. Optim. 2019, 60, 1709–1713. [Google Scholar] [CrossRef]
Wiering, M.A.; Van Otterlo, M. Reinforcement learning. Adapt. Learn. Optim. 2012, 12, 729. [Google Scholar]
Li, Y. Deep reinforcement learning: An overview. arXiv 2017, arXiv:1701.07274. [Google Scholar]
Montavon, G.; Samek, W.; Müller, K.R. Methods for interpreting and understanding deep neural networks. Digit. Signal Process. 2018, 73, 1–15. [Google Scholar] [CrossRef]
Kalimuthu, M.; Pathmakumar, T.; Hayat, A.A.; Veerajagadheswar, P.; Elara, M.R.; Wood, K.L. Optimal Morphologies of n-Omino-Based Reconfigurable Robot for Area Coverage Task Using Metaheuristic Optimization. Mathematics 2023, 11, 948. [Google Scholar] [CrossRef]
Singh, V.; Skiles, S.M.; Krager, J.E.; Wood, K.L.; Jensen, D.; Sierakowski, R. Innovations in design through transformation: A fundamental study of transformation principles. J. Mech. Des. Trans. ASME 2009, 131, 081010. [Google Scholar] [CrossRef]
Weaver, J.; Wood, K.; Crawford, R.; Jensen, D. Transformation design theory: A meta-analogical framework. J. Comput. Inf. Sci. Eng. 2010, 10, 031012. [Google Scholar] [CrossRef]
Hayat, A.A.; Parween, R.; Elara, M.R.; Parsuraman, K.; Kandasamy, P.S. Panthera: Design of a reconfigurable pavement sweeping robot. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 7346–7352. [Google Scholar]
Kalimuthu, M.; Hayat, A.; Elara, M.; Wood, K. Transformation design Principles as enablers for designing Reconfigurable Robots. In Proceedings of the International Design Engineering Technical Conferences and Computers and Information in Engineering Conference, American Society of Mechanical Engineers, Virtual, 17–19 August 2021; Volume 85420, p. V006T06A008. [Google Scholar]
Kalimuthu, M.; Hayat, A.A.; Elara, M.R.; Wood, K.L. Reconfigurable Robot Design Aided with Design Cards. In Proceedings of the International Design Engineering Technical Conferences and Computers and Information in Engineering Conference, American Society of Mechanical Engineers, St. Louis, MO, USA, 14–17 August 2022; Volume 86267, p. V006T06A010. [Google Scholar]
Ong, J.H.; Hayat, A.A.; Manimuthu, M.A.A.; Elara, M.R.; Wood, K. Transforming Spherical Robots. In Proceedings of the International Design Engineering Technical Conferences and Computers and Information in Engineering Conference, Boston, MA, USA, 20–23 August 2023; pp. 1–12. [Google Scholar]
Zelinsky, A.; Jarvis, R.A.; Byrne, J.; Yuta, S. Planning paths of complete coverage of an unstructured environment by a mobile robot. In Proceedings of the International Conference on Advanced Robotics, Atlanta, GA, USA, 2–6 May 1993; Volume 13, pp. 533–538. [Google Scholar]
Van Otterlo, M.; Wiering, M. Reinforcement learning and markov decision processes. In Reinforcement Learning: State-of-the-Art; Springer: Berlin/Heidelberg, Germany, 2012; pp. 3–42. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1928–1937. [Google Scholar]
Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; Zaremba, W. Openai gym. arXiv 2016, arXiv:1606.01540. [Google Scholar]
Liang, E.; Liaw, R.; Nishihara, R.; Moritz, P.; Fox, R.; Goldberg, K.; Gonzalez, J.; Jordan, M.; Stoica, I. RLlib: Abstractions for distributed reinforcement learning. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 3053–3062. [Google Scholar]
Long, P.; Fan, T.; Liao, X.; Liu, W.; Zhang, H.; Pan, J. Towards optimally decentralized multi-robot collision avoidance via deep reinforcement learning. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018; pp. 6252–6259. [Google Scholar]
Deb, K.; Pratap, A.; Agarwal, S.; Meyarivan, T. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput. 2002, 6, 182–197. [Google Scholar] [CrossRef]
Van Veldhuizen, D.A.; Lamont, G.B. Evolutionary computation and convergence to a pareto front. In Late Breaking Papers at the Genetic Programming 1998 Conference; Citeseer: University Park, PA, USA, 1998; pp. 221–228. [Google Scholar]

Figure 1. General reinforcement learning (RL) framework [50].

Figure 2. Proposed morphology generation framework using reinforcement learning for a reconfigurable robot: (a) simplified block diagram, (b) detailed overview.

Figure 3. Mechanical design of the

S m o r p h i

robot.

Figure 3. Mechanical design of the

S m o r p h i

robot.

Figure 4. (a) Footprint modelling of the

S m o r p h i

robot with the four blocks (B1–B4) with the relative angles labelled as

θ_{1}, θ_{2}, θ_{3}

The footprint dimensions denoted as l and w, (b) Initial Map, (c) Smorphi’s footprint configuration with the (

θ_{1}, θ_{2}, θ_{3} \equiv 0, 180, 0

), and (d) resized map based on the footprint with the implemented complete coverage planning.

Figure 4. (a) Footprint modelling of the

S m o r p h i

robot with the four blocks (B1–B4) with the relative angles labelled as

θ_{1}, θ_{2}, θ_{3}

The footprint dimensions denoted as l and w, (b) Initial Map, (c) Smorphi’s footprint configuration with the (

θ_{1}, θ_{2}, θ_{3} \equiv 0, 180, 0

), and (d) resized map based on the footprint with the implemented complete coverage planning.

Figure 5. Representation of simulation Map-I (a) and Map-II (b). Observed mean reward vs. number of iterations for PPO and A3C on Map-I (c,d) and on Map-II (e,f).

Figure 6. Top optimal morphologies generated by the PPO and A3C algorithm for Map-I (a,b), and Map-II (c,d). (e,f) is the footprint based coverage planner for the morphologies shown in the (a,b). The green shade indicates the area covered by the robot while the red line shows the path taken by the robot to cover the map.

Figure 7. Pareto Frontier results obtained from the optimization algorithm for Map-I (a). morphologies produced by the optimization with hinge angles of

(θ_{1}, θ_{2}, θ_{3}) = (159, 20, 152)

(b), and

(θ_{1}, θ_{2}, θ_{3}) = (51, 114, 136)

(c).

Figure 7. Pareto Frontier results obtained from the optimization algorithm for Map-I (a). morphologies produced by the optimization with hinge angles of

(θ_{1}, θ_{2}, θ_{3}) = (159, 20, 152)

(b), and

(θ_{1}, θ_{2}, θ_{3}) = (51, 114, 136)

(c).

Table 1. Hyperparameters for PPO and A3C.

Hyperparameter	PPO	A3C
Learning Rate	$3 \times 10^{- 4}$	$7 \times 10^{- 4}$
Batch Size	4096	32
Discount Factor ( $γ$ )	0.99	0.99
GAE Parameter ( $λ$ )	0.95	0.95
PPO. Epochs	15	-
Max Gradient Norm	0.5	0.5
Entropy Coefficient	0.01	0.01
Value Loss Coefficient	0.5	0.5
Optimization Steps	5	5

Table 2. Optimal morphologies generated by PPO and A3C for Map-I.

Sl.NO	Algorithm	Map	Hinge Angles $(θ_{1}, θ_{2}, θ_{3})$ (Degree)	Area Covered (P_c) in %	Path Length (P_len)
1	PPO	Map-I	(95, 59, 103)	98.5136	99
2			(97, 61, 107)	98.446	99
3			(96, 61, 109)	98.1036	99
4			(105, 51, 81)	97.422	92
5	A3C	Map-I	(37, 33, 137)	97.6692	99
6			(41, 31, 129)	97.5516	99
7			(37, 35, 125)	97.438	107
8			(33, 31, 125)	97.1308	102

The % of area covered (

P_{c}

) mentioned in all the Tables includes the obstacle region in the map. The obstacle density for Map-I is 18.7756%, and for Map-II is 22.2142%.

Table 3. Optimal morphologies generated by PPO and A3C for Map-II.

Sl.NO	Algorithm	Map	Hinge Angles $(θ_{1}, θ_{2}, θ_{3})$ (Degree)	Area Covered (P_c) in %	Path Length (P_len)
1	PPO	Map-II	(15, 15, 39)	96.9414	134
2			(15, 21, 39)	96.6926	134
3			(41, 43, 95)	95.2774	96
4			(43, 39, 95)	95.203	96
5	A3C	Map-II	(75, 69, 83)	90.1382	72
6			(83, 67, 97)	89.9714	79
7			(87, 71, 111)	88.9814	79
8			(85, 71, 103)	88.7038	72

Table 4. Randomly selected morphologies from the design space for Map-I.

Sl.NO	Hinge Angles $(θ_{1}, θ_{2}, θ_{3})$ (Degree)	Area Covered (P_c) in %	Path Length (P_len)
1	(171, 49, 29)	79.2364	92
2	(70, 170, 86)	88.350	121
3	(120, 171, 27)	88.6871	128
4	(71, 113, 125)	86.518	68
5	(2, 55, 17)	82.1452	75
6	(146, 179, 58)	88.0688	135
7	(165, 93, 2)	83.1008	116
8	(0, 180, 0)	100	224

Table 5. Policy validation in three new environments.

Environment	Iterations Taken for Convergence
	Trained Model	Baseline Model
Environment-1	$5 \times 10^{4}$	$2.7 \times 10^{5}$
Environment-2	$4.5 \times 10^{4}$	$2.1 \times 10^{5}$
Environment-3	$6 \times 10^{4}$	$3.2 \times 10^{5}$

Table 6. Optimization results for Map-I using the NSGA-II algorithm.

Sl.NO	Hinge Angles $(θ_{1}, θ_{2}, θ_{3})$ (Degree)	Area Covered (P_c) in %	Path Length (P_len)
1	(97, 12, 150)	86.9368	68
2	(159, 20, 152)	94.9972	92
3	(53, 39, 167)	92.5272	85
4	(62, 96, 65)	82.5484	59
5	(73, 90, 124)	85.1824	66
6	(51, 114, 136)	83.6408	65
7	(70, 27, 136)	90.6232	75
8	(55, 17, 3)	95.8784	107

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kalimuthu, M.; Hayat, A.A.; Pathmakumar, T.; Rajesh Elara, M.; Wood, K.L. A Deep Reinforcement Learning Approach to Optimal Morphologies Generation in Reconfigurable Tiling Robots. Mathematics 2023, 11, 3893. https://doi.org/10.3390/math11183893

AMA Style

Kalimuthu M, Hayat AA, Pathmakumar T, Rajesh Elara M, Wood KL. A Deep Reinforcement Learning Approach to Optimal Morphologies Generation in Reconfigurable Tiling Robots. Mathematics. 2023; 11(18):3893. https://doi.org/10.3390/math11183893

Chicago/Turabian Style

Kalimuthu, Manivannan, Abdullah Aamir Hayat, Thejus Pathmakumar, Mohan Rajesh Elara, and Kristin Lee Wood. 2023. "A Deep Reinforcement Learning Approach to Optimal Morphologies Generation in Reconfigurable Tiling Robots" Mathematics 11, no. 18: 3893. https://doi.org/10.3390/math11183893

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Deep Reinforcement Learning Approach to Optimal Morphologies Generation in Reconfigurable Tiling Robots

Abstract

1. Introduction

2. Literature Survey

3. Deep Reinforcement Learning (DRL)

4. Problem Formulation

4.1. Mechanical Design of the Agent

4.2. Footprint-Based Complete Coverage Path Planner (FBCPP) of the Environment

4.2.1. Footprint Generation

4.2.2. Complete Coverage Path Planning Based on the Footprint

5. Markov Decision Process

5.1. Proximal Policy Optimization (PPO)

5.2. Asynchronous Actor–Critic Agents (A3C)

6. Training and Simulation Results

6.1. Analysis of Simulation Results

6.2. Evaluation of Optimal Morphologies

6.3. Optimal vs. Sub-Optimal Morphologies

6.4. Policy Validation

6.5. Comparison with the Optimization Approach

6.6. Limitations

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

List of Symbols

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI