Velocity Control of a Multi-Motion Mode Spherical Probe Robot Based on Reinforcement Learning

Ma, Wenke; Li, Bingyang; Cao, Yuxue; Wang, Pengfei; Liu, Mengyue; Chang, Chenyang; Peng, Shigang

doi:10.3390/app13148218

Open AccessArticle

Velocity Control of a Multi-Motion Mode Spherical Probe Robot Based on Reinforcement Learning

by

Wenke Ma

^1,2,†,

Bingyang Li

^2,3,†

,

Yuxue Cao

⁴,

Pengfei Wang

^2,*,

Mengyue Liu

²,

Chenyang Chang

³ and

Shigang Peng

^1,2

¹

Qian Xuesen Laboratory of Space Technology, China Academy of Space Technology, Beijing 100094, China

²

China Academy of Aerospace Science and Innovation, Beijing 102600, China

³

College of Engineering, Peking University, Beijing 100871, China

⁴

Beijing Institute of Control Engineering, Beijing 100190, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work and should be considered co-first authors.

Appl. Sci. 2023, 13(14), 8218; https://doi.org/10.3390/app13148218

Submission received: 11 June 2023 / Revised: 11 July 2023 / Accepted: 12 July 2023 / Published: 15 July 2023

(This article belongs to the Section Robotics and Automation)

Download

Browse Figures

Versions Notes

Abstract

:

As deep space exploration tasks become increasingly complex, the mobility and adaptability of traditional wheeled or tracked probe robots with high functional density are constrained in harsh, dangerous, or unknown environments. A practical solution to these challenges is designing a probe robot for preliminary exploration in unknown areas, which is characterized by robust adaptability, simple structure, light weight, and minimal volume. Compared to the traditional deep space probe robot, the spherical robot with a geometric, symmetrical structure shows better adaptability to the complex ground environment. Considering the uncertain detection environment, the spherical robot should brake rapidly after jumping to avoid reentering obstacles. Moreover, since it is equipped with optical modules for deep space exploration missions, the spherical robot must maintain motion stability during the rolling process to ensure the quality of photos and videos captured. However, due to the nonlinear coupling and parameter uncertainty of the spherical robot, it is tedious to adjust controller parameters. Moreover, the adaptability of controllers with fixed parameters is limited. This paper proposes an adaptive proportion–integration–differentiation (PID) control method based on reinforcement learning for the multi-motion mode spherical probe robot (MMSPR) with rolling and jumping. This method uses the soft actor–critic (SAC) algorithm to adjust the parameters of the PID controller and introduces a switching control strategy to reduce static error. As the simulation results show, this method can facilitate the MMSPR’s convergence within 0.02 s regarding motion stability. In addition, in terms of braking, it enables an MMSPR with random initial speed brake within a convergence time of 0.045 s and a displacement of 0.0013 m. Compared with the PID method with fixed parameters, the braking displacement of the MMSPR is reduced by about 38%, and the convergence time is reduced by about 20%, showing better universality and adaptability.

Keywords:

spherical robot; multi-motion mode; reinforcement learning; motion stability; rapidly brake

1. Introduction

The exploration of extraterrestrial planets is widely regarded as a cutting-edge pursuit in contemporary science and technology, owing to its remarkable foresight and innovative potential. However, since the planets have much uncertainty, it is dangerous for humans to explore them directly. Therefore, replacing humans with unmanned detection equipment (such as mobile robots and detection vehicles) is the main development trend in deep space exploration missions.

Due to their geometrically symmetrical structure, spherical robots have better balance and flexibility [1]. Thus, they are highly resistant to the rollover problems that wheeled or tracked robots suffer. Consequently, they have attracted many researchers and have been widely used in the detection of unknown environments [2], environmental monitoring [3,4], surveillance [5], and other fields [6]. Based on the single-motion mode [7,8,9], researchers have gradually developed spherical robots with multi-motion modes, such as rolling and jumping, rolling and crawling, and others [10,11,12,13]. The multi-motion mode further expands the application of spherical robots in environmental detection tasks. During exploration missions, the spherical robot is commonly outfitted with specialized, internally contained optical functional modules designed to capture images and record video. However, owing to structural discrepancies in the MMSPR, its mass distribution is non-uniform. Any deviation in the MMSPR’s movement during the rolling process could trigger an excessive motion in its rotational direction, resulting in a disruption of the robot’s postural stability. This instability could severely impact the quality of photos and videos captured during exploration missions. Additionally, such motion instability can exacerbate the force exerted on the optical modules and other associated equipment, leading to reduced service life and higher chances of maintenance costs. Therefore, maintaining the motion stability of the MMSPR during its rolling process is of great significance. In addition, the MMSPR could switch to jump mode to cross obstacles while performing exploration missions. In such a situation, it is crucial for the MMSPR to brake rapidly and adjust its postures after landing to allow for user intervention for the next course of action. This prevents the MMSPR from reentering obstacles, ensuring the efficiency and safety of the task. In conclusion, the successful implementation of the MMSPR in deep space exploration hinges on addressing two critical issues: motion stability and rapid braking.

The spherical robot is a strong-coupled and underactuated system with nonholonomic constraints [14], which poses considerable challenges to its control. Moreover, the unmodeled dynamics of the system and environmental measurement errors will affect the closed-loop stability of the system [15]. There are few studies on the postures control and trajectory tracking of spherical robots. The primary control algorithms used include the sliding mode algorithm [16,17,18], the PID algorithm [19], the fuzzy algorithm [20,21], and others [22,23]. Kayacan decoupled the spherical robot into two subsystems and designed a feedback controller with fuzzy logic to achieve linear motion [24]. The fuzzy controller enhanced the system’s adaptability to nonlinear characteristics and uncertainties, resulting in superior control performance compared to traditional proportion–differentiation (PD) control. Kamis designed a fuzzy PD controller to address the trajectory tracking issue of a spherical robot [25]. The controller uses fuzzy logic to adjust the parameters of the PD controller, which effectively eliminates the overshoot and improves the stability time, thus ensuring more accurate control of the robot’s position. Considering the asymmetry of the spherical robot prototype, Zhang proposed an adaptive hierarchical sliding mode controller (AHSMC) [26]. This controller combines hierarchical sliding mode control (HSMC) with adaptive laws to address the spherical robot’s balance control and velocity control during motion.

The PID algorithm is extensively used in industrial applications, and its parameter tuning is critical to achieving effective control. However, it becomes challenging to obtain the desired control effect when dealing with nonlinear and strongly coupled systems by merely setting fixed controller parameters. Furthermore, the controller’s performance may deteriorate if the system’s parameters exceed the controller’s range [27], indicating poor adaptability and generalization. Reinforcement learning is a method to find the optimal solution from the interaction between the agent and the environment [28]. Combining it with the traditional PID algorithm can improve the generalization and adaptability of a PID controller. Zheng proposed a PID controller based on reinforcement learning to enhance the trajectory tracking performance in a quadrotor [29]. Combining the proximal policy optimization (PPO) algorithm with the traditional PID controller, this method has improved response time, reduced overshoot, minimized control errors, enhanced stability, and strong anti-interference capabilities compared with the traditional PID controller. Wang developed a Q-learning PID controller for trajectory tracking in mobile robots [30]. This controller adjusts the output of the PID controller with the output from Q-learning, achieving precise trajectory tracking. Guan introduced an adaptive PID controller based on radial basis function (RBF) networks for complex nonlinear systems [31]. The controller utilizes the RBF network to adaptively adjust the PID controller’s parameters based on the system’s current state errors, which achieves stable control without system modeling in advance. Park adjusted proportion–integration (PI) parameters online through reinforcement learning so that the vehicle could follow the target more accurately [32]. According to the above research, the combination of reinforcement learning and the traditional PID controller improves the performance of control systems, which is useful for practical engineering applications.

Due to the nonlinear coupling and parameter uncertainty in the dynamic model of the MMSPR, it is complex to adjust the controller parameters manually. Furthermore, controllers with fixed parameters have constrained adaptability. Considering the requirements of deep space exploration tasks and the integrated optical functional modules, this article researches the control method for the MMSPR. The aim is to address the challenges concerning the robot’s motion stability during the rolling process and rapid braking after crossing obstacles. Moreover, the feasibility and effectiveness of the controller proposed in this paper are substantiated through simulation.

The structure of this paper is organized as follows. Section 2 introduces the mechanical design and dynamic model of the spherical robot. Section 3 describes the design of the adaptive PID controller based on reinforcement learning. Section 4 shows and discusses the simulation results in motion stability and rapid braking situations. Finally, Section 5 provides conclusions and future research prospects.

2. Materials and Methods

2.1. Design of the Structure

The structure of MMSPR discussed in this paper is shown in Figure 1, which comprises the spherical shell, the rolling drive module, and the jumping drive module. The shell comprises two hemispheres connected by flanges, forming a closed spherical environment. The rolling drive module consists of two racks and three motors, which are used to drive the robot forward, backward, and steer. The jumping drive module primarily consists of a mechanism box and two six-bar mechanisms, which enable the robot to jump. The specific structural design can be found in our previous work [11].

2.2. Dynamic Modeling

The dynamic model of MMSPR in this paper is based on the following assumptions:

During movement, MMSPR rolls without slipping, and its mass is decomposed into the mass of the shell and the pendulum: $M_{s}$ , $m_{p}$ ;
The centroid of MMSPR coincides with its geometric center;
The two degrees of freedom do not mutually interfere, and the longitudinal axis does not rotate.

The schematic diagram of MMSPR during rolling is shown in Figure 2a. Specifically,

O_{A}

,

O_{B}

, and

O_{C}

represent the coordinate systems

O_{A} - X_{A} Y_{A} Z_{A}, O - X_{B} Y_{B} Z_{B}

, and

O - X_{C} Y_{C} Z_{C}

, and

O_{A}

is the world coordinate system.

O_{B}

is the coordinate system fixed at the geometric center of the sphere, which can only translate relative to

O_{A}

. Additionally,

O_{C}

is the coordinate system fixed at the geometric center of the sphere, which can only rotate relative to

O_{A}

.

R

and

L

represent sphere’s radius and the pendulum’s length, respectively.

θ

and

ϕ

represent the rolling angles of MMSPR along its

X_{B}

and

Y_{B}

.

α

and

β

refer to the rotation angles of the pendulum around the

X_{B}

and

Y_{B}

.

τ_{x}

and

τ_{y}

are the driving torques acting on the

X_{B}

and

Y_{B}

.

To facilitate the controller design, MMSPR’s system needs to be decoupled. Referring to Kayacan [24], we decoupled the robot into the forward motion and the steering motion (Figure 2b,c). According to the Lagrange equation, taking the state variable

x = {[\begin{matrix} θ & α & ϕ & β \end{matrix}]}^{T}

, the dynamic model of MMSPR can be formulated as:

\begin{matrix} M (x (t)) \ddot{x} (t) + V (x (t), \dot{x} (t)) = u (t) \\ [\begin{matrix} M_{11} & M_{12} & M_{13} & M_{14} \\ M_{21} & M_{22} & M_{23} & M_{24} \\ M_{31} & M_{32} & M_{33} & M_{34} \\ M_{41} & M_{42} & M_{43} & M_{44} \end{matrix}] [\begin{matrix} {\ddot{x}}_{1} \\ {\ddot{x}}_{2} \\ {\ddot{x}}_{3} \\ {\ddot{x}}_{4} \end{matrix}] + [\begin{matrix} V_{1} \\ V_{2} \\ V_{3} \\ V_{4} \end{matrix}] = [\begin{matrix} τ_{x} \\ τ_{x} \\ τ_{y} \\ τ_{y} \end{matrix}] \end{matrix}

(1)

where:

\begin{array}{l} M_{11} = & M_{s} R^{2} + m_{p} R^{2} + m_{p} L^{2} + I_{s} + I_{p} + 2 m_{p} R L c o s (x_{2} - x_{1}), \\ M_{12} = & - m_{p} L^{2} - I_{p} - m_{p} R L c o s (x_{2} - x_{1}), \\ M_{13} = & M_{14} = 0, \\ M_{21} = & - m_{p} L^{2} - I_{p} - m_{p} R L c o s (x_{2} - x_{1}), \\ M_{22} = & m_{p} L^{2} + I_{p}, \\ M_{23} = & M_{24} = M_{31} = M_{32} = 0, \\ M_{33} = & M_{s} R^{2} + m_{p} R^{2} + m_{p} L^{2} + I_{s} + I_{p} + 2 m_{p} R L c o s (x_{4} - x_{3}), \\ M_{34} = & - m_{p} L^{2} - I_{p} - m_{p} R L c o s (x_{4} - x_{3}), \\ M_{41} = & M_{42} = 0, \\ M_{43} = & - m_{p} L^{2} - I_{p} - m_{p} R L c o s (x_{4} - x_{3}), \\ M_{44} = & m_{p} L^{2} + I_{p}, \\ V_{1} = & m_{p} R L s i n (x_{2} - x_{1}) {\dot{x}}_{1}^{2} + m_{p} R L s i n (x_{2} - x_{1}) {\dot{x}}_{2}^{2} \\ - 2 m_{p} R L c o s (x_{2} - x_{1}) x_{1} x_{2} - m_{p} g L s i n (x_{2} - x_{1}), \\ V_{2} = & m_{p} g L s i n (x_{2} - x_{1}), \\ V_{3} = & m_{p} R L s i n (x_{4} - x_{3}) {\dot{x}}_{3}^{2} + m_{p} R L s i n (x_{4} - x_{3}) {\dot{x}}_{4}^{2} \\ - 2 m_{p} R L s i n (x_{4} - x_{3}) x_{3} x_{4} - m_{p} g L s i n (x_{4} - x_{3}), \\ V_{4} = & m_{p} g L s i n (x_{4} - x_{3}) \\ I_{s} = & \frac{2}{3} M_{s} R^{2}, I_{p} = m_{p} L^{2} \end{array}

(2)

where

I_{s}

and

I_{p}

are the moments of inertia for the sphere and the pendulum, respectively.

τ_{x}

and

τ_{y}

represent the input torques on the

X_{B}

and

Y_{B}

axes, respectively.

It is worth noting that the

X_{B}

is perpendicular to the ground when

β

equals

π / 2

or

- π / 2

, which causes the robot to rotate in place. For this reason, the range of

β

in the model is limited to (

- π / 3

,

π / 3

).

3. Controller Design

In this section, we will introduce the structure of the adaptive PID controller based on reinforcement learning.

3.1. Reinforcement Learning

Reinforcement learning is an unsupervised learning process that follows a Markov decision process (MDP). Generally, finite MDPs have four parts,

(S, A, p, R)

.

S

represents the state space of the environment,

A

denotes the action space,

p

represents the unknown probability of state transitions, and

R

is the reward function. The specific process of reinforcement learning can be seen in Figure 3.

The agent interacts with the environment at each discrete moment. At each step

t

, the agent receives the environment’s state

s_{t}

and selects an action

a_{t}

according to the current policy

π

and then receives a reward value

r_{t}

from the environment to evaluate the quality of

a_{t}

, where

s_{t} \in S

,

a_{t} \in A

,

r_{t} \in R

. The decision-making process can be expressed as follows:

a_{t} = π (s_{t})

(3)

The environment’s state will change to

s_{t + 1}

after the agent performs the action

a_{t}

, with

p (s_{t + 1} | s_{t}, a_{t})

as the transition probability. Reinforcement learning aims to maximize the agent’s reward value to find the optimal strategy

π *

.

3.2. SAC Algorithm

The SAC algorithm was proposed by Haarnoja [33], which is characterized by introducing entropy regularization into the policy. The algorithm considers maximizing the sum of expected rewards and the policy entropy, which improves the explorable ability of the policy. Thus, it can converge to the optimal value more efficiently. The optimal policy

π *

can be expressed as follows:

π * = a r g \max_{π} E_{(s_{t}, a_{t}) ~ ρ_{π}} [\sum_{t} R (s_{t}, a_{t}) + α H (π (\cdot | s_{t}))]

(4)

The SAC is a reinforcement learning algorithm based on the actor–critic structure. The actor is the policy function to generate actions that the agent uses to interact with the environment, represented by

π_{ϕ} (a_{t} | s_{t})

. The policy update equation can be obtained by minimizing the KL divergence:

J_{π} (ϕ) = E_{s_{t} \sim D, a_{t} ~ π_{ϕ}} [l o g π_{ϕ} (a_{t}| s_{t}) - \frac{1}{α} Q_{θ} (s_{t}, a_{t}) + l o g Z (s_{t})]

(5)

The critic is the value function that evaluates actions output by the actor, which guides the updating of the policy and is represented by

Q_{θ} (s_{t}, a_{t})

. The TD error method is used to update the value function of critic:

J_{Q} (θ) = E_{(s_{t}, a_{t}, s_{t + 1}) ~ D, a_{t + 1 ~ π_{ϕ}}} [\frac{1}{2} {(Q_{θ} (s_{t}, a_{t}) - (r (s_{t}, a_{t}) + γ (Q_{\bar{θ}} (s_{t + 1}, a_{t + 1}) - α l o g (π_{ϕ} (a_{t + 1} | s_{t + 1}))))}^{2}]

(6)

As an off-policy algorithm, the SAC also stores the experience samples

(s_{t}, a_{t}, s_{t + 1}, R (s_{t}, a_{t}))

collected by the agent in the previous episodes in the replay buffer. During training, these samples are randomly sampled to update the neural networks, which improves the sample efficiency of reinforcement learning.

3.3. Adaptive PID Controller Based on Reinforcement Learning

The structure of the adaptive PID controller based on reinforcement learning in this paper is shown in Figure 4. It is possible to divide the entire controller into two parts. One part is the PID controller, which directly controls the motor torques of the MMSPR and realizes velocity adjustments. The other part of the controller is reinforcement learning, which outputs PID parameters for adaptive parameter adjustment.

The PID controller used in this paper is the position PID controller, whose mathematical expression can be formulated as:

u (t) = K_{p} e (t) + K_{i} \sum_{n = 0}^{t} e (n) + K_{d} (e (t) - e (t - 1))

(7)

For reinforcement learning, choosing appropriate state space, action space, and reward function is essential. To solve the velocity control problem, we select the dynamic characteristics of the MMSPR as observations. Specifically, the rotation angles and angular velocities of the MMSPR and the pendulum around the X and Y axes are involved. In addition, to facilitate the SAC algorithm establishing a mapping between the MMSPR’s velocity errors and the controller parameters more effectively, velocity errors of the MMSPR in X and Y directions are added into the state space, namely

e_{x}

, and

e_{y}

. As a result, the state space of reinforcement learning is ten dimensions. According to the algorithm architecture proposed in this paper, the actions of reinforcement learning are the parameters of the PID controller, as shown below:

a_{t} = π (s_{t}) = (K_{p x}, K_{p y}, K_{i x,} K_{i y}, K_{d x} {, K}_{d y})

(8)

To ensure the stability of the control system, the action ranges are set to

K_{p s} = [0, 3]

,

K_{i s} = [0, 1]

,

K_{d s} = [0, 0.5]

,

s = {x, y}

. Furthermore, since the motor torques are limited in real-world applications, the output ranges of the PID controller are set to be within 10 Nm,

τ_{x}, τ_{y} \in (- 10 N m, 10 N m)

.

An important part of reinforcement learning is designing an appropriate reward function, which determines whether the agent can converge to the optimal solution. In this paper, our purpose is to control the MMSPR’s rolling velocities, that is, to reduce the errors between the actual and desired velocity of the MMSPR in X and Y directions. The velocity errors

e_{x}

and

e_{y}

are formulated as:

e_{x} = {\dot{x}}_{1} \times R - v_{x d}, e_{y} = {\dot{x}}_{3} \times R - v_{y d}

(9)

where

{\dot{x}}_{1}

and

{\dot{x}}_{3}

are the rolling angular velocities of the MMSPR in the X and Y directions, respectively.

Therefore, the smaller

e_{x}

and

e_{y}

are, the larger the reward value of the agent will be. In order to obtain smaller static errors when the system is stable,

e_{x}

and

e_{y}

in the reward function are multiplied by a certain ratio. When the velocity errors remain large, the reward should change sharply to expedite the convergence. However, it is expected to be stable when the errors are small to ensure the agent outputs actions that maintain small errors. Therefore, tanh is used to make the 2-norm of the velocity error nonlinear, and the reward function is defined as:

\begin{matrix} r e w a r d 1 = 1 - \tanh (10 \times |e_{1}|) \\ r e w a r d 2 = 1 - \tanh (10 \times |e_{2}|) \\ r e w a r d = 0.5 \times r e w a r d 1 + 0.5 \times r e w a r d 2 \end{matrix}

(10)

Due to the utilization of the tanh function, the velocity errors are difficult to reduce after it is below a threshold. Switching control is introduced into our controller to obtain smaller errors further. After the errors fall below the threshold value

α

, the controller will switch from the variable-parameter PID controller adjusted by reinforcement learning to the PID controller with fixed parameters.

The threshold value

α

is set to 0.1, which means that the variable-parameter PID controller works when

e_{x}

and

e_{y}

are greater than 0.1. When

e_{x}

and

e_{y}

are less than 0.1, the controller will switch to the PID controller with fixed parameters. The logic of switching control is shown in Figure 5.

The MMSPR is considered to achieve the desired velocities successfully when

|e_{1}|

and

|e_{2}|

are less than 0.0001 and the reward value is larger than 0.999.

4. Simulations

4.1. Simulation Environment

In this paper, the simulation environment is built in Pycharm2022, and the computer environment is Windows 10. The CPU used is an AMD 5600X with 16 GB of RAM. The reinforcement learning algorithm is achieved with Pytorch, and its parameters are shown in Table 1.

Since the maximum velocity of the MMSPR prototype is 0.24 m/s, the initial velocity of the MMSPR is set to a random value in the range of [0, 0.24] in the simulation environment, and the desired velocity

v_{d}

is set to 0 m/s. We aim to train a controller that enables the MMSPR with different initial postures and velocities to brake rapidly. In the simulation process, the sampling time is 0.005 s, and the total simulation time is 1 s, timestep = 200. The related parameters of the MMSPR are shown in Table 2.

Considering the uncertain interference caused by the inaccurate dynamic model, sensor noise, and environmental noise, a 5% model error and 2% measurement error were added to the MMSPR’s mass and the observed state values in the training and testing environment. Additionally, pulse interferences with values of 0.08, 0.04, 0.08, and 0.04 were added to

{\dot{x}}_{1}

,

{\dot{x}}_{2}

,

{\dot{x}}_{3}

, and

{\dot{x}}_{4}

at 0.1s to simulate collision scenarios.

4.2. Simulation Results and Analysis

We argue that our algorithm’s performance improvement depends on three pivotal factors: the hierarchical control structure, the switching control strategy, and adjustable controller parameters. To substantiate the significance of these factors in the design method, this section conducts a comparative analysis across two scenarios: the spherical robot’s motion stability and rapid braking. We provide the abbreviation of each method: the adaptive PID control method based on reinforcement learning with switching control (ARLPID), fixed-parameter PID control method (PID), pure reinforcement learning control method that directly outputs motor torque (Torque_RL), and adaptive PID control method based on reinforcement learning without switching control (RLPID).

This section initially presents the average reward curves during the training process for ARLPID, RLPID, and Torque_RL (Figure 6). Noting that the concept of average reward applies solely to reinforcement learning methods, PID is not encompassed in this comparison.

The average reward value of ARLPID is higher than that of RLPID, and the fluctuation of reward value is slighter after the algorithm converges. However, the average reward curve of Torque_RL fluctuates sharply and finally remains at a lower value.

The success rate of different algorithms in the test environment is presented in Table 3. As shown in the table, both ARLPID and PID can successfully realize the braking of the MMSPR with a 100% success rate in the case of random initial velocities and various interferences. In comparison, RLPID only has a 91% success rate. In contrast, the success rate of Torque_RL is 0%, which shows that it cannot solve the velocity control problem of the MMSPR after training.

4.2.1. Motion Stability

To verify the effectiveness of our method in maintaining the robot’s motion stability, in this section, we simulate a scenario of an MMSPR moving straightly on a flat surface. We apply disturbances in the turning direction to simulate posture disturbances resulting from structural errors, non-uniform mass distribution, and collisions.

The robot’s desired speed in the turning direction is established as:

v_{y d}

= 0 m/s. The simulation duration is 5 s, and the results are shown in Figure 7 and Figure 8.

Figure 7 shows the speed changes in the turning direction of the MMSPR under different control methods. By adjusting the controller parameters, ARLPID reduces the overshoot by approximately 93% compared to PID with fixed parameters, costing a shorter convergence time. It takes only 0.02 s to converge, compared to 0.03 s required by PID. RLPID performs the best in convergence speed, requiring just 0.01 s, which is faster than PID, and its overshoot is also lower than PID. However, there is a small fluctuation of 0.06% in the robot’s speed after convergence, and it does not stabilize completely. This might be attributable to using the tanh function when designing the reward function of reinforcement learning. The reward value tends to saturate and stabilize when the error is small. Hence, it cannot continue to optimize. This also results in a lower success rate of RLPID in the testing environment. In comparison, ARLPID introduces a switching control strategy to eliminate static errors, which can avert speed fluctuations prominently.

Figure 8 displays the robot’s trajectory during the simulation process. The maximum offset displacement of the robot reaches 0.00048 m under the control of PID. With RLPID control, the trajectory significantly fluctuates and cannot maintain stability, showing a maximum offset displacement of 0.00052 m. However, due to variable parameters and switching control strategy, the trajectory under ARLPID control changes smoothly, with the minimum offset displacement being merely 0.00041 m.

The simulation results show that the ARLPID method, which introduces variable controller parameters and the switching control strategy, significantly enhances control performance in the problem of the spherical robot’s motion stability. This improvement is especially notable compared to the traditional PID with fixed parameters and the RLPID method, which lacks the switching control strategy.

4.2.2. Rapidly Brake

In this section, we simulated the task of MMSPR rapid braking after landing. To provide a more detailed description of the robot’s state after its landing, we will simulate the different postures and velocities of the MMSPR after landing, respectively.

Postures

Three distinct initial states of the robot are set to simulate its various random postures after landing fully. In these states, the robot’s directions are consistent with the forward movement direction (north) but with acceleration in different directions: northwest, north, and northeast (as shown in Figure 9). These three acceleration directions represent that the robot may deviate in different directions after landing, providing a more comprehensive scenario for algorithm validation.

In the first scenario, related parameters are set as:

v_{x 1} = 0.24 m / s

,

v_{y 1} = 0 m / s

,

α_{1} = 0.1 r a d

,

β_{1} = 0.2 r a d

, and the remaining parameters are zero. In the second scenario, parameters are set as:

v_{x 2} = 0.24 m / s

,

v_{y 2} = 0.24 m / s

,

α_{2} = 0.1 r a d

,

β_{2} = - 0.2 r a d

, and the remaining parameters are zero. In the third scenario, parameters are set as:

v_{x 3} = 0 m / s

,

v_{y 3} = 0.24 m / s

,

α_{3} = - 0.1 r a d

,

β_{3} = - 0.2 r a d

, and the remaining parameters are zero. The desired values for the speed of the MMSPR are set as:

v_{x d} = v_{y d} = 0 m / s

. The results of each control method under different states are shown in Figure 10 and Figure 11.

To more intuitively compare the effects of various control methods, this section will mainly analyze the simulation results of the first scenario. As seen in Figure 10a, there is a substantial overshoot during the convergence with PID. The convergence time of

v_{x}

and

v_{y}

is 0.04 s and 0.01 s, respectively, which are slower than the ARLPID and RLPID methods using variable parameters. With RLPID, the speed error converges more rapidly, requiring only 0.01 s for both

v_{x}

and

v_{y}

to converge. However, the robot’s speed fluctuates significantly when disturbances occur. Under SMC, the robot’s speed changes relatively smoothly, but it has the longest convergence time, with

v_{x}

and

v_{y}

taking 1.04 s and 0.535 s, respectively. Additionally, there is a periodic speed fluctuation during the convergence. Compared with the above methods, the overshoot under ARLPID is reduced by about 81% compared to the PID method, eliminating the speed fluctuation observed in RLPID. This method allows

v_{x}

and

v_{y}

to converge quickly within 0.03 s and 0.01 s, respectively, and shows strong resistance to disturbances. The convergence times for other scenarios are shown in Table 4.

Figure 11 shows the trajectories of different control methods during braking under different scenarios. In the first scenario, the braking displacement of the robot under PID is longer, with 0.0021 m, compared to that of RLPID and ARLPID, both of which employ variable parameters. The braking displacement under RLPID is 0.0013 m. However, the robot’s speed is not completely stable when suffering disturbances from the external environment, resulting in substantial fluctuation in its movement trajectory. The robot’s trajectory under ARLPID alters relatively smoothly, with a braking displacement of 0.0013 m, the smallest of the three methods. In the second scenario, the braking displacements of the three methods are 0.0019 m, 0.0014 m, and 0.0013 m, respectively. Similarly, in the third scenario, the braking displacements are 0.0019 m, 0.0015 m, and 0.0013 m, respectively.

Velocities

The results of the MMSPR with different initial velocities and different controllers are shown in Figure 12. As Torqu_RL cannot solve the control problem, only the results of the other three algorithms are displayed. As shown in Figure 12, although the velocity errors finally converge successfully with PID, there are obvious jitters during the error reduction, and the convergence time is longer than ARLPID. RLPID is the algorithm with the shortest convergence time, but small error fluctuations occur in the follow-up timesteps, which cannot be completely stable. The reason may be that the tanh function is used in the reward function. When the error is small, the reward function tends to be saturated and stable, so it cannot continue to guide the policy update.

With ARLPID proposed in this paper, the velocity errors converge rapidly and remain stable, which outperforms other controllers. The error convergence time of each algorithm is shown in Table 5.

In summary, ARLPID performs best in maintaining motion stability and achieving rapid braking among the three control methods. It has a fast convergence speed, a small displacement offset, and stable control performance. This method can solve the problems of motion stability during the rolling and rapid braking after jumping to avoid obstacles, relying on its high adaptability and robustness.

5. Conclusions

This paper presents an adaptive PID controller based on reinforcement learning to address the motion stability of MMSPR during rolling and its rapid braking after jumping to avoid obstacles. This controller adopts the SAC algorithm for adaptively tuning the parameters of the PID controller and introduces a switching control strategy to reduce the system’s static errors.

The adaptive PID control method based on reinforcement learning proposed in this paper outperforms baseline algorithms in optimality and robustness. It ensures the motion stability of the MMSPR during its rolling process and facilitates rapid braking after crossing obstacles to avoid it reentering obstacles. Regarding motion stability, ARLPID facilitates the robot’s convergence within 0.02 s. Furthermore, this controller significantly reduces the overshoot observed in PID and eliminates the static errors in RLPID. In cases with the robot having random landing postures, ARLPID ensures the robot’s braking within 0.0013 m, which is about 38% better than PID. Furthermore, it has strong robustness and eliminates oscillation in RLPID. Under conditions with random landing speed, the simulation results show that ARLPID enables the MMSPR to brake rapidly when its initial velocity ranges from 0 to 0.24 m/s after landing. It attains a success rate of 100% and a braking time of less than 0.045 s, which is 20% faster than PID. In addition, ARLPID maintains better stability than RLPID considering external interference and collisions that might occur in the practical application.

In future work, we will further study the performance of this control method on the prototype of MMSPR. Furthermore, we will modify and optimize our control algorithm according to the experimental results.

Author Contributions

Conceptualization, W.M. and B.L.; data curation, C.C. and S.P.; methodology, W.M. and Y.C.; software, W.M. and Y.C.; supervision, B.L. and P.W.; writing—original draft, W.M., B.L. and M.L.; writing—review and editing, W.M., B.L., Y.C. and P.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was sponsored by Technology 173 Program Technical Field Fund (2019-JCJQ-JJ-459).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to confidentiality regulations of the laboratory.

Conflicts of Interest

The authors declare no conflict of interest.

References

Sagsoz, I.H.; Eray, T. Design and Kinematics of Mechanically Coupled Two Identical Spherical Robots. J. Intell. Robot. Syst. 2023, 108, 12. [Google Scholar] [CrossRef]
Li, M.; Sun, H.; Ma, L.; Gao, P.; Huo, D.; Wang, Z.; Sun, P. Special spherical mobile robot for planetary surface exploration: A review. Int. J. Adv. Robot. Syst. 2023, 20. [Google Scholar] [CrossRef]
Chi, X.; Zhan, Q. Design and modelling of an amphibious spherical robot attached with assistant fins. Appl. Sci. 2021, 11, 3739. [Google Scholar] [CrossRef]
Shi, L.; Zhang, Z.; Li, Z.; Guo, S.; Pan, S.; Bao, P.; Duan, L. Design, Implementation and Control of an Amphibious Spherical Robot. J. Bionic Eng. 2022, 19, 1736–1757. [Google Scholar] [CrossRef]
Rangapur, I.; Prasad, B.K.S.; Suresh, R. Design and Development of Spherical Spy Robot for Surveillance Operation. Procedia Comput. Sci. 2020, 171, 1212–1220. [Google Scholar] [CrossRef]
Michaud, F.O.; Caron, S. Roball, the Rolling Robot. Auton. Robot. 2002, 12, 211–222. [Google Scholar] [CrossRef]
Azizi, M.R.; Naderi, D. Dynamic modeling and trajectory planning for a mobile spherical robot with a 3Dof inner mechanism. Mech. Mach. Theory 2013, 64, 251–261. [Google Scholar] [CrossRef]
Borisov, A.V.; Kilin, A.A.; Mamaev, I.S. An omni-wheel vehicle on a plane and a sphere. Nelin. Dinam. 2011, 7, 785–801. [Google Scholar] [CrossRef] [Green Version]
Moazami, S.; Palanki, S.; Zargarzadeh, H. Design, Modeling, and Control of Norma: A Slider & Pendulum-Driven Spherical Robot. arXiv 2020, arXiv:1908.02243. [Google Scholar]
Wu, H.; Li, B.; Wang, F.; Luo, B.; Jiao, Z.; Yu, Y.; Wang, P. Design and Analysis of the Rolling and Jumping Compound Motion Robot. Appl. Sci. 2021, 11, 10667. [Google Scholar] [CrossRef]
Wang, F.; Li, C.; Niu, S.; Wang, P.; Wu, H.; Li, B. Design and Analysis of a Spherical Robot with Rolling and Jumping Modes for Deep Space Exploration. Machines 2022, 10, 126. [Google Scholar] [CrossRef]
Hu, Y.; Wei, Y.; Liu, M. Design and performance evaluation of a spherical robot assisted by high-speed rotating flywheels for self-stabilization and obstacle surmounting. J. Mech. Robot. 2021, 13, 061001. [Google Scholar] [CrossRef]
Chang, W.-J.; Chang, C.-L.; Ho, J.-H.; Lin, P.-C. Design and implementation of a novel spherical robot with rolling and leaping capability. Mech. Mach. Theory 2022, 171, 104747. [Google Scholar] [CrossRef]
Chen, S.-B.; Beigi, A.; Yousefpour, A.; Rajaee, F.; Jahanshahi, H.; Bekiros, S.; Martinez, R.A.; Chu, Y. Recurrent Neural Network-Based Robust Nonsingular Sliding Mode Control With Input Saturation for a Non-Holonomic Spherical Robot. IEEE Access 2020, 8, 188441–188453. [Google Scholar] [CrossRef]
Fortuna, L.; Frasca, M.; Buscarino, A. Optimal and Robust Control: Advanced Topics with MATLAB^®; CRC Press: Boca Raton, FL, USA, 2021. [Google Scholar]
Zhou, T.; Xu, Y.-G.; Wu, B. Smooth Fractional Order Sliding Mode Controller for Spherical Robots with Input Saturation. Appl. Sci. 2020, 10, 2117. [Google Scholar] [CrossRef] [Green Version]
Ma, W.; Chang, C.; Cao, Y.; Wang, F.; Wang, P.; Li, B. Attitude control of multi-motion mode spherical probe robots based on decoupled dynamics. In Proceedings of the Advances in Guidance, Navigation and Control; Yan, L., Duan, H., Deng, Y., Eds.; Springer Nature: Singapore, 2023; pp. 5850–5861. [Google Scholar]
Ma, L.; Sun, H.; Song, J. Fractional-order adaptive integral hierarchical sliding mode control method for high-speed linear motion of spherical robot. IEEE Access 2020, 8, 66243–66256. [Google Scholar] [CrossRef]
Shi, L.; Hu, Y.; Su, S.; Guo, S.; Xing, H.; Hou, X.; Liu, Y.; Chen, Z.; Li, Z.; Xia, D. A Fuzzy PID Algorithm for a Novel Miniature Spherical Robots with Three-dimensional Underwater Motion Control. J. Bionic Eng. 2020, 17, 959–969. [Google Scholar] [CrossRef]
Guo, J.; Li, C.; Guo, S. A novel step optimal path planning algorithm for the spherical mobile robot based on fuzzy control. IEEE Access 2020, 8, 1394–1405. [Google Scholar] [CrossRef]
Guo, J.; Li, C.; Guo, S. Path optimization method for the spherical underwater robot in unknown environment. J. Bionic Eng. 2020, 17, 944–958. [Google Scholar] [CrossRef]
Liu, Y.; Wang, Y.; Guan, X.; Wang, Y.; Jin, S.; Hu, T.; Ren, W.; Hao, J.; Zhang, J.; Li, G. Multi-terrain velocity control of the spherical robot by online obtaining the uncertainties in the dynamics. IEEE Robot. Autom. Lett. 2022, 7, 2732–2739. [Google Scholar] [CrossRef]
Liu, Y.; Wang, Y.; Guan, X.; Hu, T.; Zhang, Z.; Jin, S.; Wang, Y.; Hao, J.; Li, G. Direction and trajectory tracking control for nonholonomic spherical robot by combining sliding mode controller and model prediction controller. IEEE Robot. Autom. Lett. 2022, 7, 11617–11624. [Google Scholar] [CrossRef]
Kayacan, E.; Bayraktaroglu, Z.Y.; Saeys, W. Modeling and control of a spherical rolling robot: A decoupled dynamics approach. Robotica 2012, 30, 671–680. [Google Scholar] [CrossRef] [Green Version]
Kamis, N.N.; Embong, A.H.; Ahmad, S. Modelling and Simulation Analysis of Rolling Motion of Spherical Robot. IOP Conf. Ser. Mater. Sci. Eng. 2017, 260, 012014. [Google Scholar] [CrossRef] [Green Version]
Zhang, L.; Ren, X.; Guo, Q. Balance and velocity control of a novel spherical robot with structural asymmetry. Int. J. Syst. Sci. 2021, 52, 3556–3568. [Google Scholar] [CrossRef]
Qin, Y.; Zhang, W.; Shi, J.; Liu, J. Improve PID controller through reinforcement learning. In Proceedings of the 2018 IEEE CSAA Guidance, Navigation and Control Conference (CGNCC), Xiamen, China, 10–12 August 2018; pp. 1–6. [Google Scholar]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Zheng, Q.; Tang, R.; Gou, S.; Zhang, W. A PID Gain Adjustment Scheme Based on Reinforcement Learning Algorithm for a Quadrotor. In Proceedings of the 2020 39th Chinese Control Conference (CCC), Shenyang, China, 27–29 July 2020; pp. 6756–6761. [Google Scholar]
Wang, S.; Yin, X.; Li, P.; Zhang, M.; Wang, X. Trajectory Tracking Control for Mobile Robots Using Reinforcement Learning and PID. Iran. J. Sci. Technol. Trans. Electr. Eng. 2020, 44, 1059–1068. [Google Scholar] [CrossRef]
Guan, Z.; Yamamoto, T. Design of a Reinforcement Learning PID Controller. IEEJ Trans. Elec. Electron. Eng. 2021, 16, 1354–1360. [Google Scholar] [CrossRef]
Park, J.; Kim, H.; Hwang, K.; Lim, S. Deep Reinforcement Learning Based Dynamic Proportional-Integral (PI) Gain Auto-Tuning Method for a Robot Driver System. IEEE Access 2022, 10, 31043–31057. [Google Scholar] [CrossRef]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. arXiv 2018, arXiv:1801.01290. [Google Scholar]

Figure 1. The structure schematic of the MMSPR: (a) structural diagram of MMSPR; (b) simplified model of MMSPR.

Figure 2. Establishment of spherical robot coordinate system: (a) spatial coordinate system of spherical robot; (b) projection of coordinate system on

Y_{A} Z_{A}

plane; (c) projection of coordinate system on

X_{A} Z_{A}

plane.

Figure 2. Establishment of spherical robot coordinate system: (a) spatial coordinate system of spherical robot; (b) projection of coordinate system on

Y_{A} Z_{A}

plane; (c) projection of coordinate system on

X_{A} Z_{A}

plane.

Figure 3. The specific process of reinforcement learning.

Figure 4. The structure of the designed controller. The red part represents the reinforcement learning module, which adjusts the parameters of the PID controller. The blue part represents the controller module, which outputs the motor torques.

Figure 5. The logic of switching control.

Figure 6. The mean reward curve in the training process of different algorithms.

Figure 7. Control effects of different control methods.

Figure 8. Trajectory of different control methods.

Figure 9. Three initial scenarios of MMSPR after landing: (a) the first scenario; (b) the second scenario; (c) the third scenario.

Figure 10. Velocity variation of MMSPR under different scenarios: (a) the change in velocity in the first scenario (with external interference); (b) the change in velocity in the second scenario (with external interference); (c) the change in velocity in the third scenario (with external interference).

Figure 11. Brake trajectory of MMSPR under different control methods: (a) the change in trajectory in the first scenario; (b) the change in trajectory in the second scenario; (c) the change in trajectory in the third scenario.

Figure 12. The results of different controllers: (a) initial speed is 0.08 m/s; (b) initial speed is 0.16 m/s; (c) initial speed is 0.24 m/s.

Table 1. Parameters of SAC algorithm.

Parameters	Value
Learning rate of actor	0.001
Actor network	128 × 128
Learning rate of critic	0.0005
Critic network	128 × 128
Discount ( $γ$ )	0.99
Batch size	128
Max_epoch	1000
Optimizer	Adam
Length of an episode	200 steps
Soft target update ( $τ$ )	0.005
Replay_buffer_size	1,000,000

Table 2. Parameters of the MMSPR.

Parameters	Value
M, the mass of shell	0.2 kg
m, the mass of pendulum	0.6 m
R, the radius of sphere	0.08 m
L, the length of swing arm	0.023 m
g, the acceleration of gravity	9.81 m/s²

Table 3. Test success rate of different algorithms.

Algorithms	Success Rate
ARLPID (ours)	100%
Torque_RL	0%
RLPID	91%

Table 4. The braking time of robot speed.

Scenarios	Velocity	PID	RLPID	ARLPID
The first scenario	$v_{x}$	0.04 s	0.01 s	0.03 s
The first scenario	$v_{y}$	0.01 s	0.01 s	0.01 s
The second scenario	$v_{x}$	0.035 s	0.01 s	0.025 s
The second scenario	$v_{y}$	0.00 s	0.00 s	0.00 s
The third scenario	$v_{x}$	0.035 s	0.01 s	0.025 s
The third scenario	$v_{y}$	0.01 s	0.08 s	0.01 s

Table 5. The convergence time of each algorithm.

Initial Velocity	Error	PID	RLPID	ARLPID
0.08 m/s	$e_{x}$	0.065 s	0.065 s	0.045 s
0.08 m/s	$e_{y}$	0.065 s	0.025 s	0.045 s
0.16 m/s	$e_{x}$	0.06 s	0.02 s	0.04 s
0.16 m/s	$e_{y}$	0.06 s	0.03 s	0.04 s
0.24 m/s	$e_{x}$	0.055 s	0.03 s	0.045 s
0.24 m/s	$e_{y}$	0.055 s	0.035 s	0.045 s

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ma, W.; Li, B.; Cao, Y.; Wang, P.; Liu, M.; Chang, C.; Peng, S. Velocity Control of a Multi-Motion Mode Spherical Probe Robot Based on Reinforcement Learning. Appl. Sci. 2023, 13, 8218. https://doi.org/10.3390/app13148218

AMA Style

Ma W, Li B, Cao Y, Wang P, Liu M, Chang C, Peng S. Velocity Control of a Multi-Motion Mode Spherical Probe Robot Based on Reinforcement Learning. Applied Sciences. 2023; 13(14):8218. https://doi.org/10.3390/app13148218

Chicago/Turabian Style

Ma, Wenke, Bingyang Li, Yuxue Cao, Pengfei Wang, Mengyue Liu, Chenyang Chang, and Shigang Peng. 2023. "Velocity Control of a Multi-Motion Mode Spherical Probe Robot Based on Reinforcement Learning" Applied Sciences 13, no. 14: 8218. https://doi.org/10.3390/app13148218

APA Style

Ma, W., Li, B., Cao, Y., Wang, P., Liu, M., Chang, C., & Peng, S. (2023). Velocity Control of a Multi-Motion Mode Spherical Probe Robot Based on Reinforcement Learning. Applied Sciences, 13(14), 8218. https://doi.org/10.3390/app13148218

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Velocity Control of a Multi-Motion Mode Spherical Probe Robot Based on Reinforcement Learning

Abstract

1. Introduction

2. Materials and Methods

2.1. Design of the Structure

2.2. Dynamic Modeling

3. Controller Design

3.1. Reinforcement Learning

3.2. SAC Algorithm

3.3. Adaptive PID Controller Based on Reinforcement Learning

4. Simulations

4.1. Simulation Environment

4.2. Simulation Results and Analysis

4.2.1. Motion Stability

4.2.2. Rapidly Brake

Postures

Velocities

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI