1. Introduction
The deepening evolution of the Industry 4.0 era is progressively reshaping the face of industrial production, with a pronounced impact in areas like building control [
1] and the textile industry [
2]. Textile processes, in particular, stand at the forefront of this change. In the realm of intelligent computerized flat knitting, robotic arms bear about 30% of the operational load, undertaking tasks such as yarn fetching, feeding, and knitting. This shift necessitates innovative approaches to robotic arm path planning [
3]. The complexity inherent in these tasks not only raises the difficulty of planning and execution but also increases the likelihood of errors, thereby adversely affecting both efficiency and overall production [
3]. Traditional path-planning methodologies, which are heavily dependent on offline programming and online instruction, prove inadequate for complex knitting operations, demanding significant labor and time investments [
4,
5,
6]. Activities like yarn fetching and feeding require multiple adjustments in arm posture, each mandating a reevaluation and replanning of the path. Consequently, the pursuit of efficient, accurate, and safe path planning for knitting robotic arms has emerged as a vital area of research within the sphere of automated and intelligent manufacturing.
In industrial settings, the trajectory planning of robotic arms typically employs forward and inverse kinematic techniques [
7]. Forward kinematics involves manipulating the end effector, TCP, to execute specific tasks by modulating joint angles, velocities, and accelerations. In contrast, inverse kinematics centers on the TCP’s path planning, translating this planned trajectory from the path space to the robotic arm joints’ operational space through inverse kinematic processes [
8]. This paper focuses on exploring model-free path-planning strategies for knitting robotic arms, underpinned by experiential knowledge in inverse kinematics.
DRL distinguishes itself from traditional reinforcement learning methods [
9] by adeptly navigating high-dimensional, continuous state and action spaces, thereby exhibiting enhanced representational and generalization skills. This capability enables DRL to identify latent patterns and structures in intricate tasks, facilitating the learning of highly non-linear and complexly associated decisions in unfamiliar environments, as demonstrated in the soft robotics sector [
10]. DRL primarily operates through two paradigms: value-function-based and policy-gradient-based reinforcement learning algorithms [
11]. The former concentrates on learning the value functions of states or state–action pairs, and evaluating action or state values under prevailing policies. The inherent continuity of action spaces and the infinite variety of action values pose substantial challenges in handling continuous domain issues, with Q-Learning and Deep Q-Network (DQN) being notable examples. Conversely, policy-gradient-based methods, aimed at directly learning policies, are more apt for continuous domains. They produce a probability distribution for each potential action in a given state, refining the policy through parameter optimization to enhance the probabilities of actions yielding higher rewards. Representative algorithms include TRPO [
12], DDPG [
13], A2C [
14], HER [
15], PPO [
16], SAC [
17], and TD3 [
18].
These algorithms have all shown remarkable efficacy in path planning for target points. Nevertheless, the non-uniqueness and multiplicity inherent in inverse kinematics can lead to various solutions in the robotic arm joints’ operational space [
19], creating “multiple solution conflicts”. This issue may result in the same processing method being applied to disparate joint states, leading to impractical motion outcomes and compromised stability. Moreover, an excessive emphasis on the TCP point could trap algorithms in local optima, potentially causing unsafe movements, increased mechanical wear, and higher energy consumption. This scenario underscores a lack of robustness and an inability to realize optimal global outcomes [
20].
In tackling the challenges posed by the multiplicity of inverse kinematics solutions and the tendency towards local optima due to an excessive focus on the TCP point, this study innovatively introduces Environment Augmentation (EA) and Multi-Information Entropy Geometric Optimization (MIEGO) methodologies within the realm of DRL algorithms. The EA strategy, in particular, enriches an environmental state by incorporating the angles of each robotic arm joint and treats the stability of solutions as an additional reward factor. This technique effectively guides agents towards exploring paths with enhanced stability.
To circumvent the “local optima trap” associated with an overemphasis on the TCP point, we propose the MIEGO approach. This method synergizes the TCP optimization issue with broader global optimization goals, creating a comprehensive multi-objective optimization framework. By shifting the focus from a narrow TCP-centric view to a wider multi-objective perspective, the likelihood of falling into local optima is substantially reduced. In optimizing these multi-objective elements, the study leverages information geometric techniques to regulate multi-information entropy, thereby not only advancing the efficacy of deep reinforcement learning (DRL) algorithms but also amplifying their convergence properties.
Subsequently, the paper delineates the principal work, offering an overview of existing policy-gradient-based reinforcement learning algorithms, followed by an exhaustive delineation of both the EA strategy and MIEGO method. Empirical assessments are conducted to thoroughly validate each algorithm. Moreover, the simultaneous implementation of these enhancements in the DDPG and SAC algorithms reveals that their integration effectively harmonizes aspects like path length, motion safety, and equipment longevity.
3. Reinforcement Learning Algorithms
This section delves into existing policy-gradient-based reinforcement learning algorithms, which were selected as foundational algorithms for the textile robotic arms discussed in this study. These algorithms are explored and experimented upon in conjunction with the strategies introduced in this research.
3.1. Trust Region Methods
Trust Region Policy Optimization (TRPO), introduced by Schulman et al. in 2015 [
12], represents a sophisticated method in the domain of reinforcement learning, designed to optimize control policies with a guarantee of monotonic advancements. TRPO’s primary principle involves identifying a trajectory that enhances policy efficacy while ensuring minimal deviations in policy alterations. This approach is exceptionally effective for the optimization of extensive, non-linear policy frameworks.
Within the TRPO framework, a policy,
, is conceptualized as a probability distribution governing the selection of an action,
, in a given state,
, expressed as
. The overarching aim of this policy is the maximization of the expected cumulative rewards. TRPO primarily focuses on optimizing the following objective function:
Here, and represent the policy parameters of the current and previous iterations, respectively. is the advantage function for state and action under policy .
In addition, Trust Region Policy Optimization (TRPO) employs a Kullback–Leibler (KL) divergence constraint to ensure that the magnitude of policy updates remains moderate [
12]. The KL divergence is a measure used to quantify the difference between two probability distributions. In this context, it is utilized to quantify the variation in a policy before and after an update. Specifically, TRPO controls the magnitude of policy updates through the following KL divergence constraint:
Here, represents the Kullback–Leibler (KL) divergence, and denotes the probability distribution obtained under state using the policy with parameters . The value is a predefined threshold. By employing this approach, TRPO can effectively control the magnitude of changes during policy updates, preventing excessively drastic alterations. This ensures the stability and reliability of the learning process.
Proximal Policy Optimization (PPO), proposed by Schulman et al. in 2017 [
16], is an improvement of TRPO. PPO simplifies the optimization process of TRPO and enhances sample utilization efficiency. PPO optimizes a policy through the following objective function:
In this context, is the probability ratio, representing the likelihood of selecting an identical action under both the new and the old policies. is the estimated value of the advantage function, and is a small constant utilized to limit the magnitude of policy updates. The ‘clip’ function confines the probability ratio, within predetermined bounds of and . This constraint on the probability ratio aids in preventing excessive oscillations during policy updates, thereby ensuring stability and reliability in the learning process. When the value of falls below , the ‘clip’ function adjusts it to ; conversely, when it exceeds , it is corrected to . Within these boundaries, the value of remains unchanged.
Equation (3) defines the probability ratio, , which measures the ratio of the probabilities of choosing the same action under new and old policies. This ratio is the foundation for calculating the objective function of the PPO algorithm. Following this, Equation (4) presents the core objective function, of the PPO algorithm, utilizing the probability ratio, from Equation (3), in conjunction with the estimated value of the advantage function, , and the clip function, to optimize the policy.
PPO utilizes this approach to balance exploration and exploitation while diminishing reliance on extensive datasets, enhancing its suitability for practical applications.
3.2. Deterministic Policy Methods
In addition to stochastic policies, the Deep Deterministic Policy Gradient (DDPG) emerges as a deterministic policy gradient approach [
13], extending the success of DQN to the realm of continuous control. The DDPG adheres to the following update mechanism:
This equation conveys that the estimated value of the function Q, given the current state, and action, , is the sum of the immediate reward, and the discounted value of future expected returns. The discount factor, , plays a critical role in weighting the significance of immediate versus future rewards. The term denotes the anticipated return associated with the optimal action as per policy θ in the subsequent state . DDPG operates as an off-policy actor–critic algorithm adept at formulating effective policies across a spectrum of tasks.
The Twin Delayed DDPG (TD3) builds upon the DDPG [
18], aiming to mitigate some of its inherent constraints. TD3 utilizes a dual-structured framework with two actor networks and two critic networks. These networks are co-trained to synergize and optimize the agent’s performance. The twin actor networks are designed to yield disparate action outputs for each state, subsequently appraised by the dual critic networks [
22]. Such an arrangement allows TD3 to more accurately navigate environmental uncertainties and curtail the propensity for value overestimations typical in DDPG.
3.3. Entropy-Regularized Policy Gradient
Soft Actor–Critic (SAC) [
17] is a sophisticated off-policy gradient technique that forms a nexus between DDPG and stochastic policy optimization strategies. SAC integrates the concept of clipped Double Q-learning, with its maximum entropy DRL objective function articulated as:
In this context, denotes the immediate reward obtained from executing action in state . symbolizes the policy’s entropy, reflecting the inherent randomness or uncertainty within a specific state. The parameter serves as a crucial balance factor, calibrating the significance of the reward against entropy, thus effectively managing the policy’s exploration–exploitation dynamics. The core objective of this approach is to optimize the policy , maximize expected returns, and facilitate the optimal choice of action in any given state, . SAC’s objective function uniquely features entropy regularization, aiming to optimize a policy through a strategic equilibrium between entropy and anticipated returns. Entropy here quantifies the level of stochasticity in policy decisions, mirroring the critical balance between exploration and exploitation. Elevating entropy levels not only fosters increased exploration but also expedites the learning trajectory. Crucially, this mechanism also safeguards against a policy’s convergence to suboptimal local extremums.
4. Environment Augmentation Strategy
In the field of robotics, trajectory planning for tasks is a pivotal endeavor. It encompasses the movement of objects from one point to another or the execution of intricate maneuvers. For this purpose, robots require an efficient trajectory planning strategy. Utilizing DRL for trajectory planning presents a highly promising approach. It leverages the dynamics of Markov Decision Processes (MDPs) to derive optimal strategies from multifaceted environments. Such a strategy enables robots to make optimal decisions based on real-time status during task execution, thereby crafting trajectories that fulfill specified criteria.
In conventional path-planning algorithms integrating inverse kinematics, intelligent textile robotic arms typically align directly above the center of a target yarn spool, subsequently inserting their grippers into the central hole of the spool and then opening them. However, this method can result in collisions between the robot arm and the upper inside of the yarn spool, as illustrated in
Figure 2a. Such collisions can become unacceptably hazardous due to environmental variations. To counter this, we introduced an EA strategy tailored to refining the path-planning algorithm of DRL under the paradigm of inverse kinematics integration. This approach addresses the issue of overconcentration on the tool center point (TCP) and the consequent neglect of environmental data, thereby facilitating safer and more efficient trajectory planning. By incorporating MIEGO, the focus is expanded from a singular TCP issue to a broader spectrum of multi-objective optimization, leveraging environmental data to circumvent scenarios of local optima, as shown in
Figure 2b.
It is crucial first to clarify that in DRL, the definition of the environmental state is of utmost importance. In typical reinforcement-learning-based trajectory planning tasks, the environmental state encompasses: (1) the position of the TCP of the end effector and (2) relative positional data between the TCP and the target location. Typically, the state
for trajectory planning is defined as:
In this framework, the is represented as a vector in the global coordinate system, capturing the position of the end effector. The term represents the offset vector between the end effector and its targeted object. The boolean variable is set to 1 when the end effector successfully reaches its target position; otherwise, it remains 0.
Throughout the task execution phase, the DRL model makes action decisions based on both the prevailing state and the strategy it has learned. To ensure adaptability across various robots and end effectors, the action
is defined in terms of the end effector’s translational movement:
This approach to defining states and actions steers the continuous modification in the position and orientation of the robot’s end TCP, crafting a seamless motion trajectory. However, despite facilitating the autonomous planning of task trajectories by the agent, the inherent multiplicity in inverse kinematics suggests multiple potential configurations of robotic arm joint angles (states) for any given TCP position and orientation.
Figure 3 depicts multiple-solution conflict scenarios, a consequence of inverse kinematics, in two distinct cases.
In the realm of DRL, models typically formulate a distinct policy (action) for each specified state. However, a complication arises when joint angles are not factored in and states are solely defined by the TCP position. This can lead to different states of the robotic arm corresponding to the same TCP position, yet the DRL model may produce identical policies. Such a scenario introduces several challenges:
- 1.
For robotic arms with identical TCP positions but varying joint states, the ideal action could vary. Generating the same policy for these divergent states by the DRL model might compromise task execution efficiency or even result in task failure.
- 2.
In tasks necessitating maneuvers in complex settings or confined spaces, the specific joint angle configurations of the robotic arm can critically influence task success. Policies derived solely from TCP positions may fall short of meeting these nuanced requirements.
- 3.
Given that a robotic arm’s state is influenced by both the TCP position and joint angles, overlooking joint angles means the state cannot comprehensively represent a robot’s actual condition, potentially impeding the learning outcomes.
To surmount these DRL challenges, this study introduces a novel EA strategy, as delineated in
Figure 4. This strategy centers around enhancing the state representation of the robotic arm within its environment, synergized with a thoughtfully constructed reward function. Initially, the enhanced state and the predefined reward function are fed into the agent. This step is followed by updating the actor–critic network parameters within the agent. Subsequently, the deep network, utilizing these updated parameters, computes the information for the robotic arm’s subsequent target pose (TCP). Post-inverse kinematics processing, these data inform the action angles for the robotic arm’s joints. These angles are then communicated to the robotic arm, directing it to perform the planned actions. This iterative process facilitates the ongoing refinement of the robotic arm’s control strategy.
In DRL, the State Augmentation strategy involves introducing the states of the robotic arm’s joint angles into the environmental state. This approach provides a more accurate reflection of a robotic arm’s actual state, thereby generating more effective strategies. The state
is defined as:
Here, the definitions of , , and remain consistent with earlier descriptions, while represents the angles of the robotic arm’s joints, providing a more accurate representation of the robot’s current state.
Given that the robotic arm used in the scenarios of this paper is the six-DOF ER10-2000 model (EFORT Intelligent Equipment Co., Ltd., Foshan, China), the state incorporates the rotational angles of the robotic arm’s six axes to offer additional information for learning. In DRL, the effectiveness of model learning often depends on the amount of available information. Introducing joint angles provides more data for the model, potentially aiding in learning better strategies. Furthermore, for robots with redundant degrees of freedom, multiple joint angle configurations can achieve the same TCP position. Incorporating joint angles allows the strategy to better utilize these redundant degrees of freedom, potentially resulting in a greater variety of actions.
This paper also uses the change in inverse solutions as a reference index for measuring the stability of solutions. This index serves as an additional reward element, guiding the agent to prioritize exploring paths with higher stability and achieving smoother trajectories that avoid collisions and enhance efficiency.
Initially,
represents the robotic arm’s joint angles at the current time step, and
represents the joint angles at the previous time step. The corresponding planned TCP positions are
and
. The following formula is established:
With the corresponding differential relationship:
Here, is the Jacobian matrix for a six-DOF robotic arm, representing the rotational velocity of each joint. For differential mobile robotic arms, it signifies the rate of change in angles or the rotational acceleration of each joint.
To measure the stability of solutions without excessively rewarding minor changes, this paper employs a variant of the sigmoid function for the nonlinear mapping of stability rewards, defined as:
Here, is a positive parameter controlling the slope of the function, and is a parameter controlling the center of the function.
When is close to 0, will be a small value. As increases, will smoothly increase. The advantage of this function is that it does not overly reward very small changes, and as the change magnitude reaches a certain level, the reward increases but at a slower rate.
If the stability measure dominates the reward function, the agent might minimize the change in solutions during learning to maximize rewards, leading to local optima or inaction. To avoid this, the paper introduces a weight parameter and incorporates this stability measure into the overall reward function:
Here, is the basic reward, such as the reward for reaching the target position or other task-related rewards; is a weight parameter, controlling the importance of the stability measure in the overall reward.
6. Experimental Analysis
This chapter first establishes a virtual twin experimental environment for the intelligent textile robotic arm, as illustrated in
Figure 6, and defines the task objectives. Subsequently, this paper employs four policy-gradient-based DRL algorithms, DDPG, TD3, SAC, and PPO, and integrates our EA strategy and MIEGO method into these algorithms. A series of experiments are then conducted to evaluate the proposed optimization strategies and methods, demonstrating their superior performance.
6.1. Intelligent Textile Robotic Arm and Its Virtual Twin
The intelligent textile robotic arm model involves the geometric structure of industrial robot arms, joint parameters, and kinematic modeling.
Table 1 and
Table 2 and
Figure 7 present the parameters and motion range of the real-world textile robot hand ER10-2000. Based on these parameters, a model of the textile robot hand is developed in VREP on a 1:1 scale ratio, as shown in
Figure 8, achieving a high-fidelity simulation of the real-world scenario.
To acquire more learnable data in realistic textile yarn-gripping robotic hand scenarios, this study constructs a virtual twin system. This system accurately replicates the 1:1 kinematic model of the robotic arm and hand, enabling extensive training within the simulation environment. This approach allows for the collection of a substantial amount of high-quality training data, which can be directly applied to the trajectory generation of the actual robot.
The introduction of the virtual twin system not only optimizes the efficiency and effectiveness of DRL training for the robotic arm and hand but also reduces the costs and safety risks associated with physical training. By simulating different production scenarios and environmental conditions, the virtual twin system can provide more comprehensive and accurate training data for the actual robot, testing its robustness and reliability in various settings. Furthermore, safety constraints can be tested and optimized in the virtual environment, ensuring the safety and stability of the robot in actual operations. Thus, the virtual twin technology offers a more efficient and reliable solution for the textile production and processing industry.
This paper utilizes VREP as the simulation platform, specifically V-REP PRO EDU Version 3.6.2. This software provides a powerful simulation environment, supporting a variety of robot models and algorithms and facilitating high-quality training of the robotic arm and hand in the virtual environment. Using VREP, this study can quickly set up and test the virtual twin system, laying a solid foundation for the practical application of the robotic arm and hand.
6.2. Experimental Setup
Figure 9 displays the continuous trajectory planning scenario for the textile robotic hand within the virtual twin platform, with the task described as follows.
At the start of the task, the robotic arm is positioned at the initial location, point A, and its vicinity. To the right of the robotic hand is a table with various obstacles of different sizes and dimensions, which will be detailed in subsequent chapters. Also placed on the table is a yarn spool, the object to be grasped and relocated. Above the yarn spool is point B, where the robotic hand’s end effector needs to be positioned vertically downward during grasping. The end effector inserts its grippers into the central hollow of the yarn spool and opens them, lifting the spool using friction before placing it in the designated area, point C, on the left platform with the correct posture. To generalize the problem, only one yarn spool in the middle is selected on the virtual twin platform, but it can be positioned in different areas before each simulation to accommodate more possibilities.
Assume the end effector of the robotic hand is initially at point A, and the center of the yarn spool is located above point B (and its vicinity). To successfully lift the spool, the end effector of the robotic hand must first move a short distance along the −x axis, then move along the −y axis, continue moving along the −x axis, and approach the yarn spool with the appropriate posture before lifting it along the z-axis. It is foreseeable that if the robotic hand moves directly from point A to point B, the robotic arm’s joints (or links) will inevitably collide with the beams in the environment. Therefore, the grasp–place task is divided into the following steps, with the complete trajectory composed of continuous segments:
- 1.
The yarn spool appears at any location on the table, and the TCP of the robotic arm moves from the initial position, A, along trajectory segment 1 to the preparation position, B, with the appropriate posture. Point B is located directly above the center of the yarn spool along the z-axis and may appear anywhere in Region 1.
- 2.
The robotic arm moves along trajectory segment 2 from the preparation position, B, to the placement position, C (which can be randomly specified in Region 2), retracts its gripper, and places the yarn spool on the platform.
- 3.
The arm returns to the vicinity of the initial position, A, from the placement position, C.
In this paper’s experiments, the six-DOF intelligent textile robotic arm model on the virtual twin platform is used to complete the task along trajectory 1. In this environment, the coordinates of point A (the starting point) are (1.3019, 0.0319, 1.5801), and those of point B (the center of the yarn spool) are (0.7500, −1.3000, 1.2000). Therefore, the distance between A and B is approximately 0.7228 m. A step length, , is chosen, so the straight-line distance between A and B requires at least 24 steps (all distance units are in meters).
6.3. Experimental Results
In our experiments, we compared the performance of PPO, SAC, DDPG, and TD3 and their corresponding environment-augmented strategy algorithms, EA-PPO, EA-SAC, EA-DDPG, and EA-TD3. The specific parameters for each algorithm are detailed in
Table 3.
The training comparison data presented in
Figure 10 show that both PPO and EA-PPO were successful in learning appropriate paths over the course of training, ultimately converging to similar reward levels. Notably, due to the requirement of EA-PPO to explore the environment more extensively, its convergence rate was relatively slower. For SAC and EA-SAC, their performance was generally similar, but the step count results indicate that EA-SAC required more steps in the process of exploring suitable paths. The difference between DDPG and EA-DDPG was more pronounced, with EA-DDPG spending more time exploring quality paths due to the influence of inverse solution stability rewards and the uncertainty of environmental information. Additionally, EA-DDPG exhibited a larger error band compared to DDPG, indicating increased fluctuation in reward values across multiple experiments due to the EA strategy. Since TD3 is an improved version of DDPG, in the experiments, relative to DDPG, both TD3 and EA-TD3 achieved convergence faster, although EA-TD3 still required more steps to find ideal paths.
Figure 11 reveals the impact of the EA strategy on the stability metric of the solutions, especially the stability measure referenced by the S value. As training progressed, a gradual decrease in the S value was observed. This trend indicates that the algorithms produced more stable solutions over time. Furthermore, this stability was reflected in the smoothness of the search paths. In other words, as the S value decreased, the exploration paths became smoother. This is particularly important in practical applications, as smoother paths typically mean more predictable and controllable decision-making processes.
Figure 12 illustrates significant differences between the environment-augmented strategy versions of PPO, SAC, DDPG, and TD3 and their original DRL algorithms across various performance indicators. At first glance, the paths planned by the versions with the EA strategy seem longer with extended execution times, giving an initial impression of lower efficiency. However, this reflects a key feature of the EA strategy: a deeper and more comprehensive exploration of the environment. The strategy aims to thoroughly analyze the environmental structure, uncover potential risks, and thus provide more comprehensive and refined decision making.
The experiments validate the effectiveness of our EA strategy. During training, our strategy continually optimizes stability indicators, producing higher-quality solutions. Ultimately, this enhancement in stability and improvement in path smoothness provide strong support for the practical application of our EA strategy.
Furthermore, the EA strategy emphasizes robust control policies, prioritizing precise and cautious movements over speed. This strategy not only reduces joint speeds (except for EA-PPO, where all environment-augmented versions have lower joint speeds than their original versions) but also enhances inverse kinematics stability. By choosing more robust and precise inverse kinematics solutions, the environment-augmented versions can achieve smoother path planning and more stable joint movements, effectively reducing energy consumption. Specifically, the EA strategy brings benefits in terms of energy saving and also helps to reduce the overall lead time. Although this strategy may result in more steps and longer execution times in the initial stages of a task, it effectively reduces repetitions or adjustments due to environmental changes or imprecise path planning over the entire task cycle. This ability to reduce repetitive actions, in the long run, not only lowers energy consumption but also reduces the overall completion time of a task. In other words, the EA strategy enhances efficiency and safety in the long term by improving the accuracy and stability of path planning, thereby avoiding operational failures and safety incidents due to environmental changes or misinterpretations [
30].
In summary, while the EA strategy may show weaker performance in terms of steps and execution time compared to the original algorithms, its deep exploration and understanding of the environment, along with its enhanced focus on operational safety and stability, reveal its significant advantages and resilience in dealing with complex environments and unknown risks.
Additionally, this paper selects the widely applied DDPG and SAC algorithms in continuous space path planning as benchmarks for comparison and conducts an in-depth analysis of the path planning results. We designed and implemented an improved algorithm combining the EA strategy and MIEGO method (referred to as the EM algorithm), specifically for training on the first path. Comparative observations of the final paths planned by these algorithms reveal the effectiveness and innovation of the EM algorithm in path optimization, further confirming the practical value of the EA strategy in intelligent textile robotic arm path planning.
Figure 13 shows the paths planned by four different algorithms, from left to right: EM-SAC, SAC, DDPG, and EM-DDPG. It is observed that the paths planned by EM-SAC and EM-DDPG are more circuitous and maintain a greater distance from potential obstacles. This strategy reduces safety risks that could arise due to minor environmental changes, electrical control factors, etc., during actual movement. More importantly, the paths planned by these two algorithms are more streamlined, better suited for smooth motor operation, and can effectively avoid frequent motor starts and stops, thus extending the motor’s lifespan [
31].
Figure 14 presents the operational conditions of joint angles in the optimal paths planned by the original algorithms and the EM algorithm. In the figure, (a) and (b) correspond to the EM-SAC and SAC algorithms, while (c) and (d) correspond to DDPG and EM-DDPG, respectively. In comparison, it is noticeable that the variation curves in (a) and (d) are significantly smoother and exhibit less jitter, almost negligible. This phenomenon indicates that the paths planned through the EM algorithm have a higher degree of smoothness. This feature of high smoothness means that in practical applications, the robotic arm’s movement will be considerably less jittery, thus achieving more precise positioning. Additionally, smoother motion paths can reduce the robotic arm’s instantaneous acceleration, thereby mitigating wear and tear and prolonging its lifespan. Furthermore, smoother movement paths also reduce the robotic arm’s demand for electricity, consequently lowering energy consumption [
4]. Therefore, the excellent performance of the EM algorithm in this regard further confirms its superior performance and broad application prospects in real-world applications.
In conclusion, the EA strategy’s ability to deeply explore and understand the environment, ensuring operational safety and stability, makes it highly adaptable and superior for tasks involving complex environments and unknown risks. The integration of the EA strategy with the MIEGO method (EM algorithm) in path planning effectively balances factors such as path length, motion safety, and equipment durability, providing an effective strategy and methodology for achieving efficient and safe robotic arm movements in complex environments.