Next Article in Journal
Major and Trace Element Accumulation in Soils and Crops (Wheat, Corn, Sunflower) around Steel Industry in the Lower Danube Basin and Associated Ecological and Health Risks
Previous Article in Journal
The Relationship between the Time Difference of Formation Water Infiltration Rate, Tectonic Movement, and the Formation Pressure Coefficient
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Risk-Sensitive Intelligent Control Algorithm for Servo Motor Based on Value Distribution

1
School of Computer and Information Engineering, Nantong Institute of Technology, Nantong 226001, China
2
School of Software, Northwestern Polytechnical University, Xi’an 710000, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2024, 14(13), 5618; https://doi.org/10.3390/app14135618
Submission received: 30 April 2024 / Revised: 11 June 2024 / Accepted: 21 June 2024 / Published: 27 June 2024
(This article belongs to the Section Computing and Artificial Intelligence)

Abstract

:
With the development of artificial intelligence, reinforcement-learning-based intelligent control algorithms, which generally learn control strategies through trial and error, have received more attention in the automation equipment and manufacturing fields. Although they can intelligently adjust their control strategy without the need for human effort, the most relevant algorithms for servo motors only consider the overall performance, while ignoring the risks in special cases. Therefore, overcurrent problems are often triggered in the training process of the reinforcement learning agent. This can damage the motors’ service life and even burn it out directly. To solve this problem, in this study we propose a risk-sensitive intelligent control algorithm based on value distribution, which uses the quantile function to model the probability distribution of cumulative discount returns and employs the condition value at risk to measure the loss caused by overcurrent. The agent can accordingly learn a control strategy that is more sensitive to environmental restrictions and avoid the overcurrent problem. The performance is verified on three different servo motors with six control tasks, and the experimental results show that the proposed method can achieve fewer overcurrent occurrences than others in most cases.

1. Introduction

Servo motors play an important role in the composition of automation equipment. They can convert the voltage signal into torque and speed to drive the controlled object with high control precision. Because servo motors can meet the torque and speed control requirements of most tasks, they have been widely used in many manufacturing fields, such as in aerospace, automotive manufacturing, and robotics, among others [1].
The most important thing in the servo motor control system is the control algorithm, which directly affects the performance. The traditional control algorithms, such as the proportion-integration-differentiation (PID), mainly use ordinary differential equations, transfer functions, and dynamic structure diagrams to describe the relationship between the input and output of the system [2]. The main research focus of traditional algorithms lies in the linear time-invariant system with a single input and single output, and bear the nature of a simple structure and easy understanding, thus having been widely used in various industrial scenarios. However, these algorithms can generally only describe the relationship between the input and output of the system, while not being able to describe the internal structure and the change in internal states of the system, so many nonlinear systems use modern control algorithms [3].
A set of modern control methods applying the related control theory have been employed on the servo motor control task [4]. In modern control theory, the analysis and design of control systems are mainly carried out through the description of the state variables of the system. Modern control theory can deal with wider problems than classical control theory, including linear and nonlinear systems, time-invariant and time-varying systems, and univariate and multivariable systems. Modern control algorithms have excellent control performance for nonlinear systems and can change controller parameters with time-varying systems [5,6]. They are often used in nonlinear systems with high performance requirements. However, modern control algorithms cannot perform approximate local optimization, therefore, intelligent control algorithms are needed for high-performance and high-precision control systems.
At present, the main intelligent control algorithms are based on reinforcement learning [7], which can prompt the servo motor control system to adapt to the complex control environment by automatically adjusting the output of the control system through the imitation of human decision-making behavior [8,9,10,11]. By interacting with the environment, the intelligent controller learns optimal action selection strategies to maximize cumulative rewards. In each interaction, the intelligent controller outputs the duty cycle according to the current operating state of the motor and the target trajectory to obtain optimal control performance. These intelligent control algorithms can adapt to different environmental conditions and task requirements, and have strong generalization ability [12,13,14,15].
However, almost all existing intelligent control methods only consider maximizing the expected value of the cumulative discount return in the interaction process, while usually ignoring its complete probability distribution. This means that only the average performance of the motor is considered, and the performance in special cases is not included in the calculation. Therefore, the motor may often trigger the overcurrent problem (i.e., the current exceeds the rating value) which increases the temperature of the motor and accelerates the aging of insulation, further damaging the motor’s service life or even burning out the motor.
In view of the aforementioned problems faced by existing control algorithms, this study intends to design a risk-sensitive intelligent control algorithm based on value distribution to avoid the overcurrent problem in the process for training a reinforcement learning agent. Specifically, the quantile function is used to model the probability distribution of cumulative discount returns. By means of quantile regression, the output of the value network is transformed from the expected value of the cumulative discount return to the entire probability distribution, thus providing a more accurate estimate of the value. In the process of learning the control strategy, the risk evaluation index in the financial field is used to effectively measure the loss caused by overcurrent, so that the reinforcement learning agent learns a control strategy that is more sensitive to environmental restrictions. As a result, it can effectively alleviate the interference of overcurrent on motor control. Six different control tasks are used to verify the performance of the developed method. The results show that the proposed risk-sensitive control algorithm not only triggers fewer overcurrents, its control performance is also better than others in most cases.
The key contributions of this work can be summarized as follows: (1) This paper discussed the possible damage to the motor caused by the reinforcement-learning-based control method in the process of interaction with the environment, for the first time. (2) Proposed a risk-sensitive intelligent control algorithm to avoid the overcurrent problem when training a reinforcement learning agent. (3) Verified the proposed control algorithm’s effectiveness via six different control tasks.

2. Basic Knowledge

2.1. Motion Control System

As shown in Figure 1, a servo motor control system typically consists of a closed-loop circuit, which mainly includes seven components: human–machine interface, motion controller, driver, actuator, driving mechanism, load, and feedback.
The motion controller is the “brain” of the entire system, which receives the feedback from the motor and calculates the servo error between the actual and given state, then generating a control signal to adjust the state of the motor and reduce the error according to its control strategy. After that, the driver amplifies these signals to a high-power voltage and current to meet the needs of the motor operation [2].
In this work, we mainly focus on the control strategy in the controller, and our goal is to design a risk-sensitive control algorithm to avoid the overcurrent problem during the training process.

2.2. Value Distribution Reinforcement Learning

Reinforcement learning is a type of method that enables continuous learning from interaction with the environment [16,17]. Different from supervised learning, reinforcement learning only needs the reward of its strategy instead of the “correct” supervised information, and then, adjusts its strategy to obtain the maximum reward [18].
In most traditional reinforcement learning methods, the agent learns the value function and chooses an action with a higher value by interacting with the environment, to maximize the expected value of cumulative discount returns [19,20]. As a random variable, the cumulative discount return has a complete probability distribution. However, these methods often use the expected value of the cumulative discount return to express it as a value function or an action value function, and this complete distribution information is lost. To solve this problem, Marc G. Bellemare et al. proposed the concept of value-distributed reinforcement learning algorithms in 2017, which use complete distribution information to represent cumulative discount returns [16]. By taking the expected value of the complete distribution information, we can obtain the estimated value of the action value function used by the classical reinforcement learning method.
In the current value-distribution reinforcement learning algorithms, the probability distribution of cumulative discount returns can be expressed in many ways, the most commonly used is the segmentation method [21,22]. It uses several fixed loci to make the model outputs quantiles corresponding to each loci. As shown in Figure 2, the function representation of the quantile is equivalent to the inverse of the probability distribution function. In the usual probability distribution function, the independent variable is the value of the random variable, and the output of the function is the cumulative probability under the value of the random variable. Instead, the quantile function is the opposite, the independent variable is the cumulative probability value, and the function outputs the value of the random variable [23,24].

2.3. Risk Measurement

Risk measurement mainly comes from the financial sector, where the loss of an investment is the risk, and its size is proportional to the magnitude or amount of the asset reduction. Three common risk measures are volatility, value at risk (VaR), and conditional value at risk (CVaR) [25].
Volatility generally refers to the change in a financial asset over a period of time, usually measured by the standard deviation of the return of the asset over a past period of time. The higher the volatility of an asset, the riskier it can be considered.
At a certain confidence level, the expected losses that may occur in a certain period of time in the future do not exceed a certain value, which is called the value at risk. As shown in Figure 3, suppose the sum of the probability values of the shaded regions is 5% , so x is the value at risk under the 95% confidence level. The value at risk can be calculated by Equation (1):
V a R = μ + σ × Z ( α )
where μ is the expected return, σ represents the volatility of events, and Z ( α ) represents the quantile at the confidence level α .
The method of calculating value at risk is to obtain its quantile at a specific sub-site that can be effectively combined with the value-distribution reinforcement learning algorithm to directly obtain quantile estimates of specific loci of the model’s output. But it still has some shortcomings: (1) It may underestimate the risk in many cases; (2) it only applies to the normal distribution, when the distribution of random variables presents the characteristics of a fat-tailed distribution, the calculation of value at risk is unlikely to accurately reflect the risk of events. This is because the fat-tailed distribution means that the probability of occurrence of some extreme events does not decrease with the increase in the extreme degree, but there is a considerable probability of occurrence. Therefore, the probability will not gradually decrease in the end segment, but there is a peak probability, as shown in Figure 4.
The conditional value at risk can make up for the deficiency of value at risk; it is defined as the expected loss that a random event can produce at a given confidence level over a period of time. As shown in Figure 4, the conditional value at risk is to calculate the expected value of the shaded part with Equation (2).
C V a R = μ + σ e Y 2 2 2 π ( 1 X )
where Y is the critical value at the ( 1 X ) % confidence level under the standard normal distribution.

3. Problem Description

The training process of the reinforcement-learning-based control algorithm is as follows: (1) input the current state of the motor; (2) the agent outputs an action according its control strategy; (3) the motor performs the action and moves to the next state; (4) evaluate and optimal the control strategy; (5) repeat this process until convergence. The control strategy is stochastic at the beginning of this process, therefore the output action is stochastic as well [20]. The stochastic action can make the motor’s current too high or too low. Too high a current will damage the motor, and is called overcurrent. This work will address this.

3.1. Control Mode of Servo Motor

The general servo motor has three control modes: position control, torque control, and speed control. Position control is the most common mode in the servo motor; through the external input pulse frequency to determine the size of the rotation speed, and through the number of pulses to determine the rotation angle. This mode is suitable for applications that have strict requirements for position, such as CNC machine tools, printing machinery, etc.
The speed control mode controls the speed of rotation through the analog input or the frequency of the pulse. This mode is suitable for scenarios where speed is required but the requirement for positional accuracy is not high, such as speed regulating devices or applications requiring smooth speed changes.
The torque control mode sets the output torque of the motor shaft through an external analog input or direct address assignment. This mode is suitable for situations where strict torque control is required, such as winding and unwinding devices that have strict requirements for material forces.
Since position control can be achieved via speed control, in this work, we only consider the torque control mode and speed control mode. The speed and torque of the motor are determined by the duty cycle, which denotes the ratio of the up level time to the whole cycle time within a pulse period. The range of the duty cycle is [0,1], and its value is proportional to the velocity and torque.

3.2. State Information of Servo Motor

Three commonly used servo motors are considered in this work:  an externally excited DC motor (ExtEx), a series DC motor (Series), and a three-phase permanent magnet synchronous motor (PMSM). To describe the operational state of a servo motor, it is necessary to use different physical quantities such as the motor’s angular velocity, actual current, actual voltage, and motor torque.
For the ExtEx motor, its state O E x t E x can be described by the angular velocity ω , torque T, armature current i A , excitation circuit current i E , armature voltage u A , excitation circuit voltage u E , and service voltage u s e r [2], as shown in Equation (3):
O E x t E x = [ ω , T , i A , i E , u A , u E , u s e r ] .
In the series DC motor, the armature and the excitation circuit are connected in series, so their armature values are the same, i.e., i A = i E , and both of them are denoted as i. According to the basic principles of a series circuit, the input voltage is the sum of the armature voltage and excitation circuit voltage, which can be denoted as u = u A + u E [2]. Therefore, the state of the series DC motor O S e r i e s is denoted as Equation (4):
O S e r i e s = [ ω , T , i , u , u s e r ] .
A PMSM motor has three phases, phase A, phase B, and phase C, and each of them has a phase voltage and phase current, that are denoted as u a , u b , u c , i a , i b , and i c , respectively. To simplify the dynamic mathematical model of a permanent magnet synchronous motor in a three-dimensional static coordinate system, it is common to convert it into a DQ two-phase rotor coordinate system and simplify the inductance matrix to a constant. In DQ coordinates, the voltage components and current components in the d and q direction are denoted as u s d and u s q , i s d and i s q , respectively [26]. Moreover, the rotor flux of a PMSM motor is denoted by ε . Therefore, the state of a PMSM is denoted as O P M S M , which is shown in Equation (5).
O P M S M = [ ω , T , i a , i b , i c , i s q , i s d , u a , u b , u c , u s q , u s d , u s e r , ε ] O P M S M

3.3. Task of Agent

The objective of a servo control system is to operate the motor at a specified speed or torque. To achieve this, a reinforcement learning agent must output an appropriate duty cycle to minimize the servo error, which is the difference between the actual state and the target [27].
To mitigate the impact of observation errors, and to capture some motor running states that cannot be represented by instantaneous observations, in this work, both the current observations of motor operation and the historical information over a period are taken as the input of the proposed control algorithm. Specifically, it includes the observations of the motor’s running state and the control outputs of the agent in the past period. Therefore, the input state s t of the reinforcement learning agent at time t can be expressed as Equation (6).
s t = [ o t h , a t h , o t h + 1 , a t h + 1 , , o t ]
where o t represents all operational state observations of the motor at time t, i.e., Equations (3)–(5); a t represents the action output by the agent at time t; and h indicates the length of the history information contained in the status input.
Then, the reinforcement learning agent takes s t as input, and takes the duty cycle as the output action. For an ExtEx motor, the action includes the armature voltage’s duty cycle a A , and the excitation circuit voltage’s duty cycle a E , which is denoted as a E x t E x = [ a A , a E ] . For a series DC motor, the action only includes the duty cycle of the input voltage a S e r i e s . For a PMSM, all the duty cycles of the A, B, and C phases are taken as the output action, which is denoted as a P M S M = [ a A , a B , a C ] . Each output action is a real number between 0 and 1. The duty cycle gradually increases with the increase in the action value.
Therefore, at time t, to control the motor running under the set target g t , the agent has to output the appropriate duty cycle a t based on the state input s t , where g t denotes the target speed or target torque.

The Interaction between Agent and Environment

Given a target space G including all possible speeds or torques, the reward function evaluating the performance of the control strategy is expressed as r ( s , a , g ) , which calculates the real-time reward using the input state s, the output duty cycle a, and the set target speed or torque g. After the agent outputs a duty cycle in the current state under a given target, the environment gives a reward to the agent to reflect the performance of the agent. In detail, the environment obtains an initial state of the environment s 0 and the target g 0 at the beginning of each turn. In each subsequent turn, the agent makes decisions according to the environment state s t and the target g t to be tracked at time t. When the agent outputs the duty cycle, the environment feeds back a reward signal r t = r ( s t , a t , g t ) to the agent. At the same time, the environment transitions to the next state s t + 1 , and gives a new target g t + 1 .

4. The Proposed Risk-Sensitive Intelligent Control Method

4.1. The Structure of the Control Method

As shown in Figure 5, the risk-sensitive intelligent control method based on value distribution consists of four main components: (1) the strategy network models the control strategy, which receives the motor’s current state s t , and output the specific duty cycle a t ; (2) the value network evaluates the performance of the strategy network; (3) the temperature coefficient balances the proportional relationship between the duty cycle and strategy in the loss function, which can affect the degree of emphasis on the exploration and utilization of the algorithm; (4) the experience buffer stores the interactive data between the reinforcement learning agent and servo motor control environment, that can be taken as training data [28,29].
Specially, the strategy network and value network interact with each other as follows: the strategy network receives the current state and outputs an action; the motor performs this action, and then, moves to the next state; the value network evaluates the performance and improves itself; the strategy network improves the control strategy; this process is repeated until convergence. At the beginning of training, the strategy network randomly outputs actions, and the value network randomly evaluates the performance. With continuous learning, the evaluations of the value network become more and more accurate, and the actions output by the strategy network become better and better.

4.1.1. Structure and Loss Function of Value Network

The value network of the proposed risk-sensitive intelligent control method includes two parts: the coding part and the integrated part. The former outputs the hidden-layer codes of the quantiles, and the latter calculates the value of each action. As shown in Figure 6, the inputs of the value network also include two parts: The first one is the motor running state s t = [ o t , a t ] , target speed, or torque g t , and the duty ratio output of agent a t , and all of this information is input into the fully connected network and activation function. The second one is a real value τ i sampled from a uniform distribution τ U ( 0 , 1 ) , which represents the position of the ith quantile. After being cosine-encoded, this real value can translate to an output of the same length as the first part through a fully connected network. Each time an estimate is needed, the positions of the quantiles are sampled uniformly and act as an input to the neural network, which outputs the action value corresponding to each quantile. After receiving the input of the two parts, the neural network adopts a bit-multiplication calculation method, and transmits the data to the next fully connected layer and activation function to obtain the hidden-layer codes of the quantiles.
With this input method, the location of each quantile can be different, which improves the accuracy of fitting the probability distribution of the neural network. On the other hand, the network can use arbitrary precision quantiles to describe the probability distribution, and the number of quantiles can be flexibly adjusted according to the actual needs and the computing performance of the specific computing equipment. After that, the coding part will generate the corresponding hidden-layer coding information for each input quantile position, and the amount of coding information is consistent with the number of quantile positions [30,31].
After extracting all the hidden-layer coding information of the quantiles, the value network collects and concatenates them together, and then, inputs them into the integrated part of the value network. As shown in Figure 7, there are two branches in the integrated part. In the first branch, all the hidden-layer coding information of the quantiles is successively passed through a fully connected layer and softmax layer. Then, through a cumulative summation operation, a set of monotonically increasing non-negative real ψ between 0 and 1 is obtained, which is used to represent the proportional relationship of the distance between the quantiles. In the second branch, the hidden-layer coding information of the quantiles passes through the fully connected network layer and outputs two real values. The first real value is exponentiated to obtain a proportional coefficient α ( α 0 ) , which is used to represent the proportional relationship between the output in the first branch and the actual quantiles. The second unprocessed real value, as the offset coefficient β , represents the offset of the whole set of component values on the number line. Using the two coefficients α and β , the estimate of the probability distribution of cumulative discount returns for the entire network is obtained by linearly transforming the whole set of quantiles in the first branch. That can be expressed as follows:
Z i ( s , a , g ) = α ( s , a , g ) × ψ i + β ( s , a , g ) , i = 1 , 2 , , N
where Z i ( s , a , g ) denotes the output of the ith quantile.
When using the value network, to solve the overvaluation problem of the duty ratio the algorithm adopts a pair of twin networks with the same structure, and takes the minimum value of the two networks as the final output every time the quantile is calculated. When training the value network, the positions of N quantiles as input are first obtained by uniform sampling in the range of [ 0 , 1 ] . Then, the operating state of the servo motor s, the reference trajectory during the control task g, the duty cycle output of the strategy network a, and the previously generated quantile positions τ i are fed to the neural network. Through the above calculations of the value network, a set of monotonically increasing quantile outputs Z i are obtained, which together with the input quantile positions τ i describe the probability distribution of the cumulative discount returns. In this way, both the probability distribution of the current time step Z ( s t , g t , a t ) and the probability distribution of the next time step Z ( s t + 1 , g t , a t + 1 ) can be obtained. The target probability distribution can be obtained by adding the probability distribution of the next time step Z ( s t + 1 , g t , a t + 1 ) and the reward signal of the current time step r ( s t , g t , a t ) . By minimizing the Wasserstein distance between the probability distribution of the current time step and the target probability distribution, the approximate distribution output by the network can gradually approach the true probability distribution. Therefore, the objective function of value network J ( θ ) can be expressed as Equation (8).
J ( θ ) = E ( s t , g t , a t , s t + 1 ) D , a t + 1 π 1 N i = 1 N j = 1 N s i g n ( u i j ) + 1 2 s m o o t h L 1 ( u i j ) τ ^ j u i j = Z i ( s t , g t , a t ) r ( s t , g t , a t ) + γ Z j ( s t + 1 , g t , a t + 1 ) α l o g π ( a t + 1 | s t + 1 , g t ) s m o o t h L 1 ( u i j ) = 0.5 ( u i j ) 2 , i f | u i j | < 1 | u i j | 0.5 , o t h e r s τ ^ j = τ j 1 + τ j 2 , τ 0 = 0
where D represents the experience buffer that stores the interactive data within the environment, γ is the discount factor, α is the temperature coefficient, and π ( a t + 1 | s t + 1 , g t ) is the policy function that will be described next.
Equation (8) represents the updated objective function of the value network. By sampling in the replay buffer D, the calculation method used is the negative of the absolute value of the difference between the target of the control trajectory and the actual control result.

4.1.2. Structure and Loss Function of Strategy Network

As shown in Figure 8, the inputs of the strategy network are motor state s t = [ o t , a t ] and target g t , and its output is the duty ratio after normalization, which is denoted as π ( s t , g t ) .
In order to obtain a control strategy sensitive to the overcurrent problem, the risk-sensitive motor control algorithm based on the value distribution combines the expected value of the cumulative discount return with the conditional value at risk. When training a strategy network, it is necessary to assign a confidence level φ in advance to the algorithm, so that the value network can calculate the conditional value at risk of the probability distribution of the cumulative discount return at the confidence level of 1 φ . Since the representation of the probability distribution by the neural network is inconsistent with the general parametric representation, it is different from Equation (2) when calculating the conditional value at risk. The specific calculation process is (1) obtaining the locations of the N quantiles τ 1 , , τ N by uniformly sampling in the range [ 0 , φ ] ; (2) obtaining the partial probability distribution of the cumulative discount return in this range through the value network Z ( s t , g t , a t , τ n ) with current state s t , reference target g t , action a t , and each τ n , respectively; (3) calculating its value of expectation, and then, obtaining the conditional value at risk at confidence level φ , which can be denoted as follows:
C V a R = n = 1 N τ ^ n Z ( s t , g t , a t , τ n ) .
where τ ^ n = τ n 1 + τ n 2 denotes the mean of adjacent quantiles.
In addition to calculating the conditional value at risk, the algorithm also needs to calculate the expectation of the whole probability distribution as a second-strategy evaluation index. Similar to the training of the value network, the positions of the quantiles are first uniformly sampled in the range [ 0 , 1 ] , then the probability distribution of the cumulative discount returns is obtained and its expected value is calculated. Therefore, the value network’s actual evaluation for duty cycle output by the agent can be expressed as Equation (10).
I a = ( 1 ρ ) E Z ( s t , g t , a t ) + ρ C V a R Z ( s t , g t , a t , φ )
Therefore, the objective function of the strategy network can be expressed as Equation (11).
J ( ϕ ) = E s t D log π a t | s t , g t I a
Equation (11) represents the updated objective function of the policy network, which adjusts the objective of the proposed control algorithm based on the structure and input–output of the value network and policy network, while the training method remains basically the same.
In order to use the gradient descent method to train the policy network, it is necessary to make the random policy output of the network derivable. Therefore, the algorithm uses the reparameterization technique to calculate the output of the strategy network. The strategy network itself outputs two real values, μ and σ , corresponding to the mean and standard deviation of the normal distribution, respectively. The value of the output action a t can be calculated via Equation (12).
a t = ε t × σ ( s t , g t ) + μ ( s t , g t )
where ε t is noise sampled from a standard normal distribution, as shown in Figure 8.

4.1.3. Temperature Coefficient and Its Loss Function

The temperature coefficient α balances the ratio between real-time rewards and policy entropy at each time step, and therefore, can affect the randomness of the optimal strategy. α is a learnable parameter. At the beginning, an initial value is given, and then, the value of α can be dynamically adjusted via training of the algorithm. The objective function for optimizing the temperature coefficient α is shown in Equation (13).
J ( α ) = E ( s t , g t ) D , a t π α log π ( a t | h t , s t , g t ) α H 0
where H 0 represents the target value of policy entropy. In this paper, H 0 is set to the negative of the duty cycle’s dimension.
Similar to the value network or the actor network, the temperature coefficient is updated by iteration as well. When the entropy of the current strategy is lower than H 0 , the value of α is increased. Otherwise, it is decreased. By adjusting the importance of policy entropy in the objective function of the strategy network, the direction of policy improvement can be controlled.
The pseudocode of the proposed risk-sensitive intelligent control method is summarized in Algorithm 1.
Algorithm 1 The proposed Risk-Sensitive Intelligent Control Method.
Require: Parameters of value network θ 1 and θ 2 , parameter of policy network ϕ , and
     experience pool D .
1:  for each iteration do
2:     for each environment step do
3:          obtain the duty cycle a t according to the state s t ,
     the current target g t and the sampling strategy π ϕ ;
4:          execute duty cycle a t , and then, obtain the next state s t + 1 ,
     the next target g t + 1 and the risk evaluation I t ;
5:          store sample ( s t , a t , I t , s t + 1 , g t ) into experience pool D ;
6:          for each gradient step do
7:               evenly extract a batch size sample from D for training;
8:               update value network parameters θ 1 and θ 2 ;
9:               update policy network parameter ϕ ;
10:             update temperature coefficient α ;
11:        end for
12:    end for
13: end for

5. Experiment and Analysis

5.1. Experimental Environment and Setting

The Python-based Gym Electric Motor (v.0.1.1) is an open-source simulation platform that offers a straightforward algorithmic interface for reinforcement learning agents. It provides a favorable condition for researchers to develop and test reinforcement learning control algorithms, as the interaction between servo motor and agent can be simulated conveniently [32].
As shown in Table 1, six common tasks provided by Gym Electric Motor are selected to verify the effectiveness of the proposed risk-sensitive intelligent control method.
Two different deep reinforcement learning algorithms are selected: the twin delay depth deterministic strategy gradient (TD3), and the depth deterministic policy gradient (DDPG). Meanwhile, two different feature extraction methods are also taken into account: the difference value input (DVI), and state input without special processing (SI). Therefore, the four comparison methods are obtained by combining the reinforcement learning algorithm and feature extraction method with each other, which are termed as DDPG-DVI, DDPG-SI, TD3-DVI, and TD3-SI, respectively. In addition to these methods, our previous method (RESAC) [33] is also used to compare the performance.
The strategy network of the proposed risk-sensitive intelligent control method contains two hidden layers. For the value network, there are two hidden layers in the coding part and there is one in the integrated part. Each hidden layer has 512 neurons and the ReLU activation function is used. The strategy network’s input length is set as 20 and the output length corresponds to the number of duty cycle dimensions corresponding to each environment. The number of quantiles in the probability distribution is set to be 32. When the positions of the quantiles are input into the neural network, the number of input neurons of the value network is 64, and the cos function is used as the auxiliary embedding function. The positions of the quantiles are randomly generated by a uniform distribution. The final output layer has 32 neurons, where each output represents one of the quantiles on the distribution.
The temperature coefficient is chosen as 1. The total number of training time steps for each control task is 1 million. The batch size is set as 256, and the samples are uniformly randomly sampled from the capacity of the experience buffer, whose capacity is 200,000. The Adam optimizer [34] is used to optimize network parameters for gradient descent, and the learning rate of the optimizer is set to 0.00003. The discount reward factor is 0.99. The update ratio is set to 0.005 when soft updating parameters for the target network.

5.2. Comparison Experiment for Safety Performance

In order to evaluate the safety performance, the number of times that each control algorithm triggers the overcurrent protection in each task is counted to verify its sensitivity to the current overload problem. The total number of training time steps for each control task is 1 million. During the training process, the cumulative number of times the servo motor control system triggered the overcurrent protection was recorded. In each control task, 10 random seeds were used for 10 repetitions of the training process.
Figure 9 shows the safety performances of the different compared algorithms on each task. The horizontal coordinate represents the number of time steps taken in training, in millions of steps. The ordinate represents the number of times that the motor model triggered the overcurrent protection mechanism from the start of training to the current time step. The curve represents the mean value of the experimental results, and the shaded area represents the standard deviation of the experimental results.
From Figure 9 we can see that the proposed risk-sensitive control algorithm achieves the best result in avoiding the current overload state. As shown in Figure 9a–c, in the early stage of training on the TC-ExtExDc task, SC-ExtExDc task, and TC-PMSM task, the cumulative trigger times of each algorithm all increased rapidly. But after about 50,000 steps, 400,000 steps, and 50,000 steps, respectively, our algorithm’s cumulative number starts to grow more slowly than the others until the end of the training process. As shown in Figure 9d, in the early stage of training on the SC-PMSM task the cumulative times of all algorithms increase uniformly, and their rate of increase is basically the same. However, after a long period of time, the cumulative numbers of other algorithms begin to be higher than that of the proposed algorithm, which finally achieves the best performance. As shown in Figure 9e, all the compared algorithms do not frequently trigger the overcurrent protection mechanism during the training process on the TC-SeriesDc task. It may be that the probability of triggering the overcurrent protection mechanism is lower on this task than the others. Especially on the SC-SeriesDc task, none of the compared algorithms triggers the overcurrent protection mechanism.
In conclusion, except for the SC-SeriesDc task, the proposed risk-sensitive intelligent control algorithm has fewer overcurrent trigger times on the other five control tasks. In other words, it is safer.

5.3. Comparison Experiment for Control Performance

To fairly compare the control performance, all comparison algorithms have to use the same control target in each control task. The total length of each test was 10,000 time steps, and each control algorithm repeated the process under 10 identical random seeds. When testing, we calculated the mean absolute error (MAE), and used it to measure the control accuracy.
In this work, the random seed is set to 0, and the control performances of each algorithm on the six tasks are shown in Figure 10, where the vertical coordinate represents the actual value of speed or torque, and the horizontal coordinate represents time steps. The blue lines represent the reference control targets, while other lines represent the control results of different algorithms, respectively. The smaller the gap, the stronger the ability of the algorithm to control the motor operating with the set target; on the other hand, the larger the gap, the weaker this ability. By measuring the gap between the actual running state and the target, we can roughly see the difference in the control effect among various algorithms. Moreover, the mean and standard deviation of the MAE in tests based on 10 different random seeds are shown in Table 2.
As shown in Figure 10a,c,e, the performance of all the control methods is similar in the torque control mode. This may be because the response speed of the current loop is fast, and all algorithms can achieve torque control well, so there is no significant performance difference between different algorithms. From Figure 10b,d,f, it can be clearly seen that the actual speed controlled by the proposed algorithm is close to the target speed, which shows that our algorithm has certain advantages on the three speed control tasks. Since the response of speed control is slow, the performance difference between algorithms may be caused by the changes in the control trajectory. The proposed risk-sensitive algorithm uses target trajectory coding to reduce the servo error to the target trajectory. Therefore, it can quickly adjust the running speed of the motor and respond to the change.
Table 2 shows that the mean absolute error of the proposed algorithm is smaller than others in both torque control mode and speed control mode. However, it has a larger standard deviation in some cases. This indicates that the proposed algorithm is less stable than others in individual tasks.
In conclusion, the proposed risk-sensitive intelligent control algorithm is not only safe, but its control performance is also better than the other comparison algorithms in general.

5.4. Comparison Experiment for Training Speed

To compare the training speed, we recorded the convergence status of mean absolute error over 10 6 time steps during the training process in Figure 11. To obtain a stable performance during the test, 10 fixed random seeds were used to repeat the experiment and we calculated their mean value. The parameters provided by Gym Electric Motor were used for random initialization.
To better reflect the trend in algorithm control capability changes in the value of MAE, we set a sliding window of size 1000 and calculate the MAE within it when plotting the experimental results. All the learning processes of the compared algorithms are shown in Figure 11, where the horizontal coordinate represents the number of time steps in the training process, and the unit is one million. The coordinate represents the MAE between the actual running state of the motor and the target. The curve represents the mean value, and the shaded part represents the standard deviation of the experimental results.
Figure 11 shows that the convergence rate of the proposed method is slower than REAC but it is faster than the others on all tasks. On the one hand, this is because there is a certain conflict between the risk avoidance and control performance improvement. On the other hand, it may be because the value network structure of the proposed algorithm is more complex, which slows down the convergence of the network.

6. Conclusions

The existing reinforcement-learning-based intelligent control algorithms only consider the overall performance and ignore the risk in special cases, that may frequently trigger the overcurrent problem and damage the motor. To solve these problems, we proposed a risk-sensitive intelligent control algorithm for servo motors based on value distribution. In order to more accurately calculate cumulative discount returns, this algorithm uses the quantile function to model its entire probability distribution, which can provide a more accurate estimate of the value to each action. In addition to that, the condition value at risk is used to effectively measure the loss caused by overcurrent when learning the control strategy, so that the reinforcement learning agent can learn a control strategy that is more sensitive to environmental restrictions. The experimental results of six different control tasks show that the proposed risk-sensitive control algorithm triggers fewer overcurrents, i.e., it is safer. Due to the increased risk aversion, the training time and reasoning time of the algorithm are slightly increased. Therefore, in the future, one main research direction is to improve the time efficiency of the risk-sensitive intelligent control algorithm.

Author Contributions

D.G. wrote the manuscript; T.X. revised the manuscript. S.W. performed the experiment and simulation; H.L. provided the technical guidance; J.Q. provided financial support; Y.Y. and H.Z. processed the experimental data and participated in the revision of the manuscript; H.C. assisted with the experiments and analysis; X.L. and S.C. analyzed the format. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Nantong Natural Science Foundation project (grant numbers JC2023023 and JC2023075), the Natural Science Fund for Colleges and Universities in Jiangsu Province (grant numbers 21KJD210004, 22KJB520032, 22KJD520007, 22KJD520008, 22KJD520009, 23KJD520010 and 23KJD520011), the Nantong Key Laboratory of Virtual Reality and Cloud Computing (grant numbers CP2021001), the Electronic Information Master’s project of Nantong Institute of Technology (grant numbers 879002), and the Software Engineering Key Discipline Construction project of Nantong Institute of Technology (grant numbers 879005).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Liu, J. MATLAB Simulation of Advanced PID Control, 2nd ed.; Electronic Industry Press: Beijing, China, 2004. [Google Scholar]
  2. Baojun, G.; Yanping, L.; Dajun, T. Electromechanics; Higher Education Press: Beijing, China, 2020. [Google Scholar]
  3. Tamizi, M.G.; Yaghoubi, M.; Najjaran, H. A review of recent trend in motion planning of industrial robots. Int. J. Intell. Robot. Appl. 2023, 7, 253–274. [Google Scholar] [CrossRef]
  4. Cheng, M.; Zhou, J.; Qian, W.; Wang, B.; Zhao, C.; Han, P. Advanced Electrical Motors and Control Strategies for High-quality Servo Systems-A Comprehensive Review. Chin. J. Electr. Eng. 2024, 10, 63–85. [Google Scholar] [CrossRef]
  5. Peng, S.; Jiang, Y.; Lan, Z.; Li, F. Sensorless control of new exponential adaptive sliding mode observer for permanent magnet synchronous motor. Electr. Mach. Control. 2022, 26, 104–114. [Google Scholar]
  6. Wen, C.; Wang, Z.; Zhang, Z. Robust Model Predictive Control for Position Servo Motor System. J. Electr. Eng. 2021, 16, 59–67. [Google Scholar]
  7. Hoel, C.J.; Wolff, K.; Laine, L. Ensemble quantile networks: Uncertainty-aware reinforcement learning with applications in autonomous driving. IEEE Trans. Intell. Transp. Syst. 2023, 24, 6030–6041. [Google Scholar] [CrossRef]
  8. Tian, M.; Wang, K.; Lv, H.; Shi, W. Reinforcement learning control method of torque stability of three-phase permanent magnet synchronous motor. J. Physics Conf. Ser. 2022, 2183, 12–24. [Google Scholar] [CrossRef]
  9. Cai, M.; Aasi, E.; Belta, C.; Vasile, C.I. Overcoming exploration: Deep reinforcement learning for continuous control in cluttered environments from temporal logic specifications. IEEE Robot. Autom. Lett. 2023, 8, 2158–2165. [Google Scholar] [CrossRef]
  10. Song, Z.; Yang, J.; Mei, X.; Tao, T.; Xu, M. Deep reinforcement learning for permanent magnet synchronous motor speed control systems. Neural Comput. Appl. 2021, 33, 5409–5418. [Google Scholar] [CrossRef]
  11. Din, A.F.; Mir, I.; Gul, F.; Al Nasar, M.R.; Abualigah, L. Reinforced Learning-Based Robust Control Design for Unmanned Aerial Vehicle. Arab. J. Sci. Eng. 2023, 48, 1221–1236. [Google Scholar] [CrossRef]
  12. Coskun, M.Y.; İtik, M. Intelligent PID control of an industrial electro-hydraulic system. ISA Trans. 2023, 139, 484–498. [Google Scholar] [CrossRef]
  13. Chen, P.; He, Z.; Chen, C.; Xu, J. Control Strategy of Speed Servo Systems Based on Deep Reinforcement Learning. Algorithms 2018, 11, 65. [Google Scholar] [CrossRef]
  14. Zhang, M.; Jie, D.; Xi, X. Control strategy of electro-mechanical actuator based on deep reinforcement learning-PI control. Appl. Sci. Technol. 2022, 49, 18–22. [Google Scholar]
  15. Hamed, R.N.; Abolfazl, Z.; Holger, V. Actor–critic learning based PID control for robotic manipulators. Appl. Soft Comput. 2023, 151, 111153. [Google Scholar]
  16. Bellemare, M.G.; Dabney, W.; Munos, R. A Distributional Perspective on Reinforcement Learning. In Proceedings of the International Conference on Machine Learning, Ithaca, NY, USA, 17 July 2017; pp. 1–19. [Google Scholar]
  17. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjel, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
  18. Van, H.H.; Guez, A.; Silver, D. Deep Reinforcement Learning with Double Q-Learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; AAAI: Menlo Park, CA, USA, 2016; pp. 2094–2100. [Google Scholar]
  19. Sutton, R.S.; McAllester, D.; Singh, S.; Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. Adv. Neural Inf. Process. Syst. 2000, 12, 1057–1063. [Google Scholar]
  20. Ståhlberg, S.; Bonet, B.; Geffner, H. Learning general policies with policy gradient methods. In Proceedings of the International Conference on Principles of Knowledge Representation and Reasoning, Rhodes, Greece, 2–8 September 2023; Volume 19, pp. 647–657. [Google Scholar]
  21. Dabney, W.; Rowland, M.; Marc, G. Bellemare. Distributional Reinforcement Learning with Quantile Regression. arXiv 2017, arxiv:1710.10044. arXiv 2017, arXiv:1710.10044. [Google Scholar]
  22. Voigtlaender, F. The universal approximation theorem for complex-valued neural networks. Appl. Comput. Harmon. Anal. 2023, 64, 33–61. [Google Scholar] [CrossRef]
  23. Schaul, T.; Horgan, D.; Gregor, K.; Silver, D. Universal Value Function Approximators. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 6–11 July 2015; WCP: Lille, France, 2015; Volume 37. [Google Scholar]
  24. Liu, J.; Gao, F.; Luo, X. A Review of Deep Reinforcement Learning Based on Value Function and Strategy Gradient. Chin. J. Comput. 2019, 42, 1406–1438. [Google Scholar]
  25. Chow, Y.; Ghavamzadeh, M. Algorithms for CVaR optimization in MDPs. In Advances in Neural Information Processing Systems; Morgan Kaufmann: San Mateo, CA, USA, 2014; pp. 3509–3517. [Google Scholar]
  26. Wang, C.S.; Guo, C.W.; Tsay, D.M.; Perng, J.W. PMSM Speed Control Based on Particle Swarm Optimization and Deep Deterministic Policy Gradient under Load Disturbance. Machines 2021, 9, 343. [Google Scholar] [CrossRef]
  27. Schenke, M.; Kirchgässner, W.; Wallscheid, O. Controller Design for Electrical Drives by Deep Reinforcement Learning: A Proof of Concept. IEEE Trans. Ind. Inform. 2020, 16, 4650–4658. [Google Scholar] [CrossRef]
  28. Scott, F.; Herke, V.H.; David, M. Addressing Function Approximation Error in Actor-Critic Methods. In Proceedings of the International Conference on Machine Learning, Ithaca, NY, USA, 3 July 2018; pp. 1–15. [Google Scholar]
  29. Kumar, H.; Koppel, A.; Ribeiro, A. On the sample complexity of actor-critic method for reinforcement learning with function approximation. Mach. Learn. 2023, 112, 2433–2467. [Google Scholar] [CrossRef]
  30. Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In Proceedings of the International Conference on Machine Learning, Ithaca, NY, USA, 3 July 2018. [Google Scholar]
  31. Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.; Tan, J.; Kumar, V.; Zhu, H.; Gupta, A.; Abbeel, P.; et al. Soft actor-critic algorithms and applications. arXiv 2018, arXiv:1812.05905. [Google Scholar]
  32. Balakrishna, P.; Book, G.; Kirchgässner, W.; Schenke, M.; Traue, A.; Wallscheid, O. Gym-electric-motor (GEM): A python toolbox for the simulation of electric drive systems. J. Open Source Softw. 2021, 6, 2498. [Google Scholar] [CrossRef]
  33. Gao, D.; Wang, S.; Yang, Y.; Zhang, H.; Chen, H.; Mei, X.; Chen, S.; Qiu, J. An Intelligent Control Method for Servo Motor Based on Reinforcement Learning. Algorithms 2024, 17, 14. [Google Scholar] [CrossRef]
  34. Diederik, P.K.; Jimmy, B. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015; pp. 13–19. [Google Scholar]
Figure 1. Structure diagram of servo motor control system.
Figure 1. Structure diagram of servo motor control system.
Applsci 14 05618 g001
Figure 2. The difference between probability distribution function and quantile function. (a) Probability distribution function; (b) quantile function.
Figure 2. The difference between probability distribution function and quantile function. (a) Probability distribution function; (b) quantile function.
Applsci 14 05618 g002
Figure 3. The diagram of value at risk.
Figure 3. The diagram of value at risk.
Applsci 14 05618 g003
Figure 4. The diagram of conditional value at risk.
Figure 4. The diagram of conditional value at risk.
Applsci 14 05618 g004
Figure 5. The structure of the control method.
Figure 5. The structure of the control method.
Applsci 14 05618 g005
Figure 6. The structure of the coding part of the value network.
Figure 6. The structure of the coding part of the value network.
Applsci 14 05618 g006
Figure 7. The structure of the integrated part of the value network.
Figure 7. The structure of the integrated part of the value network.
Applsci 14 05618 g007
Figure 8. The structure of the strategy network.
Figure 8. The structure of the strategy network.
Applsci 14 05618 g008
Figure 9. The trigger times of overcurrent protection on six tasks: (a) TC-ExtExDc; (b) SC-ExtExDc; (c) TC-PMSM; (d) SC-PMSM; (e) TC-SeriesDc; and (f) SC-SeriesDc.
Figure 9. The trigger times of overcurrent protection on six tasks: (a) TC-ExtExDc; (b) SC-ExtExDc; (c) TC-PMSM; (d) SC-PMSM; (e) TC-SeriesDc; and (f) SC-SeriesDc.
Applsci 14 05618 g009
Figure 10. The control performance on six tasks: (a) TC-ExtExDc; (b) SC-ExtExDc; (c) TC-PMSM; (d) SC-PMSM; (e) TC-SeriesDc; and (f) SC-SeriesDc.
Figure 10. The control performance on six tasks: (a) TC-ExtExDc; (b) SC-ExtExDc; (c) TC-PMSM; (d) SC-PMSM; (e) TC-SeriesDc; and (f) SC-SeriesDc.
Applsci 14 05618 g010
Figure 11. The convergence procedures of each method on different tasks: (a) TC-ExtExDc; (b) SC-ExtExDc; (c) TC-PMSM; (d) SC-PMSM; (e) TC-SeriesDc; and (f) SC-SeriesDc.
Figure 11. The convergence procedures of each method on different tasks: (a) TC-ExtExDc; (b) SC-ExtExDc; (c) TC-PMSM; (d) SC-PMSM; (e) TC-SeriesDc; and (f) SC-SeriesDc.
Applsci 14 05618 g011
Table 1. The six common tasks used for the performance evaluation of the proposed algorithm.
Table 1. The six common tasks used for the performance evaluation of the proposed algorithm.
NameState’s DimensionModeDuty Cycle’s DimensionMotor
TC-ExtExDc7Torque control2ExtEx DC motor
SC-ExtExDc7Speed control2ExtEx DC motor
TC-PMSM14Torque control3Three-phase PMSM motor
SC-PMSM14Speed control3Three-phase PMSM motor
TC-SeriesDc5Torque control1Series DC motor
SC-SeriesDc5Speed control1Series DC motor
Table 2. Mean and standard deviation of control error.
Table 2. Mean and standard deviation of control error.
Control TaskTD3-SITD3-DVIDDPG-SIDDPG-DVIRESACOurs
TC-ExtExDc0.029 ± 0.0020.021 ± 0.0060.022 ± 0.0040.035 ± 0.0060.016 ± 0.0020.016 ± 0.003
SC-ExtExDc13.509 ± 1.35116.785 ± 1.26214.827 ± 1.31618.949 ± 1.40112.60 ± 1.22911.995 ± 1.137
TC-PMSM0.320 ± 0.0140.474 ± 0.0160.312 ± 0.0120.434 ± 0.0190.308 ± 0.0140.302 ± 0.011
SC-PMSM29.591 ± 4.05338.271 ± 4.23132.670 ± 4.12243.365 ± 5.03225.590 ± 3.65323.946 ± 3.059
TC-SeriesDc0.035 ± 0.0040.048 ± 0.0020.045 ± 0.0030.029 ± 0.0040.013 ± 0.0020.013 ± 0.003
SC-SeriesDc12.478 ± 1.13319.503 ± 1.23118.549 ± 1.39321.798 ± 1.53811.251 ± 1.5138.466 ± 1.020
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gao, D.; Xiao, T.; Wang, S.; Li, H.; Qiu, J.; Yang, Y.; Chen, H.; Zhang, H.; Lu, X.; Chen, S. A Risk-Sensitive Intelligent Control Algorithm for Servo Motor Based on Value Distribution. Appl. Sci. 2024, 14, 5618. https://doi.org/10.3390/app14135618

AMA Style

Gao D, Xiao T, Wang S, Li H, Qiu J, Yang Y, Chen H, Zhang H, Lu X, Chen S. A Risk-Sensitive Intelligent Control Algorithm for Servo Motor Based on Value Distribution. Applied Sciences. 2024; 14(13):5618. https://doi.org/10.3390/app14135618

Chicago/Turabian Style

Gao, Depeng, Tingyu Xiao, Shuai Wang, Hongqi Li, Jianlin Qiu, Yuwei Yang, Hao Chen, Haifei Zhang, Xi Lu, and Shuxi Chen. 2024. "A Risk-Sensitive Intelligent Control Algorithm for Servo Motor Based on Value Distribution" Applied Sciences 14, no. 13: 5618. https://doi.org/10.3390/app14135618

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop