1. Introduction
Unmanned aerial vehicles (UAVs) have been widely integrated into industrial production and civil use, such as aerial photography, plant protection, and military operations, owing to their simple structure and high mobility. However, UAVs represent a highly intricate system characterized by complicated coupling and nonlinear behaviors. Additionally, environmental factors such as wind, air pressure, and temperature changes can further introduce interference. Over the past decades, numerous methods have been proposed to address these challenges.
Nonlinearity and unknown external disturbances are inherent in UAV systems. To enhance the robustness of proportional–integral–derivative control (PID) [
1], several robust methods, such as intelligent control [
2,
3,
4], fuzzy control [
5,
6,
7], model predictive control [
8], and sliding mode control (SMC) [
9,
10,
11,
12,
13], have been proposed. Among them, SMC is a model-based method with strong robustness to external disturbances and uncertainties, including equivalent control for realizing target tracking and switching control for suppressing nonlinear parts. The switching control can drive the system state to slide on the sliding surface according to the state error, and the nonlinear part can be shown as a linear behavior near the sliding surface, hence realizing the suppression and control of the nonlinear part [
14].
Thus, this algorithm is effective for nonlinear systems and can fit well with UAVs. However, unmodeled higher-order dynamics and input delays are not conducive to the nonlinear output from tiny switching. They render it difficult to keep the state entirely at the sliding surface and are manifested as state transfer between the two sides of the sliding surface; this phenomenon is called chattering [
15].
The chattering frequency is related to the controller, and the high-frequency switching output will harm the system. Over the years, numerous solutions have been proposed to deal with this problem.
Matouk et al. proposed a super twist algorithm based on a second-order SMC to design the UAV’s position and attitude controller [
16]. Jayakrishnan et al. [
17] used a cascaded inner and outer structure and a super-torsional SMC to design a quadrotor controller and demonstrated its superiority by comparing it with the linear quadratic regulator PD method. Adaptive methods are often used to deal with chattering. Lei and Li [
18] used the sigmoid function to design an adaptive gain-tuning law that can effectively relax restrictions on prior information about system uncertainty. Yang and Yan [
19] proposed a fuzzy system based on Gaussian membership functions to adaptively tune the switching gain in the attitude controller. Additionally, disturbance observers are widely used in controller design [
20]. To deal with the external disturbances, Zhang et al. proposed a chattering-free discrete SMC based on disturbance observers [
21]. Xu et al. combined an extended state observer and fast terminal SMC, thereby proposing a composite method to improve tracking performance while reducing chattering [
22]. Subsequently, the symbolic function was replaced with a smoothing function to resolve chattering [
23]. Le and Kang proposed a terminal SMC to guarantee that the tracking error converged to zero in finite time [
24]. However, this method is limited by a low convergence rate and singularity. Consequently, fast TSMC [
25], nonsingular TSMC [
26], and nonsingular fast TSMC [
27] were proposed to overcome these problems.
Nonetheless, these approaches commonly encounter a challenge: identifying the optimal mechanism for a specific system. For instance, in various methods, challenges arise in designing adaptive rules, determining the symbolic function strategy, and choosing optimal parameters for each mechanism. Consequently, intelligent learning methods were considered to explore the optimal mechanisms applicable to the system. By employing intelligent learning methods, these studies aim to overcome the aforementioned challenges and identify the most suitable agents for the determined system.
Reinforcement learning (RL) is a mechanism that learns to map states to actions for maximum rewards. It has been widely used in many fields [
28,
29,
30]. As the potential of learning algorithms is explored, some learning-based methods have been proposed to solve the chattering in SMC. Farjadian et al. [
31] introduced an adaptive gradient saturation function to replace the discontinuous switching function and used RL to adjust the slope to trade off control accuracy for a decrease in chattering. Their study is based on the saturation function and uses RL to realize adaptive control; however, the selected saturation function limits the final effect. Another study [
32] proposed a method to solve the optimal control under constraints by applying the hyperbolic tangent and symmetry radial basis functions to design the saturating function. Subsequently, integral Q-learning was used to approximate it to reduce chattering. However, the Q-learning algorithm relies on the Q table, which limits the state’s dimension and makes the algorithm unsuitable for complex systems. Existing intelligent chattering reduction methods always accompany custom functions. This results in the network being limited by the selected function, which restricts the advantage of reinforcement learning.
Based on the above discussion, to avoid the problems of state dimension limitation [
32] and auxiliary function limitation [
31] in the existing works, this study combines a policy-based RL method to solve the chattering problem in SMC. Specifically, a reference model-based SMC was adopted as the base controller. In the nonlinear part, the reward function is designed based on the state tracking error, and the network output that produces the maximum reward is directly involved in the controller as the switching control. Notably, no additional functions are employed to restrict the network output. The contributions of this study can be summarized as follows:
A method using RL to reduce chattering in SMC was proposed. The reference model-based SMC was designed and implemented based on the easy-obtained fitted model.
The policy-based DDPG algorithm was employed to explore optimal switching control and produce continuous nonlinear outputs. In contrast to [
31], no auxiliary function was utilized; the actions generated by the network were regarded as switching outputs and contributed to the final control output.
Two classical methods were selected to improve the same basic SMC and compared with the proposed method. Experiment results revealed that the proposed method can solve chattering well and better tolerate the delay and disturbance of the system compared to the two classical methods.
The remainder of this paper is organized as follows:
Section 2 analyzes the dynamics of the UAV system.
Section 3 describes the proposed RL-based controller design process in detail.
Section 4 demonstrates the validation of the proposed algorithm. Finally, the conclusions are summarized in
Section 5.
3. Controller Design
3.1. System Overview
In this study, a reference model sliding mode controller was adopted as the basic controller; its outstanding performance was verified in our previous work [
34]. The proposed method consists of a reference model, Kalman filter, neural network, and sliding mode controller; its structure is shown in
Figure 1.
The reference model can expand the set of target states beyond the given original target, such as the target of attitude rate and acceleration. The attitude model used herein is based on the fitted system; thus, the attitude acceleration is considered in the controller. However, as directly measuring the attitude acceleration is difficult, a steady Kalman filter was adopted herein to estimate the difficult-to-measure state. Finally, the reference signal and the estimate feedback states were used to calculate the equivalent control , combined with the output of the neural network; thus, the target could be well-tracked without chattering.
The significant advantage of this structure is that it separates the tracking and regulation stages. The time domain evaluation indicators, such as the rise time and settling time, and the general indicators, such as stability and tracking performance, can be designed individually by the reference model and the controller.
3.2. Reference Model
The reference model is closely related to the UAV model. According to the attitude model expressed in (
3), the reference model can be designed as follows:
is the original attitude target,
is the reference of each state, and its structure is similar to
X, and the output matrix
. Similar to (
4), the structure of
in the reference model can be expressed as follows:
where
,
, and
are negative constants that are tuned to realize the desired target trajectory. It is assumed that the designed reference model converges to the target within a finite time and that the state of the reference model remains constant after convergence. Moreover, the output of (
5) equals the input. Then,
Assume that
and
satisfy the conditions expressed in (
8) and (
9),
When (
8) is combined with (
7),
can be expressed as follows:
Then, the input matrix
in the reference model can be calculated by
3.3. RL-Based Controller Design
The output of the neural network is directly utilized as a nonlinear control part to contribute to the total control output. Hence, the base controller should be introduced.
If the tracking error of the feedback and reference states is denoted as
, then the switching function can be designed.
The selection parameter
is a three-dimensional column vector corresponding to each state error. By combining the time derivative of the switching function,
, with Equations (
3) and (
5), and subject to the conditions (
8) and (
9), the resulting expression is
When the condition of the sliding mode is satisfied, the system states switch near the sliding surface; once the sliding surface is reached, the states remain on it and satisfy
. The equivalent control, denoted by
, can be calculated as follows:
Herein, the optimum theory was used to calculate
. The optimal feedback gain was chosen as the hyperplane
, which satisfied the condition
. The matrix
P could be calculated using the Riccati equation, while
Q is a designed positive diagonal matrix.
Switching control (
16) was introduced to suppress the nonlinear parts in the system, such as model error, interference, and uncertainty.
where
is the gain coefficient,
is the adjustable gain related to the unknown disturbance and model error, and
is typically adopted in the conventional scheme.
By combining (
14) and (
16), the result of reference model-based SMC can be obtained as follows:
To reduce chattering in a specific system, the selection of
always depends on the researchers’ experience. In this work, the RL method was combined with powerful exploration abilities to obtain the optimal output. Note that the output of the neural network is denoted as
, and the RL-improved nonlinear control
is
By combining (
14) and (
18), the proposed RL-based sliding mode controller with improved chattering reduction can be expressed as follows:
The method of obtaining for the nonlinear control is described in detail, as follows:
3.4. Nonlinear Output by RL
RL algorithms are often described in the context of the Markov decision process (MDP), which can be represented by a quadruple , the elements of which are a set of states, actions in the corresponding state, transition probability between states, reward after state transition, and decay factor, respectively.
Under policy
and state
s, the cumulative discounted return obtained during the period of taking a series of actions to reach the terminal state is called the value function
, and
reflects the expectation. According to
, starting from
s and performing
a, the expected return of all possible decision sequences is recorded as an action-value function
. Each
corresponds to a value function and state value function, whereas the optimal strategy
corresponds to the optimal value function
and the optimal state-action value function
. Let
be the next moment of ∗, then the Bellman equation can be expressed as follows:
As the research object, the UAV system is characterized as complex and continuously time-varying, and the output of each controller must be deterministic and unique during the operation. Under such severe conditions, the deep deterministic policy gradient (DDPG) [
36] algorithm, which performs well in continuous systems, is typically used. Based on actor-critic architecture, an RL scheme that includes an actor-network, an actor-target network, a critic-network, and a critic-target network was designed herein. A simplified training structure diagram is shown in
Figure 2.
Combined with the reference state, the reward corresponding to the current action can be calculated according to the designed reward function (
21). The aim of this study was to track the state of the target as quickly as possible and reduce chattering. Therefore, in the process of designing the reward, in addition to the attitude, the attitude rate, which is more sensitive to changes, deserves more of our attention. Among them,
represents the target value of the state, which is
in
Section 3.2;
means
Y, and
is the weight represented by each state.
In the DDPG, the network update is based on the difference between predicted and actual values. Under the deterministic strategy
, the Bellman equation can be rewritten as follows:
To facilitate the distinction,
and
are used to represent the parameters of the critic and actor-networks, respectively. Regarding the critic, the update of the neural network parameter adopts the temporal difference (TD)-error method, which is realized by the mean square error. Its loss function is defined as follows:
where
represents the output of the critic-target network with action reward, calculated as expressed in (
24).
Thus, the gradient descent method is used to solve the gradient of the loss function in (
23); subsequently, the critic-network parameters are updated according to the calculation results, as shown in (
26), where
is the critic-network learning rate.
Similarly, the actor-network is updated as shown in (
27) and (
28), where
is the update rate of the actor-network.
Selecting the appropriate soft update rate
, the actor- and critic-target networks can be soft-updated by their respective non-target networks in the manner shown in (
29).
The pseudocode of the entire reinforcement learning training process is presented in Algorithm 1.
Algorithm 1 Training process of the RL-based reference model SMC |
Initialize the four network weights and replay buffer Load the UAV model and the reference model while episode < MaxEpisode do Initialize noise process Initialize UAV states Initialize reference model states while timestep < MaxTimeStep do Calculate the reference states Select action Calculate , and run the model Obtain and Save quadruple to D Sample quadruple from D randomly Critic-network updated using ( 25) and ( 26) Actor-network updated using ( 27) and ( 28) Target network updated using ( 29) if not safe then Break end if end while end while
|
4. Simulation and Experiment
The chattering reduction effect of the proposed RL-based SMC was verified by simulation and actual flight experiments.
4.1. Experimental Verification Platform
The experimental platform for the proposed algorithm application was a positive X-type quadrotor UAV with a 0.5 m wheelbase, and a self-developed flight control system based on the STM32F4 was adopted. The overall system, shown in
Figure 3, consisted of a main control module, an inertial measurement unit (IMU) module, a global navigation satellite system (GNSS) module, and a data logging module.
The power system was composed of four 900 KV U2810 rotors and 11-inch propellers.
By collecting the attitude, attitude rate, and target attitude rate during flight, an approximate model can be fitted. The fitted model of the attitude model is , , , and the reference model is designed as , , .
4.2. RL Network Training
Deep, fully connected networks were selected to build the networks. The actor-network used the reference and feedback states as input, and its output was a single value. Given that the actor-network is set up for the control system, the computational burden must be considered.
Considering that the network contains four inputs, and the environment is simple, a network with two hidden layers with an eight-neuron structure was selected. The trained network took only 0.003 s to complete the calculation in STM32F4.
The rectified linear unit (Relu) function connects the transfer between the hidden layers, and the Tanh function was used in the output. The Relu function speeds up convergence during training and is robust to hyperparameter changes. The output interval of the Tanh function was one and centered on zero; thus, it could well limit the network output. The number of hidden layer neurons in training, which depends on the complexity of the environment and is related to the input number of the network, was set to 64 layers.
The training was implemented in an Ubuntu 20.04 environment, and Python and PyTorch were used to complete the network and model construction. The simulation environment was equipped with an AMD Ryzen 7 5800X processor @ 3.8 GHz 8 cores with an Nvidia 2080Ti graphics card to accelerate neural network calculations. The parameters used in the training are presented in
Table 1.
‘’ and ‘’ correspond to the update rates of the actor and critic-networks, which are set to small constants to pursue stability to reduce network instability and parameter fluctuations; similarly, the parameter selection of ‘’ share the same consideration. The ‘Batch size’ needs to consider various factors, such as hardware resources and environmental complexity, and it is usually set to range from tens to hundreds. ‘Replay buffer’ stores state and action experience, and its size determines how much experience can be saved. In the case of abundant computing resources, a larger size is usually selected. ‘Discount factor’ is a number between 0 and 1. In our case, it is set to 0.9 because more emphasis is placed on long-term cumulative rewards, which means stable attitude tracking of the entire process. ‘Noise variance’ in RL introduces randomness and exploratory behavior, and the parameter selection needs to affect the system without destroying the system’s stability. ‘Time step’ and ‘Maximum step’ are the control cycle and the maximum step size, which means that 6 s tracking is considered in our case.
Following the parameters, 4000 training episodes were conducted, depending on the complexity of the training environment and the network size. The average cumulative reward in each episode is shown in
Figure 4.
In
Figure 4, the training system uses the random output to generate initial samples in 0–170 episodes: according to the defined reward function, the random output results in a large error, corresponding to a small reward. It is worth noting that for the designed reward function (
21), the opposite number of the cumulative result of the absolute tracking error is considered, which means that a more significant cumulative tracking error has a smaller reward. A small cumulative tracking error has a large reward. In the ideal situation, when there is no tracking error, the reward function reaches a maximum value of 0. The reward increases obviously during 700–1000 episodes. In the process of 1500–4000, the reward is still increasing and gradually approaching 0. Therefore, the network training is complete.
Subsequently, the trained network was imported into the simulation environment. Based on a standard step signal of 0.3 rad, the simulation results were compared with those of conventional SMC, as shown in
Figure 5.
The attitude and attitude rate tracking are presented in
Figure 5a,b, respectively. In
Figure 5a, the green line is the original target signal, whereas the pink line represents the reference value calculated from the reference model. Evidently, both the proposed and conventional methods could adequately track the reference model output. However, in
Figure 5b, the traditional method encountered a noticeable chattering in the attitude rate during tracking. By contrast, this phenomenon did not occur with the proposed method. The same conclusion can be drawn from
Figure 5c, which indicates the output of the controller.
4.3. Comparison Schemes Implement
Numerous studies have been proposed to reduce chattering, as described in
Section 1. This work selected two types of chattering reduction methods, Classes A and B, to compare with the proposed method. Considering the diversity of different objects and SMC methods, we summarized their improved parts and then applied them to the same basic controller (
17).
Regarding improvement based on Class A, ref. [
18] designed an SMC scheme with adaptive gain, and then used a sigmoid function to solve the chattering problem. Its sliding mode surface is consistent with (
12), and the adaptive nonlinear controller is:
Among them, the designed adaptive gain
,
, and the sigmoid function is:
where
and
are positive constants related to the steepness of the sigmoid function. In this situation,
, and
are selected.
Then, the method improvement based on Class B was realized. In class D, ref. [
23] replaced the sign function with the smooth function (
32) to alleviate chattering.
represents the weight of the smooth function.
The tracking performance of the improved two-classes methods is shown in
Figure 6, and the tracking errors of each method are presented in
Table 2. The attitude and attitude rate tracking errors are the primary indicators.
Figure 6 presents the step responses of the two improved methods after parameter tuning. The results depicted in
Figure 6a,b demonstrate that both methods yield similar outcomes, with each state accurately tracking its reference target.
Figure 6d displays the switching output during the tracking process; there is no chattering during tracking. Furthermore, the switching outputs exhibit comparable magnitudes. Thus, it can be concluded that both improvement methods have achieved satisfactory results for the specific system.
Figure 6 proves that the improved controller based on Class A and B can accurately track the target while avoiding chattering. In the tuning process, the same settings as in the reference studies were adopted.
In summary, this subsection compares two classic chattering reduction methods. Considering the inconsistencies of the target system, the parameter selection for each method followed the recommendations of the references. It was adjusted to the ideal state without chattering to ensure fairness in the comparison.
4.4. Simulation under Input Delay and Disturbance
Chattering is inherent in sliding mode control and appears more noticeable under input delays and external disturbances.
This section verifies the proposed method with the comparison methods in the MATLAB environment, considering the system (
1), setting the attitude target of 0.3 rad based on the existing equivalent control (
14), combined with the switching control calculated based on the proposed RL-based (
16), the improved Class A method (
30), and the improved Class B method (
32), for target tracking. The system’s initial state
.
To verify chattering suppression performance, a 20-cycle delay was performed on the controller output to imitate the actual system, which depended on the UAV’s size and maneuverability. Another factor that exacerbates chattering, the unknown disturbance, was also considered herein. Based on the delayed input, a disturbance
rad was added to the system input. The response of each method and the controller output are shown in
Figure 7 and
Figure 8.
Figure 7 displays the target tracking performance in the presence of input delay and external disturbance, while
Figure 8 records the switching output during this process. Notably, the switching outputs of all three methods exhibit the same magnitude, approximately 0.003 rad/s. Moreover, both Class A and Class B output waveforms depict similar curves, while the RL network produces a distinct curve that closely resembles a square wave, which can be considered an optimal mechanism for the current system. Finally, the tracking errors of attitudes and attitude rates for the two simulation experiments are summarized in
Table 3.
Compared with the initial condition, after considering input delay and disturbance, the mean errors of Class A and Class B can still be maintained. However, the attitude variance in Class A and the attitude rate variance in Class B increased by 28.5% and 48.6%, thus proving the occurrence of vibration. The proposed methods’ tracking error variance changed within 5%. Therefore, under the premise of reducing chattering, the method based on RL better tolerates system input delay.
After the increase in disturbance, the tracking effect of each method declined. Nonetheless, the chattering was still suppressed. Compared with the comparison methods, the proposed method can better reduce chattering and tolerate input delay and disturbance.
4.5. Experimental Verification
The simulated and actual environments differ significantly; hence, more than the simulation verification is needed to explain the effectiveness of the proposed method. This section presents the performance of the discussed methods on a real quadrotor, including the proposed and two classes comparison methods. During verification, each method’s environment and tracking target were consistent, and the parameters were consistent with the simulation situation.
The experiment was conducted in an outdoor natural environment, as shown in
Figure 9, and the open area allowed the free flight of the UAV. A self-designed trajectory was set as the attitude target, and no human intervention was allowed during tracking.
It is worth mentioning that the controller was designed and implemented based on the fitted model; modeling errors and input delays were expected as the primary reasons for chattering. The attitude and attitude rate tracking results were recorded for evaluation, and the experimental results are shown in
Figure 10,
Figure 11 and
Figure 12.
Figure 10a,
Figure 11a and
Figure 12a shows the designed attitude tracking performance, including the original target, reference target, and actual state feedback.
Figure 10b,
Figure 11b and
Figure 12b shows the attitude rate tracking.
Figure 10c,
Figure 11c and
Figure 12c shows the switching output of each method. Since the equivalent control part is entirely consistent, the switching output has better significance.
Based on the experimental results, it is evident that all three methods effectively track the planned reference attitude and rate. However, the distinct output of the switching output reveals notable differences among the three methods.
Figure 10c shows that the switching output amplitude of Class A improved by 0.02 rad/s, which has a significantly larger magnitude than the other two. Nonetheless, compared with other methods, Class A improvement was the least satisfactory, and a distinct tracking error was observed. To more clearly reflect the experimental results, the tracking errors of attitude and rate are shown separately in
Figure 13, and the statistical results are listed in
Table 4. It is worth noting that a new evaluation mechanism, the integral square error (ISE) [
37,
38], was adopted to evaluate the experimental results further; therefore, experiment results can be judged from different indicators.
The comparison is depicted in
Figure 13a,b, it is evident that the proposed method exhibits the smallest tracking error for both attitude and rate, particularly in 3–6 s. This conclusion is further supported by the statistical results presented in
Table 4. According to the statistical results in
Table 4, during the actual flight, the target tracking error’s mean value, variance, and ISE of the proposed method were smaller than the two comparison methods. Therefore, it can be considered that the RL-based method achieves better tracking accuracy and stability. Additionally, it is noteworthy that none of the three schemes experienced chattering during flight.
Thus, according to the actual flight verification, the proposed improved reference sliding mode controller based on RL can reduce chattering and accurately track the attitude and attitude rate target.