Prediction Horizon-Varying Model Predictive Control (MPC) for Autonomous Vehicle Control

Chen, Zhenbin; Lai, Jiaqin; Li, Peixin; Awad, Omar I.; Zhu, Yubing

doi:10.3390/electronics13081442

Open AccessArticle

Prediction Horizon-Varying Model Predictive Control (MPC) for Autonomous Vehicle Control

by

Zhenbin Chen

,

Jiaqin Lai

,

Peixin Li

^*,

Omar I. Awad

and

Yubing Zhu

School of Mechanics and Electronics Engineering, Hainan University, Haikou 570228, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(8), 1442; https://doi.org/10.3390/electronics13081442

Submission received: 4 March 2024 / Revised: 3 April 2024 / Accepted: 9 April 2024 / Published: 11 April 2024

(This article belongs to the Special Issue Intelligent Control of Unmanned Vehicles)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The prediction horizon is a key parameter in model predictive control (MPC), which is related to the effectiveness and stability of model predictive control. In vehicle control, the selection of a prediction horizon is influenced by factors such as speed, path curvature, and target point density. To accommodate varying conditions such as road curvature and vehicle speed, we proposed a control strategy using the proximal policy optimization (PPO) algorithm to adjust the prediction horizon, enabling MPC to achieve optimal performance, and called it PPO-MPC. We established a state space related to the path information and vehicle state, regarded the prediction horizon as actions, and designed a reward function to optimize the policy and value function. We conducted simulation verifications at various speeds and compared them with an MPC with fixed prediction horizons. The simulation demonstrates that the PPO-MPC proposed in this article exhibits strong adaptability and trajectory tracking capability.

Keywords:

trajectory tracking; MPC; prediction horizon; PPO reinforcement

1. Introduction

Nowadays, advancements in autonomous driving technology have led to significant enhancements in road safety and driving efficiency [1,2,3]. As a pivotal technology, the effectiveness of trajectory tracking control directly affects the overall performance and safety of vehicles during operation [4]. Despite significant progress in autonomous driving technology, optimizing trajectory tracking control remains a challenging but crucial task [5]. Moreover, as the demand for autonomous vehicles continues to increase, it is crucial to refine trajectory tracking control algorithms to effectively address complex and various road scenarios [6,7], achieving a good balance between responsiveness, accuracy, and computational efficiency in dynamic driving environments and ensuring real-time adaptability.

One main method to solve trajectory tracking control problems is model predictive control (MPC) [7,8]. MPC is a method for solving optimal control problems (OCPs) using the current state of the controlled system as the initial condition. This method uses a controlled object model to predict the controlled variable’s response [9,10]. Solving an OCP involves finding a sequence of control inputs that minimize the objective function within the specified prediction horizons. Simultaneously, it maintains feasibility while the trajectory stays within the defined limitations. The linear time-varying model predictive control (LTV-MPC) proposed by [11,12] is an MPC based on the discrete linear state space model, which can reduce the amount of calculation and improve efficiency. The LPV-MPC presented in [13] takes into account future inputs and scheduling parameters, predicting future outputs accordingly. Ref. [14] presents a customized genetic algorithm for real-time optimization of a nonlinear model predictive control (NMPC) path-tracking controller, specifically designed for lower vehicle speeds.

With increasing computing power, sensing, and communication capabilities and advances in the field of machine learning, automating controller design and adaptation based on data collected during operations are being studied [15], such as by improving performance, facilitating deployment, and reducing the need for manual controller tuning. In refs. [16,17,18], Gaussian Process Regression and neural networks were used as predictive models of controlled systems, which were prediction models adaptively adjusted in a data-driven manner to improve control accuracy and reduce computing costs. The concept of using reinforcement learning (RL) to learn MPC cost function parameters is introduced in ref. [19]. Ref. [20] proposes a weights-varying MPC using a deep reinforcement learning (DRL) algorithm to adjust cost function weights in different situations. Ref. [21] proposed a novel approach limiting DRL actions within a safe learning space, and the proposed DRL algorithm can automatically learn context-dependent optimal parameter sets and dynamically adapt for a weights-varying MPC. Ref. [22] has introduced a novel control algorithm based around an event-triggered MPC, using RL with a configurable objective to automatically tune the control algorithm’s meta-parameters: the prediction horizon and the re-computation (triggering) problem. In ref. [23], a RLMPC scheme was introduced integrating MPC and RL through policy iteration (PI), where MPC is a policy generator and the RL technique is employed to evaluate the policy. This RLMPC scheme has great potential in reducing computational burden.

Prediction horizon is the key parameter affecting both performance and computational burden of the control system in MPC. It denotes the time range used to predict the system’s future behavior during the control process [24]. Shorter horizons offer better control but less stability, while longer horizons provide stability but should not be excessively long [25]. In ref. [26], a dual-mode receding horizon controller eliminates the danger of interference that is always present in nonlinear optimal control algorithms and greatly reduces the amount of online calculation required. Two adaptive prediction horizons for MPC were proposed in ref. [27]. One is based on heuristics, which is idealized but not feasible, and the other is more practical and uses iterative deepening, where each iteration will check stability criteria and find out the minimum horizon of stability. Ref. [28] adjusted the prediction horizon range to a positive integral discrete time variable and introduced an NMPC to achieve velocity control, incorporating a self-correction method for the prediction horizons. Studies in refs. [29,30] proposed an event-triggered MPC, which dynamically adjusts the prediction horizon using event-triggering mechanisms to facilitate the optimization process. A finite control set MPC with an adaptive prediction horizon is proposed [31], and neural networks are trained to calculate the optimal prediction horizon at runtime. Ref. [9] proposed an RLMPC which learns the optimal prediction horizon length of an MPC scheme using RL.

Proximal policy optimization (PPO) is a policy-based RL algorithm, clipping the probability ratio to modify the agent’s objective and constraining the magnitude of policy change at each step to enhance training stability. In this paper, we propose an adaptive MPC, combining the PPO algorithm and MPC to adjust the prediction horizon. To accommodate dynamic conditions such as road curvature and vehicle speed, we leverage the PPO algorithm to adaptively adjust the prediction horizon within the MPC framework. This integration empowers MPC to achieve optimal performance under varying environmental circumstances.

Unlike other research that combines reinforcement learning with MPC for autonomous driving—such as using RL for planning, MPC for control, and RL to adjust MPC weight parameters—our proposed PPO-MPC deeply considers the impact of prediction horizon selection on control performance, offering a dynamic prediction horizon that enhances MPC’s adaptability. And the PPO-MPC strategy we propose enables the vehicle to achieve adaptive tracking control at different speeds and curvatures.

2. Methodology

This section outlines the methodology of our proposed PPO-MPC framework, encompassing the establishment of the vehicle dynamics model, the design of the MPC strategy, and the adaptation of prediction horizons within the PPO algorithm. Specifically, we introduce a novel approach called “prediction horizon-varying model predictive control” to optimize the prediction horizon for MPC. This involves formulating a hybrid PPO-MPC prediction horizon optimization problem. To improve the adaptive performance of autonomous driving trajectory tracking control, we use the PPO algorithm to dynamically adjust the MPC prediction horizon, exploring the intricate relationship between the prediction horizon and variables such as vehicle speed, lateral deviation, and curvature. A visual representation of the architectural concept of PPO-MPC is provided in Figure 1. Figure 1 shows that the vehicle dynamics model is used to analyze the vehicle’s motion state; the lateral error model and the longitudinal acceleration model are coupled, with the front wheel steering angle and acceleration serving as control variables. And the model predictive control algorithm is used to solve the problem. In this structure, a state space related to the MPC controller and vehicle motion status is established, and the prediction horizon is set as the action space. Based on the PPO algorithm, the optimal prediction horizon is dynamically adjusted through iterative training. In addition, the speed control strategy can be interpreted as calculating the desired acceleration through model predictive control, switching the driving mode through the desired acceleration corresponding control strategy and then controlling the vehicle’s accelerator opening and brake pressure to achieve speed control.

2.1. Vehicle Dynamic Model

As shown in Figure 2, this study considers the vehicle dynamics model of both longitudinal and lateral motion.

2.1.1. Lateral Dynamics

In this paper, the model for lateral movement is constructed on the foundational principles of the bicycle model. This approach presumes symmetry in the steering angles on both the left and right sides, which effectively reduces the complexity of the vehicle’s dynamic model to that of a two-wheeled bicycle. Leveraging Newton’s second law of motion, simplified Equations (1a)–(1c) can be derived to quantify the lateral force exerted by the vehicle.

m \ddot{y} = m \dot{x} \ddot{φ} + 2 F_{x f} + 2 F_{x r}

(1a)

m \ddot{x} = m \dot{y} \dot{φ} + 2 F_{y f} + 2 F_{y r}

(1b)

I_{z} \ddot{φ} = 2 l_{f} F_{y f} - 2 l_{r} F_{y r}

(1c)

where

m

is the vehicle mass and

\dot{x}

and

\dot{y}

are the longitudinal and lateral velocity, respectively.

\ddot{x}

and

\ddot{y}

denote the longitudinal and lateral acceleration, respectively.

F_{y f}

and

F_{y r}

are the lateral tire forces at the front and the rear wheels, respectively;

F_{x f}

and

F_{x r}

are the longitudinal tire forces at the front and the rear wheels, respectively;

\dot{φ}

is the yaw rate;

I_{z}

denotes the yaw inertia of the vehicle; and

l_{f}

and

l_{r}

are the distances from the front and rear axles to the center of gravity, respectively.

Considering the tire turning characteristics, Equations (2a) and (2b) represent the lateral force of the front and rear tires at a small sideslip angle:

F_{y f} = 2 C_{f} a_{f}

(2a)

F_{y r} = 2 C_{r} a_{r}

(2b)

where

C_{f}

and

C_{r}

are the cornering stiffness of the front tire and rear tire, respectively;

α_{f}

is the sideslip angle of the front tire; and

α_{r}

is the sideslip angle of the rear tire.

Given that the angles of the two front wheels of the vehicle are equal, the vehicle’s lateral acceleration should meet the criteria for the small angle assumption. In this case, the subsequent approximate relationship can be utilized:

a_{f} = δ_{f} - \frac{l_{f} \dot{φ} + \dot{y}}{\dot{x}}

(3a)

a_{r} = \frac{b \dot{φ} - \dot{y}}{\dot{x}}

(3b)

Upon substituting Equations (2a), (2b) and Equations (3a), (3b) into Equations (1a)–(1c), the resulting expressions are obtained:

m \ddot{y} = - m V_{x} \dot{φ} + 2 [C_{c f} (δ_{f} - \frac{\dot{y} + l_{f} \dot{φ}}{\dot{x}}) + C_{c r} \frac{l_{r} \dot{φ} - \dot{y}}{\dot{x}}]

(4a)

m \ddot{x} = m \dot{y} \dot{φ} + 2 [C_{c f} (δ_{f} - \frac{\dot{y} + l_{f} \dot{φ}}{\dot{x}}) + C_{l f} S_{f} + C_{l r} S_{f}]

(4b)

I_{z} \ddot{φ} = 2 [l_{f} C_{c f} (d_{f} - \frac{\dot{y} + l_{f} \dot{φ}}{\dot{x}}) - l_{r} C_{c r} \frac{l_{r} \dot{φ} - \dot{y}}{\dot{x}}]

(4c)

2.1.2. Longitudinal Dynamics

Vehicle longitudinal dynamics studies the motion and mechanical characteristics of a vehicle in the longitudinal direction (i.e., the direction of the vehicle’s forward motion). In longitudinal dynamics, the vehicle’s acceleration, braking, traction, resistance, and other factors are primarily considered. Newton’s second law provides a framework for understanding these forces and their impact on motion. Considering the computational complexity, the effect of partial resistance is not taken into account.

m \ddot{x} = F_{t} - F_{w} - R_{x}

(5)

where

F_{t}

,

F_{w}

, and

R_{x}

respectively represent driving force, air resistance, and rolling resistance. The expression for air resistance

F_{w}

is as follows.

F_{w} = \begin{matrix} \frac{1}{2} ρ_{air} C_{d} A_{f} ν_{x}^{2} \end{matrix}

(6)

where

ρ_{air}

represents air density.

C_{d}

stands for the drag coefficient.

A_{f}

denotes the frontal area of the vehicle.

Torque is based on the sum of the forces multiplied by the wheel radius; according to Gillespie’s “Fundamentals of Vehicle Dynamics” [32], we can obtain the calculation of torque in the context of vehicle dynamics, denoted as Equation (7).

T_{t} = r_{t i r e} (m \ddot{x} + \begin{matrix} \frac{1}{2} ρ_{air} C_{d} A_{f} ν_{x}^{2} \end{matrix} + R_{x})

(7)

where

T_{t}

is the driving torque and

r_{t i r e}

represents the effective tire radius.

Taking into account the transmission ratio and motor efficiency, the relationship between driving force and motor torque is:

T_{m} = \frac{r_{t i r e} (m \ddot{x} + \begin{matrix} \frac{1}{2} ρ_{air} C_{d} A_{f} ν_{x}^{2} \end{matrix} + R_{x})}{i_{0} η_{t}}

(8)

where

T_{m}

denotes the motor torque,

i_{0}

is the transmission ratio, and

η_{t}

is the motor efficiency.

The braking force can be expressed as:

F_{d} = - m a_{d e s} - \frac{1}{2} ρ_{a i r} C_{d} A_{f} v_{x}^{2} - R_{x}

(9)

Considering that the braking and driving modes cannot work at the same time, and the braking and driving modes cannot be switched frequently, the driving and braking switching strategy of the vehicle is designed as:

mode = {\begin{array}{l} 1 (driving), a_{des} ⩾ a_{thre} + 0.1 m / s^{2}; \\ 0 (nochange), a_{thre} - 0.1 m / s^{2} < a_{des} < a_{thre} + 0.1 m / s^{2}; \\ - 1 (braking), a_{des} ⩽ a_{thre} - 0.1 m / s^{2} \end{array}

(10)

2.2. MPC System Definition

MPC is a control system approach employing a predictive model. In each discrete sampling time, MPC solves an open-loop OCP over a predetermined finite horizon. This is an iterative process of predicting the future system behavior within the future horizon through the plant’s predictive model. The current state of the system is considered the initial condition for each optimization cycle. An optimizer solves the optimization problem and returns a control sequence for the prediction horizon. The plant only adopts the initial control in the optimal sequence, and on subsequent samples, the initial control is used to solve the system optimization problem again [33]. This refers to the “receding horizon” principle [34]. Indeed, the horizon recedes as time passes. A key feature of MPC is its ability to integrate hard constraints on control variables and states during the design phase.

In this study, we divide longitudinal control into two parts: upper-level and lower-level control. Upper-level control uses MPC, producing acceleration as its output. Lower-level control is deduced using Equation (7), which establishes a throttle and brake control map.

According to Ref. [35], the longitudinal accelerating system is modeled as a linear first-order system; the relationship between the desired vehicle acceleration

a_{d e s}

and the actual acceleration is as follows in Equation (11).

\ddot{x} = \frac{K}{τ s + 1} a_{d e s}

(11)

where

K = 1

is the gain coefficient,

τ

is the delay time, and

τ = 0.5

.

The longitudinal accelerating and lateral steering combined system model of the vehicle can be described as Equation (12).

\frac{d}{d t} [\begin{matrix} \ddot{x} \\ \dot{x} \\ y \\ \dot{y} \\ φ \\ \dot{φ} \end{matrix}] = [\begin{matrix} - \frac{1}{τ} & 0 & 0 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & - \frac{2 C_{f} + 2 C_{r}}{m \dot{x}} & 0 & - \dot{x} - \frac{2 l_{f} C_{f} - 2 l_{r} C_{r}}{m \dot{x}} \\ 0 & 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & - \frac{2 l_{f} C_{f} - 2 l_{r} C_{r}}{I_{z} \dot{x}} & 0 & - \frac{2 l_{f}^{2} C_{f} + 2 l_{r}^{2} C_{r}}{I_{z} \dot{x}} \end{matrix}] [\begin{matrix} \ddot{x} \\ \dot{x} \\ y \\ \dot{y} \\ φ \\ \dot{φ} \end{matrix}] + [\begin{matrix} \frac{1}{τ} & 0 \\ 0 & 0 \\ 0 & 0 \\ 0 & \frac{2 C_{f}}{m} \\ 0 & 0 \\ 0 & \frac{2 l_{f} C_{f}}{I_{z}} \end{matrix}] [\begin{matrix} a_{d e s} \\ δ_{f} \end{matrix}]

(12)

The full system model consists of the longitudinal and lateral model. Equation (13) represents the system’s state space expression,

{\begin{matrix} \dot{ξ} = A ξ + B u \\ y (ξ) = C ξ \end{matrix}

(13)

where

A = [\begin{matrix} - \frac{1}{τ} & 0 & 0 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & - \frac{2 C_{f} + 2 C_{r}}{m \dot{x}} & 0 & - \dot{x} - \frac{2 l_{f} C_{f} - 2 l_{r} C_{r}}{m \dot{x}} \\ 0 & 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & - \frac{2 l_{f} C_{f} - 2 l_{r} C_{r}}{I_{z} \dot{x}} & 0 & - \frac{2 l_{f}^{2} C_{f} + 2 l_{r}^{2} C_{r}}{I_{z} \dot{x}} \end{matrix}]

,

B = [\begin{matrix} \frac{1}{τ} & 0 \\ 0 & 0 \\ 0 & 0 \\ 0 & \frac{2 C_{f}}{m} \\ 0 & 0 \\ 0 & \frac{2 l_{f} C_{f}}{I_{z}} \end{matrix}]

,

C = [\begin{matrix} 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 \end{matrix}]

.

Where

ξ = {[\begin{matrix} \ddot{x} & \dot{x} & e_{1} & {\dot{e}}_{1} & e_{2} & {\dot{e}}_{2} \end{matrix}]}^{T}

is the state vector and

u = {[a_{d e s}, δ]}^{T}

is the control input vector of the vehicle model.

Equation (13) can be linearized and discretized as follows:

{\begin{matrix} x (k + 1) = (I + T_{1} A) x (k) + T_{1} B u (k) \\ h (k) = C x (k) \end{matrix}

(14)

where

T_{1}

denotes the sample time;

I

is the identity matrix; and

A

,

B

, and

C

are matrices of coefficients.

Based on the known vehicle model and the deviation between the current measurement value and the expected value, the MPC controller predicts the output of the system within

N_{p}

. By solving the objective function and optimizing the output, a control quantity array in

N_{c}

is obtained, and control elements in the first-time interval

Δ t

are used as the output quantity. This process is repeated to achieve vehicle tracking along the desired trajectory.

According to the control requirements, the basic principle of MPC is to minimize the performance evaluation function while satisfying the control constraints. Equations (15a)–(15c) represent the objective function.

\min_{Δ u (k)} J (η (k), Δ u (k))

(15a)

J (\begin{matrix} η (k), u (k) \end{matrix}) = J_{1} + J_{2} + ρ ε^{2}

(15b)

J_{1} = \sum_{i = 1}^{N_{p}} {‖ η (k + i ∣ k) - η_{r e f} (k + i) ‖}_{Q}^{2}, J_{2} = \sum_{i = 0}^{N_{c} - 1} {‖ \begin{matrix} Δ u (k + i ∣ k) \end{matrix} ‖}_{R}^{2}

(15c)

where

J_{1}

reflects the system’s ability to track reference trajectory within prediction horizon

N_{p}

. This reflects the system’s requirement for a smooth change in the control increment within control horizon

N_{c}

.

Q

and

R

are the weight matrices,

ρ

is the weight coefficient,

ε

is the relaxation factor, and

Δ u

is the control input increment.

Considering safety constraints and vehicle actuator constraints, the constraint conditions can be expressed as Equation (16).

s . t . {\begin{matrix} u_{\min} \leq u (k + i | k) \leq u_{\max}, i = 0, 1, \dots, N_{c} - 1 \\ Δ u_{\min} \leq Δ u (k + i | k) \leq Δ u_{\max}, i = 0, 1, \dots, N_{c} - 1 \\ η_{\min} \leq η (k + i | k) \leq η_{\max}, i = 1, 2, \dots, N_{p} \end{matrix}

(16)

2.3. PPO Horizon Policy

2.3.1. Proximal Policy Optimization (PPO)

The PPO algorithm is an online policy gradient RL algorithm which can deal with problems in continuous state–action spaces. This DRL algorithm learns optimal strategies in interaction with the environment and uses stochastic gradient ascension to optimize the agent objective function [36]. To achieve high cumulative rewards, the PPO algorithm seeks optimal decisions in complex environments by constructing and optimizing policies, generally represented by neural networks. The policy, a parameterized function, maps states to probability distributions of actions. To improve the policy, PPO uses proximal policy optimization and maximizes the objective function. The objective function comprises a loss term for current policy updates and a KL divergence term, ensuring stability by controlling the magnitude of policy updates.

In PPO, the problem is represented as a Markov Decision Process (MDP), which is defined as

(S, A, T_{2}, R, γ)

.

S

represents the state space, which encompasses the complete set of potential states that the environment can occupy. And

A

represents the action space, which contains a collection of all possible actions.

T_{2}

represents the state transition function, which defines the probability distribution of the environment transitioning to the next state.

R

is the reward function, which is specified in the given state and represents an immediate reward.

γ

is the discount factor, which determines the importance of future rewards. The pseudocode of PPO is shown in Algorithm 1.

The goal of PPO is to maximize the anticipated total reward while limiting changes in the policy during each update. In this study, Equation (17) is the objective function expression of PPO.

L^{P P O} (θ) = E [\min (r_{t_{2}} (θ) \cdot {\hat{A}}_{t_{2}}, c l i p (r_{t_{2}} (θ), 1 - ϵ, 1 + ϵ) \cdot {\hat{A}}_{t_{2}})]

(17)

where

r_{t_{2}} (θ)

denotes the ratio of the new strategy to the old strategy and

{\hat{A}}_{t}

indicates the advantage function.

Algorithm 1 PPO algorithm

For iteration

i = 1 : N

do

For episode

j = 1 : M

do

Initialize the weight parameters of the policy network (actor) and value function network (critic). The initial state is determined by the discount factor

γ

and the greedy factor

g

;

Collect experiences

D = {(s, a, r, s^{'}) \dots}

;

For optimization step

k = 1 : K

do

Calculate the expected advantage function of the current strategy

A (s, a; θ)

;

Calculate the advantage function

A (s, a; θ) - r

for each experience

(s, a, r, s^{'})

;

Update the policy

θ

;

Update proximal policy

π_{θ}

;

End

End for

The framework of PPO-MPC is shown in Figure 3, which includes the environment and PPO network. The state given in the environment is input into the PPO network for learning and training. The state quantity is scored through the critic network, and then the appropriate action is selected.

2.3.2. Action and State Space

Considering that the prediction horizon is related to historical trajectory information and road curvature, a state space

S (t) = [c (t), v (t), δ (t), a c c (t), e (t), \cos t (t)]

was established in this section.

c (t), v (t), δ (t), a c c (t), e (t), \cos t (t)

represent the curvature of the reference trajectory, velocity, steering angle, acceleration, lateral error, and cost of the MPC system at time

t

, respectively. Our strategy trains a PPO policy

π_{θ}^{N}

to determine

N_{p}

for MPC. At each time step, the system’s state is measured, and the policy outputs

N_{p}

is used to solve the MPC problem.

The prediction horizon, denoted as

N_{p}

, is defined as a positive integer representing the maximum value

N_{m a x}

of the prediction horizon. In order to adjust the output of the PPO policy, we employ linear scaling. Specifically, we limit the output to [−1, 1], which is associated with the hyperbolic tangent (

t a n h

) function, to a new range of 1 to

N_{m a x}

. This adjustment ensures that the policy output aligns with the requirements of the MPC scheme.

2.3.3. Reward Function

The policy and value function in PPO are learned directly from the reward signal. Thus, an appropriate reward function plays a crucial role in enabling the neural network in PPO to effectively converge towards the optimal solution. Our designed reward function aims to strike a balance between promoting smooth driving and maintaining an acceptable range of tracking deviation. Tracking error is closely related to control performance, and limiting the control output within the constraint range is related to the stability of MPC control. To coordinate MPC and PPO to achieve optimal performance, we design a reward function that takes into account tracking error and control output, and the reward function is denoted as follows:

R (t) = w_{1} e^{- (λ_{1} | e_{1} | + λ_{2} | e_{2} | + λ_{3} | e_{3} |)} - w_{2} H_{1} - w_{3} H_{2}

(18)

where e is a natural index,

e_{1}

is the lateral tracking deviation,

e_{2}

is the longitudinal velocity deviation, and

e_{3}

is the relative yaw angle error.

H_{1}

and

H_{2}

are penalty terms.

H_{1} = 1

when either the steering angle or acceleration exceeds the constraint; if both exceed the constraint,

H_{1} = 2

, else

H_{1} = 0

.

H_{2} = 1

when the lateral tracking deviation is greater than 0.15, else

H_{2} = 0

.

λ_{1}

,

λ_{2}

, and

λ_{3}

are the weights of the tracking error, respectively, and

w_{1}

,

w_{2}

, and

w_{3}

are the weights of each reward, respectively. Equation (18) adopts the form of an exponential function, which makes the gradient change more drastic, which is beneficial to the training process. The reward increases when the total tracking error decreases, and when errors are 0, the instant reward can be obtained by corresponding points.

3. Simulation and Training

In this section, to verify the validity of the strategy we proposed, we conduct training and verification at various speeds using the MATLAB/Simulink simulation platform. In addition, trajectory tracking comparisons are conducted between the PPO-MPC algorithm proposed in this article and MPC with fixed horizons. The vehicle parameters used in the simulation are outlined in Table 1.

The PPO algorithm involves a set of hyperparameters that significantly influence the algorithm’s performance and training stability. The hyperparameter settings in reinforcement learning must be customized to the specific problem and environment to achieve optimal performance. In this study, the agent collects experiences based on the training set and stops when it reaches a 500-steps experience horizon or the terminal episode. Then it is trained for three epochs using mini-batches of 128 experiences. The objective function clip factor is set to 0.2 to enhance the stability of training, while the discount factor is set to 0.998 to promote long-term rewards. The Generalized Advantage Estimate (GAE) method reduces the variance in critic output with a GAE factor of 0.95.

Based on the aforementioned simulation environment, training was conducted for up to 10,000 episodes, with each episode spanning up to 500 steps. Figure 4 depicts the training results. The light blue line represents the cumulative reward obtained by the agent at the end of each round; the thick line describes the average reward value of all rounds during the training process. In the early stages of training, PPO reinforcement learning explores various actions through interactions with the environment to achieve the overall optimal outcome. In this experiment, after 300 rounds, the model finally converged stably, showing good training effects.

4. Results and Discussion

PPO-MPC is simulated and verified at different speeds, and its trajectory tracking performance is analyzed in this section. In addition, PPO-MPC was compared with MPC under fixed prediction horizons at different speeds, and their performance was discussed. The pre-calculated reference path is shown in Figure 5, with a total length of 12,000 m.

4.1. Performance of Trajectory Tracking Using PPO-MPC

In this section, simulation verification of PPO-MPC was performed at various velocities (

v = 10 m / s

,

v = 15 m / s

, and

v = 20 m / s

, respectively), and an analysis of its trajectory tracking performance during operation was conducted.

Figure 6, Figure 7 and Figure 8 illustrate the vehicle control results of our proposed control strategy at

v = 10 m / s

,

v = 15 m / s

, and

v = 20 m / s

. The PPO-MPC controller showed excellent performance by effectively adjusting the acceleration and steering angle output within the predefined constraint range, except for some fluctuations at the beginning of the simulation in Figure 8, which quickly stabilized. Figure 6a, Figure 7a and Figure 8a reveal the vehicle’s lateral deviation changes. Large lateral errors may occur where the curvature suddenly changes, and otherwise remain near 0, indicating that the control system has a good ability to keep the vehicle close to the desired trajectory. Figure 6b, Figure 7b and Figure 8b illustrate heading angle error changes, with the maximum value not exceeding 0.05. This shows that the vehicle has good performance in tracking the expected direction. Changes in the yaw rate, as depicted in Figure 6c, Figure 7c and Figure 8c, generally demonstrate low values, suggesting smooth steering operations, mitigated risk of abrupt rolling or sharp turning, and enhanced driving stability. Moreover, Figure 6d, Figure 7d and Figure 8d illustrate speed error changes; even at maximum speed, the speed error for stable operation always remains small, with the error limited within 0.2 m/s.

From the above analysis, it can be concluded that the PPO-MPC controller shows good performance at various speeds. And PPO-MPC effectively maintains the trajectory, direction, and stability of the vehicle, indicating that the strategy has good adaptability in different scenarios.

4.2. Performance Comparison of the Trajectory Tracking Using PPO-MPC and Model Predictive Control

To further illustrate the advantages of the PPO-MPC strategy, we conducted a comparison between PPO-MPC and the conventional MPC strategy with fixed prediction horizons. In the simulated design, we conducted a comparison between PPO-MPC and MPC with fixed horizons of 10, 20, and 30 and the control horizon set to 3.

Comparisons of the simulation data are shown in Figure 9, Figure 10 and Figure 11. As shown in Figure 11, it is easy to see that MPC with a fixed range of 10 fails to converge and is unstable at

v = 20 m / s

. Figure 9a, Figure 10a and Figure 11a clearly show that compared with MPC with static prediction horizons, PPO-MPC generally has better lateral deviation and exhibits superior trajectory tracking capabilities. The maximal lateral deviation of MPC may even be twice that of PPO-MPC, suggesting that the MPC with fixed prediction horizons may be subject to have greater lateral disturbance or challenges under certain circumstances. Figure 9b, Figure 10b and Figure 11b show that the heading error of PPO-MPC is almost the same as MPC with static prediction horizons at

v = 10 m / s

, and the PPO-MPC heading error at

v = 15 m / s

is smaller. Moreover, although PPO-MPC jittered at the beginning at

v = 20 m / s

, it quickly stabilized and resulted in a smaller heading error. Figure 9c, Figure 10c and Figure 11c show that, except for the jitter that occurs at the beginning of the simulation when

v = 20 m / s

, the overall performance of the PPO-MPC speed error is smaller.

According to [37], we introduce an index which quantitatively measures the tracking performance and is achieved as

Q_{t r a c k_i} = \sqrt{\frac{\sum_{j = 1}^{T / T_{s}} {(y_{r e f} (j) - y (j))}^{2}}{T / T_{s} - 1}}

(19)

where

T

represents the simulation duration time and

T_{s}

represents the controller sampling step. The tracking performance indexes of simulations are provided in Table 2.

Q_{t r_{-} l a t}

,

Q_{t r_y a w}

, and

Q_{t r_v}

denote the lateral, heading, and speed tracking accuracy, respectively. While ensuring dynamic stability, it is not difficult to find that, compared with MPC, the PPO-MPC tracking accuracy is improved, except that the speed tracking accuracy of PPO-MPC at

v = 20 m / s

is smaller than MPC, with

N_{p} = 30

. Analysis shows the superiority of the proposed PPO-MPC path-tracking controller.

5. Conclusions

In this paper, we present a novel PPO-MPC strategy, which integrates proximal policy optimization (PPO) with model predictive control (MPC), using the PPO reinforcement learning algorithm to dynamically adapt the prediction horizon of MPC. The proposed strategy was evaluated and validated using the MATLAB/Simulink simulation environment across three distinct operating speeds. Additionally, comparations were conducted against conventional MPC employing static prediction horizons under analogous conditions. From the analysis of simulation results, it can be seen that the PPO-MPC framework is better than the traditional model predictive controller with a fixed prediction range, and has superior tracking performance and robustness.

In the future, PPO-MPC can be explored through multi-objective optimization methods. Co-optimizing the prediction horizon with other key MPC parameters (such as control weights and constraints) is expected to achieve more powerful and efficient vehicle control strategies. And it may achieve adaptive trajectory control in various scenarios. In addition, adding the RL differential prediction model to the MPC prediction model to achieve better adaptive control is also a major idea for improving automatic driving control performance.

Author Contributions

Z.C.: design of methodology and writing; J.L.: conducted train and validation simulation and wrote the manuscript; P.L.: data analysis and reviewed the manuscript; O.I.A.: edited and reviewed the manuscript; Y.Z.: wrote the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

Open access funding provided by the Hainan Province Key R&D Plan Project, China (No. ZDYF2024GXJS020, No. ZDYF2021GXJS002) and Hainan Natural Science Foundation, China (No. 523RC441).

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no competing interests.

References

Yao, Q.; Tian, Y. A model predictive controller with longitudinal speed compensation for autonomous vehicle path tracking. Appl. Sci. 2019, 9, 4739. [Google Scholar] [CrossRef]
Wang, H. Control System Design for Autonomous Vehicle Path Following and Collision Avoidance. Ph.D. Thesis, The Ohio State University, Columbus, OH, USA, 2018. [Google Scholar]
Bimbraw, K. Autonomous cars: Past, present and future a review of the developments in the last century, the present scenario and the expected future of autonomous vehicle technology. In Proceedings of the 2015 12th International Conference on Informatics in Control, Automation and Robotics (ICINCO), Colmar, France, 21–23 July 2015. [Google Scholar]
Ren, L.; Xi, Z. Bias-Learning-Based Model Predictive Controller Design for Reliable Path Tracking of Autonomous Vehicles Under Model and Environmental Uncertainty. J. Mech. Des. 2022, 144, 091706. [Google Scholar] [CrossRef]
Rokonuzzaman, M.; Mohajer, N.; Nahavandi, S.; Mohamed, S. Review and performance evaluation of path tracking controllers of autonomous vehicles. IET Intell. Transp. Syst. 2021, 15, 646–670. [Google Scholar] [CrossRef]
Chen, G.; Zhao, X.; Gao, Z.; Hua, M. Dynamic drifting control for general path tracking of autonomous vehicles. IEEE Trans. Intell. Veh. 2023, 8, 2527–2537. [Google Scholar] [CrossRef]
Stano, P.; Montanaro, U.; Tavernini, D.; Tufo, M.; Fiengo, G.; Novella, L.; Sorniotti, A. Model predictive path tracking control for automated road vehicles: A review. Annu. Rev. Control 2022, 55, 194–236. [Google Scholar] [CrossRef]
Zhang, C.; Chu, D.; Liu, S.; Deng, Z.; Wu, C.; Su, X. Trajectory planning and tracking for autonomous vehicle based on state lattice and model predictive control. IEEE Intell. Transp. Syst. Mag. 2019, 11, 29–40. [Google Scholar] [CrossRef]
Bøhn, E.; Gros, S.; Moe, S.; Johansen, T.A. Reinforcement learning of the prediction horizon in model predictive control. IFAC-PapersOnLine 2021, 54, 314–320. [Google Scholar] [CrossRef]
Mohammadi, A.; Asadi, H.; Mohamed, S.; Nelson, K.; Nahavandi, S. MPC-based motion cueing algorithm with short prediction horizon using exponential weighting. In Proceedings of the 2016 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Budapest, Hungary, 9–12 October 2016. [Google Scholar]
Yuan, H.; Sun, X.; Gordon, T. Unified decision-making and control for highway collision avoidance using active front steer and individual wheel torque control. Veh. Syst. Dyn. 2018, 57, 1188–1205. [Google Scholar] [CrossRef]
Huang, Y.; Ding, H.; Zhang, Y.; Wang, H.; Cao, D.; Xu, N.; Hu, C. A motion planning and tracking framework for autonomous vehicles based on artificial potential field elaborated resistance network approach. IEEE Trans. Ind. Electron. 2019, 67, 1376–1386. [Google Scholar] [CrossRef]
Morato, M.M.; Normey-Rico, J.E.; Sename, O. Model predictive control design for linear parameter varying systems: A survey. Annu. Rev. Control 2020, 49, 64–80. [Google Scholar] [CrossRef]
Du, X.; Htet, K.K.K.; Tan, K.K. Development of a genetic-algorithm-based nonlinear model predictive control scheme on velocity and steering of autonomous vehicles. IEEE Trans. Ind. Electron. 2016, 63, 6970–6977. [Google Scholar] [CrossRef]
Hewing, L.; Wabersich, K.P.; Menner, M.; Zeilinger, M.N. Learning-based model predictive control: Toward safe learning in control. Annu. Rev. Control Robot. Auton. Syst. 2020, 3, 269–296. [Google Scholar] [CrossRef]
Wang, G.; Jia, Q.-S.; Qiao, J.; Bi, J.; Zhou, M. Deep Learning-Based Model Predictive Control for Continuous Stirred-Tank Reactor System. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 3643–3652. [Google Scholar] [CrossRef] [PubMed]
Rokonuzzaman, M.; Mohajer, N.; Nahavandi, S.; Mohamed, S. Model predictive control with learned vehicle dynamics for autonomous vehicle path tracking. IEEE Access 2021, 9, 128233–128249. [Google Scholar] [CrossRef]
Bao, H.; Kang, Q.; Shi, X.; Zhou, M.; Li, H.; An, J.; Sedraoui, K. Moment-Based Model Predictive Control of Autonomous Systems. IEEE Trans. Intell. Veh. 2023, 8, 2939–2953. [Google Scholar] [CrossRef]
Mehndiratta, M.; Camci, E.; Kayacan, E. Automated tuning of nonlinear model predictive controller by reinforcement learning. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018. [Google Scholar]
Zarrouki, B.; Klos, V.; Heppner, N.; Schwan, S.; Ritschel, R.; Voswinkel, R. Weights-varying mpc for autonomous vehicle guidance: A deep reinforcement learning approach. In Proceedings of the 2021 European Control Conference (ECC), Delft, The Netherlands, 29 June–2 July 2021. [Google Scholar]
Zarrouki, B.; Spanakakis, M.; Betz, J. A Safe Reinforcement Learning driven Weights-varying Model Predictive Control for Autonomous Vehicle Motion Control. arXiv 2024, arXiv:2402.02624. [Google Scholar]
Bøhn, E.; Gros, S.; Moe, S.; Johansen, T.A. Optimization of the model predictive control meta-parameters through reinforcement learning. Eng. Appl. Artif. Intell. 2023, 123, 106211. [Google Scholar] [CrossRef]
Lin, M.; Sun, Z.; Xia, Y.; Zhang, J. Reinforcement learning-based model predictive control for discrete-time systems. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 3312–3324. [Google Scholar] [CrossRef]
Brown, M.; Funke, J.; Erlien, S.; Gerdes, J.C. Safe driving envelopes for path tracking in autonomous vehicles. Control Eng. Pract. 2017, 61, 307–316. [Google Scholar] [CrossRef]
Raff, T.; Huber, S.; Nagy, Z.K.; Allgower, F. Nonlinear model predictive control of a four tank system: An experimental stability study. In Proceedings of the 2006 IEEE Conference on Computer Aided Control System Design, 2006 IEEE International Conference on Control Applications, 2006 IEEE International Symposium on Intelligent Control, Munich, Germany, 4–6 October 2006. [Google Scholar]
Michalska, H.; Mayne, D.Q. Robust receding horizon control of constrained nonlinear systems. IEEE Trans. Autom. Control 1993, 38, 1623–1633. [Google Scholar] [CrossRef]
Krener, A.J. Adaptive horizon model predictive control. IFAC-PapersOnLine 2018, 51, 31–36. [Google Scholar] [CrossRef]
Wei, Y.; Wei, Y.; Gao, Y.; Qi, H.; Guo, X.; Li, M.; Zhang, D. A variable prediction horizon self-tuning method for nonlinear model predictive speed control on PMSM rotor position system. IEEE Access 2021, 9, 78812–78822. [Google Scholar] [CrossRef]
Hashimoto, K.; Adachi, S.; Dimarogonas, D.V. Event-triggered intermittent sampling for nonlinear model predictive control. Automatica 2017, 81, 148–155. [Google Scholar] [CrossRef]
Ma, A.; Liu, K.; Zhang, Q.; Liu, T.; Xia, Y. Event-triggered distributed MPC with variable prediction horizon. IEEE Trans. Autom. Control 2020, 66, 4873–4880. [Google Scholar] [CrossRef]
Gardezi, M.S.M.; Hasan, A. Machine learning based adaptive prediction horizon in finite control set model predictive control. IEEE Access 2018, 6, 32392–32400. [Google Scholar] [CrossRef]
Gillespie, T. Fundamentals of Vehicle Dynamics; SAE International: Warrendale, PA, USA, 2021. [Google Scholar]
Yu, S.; Böhm, C.; Chen, H.; Allgöwer, F. MPC with one free control action for constrained LPV systems. In Proceedings of the 2010 IEEE International Conference on Control Applications, Yokohama, Japan, 8–10 September 2010. [Google Scholar]
García, C.E.; Prett, D.M.; Morari, M. Model predictive control: Theory and practice—A survey. Automatica 1989, 25, 335–348. [Google Scholar] [CrossRef]
Guo, J.; Luo, Y.; Li, K. Adaptive neural-network sliding mode cascade architecture of longitudinal tracking control for unmanned vehicles. Nonlinear Dyn. 2017, 87, 2497–2510. [Google Scholar] [CrossRef]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
Zhang, B.; Zong, C.; Chen, G.; Li, G. An adaptive-prediction-horizon model prediction control for path tracking in a four-wheel independent control electric vehicle. Proc. Inst. Mech. Eng. Part D J. Automob. Eng. 2019, 233, 3246–3262. [Google Scholar] [CrossRef]

Figure 1. The concept architecture.

Figure 2. Vehicle dynamics model.

Figure 3. PPO-MPC framework.

Figure 4. The training results.

Figure 5. Reference path.

Figure 6. Performance of the PPO-MPC controller at

v = 10 m / s

. (a) Lateral deviation. (b) Yaw angle error. (c) Yaw rate. (d) Velocity error.

Figure 6. Performance of the PPO-MPC controller at

v = 10 m / s

. (a) Lateral deviation. (b) Yaw angle error. (c) Yaw rate. (d) Velocity error.

Figure 7. Performance of the PPO-MPC controller at

v = 15 m / s

. (a) Lateral deviation. (b) Yaw angle error. (c) Yaw rate. (d) Velocity error.

Figure 7. Performance of the PPO-MPC controller at

v = 15 m / s

. (a) Lateral deviation. (b) Yaw angle error. (c) Yaw rate. (d) Velocity error.

Figure 8. Performance of the PPO-MPC controller at

v = 20 m / s

. (a) Lateral deviation. (b) Yaw angle error. (c) Yaw rate. (d) Velocity error.

Figure 8. Performance of the PPO-MPC controller at

v = 20 m / s

. (a) Lateral deviation. (b) Yaw angle error. (c) Yaw rate. (d) Velocity error.

Figure 9. Performance comparison of trajectory tracking using PPO-MPC model predictive control at

v = 10 m / s