**1. Introduction**

The optimal control problem with phase constraints often has a multi-modal functional. Therefore, with its numerical solution by direct approach, it is possible to obtain several control functions that ensure the movement of the object along different trajectories in the state space with approximately the same value of the control quality criterion which is close to the optimal.

A numerical solution to the optimal control problem leads to some difficulties. As a rule, in most optimal control problems, it is necessary to minimize not one but at least two criteria, reach the control goal or minimize the error of reaching the terminal state and still minimize the given quality criterion. Addition of weight coefficients into criteria does not significantly simplify the problem, since the problem of choosing weights arises.

Another search problem is defined as the loss of unimodality of the functional on the space of parameters of the approximating function. Even a piecewise linear approximation of the control function, when only one parameter needs to be found on each interval for each control component, does not guarantee the presence of a single minimum of the goal functional on the space of parameters.

The problem becomes more complicated in the presence of phase constraints that describe the areas of state space forbidden for the optimal trajectory. It is most likely that due to these reasons, and despite numerous attempts [1,2], a universal computational method for the optimal control problem has not been created.

Further studies have shown that if a strictly optimal solution is not needed and solutions close to the optimal are quite satisfactory, then evolutionary algorithms can be successfully applied to the optimal control problems [3].

Sometimes in practice the researcher knows how the object should move along the optimal trajectory, i.e., approximately knows the areas in the state space the optimal trajectory should pass through. If we introduce additional requirements in the form of passing

**Citation:** Diveev, A.; Sofronova, E.; Konstantinov, S.; Moiseenko, V. Reinforcement Learning for Solving Control Problems in Robotics. *Eng. Proc.* **2023**, *33*, 29. https://doi.org/ 10.3390/engproc2023033029

Academic Editors: Ivan Zelinka, Arutun Avetisyan and Alexander Ilin

Published: 15 June 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

through the given areas into the quality criterion, then the evolutionary algorithm should change the search area and look for a solution that satisfies the additional requirements. This approach is effective when using evolutionary algorithms. Due to the inheritance property, the improvement of the criterion value at each generation is performed on the basis of small evolutionary transformations of possible solutions to the previous generation. Therefore, if at some generation one of the possible solutions passes through the required areas specified by the researcher, then with a high probability the evolutionary algorithm will search for the optimal solution that preserves the obtained properties. A similar technique is used in machine learning with reinforcement [4,5], when the researcher awards the object by the change of the target functional value for the right actions. Currently, reinforcement learning is actively used in the practice of solving control problems [6]. The paper contains a formal description and practical application of reinforcement learning for solving the optimal control problem.

#### **2. The Optimal Control Problem and Reinforcement Learning**

Consider a formal statement of the optimal control problem.

The mathematical model of a control object is given in the Cauchy form of an ordinary differential equation system

$$
\dot{\mathbf{x}} = \mathbf{f}(\mathbf{x}, \mathbf{u}),
\tag{1}
$$

where **<sup>x</sup>** is a state space vector, **<sup>u</sup>** is a control vector, **<sup>x</sup>** <sup>∈</sup> <sup>R</sup>*n*, **<sup>u</sup>** <sup>∈</sup> <sup>U</sup> <sup>⊆</sup> <sup>R</sup>*m*, and U is a compact set.

The initial state is given by

$$\mathbf{x}(0) = \mathbf{x}^0 \in \mathbb{R}^n. \tag{2}$$

The terminal state is given by

$$\mathbf{x}(t\_f) = \mathbf{x}^f \in \mathbb{R}^n,\tag{3}$$

where *tf* is the time to reach the terminal state (3). Time *tf* is not given, but it is limited to *tf* ≤ *t* <sup>+</sup> , where *t* <sup>+</sup> is a given positive value.

The quality criterion is given by

$$J = \int\_0^{t\_f} f\_0(\mathbf{x}, \mathbf{u})dt \to \min\_{\mathbf{u} \in \mathcal{U}}.\tag{4}$$

Assume that the researcher knows the areas in the state space of where the optimal trajectory should be. Then, additional conditions are included in the quality criterion

$$J\_1 = \int\_0^{t\_f} f\_0(\mathbf{x}, \mathbf{u})dt + p \sum\_{i=1}^r \psi\_i(\mathbf{x}(t)) \to \min\_{\mathbf{u} \in \mathcal{U}}.\tag{5}$$

where

$$\psi\_{i}(\mathbf{x}) = \theta(\min\_{t} \{ \|\mathbf{z}^{i} - \mathbf{x}\| \} - \varepsilon\_{i}) (\min\_{t} \{ \|\mathbf{z}^{i} - \mathbf{x}\| \} - \varepsilon\_{i}),\tag{6}$$

*p* is a penalty coefficient, and *ϑ*(*α*) is a Heaviside step function

$$\theta(\alpha) = \begin{cases} 1, \text{if } \alpha > 0 \\ 0, \text{otherwise} \end{cases} \tag{7}$$

*εi*, *i* = 1, ... ,*r* are given small positive values, and **z***<sup>i</sup>* , *i* = 1, ... ,*r* are the centres of known areas.

According to the introduced additional conditions, if an optimal trajectory does not pass near some given point **z***<sup>i</sup>* , then value of criterion (5) will grow.

#### **3. Computation Experiment**

Consider the optimal control problem for the spatial movement of a quadcopter. The mathematical model of the control object is

$$\begin{array}{rcl} \dot{\mathbf{x}}\_{1} &=& \mathbf{x}\_{4\prime} \\ \dot{\mathbf{x}}\_{2} &=& \mathbf{x}\_{5\prime} \\ \dot{\mathbf{x}}\_{3} &=& \mathbf{x}\_{6\prime} \\ \dot{\mathbf{x}}\_{4} &=& u\_{4}(\sin(\mu\_{3})\cos(\mu\_{2})\cos(\mu\_{1}) + \sin(\mu\_{1})\sin(\mu\_{2})), \\ \dot{\mathbf{x}}\_{5} &=& u\_{4}\cos(\mu\_{3})\cos(\mu\_{1}) - \mathbf{g}\_{G\prime} \\ \dot{\mathbf{x}}\_{6} &=& u\_{4}(\cos(\mu\_{2})\sin(\mu\_{1}) - \cos(\mu\_{1})\sin(\mu\_{2})\sin(\mu\_{3})), \end{array} \tag{8}$$

where *gc* = 9.80665.

Control is constrained

$$u\_i^- \le u\_i \le u\_i^+, \quad i = 1, \dots, 4,\tag{9}$$

where *u*− <sup>1</sup> <sup>=</sup> <sup>−</sup>*π*/12, *<sup>u</sup>*<sup>+</sup> <sup>1</sup> = *π*/12, *u*<sup>−</sup> <sup>2</sup> <sup>=</sup> <sup>−</sup>*π*, *<sup>u</sup>*<sup>+</sup> <sup>2</sup> = *π*, *u*<sup>−</sup> <sup>3</sup> <sup>=</sup> <sup>−</sup>*π*/12, *<sup>u</sup>*<sup>+</sup> <sup>3</sup> = *π*/12, *u*<sup>−</sup> <sup>4</sup> = 0 and *u*<sup>+</sup> <sup>4</sup> = 12.

The initial state is given by

$$\mathbf{x}^{0} = \begin{bmatrix} 0 \ 5 \ 0 \ 0 \ 0 \ 0 \end{bmatrix}^{T}. \tag{10}$$

The terminal state is given by

$$\mathbf{x}^{f} = \begin{bmatrix} 10 \ 5 \ 10 \ 0 \ 0 \ 0 \end{bmatrix}^{T}. \tag{11}$$

The phase constraints are given by

$$\varphi\_k(\mathbf{x}) = r\_k - \sqrt{(\mathbf{x}\_1 - \mathbf{x}\_{k,1})^2 + (\mathbf{x}\_3 - \mathbf{x}\_{k,3})^2} \le 0,\tag{12}$$

where *k* = 1, 2, *r*<sup>1</sup> = *r*<sup>2</sup> = 2, *x*1,1 = 2.5, *x*1,3 = 2.5, *x*2,1 = 7.5, *x*2,3 = 7.5.

It is necessary to find a control function, taking into account the constraints in (9), that minimizes the following criterion

$$J\_1 = t\_f + p\_1 \sum\_{k=1}^2 \int\_0^{t\_f} \theta(\varphi\_k(\mathbf{x})) dt \to \min\_{\mathbf{u} \in \mathbf{U}}.\tag{13}$$

where *tf* ≤ *t* <sup>+</sup> = 5.6, *p*<sup>1</sup> = 3.

To solve the control problem numerically, let us use a piecewise linear approximation. The time axis is divided into equal intervals Δ*t*, and the search for constant parameters is performed at the interval boundaries for each control component. Control is a piecewise linear function that consists of segments connecting points at the bounds of intervals. Given the control constraints, the desired control function is as follows

$$u\_{i} = \begin{cases} u\_{i}^{+}, \text{if } \hat{u}\_{i} \ge u\_{i}^{+} \\ u\_{i}^{-}, \text{if } \hat{u}\_{i} \le u\_{i}^{-} \\ \hat{u}\_{i}, \text{otherwise} \end{cases}, i = 1, \dots, 4,\tag{14}$$

where

$$\text{At}\_{i} = d\_{i + (j - 1)m} + (d\_{i + jm} - d\_{i + (j - 1)m}) \frac{t - (j - 1)\Delta t}{\Delta t}, \; i = 1, \dots, 4, \; j = 1, \dots, K,\tag{15}$$

and *K* is a number of time interval boundaries

$$K = \left\lfloor \frac{t^+}{\Delta t} \right\rfloor + 1 = \left\lfloor \frac{5.6}{0.4} \right\rfloor + 1 = 15. \tag{16}$$

When solving the problem by a direct approach, the condition of reaching the terminal state is included in the quality criterion

$$J\_2 = t\_f + p\_1 \sum\_{k=1}^2 \int\_0^{t\_f} \theta(\varphi\_k(\mathbf{x})) dt + p\_2 \|\mathbf{x}^f - \mathbf{x}(t\_f)\| \to \min\_{\mathbf{u} \in \mathbf{U}} \tag{17}$$

where *p*<sup>2</sup> = 1,

$$t\_f = \begin{cases} t, \text{if } t \le t^+ \text{ and } \|\mathbf{x}^f - \mathbf{x}(t)\| \le \varepsilon = 0.01\\ t^+, \text{otherwise} \end{cases} \tag{18}$$

To solve the problem, a hybrid evolutionary algorithm [7] is used.

Figures 1 and 2 show projections on the horizontal plane {*x*1; *x*3} of the two found optimal trajectories. The big circles present the phase constraints in (12).

**Figure 1.** Projection of optimal trajectory 1 on the horizontal plane.

**Figure 2.** Projection of optimal trajectory 2 on the horizontal plane.

The criterion for the solutions found had the following values: for the solution in Figure 1 *J*<sup>2</sup> = 5.6434, for the solution in Figure 2 *J*<sup>2</sup> = 5.6330.

As can be seen from the experiment, the values of the criteria practically coincide, the solutions found ensure the movement of the object from the given initial state (10) to the given terminal state (11) without violation of the phase constraints. In the series of experiments, the hybrid evolutionary algorithm found solutions that bypass the phase constraints either from above, as in Figure 1, or from below, as in Figure 2.

Suppose that we need the control object to move between obstacles. For this purpose the desired areas on the horizontal plane are defined. It is known that in the presence of interfering phase constraints, the optimal trajectory should be close to the boundary of these constraints. For the given problem four desired areas are defined as

$$\begin{array}{l} \mathbf{z}^1 = [2.5 \, 0.4]^T, \, \varepsilon\_1 = 0.6, \\ \mathbf{z}^2 = [4.5 \, 2.5]^T, \, \varepsilon\_2 = 0.6, \\ \mathbf{z}^3 = [5.5 \, 7.5]^T, \, \varepsilon\_3 = 0.6, \\ \mathbf{z}^4 = [7.5 \, 9.6]^T, \, \varepsilon\_4 = 0.6. \end{array} \tag{19}$$

The conditions for passing through the desired areas (19) are included in the quality criterion

$$J\_3 = t\_f + p\_1 \sum\_{k=1}^2 \int\_0^{t\_f} \theta(\varphi\_k(\mathbf{x}))dt + p\_2 \|\mathbf{x}^f - \mathbf{x}(t\_f)\| + p\_3 \sum\_{i=1}^4 \psi\_i(\mathbf{x}) \to \min\_{\mathbf{u} \in \mathbf{U}},\tag{20}$$

where *p*<sup>3</sup> = 3.

The hybrid evolutionary algorithm found the following optimal solution **d** = [*d*<sup>1</sup> ... *d*60] *T* = [−11.1092, 5.9957, −0.0532, 7.0045, 17.5091, 18.1172, −1.6764, 18.7012, −16.0121, 10.5543, −19.9307, 12.1721, −6.0892, 0.5339, −0.8616, 19.2556, −13.7218, 15.1266, 0.3982, 14.2650, −1.1768, 2.9832, 4.3286, 15.1508, −8.9240, −19.6814, 4.5363, 15.9879, −0.0026, 1.1203, 13.2592, −6.6358, −6.2012, −0.5328, −0.0354, 4.2548, 11.6764, −4.3345, −6.7336, 19.8643, 0.3360, −8.9741, −2.6648, 12.5608, 19.6577, −19.9308, −1.6252, 19.3797, −1.1954, 2.2625, 5.9582, 16.0807, −0.8272, 2.3167, 0.9842, 14.2695, −6.3767, 2.3895, 0.3742, 16.2710] *T*.

Figure 3 shows the projection of the found optimal trajectory for the solution with quality criterion *J*<sup>3</sup> = 5.5730. Small dashed circles are the desired areas, while big circles are the constraints.

**Figure 3.** Projection on the horizontal plane of the optimal trajectory found by reinforcement learning.

To implement the obtained solution according to the extended statement of the optimal control problem, it is necessary to build a system to stabilize the movement of the object along the optimal trajectory [8]. For this purpose, machine learning control is used [9]. The control function structure search is carried out by symbolic regression [10].

The obtained solution is

$$u\_i = \begin{cases} u\_i^+, \text{ if } \vec{u}\_i \ge u\_i^+ \\ u\_i^-, \text{ if } \vec{u}\_i \le u\_i^- \\ \vec{u}\_i, \text{ otherwise} \end{cases}, i = 1, \dots, 4,\tag{21}$$

where

$$
\mathfrak{a}\_1 = \mathfrak{\mu}(G),
\tag{22}
$$

$$
\overline{u}\_2 = (\overline{u}\_1 - \overline{u}\_1^3)\rho\_{17}(A + \mu(G))\theta(F)\rho\_{17}(\mathbf{x}\_4^\* - \mathbf{x}\_4), \tag{23}
$$

$$\mathfrak{a}\_3 = \mathfrak{a}\_2 + \tanh(\mathfrak{a}\_1) + \rho\_{19}(A + \mu(G)) + \rho\_{17}(W),\tag{24}$$

$$\begin{array}{rcl} \overline{u}\_4 &=& \overline{u}\_3 + \ln|\overline{u}\_2| + \text{sgn}(A + \mu(G))\sqrt{|A + \mu(G)|} + \rho\_{19}(A) + \arctan(C) +\\ & \text{sgn}(E) + \arctan(F) + \exp(q\_2(\mathbf{x}\_2^\* - \mathbf{x}\_2)) + \sqrt{q\_1}\rho \end{array} \tag{25}$$

*A* = *B*sgn(*D*) |*D*| tanh(*D*) exp(*H*), *B* = exp(*C*) + *ρ*17(*F*) + cos(*q*6(*x*<sup>∗</sup> <sup>6</sup> − *x*6)), *<sup>C</sup>* <sup>=</sup> *<sup>D</sup>* <sup>+</sup> tanh(*E*) + *<sup>ρ</sup>*18(*V*), *<sup>D</sup>* <sup>=</sup> *<sup>E</sup>* <sup>+</sup> <sup>√</sup><sup>3</sup> *<sup>F</sup>* <sup>+</sup> sin(*W*), *E* = *F* + *G* + exp(*H*) − *V*, *F* = *H* + sgn(*x*<sup>∗</sup> <sup>5</sup> − *x*5)+(*x*<sup>∗</sup> <sup>2</sup> <sup>−</sup> *<sup>x</sup>*2)3, *G* = *q*6(*x*<sup>∗</sup> <sup>6</sup> − *x*6) + *q*3(*x*<sup>∗</sup> <sup>3</sup> − *x*3) + sgn(*x*<sup>∗</sup> <sup>2</sup> − *x*2) |*x*∗ <sup>2</sup> − *x*2|, *H* = *ρ*17(*q*6(*x*<sup>∗</sup> <sup>6</sup> − *x*6) + *q*3(*x*<sup>∗</sup> <sup>3</sup> <sup>−</sup> *<sup>x</sup>*3)) + *<sup>V</sup>*<sup>3</sup> <sup>+</sup> *<sup>W</sup>* <sup>+</sup> *<sup>q</sup>*6*q*<sup>2</sup> 5(*x*<sup>∗</sup> <sup>5</sup> <sup>−</sup> *<sup>x</sup>*5)<sup>2</sup> + (*x*<sup>∗</sup> <sup>5</sup> <sup>−</sup> *<sup>x</sup>*5)2, *V* = sin(*q*6(*x*<sup>∗</sup> <sup>6</sup> − *x*6)) + *q*5(*x*<sup>∗</sup> <sup>5</sup> − *x*5) + *q*2(*x*<sup>∗</sup> <sup>2</sup> − *x*2) + cos(*q*1) + exp(*x*<sup>∗</sup> <sup>5</sup> − *x*5) + *ϑ*(*x*<sup>∗</sup> <sup>2</sup> − *x*2), *W* = *q*4(*x*<sup>∗</sup> <sup>4</sup> − *x*4) + *q*1(*x*<sup>∗</sup> <sup>1</sup> − *x*1) + sin(*q*6), *<sup>μ</sup>*(*α*) = *α*, if |*α*| ≤ 1 sgn(*α*), otherwise , *<sup>ρ</sup>*17(*α*) = sgn(*α*)ln(|*α*<sup>|</sup> <sup>+</sup> <sup>1</sup>), *ρ*18(*α*) = sgn(*α*)(exp(|*α*|) − 1), *ρ*19(*α*) = sgn(*α*) exp(−|*α*|),

where *q*<sup>1</sup> = 13.02930, *q*<sup>2</sup> = 11.21509, *q*<sup>3</sup> = 15.91016, *q*<sup>4</sup> = 14.33447, *q*<sup>5</sup> = 14.67798, *q*<sup>6</sup> = 9.91431 and **x**<sup>∗</sup> = [*x*<sup>∗</sup> <sup>1</sup> ... *x*<sup>∗</sup> 6 ] *<sup>T</sup>* is a state vector of the reference model.

Figure 4 shows the trajectories from eight initial states on the horizontal plane.

**Figure 4.** Projection on the horizontal plane of the optimal trajectories from eight initial states.

#### **4. Results**

The paper presents the use of machine learning technology with reinforcement to solve the optimal control problem with phase constraints using an evolutionary algorithm. To implement reinforcement learning, additional conditions defining the form of the optimal trajectory are introduced into the quality criterion. The optimal trajectory should pass through specified areas whose positions depend on the phase constraints. An example of solving the optimal control problem for a quadcopter by machine learning with reinforcement was given.

### **5. Discussion**

The use of reinforcement learning technology to solve the optimal control problems of robotic devices is advisable, since in most cases the developer approximately knows the form of the optimal trajectory for the problem being solved.

**Author Contributions:** Conceptualization, A.D. and E.S.; methodology, A.D. and S.K.; software, A.D., S.K. and E.S.; validation, A.D. and V.M.; formal analysis, E.S.; investigation, S.K.; data curation, A.D., S.K. and E.S.; writing—original draft preparation, A.D, E.S. and V.M.; writing—review and editing, E.S.; visualization, V.M.; supervision, A.D. All authors have read and agreed to the published version of the manuscript.

**Funding:** The work was performed with partial support from the Russian Science Foundation, Project No 23-29-00339.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The data that support the findings of this study are openly available in https://cloud.mail.ru/public/pKW4/fmrfqjkv3 (for optimal trajectory 1), https://cloud.mail.ru/ public/mVZ5/Kr2yb9jJn (for optimal trajectory 2).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
