# **Advances in Artificial Intelligence Methods Applications in Industrial Control Systems**

Edited by

Emanuele Carpanzano

Printed Edition of the Special Issue Published in *Applied Sciences*

www.mdpi.com/journal/applsci

## **Advances in Artificial Intelligence Methods Applications in Industrial Control Systems**

## **Advances in Artificial Intelligence Methods Applications in Industrial Control Systems**

Editor

**Emanuele Carpanzano**

MDPI • Basel • Beijing • Wuhan • Barcelona • Belgrade • Manchester • Tokyo • Cluj • Tianjin

*Editor* Emanuele Carpanzano University of Applied Sciences and Arts of Southern Switzerland Switzerland

*Editorial Office* MDPI St. Alban-Anlage 66 4052 Basel, Switzerland

This is a reprint of articles from the Special Issue published online in the open access journal *Applied Sciences* (ISSN 2076-3417) (available at: https://www.mdpi.com/journal/applsci/special issues/ai industrial control systems).

For citation purposes, cite each article independently as indicated on the article page online and as indicated below:

LastName, A.A.; LastName, B.B.; LastName, C.C. Article Title. *Journal Name* **Year**, *Volume Number*, Page Range.

**ISBN 978-3-0365-6808-9 (Hbk) ISBN 978-3-0365-6809-6 (PDF)**

Cover image courtesy of Laboratory of Automation, Robotics and Machines (ARM-Lab) of the University of Applied Sciences and Arts of Southern Switzerland (SUPSI).

© 2023 by the authors. Articles in this book are Open Access and distributed under the Creative Commons Attribution (CC BY) license, which allows users to download, copy and build upon published articles, as long as the author and publisher are properly credited, which ensures maximum dissemination and a wider impact of our publications.

The book as a whole is distributed by MDPI under the terms and conditions of the Creative Commons license CC BY-NC-ND.

## **Contents**


## **About the Editor**

#### **Emanuele Carpanzano**

Emanuele Carpanzano is Director of Research, Development, and Knowledge Transfer at the University of Applied Sciences and Arts of Southern Switzerland. He has managed numerous R&D initiatives at international, national and regional level in addition to industrial research and knowledge transfer projects. He is active in different federal and international associations and institutions dedicated to education, research, and innovation programs and initiatives. At international level, he is member of CIRP (The International Academy for Production Engineering) and of EIT (European Institute of Technology) Manufacturing. At federal level, he is a member of SATW (Schweizerische Akademie der Technischen Wissenschaften), of the board of directors of FTAL (Fachkonferenz Technik, Architektur und Life Sciences), of Euresearch, and of the AM-TTC Alliance (Advanced Manufacturing Technology Transfer Centers Alliance). At regional level, he is a member of the boards of the Lifestyle Tech Competence Center and of SDBC, Swiss Drone Base Camp, both part of Innovation Park Ticino. As a scientific expert, he supervises, acting also as reviewer and independent scientist, research and innovation activities and projects at local, national, and European level. His research interests and activities are focused on industrial control and automation systems as well as on the digitalization of production systems and value chains, including evolution of related human aspects. He is Professor of industrial plants at SUPSI and author of more than 140 scientific papers as well as of different industrial patents in his applied research fields."

## **Preface to "Advances in Artificial Intelligence Methods Applications in Industrial Control Systems"**

Artificial intelligence methods are being increasingly applied at different industrial control systems levels, from single automation devices up to the real-time control of complex machines, production processes, and overall factories in terms of supervision and optimization. Such innovative solutions are being exploited with reference to different industrial control applications from sensor fusion methods to novel model predictive control techniques, from self-optimizing machines to collaborative robots, and from factory adaptive automation systems to production supervisory control systems. The aim of this Special Issue is to provide an overview of novel applications of AI methods to industrial control systems, so as to improve the production systems self-learning capacities, their overall performance, the related process and product quality, the optimal use of resources, and industrial systems safety and resilience to varying boundary conditions and production requests.

> **Emanuele Carpanzano** *Editor*

## *Editorial* **Editorial of the Special Issue "Advances in Artificial Intelligence Methods Applications in Industrial Control Systems"**

**Emanuele Carpanzano**

Department of Innovative Technologies, University of Applied Sciences and Arts of Southern Switzerland, 6928 Manno, Switzerland; emanuele.carpanzano@supsi.ch

#### **1. Motivation for the Special Issue**

Today, Artificial Intelligence (AI) applications are considered to be of increasing relevance for the future of industrial control systems [1–3]. AI methods are increasingly being applied at different industrial control systems levels, from single automation devices up to the real-time control of complex machines, production processes and overall factories supervision and optimization.

AI solutions are exploited with reference to different industrial control applications, from sensor fusion methods to novel model predictive control techniques, from selfoptimizing machines to collaborative robots, and from factory adaptive automation systems to production supervisory control systems [1,4–13].

The motivation for the present Special Issue is to provide an overview of novel applications of AI methods to industrial control systems by means of selected best practices, to highlight how such methodologies can be used to improve the production systems self-learning capacities, their overall performance, the related process and product quality, the optimal use of resources and the industrial systems safety, and resilience to varying boundary conditions and production requests [1,4,12,13].

#### **2. Summary of the Special Issue Contents**

This Special Issue includes seven papers: a perspective paper to frame the ongoing advances in AI methods applications in Industrial Control Systems to realize self-optimizing manufacturing systems [1], and six papers presenting novel applications of AI methods with reference to specific control problems and different application sectors [5–10]. Further details about the seven articles are provided in the following.

The perspective paper "Advances in Artificial Intelligence Methods Applications in Industrial Control Systems: Towards Cognitive Self-Optimizing Manufacturing Systems", by Emanuele Carpanzano and Daniel Knüttel [1], provides an overview of the recent progress on different hierarchic and functional levels of industrial control solutions, discussing how AI-based methods enable further enhancements in many applications and sectors. Promising algorithms and methods are presented and discussed for every control level following a bottom-up approach, starting from the sensor level and data fusion to more complex applications such as self-optimizing machines. It is outlined how the combination of advanced monitoring, modeling, control and scheduling methods will, in the future, allow for the development of self-optimizing machines, leading to production machine improvements in terms of product quality, productivity and resource efficiency, and representing a crucial point for the next generation of human-centric manufacturing systems.

The article "*Adaptive Optimal Robust Control for Uncertain Nonlinear Systems Using Neural Network Approximation in Policy Iteration*", by Dengguo Xu, Qinglin Wang and Yuan Li [5], presents an optimal adaptive control approach to solve the robust control problems of nonlinear systems with internal and input uncertainties, based on the policy iteration

**Citation:** Carpanzano, E. Editorial of the Special Issue "Advances in Artificial Intelligence Methods Applications in Industrial Control Systems". *Appl. Sci.* **2023**, *13*, 16. https://doi.org/10.3390/ app13010016

Received: 14 December 2022 Accepted: 19 December 2022 Published: 20 December 2022

**Copyright:** © 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

(PI) in reinforcement learning (RL). First, the robust control is converted into solving an optimal control containing a nominal or auxiliary system with a predefined performance index. It is demonstrated that the optimal control law enables the considered system to become globally asymptotically stable for all admissible uncertainties. Second, based on the Bellman optimality principle, online PI algorithms are proposed to calculate robust controllers for the matched and mismatched uncertain systems. The approximate structure of the robust control law is obtained by approximating the optimal cost function with the neural network in PI algorithms. Finally, to illustrate the availability of the proposed algorithm and theoretical results, some numerical examples are provided.

The paper "*Digital Twin for Designing and Reconfiguring Human–Robot Collaborative Assembly Lines*", by Niki Kousi, Christos Gkournelos, Sotiris Aivaliotis, Konstantinos Lotsaris, Angelos Christos Bavelos, Panagiotis Baris, George Michalos and Sotiris Makris [6], discusses the use of DTs (Digital Twins) for designing and redesigning flexible production systems. These systems employ mobile dual arm robots that move across the factory undertaking multiple tasks, assisting humans. Exploiting this hardware ability, the DT based re-design system generates optimal configurations in terms of layout and task plans. The solution allows for online reconfiguration of the system by (a) dynamically re-assigning the tasks when unexpected events occur, and (b) making real-time adjustments of robots' behavior to ensure collision-free trajectories generation in unstructured environments. These are achieved in addition to the sensor-based real-time scene reconstruction that is provided by the Digital Twin by synthesizing multiple sensor data. The discussed DT-based system was deployed, tested and validated in a case study from the automotive sector.

The contribution "Manufacturing Execution System Integration through the Standardization of a Common Service Model for Cyber-Physical Production Systems", by Richárd Beregi, Gianfranco Pedone, Borbála Háy and József Váncza [7], proposes a generalized common service model and architecture of CPS (Cyber Physical Systems)-based manufacturing execution systems. The core model is minimalist as far as its underlying assumptions are concerned. Hence, it does not constrain the decision autonomy of collaborating cyberphysical entities and "only" provides channels for transferring and synchronizing the information that ensues from their decisions. The proposed approach identified the elementary concepts, such as functions, calls, variables, and reports, as the basis for modeling and providing I4.0-compliant, CPS-based services in a manufacturing environment. They were developed via standardized technologies enabling semantic interoperability and openness (OPC UA, MQTT). The universal CPS-based service integration mechanism was validated in an experimental pilot production and logistics system that included a variety of heterogeneous and autonomous resources, such as manufacturing cells, AGVs, robots, and human–robot collaborative cells. These CPS components were connected and controlled via the plug and collaborate mechanism of a MES system in a number of complex scenarios.

The manuscript "*Decentralized Multi-Agent Control of a Manipulator in Continuous Task Learning"*, by Asad Ali Shahid, Jorge Said Vidal Sesin, Damjan Pecioski, Francesco Braghin, Dario Piga and Loris Roveda [8], formulates a decentralized and multi-agent approach in the form of an RL (Reinforcement Learning) problem, demonstrating the possibility of decentralizing a single manipulator controller by applying multiple agents in learning continuous actions. The purpose of this paper was to compare the feasibility of decentralization for a single robot to show the modularity in the joints, as well as to make a comparison between the centralized and decentralized approach. The results show that it is possible to decentralize the control action on the robot. Using multiple agents shows the stagnation of the learning process when using the same reward function that is used for a single agent. It is believed that this is due to the lack of communication between the agents and the generality of the reward when considering two agents.

The paper "*Predictive Control for Small Unmanned Ground Vehicles* via *a Multi-Dimensional Taylor Network*", by Yuzhan Wu, Chenlong Li, Changshun Yuan, Meng Li and Hao Li [9], presents an improved predictive control scheme based on the MTN (multi-dimensional Taylor network) for tracking the control of SUGVs (Small Unmanned Ground Vehicles). The traditional objective function was improved to obtain a predictive objective function with the differential term. The optimal control quantity was given in real time through iterative optimization. A tracking control experiment was carried out on an SUGV to verify the effectiveness of the proposed scheme. The results show that the proposed scheme is effective and has good a real-time, robustness, and convergence performance, which ensure that the vehicle can quickly and accurately track the desired yaw velocity signal and that it is superior to the traditional MTN and RBF (Radial Basis Function) predictive control schemes.

The article "*Design of a NARX-ANN-Based SP Controller for Control of an Irrigation Main Canal Pool*", by Ybrain Hernandez-Lopez, Raul Rivas-Perez and Vicente Feliu-Batlle [10], proposes a modification of the well-known Smith predictor controller, in which the internal linear model was substituted by the combination of a NARX-ANN-based model and a TD-NARX-ANN-based model, in order to take into account the dynamic nonlinearities in the effective control of water distribution in an irrigation main canal pool. By the application of system identification procedures, NARX-ANN and a TD-NARX-ANN with recurrent architectures were obtained which describe with high accuracy the non-linear dynamic behavior of the water distribution in the studied canal pool. The NARX-ANN structure with an input layer with 13 memory blocks for the input and output signals, three neurons in the hidden layer and one neuron in the output layer, provided the best model performance. A third model—which was linear and represented by a time-delay first-order transfer function—was obtained using an identification procedure. The validation results of the three models illustrate that the FIT performance indexes of the NARX-ANN-based models are higher than that of the linear model.

#### **3. Discussion and Concluding Remarks**

By means of its seven scientific papers, the present Special Issue clearly illustrates the increasing added value of the introduction of AI methods to improve the performance of control solutions with reference to different control and automation problems in different industrial applications and sectors [1,4–10], ranging from single manipulators [8] or small unmanned ground vehicles [9], up to complex manufacturing plants [7] or irrigation systems [10]. Additionally, the role of AI to improve the performance of relevant engineering methodologies and digital instruments, such as Cyber Physical Systems [7], Digital Twins and Human-Robot Collaboration [6], are also effectively addressed in the contributions to the Special Issue.

Even if many research initiatives are ongoing, such as those discussed in the present Special Issue, many relevant challenges remain to be faced in the near future to exploit the high potential of AI methods for industrial control applications. In particular, it is of major importance to support the adoption of such novel solutions in real-world industrial applications, as well as by small- and medium-sized companies [2,3,11], and to integrate AI-based techniques to steer the co-evolution of human workers' tasks and manufacturing systems frameworks, in order to improve production systems' performance, as well as human workers' safety and well-being [12,13].

**Funding:** This research received no external funding.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

## *Perspective* **Advances in Artificial Intelligence Methods Applications in Industrial Control Systems: Towards Cognitive Self-Optimizing Manufacturing Systems**

**Emanuele Carpanzano 1,\* and Daniel Knüttel 2,3**


**Abstract:** Industrial control systems play a central role in today's manufacturing systems. Ongoing trends towards more flexibility and sustainability, while maintaining and improving production capacities and productivity, increase the complexity of production systems drastically. To cope with these challenges, advanced control algorithms and further developments are required. In recent years, developments in Artificial Intelligence (AI)-based methods have gained significantly attention and relevance in research and the industry for future industrial control systems. AI-based approaches are increasingly explored at various industrial control systems levels ranging from single automation devices to the real-time control of complex machines, production processes and overall factories supervision and optimization. Thereby, AI solutions are exploited with reference to different industrial control applications from sensor fusion methods to novel model predictive control techniques, from self-optimizing machines to collaborative robots, from factory adaptive automation systems to production supervisory control systems. The aim of the present perspective paper is to provide an overview of novel applications of AI methods to industrial control systems on different levels, so as to improve the production systems' self-learning capacities, their overall performance, the related process and product quality, the optimal use of resources and the industrial systems safety, and resilience to varying boundary conditions and production requests. Finally, major open challenges and future perspectives are addressed.

**Keywords:** control systems; industrial automation; artificial intelligence; machine learning; self-learning machine tools; adaptive production systems

#### **1. Introduction**

Currently, manufacturing companies have to deal with many challenges in order to remain competitive within the rapidly changing market dynamics and framework conditions, including socio-economic, environmental and cultural aspects.

Short delivery times and customization along with comparable production costs and high-quality parts are order-winning properties and therefore vital for business. Additionally, accelerated product developments and industrializations are important attributes to quickly cover market developments with brief time to markets. Moreover, upcoming regulations as well as growing social and environmental requirements represent additional challenges to be faced by the industry. Key enabling factors to face such challenges are production machines, whose developments are crucial to meet the cited and constantly growing demands of modern manufacturing companies.

To keep pace with these ongoing evolutions and master upcoming challenges, industrial control systems are a key factor for advanced production machines and industrial

**Citation:** Carpanzano, E.; Knüttel, D. Advances in Artificial Intelligence Methods Applications in Industrial Control Systems: Towards Cognitive Self-Optimizing Manufacturing Systems. *Appl. Sci.* **2022**, *12*, 10962. https://doi.org/10.3390/ app122110962

Academic Editor: Alessandro Gasparetto

Received: 29 September 2022 Accepted: 24 October 2022 Published: 29 October 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

plants [1]. Closed loop and real time control solutions are already widely used and are well established on various levels of production systems. However, with increasing complexity of the manufacturing systems and control requirements, further developments of control algorithms are indispensable to address the demand for future manufacturing systems.

Promising approaches for improved industrial control systems are based on recent developments in artificial intelligence (AI). AI has obtained wide popularity in the past years and is one of the most pursued topics in research and industry. Progress in computing hardware and software has facilitated the collection of data and its exploration with AI for various use cases. Initiatives such as Industry 4.0 have further highlighted the potential of AI-based methods and have pushed them to the centre of attention in many fields [2]. A report from McKinsey & Company states that AI could contribute with a value of USD 3.6–6.6 trillion annually in the supply-chain management and manufacturing sector [3].

Especially in the area of manufacturing, AI-based developments further expand the capabilities and pave the way for more advanced industrial control systems. Current applications and trends are shown in the following section for various levels.

In addition to existing works about applications of AI or machine learning in manufacturing and Industry 4.0 [4–7], this work provides a perspective overview of the current state of the art, recent developments and future challenges for each manufacturing process and control level.

Therefore, this paper is structured following a hierarchical and functional bottomup approach: starting from the processing and elaboration of sensor signals to the job scheduling operations of the whole production plant. Thereby, the focus lies on identifying and illustrating most promising artificial intelligence-based methods coping with current challenges at each industrial control and automation level, including the related data acquisition, process modelling, real-time control and task scheduling operations. Due to the plethora of existing algorithms, the aim is not to present a complete overview of existing approaches and methods, but to highlight the most promising ones and the potential of their future development and applications to face modern industrial control and automation challenges, by means of emerging AI-based methodologies. In particular, the key role of the integration of AI methods in industrial control solutions is to strive for future selfoptimizing production systems. For the successful implementation of self-optimizing machines, all components over all hierarchies must be optimally designed and operational. Figure 1 schematically illustrates the main element of self-optimizing production systems and forms the basis for the focus of the first sections in the present review and survey paper. Starting from a controller dedicated to a set of well-defined process variables, multiple aspects must be considered to enable an accurate and reliable control that is elementary for subsequent higher-level control structures. The first step of each controller is the direct or indirect measurement of the process variable under consideration. In order to obtain the most relevant information and controller inputs, sensor signals must be processed and elaborated accordingly. Since amount, type and quality of sensor signals can vary significantly, different methods are required to process them in an appropriate manner. In particular, trends towards the utilization of multiple sensors and the consequently increasing data availability require new developments towards suitable approaches for sensor and data fusion. Related current approaches and AI-based methods are discussed in Section 2.

A further important element within optimized controllers and manufacturing applications are process models. Depending on the use case, these process models can be generated conventionally with multi-physics simulations as finite element methods or more recently, data-driven approaches enable the generation of process models based on the previously acquired data. However, not all process models are suitable for control applications. An overview of available approaches and recent developments concerning AI-based process models is given in Section 3. By processing the measurement values and querying process models, the optimizer determines the optimal controller output. Depending on the application, these processes must be performed very fast or even in

real time and with as much autonomy as possible, requiring intelligent and specialized solutions. Therefore, promising AI-based methods to implement effective optimizers are presented in Section 4.

**Figure 1.** Model-based optimized controller.

A further concatenation of multiple optimal controllers enables an advanced control strategy over all hierarchical process levels. The main elements of self-optimizing machines are their capability to autonomously adapt to changing requirements and circumstances. Therefore, the task scheduler coupled with multiple cascaded control chains allows one to redefine and modify control objectives of control chains towards higher-level optimal objectives. Thereby, an example for the cascaded control chains could be the process control of an additive manufacturing process, where the first controller guarantees a constant energy density during the deposition process and the second controller acts layerwise by modulating the process parameter for the subsequent layer to obtain an equal deposition height. In an exemplary additive manufacturing process, the two control loops could represent the regulation of two laser sources, simultaneously building various features of a more complex part. A simplified overview of a self-optimizing framework is displayed in Figure 2.

**Figure 2.** Self-optimizing production system control framework.

AI methods can enhance the self-optimization capacity of a cognitive industrial control system at different hierarchical and functional levels. The combination of models and realtime data, as well as of real-time control and job scheduling functionalities, allows for the development of self-optimizing industrial machines and systems with new elevated degree of autonomy. Current methods and developments of job scheduling functionalities are described in Section 5, whereas approaches and frameworks towards cognitive selfoptimizing machines are outlined in Section 6.

#### **2. Methods for Industrial Data Fusion at Sensor, Feature and Decision Level**

Efficiency, repeatability, robustness, and quality of manufacturing processes are of great importance in today's production sites. Monitoring systems enable to capture parameters and system states in real time, allowing to effectively control manufacturing processes. The quality and performance of these automation systems stand and fall with the reliability and precision of available information. As a result, robust and accurate process measurements are of crucial relevance to satisfy higher-quality requirements.

A common approach to tackle these issues is a fast developing and huge research field called sensor fusion. The scope of data fusion, also known as sensor fusion, multisensor data fusion, multi-sensor integration, data combination or data aggregation is to obtain improved information with higher accuracy by merging multiple information sources. It fosters the robustness, reliability and confidence of the analysed information and improves the temporal and spatial resolution [8]. Initial developments of sensor-fusionbased systems go back to the military industry, where the authentication and tracking of dynamic objects is of crucial importance. Recently, further sectors such as the automotive sector experience a strong increase in sensor fusion for autonomous vehicles to accurately map their environment.

There have been multiple attempts to define a definition and framework for sensor data fusion. One of the most popular definitions has been presented by the Joint Directors of Laboratories (JDL) [9] describing sensor fusion as a multi-level process handling the association, correlation and combination of data and information from single and multiple sources. This section focuses on the application of AI-based algorithms for industrial data fusion. For a deeper insight in further data fusion architectures, the reader is referred to [10–12].

To capitalize on the benefits of sensor data fusion, some challenges must be overcome for a successful implementation. There are various data properties affecting the feasibility and success of the data fusion process. The main challenges of data fusion arise from the heterogeneity of data sources and individual imperfections of each data source such as uncertainty, imprecision or granularity. An overview of typical data fusion issues is displayed in Figure 3.

**Figure 3.** Taxonomy describing data-fusion-related issues.

Depending on the considered data type and its fusion issues, suitable fusion algorithms must be selected. Today's available algorithms only address a subset of these aspects. Due to the multitude of applications and available algorithms, the following section does not claim to be complete, but identifies promising methodologies for applications in industrial control solutions. Furthermore, there are many ways to classify fusion algorithms. Meng et al. [13] presented the algorithms according to their abstraction level as signal level, feature level and decision level, whereas Sasiadek et al. [14] divides fusion approaches in probabilistic models, least-square techniques and intelligent fusion.

On a *signal level*, the most intuitive and easiest way to increase the signal-to-noise ratio of redundant sources is to apply the weighted average or the least squares method on captured data. A more sophisticated and widespread method is the Kalman Filter. The Kalman Filter considers the statistical characteristics of the measurements while still allowing for real-time processing of the data. Kalman filtering is a discrete and time invariant approach and only holds for linear models. Unfortunately, practical problems often have strong non-linear relations. To overcome these limitations, Extended Kalman Filter (EKF) [15], Unscented Kalman Filter (UKF) or the Particle Filter (PF) [16] also known as Sequential Monte Carlo (SMC) have been developed to handle non-linear relationships. A drawback of the latter method is the increasing computational effort and rising amount of required data.

Regarding the *feature level*, unsupervised learning algorithms such as K-means clustering or self-organizing maps can be used to classify relationships between entities and extract further information. These methods are particularly suitable for applications where patterns or geometrical relations need to be detected. Further methods to address pattern recognition problems are adaptive resonance theory (ART), ARTMAP or Fuzzy ART.

For applications with limited bandwidth, Challa et al. [17] applied support vector machines (SVM) to compress the information in sensor fusion for large amounts of samples. Further algorithms and their applications are presented by Fung et al. [18]. Depending on the use case and its requirements, the suitability of the data fusion algorithm must be evaluated individually. Thereby, the target application defines the required accuracy, computational complexity, processing power and available amount of data.

King et al. [16] introduced a classification depending on the degree of complexity. Thereby, low-complexity algorithms such as k-means allow for an enhanced battery life for wearable health monitoring systems due to their reduced computational complexity and resource requirements.

In Table 1, the presented algorithms are distinguished regarding their application or abstraction level as sensor fusion, feature fusion and decision fusion. In general, there is a wide range of possible data fusion applications.


**Table 1.** Overview of data fusion algorithms at sensor, feature and decision level.

Typical examples of the military industry are the location and tracking of moving objects. Depending on the target, multiple sensor platforms are used, and various sources as acoustic signals or electromagnetic radiation can be observed [38]. More recent applications are in the health care [39] or telecommunication sectors as wireless sensor networks. The widespread machine learning algorithms such as ANN, SVM or GMM lead to an increased adoption in sensor fusion approaches and enhance the possibilities of sensor fusion.

However, the efficient handling of big amounts of data in industrial applications remains an important challenge for the future. Approaches called consensus filters try to address this issue by implementing distributed filters to create scalable algorithms [40,41]. More recent developments were published in the automotive sector, where fast developments are noticeable. Future self-driving or autonomous vehicles require reliable information about their environment and are therefore predestined for sensor fusion [42–44]. Wei et al. [45] fusioned LiDAR and a vision system to reduce the false positive detection of the LiDAR system utilizing neural networks to project the camera to the LiDAR space. A more detailed review in this sector is presented by Wang et al. [42] and Alatise et al. [44].

These developments will probably lead to further adoption and improvements in manufacturing industries. Besides data-related issues, the interaction and data fusion of human inputs and machine data will play an important role in the future and requires more research to be developed in the near future.

#### **3. Empirical Data Driven Methods for Production Process Modelling**

Production process modelling enables the development of stable and repeatable process models, which are of crucial relevance for manufacturing companies. Production processes can be viewed from different perspectives ranging from the overall production process chain including raw material delivery, manufacturing steps and the final stock exploitation with the final product to simulations of the microscopic material behaviour during a manufacturing step [46]. This section focuses on the modelling of manufacturing processes for control purposes. The following sections consider high-level perspectives accounting for multiple manufacturing machines and systems.

In the domain of manufacturing processes, as for example additive manufacturing, models can be classified in analytical, numerical or empirical approaches [47]. Analytical models consist of the mathematical analysis of a physical problem described by partial differential equations (PDE) and given boundary conditions. Given the assumptions made, the obtained solution accurately represents the reflected domain. However, the process to determine analytical solutions is not trivial, and for complex processes, advanced numerical methods and demanding computational efforts are necessary to solve the considered physical equations.

Numerical methods are typically finite element method (FEM), boundary element method (BEM), finite difference method (FDM) or finite volume method (FVM). The solving of PDE is an essential element for multi-physics simulations enabling the simulation of complex manufacturing processes. Offline process simulations allow to perform parameter or design engineering without performing costly experiments and are therefore widely spread in R&D departments. An important limitation of these methods is the computational burden for complex processes including multiple physical phenomena. Furthermore, the computational effort scales drastically with increasing simulation domain sizes. Besides computational drawbacks, setting up a model is not trivial and requires highly skilled employees to define model parameters. The model definition and its assumptions are crucial to obtain representative results for a predefined process window.

In certain applications, it can be more suitable to create empirical models by performing a series of experiments and analysing the observations. Thus, relations between process inputs and obtained measurements can be captured without knowing the detail about the physical phenomena. This method is in particular suitable for low-cost experiments, where numerical or analytical models would require more resources. To approximate the captured observations in terms of an exploitable model, various approaches exist. One of the most straightforward methods in statistical analysis is linear regression, where a line is fitted to capture the relation between two variables. For more complex non-linear applications, machine learning based methods as neural networks (NN) have experienced a widespread adoption in recent years [48–53]. Due to the increased availability of computational resources, the increasing development of machine learning libraries and the possibility to approximate highly non-linear relationships lead to many applications in industry, as for example machining operations and additive manufacturing [54]. The approximation capacities can be customized by varying NN hyperparameters as the number of hidden layers or the number of neurons of each layer. A further advantage of the NN-based model is the possibility to apply the model for similar processes. Therefore, only few additional

data points are required to fine tune the already trained model by adapting the model to the changed conditions [55]. However, the main drawbacks of NN are the required amount of data to train the network and the reliability of the predictions. The structure of NNs makes it difficult to interpret and predict the behaviour, in particular for unseen data points.

Recent developments combine the advantages of physics-based simulations and neural networks by including the physical knowledge in the training of the neural network [56–59]. In contrast to conventional neural networks, physics-informed neural networks contain additional loss terms. Besides the prediction error, terms satisfying the partial differential equation of the simulated domain penalize solutions disagreeing with the physical equations. As a result, less data are required to train and optimize the weights of the network, and predictions are more stable. Moreover, the additional loss terms shrink the solution space and prevent overfitting of the model. For data scare applications, this method is very promising but needs further research for industrial applications. In particular, the scalability needs to be investigated since the additional computations for each iteration and the coordinates dependent on PDE are potentially limiting factors for an adoption on larger part dimensions.

With a similar objective, but without directly implementing the physics of the considered process, a recent method has been presented by Li et al. [60]. The authors presented the Fourier neural operator (FNO) to learn the underlying multiple partial differential equations (PDE) by mapping from a functional parametric dependence to the solution. The proposed architecture method appears to be significantly faster than traditional PDE solvers. To train the FNO, a sequence of timeframes of a physical phenomenon is given as input to predict the following time frames. The inputs are processed by transforming the input into the frequency space, filtering the higher frequency modes and transforming the filtered data back to the time space; furthermore, the inputs are multiplied with weights and summed up with the filtered results. Subsequently, the results are passed through an activation function, and the described layer is repeated multiple times. The advantages of the method are the mesh invariance and the capability of learning the relevant physics behind the process. Therefore, this method could be a promising approach for the modelling of manufacturing processes. First adoptions for solving the PDE for two-phase flow fields have been demonstrated by Zhang et al. [61]. However, the transfer from theoretical physical problems to empirical data is still missing.

Besides AI-based algorithms inspired by the structure of NN, probabilistic approaches as Bayesian Optimization (BO) have also been proven to be suitable for data scare applications. Due to the probabilistic nature of BO, probabilities and uncertainties of the prediction are computed. These values enable the development of data sampling methods, minimizing the uncertainty of the predictions. As a result, less experimental data are required to obtain a model of the manufacturing process. Maier et al. [62] successfully applied this method to optimize grinding parameters after a few trials for an unknown workpiece and tool combination. Table 2 summarizes the mentioned process modelling methods.

However, a general challenge of process simulations remains the lack of generalizability of models and the increased computational burden for complex processes. Simulation domains are often tiny volumes and only represent a small element of the manufacturing step. Previously mentioned recent developments could be promising to handle both larger simulation domains and complex processes. Particularly, data-driven methods have recently shown great potential. Further research regarding the explainability and generalizability of data-driven methods must be carried out to enable an imminent transition to industry.


**Table 2.** Overview of sensor fusion algorithms.

#### **4. Machine Learning Applications for Production Machines Real Time Control**

Real-time control of manufacturing machines is a further essential element of production machines to achieve repeatable and stable processes with high-quality parts. Applications range from accurate positioning of machine axes to adaptive control of process parameters as the laser power during laser welding. Today's industrial control systems are mainly based on highly consolidated solutions, as for example PID controllers for axis control. The simple structure, ease of implementation and long experience in parameter tuning lead to an established technology for control applications. However, there were many further developments in the controller design and automated parameter tuning [63]. AI-based methods have been increasingly adopted in recent years to optimize control parameters tuning or to develop new types of control methods.

In general, machine learning methods can be divided into supervised, unsupervised and reinforcement learning (RL) [64]. Supervised learning trains for example an NN by reducing the mean squared error between predictions and the corresponding labelled solution. In unsupervised learning, no information about the ground truth is available. Unsupervised learning often consists in clustering or reducing data dimensionality by grouping correlated data points. In contrast to supervised and unsupervised learning, RL forms a semi-supervised learning approach, where an agent interacts with the environment and a reward is given for performed actions. The algorithms explore the environment by executing actions and receiving rewards for the actions and decisions taken. The optimal execution order of the actions, the policy, is not known and will be determined during the learning process. A good introduction into Deep RL is given by Mousavi et al. [65]. RL can be further subdivided into dynamic programming, Monte Carlo methods and temporal difference methods [66].

Regarding PID parameter tuning, RL has been successfully applied to find suitable controller parameters. Besides traditional approaches such as trial and error or Ziegler– Nichols [67], Wang et al. [68] utilized an actor critic RL approach to determine the PID parameters. The results demonstrated the algorithm's ability to track non-linear systems while ensuring adaptability and robustness. Besides parameter tuning, RL also experiences an increased adoption within process control applications. Table 3 lists the currently most adopted RL algorithms for control applications.


**Table 3.** Overview of common and new control algorithms.

Nian et al. [66], Shin et al. [63] and Lee et al. [69] recently published reviews of machine learning and process control methodologies. In particular, Shin et al. [63] compared reinforcement learning with model predictive control algorithms. For complex and dynamical systems, model predictive controllers (MPC) represent an effective alternative to PID controllers. Due to their ability to explicitly consider constraints and predictively optimize the controller output, MPC solutions are already well established in the industry. Important for the applicability of MPC are sufficient slow system dynamics so as to execute the model online during every sampling step. Reinforcement learning for process control appears to be a promising method to reduce online computation times. The m ain differences lie in the modelling approach, where RL algorithms learn the model through trial and error, whereas MPC requires a model developed with first principles or process identifications. Furthermore, the online execution time of model-free RL is faster than MPC, since only a forward run of the policy network is required, whereas the MPC requires solving an optimization problem at each execution. A general comparison between deep RL and MPC for optimal control is given by Lin et al. [70] and is displayed in Table 4. More details about the comparison of MPC and RL can be found in [70,71].

**Table 4.** Comparison of deep reinforcement learning and model predictive control for optimal control [72].


Due to the possibility of offline training, reinforcement learning is suitable for real-time control. Famous results of reinforcement learning were achieved by DeepMind beating human players in games such as Go [72]. However, these kinds of simulated environments are perfectly known and are therefore particularly suitable for the learning of reinforcement algorithms. In contrast to simulated environments, industrial processes suffer from high levels of imprecision and uncertainty (see sensor fusion; Section 2). The training requires a lot of data and interactions with the environment, which can be performed by interacting with a model. However, accurate models covering the entire manufacturing process over all parameter combinations are extremely complex, and experiments are costly. As discussed in Section 3, AI-based modelling approaches could therefore be suitable to model the process and serve as an environment for the RL algorithm. In contrast to traditional control approaches, there are no stability guarantees for RL controllers. Compared to traditional control methods, the development of RL control algorithms is still in an early stage. Nevertheless, the general learning capability of RL is promising for the future. Furthermore, research is needed to search for methods to improve stability and convergence of RL algorithms. For further applications as control approaches, safe RL will be, in the future, an important topic to enhance the reliability of RL-based controllers [73,74].

#### **5. Real-Time Scheduling Methods for Flexible Manufacturing Systems**

Trends towards customized products and shorter delivery times lead to a major challenge for manufacturing factories automation and management systems. The increase in product variants introduces significant non-production times due to system reconfiguration and set up needs at each product changeover. Besides internal production uncertainties, external factors such as a varying customer demand, supply chain ruptures or even raw material price dynamics increase the complexity level of the manufacturing configuration, planning and scheduling tasks [75]. To cope with these challenges, today's manufacturing companies need not only high-performance and flexible processes but also robust concepts to handle foreseen variances and unforeseen disturbances with minimal impact and small additional production costs [76,77]. Approaches to raise the production flexibility can range from improvements and development of adaptive work cells with self-adaptive control functions [78,79] to optimized production planning and scheduling solutions [80].

The mapping and optimization of these increasingly complex production processes is an important field of research. Recent works underline the relevance of production scheduling and show a tendency towards the following topics in the last years [81]:


These topics highlight the importance of autonomous and self-optimizing scheduling approaches to handle the overall growing production dynamics. Frequently applied scheduling optimization objectives are the maximization of machine utilization and the minimization of the lead time to produce a product [82]. Additionally, an increased sensibility and awareness of the climatic impact lead to further requirements such as the minimization of utilized resources and the reduced emission of greenhouse gases.

Once the scheduling objectives are defined, various approaches exist to implement a job shop scheduling policy. A general classification divides them into centralized, distributed or more recent hybrid scheduling and control architectures (HCA) [83]. These hybrid structures aim to combine the advantages of both approaches, while reducing their drawbacks. State-of-the-art architectures are multi-agent systems (MAS), holonic manufacturing systems (HMS) or product-driven systems (PDS) [83]. Another classification presented by Zhang et al. [84] groups the job shop scheduling models by the problem space in the following five types: basic type job shop scheduling problem (JSP), flexible JSP, multi-resources FJSP model (MrFJSP), multi-plants-based MrFJSP (MpFJSP) and MpFJSP with smart factory.

Independently of the scheduling architecture or model, dynamic scheduling approaches are required to cover and react on production events in real time. Events can be classified in resource- or job-related events [85]. Thereby resource-related events include process-linked incidents, as for example machine breakdowns or material shortages. Examples for job-related events are job cancellation or a change in priorities. Handling these dynamic and unforeseen events, while considering constraints and variations of each production process, is a key challenge of industrial production scheduling solutions. To cope with the challenge, accurate production models are required to obtain realistic results. However, with increasing model complexity, computation times and resulting optimization times increase significantly, limiting the real-time application potential of

advanced scheduling methods. To overcome these issues, new technologies and algorithms are required to model and optimize the industrial operation scheduling problem.

Promising approaches are dynamic HCA with self-adaptive mechanisms to improve the agility and reactivity [86]. Cardin et al. [75] presented an optimized reactive control architecture (ORCA) consisting of a global and local control layer. Important is the distinction between normal and disrupted mode. In the case of a local perturbation, the local optimizer triggers the switch to disrupted mode, where the local optimizer controls the entity without the input of the global control layer. As a result, the system is able to handle reactive and local disturbances, while maintaining a global optimal solution at the start of each time window. The overall architecture is shown in Figure 4, illustrating the different control levels and the disrupted mode.

**Figure 4.** Optimized reactive control architecture (ORCA) [75].

However, a crucial element for effective industrial scheduling solutions is an algorithm that optimizes the objective functions in dynamic rescheduling operations. An overview of scheduling algorithms is shown in Figure 5 [84]. With increasing complexity, approximate methods are more suitable to optimize the production scheduling problem.

**Figure 5.** Scheduling algorithms classification.

In recent years, AI-based methods have received increasing attention and have been more commonly applied to support the scheduling process in high-performing industrial applications. Predictive maintenance enables to predict and even anticipate production disturbances. The research in this field has increased drastically due to the exponentially increasing amount of available production data and the adoption of AI-based techniques [87,88]. Developments in this area are especially helpful for scheduling solutions to anticipate disturbances and future changes, so as to proactively adapt the scheduling solution. Within the developments under Industry 4.0, new frameworks such as the MpFJSP with smart factory are presented [84]. Thereby the actual job shop situation can be measured in real time by various sensors and identification technologies. As a result, the actual situation can be compared to the planned scheduling solution. AI-based methods such as deep learning technologies enable to extract information and learn from the production processes autonomously to provide the necessary basis for decision making and new scheduling solutions identification.

Priore et al. [89] presented an approach to predict the suitable decision rule with machine learning based methods. Thereby, ensemble methods have been successfully applied to exploit the prediction of multiple algorithms and increase the prediction accuracy. Another work using ensemble methods has been presented by Lubosch et al. [90]. Thereby, a Monte Carlo Tree Search was used to develop a flexible scheduling algorithm. To achieve a good tradeoff between exploration and exploitation, the algorithms' hyper parameters are tuned with a Gradient Boosted Decision Tree (GBDT), avoiding a manual parameter tuning. Recently, reinforcement learning-based production scheduling systems have gained increasing attention in research [91]. Thereby a clear research trend towards more reliabilityaware algorithms can be identified.

Besides further research in modelling and optimization algorithms, human aspects need to be considered in future adaptive automation solutions as well [92]. In particular, human–robot collaboration is a promising approach for higher productivity and product quality [93,94]. Worker-aware adaptive solutions enable the possibility to harmonize human well-being to obtain a synergic human-automation solution, including the related job scheduling aspects. Bettoni et al. [95] and Valente et al. [96] demonstrated an approach of adaptive huma–machine collaboration, enabling an effective reduction of the physical and mental stress of the operator. The monitoring of physiological parameters and classification with random forest algorithms allow for estimating the state of the operator and adapting the behaviour of the human-aware manufacturing systems, also considering the system evolution over time, as outlined in Bracco et al. [97].

#### **6. Emerging Cognitive Approaches for Self-Optimizing Machines**

The key idea behind self-optimizing machines (SOM) or self-optimizing production systems (SOPS) is to enable industrial production systems to produce customized highquality parts while satisfying productivity demands [98]. Therefore, highly autonomous production machines are required to maximize the counteracting targets of efficient production and customization. Enhanced autonomy enables the system to produce complex parts, while handling and reacting to unforeseen events autonomously without any intervention of external operators [99].

Adapting the machine state automatically towards the objectives requires increased decision-making capabilities [100]. Thereby, the decision-making process can be compared to the human decision-making requirements [101]. These cognitive processes consist of the perception of the environment, its interpretation, the crosslink with existing knowledge and the subsequent decision with a coupled action. The general cognitive process is displayed in Figure 6 and illustrates the high-level control strategy.

The illustrated control structure includes the elements of each control loop. Cascading these process control loops forms a fundamental element to achieve this elevated degree of autonomy. These process control loops enable the SOM to react on changing objectives and subsequently adapt the machine state levelwise top–down. The cascaded implementation allows one to act on various influences on different impact levels, ranging from highlevel product modifications to microscopic tool wear during the manufacturing process. Figure 7 visualizes a cascaded self-optimization loop for an additive manufacturing process. Thereby, the control loops are linked top–down to consider production objectives within the process control and act on each process level. Based on the Viable System model of Stafford Beer, Permin et al. [102] presented a concept of the Viable System model, including different levels of production and production management. Thereby, objectives are divided

into normative production management, strategic production management and tactical production management.

**Figure 6.** The cognitive process supporting self-optimization.

**Figure 7.** An exemplary cascaded loop of a self-optimizing additive manufacturing machine.

A main element of the SOM and its cascaded process loops is the exploitation of multiple information sources to capture and control the machine state. Moreover, the link between all processing elements allows for the process control over multiple levels. Thereby, process models as described in Section 3 support the control process by predicting target parameters or by optimizing control outputs with model-based controllers. Thombandsen et al. [103] presented a framework of a model-based self-optimizing manufacturing machine. Within this concept, accurate observations and the understanding of their relation to the product quality are crucial prerequisites for self-optimizing machines. Based on sensor signals, machine states and operating points are identified, and control parameters are computed by a model-based optimization.

A more higher-level perspective of a SOM is presented by Möhring et al. [104] and is displayed according to the framework in Figure 8. Within this concept, the process starts with generation of the trajectory for the imported part to manufacture. During this process, the planning process considers already processed and machine properties to optimize the toolpath. Further optimizations of the settings are performed by modelling and simulation of the process and the machine. These models enable the prediction of various product target properties, as for example the surface quality or part geometry. Afterwards, the actual machining process is performed, and monitoring solutions record the real-time process behaviour and interact with real-time controllers. The actual process measurements allow one to automatically update and improve the process model. Furthermore, the inline measurements form the input of real-time controllers to perform accurate control tasks, such as axis positioning. Considering the machine characteristics and measured process deviations, real-time controllers are constantly refined. Further details and examples can be found in Möhring et al. [104].

**Figure 8.** Self-optimizing manufacturing machine [104].

Essential prerequisites for SOM are developments in data fusion, process modelling and real-time control mentioned in the previous sections. In particular, the implementation of suitable measurement devices is crucial for the subsequent process steps. Only if the machine state is captured properly by accurate and robust measurements can valid process models be generated and subsequent optimal actions selected. Key enabling technologies as artificial intelligence and machine learning are promising approaches to process and analyse sensor data. These techniques are necessary to build cascaded process control loops based on comprehensive process observations, accurate process models and the knowledge about quality-influencing factors. Even if there have been promising advancements in various applications, there is still a lack in industrial implementations of comprehensive SOM solutions. A general overview of AI applications in manufacturing is given by Monostori [105]. Mayr et al. [106] presented an adaptive self-learning control approach to compensate for thermal errors on five-axis machine tools. Based on data captured during the process, a thermal autoregressive with exogenous input (ARX) model is identified and updated during the process to reduce thermal errors. Dittrich et al. [107] presented a selfoptimizing process planning approach for five-axis milling to automatically compensate for tool deflections. Combining a simulation model with support vector regression enabled the prediction of shape errors. The subsequent automated toolpath adaption resulted in a shape error reduction by 50%.

The overall promising benefits of SOM are becoming more and more realistic with recent progresses in computational power, sensor techniques and AI algorithms. Autonomous selection of data sampling points enables an efficient generation of process models [64]. Based on these models, integrated optimization processes pave the way for autonomous self-tuning of parameters to guarantee a constant product quality.

To enable further widespread in industrial applications, faster and data-scarce simulation and analysis methods will be needed. AI algorithms are promising techniques to generate fast models and to handle process uncertainty, while updating the model during manufacturing processes. However, these approaches suffer from low data efficiency and reliability of its predictions and are therefore still critical regarding safety requirements.

Further challenges are the process dynamics, requiring fast optimized control values to control the process. Further developments are needed to cope with the complexity and dimensionality of the optimization problems.

Moreover, more research will be needed to integrate the human interaction and transparency of the decisions made and an interactive machine user interface.

#### **7. Major Ongoing Research Directions and Future Perspectives**

The presented work introduced recent progress on different hierarchic and functional levels of industrial control solutions, discussing how AI-based methods enable further enhancements in many applications and sectors. Promising algorithms and methods were mentioned and discussed for every level following a bottom-up approach, starting from the sensor level and data fusion to more complex applications such as self-optimizing machines.

Thereby, the (self)-learning properties of AI-based algorithms are promising approaches to deal with complex processes. Compared to conventional FEM-based modelling approaches, AI-based methods come with short inference times, which are important for real-time production systems. However, future research is required to overcome the drawbacks of AI-based methods as far as the model reliability and explainability are concerned. Approaches such as neural networks are difficult to interpret for data points outside the training set. A linked issue is the required amount of data for the training process, which can be a burden for data scare applications. Furthermore, data quality in terms of uncertainty, outliers or missing data points is crucial for accurate training. The data preparation can be time-expensive and requires specialized skills. Depending on the application, AI approaches should be evaluated individually and compared to conventional methods in terms of effort and resulting costs. The mentioned factors can be currently limiting factors for a widespread exploitation of AI methods in industrial control and automation solutions, particularly in small to medium companies (SME). To assess the suitability of AI algorithms, Bettoni et al. [108] proposed a conceptual framework for SMEs. However, ongoing research and developments in this field are constantly showcasing new applications and pushing the technology towards industrial applications. These developments foster the wide spread of AI-based methods and facilitate the accessibility for companies. For data scare applications, transfer learning or physics-informed neural networks are promising methods to deal with this limitation and will potentially increase the interpretability of the models. Regarding control applications, reinforcement learning approaches are constantly gaining an interest in research, and new approaches and use cases will permit to understand and assess their suitability for industrial use. The combination of advanced monitoring, modelling, control and scheduling methods will in the future allow for the development of self-optimizing machines, leading to production machine improvements in terms of product quality, productivity and resource efficiency, and representing a crucial point for the next generation of human-centric manufacturing systems.

**Author Contributions:** Conceptualization, E.C. and D.K.; methodology, E.C. and D.K.; investigation, E.C. and D.K.; writing—original draft preparation, D.K.; writing—review and editing, E.C.; visualization, D.K.; supervision, E.C.; project administration, E.C. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Adaptive Optimal Robust Control for Uncertain Nonlinear Systems Using Neural Network Approximation in Policy Iteration**

**Dengguo Xu 1,2, Qinglin Wang <sup>1</sup> and Yuan Li 1,\***


**Abstract:** In this study, based on the policy iteration (PI) in reinforcement learning (RL), an optimal adaptive control approach is established to solve robust control problems of nonlinear systems with internal and input uncertainties. First, the robust control is converted into solving an optimal control containing a nominal or auxiliary system with a predefined performance index. It is demonstrated that the optimal control law enables the considered system globally asymptotically stable for all admissible uncertainties. Second, based on the Bellman optimality principle, the online PI algorithms are proposed to calculate robust controllers for the matched and the mismatched uncertain systems. The approximate structure of the robust control law is obtained by approximating the optimal cost function with neural network in PI algorithms. Finally, in order to illustrate the availability of the proposed algorithm and theoretical results, some numerical examples are provided.

**Keywords:** policy iteration; uncertain nonlinear system; robust control; adaptive optimal control

#### **1. Introduction**

It is ineluctable to contain uncertain parameters and disturbances in practical systems due to modeling errors, external disturbances, and so on [1]. Thus, it is of great practical significance to study robust control of the uncertain systems. In recent years, the control problem of uncertain systems has been extensively studied. The literature on robust control of uncertain systems mainly includes linear systems and nonlinear systems. For the uncertain linear systems, algebraic Riccati equations (ARE) or linear matrix inequalities (LMI) were mostly used to deal with them in the classical research methods [2–6]. In this literature, both matched and mismatched systems were involved. For nonlinear systems, the early research methods include feedback linearization, fuzzy modeling, nonlinear *H*∞ control, and so on [7–11]. However, in recent decades, neural networks (NN) and PI in RL were used to approximate the robust control law numerically [12–14].

The PI method in RL was initially utilized to calculate the optimal control law for some deterministic systems. Werbos first proposed an idea of approximating the solution in Bellman's equation using approximate dynamic programming (ADP) [15]. There have been many results of using the PI method to calculate the optimal control law for some deterministic systems [16–19]. There are two major benefits of using the PI algorithm to deal with such optimal problems. On the one hand, it can effectively solve some problems of the "curse of dimensionality" in engineering control [20]. On the other hand, it can be utilized to calculate the optimal control law without knowing the system dynamics. In the practice of engineering control, it is difficult to obtain system dynamics accurately. Therefore, it is a good choice to use the PI algorithm to solve unknown model control problems.

Within the last ten years, the PI method was also developed to calculate the robust controller for some uncertain systems, which is based on the optimal control method of

**Citation:** Xu, D.; Wang, Q.; Li, Y. Adaptive Optimal Robust Control for Uncertain Nonlinear Systems Using Neural Network Approximation in Policy Iteration. *Appl. Sci.* **2021**, *11*, 2312. https://doi.org/10.3390/ app11052312

Academic Editor: Emanuele Carpanzano and Giovanni Petrone

Received: 21 January 2021 Accepted: 28 February 2021 Published: 5 March 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

robust control [21]. For an input constraint nonlinear system with continuous time, a novel RL-based algorithm was prosposed to deal with the robust control problem in [22]. Based on network structure approximation, an online PI algorithm was developed to solve robust control of a class of nonlinear discrete-time uncertain systems in [23]. Using a data-driven RL algorithm, a robust control scheme was developed for a class of completely unknown dynamic systems with uncertainty in [24]. In addition, there are many other examples of literature on robust control based on RL, such as [25–27]. In all the literature listed above, the solution to the Hamilton Jacobi Bellman (HJB) equation was approximated by neural network. In fact, solving the HJB equation is a key problem in optimal control problem [28]. The HJB equation is difficult to solve because it is a nonlinear partial differential equation. For a nonlinear system, the HJB equation is solved with neural network approximation in many cases. Meanwhile, for the linear system, ARE is used to solve it instead of a neural network. However, for all we know, most of the current studies have not considered the input uncertainty in system. The input uncertainty does exist in the actual control system.

In this study, a class of continuous-time nonlinear systems with internal and input uncertainties is considered. The main objective is to establish robust control laws for the uncertain systems. By solving the optimal control problem constructed, the robust control problem is converted into calculating an optimal controller. The online PI algorithms are proposed to calculate robust control by approximating the optimal cost with neural network. The convergence of the proposed algorithms is proved. Numerical examples are given to illustrate the availability of the method.

Our main contributions are as follows. First, more general uncertain nonlinear systems are considered, in which the uncertainty entered both the system and the input. For the matched and the mismatched uncertain systems, it is proved that the robust control can be converted into calculating an optimal controller. Second, the online PI algorithms are developed to solve the robust control problem. The neural network is utilized to approximate the optimal cost in PI algorithm, which fulfilled a difficult task of solving the HJB equation.

The rest of this paper is arranged as follows. We formulate the robust control problems and propose some basic results for the issues under consideration in Section 2. Solving the robust control problem is converted to calculate an optimal control law of a nominal or auxiliary system in Sections 3 and 4. Based on approximating optimal cost with neural network, the online PI algorithms are developed to solve the robust control problem in Section 5. To support the proposed theoretical framework, we provide some numerical examples in Section 6. In Section 7, the study is concluded, and the scope for future research is discussed.

#### **2. Preliminaries and Problem Formulation**

Consider an uncertain nonlinear system as follows:

$$\dot{\mathbf{x}}(t) = f(\mathbf{x}(t)) + \Delta f(\mathbf{x}(t)) + [\mathbf{g}(\mathbf{x}(t)) + \Delta \mathbf{g}(\mathbf{x}(t))] \mathbf{u}(t), \mathbf{x}(0) = \mathbf{x}\_0 \tag{1}$$

where *<sup>x</sup>*(*t*) <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* is the system state, *<sup>u</sup>*(*t*) <sup>∈</sup> <sup>R</sup>*<sup>m</sup>* is the control input, *<sup>f</sup>*(*x*(*t*)) <sup>∈</sup> <sup>R</sup>*n*, *<sup>g</sup>*(*x*(*t*)) <sup>∈</sup> <sup>R</sup>*n*×*<sup>m</sup>* are known function, <sup>Δ</sup>*f*(*x*(*t*)) <sup>∈</sup> <sup>R</sup>*n*, <sup>Δ</sup>*g*(*x*(*t*)) <sup>∈</sup> <sup>R</sup>*n*×*<sup>m</sup>* are uncertain disturbing function.

The control objective is to establish a robust control law *u* = *u*(*x*) in order that the closed-loop system is asymptotically stable for all allowed uncertain disturbances Δ*f*(*x*(*t*)) and Δ*g*(*x*(*t*)).

As a general case, we first make the following assumptions to ensure that the state Equation (1) is well defined [1,29].

**Assumption 1.** *In* (1)*, f*(*x*) + *g*(*x*)*u is Lipschitz continuous with respect to x and u on the set* <sup>Ω</sup> <sup>⊆</sup> <sup>R</sup>*<sup>n</sup> containing the origin.*

**Assumption 2.** *For the free vibration system, f*(0) = 0, Δ*f*(0) = 0*, that is, x* = 0 *is the equilibrium point of the free vibration system.*

**Definition 1.** *The system* (1) *is called to satisfy the system dynamics matched condition if there is a function matrix h*(*x*) <sup>∈</sup> *<sup>R</sup>m*×<sup>1</sup> *such that*

$$
\Delta f(\mathbf{x}) = \mathcal{g}(\mathbf{x}) h(\mathbf{x}).\tag{2}
$$

**Definition 2.** *System* (1) *is called to satisfy the input matched condition if there is a function <sup>m</sup>*(*x*) <sup>∈</sup> *<sup>R</sup>m*×*<sup>m</sup> such that*

$$
\Delta \mathcal{g}(\mathbf{x}) = \mathcal{g}(\mathbf{x}) m(\mathbf{x}),
\tag{3}
$$

*where m*(*x*) <sup>≥</sup> <sup>0</sup>*.*

**Definition 3.** *If the system* (1) *satisfies the conditions* (2) *and* (3) *for any allowed disturbances* Δ*f*(*x*) *and* Δ*g*(*x*)*, then the system* (1) *is called a matched uncertain system.*

**Definition 4.** *If the system* (1) *does not satisfy the condition* (2) *or* (3) *for any allowed disturbances* Δ*f*(*x*) *and* Δ*g*(*x*)*, then the system* (1) *is called a mismatched uncertain system.*

Next, we consider the robust control problem of nonlinear system (1) with matched and mismatched conditions, respectively.

#### **3. Robust Control of Matched Uncertain Nonlinear Systems**

This section considers the problem of robust control when the system (1) meets the matched conditions (2) and (3). By constructing appropriate performance indexes, the problem of robust control is transformed into calculating the optimal control law of a corresponding nominal system. Based on the optimal control of the nominal system, a PI algorithm is proposed to obtain robust feedback controller.

For the nominal system

$$
\dot{\mathbf{x}} = f(\mathbf{x}) + \mathbf{g}(\mathbf{x})\mathbf{u},\tag{4}
$$

find the controller *u* = *u*(*x*) to minimize performance index

$$J(\mathbf{x}\_{0\prime}u) = \int\_0^\infty [f\_{\text{max}}^2(\mathbf{x}) + \mathbf{x}^T \mathbf{x} + u^T u] dt \tag{5}$$

where *fmax*(*x*) is the supremum function of uncertainty *<sup>h</sup>*(*x*), that is *h*(*x*) ≤ *fmax*(*x*). The definition of admissible control in optimal control problem is given below [26].

**Definition 5.** *The control policy u*(*x*) *is called an admissible control of the system* (4) *with regard to the performance function* (5) *on compact set* <sup>Ω</sup> <sup>⊆</sup> <sup>R</sup>*<sup>n</sup> if <sup>u</sup>*(*x*) *is continuous on* <sup>Ω</sup>*, <sup>u</sup>*(0) = <sup>0</sup>*, it can stabilize the system* (4) *on* Ω*, and the performance function* (5) *is limited for any x* ∈ Ω*.*

According to the performance index (5), the cost function corresponding to the admissible control *u*(*x*) is given by

$$V(\mathbf{x}, \boldsymbol{\mu}) = \int\_{t}^{\infty} [f\_{\max}^{2}(\mathbf{x}) + \mathbf{x}^{T}\mathbf{x} + \mathbf{u}^{T}\boldsymbol{\mu}]dt\tag{6}$$

Taking time derivative on both side of (6), it follows the Bellman equation

$$f\_{\max}^2(\mathbf{x}) + \mathbf{x}^T \mathbf{x} + \boldsymbol{\mu}^T \boldsymbol{\mu} + \nabla V^T [f(\mathbf{x}) + \mathcal{g}(\mathbf{x}) \boldsymbol{\mu}] = \mathbf{0},\tag{7}$$

where <sup>∇</sup>*<sup>V</sup>* is the gradient vector of the cost function *<sup>V</sup>*(*x*, *<sup>u</sup>*) with respect to *<sup>x</sup>*.

Definite Hamiltonian function

$$H(\mathbf{x}, \boldsymbol{\mu}, \nabla V) = f\_{\max}^2(\mathbf{x}) + \mathbf{x}^T \mathbf{x} + \boldsymbol{\mu}^T \boldsymbol{\mu} + \nabla V^T [f(\mathbf{x}) + \mathbf{g}(\mathbf{x})\boldsymbol{\mu}] \tag{8}$$

Determining the extremum of the Hamiltonian function yields the optimal control function

$$u^\*(\mathbf{x}) = -\frac{1}{2}\mathbf{g}^T(\mathbf{x})\nabla V\tag{9}$$

By substituting (9) into (7), it follows that optimal cost *V*∗(*x*) satisfies the following HJB equation

$$f\_{\max}^2(\mathbf{x}) + \mathbf{x}^T \mathbf{x} + \nabla V^{\*T} f(\mathbf{x}) - \frac{1}{4} \nabla V^{\*T} g(\mathbf{x}) \mathbf{g}^T(\mathbf{x}) \nabla V^\* = 0 \tag{10}$$

and initial conditions *V*∗(0) = 0.

Solving the optimal cost *V*∗(*x*) from the HJB Equation (10), we can get the solution to the optimal control problem. Thus, the robust control problem can be solved.

The following theorem shows that optimal control *<sup>u</sup>*∗(*x*) = <sup>−</sup><sup>1</sup> <sup>2</sup> *<sup>g</sup>T*(*x*)∇*V*<sup>∗</sup> is a robust controller for matched uncertain systems.

**Theorem 1.** *Assume that the conditions* (2) *and* (3) *hold in system* (1) *and the solution V*∗(*x*) *in HJB Equation* (10) *exists. Considering the nominal nonlinear system* (4) *with performance index* (5)*, then the optimal control policy <sup>u</sup>*∗(*x*) = <sup>−</sup><sup>1</sup> <sup>2</sup> *<sup>g</sup>T*(*x*)∇*V*<sup>∗</sup> *can globally asymptotically stabilize the nonlinear uncertain system* (1)*. That is to say, the closed-loop uncertain system x*˙(*t*) = *f*(*x*(*t*)) + Δ*f*(*x*(*t*)) + [*g*(*x*(*t*)) + Δ*g*(*x*(*t*))]*u*∗(*x*) *is globally asymptotically stable.*

**Proof.** In order to prove the stability with controller *<sup>u</sup>*∗(*x*) = <sup>−</sup><sup>1</sup> <sup>2</sup> *<sup>g</sup>T*(*x*)∇*V*∗, *<sup>V</sup>*∗(*x*) is chosen as the Lyapunov function. Considering the performance index (5), *V*∗(*x*) is obviously positive, and *V*∗(0) = 0. Taking time derivative of the function *V*∗(*x*) along closed-loop system (1), it follows that

$$\frac{dV^\*}{dt} = \nabla V^{\*T}[f(\mathbf{x}) + \Delta f(\mathbf{x})] - \frac{1}{2}\nabla V^{\*T}[\mathbf{g}(\mathbf{x}) + \Delta \mathbf{g}(\mathbf{x})] \mathbf{g}^T(\mathbf{x}) \nabla V^\* \tag{11}$$

Using the matched conditions (2)and (3), it follows from (11) that

$$\begin{split} \frac{dV^\*}{dt} &= \nabla V^{\*T} f(\mathbf{x}) + \nabla V^{\*T} \mathbf{g}(\mathbf{x}) h(\mathbf{x}) - \frac{1}{2} \nabla V^{\*T} \mathbf{g}(\mathbf{x}) \mathbf{g}^T(\mathbf{x}) \nabla V^\* \\ &- \frac{1}{2} \nabla V^{\*T} \mathbf{g}(\mathbf{x}) m(\mathbf{x}) \mathbf{g}^T(\mathbf{x}) \nabla V^\* \end{split} \tag{12}$$

From HJB Equation (10), one can obtain

$$
\nabla V^{\*T} f(\mathbf{x}) = -f\_{\text{max}}^2(\mathbf{x}) - \mathbf{x}^T \mathbf{x} + \frac{1}{4} \nabla V^{\*T} g(\mathbf{x}) g^T(\mathbf{x}) \nabla V^\* \tag{13}
$$

Substituting (13) into (12) yields

$$\begin{split} \frac{dV^{\*}}{dt} &= -f\_{\text{max}}^{2}(\mathbf{x}) - \mathbf{x}^{T}\mathbf{x} + \frac{1}{4}\nabla V^{\*T}\mathbf{g}(\mathbf{x})\mathbf{g}^{T}(\mathbf{x})\nabla V^{\*} + \nabla V^{\*T}\mathbf{g}(\mathbf{x})h(\mathbf{x}) - \frac{1}{2}\nabla V^{\*T}\mathbf{g}(\mathbf{x})\mathbf{g}^{T}(\mathbf{x})\nabla V^{\*} \\ &- \frac{1}{2}\nabla V^{\*T}\mathbf{g}(\mathbf{x})m(\mathbf{x})\mathbf{g}^{T}(\mathbf{x})\nabla V^{\*} \\ &= -f\_{\text{max}}^{2}(\mathbf{x}) - \mathbf{x}^{T}\mathbf{x} - \frac{1}{4}\nabla V^{\*T}\mathbf{g}(\mathbf{x})\mathbf{g}^{T}(\mathbf{x})\nabla V^{\*} + \nabla V^{\*T}\mathbf{g}(\mathbf{x})h(\mathbf{x}) - \frac{1}{2}\nabla V^{\*T}\mathbf{g}(\mathbf{x})m(\mathbf{x})\mathbf{g}^{T}(\mathbf{x})\nabla V^{\*} \end{split} \tag{14}$$

It follows from *<sup>m</sup>*(*x*) <sup>≥</sup> 0 that <sup>−</sup><sup>1</sup> <sup>2</sup>∇*V*<sup>∗</sup>*Tg*(*x*)*m*(*x*)*gT*(*x*)∇*V*<sup>∗</sup> <sup>≤</sup> 0. Therefore, from (14), we can have

$$\begin{split} \frac{dV^{\*}}{dt} &\leq -f\_{\text{max}}^{2}(\mathbf{x}) - \mathbf{x}^{T}\mathbf{x} - \frac{1}{4}\nabla V^{\*T}\mathbf{g}(\mathbf{x})\mathbf{g}^{T}(\mathbf{x})\nabla V^{\*} + \nabla V^{\*T}\mathbf{g}(\mathbf{x})h(\mathbf{x}) \\ &= -f\_{\text{max}}^{2}(\mathbf{x}) - \mathbf{x}^{T}\mathbf{x} - \frac{1}{4}[\nabla V^{\*T}\mathbf{g}(\mathbf{x})\mathbf{g}^{T}(\mathbf{x})\nabla V^{\*} - 4\nabla V^{\*T}\mathbf{g}(\mathbf{x})h(\mathbf{x}) + 4h(\mathbf{x})^{T}h(\mathbf{x})] + h(\mathbf{x})^{T}h(\mathbf{x}) \\ &= -\mathbf{x}^{T}\mathbf{x} + h(\mathbf{x})^{T}h(\mathbf{x}) - f\_{\text{max}}^{2}(\mathbf{x}) - \frac{1}{4}[\mathbf{g}^{T}(\mathbf{x})\nabla V^{\*} - 2h(\mathbf{x})]^{T}[\mathbf{g}^{T}(\mathbf{x})\nabla V^{\*} - 2h(\mathbf{x})] \\ &\leq -\mathbf{x}^{T}\mathbf{x} \end{split} \tag{15}$$

Therefore, by Lyapunov stability theory [30], the optimal control *<sup>u</sup>*∗(*x*) = <sup>−</sup><sup>1</sup> <sup>2</sup> *<sup>g</sup>T*(*x*)∇*V*<sup>∗</sup> can make the matched uncertain system (1) asymptotically stable. Thus, for a constant *<sup>c</sup>* <sup>&</sup>gt; 0, there is a neighborhood *<sup>N</sup>* <sup>=</sup> {*<sup>x</sup>* : *x* <sup>&</sup>lt; *<sup>c</sup>*} such that if the state *<sup>x</sup>*(*t*) enters the neighborhood *<sup>N</sup>*, then *<sup>x</sup>* <sup>→</sup> 0 when *<sup>t</sup>* <sup>→</sup> <sup>∞</sup>. However, *<sup>x</sup>*(*t*) cannot stay out of the domain *<sup>N</sup>* forever; otherwise, for all *<sup>t</sup>* <sup>&</sup>gt; 0, there is *x*(*t*) ≥ *<sup>c</sup>*. This implies that

$$\begin{aligned} V^\*[\mathbf{x}(t)] - V^\*[\mathbf{x}(0)] &= \int\_0^t \dot{V}^\*(\mathbf{x}(\tau))dt \\ &\leq \int\_0^t (-\mathbf{x}^T \mathbf{x})dt \\ &\leq -\int\_0^t c^2 dt \\ &= -c^2 t \end{aligned}$$

Therefore, when *<sup>t</sup>* <sup>→</sup> <sup>∞</sup>, *<sup>V</sup>*∗[*x*(*t*)] <sup>≤</sup> *<sup>V</sup>*∗[*x*(0)] <sup>−</sup> *<sup>c</sup>*2*<sup>t</sup>* → −∞. This contradicts that *V*∗[*x*(*t*)] is positive definite. Consequently, the system (1) is globally asymptotically stable.

**Remark 1.** *For matched nonlinear systems, the robust controller can be obtained by solving the optimal cost function V*∗[*x*(*t*)] *from HJB Equation* (10)*. In Section 4, we will use the PI algorithm to solve the HJB equation, which is a difficult partial differential equation.*

#### **4. Robust Control of Nonlinear Systems with Mismatched Uncertainties**

In this section, we consider the robust control problem when the system (1) does not satisfy the matched condition (2). At this time, the system is a mismatched nonlinear uncertain system. By constructing the appropriate auxiliary system and performance index, the robust control for the mismatched uncertain system is transformed into solving optimal control law of an auxiliary system.

Firstly, the following assumptions are given.

**Assumption 3.** *Suppose that the uncertainty of system* (1) *satisfies* Δ*f*(*x*) = *c*(*x*)*h*(*x*)*,* Δ*g*(*x*) = *g*(*x*)*m*(*x*)*, where c*(*x*) *is a known function matrix of appropriate dimensions, h*(*x*) *and m*(*x*) *are uncertain functions, and m*(*x*) <sup>≥</sup> <sup>0</sup>*.*

The goal of robust control is to find a control function *u*(*x*), which makes the closedloop system

$$\dot{\mathbf{x}} = f(\mathbf{x}) + \mathbf{c}(\mathbf{x})h(\mathbf{x}) + [\mathbf{g}(\mathbf{x}) + \mathbf{g}(\mathbf{x})m(\mathbf{x})]u(\mathbf{x})\tag{16}$$

globally asymptotically stable for all uncertainties *h*(*x*) and *m*(*x*).

In order to obtain the robust controller, an optimal control problem is constructed as follow. For the following auxiliary systems

$$\dot{\mathbf{x}} = f(\mathbf{x}) + \mathbf{g}(\mathbf{x})u + [I - \mathbf{g}(\mathbf{x})\mathbf{g}(\mathbf{x})^{+}]\mathbf{c}(\mathbf{x})v \tag{17}$$

find the controller *u* = *u*(*x*), *v* = *v*(*x*), such that the performance index

$$f(\mathbf{x}\_0, \mathbf{u}) = \int\_0^\infty [f\_{\text{max}}^2(\mathbf{x}) + g\_{\text{max}}^2(\mathbf{x}) + \boldsymbol{\beta}^2 \mathbf{x}^T \mathbf{x} + \boldsymbol{u}^T \boldsymbol{u} + \boldsymbol{\upsilon}^T \boldsymbol{\upsilon}] d\mathbf{t} \tag{18}$$

is minimized, where *<sup>β</sup>* is the design parameter, *<sup>g</sup>*(*x*)<sup>+</sup> = [*gT*(*x*)*g*(*x*)]−1*gT*(*x*) is a pseudo inverse of the matrix function *g*(*x*). Moreover, *fmax*(*x*) and *gmax*(*x*) are nonnegative functions and satisfy the conditions

$$\|\|g(\mathbf{x})^{+}c(\mathbf{x})h(\mathbf{x})\|\| \le f\_{\max}(\mathbf{x}),\\\|\|h(\mathbf{x})\|\| \le g\_{\max}(\mathbf{x})\tag{19}$$

According to the performance index (18), the cost function corresponding to the admissible control (*u*(*x*), *v*(*x*)) is

$$V(\mathbf{x}) = \int\_{t}^{\infty} [f\_{\text{max}}^{2}(\mathbf{x}) + g\_{\text{max}}^{2}(\mathbf{x}) + \beta^{2}\mathbf{x}^{T}\mathbf{x} + u^{T}u + v^{T}v]dt\tag{20}$$

The following Bellman equation is obtained by taking the time derivation on both sides of (20)

$$f\_{\max}^2(\mathbf{x}) + g\_{\max}^2(\mathbf{x}) + \beta^2 \mathbf{x}^T \mathbf{x} + \mathbf{u}^T \mathbf{u} + \nabla V^T [f(\mathbf{x}) + \mathbf{g}(\mathbf{x}) \mathbf{d}] = \mathbf{0},\tag{21}$$

where <sup>∇</sup>*<sup>V</sup>* is the gradient vector of *<sup>V</sup>*(*x*) with respect to *<sup>x</sup>*, *<sup>g</sup>*¯(*x*)=[*g*(*x*),(*<sup>I</sup>* <sup>−</sup> *<sup>g</sup>*(*x*)*g*(*x*)+)*c*(*x*)], *u*¯ = [*uT*, *vT*] *T*.

Defining Hamiltonian functions as

$$H(\mathbf{x}, \boldsymbol{\mu}, \nabla V) = f\_{\max}^2(\mathbf{x}) + g\_{\max}^2(\mathbf{x}) + \boldsymbol{\beta}^2 \mathbf{x}^T \mathbf{x} + \boldsymbol{\mu}^T \boldsymbol{\mu} + \nabla V^T [f(\mathbf{x}) + \boldsymbol{\mathcal{g}}(\mathbf{x}) \boldsymbol{\mu}] \tag{22}$$

Assuming that the minimum value exists and is unique in (22), the optimal control law is given by

$$\begin{split} \boldsymbol{u}^\*(\mathbf{x}) &= \begin{bmatrix} \boldsymbol{u}^\*(\mathbf{x}) \\ \boldsymbol{v}^\*(\mathbf{x}) \end{bmatrix} = -\frac{1}{2} \boldsymbol{\bar{g}}^T(\mathbf{x}) \nabla V^\* \\ &= -\frac{1}{2} \begin{bmatrix} \boldsymbol{g}^T(\mathbf{x}) \nabla V^\* \\ \boldsymbol{c}^T(\mathbf{x}) [\boldsymbol{I} - \boldsymbol{g}(\mathbf{x}) \boldsymbol{g}(\mathbf{x})^+]^T \nabla V^\* \end{bmatrix} \end{split} \tag{23}$$

By substituting (23) into (21), the HJB equation is given by

$$f\_{\max}^2(\mathbf{x}) + g\_{\max}^2(\mathbf{x}) + \beta^2 \mathbf{x}^T \mathbf{x} + \nabla V^{\*T} f(\mathbf{x}) - \frac{1}{4} \nabla V^{\*T} \bar{\mathbf{g}}(\mathbf{x}) \bar{\mathbf{g}}^T(\mathbf{x}) \nabla V^\* = 0 \tag{24}$$

and the initial value *V*∗(0) = 0.

**Remark 2.** *Generally, the pseudo-inverse of <sup>g</sup>*(*x*)*, <sup>g</sup>*(*x*)<sup>+</sup> *will exist if its columns are linearly independent when Assumptions 1 and 2 are true [31]. In practical control systems, the function, g*(*x*)*, is usually column full-rank. Therefore, the pseudo-inverse of the function g*(*x*) *is generally satisfied. Furthermore, the pseudo-inverse <sup>g</sup>*(*x*)<sup>+</sup> *satisfies <sup>g</sup>*(*x*)+*g*(*x*) = *I. However, it does not satisfy <sup>g</sup>*(*x*)*g*(*x*)<sup>+</sup> = *I. In addition, the auxiliary system constructed above is not a nominal system, but a compensation control term v*(*x*) *is added to the nominal system.*

If we can choose an appropriate parameter *β*, the optimal cost *V*∗(*x*) can be computed from HJB Equation (24). Then, we can get the optimal control law of system (17) with performance index (18). The following theorem shows that optimal control *<sup>u</sup>*∗(*x*) = <sup>−</sup><sup>1</sup> <sup>2</sup> *<sup>g</sup>T*(*x*)∇*V*<sup>∗</sup> is a robust controller for uncertain systems.

**Theorem 2.** *Assume that the mismatched uncertain system* (16) *satisfies Assumptions 4.1, 4.2 and* (19)*. Consider the auxiliary system* (17) *corresponding to the performance index* (18)*. There* *exists a solution V*∗(*x*) *in HJB Equation* (24) *for a selected parameter β, and for a constant β satisfying* |*β* | < |*β*|*, such that*

$$2v^{\*T}(\mathbf{x})v^\*(\mathbf{x}) \le \beta^{\*2}\mathbf{x}^T\mathbf{x}$$

*Then, the optimal control policy <sup>u</sup>*∗(*x*) = <sup>−</sup><sup>1</sup> <sup>2</sup> *<sup>g</sup>T*(*x*)∇*V*<sup>∗</sup> *can globally asymptotically stabilize the nonlinear uncertain system* (16)*. That is to say, the closed-loop uncertain system x*˙(*t*) = *f*(*x*) + *c*(*x*)*h*(*x*)+[*g*(*x*) + *g*(*x*)*m*(*x*)]*u*∗(*x*) *is globally asymptotically stable.*

**Proof.** In order to prove the global asymptotic stability of the closed-loop system, *V*∗(*x*) is chosen as the Lyapunov function. Considering the performance index (18), *V*∗(*x*) is obviously positive, and *V*∗(0) = 0. Taking the time derivative of the function *V*∗(*x*) along the system (16), we have

$$\frac{dV^\*}{dt} = \nabla V^{\*T}[f(\mathbf{x}) + c(\mathbf{x})h(\mathbf{x}) + g(\mathbf{x})u^\*(\mathbf{x})] + \nabla V^{\*T}\mathbf{g}(\mathbf{x})m(\mathbf{x})u^\*(\mathbf{x})\tag{25}$$

Using *<sup>u</sup>*∗(*x*) = <sup>−</sup><sup>1</sup> <sup>2</sup> *<sup>g</sup>T*(*x*)∇*V*<sup>∗</sup> yields

$$\begin{split} \frac{dV^{\*}}{dt} &= \nabla V^{\*T} f(\mathbf{x}) + \nabla V^{\*T} c(\mathbf{x}) h(\mathbf{x}) + \nabla V^{\*T} g(\mathbf{x}) u^{\*}(\mathbf{x}) - \frac{1}{2} \nabla V^{\*T} g(\mathbf{x}) m(\mathbf{x}) g^{T}(\mathbf{x}) \nabla V^{\*} \\ &\leq \nabla V^{\*T} f(\mathbf{x}) + \nabla V^{\*T} c(\mathbf{x}) h(\mathbf{x}) + \nabla V^{\*T} g(\mathbf{x}) u^{\*}(\mathbf{x}) \\ &= \nabla V^{\*T} [f(\mathbf{x}) + g(\mathbf{x}) u^{\*}(\mathbf{x}) + c(\mathbf{x}) h(\mathbf{x})] - \nabla V^{\*T} (I - g(\mathbf{x}) g(\mathbf{x})^{+}) c(\mathbf{x}) v^{\*}(\mathbf{x}) \\ &\quad + \nabla V^{\*T} (I - g(\mathbf{x}) g(\mathbf{x})^{+}) c(\mathbf{x}) v^{\*}(\mathbf{x}) \\ &= \nabla V^{\*T} [f(\mathbf{x}) + g(\mathbf{x}) u^{\*}(\mathbf{x}) + (I - g(\mathbf{x}) g(\mathbf{x})^{+}) c(\mathbf{x}) v^{\*}(\mathbf{x})] + \nabla V^{\*T} g(\mathbf{x}) g(\mathbf{x})^{+} c(\mathbf{x}) h(\mathbf{x}) \\ &\quad - \nabla V^{\*T} (I - g(\mathbf{x}) g(\mathbf{x})^{+}) c(\mathbf{x}) v^{\*}(\mathbf{x}) + \nabla V^{\*T} (I - g(\mathbf{x}) g(\mathbf{x})^{+}) c(\mathbf{x}) h(\mathbf{x}) \end{split}$$

by *<sup>u</sup>*∗(*x*) = <sup>−</sup><sup>1</sup> <sup>2</sup> *<sup>g</sup>T*(*x*)∇*V*<sup>∗</sup> and *<sup>v</sup>*∗(*x*) = <sup>−</sup><sup>1</sup> <sup>2</sup> *<sup>c</sup>T*(*x*)(*<sup>I</sup>* <sup>−</sup> *<sup>g</sup>*(*x*)*g*(*x*)+)*<sup>T</sup>*∇*V*∗,

$$\begin{split} \frac{dV^\*}{dt} \leq & \nabla V^{\*T} f(\mathbf{x}) - 2u^\*(\mathbf{x})^T u^\*(\mathbf{x}) - 2v^{\*T}(\mathbf{x}) v^\*(\mathbf{x}) + \nabla V^{\*T} \mathbf{g}(\mathbf{x}) \mathbf{g}(\mathbf{x})^+ c(\mathbf{x}) h(\mathbf{x}) \\ \qquad - \nabla V^{\*T} (I - \mathbf{g}(\mathbf{x}) \mathbf{g}(\mathbf{x})^+) c(\mathbf{x}) v^\*(\mathbf{x}) + \nabla V^{\*T} (I - \mathbf{g}(\mathbf{x}) \mathbf{g}(\mathbf{x})^+) c(\mathbf{x}) h(\mathbf{x}) \end{split}$$

It follows from (24) that

$$\nabla V^{\*T} f(\mathbf{x}) = -f^2\_{\max}(\mathbf{x}) - g^2\_{\max}(\mathbf{x}) - \beta^2 \mathbf{x}^T \mathbf{x} + u^\*(\mathbf{x})^T u^\*(\mathbf{x}) + v^{\*T}(\mathbf{x}) v^\*(\mathbf{x})$$

As a result,

$$\begin{split} \frac{dV^\*}{dt} \le & -f\_{\text{max}}^2(\mathbf{x}) - g\_{\text{max}}^2(\mathbf{x}) - \beta^2 \mathbf{x}^T \mathbf{x} - \boldsymbol{u}^\*(\mathbf{x})^T \boldsymbol{u}^\*(\mathbf{x}) + \boldsymbol{v}^{\*T}(\mathbf{x}) \boldsymbol{v}^\*(\mathbf{x}) \\ & - 2\boldsymbol{u}^\*(\mathbf{x})^T \boldsymbol{g}(\mathbf{x})^+ \boldsymbol{c}(\mathbf{x}) \boldsymbol{h}(\mathbf{x}) - 2\boldsymbol{v}^{\*T}(\mathbf{x}) \boldsymbol{h}(\mathbf{x}) \end{split} \tag{26}$$

On the other hand

$$-2u^\*(\mathbf{x})^T \mathbf{g}(\mathbf{x})^+ c(\mathbf{x}) h(\mathbf{x}) \le [\mathbf{g}(\mathbf{x})^+ c(\mathbf{x}) h(\mathbf{x})]^T [\mathbf{g}(\mathbf{x})^+ c(\mathbf{x}) h(\mathbf{x})] + u^\*(\mathbf{x})^T u^\*(\mathbf{x})$$

Therefore,

$$\begin{aligned} -u^\*(\mathbf{x})^T u^\*(\mathbf{x}) - 2u^\*(\mathbf{x})^T \mathbf{g}(\mathbf{x})^+ c(\mathbf{x}) h(\mathbf{x}) &\le [\mathbf{g}(\mathbf{x})^+ c(\mathbf{x}) h(\mathbf{x})]^T [\mathbf{g}(\mathbf{x})^+ c(\mathbf{x}) h(\mathbf{x})] \\ &\le f\_{\text{max}}^2(\mathbf{x}) \end{aligned} \tag{27}$$

It follows from the basic matrix inequality that

$$\begin{aligned} -2v^{\*T}(\mathbf{x})h(\mathbf{x}) &\leq v^{\*T}(\mathbf{x})v^{\*}(\mathbf{x}) + h^{T}(\mathbf{x})h(\mathbf{x}) \\ &\leq v^{\*T}(\mathbf{x})v^{\*}(\mathbf{x}) + g\_{\max}^{2}(\mathbf{x}) \end{aligned} \tag{28}$$

So, it can be obtained from (26)–(28) that

$$\begin{aligned} \frac{dV^\*}{dt} &\leq -\beta^2 \mathbf{x}^T \mathbf{x} + 2\mathbf{v}^{\*T}(\mathbf{x})\mathbf{v}^\*(\mathbf{x}) \\ &= 2\mathbf{v}^{\*T}(\mathbf{x})\mathbf{v}^\*(\mathbf{x}) - \beta'^2 \mathbf{x}^T\mathbf{x} - (\beta^2 - \beta'^2)\mathbf{x}^T\mathbf{x} \\ &\leq -(\beta^2 - \beta'^2)\mathbf{x}^T\mathbf{x} \end{aligned}$$

Therefore, by Lyapunov stability theory, the optimal control *<sup>u</sup>*∗(*x*) = <sup>−</sup><sup>1</sup> <sup>2</sup> *<sup>g</sup>T*(*x*)∇*V*<sup>∗</sup> can make the closed-loop uncertain nonlinear system asymptotically stable. Thus, for a constant *<sup>c</sup>* <sup>&</sup>gt; 0, there is a neighborhood *<sup>N</sup>* <sup>=</sup> {*<sup>x</sup>* : *x* <sup>&</sup>lt; *<sup>c</sup>*} such that if the state *<sup>x</sup>*(*t*) enters the neighborhood *<sup>N</sup>*, then *<sup>x</sup>* <sup>→</sup> 0 when *<sup>t</sup>* <sup>→</sup> <sup>∞</sup>. However, *<sup>x</sup>*(*t*) cannot stay out of the domain *<sup>N</sup>* forever; otherwise, for all *<sup>t</sup>* <sup>&</sup>gt; 0, there is *x*(*t*) ≥ *<sup>c</sup>*. This implies that

$$\begin{aligned} \left[V^\*[\mathbf{x}(t)] - V^\*[\mathbf{x}(0)]\right] &= \int\_0^t \dot{V}^\*(\mathbf{x}(\tau))dt \\ &\leq \int\_0^t -(\beta^2 - \beta'^2)\mathbf{x}^T\mathbf{x}dt \\ &\leq \int\_0^t -(\beta^2 - \beta'^2)\mathbf{c}^2dt \\ &= -(\beta^2 - \beta'^2)\mathbf{c}^2t \end{aligned}$$

Hence, when *<sup>t</sup>* <sup>→</sup> <sup>∞</sup>, *<sup>V</sup>*∗[*x*(*t*)] <sup>≤</sup> *<sup>V</sup>*∗[*x*(0)] <sup>−</sup> (*β*<sup>2</sup> <sup>−</sup> *<sup>β</sup>* <sup>2</sup>)*c*2*<sup>t</sup>* → −∞. This contradicts the positivity of *V*∗[*x*(*t*)]. Therefore, system (16) is globally asymptotically stable. We complete the proof.

#### **5. Neural Networks Approximation in PI Algorithm**

In the first two sections, the robust control of uncertain nonlinear systems was transformed into solving the optimal control of an auxiliary system. However, whether the uncertain system is matched or mismatched, the key issue is how to obtain the solution to corresponding HJB equation. As is well known, it is a nonlinear partial differential equation that is hard to solve. Moreover, solving the HJB equation may lead to the curse of dimensionality [21]. In this section, an online PI algorithm is used to solve the HJB equation iteratively, and neural networks are utilized to approximate the optimal cost in PI algorithm.

#### *5.1. PI Algorithms for Robust Control*

For the system with matched uncertainty, the optimal control problem (4) with (5) is considered. For any admissible control, the corresponding cost function can be expressed as

$$\begin{aligned} V[\mathbf{x}(t)] &= \int\_{t}^{\infty} [f\_{\text{max}}^{2}(\mathbf{x}) + \mathbf{x}^{T}\mathbf{x} + \mathbf{u}^{T}\mathbf{u}]dt \\ &= \int\_{t}^{t+T} [f\_{\text{max}}^{2}(\mathbf{x}) + \mathbf{x}^{T}\mathbf{x} + \mathbf{u}^{T}\mathbf{u}]dt + \int\_{t+T}^{\infty} [f\_{\text{max}}^{2}(\mathbf{x}) + \mathbf{x}^{T}\mathbf{x} + \mathbf{u}^{T}\mathbf{u}]dt \end{aligned}$$

where *T* > 0 is a selected constant. Therefore, it follows that

$$V[\mathbf{x}(t)] = \int\_{t}^{t+T} [f\_{max}^2(\mathbf{x}) + \mathbf{x}^T \mathbf{x} + \mathbf{u}^T \mathbf{u}] dt + V[\mathbf{x}(t+T)] \tag{29}$$

Based on the integral reinforcement relationship (29) and optimal control (9), the PI algorithm of robust control for matched uncertain nonlinear systems is given below.

The convergence of Algorithm 1 is illustrated as follows. The following conclusion gives an equivalent form of the Bellman Equation (30).

**Algorithm 1** PI algorithm of robust control for matched uncertain nonlinear systems


$$V\_l[\mathbf{x}(t)] = \int\_t^{t+T} [f\_{max}^2(\mathbf{x}) + \mathbf{x}^T \mathbf{x} + \boldsymbol{\mu}\_l^T(\mathbf{x}) \boldsymbol{u}\_l(\mathbf{x})] dt + V\_l[\mathbf{x}(t+T)] \tag{30}$$

(4) Policy improvement: compute the control law *ui*+1(*x*) using

$$
\mu\_{l+1}(\mathbf{x}) = -\frac{1}{2} \mathbf{g}^T(\mathbf{x}) \nabla V\_l. \tag{31}
$$

By repeatedly iterating between (30) and (31), until the control input is convergent.

**Proposition 1.** *Suppose that ui*(*x*) *is a stabilization controller of nominal system* (4)*. Then the optimal cost Vi*(*x*) *solved from* (30) *is equivalent to solving the following equation*

$$f\_{\max}^2(\mathbf{x}) + \mathbf{x}^T \mathbf{x} + u\_i^T(\mathbf{x})u\_i(\mathbf{x}) + \nabla V\_i[f(\mathbf{x}) + \mathbf{g}(\mathbf{x})u\_i(\mathbf{x})] = \mathbf{0}.\tag{32}$$

**Proof.** Dividing both sides of (30) by *T* and finding the limit yields

$$\lim\_{T \to 0} \frac{V\_i[\mathbf{x}(t+T)] - V\_i[\mathbf{x}(t)]}{T} + \lim\_{T \to 0} \frac{\int\_t^{t+T} [f\_{\text{max}}^2(\mathbf{x}) + \mathbf{x}^T \mathbf{x} + \boldsymbol{\mu}\_i^T(\mathbf{x})\boldsymbol{\mu}\_i(\mathbf{x})]dt}{T} = 0$$

From the definition of function limit and L'Hospital's rule, we can get

$$\frac{d V\_i[\mathbf{x}(t)]}{dt} + \lim\_{T \to 0} \frac{d}{dT} \int\_t^{t+T} [f\_{\text{max}}^2(\mathbf{x}) + \mathbf{x}^T \mathbf{x} + \boldsymbol{\mu}\_i^T(\mathbf{x}) \boldsymbol{\mu}\_i(\mathbf{x})] dt = 0 \tag{33}$$

It follows that

$$f\_{\max}^2(\mathbf{x}) + \mathbf{x}^T \mathbf{x} + u\_i^T(\mathbf{x}) u\_i(\mathbf{x}) + \nabla V\_i[f(\mathbf{x}) + \mathbf{g}(\mathbf{x}) u\_i(\mathbf{x})] = 0$$

Thus, we can deduce (32) from (30). On the other hand, along the stable system *x*˙ = *<sup>f</sup>*(*x*) + *<sup>g</sup>*(*x*)*ui*(*x*), finding the time derivative of *Vi*(*x*) yields

$$\frac{dV\_i[\mathbf{x}(t)]}{dt} = \nabla V\_i[f(\mathbf{x}) + \mathbf{g}(\mathbf{x})u\_i(\mathbf{x})],$$

Integrating both sides from *t* to *t* + *T*, yields

$$V\_i[\mathbf{x}(t+T)] - V\_i[\mathbf{x}(t)] = \int\_t^{t+T} \nabla V\_i[f(\mathbf{x}) + \mathbf{g}(\mathbf{x})u\_i(\mathbf{x})]dt$$

Therefore, we can get the following result from (32)

$$V\_i[\mathbf{x}(t)] = \int\_t^{t+T} [f\_{max}^2(\mathbf{x}) + \mathbf{x}^T \mathbf{x} + \boldsymbol{u}\_i^T(\mathbf{x})\boldsymbol{u}\_i(\mathbf{x})]dt + V\_i[\mathbf{x}(t+T)],$$

This proves that (32) can deduce (30).

According to [32–34], if the initial stabilization control policy is given *<sup>u</sup>*0(*x*), then the follow-up control policy calculated by the iterative relations of (30) and (31) is also a stabilizing control policy, and cost sequence *Vi*[*x*(*t*)] calculated by iteration converges to the optimal cost. By Proposition 1, it is known that (30) and (32) are equivalent, so the iterative relations (30) and (31) in Algorithm 1 converge to the optimal control and optimal cost.

Similarly, we give a PI algorithm of robust control for nonlinear systems with mismatched uncertainties.

The steps of policy evaluation (34) and policy improvement (35) are iteratively calculated until the policy improvement step does not change the current policy. The optimal cost function is calculated as *<sup>V</sup>*∗(*x*), then *<sup>u</sup>*∗(*x*) = <sup>−</sup><sup>1</sup> <sup>2</sup> *<sup>g</sup>T*(*x*)∇*V*∗(*x*) is the robust control law.

The convergence proof of Algorithm 2 is similar to Algorithm 1, which will not be repeated here.

**Algorithm 2** PI algorithm of robust control for nonlinear systems with mismatched uncertainties


$$V\_l[\mathbf{x}(t)] = \int\_t^{t+T} [f\_{\max}^2(\mathbf{x}) + \mathbf{g}\_{\max}^2(\mathbf{x}) + \beta \mathbf{x}^T \mathbf{x} + \mathbf{a}\_l^T(\mathbf{x}) \mathbf{a}\_l(\mathbf{x})] dt + V\_l[\mathbf{x}(t+T)];\tag{34}$$

(4) Policy improvement: Calculate the control policy using the following update law

$$
\bar{u}\_{l+1}(\mathbf{x}) = -\frac{1}{2}\bar{\mathbf{g}}^T(\mathbf{x})\nabla V\_{l\succ} \tag{35}
$$

(5) Check if the condition 2*v*∗*T*(*x*)*v*∗(*x*) <sup>≤</sup> *<sup>β</sup>* <sup>2</sup>*xTx* is satisfied. Return to step (1) and select the larger constants *β* and *β* when it does not hold.

**Remark 3.** *In Step (3) of Algorithm 1 or Algorithm 2, solving Vi*[*x*(*t*)] *from* (30) *or* (34) *can be transformed into a least squares problem [17]. By reading enough data online along the system trajectory, the cost function V*(*x*) *can be calculated by using the least square principle. However, the cost Vi*[*x*(*t*)] *has no specific expressions. In next subsection, along the system trajectory, online reading of sufficient data on the interval* [*t*, *<sup>t</sup>* + *<sup>T</sup>*]*, the cost Vi*[*x*(*t*)] *can be approximated by neural network in PI algorithms. Moreover, implementation of the algorithm does not need to know the system dynamics function f*(*x*)*.*

#### *5.2. Neural Network Approximation of Optimal Cost in PI Algorithm*

In the implementation of the PI algorithms, we need to use the data of the nominal system and use the least square method to solve the cost function. However, the cost function of nonlinear optimal control problem has no specific form. Therefore, it is necessary to use neural network structure to approximate the cost function, carry out policy iteration, update weights, and then obtain the approximate optimal cost function. In this subsection, neural network is utilized to approximate the optimal cost in the corresponding HJB equation.

Based on the continuous approximation theory of neural network [35], a single neural network is utilized to approximate the optimal cost in HJB equation. For matched uncertain systems, suppose that the solution *V*∗(*x*) of HJB Equation (10) is smooth positive definite, and the optimal cost function on compact set Ω is expressed as

$$W^\*(\mathbf{x}) = \mathcal{W}^T \boldsymbol{\phi}(\mathbf{x}) + \boldsymbol{\varepsilon}(\mathbf{x}) \tag{36}$$

where *<sup>W</sup>* <sup>∈</sup> <sup>R</sup>*<sup>L</sup>* is an unknown ideal weight, and *<sup>φ</sup>*(.) : <sup>R</sup>*<sup>n</sup>* <sup>→</sup> <sup>R</sup>*<sup>L</sup>* is a linear independent basis vector function. It is assumed that *φ*(*x*) is continuous, *φ*(0) = 0, and *ε*(*x*) is the error vector of neural network reconstruction. Thus, the gradient of the optimal cost function (36) can be expressed as

$$
\nabla V^\* = \frac{\partial V^\*}{\partial \mathbf{x}} = \nabla \phi^T(\mathbf{x}) \mathcal{W} + \nabla \varepsilon(\mathbf{x}) \tag{37}
$$

where <sup>∇</sup>*ε*(*x*) = *∂ε <sup>∂</sup><sup>x</sup>* . On the basis of approximation property of neural network [35,36], when the number of neurons in hidden layer *<sup>L</sup>* <sup>→</sup> <sup>∞</sup>, the approximation error *<sup>ε</sup>*(*x*) <sup>→</sup> 0, <sup>∇</sup>*ε*(*x*) <sup>→</sup> 0. Substituting (36) and (37) into (9), the optimal control is rewritten as follows

$$u^\*(\mathbf{x}) = -\frac{1}{2}g^T(\mathbf{x})[\nabla\phi^T(\mathbf{x}) + \nabla\varepsilon(\mathbf{x})] \tag{38}$$

Assume that *W*ˆ is an estimated value of the ideal weight *W*. Since the ideal weight *W* in (36) is unknown, the cost function of the *i* − *th* iteration in Algorithm 1 is expressed as

$$
\hat{\mathcal{V}}\_i(\mathbf{x}) = \hat{\mathcal{W}}\_i^T \boldsymbol{\phi}(\mathbf{x}) \tag{39}
$$

Using the approximation of neural network in cost function, the Bellman Equation (30) in Algorithm 1 is rewritten as follows

$$
\hat{\mathcal{W}}\_i^T \boldsymbol{\Phi}(\mathbf{x}(t)) = \boldsymbol{\Psi} + \hat{\mathcal{W}}\_i^T \boldsymbol{\Phi}(\mathbf{x}(t+T)) \tag{40}
$$

where <sup>Ψ</sup> <sup>=</sup> *<sup>t</sup>*+*<sup>T</sup> t f* 2 *max*(*x*) + *<sup>x</sup>Tx* <sup>+</sup> *<sup>u</sup><sup>T</sup> <sup>i</sup>* (*x*)*ui*(*x*) *dt*. Since the above formula uses neural network to approximate the cost function, the residual error caused by neural network approximation is

$$\varepsilon\_i(\mathbf{x}(t), T) = \Psi + \mathcal{W}\_i^T \boldsymbol{\phi}(\mathbf{x}(t+T)) - \mathcal{W}\_i^T \boldsymbol{\phi}(\mathbf{x}(t)) \tag{41}$$

In order to obtain the neural network weight parameters of approximation function, the following objective functions can be minimized in the meaning of least square

$$E = \int\_{\Omega} \varepsilon\_i(\mathbf{x}(t), T)^T \varepsilon\_i(\mathbf{x}(t), T) d\mathbf{x},\tag{42}$$

that is Ω *<sup>d</sup>εi*(*x*(*t*),*T*) *dW*ˆ *<sup>i</sup> <sup>ε</sup>i*(*x*(*t*), *<sup>T</sup>*)*dx* = 0. Using the definition of inner product, it can be rewritten as

$$\langle \frac{d\varepsilon\_i(\mathbf{x}(t),T)}{d\hat{W}\_i}, \varepsilon\_i(\mathbf{x}(t),T) \rangle\_{\Omega} = 0 \tag{43}$$

It follows from properties of the internal product that

$$
\Phi \hat{\mathcal{W}}\_{\text{i}} + \langle [\phi(\mathbf{x}(t+T)) - \phi(\mathbf{x}(t))], \Psi \rangle\_{\Omega} = 0 \tag{44}
$$

where <sup>Φ</sup> <sup>=</sup> [*φ*(*x*(*<sup>t</sup>* <sup>+</sup> *<sup>T</sup>*)) <sup>−</sup> *<sup>φ</sup>*(*x*(*t*))], [*φ*(*x*(*<sup>t</sup>* <sup>+</sup> *<sup>T</sup>*)) <sup>−</sup> *<sup>φ</sup>*(*x*(*t*))]*<sup>T</sup>*. Therefore,

$$\mathcal{W}\_i = -\Phi^{-1} \langle [\phi(\mathbf{x}(t+T)) - \phi(\mathbf{x}(t))], \Psi \rangle\_{\Omega} \tag{45}$$

So far, the neural network weight parameters of approximation function *Vi*(*x*) can be calculated. Thus, the update control policy can be obtained from (35)

$$\mathcal{U}\_{i+1}(\mathbf{x}) = -\frac{1}{2} \mathbf{g}^T(\mathbf{x}) \nabla \boldsymbol{\phi}^T(\mathbf{x}) \mathcal{W}\_i. \tag{46}$$

According to [32,33,35,36], using the policy iteration of RL algorithm, the cost sequence *Vi*(*x*) converges to the optimal cost *<sup>V</sup>*∗(*x*), and the control sequence *ui*(*x*) converges to the optimal control function *u*∗(*x*).

For mismatched uncertain systems, similar neural network approximation can be used.

#### **6. Simulation Examples**

Some simulation examples are presented to verify the feasibility of the robust control design method for uncertain nonlinear systems in this section.

#### **Example 1.** *Consider the following uncertain nonlinear systems*

$$\dot{\mathbf{x}}(t) = \begin{bmatrix} 0 & 6 \\ -1 & 1 \end{bmatrix} \begin{bmatrix} \mathbf{x}\_1 \\ \mathbf{x}\_2 \end{bmatrix} + \begin{bmatrix} 0 \\ p\_1 \mathbf{x}\_1 \cos(\mathbf{x}\_2^2) \end{bmatrix} + \begin{bmatrix} 0 \\ 1 + p\_2 \mathbf{x}\_2^2 \end{bmatrix} \mu \tag{47}$$

*where <sup>x</sup>* = [*x*1, *<sup>x</sup>*2] *<sup>T</sup> is the system state,* <sup>Δ</sup>*f*(*x*) = <sup>0</sup> *<sup>p</sup>*1*x*1*cos*(*x*<sup>2</sup> 2) *is the uncertain disturbance*

*function of the system,* <sup>Δ</sup>*g*(*x*) = <sup>0</sup> *p*2*x*<sup>2</sup> 2 *is input uncertainty function, <sup>p</sup>*<sup>1</sup> <sup>∈</sup> [−2, 2]*, <sup>p</sup>*<sup>2</sup> <sup>∈</sup> [0, 10]*. Obviously,*

Δ*f*(*x*) = *g*(*x*)*h*(*x*), Δ*g*(*x*) = *g*(*x*)*m*(*x*) (48)

*where <sup>g</sup>*(*x*) = <sup>0</sup> 1 *, <sup>h</sup>*(*x*) = *<sup>p</sup>*1*x*1*cos*(*x*<sup>2</sup> <sup>2</sup>)*, <sup>m</sup>*(*x*) = *<sup>p</sup>*2*x*<sup>2</sup> <sup>2</sup>*. Moreover,* <sup>|</sup>*h*(*x*)|≤|2*x*1<sup>|</sup> <sup>=</sup> *fmax*(*x*)*. Thus, the original robust control problem is converted into calculating optimal control law. For nominal system*

$$
\dot{\mathbf{x}}(t) = \begin{bmatrix} 0 & 6 \\ -1 & 1 \end{bmatrix} \begin{bmatrix} x\_1 \\ x\_2 \end{bmatrix} + \begin{bmatrix} 0 \\ 1 \end{bmatrix} u\_\prime \tag{49}
$$

*find the control function u, such that the performance index*

$$\begin{split} f(\mathbf{x}, \boldsymbol{\mu}) &= \int\_{0}^{\infty} [f\_{\max}^{2}(\mathbf{x}) + \mathbf{x}^{T}\mathbf{x} + \mathbf{u}^{T}\boldsymbol{u}] dt \\ &= \int\_{0}^{\infty} [5\mathbf{x}\_{1}^{2} + \mathbf{x}\_{2}^{2} + \mathbf{u}^{2}] dt \end{split} \tag{50}$$

*is minimized.*

*In order to solve the robust control problem by using Algorithm 1, it is assumed that the optimal cost function V*∗(*x*) *has a neural network structure: V*∗(*x*) = *WTφ*(*x*)*, where W* = [*W*1, *<sup>W</sup>*2, *<sup>W</sup>*3] *T, φ*(*x*)=[*x*<sup>2</sup> <sup>1</sup>, *<sup>x</sup>*1*x*2, *<sup>x</sup>*<sup>2</sup> 2] *T. The initial weight is taken as <sup>W</sup>*<sup>0</sup> = [−1, 5, 1.5] *T, and the initial state of system <sup>x</sup>*<sup>0</sup> = [2, <sup>−</sup>0.5] *T. The neural network weights are calculated iteratively by MATLAB. In each iteration, 10 sets of data samples are collected along the nominal system trajectory to perform the batch least squares problem. After five iterations, the weight converges to* [1.9645, 2.8990, 5.4038]*. The robust control law of uncertain system* (47) *is <sup>u</sup>*<sup>∗</sup> <sup>=</sup> <sup>−</sup>1.4495*x*<sup>1</sup> <sup>−</sup> 5.4038*x*2*. The convergence process of neural network weight is shown in Figure 1, while the changing process of control signal is shown in Figure 2. The uncertain parameter p*<sup>1</sup> *and p*<sup>2</sup> *in uncertain system* (47) *take different values, the state trajectories of the closed-loop system are obtained by the robust control law. Figure 3 shows the trajectory of the closed-loop system when <sup>p</sup>*<sup>1</sup> <sup>=</sup> <sup>−</sup>2*, <sup>p</sup>*<sup>2</sup> <sup>=</sup> <sup>1</sup>*. Figure <sup>4</sup> shows the trajectory of the closed-loop system when <sup>p</sup>*<sup>1</sup> <sup>=</sup> <sup>−</sup>1*, <sup>p</sup>*<sup>2</sup> <sup>=</sup> <sup>4</sup>*. Figure 5 shows the trajectory of the closed-loop system when <sup>p</sup>*<sup>1</sup> = 0*, <sup>p</sup>*<sup>2</sup> = 7*. Figure 6 shows the trajectory of the closed-loop system is <sup>p</sup>*<sup>1</sup> = 2*, <sup>p</sup>*<sup>2</sup> = 10*. From these figures, we can see that the closed-loop system is stable, which shows the effectiveness of the robust control law.*

*In this example, because of the linear property of the nominal system, MATLAB software can be used to solve LQR problem directly. With this method, the optimal control is calculated as <sup>u</sup>*<sup>∗</sup> <sup>=</sup> <sup>−</sup>1.4496*x*<sup>1</sup> <sup>−</sup> 5.4038*x*2*. It is almost the same as the result of neural network approximation, which shows the validity of Algorithm 1.*

**Figure 1.** Neural network weight.

**Figure 2.** Robust control signal.

**Figure 3.** Closed loop system trajectory, *<sup>p</sup>*<sup>1</sup> <sup>=</sup> <sup>−</sup>2, *<sup>p</sup>*<sup>2</sup> <sup>=</sup> 1.

**Figure 4.** Closed loop system trajectory, *<sup>p</sup>*<sup>1</sup> <sup>=</sup> <sup>−</sup>1, *<sup>p</sup>*<sup>2</sup> <sup>=</sup> 4.

**Figure 5.** Closed loop system trajectory, *<sup>p</sup>*<sup>1</sup> = 0, *<sup>p</sup>*<sup>2</sup> = 7.

**Figure 6.** Closed loop system trajectory, *<sup>p</sup>*<sup>1</sup> = 2, *<sup>p</sup>*<sup>2</sup> = 10.

$$\dot{\mathbf{x}}(t) = \begin{bmatrix} 0 & 8\\ -5 & 1 \end{bmatrix} \begin{bmatrix} \mathbf{x}\_1\\ \mathbf{x}\_2 \end{bmatrix} + \begin{bmatrix} 0\\ 0.5 + p\_2 \mathbf{x}\_2^2 \end{bmatrix} \boldsymbol{u} + \begin{bmatrix} p\_1 \mathbf{x}\_1 \cos(\mathbf{x}\_2^2) + p\_3 \mathbf{x}\_2 \sin(\mathbf{x}\_1 \mathbf{x}\_2)\\ 0 \end{bmatrix} \tag{51}$$

*where <sup>x</sup>* = [*x*1, *<sup>x</sup>*2] *<sup>T</sup> is the system state, <sup>p</sup>*<sup>1</sup> <sup>∈</sup> [−2, 2]*, <sup>p</sup>*<sup>2</sup> <sup>∈</sup> [0, 5]*, <sup>p</sup>*<sup>3</sup> <sup>∈</sup> [−1, 1]*. Let* <sup>Δ</sup>*f*(*x*) = *<sup>p</sup>*1*x*1*cos*(*x*<sup>2</sup> <sup>2</sup>) + *<sup>p</sup>*3*x*2*sin*(*x*1*x*2) 0 *,* <sup>Δ</sup>*g*(*x*) = <sup>0</sup> *p*2*x*<sup>2</sup> 2 *. It is easy to know that the system* (51) *is a mismatched system. The uncertain disturbance of the system is decomposed as*

$$
\Delta f(\mathbf{x}) = \mathcal{c}(\mathbf{x}) h(\mathbf{x}), \\
\Delta \mathbf{g}(\mathbf{x}) = \mathcal{g}(\mathbf{x}) m(\mathbf{x}) \tag{52}
$$

*where, <sup>g</sup>*(*x*) = <sup>0</sup> 0.5 *, <sup>c</sup>*(*x*) = <sup>1</sup> 0 *, <sup>h</sup>*(*x*) = *<sup>p</sup>*1*x*1*cos*(*x*<sup>2</sup> <sup>2</sup>) + *<sup>p</sup>*3*x*2*sin*(*x*1*x*2)*, <sup>m</sup>*(*x*) = *<sup>p</sup>*2*x*<sup>2</sup> 2*. Moreover, fmax*(*x*) *and gmax*(*x*) *are calculated as follows.*

$$\|\|g(\mathbf{x})^+c(\mathbf{x})h(\mathbf{x})\|\| = \|\|\begin{bmatrix} 0 & 2 \ \end{bmatrix}\begin{bmatrix} 1 \\ 0 \end{bmatrix}h(\mathbf{x})\|\| = 0 = f\_{\max}(\mathbf{x}),$$

*and*

$$||h(\mathbf{x})|| = ||p\_1 \mathbf{x}\_1 \cos(\mathbf{x}\_2^2) + p\_3 \mathbf{x}\_2 \sin(\mathbf{x}\_1 \mathbf{x}\_2)|| \le |2\mathbf{x}\_1 + \mathbf{x}\_2| = \mathfrak{g}\_{\text{max}}(\mathbf{x})$$

*Select the parameter β* = 1*. Then the original robust control problem is converted into solving an optimal control problem. For the auxiliary system*

$$
\dot{\mathbf{x}}(t) = \begin{bmatrix} 0 & 8 \\ -5 & 1 \end{bmatrix} \begin{bmatrix} x\_1 \\ x\_2 \end{bmatrix} + \begin{bmatrix} 0 & 1 \\ 0.5 & 0 \end{bmatrix} \ddot{u}\_\prime \tag{53}
$$

*find the control policy, u, such that the following performance index is minimized* ¯

$$\begin{split} J(\mathbf{x}, \boldsymbol{\mu}) &= \int\_{0}^{\infty} [f\_{\max}^{2}(\mathbf{x}) + g\_{\max}^{2}(\mathbf{x}) + \beta^{2} \mathbf{x}^{T}\mathbf{x} + \boldsymbol{\mu}^{T}\boldsymbol{\bar{\boldsymbol{\alpha}}}] dt \\ &= \int\_{0}^{\infty} [5\mathbf{x}\_{1}^{2} + 2\mathbf{x}\_{2}^{2} + 4\mathbf{x}\_{1}\mathbf{x}\_{2} + \boldsymbol{\bar{\boldsymbol{\alpha}}}^{T}\boldsymbol{\bar{\boldsymbol{\alpha}}}] dt. \end{split} \tag{54}$$

*In order to obtain the obust control law by using Algorithm 2, it is assumed that the optimal cost function <sup>V</sup>*∗(*x*) *has a neural network structure: <sup>V</sup>*∗(*x*) = *<sup>W</sup>Tφ*(*x*)*, where <sup>W</sup>* = [*W*1, *<sup>W</sup>*2, *<sup>W</sup>*3] *T, φ*(*x*)=[*x*<sup>2</sup> <sup>1</sup>, *<sup>x</sup>*1*x*2, *<sup>x</sup>*<sup>2</sup> 2] *T. The initial weight is taken as <sup>W</sup>*<sup>0</sup> = [1, <sup>−</sup>3, 0.5] *T, and the initial state of system is chosen as <sup>x</sup>*<sup>0</sup> = [−2, 0.5] *T. The neural network weights are calculated iteratively by MATLAB. In each iteration, 10 sets of data samples are collected along the nominal system trajectory to perform the batch least squares problem. After six iterations, the weight converges to <sup>W</sup>* = [2.8983, <sup>−</sup>0.6859, 5.2576] *T. The optimal control of the auxiliary system is calculated as u*¯<sup>∗</sup> = 0.1715*x*<sup>1</sup> − 2.6288*x*<sup>2</sup> <sup>−</sup>2.8983*x*<sup>1</sup> <sup>+</sup> 0.3429*x*<sup>2</sup> *. The robust control law of the original uncertain system is <sup>u</sup>*<sup>∗</sup> <sup>=</sup> 0.1715*x*<sup>1</sup> <sup>−</sup> 2.6288*x*2*. The convergence process of neural network weight is shown in Figure 7, while the changing process of control signal is shown in Figure 8. The uncertain parameters p*1*, p*<sup>2</sup> *and p*<sup>3</sup> *in uncertain system* (51) *take different values, the state trajectories of the closed-loop system are obtained by the robust control law. Figure 9 shows the trajectory of the closed-loop system when <sup>p</sup>*<sup>1</sup> <sup>=</sup> <sup>−</sup>1*, <sup>p</sup>*<sup>2</sup> <sup>=</sup> <sup>1</sup> *and <sup>p</sup>*<sup>3</sup> <sup>=</sup> <sup>1</sup>*. Figure <sup>10</sup> shows the trajectory of the closed-loop system when <sup>p</sup>*<sup>1</sup> <sup>=</sup> <sup>−</sup>1*, <sup>p</sup>*<sup>2</sup> <sup>=</sup> <sup>2</sup> *and <sup>p</sup>*<sup>3</sup> <sup>=</sup> <sup>0</sup>*. Figure <sup>11</sup> shows the trajectory of the closed-loop system when <sup>p</sup>*<sup>1</sup> <sup>=</sup> 0.3*, <sup>p</sup>*<sup>2</sup> <sup>=</sup> <sup>3</sup> *and <sup>p</sup>*<sup>3</sup> <sup>=</sup> <sup>−</sup>1*. Figure <sup>12</sup> shows the trajectory of the closed-loop system when <sup>p</sup>*<sup>1</sup> <sup>=</sup> <sup>−</sup>2*, <sup>p</sup>*<sup>2</sup> <sup>=</sup> <sup>5</sup> *and <sup>p</sup>*<sup>3</sup> <sup>=</sup> <sup>1</sup>*. From these figures, we can see that the closed-loop system is stable, which shows the effectiveness of the robust control law.*

*The nominal system is also a linear system, so MATLAB software can be used to solve LQR problem directly. With this method, the optimal control is calculated as u*¯<sup>∗</sup> = 0.1713*x*<sup>1</sup> − 2.6286*x*<sup>2</sup> <sup>−</sup>2.8983*x*<sup>1</sup> <sup>+</sup> 0.3430*x*<sup>2</sup> *. It has little difference with the approximate result of neural network, which shows the validity of Algorithm 2.*

**Figure 7.** Neural network weight.

**Figure 8.** Robust control signal.

**Figure 9.** Closed loop system trajectory, *<sup>p</sup>*<sup>1</sup> <sup>=</sup> <sup>−</sup>1, *<sup>p</sup>*<sup>2</sup> <sup>=</sup> 2, *<sup>p</sup>*<sup>3</sup> <sup>=</sup> 1.

**Figure 10.** Closed loop system trajectory, *<sup>p</sup>*<sup>1</sup> <sup>=</sup> <sup>−</sup>1, *<sup>p</sup>*<sup>2</sup> <sup>=</sup> 2, *<sup>p</sup>*<sup>3</sup> <sup>=</sup> 0.

**Figure 11.** Closed loop system trajectory, *<sup>p</sup>*<sup>1</sup> <sup>=</sup> 0.3, *<sup>p</sup>*<sup>2</sup> <sup>=</sup> 3, *<sup>p</sup>*<sup>3</sup> <sup>=</sup> <sup>−</sup>1.

**Figure 12.** Closed loop system trajectory, *<sup>p</sup>*<sup>1</sup> <sup>=</sup> <sup>−</sup>2, *<sup>p</sup>*<sup>2</sup> <sup>=</sup> 5, *<sup>p</sup>*<sup>3</sup> <sup>=</sup> 1.

The nominal systems corresponding to the above two examples are linear systems. The following is an example with nonlinear nominal system.

**Example 3.** *Consider the following uncertain nonlinear systems*

$$\dot{\mathbf{x}}(t) = \begin{bmatrix} 1 & 1 \\ -1 & -2 \end{bmatrix} \begin{bmatrix} \mathbf{x}\_1 \\ \mathbf{x}\_2 \end{bmatrix} + \begin{bmatrix} 0 \\ 2\mathbf{x}\_1^2 \cos^2(\mathbf{x}\_2) \end{bmatrix} + \begin{bmatrix} 0 \\ p\_1\mathbf{x}\_2\cos^2(\mathbf{x}\_1) \end{bmatrix} + \begin{bmatrix} 0 \\ 1 + p\_2\mathbf{x}\_2^2 \end{bmatrix} \boldsymbol{\mu} \tag{55}$$

*where <sup>x</sup>* = [*x*1, *<sup>x</sup>*2] *<sup>T</sup> is the system state,* <sup>Δ</sup>*f*(*x*) = <sup>0</sup> *<sup>p</sup>*1*x*2*cos*2(*x*1) *is the uncertain disturbance function of the system,* <sup>Δ</sup>*g*(*x*) = <sup>0</sup> *p*2*x*<sup>2</sup> 2 *is input uncertainty function, <sup>p</sup>*<sup>1</sup> <sup>∈</sup> [−2, 2]*, <sup>p</sup>*<sup>2</sup> <sup>∈</sup> [0, 2]*. Obviously,*

$$
\Delta f(\mathbf{x}) = \mathbf{g}(\mathbf{x}) h(\mathbf{x}), \\
\Delta \mathbf{g}(\mathbf{x}) = \mathbf{g}(\mathbf{x}) m(\mathbf{x}) \tag{56}
$$

*where <sup>g</sup>*(*x*) = <sup>0</sup> 1 *, <sup>h</sup>*(*x*) = *<sup>p</sup>*1*x*2*cos*2(*x*1)*, <sup>m</sup>*(*x*) = *<sup>p</sup>*2*x*<sup>2</sup> <sup>2</sup>*. Moreover,* <sup>|</sup>*h*(*x*)|≤|2*x*2<sup>|</sup> <sup>=</sup> *fmax*(*x*)*. Thus, the original robust control problem is converted into calculating optimal control law. For nominal system*

$$\dot{\mathbf{x}}(t) = \begin{bmatrix} 1 & 1\\ -1 & -2 \end{bmatrix} \begin{bmatrix} \mathbf{x}\_1\\ \mathbf{x}\_2 \end{bmatrix} + \begin{bmatrix} 0\\ 2\mathbf{x}\_1^2 \cos^2(\mathbf{x}\_2) \end{bmatrix} + \begin{bmatrix} 0\\ 1 \end{bmatrix} \boldsymbol{\mu},\tag{57}$$

*find the control function u, such that the performance index*

$$\begin{split} f(\mathbf{x}, \boldsymbol{\mu}) &= \int\_{0}^{\infty} [f\_{\text{max}}^{2}(\mathbf{x}) + \mathbf{x}^{T}\mathbf{x} + \mathbf{u}^{T}\mathbf{u}] dt \\ &= \int\_{0}^{\infty} [\mathbf{x}\_{1}^{2} + \mathbf{5}\mathbf{x}\_{2}^{2} + \mathbf{u}^{2}] dt \end{split} \tag{58}$$

*is minimized.*

*In order to solve the robust control problem by using Algorithm 1, it is assumed that the optimal cost function V*∗(*x*) *has a neural network structure: V*∗(*x*) = *WTφ*(*x*)*, where W* = [*W*1, *<sup>W</sup>*2, *<sup>W</sup>*3] *T, φ*(*x*)=[*x*<sup>2</sup> <sup>1</sup>, *<sup>x</sup>*1*x*2, *<sup>x</sup>*<sup>2</sup> 2] *T. The initial weight is taken as <sup>W</sup>*<sup>0</sup> = [−2, 5, 0.5] *T, and the initial state of system <sup>x</sup>*<sup>0</sup> = [2, <sup>−</sup>0.5] *T. The neural network weights are calculated iteratively by MATLAB. In each iteration, 10 sets of data samples are collected along the nominal system trajectory to perform the batch least squares problem. After five iterations, the weight converges to* [25.5830, 12.5830, 2.6458]*. The robust control law of uncertain system* (55) *is <sup>u</sup>*<sup>∗</sup> <sup>=</sup> <sup>−</sup>6.2915*x*<sup>1</sup> <sup>−</sup> 2.6458*x*2*. The convergence process of neural network weight is shown in Figure 13, while the changing process of control signal is shown in Figure 14. The uncertain parameter p*<sup>1</sup> *and p*<sup>2</sup> *in uncertain system* (55) *take different values, the state trajectories of the closed-loop system are obtained by the robust control law. Figure 15 shows the trajectory of the closed-loop system when <sup>p</sup>*<sup>1</sup> <sup>=</sup> <sup>1</sup>*, <sup>p</sup>*<sup>2</sup> <sup>=</sup> 0.8*. Figure <sup>16</sup> shows the trajectory of the closed-loop system when <sup>p</sup>*<sup>1</sup> <sup>=</sup> <sup>−</sup>0.5*, <sup>p</sup>*<sup>2</sup> = <sup>1</sup>*. Figure <sup>17</sup> shows the trajectory of the closed-loop system when <sup>p</sup>*<sup>1</sup> = <sup>1</sup>*, <sup>p</sup>*<sup>2</sup> = <sup>2</sup>*. Figure <sup>18</sup> shows the trajectory of the closed-loop system is <sup>p</sup>*<sup>1</sup> = 2*, <sup>p</sup>*<sup>2</sup> = 1*. From these figures, we can see that the closed-loop system is stable, which shows the effectiveness of the robust control law.*

**Figure 13.** Neural network weight.

**Figure 14.** Robust control signal.

**Figure 15.** Closed loop system trajectory, *<sup>p</sup>*<sup>1</sup> = 1, *<sup>p</sup>*<sup>2</sup> = 0.8.

**Figure 16.** Closed loop system trajectory, *<sup>p</sup>*<sup>1</sup> <sup>=</sup> <sup>−</sup>0.5, *<sup>p</sup>*<sup>2</sup> <sup>=</sup> 1.

**Figure 17.** Closed loop system trajectory, *<sup>p</sup>*<sup>1</sup> = 1, *<sup>p</sup>*<sup>2</sup> = 2.

**Figure 18.** Closed loop system trajectory, *<sup>p</sup>*<sup>1</sup> = 2, *<sup>p</sup>*<sup>2</sup> = 1.

#### **7. Conclusions**

In this paper, the PI algorithms in RL are proposed to solve robust control problem for a class of nonlinear continuous time uncertain system. The robust control law is obtained without knowing the internal dynamics of the nominal system. The considered robust control problem is converted into solving an optimal control problem containing a nominal or auxiliary system with a predefined performance index. The online PI algorithms are established to calculate the robust controller of matched and mismatched system. The numerical examples are given to show the availability of the theoretical results. The proposed method may be extended to solve robust tracking problems for some nonlinear systems with uncertainty entering output, which may be the subject of our future research.

**Author Contributions:** D.X.: investigation, methodology, software; Q.W.: formal analysis; Y.L.: investigation; All the authors contributed equally to the development of the research. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the National Natural Science Foundation of China under Grant No. 61463002.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Data is contained within the article.

**Acknowledgments:** The authors thank to the Journal editors and the reviewers for their helpful suggestions and comments.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Digital Twin for Designing and Reconfiguring Human–Robot Collaborative Assembly Lines**

**Niki Kousi, Christos Gkournelos, Sotiris Aivaliotis, Konstantinos Lotsaris, Angelos Christos Bavelos, Panagiotis Baris, George Michalos and Sotiris Makris \***

> Laboratory for Manufacturing Systems and Automation, University of Patras, 26504 Patras, Greece; kousi@lms.mech.upatras.gr (N.K.); gkournelos@lms.mech.upatras.gr (C.G.); saival@lms.mech.upatras.gr (S.A.); lotsaris@lms.mech.upatras.gr (K.L.); bavelos@lms.mech.upatras.gr (A.C.B.); pmparis@lms.mech.upatras.gr (P.B.); michalos@lms.mech.upatras.gr (G.M.)

**\*** Correspondence: makris@lms.mech.upatras.gr; Tel.: +30-2610-910160

**Abstract:** This paper discusses a digital twin-based approach for designing and redesigning flexible assembly systems. The digital twin allows modeling the parameters of the production system at different levels including assembly process, production station, and line level. The approach allows dynamically updating the digital twin in runtime, synthesizing data from multiple 2D–3D sensors in order to have up-to-date information about the actual production process. The model integrates both geometrical information and semantics. The model is used in combination with an artificial intelligence logic in order to derive alternative configurations of the production system. The overall approach is discussed with the help of a case study coming from the automotive industry. The case study introduces a production system integrating humans and autonomous mobile dual arm workers.

**Keywords:** digital twin; human robot collaboration; reconfiguration

#### **1. Introduction**

Traditional production systems are facing difficulties in accommodating the market demand for customized products [1]. This is attributed to the rigidity of their structure and control logic [2] as well as the isolation between design and production phases [3]. This has led to a manufacturing trend toward enabling hybrid and more flexible production systems [4].

New types of factories that facilitate the cooperation among multipurpose resources achieving the system's reconfigurability have been emerging in the last decades [2]. The current work aims to enable the deployment of such reconfigurable production systems employing mobile dual arm workers that can navigate across the shopfloor to undertake different operations while acting as assistants to human operators. To facilitate the autonomous and dynamic behavior in those systems, it is critical to implement methods and tools to optimally design and plan the activities among the available humans and robots and re-allocate the work when changes are needed or unexpected events occur [5].

Today's industrial practice lacks a systematic approach to automate the process of designing the end-to-end operation of a manufacturing system [6]. In most of the cases, a manual design is generated by the engineers based on their experience using isolated engineering tools. In case of disturbances during the execution, such as changes in the production mix or the resources' availability, the system setup and the task sequence are manually re-designed, leading to the extended use of manual resources as well as stoppage of the production [7].

Digital manufacturing aspires to bridge this gap between the design and operation phases of a product [8]. The integration of smart Digital Twins (DT) combined with Artificial Intelligence (AI) methods is a key step for the evolution of decision-making

**Citation:** Kousi, N.; Gkournelos, C.; Aivaliotis, S.; Lotsaris, K.; Bavelos, A.C.; Baris, P.; Michalos, G.; Makris, S. Digital Twin for Designing and Reconfiguring Human–Robot Collaborative Assembly Lines. *Appl. Sci.* **2021**, *11*, 4620. https://doi.org/ 10.3390/app11104620

Academic Editor: Emanuele Carpanzano

Received: 23 April 2021 Accepted: 15 May 2021 Published: 18 May 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

systems [9]. Over the last decades, DT based decision-making frameworks have been suggested advancing multiple aspects in Smart Manufacturing: (a) business data integration [10], surveillance and predictive maintenance [11], offline robot work cell design [12] and programming [13], and online robot behavior adaptation [14].

Over the last few years, several definitions for the term "Digital Twin" have been presented [15]. Different simulation-based techniques for the designing and the performance assessment of flexible manufacturing systems have been presented [16,17]. The connection of a DT with the sustainability validation of an intelligent manufacturing system has been highlighted, and a framework to achieve this has been proposed [18]. An ISO standard for designing and implementing DTs is available to help SMEs develop a DT for their use cases [19]. A detailed documentation on the deployment of a DT for discrete manufacturing systems has been provided [20].

Digital Twinning techniques involve the deployment of multiple sensors for virtual reconstructing the real-time status of the factory [3,21]. To capture and process the data coming from multiple sensing devices, various sensor fusion techniques have been developed [22]. Sensor fusion methods in the field of wearable robotics (prostheses and exoskeletons) have been investigated [23]. Additionally, the navigation ability of mobile and aerial robots has been developed based on data fusion techniques [24,25].

The use of DT for the design and control of Human–Robot Collaborative (HRC) operations has also been investigated [26]. However, the existing approaches are limited only in design and planning for HRC systems that incorporate stationary robots. Therefore, they do not consider the use of mobile robots and its effect in layout reconfigurability. Furthermore, there is a lack of end-to-end integrated systems to monitor the workplace and resources status and dynamically re-design the system in case of unexpected events.

This work introduces a DT-based system to close the loop between the design and operation phase enabling the dynamic behavior of HRC reconfigurable systems. The DT is integrated with an AI decision-making logic to design the layout and the task plans of the system. During runtime, the DT smart models are dynamically updated through unified interfaces with the physical assets synthesizing the real-time planning scene that is used when re-design is needed. The goal of the proposed approach is to overcome the existing limitations in terms of execution orchestration and online reconfiguration of hybrid production systems by implementing;


The overall concept for the suggested DT-based reconfiguration system is illustrated in Figure 1.

**Figure 1.** Proposed DT-based reconfiguration concept.

The paper structure comprises an overview of the suggested DT-based re-design process (Section 2) and its implementation for HRC reconfigurable workplaces (Section 3). The performance of the system is analyzed and validated in an automotive case study (Section 4). Finally, Section 5 is dedicated to drawing the conclusions and providing an outlook on future research areas.

#### **2. Approach**

The proposed method supports the continuous design in flexible manufacturing systems exploiting digital twin technology. Based on Stark et al.'s DT definition [28], a DT infrastructure, digitalizing multiple assets, is proposed to close the loop between the design tools and the physical world, as shown in Figure 2. Three steps are followed to set up the DT system.

**Figure 2.** The continuous re-design process.

The first step includes the configuration of the DT during the design phase. Currently, production designers use various simulation tools to design the production system. However, the uncertainty on the shop floor during execution such as human operators' unpredicted behavior or unexpected variations in the demand profile or in the resources' availability is not captured sufficiently through these tools. The suggested DT includes a unified representation of the Product–Process–Resource system aiming to link the production design and execution phases. The configuration process involves the creation of smart

digital models to represent multiple assets of the shopfloor, including the parts, processes, resources, and sensors. Having as a starting point the CAD models of the parts and of the production line, these models comprise (a) the semantic representation of each entity including its characteristic and capabilities as well as of the tasks that need to be executed, (b) the geometrical information that defines the spatial relation between the entities, and (c) uniform data structures able to synthesize multiple sensor data types. The target is to use these models for data sharing between multiple phases of the product lifecycle.

The second step involves the deployment of decision-making tools to derive alternative configurations of the production system using as input the DT models. These tools, wrapped as planning services, use AI to draw decisions on (a) how to optimally allocate the tasks among the available resources, (b) where to locate the resources for realizing optimal task plans, and (c) how to ensure collision-free resource trajectories, minimizing the stoppages of the task due to the existence of obstacles. The integration of planning services to the DT aims to bridge the gap between the physical world and its digital representation enabling production system's reconfigurable behavior [4].

The third step involves the deployment of control services for interfacing the involved hardware components, namely the robots' controller, the human–machine interfaces, the sensors, and the tools. The target is to simplify the control integration and sensor data sharing. This results in an easier distribution of derived plans to the relevant resources. During execution, the DT is dynamically updated, retrieving real-time sensor data, resources' status, and location as well as human input.

#### *HRC Workplaces Re-Design Workflow*

The proposed method aims to enable the operation of systems that employ human operators and flexible mobile robot workers. The mobility of these resources creates two-fold added value for the system: (a) the layout of the system can be autonomously reconfigured by re-directing the mobile robots in different workstations, and (b) the robots act as assistants to human operators increasing flexibility on the process execution [5].

The design process of HRC workplaces operation commonly proposed in the literature follows the sequential workflow presented in Figure 3 [29]. The result of this process is a fixed HRC layout where the tasks and robot motions are pre-determined so to comply with the safety regulations [30]. Human uncertainty is handled by isolated safety systems, leading to stoppages of production when the safety is compromised by human behavior.

**Figure 3.** Current practice on the design workflow of HRC applications.

The proposed DT based re-design process aims to facilitate this dynamic system reconfiguration by transforming the design workflow, as shown in Figure 4. The DT models include the information needed for optimizing the system's layout by assigning human operators and mobile robots to perform tasks in the different workstations. During execution, a set of perception modules is deployed to monitor the execution, the environment, and human behavior through multiple sensors. These sensors capture 2D and 3D images. Consuming the relevant data structures of the DT, these data are synthesized to populate the real-time planning scene in the form of an occupancy map [31].

The decision-making logic implemented for the layout and task planning is visualized in Figure 5. An intelligent search method [1] is used to generate alternative configurations of the system consuming the 3D planning scene and the resources, tools, and task DT models. Each task model selected in the task layer is linked to the involved part model and its real-time location in the shopfloor. In the second layer, the availability of resources is

assessed based on their real-time status, considering human operators (H), mobile robot workers (M.R.), and human–robot teams (H.MR.). In the third layer, based on resource capabilities including payload and available tooling, the suitability for the selected task is assessed. In the Tools layer, the fitting tooling is selected. In the last layer, the robots' motion for each alternative is calculated consuming the 3D planning scene. Finally, the alternative with the minimum execution duration is selected and sent for execution to the physical system. In cases of disturbances in runtime, the planning services are triggered to re-design the workflow.

**Figure 4.** Proposed paradigm on the design workflow of HRC applications.

**Figure 5.** Decision-making logic for reconfigurable factories.

#### **3. Implementation**

The overall architecture of the Digital Twin including the components deployed for the construction of the DT is visualized in Figure 6. The integration and communication interfaces were established following the principles of ROS [31] to enable data exchange among the different modules as well as scalability of the system in terms of adding new robot resources. The system is deployed in a PC running Ubuntu 16.04 and ROS Kinetic. The involved modules are described below.

**Figure 6.** Digital Twin system architecture.

The Resource Manager module is responsible for storing and broadcasting the Resources' digital models. These models involve attributes such as payload, minimum velocity, location, status, kinematics transform configuration (.urdf), path (.yaml), and motion (.srdf) configuration. The Workload Manager stores the task models in a JSON format following a hierarchical modeling approach [12]. The involved sensors' configuration data such as ID, location, and Unified Robot Description Format (URDF) description are registered in the DT through the Sensor Manager. The captured data are published through dedicated topics. On top of these, the 2D and 3D data are combined and published in an additional topic so to facilitate their consumption by the planning services. The Layout Manager stores all CADs files related to static fixtures, parts, and products in .sdf format also defining the collisions, the inertia, and the mass parameters. The 3D environment constructor retrieves the locations of all parts, fixtures, sensors, and resources to construct an environment with a global world frame (using ROS Tf—library [31]). It also provides an interface with ROS services and a topic to track and update the position of all parts inside the shopfloor based on the real-time updates coming from shopfloor sensors.

The integration of the DT with the planning and control services is visualized in Figure 7. The layout and task planning modules are developed in Java. Regarding the robot motions, state-of-the-art algorithms for mobile platform path planning and robot arm motion planning are integrated in the system (gmapping, amcl, and ompl [31]). The DT provides dedicated interfaces to these modules so for the latter to have access in the 3D planning scene and exchange data. The connection with the robot controller is established through ROS drivers, while the instructions for the tasks are sent to human operators in smart devices interfaces through ROSBridge.

**Figure 7.** Digital Twin based re-design system architecture.

During execution, the DT is used for publishing the planning scene constructed through the elements described above. The component diagram presented in Figure 8 details the connection and the established interaction among the different planning and control services with the DT.

**Figure 8.** DT-based system component diagram.

In order to efficiently accommodate the interaction among these modules, a serviceoriented approach has been implemented. Under this approach, the exchange of information among the different resources is performed using peer-to-peer services. In Figure 9, the provided sequence diagram details the communication among the different resources indicatively for the execution of a Navigation Task by a mobile robot.

**Figure 9.** DT-based system sequence diagram for the execution of a Navigation Task.

#### **4. Case Study**

The section describes how the DT based re-design system has been deployed, tested, and validated in a real industrial case study. In particular, the system has been tested in an assembly line of an automotive company producing the front axle of a passenger vehicle. The current assembly line, deployed in the automotive factory, is configured in a fixed layout employing human operators, compression machines, and an automated screwing machine. The base part is loaded in an Automated Guided Vehicle (AGV) that moves along the assembly line. The process takes place in three working areas: (a) the preassembly working area, (b) the compression working area, and (c) the AGV working area. However, this set up raises two main challenges for the end user. (a) The first is hardware equipment limitations to accommodate the production of multiple product variants since the automated screwing machines can accommodate the screwing of only one axle model. When new models are added in production, a manual re-design and re-deployment of the system is needed. (b) The second is the strain caused in human operators, since they need to lift 480 compressed dampers of 6 kg weight in an 8 h shift. To overcome those limitations, the DT based re-design system is used to enable the production paradigm that employs mobile dual arm workers. The workers can (a) be re-located among workstations and by automatically changing tools perform operations for multiple parts and (b) undertake the heavy tasks, reducing operators' strain.

The DT-based system is used to generate alternative configurations of the system. Depending on the alternative task plan, the resources are located in different workstations, forming the initial layout of the system. Figure 10 presents three indicative system configurations for the assembly of the axle that takes place in three working areas.

**Figure 10.** Indicative alternative configurations of the assembly system: (**a**) alternative 1, (**b**) alternative 2, (**c**) alternative 3.

The generated alternatives are evaluated toward minimizing the processing time and optimizing the use of resources. To validate the alternatives against those metrics, the decision-making system is integrated with a GAZEBO 3D simulator [32]. This simulator provided the values on these metrics for each configuration alternative. After all alternatives were validated, the optimal configuration was selected by the system. The selected task plan that is dispatched for execution is listed in Table 1.


**Table 1.** Assembly tasks allocation to resources.

The selected configuration of the layout has been physically deployed in the Laboratory for Manufacturing Systems and Automation (LMS) Machine Shop as visualized in Figure 11. This set up has been used for the execution of the assembly task plan as generated by the decision-making module.

**Figure 11.** The selected configuration physically deployed.

In this study, we consider two cases that raise the need for system re-design during execution. The first case concerns execution disturbances that call for resource behavior level re-design when the robots' motions need to be adjusted in online changes that occur in the environment. Currently, as mentioned above, AGVs are used in the factory. When a human operator gets close of one of the AGVs, the latter stops, and it only resumes its operation when the human moves away. In such a fenceless environment, these stops occur quite often during a shift, creating eventual delays in the process. This creates the need for the mobile robots to be able detect human operators, and instead of stopping, re-plan its path so to avoid them.

The second case concerns execution disturbances that call for Station Level re-design by re-organizing the remaining tasks to the involved resources. This case is of interest considering that in actual production, resources' breakdown occur during execution. When these occur, the production stops, and typically, the contribution of the maintenance department is needed to resume the execution. The target is to use the DT-based system to minimize the stoppages of production due to resources' breakdown by dynamically re-designing the tasks when these occur.

The next sub-section presents a quantification of the effect of using the DT-based system in the initial design of the system and in the occurrence of the two type of disturbances.

#### *Results*

Regarding the design of the production system, Discrete Event Simulation (DES) has been used for calculating the performance of the DT-based generated design assembly paradigm against current practice. Two different DES models were deployed in WITNESS Discrete Event Simulation Software, version 23 [33]. The first model represented the current assembly line structure, modeling the execution time for each workstation. The tasks involved in each workstation were executed by human operators and by a compression machine. The execution time for these tasks was modeled based on relevant data provided by the automotive end user. The second model simulated the new set up of the production system that employs mobile robots to support human operators. The structure of the second model was designed based on the outcome of the DT-based system. This model as well incorporated the execution time of the tasks performed by the mobile robots, the human operators, and the compression machine. The execution time for human operators and the compression machine was modeled using the same source of information as in the first model. For modeling the time required for the mobile robots to execute their assigned tasks, the execution of the tasks by the robots was simulated in GAZEBO 3D simulator recording their duration.

The simulation time for both models was one year. The production throughput and the utilization of resources were recorded for both models during this simulated year of production. The results, shown in Table 2, indicate a considerable improvement in the production throughput and utilization of operators' availability by facilitating the reconfigurable HRC system.

**Table 2.** Comparison of the two systems—designed system performance.


Regarding the re-design of the system, below, we present the responsiveness of the system in two cases of execution disturbances against current industrial practice. Figure 12 visualizes the 3D planning scene reconstructed by the DT based on the physical world real-time status. This 3D planning scene is generated by the combination of data captured by two laser scanners placed on the mobile platform and one stereo camera placed on the robot's torso. By combining this information, the human presence and state is identified and included in the planning scene.

**Figure 12.** Automotive pilot cell (**a**) real world, (**b**) 3D planning scene.

Case 1: The operator enters the common workspace, and his path intersects the mobile robot's planned trajectory for 10 s. In current industrial practice, the mobile robot would stop, waiting for the human to move away and needing 1 s to resume its path. In the same case, the DT-based system will detect his NO INTERACTION intention and add human presence and state information in the planning scene. The mobile robot, instead of stopping, will re-plan its path in around 2 s, as shown in Figure 13. To quantify this effect, in an 8 h shift, we consider that the human will block the robot path 200 times. As shown in Table 3, the latter practice may reduce by 80% the overall delay caused by human presence.

**Figure 13.** Human NO INTERACTION intention detected.


**Table 3.** Comparison of the two systems—responsiveness to disturbances.

Case 2: Robot arm fails to grasp the damper. In a realistic set up, the robot may fail to execute a task due to occlusions in the cameras' field of view, lighting conditions, etc. Through experimentation, we consider that in an 8 h shift, the robot fails 50 times. In current practice, the production manager would manually re-start the robot and correct its position, needing up to 70 s. Through our system, the operator himself can easily resume the execution by manually guiding the arm in the grasping position, as shown in Figure 14, needing up to 21 s. Table 3 shows the time saved by using the DT model functions.

**Figure 14.** Human PHYSICAL INTERACTION intention detected.

#### **5. Conclusions**

This work discusses the use of DTs for designing and redesigning flexible production systems. These systems employ mobile dual arm robots that move across the factory undertaking multiple tasks, assisting humans. Exploiting this hardware ability, the DT based re-design system generates optimal configurations in terms of layout and task plans. The solution allows for online reconfiguration of the system by (a) dynamically reassigning the tasks when unexpected events occur, (b) real time adjusting robots' behavior so to ensure collision-free trajectories generation in unstructured environments. These are achieved reasoning on top of the sensor-based real-time scene reconstruction that is provided by the Digital Twin by synthesizing multiple sensor data.

The discussed DT-based system has been deployed, tested, and validated in a case study from the automotive sector. In particularly, mobile robot workers are introduced to assist human operators in the assembly of a vehicle's front axle, which is performed fully manually in current practice. Discrete Event Simulation has been used to validate the performance of the suggested hybrid production system against current practice. The results show that by introducing mobile robot resources in the system, the production throughput as well as the resources' utilization can be increased. To validate the effect of the DT-based system that enables this hybrid production paradigm, its effect in two cases of needed reconfiguration has been analyzed and quantified. This quantification proves that the suggested system enables the faster adaptability of the system when unexpected events occur against current practice, minimizing the stoppages of production

Future work will aim to a higher Technology Readiness Level (TRL) of the application so that it may be deployed in an industrial environment. Furthermore, the target is to enhance the digital models and decision-making tools so to incorporate the integration of dynamic risk assessment tools ensuring compliance to safety regulations. In addition, the DT-based system may be extended so to include the modeling and planning of the intra factory logistics operations required for supporting the assembly as well as for respecting the desired by the end user flowtime for the products.

**Author Contributions:** Conceptualization, S.M., G.M. and N.K.; methodology, N.K. and C.G.; software, C.G., K.L., A.C.B. and S.A.; validation, S.A. and P.B.; formal analysis, N.K. and S.M.; investigation, N.K., C.G. and K.L.; resources, S.M.; data curation, A.C.B.; writing—original draft preparation, N.K. and S.A.; writing—review and editing, N.K. and S.M.; visualization, C.G. and P.B.; supervision, N.K. and S.M.; project administration, N.K., G.M. and S.M.; funding acquisition, G.M. and S.M. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by European Union's Horizon 2020 research and innovation programme under grant agreement No. 723616 (www.thomas-project.eu/, accessed on 13 May 2021) and 825196 (https://trinityrobotics.eu/, accessed on 13 May 2021).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Manufacturing Execution System Integration through the Standardization of a Common Service Model for Cyber-Physical Production Systems**

#### **Richárd Beregi \*, Gianfranco Pedone, Borbála Háy and József Váncza**

Centre of Excellence in Production Informatics and Control, Institute for Computer Science and Control, Kende u. 13-17, 1111 Budapest, Hungary; gianfranco.pedone@sztaki.hu (G.P.); borbala.hay@sztaki.hu (B.H.); jozsef.vancza@sztaki.hu (J.V.)

**\*** Correspondence: richard.beregi@sztaki.hu

**Abstract:** Digital transformation and artificial intelligence are creating an opportunity for innovation across all levels of industry and are transforming the world of work by enabling factories to embrace cutting edge Information Technologies (ITs) into their manufacturing processes. Manufacturing Execution Systems (MESs) are abandoning their traditional role of legacy executing middle-ware for embracing the much wider vision of functional interoperability enablers among autonomous, distributed, and collaborative Cyber-Physical Production System (CPPS). In this paper, we propose a basic methodology for universally modeling, digitalizing, and integrating services offered by a variety of isolated workcells into a single, standardized, and augmented production system. The result is a reliable, reconfigurable, and interoperable manufacturing architecture, which privileges Open Platform Communications Unified Architecture (OPC UA) and its rich possibilities for information modeling at a higher level of the common service interoperability, along with Message Queuing Telemetry Transport (MQTT) lightweight protocols at lower levels of data exchange. The proposed MES architecture has been demonstrated and validated in several use-cases at a research manufacturing laboratory of excellence for industrial testbeds.

**Keywords:** interoperability; industry 4.0; manufacturing execution system; cyber-physical production system; OPC UA

#### **1. Introduction**

There is beauty in how complex the human body is, yet seemingly how naturally and effortlessly it operates day-to-day. For the untrained eyes, a manufacturing plant can seem such a complicated, yet well-functioning entity. Both have a system which connects faculties of sensing, decision-making, and acting, which is essential for their proper functioning [1]. For the human body, this is obviously the nervous system. For a production facility, this is the *Manufacturing Execution System*, or MES in short.

Many definitions and approaches of MES appeared, yet somehow all of them tried to modularize and incorporate every aspect of manufacturing and to combine them into one monolithic system [2]. Thus far, there has not been a minimalist, core concept of MES. One which only has the most basic, *core functionalities* to execute a manufacturing plan. The aim was to take out every intelligent and decision-making aspects of MES to create a *nervous system for manufacturing*, which connects both low-level resources and their operations together, in addition to linking them to the high-level orchestrators, like a scheduler or an Enterprise Resource Planner (ERP). This approach operates with the expectation that both low- and high-level components have built-in intelligence to enable smart production through the digitalization of their environment. The basis for this expectation is the raising phenomena of Cyber-Physical Systems (CPSs) in the manufacturing shop-floor and the spread of generally accepted distributed decision-making supporting the *Cyber-Physical Production System* (CPPS) concept [3,4].

**Citation:** Beregi, R.; Pedone, G.; Háy, B.; Váncza, J. Manufacturing Execution System Integration through the Standardization of a Common Service Model for Cyber-Physical Production Systems. *Appl. Sci.* **2021**, *11*, 7581. https://doi.org/10.3390/ app11167581

Academic Editor: Emanuele Carpanzano

Received: 16 July 2021 Accepted: 16 August 2021 Published: 18 August 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Manufacturing digitalization prevails more and more, and industrialized countries now have their own initiatives to support it [5]. Industry 4.0 (I4.0) is all about integration and digitalization [6,7], and, given the crucial role of data and Industrial Internet of Things (IIoT) in its technological scenarios, there is a common view that integrated MES are still crucial, although moving toward the broader Manufacturing Operations Management (MOM) and service-oriented approach [8]. Next to this, robotics is expanding in magnitude around the developed world, also in executing manufacturing processes, and is expected to rise from a 15 billion US dollar sector now to 67 billion by 2025 [9].

Overall, I4.0 is demanding a gradual but continuous evolution towards interoperable standards and systems, distributed and self-learning MES services [10]. In this view, modular (not monolithic) MES, with functions becoming services, acquires a central role. Customers in manufacturing require support to multi-plant enterprise orchestration, easy scalability, optimal interplay with modern technologies, and process innovation support through the integration with additive manufacturing, robotics, and Plant Lifecycle Management (PLM) systems. In order to digitally transform industrial automation, significant efforts have to be dedicated to the isolated systems [11].

As I4.0 grows, MES need to grow alongside it [12]. Global networking of production resources and processes, as well as the use of globally interconnected applications, crucially require the adoption of unified standards. *Standardization* in manufacturing architectures evolved from ISA95 to Manufacturing Enterprise Solutions Association International (MESA) [13] in the past decades and, more recently, to service-oriented industrial reference architectures, such as RAMI 4.0 (Reference Architecture Model for Industry 4.0) [14] and IIRA (Industrial Internet Reference Architecture) [15]. All these have proved their applicability also to MES and CPPS, with the intention to organize, define, and integrate automated functions (services) that can be connected to the *core*. These two major standardization proposals are *de-facto* considered reference guidelines for conceiving future business organizations, even though a final consensus or agreement in the scientific and industrial communities has not been reached yet. An introduction to the standardization efforts is reported in Section 2.4, whereas a more comprehensive, cloud- and service-oriented presentation, and comparison of both reference architectures can be found in [16].

This evolutionary process is not straightforward, and an analysis reported also in [10] investigated the evolution of *distributed intelligence architectures* and their expected path of development, what many today identify with Multi-Agent Systems (MAS) [17]. While IIoT is a growth factor for MES in some industries, the availability of newer technologies and approaches can hinder MES deployments. Companies are trying to reduce their future risks and understand future directions in the plethora of I4.0 standards [18]. They are willing to adopt a technology which will be part of the new standards, which is easy to integrate, use, and expand, as well as which can flexibly accommodate new technologies and applications as they are developed. Research work outlined in this paper aimed ultimately at identifying the minimum set of service concepts and technologies that every CPPS should commonly understand and leverage in their standardized service provision.

In line with the argumentation reported in [19], I4.0 is requiring also a new style of MES for several reasons. Firstly, modern consumers are demanding customization. I4.0 offers an unprecedented opportunity for this, but companies need a plant floor software that is ready to adapt to the underlying technologies. Secondly, I4.0 is a concept revolving around CPSs, which is about the merging of virtual and real worlds and the creation of links between information and operations technologies. Thirdly, reference architectures are not always easily nor efficiently applicable in small to mid sized development projects due to their complexity and design requirements.

However, all the above issues cannot be resolved overnight and MES needs to be upgraded alongside the challenges and opportunities of I4.0. In particular, the shop-floor has to be optimized through decentralized management, where autonomous agents may be seen as a marketplace with demand (the products being built) and supply (the equipment). Future MES needs to accommodate connectivity, mobile, cloud, and advanced serviceoriented computing, as well as human collaboration in order to face the challenges of a new industrialization era [19]. Supporting Human–Robot Collaboration (HRC) in a multi-agent setting is a current issue of special relevance [20]. Furthermore, MES needs contextresolution capabilities in order to enable components of a CPPS to act autonomously and release their own intelligence even in a decentralized setting. In this sense, an MES is an orchestrator of complex production processes, typically over multiple layers of abstraction. Finally, IIoT sensory activity and CPPS communications create new data flows and their vertical integration is crucial to ensure an effective response in the enterprise.

#### *1.1. Main Contributions*

In the face of new requirements, typical MES functions still remain necessary in the I4.0 era as well, even though in a renewed architecture. Just as the traditional automation pyramid is evolving, the MES should also be changed accordingly, towards providing services. Figure 1 shows how the development process of the MESA automation pyramid is extended with the current proposal in this paper.

**Figure 1.** Different functional positioning of Manufacturing Execution System within a company.

Following the above considerations, research presented in this paper targeted the design and realization of a novel concept of *Manufacturing Execution System as a Service* (MESS in the following). This is a new attempt to model, integrate, and orchestrate essential CPS services that make up a modern digitized manufacturing facility. The architecture of MESS facilitates a novel, generic, and simple way for the collaboration of distributed, autonomous manufacturing entities. The focus of the paper is on the core functionality of MESS and the standardized integration of CPS services.

Major points of this research work can be summarized as follows:


#### *1.2. Paper Outline*

After an introduction of I4.0 industrial requirements and the importance to migrate from traditional MES to CPPS-based interoperable environments, motivation and contributions are presented in short. Section 2 gives a short overview of related works and aims at specifying the goals and requirements of the system targeted in this research. In Section 3, the basic concepts underpinning the methodology are enumerated, and the methodological approach is presented in detail, highlighting the fundamentals of the service integration model, its OPC UA counterpart, as well as the overall MESS system architecture. Section 4 is centered around the realization of the MESS; meanwhile, Section 5 details the application and demonstration of this work in a pilot case study. Discussion and conclusions close the paper.

#### **2. Review of Literature and Standards**

The aim of this review is to find the core manufacturing and IT functionalities of a MES that can meet and serve the requirements of CPPS. An overview of recent MES solutions has been performed with a special emphasis on those which could support distributed smart manufacturing. Standardization efforts are reviewed both in regard to expected manufacturing functionalities and applied IT.

#### *2.1. Cyber-Physical Production Systems*

An overview of MES's specific role in I4.0 is presented in [21], where it is seen as the digital twin of future factories with an ability to connect real(-time) manufacturing processes with their digital images. I4.0 is the turning point which marks the end of conventional centralized applications: these new developments are scanned in [22], along with the analysis of so-called enabling technologies. Authors in [23] provide an example of basic resource virtualization for the development of CPPS, as a viable process for companies to create digital twins by specifying the digital twin hierarchy, the information to be modeled, and the modeling methods. Anyway, the referred work is not oriented to a possible generalization of the CPPS service interoperability. In [24], a framework for the classification of CPPS developments provided by the scientific society is introduced. As evidenced also in [25], connectivity represents a prior challenge in establishing CPPS on the basis of heterogeneous IT-software environments. An attempt to realize an OPC-UA based service-oriented communication model for CPPS can be found in [26], whose core is centered around the concept of a service bus. In [27], the authors focus on the automated creation of the CPPS digital twin in a body-in-white production system. The outcome of [28] elaborates on the advantages of migrating legacy IT systems for manufacturing operations to a microservice architecture, as an important step towards platform-based ecosystems. Authors in [29] present an Open CPPS automation architecture with the intention to enable vertical integration to become a reality through a CPPS architecture and IEC-61499 standard. Research work in [30] proposes an open architecture for collaborative edge and cloud processing mode, which promises fast operation and upgrading of cloud manufacturing systems through smart monitoring-analysis-planning-execution in a closed loop. An example of architecture that implements CPPS in the cloud with distributed control and the promise to provide fault tolerance are proposed in [31], but authors primarily focus on load balancing and clustering. Research work presented in [32] refers to a CPPS integrated architecture with OPC UA server for 5G-based Smart Manufacturing, but the approach is centralized. Authors in [33] propose a conceptual model of structural and behavioral elements in CPPS, but the attempt to address the gaps of semantic description tools to be included in behavioral elements and in conceptual description remains more at a theoretical stage. A common ontology model in web-based digital twin modeling and remote control of CPPS is proposed in [34]. Finally, an attempt of dynamic reconfiguration of serviceoriented resources in CPPS via process-independent approach and multiple criteria for resources and operations can be found in [35].

#### *2.2. Distributed Intelligence Architectures in Manufacturing*

*Distributed artificial intelligence*(DAI) has emerged as a branch of Artificial Intelligence (AI) over thirty years ago as a powerful paradigm for representing and solving complex problems. The growth of the field has been spurred by the advances in distributed computing environments and the widespread connectivity. Representing a confluence of ideas from several

disciplines, DAI became an independent research discipline with two main sub-fields: distributed problem solving and multi-agent systems.

*Distributed problem solving* deals with the issues related to solving a problem by dividing it among a number of cooperative problem solvers which share the computational burden and knowledge of their partial solutions. *Multi-agent systems*, on the other hand, are concerned with the behavior of loosely coupled autonomous problem solvers, or agents that work together to solve a problem beyond their individual capabilities. Since the DAI field is closely associated with the notion of agents, it is often referred to as multi-agent system research in literature. An important characteristic of DAI is that it combines computational resources of a group of agents such that the group intelligence is more than the sum of the individual agents' capabilities.

Agents made a real break-through two decades ago or so when the emphasis in mainstream AI research shifted from goal-seeking, logic-centered to utility-oriented, rational behavior—from single to multiple, collaborative problem-solving entities. It only mattered that an agent did the "right thing", even with bounded computational resources [36]. Accordingly, after making observations, it changed its environment by its actions in a way which served best its own interest. The impacts of actions were qualified, and not the mechanisms that generated them. This pragmatic, albeit theoretically founded concept of AI intensified further research in machine learning (when improving the performance of an agent-based on its accumulated experience), in multi-agent systems (when the environment is populated by other agents, too), and also in robotics (when integrating collaborative human and machine agents) [20].

The take-up of agent technologies was accelerated by the development of powerful multi-agent design methodologies, multi-agent simulation as well as programming environments. The agent-based paradigm of computing was early welcome in manufacturing systems control as well [37], with the promise of such much-sought properties like autonomy, cooperation, responsiveness, redundancy, distributedness, and openness [17]. Indeed, it provided an alternative to the traditional, by design centralized, hierarchical, and rigid planning and control schemes. Its current status, achieved advancements, applicability and future outlook are continuously evolving, due to the fact that today's manufacturing systems are becoming increasingly complex, dynamic, and connected [38].

However, even though MAS as a paradigm seems to be suitable to address many of the current requirements imposed by cyber-physical production systems, it still has a long and difficult path for a wider acceptance in industry [39]. Manufacturing services must comply with strong requirements in terms of reliability, robustness, and latency, and solution providers are expected to ensure that agents will operate within certain boundaries of the production, and mitigate unattended behaviors during the execution of manufacturing activities. The so-called Holonic Manufacturing Systems (HMS) addressed these issues in some way, by introducing autonomous, collaborative entities (named holons) to represent the main aspects of manufacturing in terms of product, resource, order, and staff holons. The Product–Resource–Order–Staff Reference Architecture for HMS (PROSA) gained its name from this concept [40]. Holons, as wrappers, hid unnecessary details of the main entities and could realize a spectrum of control schemes, from strictly hierarchical via heterarchical to fully distributed ones [41]. Later, the so-called Activity Resource Type Instance Reference Architecture (ARTI) was born, which combined the strength of the holonic concept and the ideas of typing and instantiating from the information and computer sciences.

On the basis of the PROSA reference architecture, a holonic manufacturing execution system was implemented to support networked production [42]. Here, smart objects representing products found their way of realization in a continuous interplay with intelligent resources. This basically self-organized design opened new research frontiers for investigating issues of distributed intelligence, collaboration, and trust in the context of manufacturing, but in terms of scalability, standardization, and performance guarantees it could not match industrial expectations. In the context of networked production, recently, Ref. [43] suggested a mutualistic resource sharing approach, where the matching between resource offers and requests were made possible by an intermediate *platform*. It is important to note that all decisions were made locally, by the participating autonomous entities. The platform provided only means for information exchange, logical matching of bids demand and supply, and for evaluating and keeping track of the trustworthiness of the partners. Agent-based simulation was used to compare different information exchange mechanisms with respect to utilization rate, service level, and communication load. The findings were applied in the design of an industry-motivated crowdsourced manufacturing platform as well [44].

As an alternative response to the concerns of distributed control in manufacturing, recently a Manufacturing Agent Accountability Framework has been proposed, which is a dynamic authorization framework that defines and enforces boundaries in which agents are freely permitted to exploit their intelligence to reach individual and collective objectives [45].

#### *2.3. MES from the Manufacturing Viewpoint*

A MES can be defined by its functionalities according to editor from MPDV Mikrolab (DE) in [2]. Many different MES views and definitions are present in parallel; however, there are some very prominent international or nationally de-facto well-accepted industrial foundations and standardization associations, whose role is outstanding in the MES scene. The biggest three are the Manufacturing Enterprise Solutions Association International (MESA, USA), the National Institute of Standards and Technology (NIST, USA), and the Zentralverband Elektrotechnik- und Elektronikindustrie eV (ZVEI), which are the most well-known in industry.

MESA's MESA-11 model stood the test of time. It was published in 1997, yet, over the years, it obtained many extensions (integrated MES, collaborative MES); however, the eleven *core manufacturing functionalities* stayed the same. They defined MES as an information delivering component from order launch to finished goods, which enables the optimization of production activities and based on the data acquisition functionality it (i.) guides, (ii.) initiates, (iii.) responds to, and (iv.) reports on plant activities as they occur in nearly real time [46]. The numbered activities highlight the crucial functionality expected from a MES.

According to the NIST, MES is a system that uses network computing to automate production control and process automation by downloading routings and schedules, furthermore, uploading production reports. Simply put, the MES bridges the information gap between the business and the shop-floor (control) layer [47]. Here, again, this bridge role reappears and the demand to operate automatically. This is a less specific functionality requirement list; however, NIST released the SIMA (Systems Integration of Manufacturing Applications) Reference Architecture, which contains the accurate manufacturing activities necessary to realize a MES [48].

The ZVEI's VDI (Verein Deutscher Ingenieure) Guideline 5600 categorizes MES in the area of discrete manufacturing as a "process-oriented manufacturing management system" and as the "comprehensive driving force for the organization and execution of the production process" [49]. SEMATECH's (Semiconductor Manufacturing Technology Consortium) CIM (Computer Integrated Manufacturing) Framework defines Manufacturing Information and Execution Systems (MIES), which focus primarily on operations in a factory: resource integration and workpiece management and movement [50].

The ideal MES according to [2], from a functional point of view, consists of guarantees of communication with corporate management applications and with the manufacturing environment, in addition to available function groups that can be activated and used depending on the requirements. These three uncovered function groups are (i.) production, (ii.) quality, and (iii.) personnel, from which the production is relevant in this case and its seven specified sub functionalities. The previously presented MES concepts and their functionalities are listed in Table 1.


**Table 1.** List of major functional requirements of the Manufacturing Execution System.

#### *2.4. MES from IT Standpoint*

With the rise of the fourth industrial revolution, many attempts to design a general IT modeling and implementation standard have been made. None of the proposed architectures have gained a unique, well-accepted reference role (mainly due to their application-domain specific nature); however, they have become a sort of de-facto expected requirement in the industrial panorama. Hereafter, three of the most recognized and researched technologies nowadays will be briefly presented, which are thus worthy of attention.

Industrial Internet Reference Architecture (*IIRA*) is a standards-oriented industrial internet system open architecture of the Industrial Internet Consortium (IIC), which aims to expand industry interoperability and guide the development of technical standards. IIRA is a highly abstract general-purpose architecture whose aim is to support a wide range of industrial applications based on use-cases defined by the IIC. The Architectural Framework (IIAF) is based on the ISO/IEC/IEEE 42010 (2011) standard specifications and codifies the conventions and common practices that are the core of the IIRA Internet Information Service (IIS). The architectural concepts include concerns (subjects of interest in the system), stakeholders (entities interested in the system), and viewpoints (conventions to construct a description and analysis of specific system concerns).

The core of Reference Architecture Model Industrie 4.0 (*RAMI 4.0*) is the concept of CPS (which is somehow similar to what IIS is in IIRA), where autonomy is localized and the component systems can make their own decisions. Its reference architecture model is more manufacturing-oriented and represents the integration of multiple stakeholder visions on how to realize I4.0-based on existing communication standards and functional descriptions. There are six vertical layers defining the nature of IT components in I4.0: business applications, functional aspects, information processing, comm unication, and integration capabilities, and the ability of assets to implement I4.0 features. The central concept of RAMI 4.0 concerns the I4.0 compliance of components through the deployment of Administration Shell (AS), which essentially provides a virtual representation of the entire life-cycle of an object or asset.

The Open Platform Communications Unified Architecture (*OPC UA*) is a product of the OPC Foundation ([51]), whose aim is to provide a service-oriented, platformindependent standard whose specification is organized into several documents, ranging from concepts and security model to services, data access, alarms, and so forth. OPC UA defines the relations and semantics of services that servers can provide, mapped onto different communication and transport protocols. Interoperability is enabled by mapping concepts between property sets from different domains and/or frameworks. There is an interesting research work presented in [16], which shows a service model similarity evidence and an interoperability affinity between OPC UA and RAMI 4.0 Administration Shell concepts.

#### *2.5. Requirements of the Realized System*

The MESS will be utilized to execute production activities in a manufacturing facility, as well as to enable service-oriented orchestration and production improvement with other external technologies. The aim of the system is not only to support and improve communication among the CPSs inside the facility, but also to improve the communication capability between the production and the other activities in the enterprise (such as process planning, process simulation, resource planning, etc.), to provide updated communication and information analysis to the management, and offer a clear and simple interface to the end-users, in order to monitor and operate the system. Briefly, it is expected of the implemented MESS to provide an interoperable, reconfigurable, and reliable information system to meet these requirements.

Several fundamental design considerations have been taken into account in specifying the basic functionality of the MESS, as listed hereafter:


(vii) guaranteeing openness to predefined exception handling, as well as to unpredictable execution changing.

A careful selection of the required functionalities has been done for the achievement of an operation-focused and functionally minimalist, but useful MES. The manufacturing execution functions were decomposed according to their specific context and use of interfaces: production process management, data acquisition, and system monitoring. It is expected by the system to initiate, control, and report on production activities as they occur. To support these expectations, a summarized list of major (i.e., considered mandatory) and *core functionalities* for the MESS are catalogued in the following Table 2 based on the previously listed functionality collections.


**Table 2.** List of major functional requirements of the MESS.

#### **3. Methodology**

The research work presented in this paper embodies a re-elaboration of the Networked Embedded Systems (NES) definition of a CPS [52,53] and its information and service model, with the intention to leverage it for the development and integration of I4.0-compliant CPSs in a distributed MES environment. This was also carried out following the RAMI 4.0 and OPC UA interoperability guidelines. To this end, the necessary industrial equipment and tools have been identified, whereas the CPS common service concepts and its core functionality has been outlined, as detailed in this section.

#### *3.1. Basic Concepts*

Manufacturing is fundamentally the process to plan and deploy an optimum way of transforming materials into goods, by means of integrating a variety of resources (people, capital, processes, systems, and enterprises), in order to deliver end-products of value to the market. Production starts with a product design. It always has a Bill of Materials (BoM), which can be translated into a *Process Plan*, which is a sequence of selected manufacturing or logistic related *Capabilities* to be performed by specific production *Resources* on the

components of the BoM at the shop-floor level. A production task can be defined as the reservation of specific Resource based on its required Capability. Execution is the procedure to perform a Process Plan in the production system.

After defining the procedure of production, when a production order arrives, a planner can create a *Routing* by finding the feasible combination of available Resources with the necessary Capabilities to be performed, as well as the required materials and parts, together denominated as *Workpieces*. These one-by-one combinations of Resource, Capability, and Workpiece are called *Operations*, which are the basic units of a Routing. With a scheduler, a Routing's time course (called *Schedule*) can also be determined to finish before the order's deadline. The list of basic concepts leveraged in this work's methodology is reported in Table 3.


**Table 3.** List of manufacturing concepts utilized.

#### *3.2. Cps Service Model and Architecture of the Mess*

At the heart of the developed system, there is the concept of CPS, which is in charge to perform production activities. As mentioned, the CPS conceptualization utilized in this manuscript is based on NES definition and comprises the following four basic abilities: sensing, acting, computing, and networking. A CPS can be abstracted to almost anything: a robot, a human worker with network connected device, a camera, a PLC controlled manufacturing unit, a pool of tools, etc. To the MESS, a CPS is like a black box that makes the physical layer seamless, providing a set of production related capabilities and variables. The production capabilities of the CPS can be reached by the production process manager (from now on MESS Core) directly or through a digital counterpart—digital twin—with standardized interface. From the point of view of an external component to the MESS (i.e., user interface, scheduler, and so forth, out of scope in this paper), only these twinned CPSs are visible and discoverable, while physical devices are kept hidden for security and competency reasons. A schematic representation of the system is illustrated in Figure 2.

In order to realize the idea of generally embeddable I4.0-compliant CPSs providing manufacturing services, a "minimalistic" CPS service model has been conceptualized. A CPS is basically expected to embody a set of core service concepts whose selection is necessary to guarantee the following: the core *digital representation* of a CPS; the *service interface* to the MESS collaborative environment; and the *compliance* of the CPS with NES definition and I4.0 components in RAMI 4.0 [16].

**Figure 2.** High-level concept of the MESS architecture.

All CPSs publish their capability in terms of (micro or macro) *services*, which can be invoked by means of parameterized *functions*. Invoking a function *call* triggers the instantiation of execution related tasks, which are necessary to track its execution advancement on the basis of even-triggered *reports*. CPSs also have operating parameters called *variables*, which can point to any exposed signal of the specific equipment and whose values can be utilized in the decision-making process of a routing. Functions are organized and linked together in a process plan via precedence edges, which represent the necessary conditions for the specific function to execute (details and an illustrative example in Section 5).

The interface of a CPS identifies the following common functionality:


*Design assumptions*:


Any dedicated, low-level controller is allowed inside the CPS, but each CPS should have a unified interface toward the MESS. The definition of this minimal set of pillar CPS concepts is reported in Table 4.

#### *3.3. Common OPC UA Model of a CPS*

Figure 3 illustrates the concept map defined in the OPC UA information space, which is shared by a CPS when connected to the core engine of the MESS. The OPC UA address space is structured hierarchically to foster the interoperability of clients and servers, but only top levels are standardized for all servers (nodes in grey).

All other nodes in the address space follow this initial hierarchy but their specific information model is left to the system designers' freedom to interpret the environmental description. To resolve this initial model uncertainty, a set of common OPC UA nodes is proposed as the basis for every CPS service model. These nodes reflect the concepts

introduced in the previous section and can be either statically or dynamically generated elements of the address space. For instance, objects organizing the overall OPC UA model structure are static (i.e., CPS, Functions, Calls, Variables, and Maintenance Functions), whereas OPC UA nodes related to the actual list of CPS functions, their call statuses, as well as the list of CPS variables are generated according to the specific configuration of the CPS and the execution logic of the process plan (dotted nodes). The core concept of a CPS variable should not be confused with the native OPC UA node of type *variable*, which is proper of the OPC UA specification. CPS and maintenance functions (yellow nodes) are OPC UA *methods*: the former expresses the service granularity of a CPS, while the latter are intended to provide the CPS administrator with a list of utilities regarding the overall process plan, the function calls, and the CPS configuration updates. The capabilities of a CPS follow the naming convention of the OPC UA technology and are specified and extended via a dedicated OPC UA namespace, which is structured hierarchically starting from the main CPS folder. The MESS can handle different reporting mechanisms for tracking the status of a CPS function call execution, but, by default, it requires that each function call generate a proper set of status reporting variables.

**Table 4.** List of CPS service concepts.


**Figure 3.** OPC UA information model of CPS core service elements.

#### **4. System Implementation**

The MESS is a set of integrated software and hardware components that provide functions for managing production activities, from job order launch to finished products. By the use of current and accurate data, the MESS initiates, guides, responds to, and reports on production activities as they occur (as expected by MESA). It can also interface with other

production information systems, as well as support engineering and business activities (such as scheduling, simulation, process planning, and low-level task dispatching and process control on the physical layer of the CPS; all out of scope in this paper). The major architectural components of the MESS are depicted in Figure 4 and presented in the following sections.

**Figure 4.** Overall view of the MESS architecture.

#### *4.1. Mess Core*

This is the heart of the system: a time-invariant, event-based sequence control of processes, based on the monitored statuses of resources (e.g., management of production processes). The MESS Core controls the synchronization with the CPSs and acknowledges the other system components about the occurring events. As illustrated in Figure 4, its major components can be listed as follows (from left to right):


– *CPS Mapper* is finally entitled to provide a unified mechanism to synchronize and consistently manage the connection and the information exchanged between the MESS Manager and the CPS Twinning layer (and its various CPS components).

The overall MESS also embodies a *Process Plan Editor*, which is part of the MESS Graphical User Interface (GUI). It provides a visual tool for the editing of process plan, to be executed by the MESS Process Manager later on. On the front-end side, the function inventory of every registered CPS is accessible and transparent with drag-and-drop feature to help users' interaction. Furthermore, the user interface provides a nearly real-time system overview during the process execution with the presentation of up-to-date production data. Beyond all of this, the admin users have a dedicated and secured interface for the configuration of the production and the MESS system.

#### *4.2. CPS Twinning*

As already mentioned, at the core of the designed system, there is the concept of CPS and the exploitation of its *Digital Twin* counterpart in (also virtually augmented) production scenarios. A tool can be any kind of manufacturing unit, an instrument, a device, a robot, a machine, a production line, a transport unit, or even a human worker. These tools are contained in (belong to) a CPS and the MESS Core takes care and manages the connections with all the CPSs. The MESS provides several ways to integrate CPSs and their tools into the system, according to the level of IT expertise and the communication control requested by their designers. Basically, it offers two interface solutions for this connection: by means of an externally served CPS (capable of one between OPC UA—the preferred solution—and HTTP/JSON equivalent service Application Programming Interface, in short API) or by the adoption of a lower-level wrapping Production Administration Shell (PAS) for CPS (part of this research work's contribution and described in detail in the following), respectively. The PAS represents the I4.0 compliant twinning counterpart of a generic production CPS, in which the latter is in control of the physical devices. The great advantage of this CPS Twinning solution is that, regardless of the specific method chosen by the designer for their CPS integration, once connected to the MESS Core, the overall system will seamlessly provide a standardized OPC UA-equivalent transformation of the CPS service and information model to the external world. It is worth recalling how the OPC UA is the technology sitting on top of the MESS (Core and CPS Twinning) standardization and semantic interoperability. It is the technological umbrella that is capable not only of transporting machine data but also semantically describing them in a machine-readable way. An example for the interoperability of a Twinned CPS is shown in Figure 5 through an external, commercial OPC UA client software.


**Figure 5.** View of Twinned CPS instance via a commercial OPC UA client software.

The externally served CPS solution provides the opportunity to the CPS designers to create their own CPS controller with one of the MESS supported communication methods. The advantage of this solution is that the CPS developer is in charge of the complete

implementation and communication, either via an OPC-UA server or an HTTP client and server pair.

The second solution is one of the peculiar outcomes of this research work and refers to the exploitation of the MESS PAS (details in the next section). Once its configuration is finalized, the PAS enables the construction of complex functionality thanks to the combination of the newly integrated tools' functions. The PAS internally models tools as devices, which can cover any kind of equipment, both individually or in a form of fleet (details in the demo use-case section).

#### *4.3. Production Administration Shell*

PAS takes inspiration from the design of the RAMI 4.0 AS (the similarity in representation can also be observed in Figure 6). Simply put, it represents a wrapping component that embodies, organizes, and exposes, in a standardized manner, the CPS functionality towards the overall production execution environment (refer to Figure 4 for architectural details). The goal of the PAS is to provide an easy-to-use tool which helps with transforming typically ad-hoc developed (legacy) manufacturing cells into interoperable, service-based I4.0 CPSs for distributed production environments. PAS embodies a dedicated OPC UA server (extension of the MILO Project, serving both the MESS core and third party OPC UA clients) and an HTTP server/client interface, which is capable of providing the same functionality and services defined for the MESS Core but specified/customized for the local intelligence and complexity of the single CPS.

**Figure 6.** Production Administration Shell for CPS modeling and functionalities.

The PAS embodies a *Device Communication Layer* (DCL) to communicate with the physical devices, which is responsible for sending task execution requests, receiving reports on task execution statuses, sending special (maintenance) service requests to the devices (in order to query or cancel active tasks), as well as collecting the device variables' current values. Sending executable and maintenance tasks, as well as receiving inherent reports and variable updates from the CPS physical layer, can occur through different channels (HTTP or TCP), whereas intensive (Big Data) physical signal variables can be sent over UDP or MQTT, both to internal components or to external, cloud-based services for analytical purposes. As previously mentioned, the CPS Twinning layer of the MESS is able to generate a standardized OPC UA-equivalent transformation of the CPS information model. This is the core feature of the PAS too (details on the interfaces between the MESS Core, the CPS Twinning/PAS, and the low-level CPS communications can be found in Figure 7.

OPC UA was not chosen (although capable) for the intensive, low-level communications between the PAS and the CPS physical layer, due to the following reasons: first of all, the OPC UA is a complex information modeling technology to be deployed, and not just a communication and transport protocol; this can possibly result in communication stack overload and so response time issues at low-level, especially in resource-limited devices; in

other words, OPC UA is not (primarily) intended for IoT bottom-up big data publishing (as also reported in [54]). To this end, the MQTT lightweight publish/subscribe messaging transport technology was selected for bottom-up data interoperability, even considering the possibility of pairing this technology with OPC UA in a publisher–subscriber model, eventually. Alternatively, the PAS has also built-in UDP data exchange interface.

Task execution requests can occur in two different modes, according to the specific device capabilities. By default, this layer handles task queues for the devices and dispatches tasks one-by-one. If a device is natively able to handle execution queues or to process more than one task at a time (what can be called *flooded* task execution), this setting can be omitted. The PAS also contains a built-in planner and a device controller, which provide the same realization of the OPC UA and HTTP interfaces proper of the MESS core but with the main goals to transform the high-level command tasks into low level, device-specific ones, and to control the execution of this new task graph.

**Figure 7.** Details on the OPC UA interface between CPS and MESS Core.

The types of CPS utilized in this work comprise a variety of elements: single robot arms (1), production assembly line (2), Automated Guided Vehicle (AGV) fleet, with and without robot arms (3, 5), collaborative robots (4) and human operated components, such as the warehouse and a digital work assistance system (6). There is also a built-in *utility* CPS, whose *internal* capabilities are related to timing, monitoring, and changing conditions in the execution of process plans. A numbered correspondence of CPSs both for the planned and physically realized layout is reported in the following Figures 8 and 9.

**Figure 8.** Physical components of the CPSs in the demonstrated use case.

**Figure 9.** Realized MESS layout: involved CPSs, human activity, and implemented logistics.

#### **5. Demonstration Use-Case**

The MESS system was deployed in a pilot cyber-physical production and logistics system which was dedicated to research, education, training, and demonstrations in the academic, industrial, and public sectors [55]. In this *learning factory*, the goal was to verify and validate the MESS system and demonstrate its main services and facilities. The scenarios in the demonstration use-case were rather straightforward by design, in order to clearly exhibit the following features of the MESS: (1) interlinking of heterogeneous resources both (2) of production and of internal logistics; (3) inclusion of physical resources and their full-fledged digital twins; (4) integration of both legacy and state-of-the-art I4.0 ready, as well as (5) of fully automated and human-assisted system components; (6) and demonstration of their easy interoperability. Here, the emphasis is on the fact that some of the integrated elements—such as an assembly line [56], human-robot collaborative workcells [57], or a small fleet of AGVs [58]—were complete, complex CPSs themselves, accompanied by their specific, custom-tailored digital twins.

#### *5.1. The Experimental Scenario*

The use-case scenario presented here addresses the production (assembly and delivery) of ball–valves and thermometers in one of the MESS premises. The physical components of the pilot system participating in the use-case scenario are shown in Figure 8. The layout of the realized MESS, the sequence of HRC dynamics, and the involved logistics for the material handling (noted with *a*1..*a*3, *b*1..*b*2, *c*1..*c*3 routes) are depicted in Figure 9.

The main steps of the use-case scenario can be summarized as follows:


The list of targets and relative services implemented via OPC UA for each CPS in the demonstrated use-case scenario is reported in Table 5, whilst the graph in Figure 10 illustrates the concatenation of CPS services in the form of a schematic operation graph (routing), with their execution preconditions (precedences between nodes).


**Table 5.** List of published services for each CPS.

The previously mentioned *Process Plan Editor* provided by the MESS GUI is where, similarly to the schematic graph, the graph of a process plan can be created by selecting the available operations and connecting them with arrows to form precedence constraints. The process plan for this use-case scenario is shown in Figure 11. Additional information is available for every node derived from the common service model by clicking the info button, as shown in the blue box.

The created plan can be executed and its status can be tracked continuously on the *Process Manager* tab of the MESS GUI. The status of the execution can be visualized by the coloring of each individual task based on their own statuses. This is a quick and easy to understand approach to report on the overall status of the process. The coloring is the following: (i.) grey displays the task is waiting for start signal, (ii.) yellow means the task is under execution; meanwhile, (iii.) the green indicates that the task is finished and (iv.) the red color represents the error status. Except for the error status, the other three can be observed in Figure 12. The communication which reports on the shown status change is visible inside the box in the middle of the figure. The message contains the most necessary data to update the information model of the system: type of the message, process identifier, timestamp, and the type specific parameters. This lightweight communication between the CPS and the MESS Core allows a reliable operation. The detailed use-case scenario has been executed flawlessly in an automatic and trackable manner.

**Figure 10.** Schematic graph of the demonstrated use-case scenario.

**Figure 11.** Process plan of the demonstrated use-case scenario.

#### *5.2. Discussion and Lessons Learned*

The above experimental setup and scenarios of the use-case proved to be appropriate to demonstrate all the above six features of the MESS. Hence, system components could be (de-)activated in a plug-and-play manner. Production and internal logistics could speak the same language, just like legacy robotic systems and brand new HRC workcells. Production and logistics processes could be run both in the virtual and physical worlds, even in parallel. This opens new opportunities both for right-first-time system design and configuration, as well as anticipating and avoiding failure situations. In settings where humans are closely involved in production, such a service is essential, whereas its lack could be detrimental [20,55].

**Figure 12.** Process execution example with communication.

The implemented MESS and related (PAS-based) CPS resulted in a valid solution for an easier (re)configuration, as well as extensibility and management of the production and logistics processes, thanks to the conceptualized CPS common service model. This led to the conversion of isolated, legacy industrial equipment into I4.0 ready CPSs, which are capable of exposing their functionalities in terms of services. On the other hand, some CPSs evidenced specific issues. Factory line #1, for instance (no. 2 on Figure 8), could natively handle only a predefined number of *tasks*, and this number could not be changed dynamically during the operation. When parameterizing variables, moreover, they could only have read-only properties, due to system security and to avoid unwanted intervention. Nevertheless, by the protocol for writable variables, server-side and clientside writing were also allowed, which might cause unattended security risks. To overcome this, it was necessary to create cohesiveness between the field programmer level and the control programmer level. The establishing of the standardized synchronization of these programming levels requires further development and research work. It was also necessary, for safety reasons, to configure the network permissions in a way to restrict and route communication traffic only from accredited CPSs. After this initial setup, a variety of process plans were successfully carried out by the various CPSs, in a way that single production cells could be incrementally linked to much wider production processes.

#### **6. Conclusions**

Industry 4.0, or in general next generation of smart factory developments, will require an unprecedented level of interoperability and standardization, with the intent to facilitate the collaboration of connected cyber-physical entities. The latter, however, are all characterized by a high degree of flexibility, adaptability, and autonomy. In this paper, we suggested a generalized common service model and architecture of CPS-based manufacturing execution systems. The core model is minimalist as far as its underlying assumptions are concerned. Hence, it does not constrain the decision autonomy of collaborating cyberphysical entities, and "only" provides channels for transferring and synchronizing the information which ensue from their decisions. The proposed approach identified the elementary concepts, such as *functions, calls, variables,* and *reports* as the basis for modeling and providing I4.0 compliant, CPS-based services in a manufacturing environment. They have been developed via standardized technologies enabling semantic interoperability and openness (OPC UA, MQTT).

The universal CPS-based service integration mechanism has been validated in an experimental pilot production and logistics system which included a variety of heterogeneous and autonomous resources, such as manufacturing cells, AGVs, robots, and human–robot

collaborative cells. These CPS components were connected and controlled via the plug and collaborate mechanism of our MESS system in a number of complex scenarios.

Next investigations look forward to the extension of the MESS in particular with including the semantics of the OPC UA common service model to support the adaptation and embodiment of new equipment, like 3D printers, palletizers, and taggers for internal positioning and logistics. The MESS system was prepared also with a distributed intelligent control in mind. Our future research will focus on realizing such a control scheme on the basis of the MESS architecture presented here.

**Author Contributions:** Conceptualization, R.B. and G.P.; methodology, R.B. and G.P.; software and system realization, B.H. and G.P.; validation, R.B., G.P. and B.H.; formal analysis, R.B. and J.V.; resources, R.B. and J.V.; data curation, G.P. and B.H.; writing—original draft preparation, R.B. and G.P.; writing—review and editing, B.H. and J.V.; visualization, R.B., G.P. and B.H.; supervision, J.V.; project administration, G.P. and J.V.; funding acquisition, J.V. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research has been supported by the GINOP-2.3.2-15-2016-00002 grant on an "I4.0 research and innovation center of excellence" and by the ED\_18-2-2018-0006 grant on a "Research on prime exploitation of the potential provided by the industrial digitalization".

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **Abbreviations**

The following abbreviations are used in this manuscript:



#### **References**


## *Article* **Decentralized Multi-Agent Control of a Manipulator in Continuous Task Learning**

**Asad Ali Shahid 1, Jorge Said Vidal Sesin 2, Damjan Pecioski 2, Francesco Braghin 2, Dario Piga <sup>1</sup> and Loris Roveda 1,\***


**Abstract:** Many real-world tasks require multiple agents to work together. When talking about multiple agents in robotics, it is usually referenced to multiple manipulators in collaboration to solve a given task, where each one is controlled by a single agent. However, due to the increasing development of modular and re-configurable robots, it is also important to investigate the possibility of implementing multi-agent controllers that learn how to manage the manipulator's degrees of freedom (DoF) in separated clusters for the execution of a given application (e.g., being able to face faults or, partially, new kinematics configurations). Within this context, this paper focuses on the decentralization of the robot control action learning and (re)execution considering a generic multi-DoF manipulator. Indeed, the proposed framework employs a multi-agent paradigm and investigates how such a framework impacts the control action learning process. Multiple variations of the multi-agent framework have been proposed and tested in this research, comparing the achieved performance w.r.t. a centralized (i.e., single-agent) control action learning framework, previously proposed by some of the authors. As a case study, a manipulation task (i.e., grasping and lifting) of an unknown object (to the robot controller) has been considered for validation, employing a Franka EMIKA panda robot. The MuJoCo environment has been employed to implement and test the proposed multi-agent framework. The achieved results show that the proposed decentralized approach is capable of accelerating the learning process at the beginning with respect to the singleagent framework while also reducing the computational effort. In fact, when decentralizing the controller, it is shown that the number of variables involved in the action space can be efficiently separated into several groups and several agents. This simplifies the original complex problem into multiple ones, efficiently improving the task learning process.

**Keywords:** reinforcement learning; decentralized control; multi-agent; continuous control; robotic grasping; policy optimization

#### **1. Introduction**

*1.1. Context*

Artificial intelligence and machine learning are quickly becoming core tools in the robotics domain, allowing incorporation of intelligence in robots, making them able to analyze, learn, and improve their behavior with minimum human intervention [1,2]. In particular, reinforcement learning (RL) [3] offers great potential to benefit from the experience to learn control tasks autonomously, as it is successfully shown in complex environments and scenarios [4,5]. However, reinforcement learning typically requires collecting a large number of samples to accumulate such an experience (i.e., training data), which still limits the application of RL in real physical systems such as robots,

**Citation:** Shahid, A.A.; Sesin, J.S.V.; Pecioski, D.; Braghin, F.; Piga, D.; Roveda, L. Decentralized Multi-Agent Control of a Manipulator in Continuous Task Learning. *Appl. Sci.* **2021**, *11*, 10227. https:// doi.org/10.3390/app112110227

Academic Editor: Dario Richiedei

Received: 2 October 2021 Accepted: 27 October 2021 Published: 1 Novermber 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

and pushes the current computer technology to the limit. Moreover, the control problems in robotics are best represented with high-dimensional matrices, continuous state, and action spaces, which actually make the application of this type of technique expensive even for highly efficient processors [6]. Therefore, it is important to consider new ways to simplify the learning problem, to reduce the computational efforts required by the RL. Recent works have considered ways to speed up the training process by learning contextual affordances [7], where the robot learns to perceive the effect of performing an action on an object using the current state. This is particularly important for learning vision-based manipulation skills since it greatly reduces the training time [8]. Some other works have instead focused on the explainability of RL agents, allowing a learning agent to explain the decision of selecting an action over other possible actions [9]. Exploration is another main challenge in RL agents that refers to the problem of choosing a policy that allows a learning agent to discover high-reward regions of the state space. In [10], the performance of different exploration strategies have been compared for on-policy and off-policy algorithms showing that the value-difference-based exploration strategy combined with softmax action selection improve the learning efficiency. Providing interactive feedback about the agent's actions is another approach to deal with the exploration issue [11].

In this paper, a decentralized learning approach is proposed for the robot to learn to perform a target task in a simulation environment through the use of a multi-agent framework (being implemented as On or Off policy RL algorithms), and transfer such knowledge to the real robot for real-task execution. Results and achieved performance are then analyzed in terms of how this approach can improve the computational effort and learning curve speed, while processing the learning algorithms, by comparing the proposed multi-agent framework with a centralized learning schema (used as a baseline).

#### *1.2. State of the Art and Related Works*

Within the state-of-the-art methods related to robotic task learning, the two main learning approaches that can be identified are (i) reinforcement learning and (ii) learning from demonstration (LfD) or imitation learning. The main difference between these approaches is that RL discovers optimal behavior through trials, whereas LfD aims to learn the task based on the presented trajectories, such as in [12]. A combination of these approaches can be used as presented in [5], where an algorithm learned to play the game of Go and won against an international champion by first learning from playing games and later improving through RL. In order to implement autonomous task-learning approaches, RL is commonly adopted. Recent works have emerged in the use of deep RL techniques in robotic grasping as in [13], and more notably in [14]. In [15], deep RL is used in the execution of a robotic task in a continuous domain. This is of great importance, since most real-life robots are simulated in continuous space. In [16], major advancements in large convolutional neural networks used for deep RL are highlighted. These networks have been used to learn hand-eye coordination for robotic grasping from monocular images and can predict the probability of successful grasping using only camera images.

With the increasing complexity of systems and tasks, a new efficient learning method, decentralized RL, has been proposed [17]. Additional work in this field is presented in [18], where decentralized reinforcement learning is implemented in learning robot behavior. Within the field of multi-agent systems, ref. [19] provides a survey for cooperative multi-agent learning literature where the work is divided into two categories. These categories are applying a single learner to discover joint solutions to multi-agent problems (team learning), or using multiple simultaneous learners, often one per agent (concurrent learning). The complexity of continuous space made the first algorithms used in RL obsolete, presenting the need for creating new approaches. These new approaches include networks such as Hindsight Experience Replay, which uses deep Q-networks (DQN) and deterministic gradients (DDPG), which can be found in [20]. Another approach is a distributed Q-learning algorithm which was introduced in [21]. It is noted that by using this algorithm, agents might neglect a penalty due to non-coordination in their update. Ref. [22] points out the limitations of distributed Q-learning when dealing with stochastic environments, presenting a modified exploration strategy. Additionally, the Hysteretic Q-Learning algorithm [23] can be used for decentralized reinforcement learning. This algorithm is an extension of Q-learning and does not require any additional communication between the robots.

When studying multi-agent systems, the learning environment might be non-Markovian from an agent's perspective. This property is explored in [24], where the multi-agent systems are investigated from an evolutionary dynamical perspective. In [25], some known intuitions about concurrent learning agents are formalized, providing formal conditions that make the environment non-Markovian from an independent (non-communicative) learner's perspective. Cooperative control can be used to improve both the quality and speed of learning [26]. Additional parameterized control policies, which are used to reduce the computational load of high dimensional continuous state-action spaces, are presented in [27]. Ref. [28] combines deep learning with multi-agent systems, developing a deep multi-agent RL approach for decentralized continuous cooperative control. In [29], the distributed control of mobile robots is presented for environmental sensing applications. Closed-loop locomotion policies are learned for articulated mobile robots (snake and hexapod) using the multi-agent extension of asynchronous advantage actor-critic (A3C) algorithm [30]. The A3C algorithm acts as a global meta-agent storing a shared policy that multiple worker agents implement and improve in a distributed manner. Coordinating manipulation skills using two robots is also presented in [31], where a multi-agent approach using skill behavior diversification is implemented. The learning of skills is done separately and put together in order to achieve a complex task execution. In [32], a decentralized multi-arm motion planner is implemented using up to 10 robotic arms. The decentralized policy is trained to control one robot arm in the multi-arm system to reach its target end-effector position, given the observations of its workspace state.

#### *1.3. Paper Contribution*

The main objective of this work is to decentralize a centralized controlled system of a multi-DoF robotic arm. By decentralizing and splitting the actions into multiple separate agents, the proposed approach gives modularity to the system. Specifically, in the presented approach, two agents are used in the multi-agent system. The number of used agents, as well as the number of motors that a single agent controls, can be varied, providing flexibility and generalizible capabilities to the framework. Multiple variations of the recent RL algorithms (i.e., Proximal Policy Optimization—PPO—and Soft Actor-Critic—SAC) are used for the training. In the first case, the agents are coordinated by an overseer (a meta-agent), which gives importance to the actions of one of the two agents. In the other case, the agents are not coordinated, and instead, both the agents follow their own policy, either the PPO policy, the SAC policy, or the combination of both. In the uncoordinated case, the agents are not given any information about the output of the other agent, thus, not knowing that their actions affect the other agent and the overall system. In the final case, a single agent in a decentralized system is used. The manipulator's action are divided into two parts, and a single agent following one policy is used to learn the task (as shown in [33]).

In order to evaluate the proposed multi-agent framework, a 7-DoF Franka EMIKA panda robot is employed to perform a grasping task. The robot is required to position the end-effector above the object to be grasped, grasp the object, and finally lift it off of the surface. The overall performance of the grasping task is compared with the centralized system proposed in [33]. In more detail, the decentralization of the system is done by splitting the motors of the system into two equal parts, and giving the control of one part of the system to one agent and the rest to the other agent. Specifically, one agent controls the first four motors of the manipulator, while the second agent controls the last three motors of the manipulator and the gripper. The information of the system is shared between the agents, each one following a specific policy, with four output control actions that affect the next state. Training with multiple agents demonstrates the modularity of the process. The learned behavior can then be transferred to the real robot, to execute the target task and prove the scalability of the algorithm.

All of the achieved results show an acceleration in the learning process at the beginning of the episodes, even though the multi-agent approach leads to a stagnation of the learning curve. It is also shown that (with the correct setup of the reward function, the correct combination of On and Off policy algorithms, the correct number of agents, and the decentralization of the action space) a fewer amount of episodes are required to achieve the same reward of the baseline algorithm in [33] during the execution of the target manipulation task. This leads to an improvement of the overall training wall clock time, in combination with the efficiency gained for processing. The main scheme of the proposed multi-agent framework for the manipulator is shown in the Figure 1.

**Figure 1.** Main conceptual scheme of the proposed approach using a multi-agent and decentralized reinforcement learning approach.

#### *1.4. Paper Layout*

The paper is organized as follows. In Section 2, the addressed problem is described and principal challenges faced in the paper are explained, along with a schematic representation of the developed solution. Section 3 serves as an introduction to the methodology behind some of the key concepts of the proposed multi-agent framework decentralizing the robots control action learning. It provides details about the multi-agent and decentralized RL, together with a more detailed scheme of the key algorithms that have been employed. Section 4 presents the implementation of the proposed solution, providing details related to the learning environment, the training, and the defined reward function. In Section 5, a deep analysis of the experimental evaluation is provided, comparing the decentralized and multi-agent approach with the centralized algorithm in [33], which has been used as a baseline. The obtained results are discussed in Section 6. Finally, Section 7 states the conclusions, highlighting the limitations of the proposed approach, together with further possible future works to address them.

#### **2. Problem Description**

The primary objective of this paper is two-fold: (1) to provide an approach for decentralizing a single manipulator into multiple parts to show the possibility of using multiple agents in modular robots, and (2) to compare the results of the decentralized framework with the centralized framework proposed in [33]. To do that, the robot needs to be simulated in continuous space, as most of the real-world robotics problems are represented in this way. For such a purpose, the Mujoco multiphysics simulation engine has been employed as it is better detailed in the following sections. Multiple variations of model-free reinforcement learning algorithms are employed in the learning process. The on-policy algorithm, Proximal Policy Optimization (PPO), off-policy algorithm, and Soft Actor-Critic (SAC) are proposed to train the robot controller. Another algorithm that is proposed is the use of a meta-controller that coordinates the multiple agents in the system.

In order to implement, validate, and test the proposed multi-agent framework, a manipulation task (i.e., grasping and lifting) has been considered. The target task is the same as in [33], to have a fair comparison between the centralized and decentralized approaches. In more detail, the task consists of the robot interacting with the object (a 6 cm cube) placed on a table. The goal is to successfully grasp and lift the cube off the table. The task can be split into several phases that need to be completed in a sequence in order to have a successful execution of the task. The phases are:


The complete task can be thought of as accomplishing the sub-tasks that lead to achieving the main objective.

#### **3. Methodology**

#### *3.1. Preliminaries*

In this paper, the RL problem is modeled as a discrete time-continuous Markov decision process (MDP). Markov decision processes are the mathematical foundation for RL [3,34]. A Markov decision process is a tuple (*S*, *P*, *R*, *A*), where *S* is a state set, *P* is the transition probability matrix, *R* is a reward function of the agent based on the current state and the chosen action, and *A* is a finite set of actions available to the agent. The objective of the agent is to learn a policy *π* that maximizes the expected return, whilst achieving the maximum cumulative reward. A policy *π* represents the description of behaviors of an agent. A stationary policy *π* : *S* → *A* is a probability distribution over actions to be taken in each state. A deterministic policy has a probability of one for some action in a state.

Reinfocement learning methods can be divided into three main categories:


In this paper, the actor-critic method is used to learn both the policy and the value function. Based on the way the algorithms generate and use experience data, a further classification can be made into *On-policy* and *Off-policy* algorithms. On-policy methods evaluate the same policy that is used to select actions; whereas, off-policy methods learn the target policy that is different from the behavioral policy used to generate experience. In this paper, two main algorithms are used, the on-policy PPO algorithm and the off-policy SAC algorithm. Further information on these algorithms can be found in [36,37].

#### *3.2. Modular Framework for Decentralized Learning*

The multi-agent RL method (MARL) consists of N policies that operate on the state and actions of corresponding agents, as illustrated in Figure 2. Instead of having a single policy with the full observation and action space, multi-agent reinforcement learning (MARL) suggests an explicit split in the state and action space according to the agents (e.g., robots or end-effectors), which allows efficient low-level policy training as well as flexible skill composition [31]. The agents in the system can have information about the whole system (all the states) or just one part of the system (the states that the agent directly influences).

**Figure 2.** Multi-agent reinforcement learning framework.

In this paper, a multi-agent RL method is proposed, where each of the N policies takes as input the observation of the whole system and outputs an action for the corresponding agent. Without a loss of generality,N=2 policies are considered in this paper.

All policies share the global critic learned from a single task reward, as mentioned in [38]. The multi-agent system can be composed of an additional component, which is the meta-policy, as illustrated in Figure 3. This framework is comprised of a set of N agents *π*1, ... *π*<sup>N</sup> and a meta-policy *πmeta*. In this configuration, the action of two low-level agents (PPO or SAC agents) learn to control the motors, each having an output of four (the control of the first or second part of the robot), and the meta-agent that oversees the process, gives importance to one of the agents. The meta-agent represents a high-level policy that decides which low-level policy to execute among a set of available policies. The low-level agents can first learn the primitive skills separately, which are then combined later on by a meta-agent to execute a complex task. In this paper, the low-level agents have not learned any specific primitive skills, such as positioning, orienting, or grasping. The learning process for both the low-level agents and the meta-agent is executed jointly. This choice is made in order to have comparable results with the single-agent system. In future work, the agent-specific tasks and rewards can first be learned by the primitive agents and later coordinated temporally by the meta-agent.

**Figure 3.** Multi-agent reinforcement learning framework with a meta-agent.

#### **4. Implementation of Proposed Approach**

Reinforcement learning requires an interaction between the agent and the environment. In order for this interaction to be successful and for learning to be possible, the environment and task need to be completely defined. The environment is everything outside the agent. In the case of a robotic manipulator, the environment consists of the robot, the object that needs to be picked up, the table on which the object rests, and the full arena where the robot is placed. The other part that needs to be defined is the task. The task, in this case, is a pick-up task. The rewards that are given to the agent are associated with this task. This section presents a detailed description of the learning environment, the training details, and the network architectures. The final part of this section is used to explain the design of the reward function. This paper is based on the use of a MuJoCo multiphysics simulation engine [39] to model the environment and Pytorch library to train the agent.

#### *4.1. Learning Environment*

The learning environment was designed in MuJoCo physics engine in order to simulate the physical system. MuJoCo is a model-based optimization platform that allows fast and accurate simulations, including contact geometries. In order to speed up the simulation, MuJoCo allows the simulation without rendering. A comparison of simulation performance for multiple physics engines is presented in [40], showing that MuJoCo is the fastest environment for robotics-related tasks. The simulation of the robot being used is taken from [41]. With the combination of different models in the simulation, the simulation setup achieved is shown in Figure 4.

**Figure 4.** Franka EMIKA panda robot employed in the target grasping task in the MuJoCo environment.

#### *4.2. State-Space*

The considered state space is continuous and consists of two main input modalities, the robot information and the target object information:


#### *4.3. Action Space*

The action space consists of an 8-dimensional vector, which represents the 7 joints of the panda robot and the gripper. The last dimension of the action space represents the actuation control of the gripper and is mirrored to have a symmetrical action for the fingers. The action space is split into two equal parts. One part of the manipulator corresponds to the motors of the first 4 joints, while the second part corresponds to the last 3 joints and the gripper's actuation. Dividing the action space in such a way allows for the decentralization of the system and shows the modularity of the robot. Joint velocities coming from the agents are fed into a controller that converts the desired joint velocities into joint torques. The robot is considered to be equipped with torque control compensating for the manipulator dynamics [42]:

$$
\boldsymbol{\pi} = \mathbf{B}(\mathbf{q})\boldsymbol{\pi}\_{lamn} + \boldsymbol{\pi}\_f(\dot{\mathbf{q}}) + \mathbf{C}(\mathbf{q}, \dot{\mathbf{q}}) + \mathbf{g}(\mathbf{q}).\tag{1}
$$

In this equation, **B**(**q**) represents the robot inertia matrix, **C**(**q**, **q˙**) is the robot Coriolis vector, **<sup>g</sup>**(**q**) is the robot gravitational vector, *<sup>τ</sup> <sup>f</sup>*(**q**˙ ) is the robot joint friction vector, **<sup>q</sup>** is the current joint position vector, and *τ* is the control torque vector. *τlearn* is the learned task-related control torque vector mapping the reference joint velocities to joint torques for the target task execution.

$$\mathbf{\pi}\_{learn} = \mathbf{k}\_v(\dot{\mathbf{q}}^d - \dot{\mathbf{q}}) \,\tag{2}$$

where **q**˙ *<sup>d</sup>* and **q**˙ are the reference (i.e., the objective of the learning) and current joint velocities respectively, and:

$$\mathbf{k}\_v = \text{diag}(8, 7, 6, 4, 2, 0.5, 0.1). \tag{3}$$

**k***<sup>v</sup>* is a diagonal matrix of fixed proportional gains for joint trajectory-tracking purposes. These control gains need to be tuned to allow the robot to track the reference joint velocities **q**˙ . The values that are selected are based on the variable impedance control [43], which results in accurate trajectory tracking performance. If needed, such control gains can be either learned or experimentally tuned to improve trajectory tracking performance.

#### *4.4. Training Details*

The algorithms described in Section 3 are used to train the target grasping task. The episodes are divided into 600 time-steps where each of them corresponds to roughly 0.1 s of real-time. At every step of the episode, the actions to be executed in the environment are sampled from Gaussian distributions parameterized by the mean and standard deviation for each action dimension. The agents use a replay buffer that stores the transition experience of the episodes. At every update iteration, random samples of a batch size of 512 are collected from the buffer to update the network parameters. While using a PPO agent, the buffer is cleared and the samples are discarded after performing each update because this is an on-policy algorithm. Both agents have the same parameters, the same state-space is fed into both of the agents, and both agents follow a given policy where the output of each agent is a vector of four values, each corresponding to the given motors. Each agent gives half of the output needed for the system and these outputs are combined and fed into the system. The policy network, as well as the value network, are encoded by MLP—multi-layer perceptron. The policy network consists of three layers and Rectified Linear Unit (ReLU) non-linearity. The input layer, representing the state of the system is 46-dim. The hidden layer size is of 64-dim and the output layer is 8-dim. The output layer of a policy network is two-fold, producing the mean and standard deviation (parametrizing the Gaussian distribution) for each action dimension. The value network outputs only a scalar value, which specifies the corresponding value function of the state. All inputs to the policy and value networks are normalized with running estimates of the mean and variance. Both the policy and value networks are shown in Figure 5.

**Figure 5.** Schematic representation of: (**top**) policy network with 46-dim input (environment states) and 8-dim output (actions); (**bottom**) value network with considered 46-dim input and 1-dim output.

#### *4.5. Reward Shaping*

The reward function helps the agents determine if their actions were desirable or not, thus allowing the agents to decide what they need to do. The dense reward signal guides the agent's exploration process and enables efficient learning; however, shaping a reward function is an exhaustive process. Another approach to overcome the exploration issue is to provide an interactive feedback to the agent where the external teacher may overwrite the actions prescribed by the policy [44]. This approach is termed policy shaping, and allows the agent to learn even in the presence of binary reward signals. In this work, the reward shaping approach has been used to overcome the exploration issue for multi-agent learners in order to compare the results with a centralized single-agent that uses the same reward function. Shaping the multi-agent policies using interactive feedback, either from external human teachers or artificial trainers, is a promising direction for future work.

The reward function is divided into two phases, and a separate reward is given for each phase. The reward for the reaching phase is as follows:

$$
\sigma\_{t, \text{reach}} = w\_{dist} r\_{dist} + w\_{\text{vel}} r\_{\text{vel}} + w\_{\text{grip}} r\_{\text{grip}}.\tag{4}
$$

In the above equation three different rewards are represented:


The combination of *rvel* and *rgrip* incentivizes the agents to approach the object with an open gripper and a small velocity. All three contributions (*rdist*, *rvel*, *rgrip*) are weighted with respective weights *wdist* = 0.6, *wvel* = 0.3, and *wgrip* = 0.1, and defined as:

$$r\_{dist} = 1 - \tanh(\mathbf{p}\_{grip} - \mathbf{p}\_{cubc})\_\prime$$

$$r\_{rel} = \begin{cases} 1 - \tanh(abs(\mathbf{v}\_{cc} / N\_{dim})) & \text{if } \quad r\_{dist} < 8 \text{ cm} \\ 0 & \text{otherwise} \end{cases} \tag{5}$$

$$r\_{grip} = \begin{cases} abs(action\_{gripper}) & \text{if } \quad action\_{gripper} < 0 \\ 0 & \text{otherwise} \end{cases}$$

where **p***grip* and **p***cube* denote the position vector of the gripper and the cube in the world reference frame; **v***ee* represents the velocity of the end-effector, which contains the linear and angular components of the velocity with *Ndim* = 6; *actiongripper* specifies the action of the gripper with negative values indicating that the gripper is opening.

The lifting phase of the reward is initiated when the robot's gripper is in close proximity to the object. The distance threshold for switching between the grip and lift rewards is set to 2.5 cm. This is because the nominal cube measures 6 cm, and the robot should be able to grasp the cube when its grip site is within the cube boundaries. The reward for the lifting phase is defined as:

$$r\_{t,lift} = r\_{dist} + r\_{\text{rel}} + w\_{\text{grip}}r\_{\text{grip}} + r\_{\text{grap}} + r\_{\text{success}} \tag{6}$$

and consists of the following five contributions: *rdist* and *rvel* are the same as the previous equation; *rgip* reward is now reversed so it gives a reward when the gripper closes in order to grasp the cube. This reward is weighted by *wgrip* = 0.1. Additional rewards are *rgrasp* and *rsuccess*. *rgrasp* is given if both fingers of the gripper make contact with the walls of the cube and *rsuccess* is an additional reward if the episode is successful, i.e., the task is completed. A successful episode constitutes if the gripper holds the cube at a certain height above the table. *rgrip*; *rgrasp*; *rsuccess* in the lifting case are defined as:

$$r\_{grip} = \begin{cases} abs(action\_{gripper}) & \text{if } \operatorname{action}\_{gripper} > 0\\ 0 & \text{otherwise,} \end{cases}$$

$$r\_{grasp} = \begin{cases} 0.5 & \text{if } \operatorname{fingsrs in contact} \\ 0 & \text{otherwise,} \end{cases} \tag{7}$$

$$r\_{\text{success}} = \begin{cases} 2 & \text{if } \operatorname{z\_{cube}} > 0.1 + \text{z\_{cube}} \\ 0 & \text{otherwise.} \end{cases}$$

Different iterations were performed to set the values of the grasping reward *rgrasp* and the success reward *rsuccess*. In particular, setting higher values led to a very high variance in the reward, causing training inefficiency. The selected values encourage the learning agent to accumulate more reward in the lifting phase by accomplishing grasping and lifting of the cube, while avoiding too high of a variance in training results. The total reward *rt* is then computed as the accumulated sum of *rt*,*reach* in Equation (4) and *rt*,*lif t* in Equation (6).

$$r\_t = r\_{t, \text{reach}} + r\_{t, lift}.\tag{8}$$

#### *4.6. Software Implementation*

The software implementation is available at the following GitHub repository [45].

#### **5. Results**

In this section, the algorithms used for the training are evaluated and compared with the centralized approach. The reward function presented in Section 4.5 has been shaped and optimized for the single-agent control. For the decentralized and multi-agent control, the same reward function is used to provide a proper comparison with the centralized approach. The following multi-agent algorithms are considered:


In the first configuration, the two low-level agents learned to control the first or second part of the robot (each having an output of four), and the meta-agent oversees the process and gives importance to one of the agents. The other three configurations involve two separate agents, where each one controls one part of the action space, i.e., each controls four motors of the robot. The two agents do not communicate with each other, and each one strives to make the optimal decision knowing the observation space, but not knowing the actions of the other agent. The two agents follow either the PPO policy, the SAC policy, or the combination of both.

A final configuration is considered where only one agent is used, but the action space remains divided. Having the action space divided in such a way allows the agent to look at the actions it takes as two separate vectors. The output of the Policy network is thus two separate four-element vectors, each corresponding to the first and last four motors of the robot. In this way, only one agent is used, which looks at the action space in a decentralized way. The algorithms considered for decentralized single-agent control are the following:


A video showing the progress of the training employing the decentalized single-agent SAC controller is available at the GitHub link [45]. In the video, the improvements in the learning of the target task (i.e., grasping and lifting of the part) are shown, highlighting the capabilities of the method.

#### *5.1. Decentralized Multi-Agent Approach*

5.1.1. Meta-PPO Agent with Two Low Level SAC Agents

This type of baseline is a hierarchical framework that composes of a meta-policy that coordinates the two low-level agents, which control the robot joints following the SAC policy. In Figure 6, results are shown for the behavior of this algorithm. At the beginning of the training, a steady and relatively fast increase in the rewards is seen; this can be explained by the fact that everything in the experience is new to the robot, and the functional algorithms make the proper use of this experience for learning. However, after approximately three million time steps, the rewards plummet, and the agents enter a local minimum, which they do not escape, regardless of the number of iterations. The metaagent's interaction with the low-level agents comes to a point where it thinks the maximum successful reward has been achieved. This training has been run for 10 million time steps, and the reward stays in the same range of 50–100, without a successful episode.

**Figure 6.** Progression of mean accumulated reward for the meta-agent with two low-level SAC agents. Training results are reported for three experiments with different random seeds. The standard deviation of three runs is shown in the plot.

#### 5.1.2. Two PPO Agents

The results for the case of two PPO agents are shown in Figure 7. As expected, the PPO agent has a more significant non-monotonic behavior. This experiment shows that a high reward of around 500 is achieved near the two million time steps, whereas the same algorithm, when used in the centralized single-agent setting, reaches the maximum reward of 250 at two million steps [46].

**Figure 7.** Progression of mean accumulated reward for the two PPO agents.Training results are reported for four experiments with different random seeds. The standard deviation of four runs is shown in the plot.

At certain times before the two million steps, very high rewards can be seen. At these points, the robot manages to grasp the cube and the learning agent transitions to the second phase of the reward described in Section 4.5. However, unfortunately, it does not result in the lifting of the cube. This can be explained by the fact that the two agents prioritize the actions coming from their own policies instead of cooperating towards a common goal.

#### 5.1.3. Two SAC Agents

In this configuration, two SAC agents are used in the multi-agent setting. The mean accumulated reward is plotted in Figure 8. A very similar conclusion can be inferred from the two SAC agents multi-agent controller. A clear rise in the total reward obtained can be read from the graph with a much more robust and monotonic increase, as expected to obtain an off-policy algorithm.

**Figure 8.** Progression of mean accumulated reward for the two SAC agents. Training results are reported for four experiments with different random seeds. The standard deviation of four runs is shown in the plot.

#### 5.1.4. One SAC Agent One PPO Agent

In Figure 9, the training results for the combination of an SAC and PPO agent are shown. Using two separate algorithms, which follow two separate policies does not yield promising results. There is a slight uptrend in the reward at the beginning but the whole curve is relatively flat in the region of 50–150 for the duration of the whole run, which lasted for eight million timesteps.

**Figure 9.** Progression of the accumulated reward for one SAC and one PPO agent.

#### *5.2. Decentralized Single-Agent*

5.2.1. Decentralized Single-Agent SAC

When using only one SAC agent with a divided action space, the following results achieved are shown in Figures 10 and 11.

**Figure 10.** Progression of the accumulated reward for the decentralized single-agent SAC.

**Figure 11.** Episode success rate for the decentralized single-agent SAC.

The figure shows a steady uptrend of the reward. At 400K timesteps, the robot learns to grasp the cube, while at 600K, the training results in a successful lifting of the cube. Comparing these results with the single-agent centralized approach in [46], the iterations needed for successful task completion are halved, achieving a reward of 1200 at 600K timesteps, leading to a much faster and efficient learning of the task. The division of the action space in single-agent helps to save the memory, and the training efficiency improves due to the fact that a common learning goal is shared through the same reward function, while the actions are redirected to the corresponding motors in a decentralized manner for the execution.

#### 5.2.2. Decentralized Single-Agent PPO

In this configuration, a single PPO agent is used with a decentralized action space and the results are shown in Figure 12.

**Figure 12.** Progression of the mean accumulated reward for the decentralized single-agent PPO. Training results are reported for three experiments with different random seeds. The standard deviation of three runs is shown in the plot.

The comparison of the results achieved through decentralized single-agent SAC and PPO is shown in Figure 13. It is shown that the decentralized PPO agent required more iterations in order to reach a successful episode in comparison to the SAC agent. The PPO agent also tends to stagnate at certain areas, which can be seen after the two million steps where the agent is performing positively, but later falls to a local minimum.

**Figure 13.** Comparison of accumulated reward for the decentralized single-agent SAC and PPO.

#### *5.3. Comparison of Centralized and Decentralized Approach*

The training comparison between the centralized and decentralized approach for PPO and SAC agents is shown in Figures 14 and 15. The decentralized PPO algorithm accumulates higher average reward than the centralized PPO; however, it also shows a non-monotonic behavior. Whereas, the learning curve of the centralized PPO agent is smooth, as demonstrated in Figure 14.

**Figure 14.** Comparison of accumulated reward for the centralized and decentralized PPO agents.

**Figure 15.** Comparison of accumulated reward for the centralized and decentralized SAC agents.

On the other hand, the decentralized single-agent SAC in Figure 15 demonstrates a much faster and efficient learning process as it accumulates a very high average reward and shows the task success in just 600K timesteps while the centralized SAC agent needs at least 2M steps to reach success. The multi-agent SAC curve instead is relatively flat and does not reach any task success in the first 2M timesteps.

Table 1 summarizes the results of all analyzed approaches and shows the comparison of maximum accumulated reward achieved in an episode considering the first 2M steps of training. The decentralized single-agent SAC is the most efficient approach of all, since the agent accumulates the highest reward, while the combination of multi-agent SAC and PPO achieves the lowest reward. This might be due to the lack of coordination among agents in the multi-agent configuration as there are no explicit terms in the reward to encourage such coordination. Therefore, an interesting direction for the future work is to shape the reward function that encourages coordination among agents in a multi-agent setting.

**Table 1.** Maximum accumulated reward in the first 2M time steps.


#### **6. Discussion**

Similar results and behaviors can be seen for all multi-agent approaches. The learning efficiency is improved at the beginning of the training before the stagnation of the learning curve. Therefore, the multi-agent division doesn't improve the training efficiency for the completed task. However, it is still better from the computational point of view as the division of the action space provides modularity to the overall system, allowing to train the agents in a distributed manner. Moreover, larger expected rewards can be achieved by the off-policy SAC multi-agent compared to the PPO multi-agent, since it takes advantage of past data storage. The PPO agent is slower than the SAC agent in both the centralized and decentralized approach. Most of the higher rewards achieved during the multi-agent runs can be correlated with the gripper touching the cube with a good approaching velocity. Nevertheless, a lack of coordination and a sparse reward setup for the second phase in charge of lifting the cube prevents it from completing the task. In the reaching phase, a very high reward was achieved by the first agent that was, however, overturned in the lifting phase where the actions of the second agent had more significance. This sort of behavior can be attributed to the fact that the reward is shaped to accommodate the centralized single-agent learning. Since the reward is highly sensitive, and small variations can lead to the difference between a successful episode and an unsuccessful run, the reward function can be augmented with different aspects to accommodate the needs of multi-agent learning as was done for the centralized single-agent. In addition, having a separate reward function for each of the agents and augmenting the agents' input with the information about prior sub-tasks could improve the performance of the decentralized multi-agent learning.

The video available at the GitHub link [45] shows the evolution of the learning process for the decentalized single-agent SAC controller, having the robot able to learn the target task (i.e., grasping and lifting of the target part). The policy of the training run with the highest accumulated reward is selected as the control actions to be applied to the robot to execute the learned task. As in [33], the learned control torques or joint positions/velocities can be transferred to the real robot to perform the given application.

#### **7. Conclusions**

In this paper, a decentralized and multi-agent approach has been formulated in the form of an RL problem, demonstrating the possibility of decentralizing a single manipulator controller by applying multiple agents in learning continuous actions. The idea of this paper has been to compare the feasibility of decentralization for a single robot to show the modularity in the joints, as well as the comparison between the centralized and decentralized approach. Results show that it is possible to decentralize the control action on the robot. Using multiple agents has shown the stagnation of the learning process when using the same reward function that is used for a single-agent. It is believed that this is due to the lack of communication between the agents and the generality of the reward when considering two agents. A sparser reward needs to be added to the system and agent specific rewards can be considered.

Future work will be devoted to deeper investigating of the proposed approaches, implementing improved methodologies based on the achieved results. A deeper investigation and analysis of the results (e.g., increasing the number of runs of each method to better highlight local minima issues, etc.) will also be conducted in order to provide further insights. In addition, the reward function will be tuned for each agent separately to improve the performance of the methods. Then, adding a sparse reward function, as well as agent-specific rewards, could result in a successful execution of the task using the decentralized approach. Meta-learning will be deeply investigated in order to further generalize the methodology. The transfer of the learned policies to manipulators (e.g., to the Franka EMIKA panda robot) will be performed to show the applicability of the proposed methods in real tasks.

**Author Contributions:** Conceptualization, L.R., A.A.S., J.S.V.S., D.P. (Damjan Pecioski) and D.P. (Dario Piga); methodology, L.R., A.A.S., J.S.V.S., D.P. (Damjan Pecioski) and D.P. (Dario Piga); software, J.S.V.S., D.P. (Damjan Pecioski) and A.A.S.; validation, J.S.V.S., D.P. (Damjan Pecioski), A.A.S. and L.R.; formal analysis, J.S.V.S., D.P. (Damjan Pecioski), A.A.S. and L.R.; investigation, L.R., A.A.S. and D.P. (Damjan Pecioski); resources, L.R.; data curation, J.S.V.S., D.P. (Damjan Pecioski) and A.A.S.; writing—original draft preparation, L.R., A.A.S., J.S.V.S., D.P. (Damjan Pecioski) and D.P. (Dario Piga); writing—review and editing, A.A.S. and L.R.; visualization, L.R., A.A.S., J.S.V.S., D.P. (Damjan Pecioski) and D.P. (Dario Piga); supervision, L.R., D.P. (Dario Piga) and F.B.; project

administration, L.R.; funding acquisition, L.R. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the project ASSASSINN, funded from H2020 CleanSky 2, grant agreement n. 886977.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The software to replicate the achieved results can be found at the link in [45].

**Acknowledgments:** The work has been developed within the project ASSASSINN, funded from H2020 CleanSky 2 under grant agreement n. 886977.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Predictive Control for Small Unmanned Ground Vehicles via a Multi-Dimensional Taylor Network**

**Yuzhan Wu 1, Chenlong Li 2,\*, Changshun Yuan 2, Meng Li 2,3 and Hao Li <sup>4</sup>**


**Abstract:** Tracking control of Small Unmanned Ground Vehicles (SUGVs) is easily affected by the nonlinearity and time-varying characteristics. An improved predictive control scheme based on the multi-dimensional Taylor network (MTN) is proposed for tracking control of SUGVs. First, a MTN model is used as a predictive model to construct a SUGV model and back propagation (BP) is taken as its learning algorithm. Second, the predictive control law is designed and the traditional objective function is improved to obtain a predictive objective function with a differential term. The optimal control quantity is given in real time through iterative optimization. Meanwhile, the stability of the closed-loop system is proved by the Lyapunov stability theorem. Finally, a tracking control experiment on the SUGV model is used to verify the effectiveness of the proposed scheme. For comparison, traditional MTN and Radial Basis Function (RBF) predictive control schemes are introduced. Moreover, a noise disturbance is considered. Experimental results show that the proposed scheme is effective, which ensures that the vehicle can quickly and accurately track the desired yaw velocity signal with good real-time, robustness, and convergence performance, and is superior to other comparison schemes.

**Keywords:** multi-dimensional Taylor network; predictive control; nonlinear system; SUGV; predictive model

#### **1. Introduction**

With the rapid development of artificial intelligence, big data, and data fusion technology, car driving tends to be intelligent and unmanned. Recently, research on Unmanned Ground Vehicles (UGVs) has become a hot topic at home and abroad [1–3]. Its core technologies include environmental awareness, vehicle positioning, path planning, and path tracking control [4–7]. Path tracking control, as a key core issue of unmanned driving technology, has attracted much attention from scholars and it ensures that the car drives on the specified path [7–9]. So, research on path tracking control has both theoretical and practical significance.

However, UGV path tracking control is a strongly nonlinear process, which contains time-varying characteristics, uncertainty, etc., and a precise mathematical model is difficult to obtain [10]. As artificial intelligence develops, neural networks can solve the above problem well due to their nonlinear approximation capabilities [11]. For example, a new end-to-end autonomous control method was proposed to simplify the separate modules in the traditional control pipeline into a single neural network [12]. A novel application of the biologically inspired computing paradigm was presented for solving the initial value problem (IVP) of electric circuits based on a nonlinear RL model by exploiting the

**Citation:** Wu, Y.; Li, C.; Yuan, C.; Li, M.; Li, H. Predictive Control for Small Unmanned Ground Vehicles via a Multi-Dimensional Taylor Network. *Appl. Sci.* **2022**, *12*, 682. https://doi.org/10.3390/ app12020682

Academic Editor: Emanuele Carpanzano

Received: 16 December 2021 Accepted: 8 January 2022 Published: 11 January 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

competency of accurate modeling with a feed forward artificial neural network, the global search efficacy of genetic algorithms, and a rapid local search with sequential quadratic programming [13]. Moreover, predictive control, rooted in industrial engineering, is widely used. Usually, the schemes that combine a neural network model and predictive control are more popular and useful to solve the nonlinear control problem. For example, physicsbased recurrent neural network modeling approaches were proposed for a general class of nonlinear dynamic process systems to improve the prediction accuracy by incorporating a priori process knowledge, and the proposed physics-based RNN models were utilized in model predictive controllers [14]. A neural-network-based technique for developing nonlinear dynamic models from empirical data for a model predictive control (MPC) algorithm was presented [15]. However, for the neural networks, they easily reach the local minimum [16].

To solve the issues with nonlinearity, time-varying characteristics, uncertainty, real time, etc. and overcome the problem of neural networks, we adopt a multi-dimensional Taylor network (MTN) and collective solutions. The collective solutions are the most effective as they take advantage of each method. The MTN is an inclusive platform that was first proposed by Zhou and Yan and can be used to describe the nonlinear dynamic characteristics of a system without prior knowledge of the structural information on the system [17,18]. It has been successfully applied to system modeling [19,20] and nonlinear control [21–24]. In this paper, an improved predictive control scheme based on a MTN is proposed to realize real-time output tracking control of Small Unmanned Ground Vehicles (SUGVs) relative to a given reference path without state feedback. First, the MTN is used as a predictive model to construct the SUGV model and back propagation (BP) is taken as its learning algorithm. Second, a predictive control law is designed to control the SUGV and the traditional objective function is improved to obtain a predictive objective function with a differential term. The optimal control quantity is given in real time through iterative optimization. Meanwhile, the stability of the closed-loop system is proved according to the Lyapunov stability theorem. Based on that, an adjustment method for MTN control law parameters is obtained using the predictive accuracy of the MTN model and the control coefficient. Finally, a tracking control experiment on the SUGV is used to verify the effectiveness of the proposed scheme. For comparison, traditional MTN and RBF predictive control schemes are introduced.

By the above analysis, the main contributions are as follows:


The rest of the paper is arranged as follows. A literature review is presented in Section 2. The materials and methods are given in Section 3. The MTN predictive control scheme with the differential term is summarized in Section 4. Section 5 presents the simulation results and a discussion. Section 6 presents our conclusions.

#### **2. Literature Review**

An effective tracking control scheme is critical to UGV path tracking control. Because UGV path tracking control is a strongly nonlinear process, the control methods for previous linear systems seem to be inadequate [25]. At present, the common control methods include adaptive control, sliding mode control, fuzzy control, neural network control, and predictive control [26–35]. For instance, for formation control of multiple vehicles, the leader–follower error model was built and an adaptive controller was designed to address

the uncertain relative distance using the dynamic estimation of the leader–follower distance [26]. A multivariable model reference adaptive control (MRAC) scheme was studied for the automatic carrier-landing control problem of unmanned aerial vehicles with the system dynamics of nonlinearity, multivariable coupling, and parametric uncertainty [27]. A vehicle lateral controller was designed for autonomous vehicles based on higher-order sliding mode control [28]. An automatic steering control strategy for unmanned vehicles based on robust backstepping sliding mode control theory was proposed [29]. For the accuracy and stability of driverless buses, a path-tracking controller based on fuzzy pure pursuit control with a front axle reference (FPPC-FAR) was proposed [30]. An obstacle avoidance strategy consisting of path planning and robust fuzzy output feedback control was proposed for unmanned ground vehicles [31]. A novel adaptive neural network robust lateral motion control method was presented that can maintain the yaw stability of an autonomous vehicle while minimizing the lateral path tracking error at the limits of driving conditions [32]. A learning-based predictive control approach was presented for an autonomous racing car [33]. A nonlinear model predictive control optimization algorithm based on the collocation method was proposed for unmanned vehicle trajectory tracking [34]. A trajectory tracking error-based model was used to design a linear model predictive controller and its control action was combined with feed forward and robust control actions [35].

However, all of the above-mentioned control technologies have their own inherent bottlenecks both in theory and in application. For example, for adaptive control, there are many stability methods to be chosen from when designing the control rate, but most of them require strong assumptions. For fuzzy control, simple fuzzy processing will lead to a reduction in control precision and poor system dynamic quality. For neural network control, the closed-loop system is stable only at a certain point, and the exponential function of neurons will also lead to a large number of computations and poor real-time performance. Sliding mode variable structure control can easily fall into a chattering state, and it also needs to solve for factors such as velocity, inertia, acceleration, and the switching surface when near the sliding mode surface. For predictive control based on a mathematical model, it is difficult to establish an accurate mathematical model, or even to build a model. Therefore, collective solutions were adopted in this study.

The multi-dimensional Taylor network is an inclusive platform and applies to nonlinear control particularly effectively. For example, an adaptive predictive control approach based on the MTN was proposed for the real-time tracking control of single-input singleoutput nonlinear systems with input time-delay [21]. A control scheme based on the MTN was proposed to achieve the real-time output feedback tracking control of multi-input multi-output non-affine nonlinear time-varying discrete systems relative to the given reference signals with online training [22]. An adaptive control approach based on the MTN was proposed to control nonlinear uncertain time-varying systems with noise disturbances [23]. A predictive control scheme on the basis of the MTN, named MTN predictive compensation control, was proposed for single-input single-output nonlinear systems [24]. However, the above-mentioned studies focus on theoretical research and they do not have an application background. Therefore, an improved predictive control scheme based on the MTN is proposed for the tracking control of SUGVs.

#### **3. Materials and Methods**

#### *3.1. System Description of the SUGV*

A simplified SUGV model is shown in Figure 1. Two front wheels and two rear wheels are combined into a single front wheel and rear wheel, which are simplified into a twodegrees-of-freedom vehicle model. The following assumptions were made: we ignored the effect of steering and suspension; we kept the vehicle's longitudinal speed constant and only considered the lateral movement along the *y* axis and the yaw movement around the *z* axis; we ignored the effects of horizontal and vertical aerodynamics; and we considered the side characteristics of the tire in the analysis of the tire force [36–38].

**Figure 1.** The SUGV model.

The *XOY* coordinate system in Figure 1 is a fixed geodetic coordinate system, and *xoy* is the structure of the coordinate system, which changes with the movement of the body. The force analysis of the above model along the *y* axis and around the *z* axis is obtained.

$$\begin{cases} Ma\_{\mathcal{Y}} = F\_{\mathcal{C}f} \cos \delta\_f + F\_{\mathcal{C}r} \\\ I\_z \dot{w} = I\_f F\_{\mathcal{C}f} \cos \delta\_f - I\_r F\_{\mathcal{C}r} \end{cases} \tag{1}$$

where *M* is the vehicle's weight; *If* is the distance between the center of the front axle and the center of mass of the vehicle; *Ir* is the distance between the center of the rear axle and the center of mass of the vehicle; *Iz* is the moment of inertia of the vehicle around the *z* axis; *δ<sup>f</sup>* is the front wheel angle of the vehicle; *w* is the yaw velocity of the vehicle; and *ay* is the accelerated velocity of the vehicle along the *y* axis based on the *xoy* coordinate system, which mainly consists of the movement along the *y* axis and the centripetal acceleration *Vxw*. We can obtain:

$$a\_y = \ddot{y} + V\_x w \tag{2}$$

where *Fc f* and *Fcr* are the lateral force of the tire applied to the front and rear wheels of the vehicle, respectively. From the defined dynamic model, we can see that the side characteristic of the tire is considered in the analysis of the tire force. When the front wheel angle *δ<sup>f</sup>* and the sideslip angle *β* are small, the sideslip characteristic of the tire is in a linear range. So, we can obtain

$$F\_{cf} = 2C\_f \alpha\_f \tag{3}$$

$$F\_{rf} = \mathcal{Z}C\_r\alpha\_r\tag{4}$$

where *Cf* and *Cr* are the cornering stiffnesses of the front and rear tires, respectively. Since there are two front and rear wheels, the force is twice that of a single tire. *α<sup>f</sup>* and *α<sup>r</sup>* are the tire sideslip angle and the angle between the direction of the tire velocity vector and the x axis, respectively. At a small angle, the size of the two angles can be approximately expressed as:

$$
\tan \alpha\_f \approx \theta < V\_{f'} \text{ x } > = \frac{V\_Y + l\_f \dot{\varphi}}{V\_{\chi}} \tag{5}
$$

$$
\tan \alpha\_{\ell} \approx \theta < V\_{r\_{\ell}} \ge \implies \frac{V\_{\mathcal{Y}} - l\_{r} \dot{\Phi}}{V\_{\mathcal{X}}} \tag{6}
$$

According to the coordinate system relationship, the side slip angle is *β* = tan *β* = *Vy Vx* . From (1)–(6), the differential equation of the vehicle dynamics model is obtained.

$$\begin{cases} \dot{w} = \frac{l\_f^2 \mathbb{C}\_f + l\_r^2 \mathbb{C}\_r}{I\_z \mathbb{P}\_x} w - \frac{l\_f \mathbb{C}\_f + l\_r \mathbb{C}\_r}{I\_z} \beta + \frac{l\_f \mathbb{C}\_f}{I\_z} \delta\_f\\ \dot{\beta} = (-\frac{l\_f \mathbb{C}\_f + l\_r \mathbb{C}\_r}{m v\_x^2} - 1)w - \frac{\mathbb{C}\_f + \mathbb{C}\_r}{m v\_x} \beta + \frac{\mathbb{C}\_f}{m v\_x} \delta\_f \end{cases} \tag{7}$$

The control input and output variables of the controlled vehicle are the front wheel steering angle *δ<sup>f</sup>* and the yaw velocity *w*, respectively. According to the vehicle dynamics model, a relation between the output and the input of the controlled system is deduced

$$
\ddot{w}(t) = -a\_1 \dot{w}(t) - a\_2 w(t) + b\_1 \dot{\delta}\_f(t) + b\_2 \delta\_f(t) \tag{8}
$$

where *<sup>a</sup>*<sup>1</sup> <sup>=</sup> *<sup>m</sup>*(*<sup>l</sup>* 2 *<sup>f</sup> Cf* <sup>+</sup>*<sup>l</sup>* 2 *<sup>r</sup> Cr*)+*Iz* (*Cf* <sup>+</sup>*Cr*) *mvx Iz* , *<sup>a</sup>*<sup>2</sup> <sup>=</sup> <sup>−</sup>*mv*<sup>2</sup> *<sup>x</sup>*(*lf Cf* <sup>−</sup>*lrCr*)+*Cf Cr mv*<sup>2</sup> *<sup>x</sup> Iz* , *<sup>b</sup>*<sup>1</sup> <sup>=</sup> *lf Cf Iz* , *<sup>b</sup>*<sup>2</sup> <sup>=</sup> *Cf CrL mvx Iz* .

So, the vehicle control system can be described as the following second-order system:

$$\begin{cases}
\dot{\mathbf{x}}\_1 = \mathbf{x}\_2 \\
\dot{\mathbf{x}}\_2 = F(\mathbf{x}\_1, \mathbf{x}\_2, \dot{\boldsymbol{\delta}}\_{f'}, \gamma(t)) + b\_0 \boldsymbol{\delta}\_f \\
y = \mathbf{x}\_1
\end{cases} \tag{9}$$

where *<sup>x</sup>*<sup>1</sup> <sup>=</sup> *<sup>w</sup>*(*t*), *<sup>x</sup>*<sup>2</sup> <sup>=</sup> . *<sup>w</sup>*(*t*), *<sup>b</sup>*<sup>0</sup> = *<sup>b</sup>*2, *<sup>F</sup>*(*x*1, *<sup>x</sup>*2, . *<sup>δ</sup> <sup>f</sup>* , *<sup>γ</sup>*(*t*)) = <sup>−</sup>*a*<sup>1</sup> . *<sup>w</sup>*(*t*) <sup>−</sup> *<sup>a</sup>*2*w*(*t*) + *<sup>b</sup>*<sup>1</sup> . *<sup>δ</sup> <sup>f</sup>*(*t*) + *<sup>γ</sup>*(*t*), and *<sup>F</sup>*(·) is defined as a generalized disturbance that includes the modeled and unmodeled parts of the vehicle and an unknown external disturbance.

#### *3.2. Multi-Dimensional Taylor Network Model*

If the function is *m*ˆ + 1-order differentiable everywhere in the neighborhood at some point, the power series of the eigenfunction expansion at this point is not greater than *m*ˆ power in terms of the principle of the multivariate Taylor formula. Based on the MTN model, as shown in Figure 2, the general dynamic equation of an ˆ*t*-dimensional system can be expressed as [17]:

$$\mathfrak{z}(k+1) = \sum\_{\rho=-1}^{\hat{N}(\boldsymbol{\ell},\boldsymbol{\vartheta})} \mathfrak{w}\_{\rho}(k) \prod\_{\emptyset=1}^{\hat{\mathbb{P}}} \mathfrak{z}\_{\emptyset}^{\hat{\mathbb{X}}(\boldsymbol{\vartheta},\boldsymbol{\vartheta})}(k) \tag{10}$$

**Figure 2.** Diagram of the MTN.

In (10), *<sup>N</sup>*<sup>ˆ</sup> (ˆ*t*, *<sup>m</sup>*ˆ) represents the total number of product items of the <sup>ˆ</sup>*t*-ary functionˆ[·], which is expanded into the approximate polynomial with *<sup>m</sup>*<sup>ˆ</sup> powers; *<sup>w</sup>*<sup>ˆ</sup> *<sup>p</sup>*ˆ(*k*) represents the

weight coefficient of the *p*ˆ-th product item in the formula; *λ*ˆ(*p*ˆ, *q*ˆ) represents the power of the variable *<sup>z</sup>*ˆ*q*ˆ(*k*) in the *<sup>p</sup>*ˆ-th product item; and ˆ*t* ∑ *q*ˆ=1 *<sup>λ</sup>*ˆ(*p*ˆ, *<sup>q</sup>*ˆ) <sup>≤</sup> *<sup>m</sup>*<sup>ˆ</sup> , where *<sup>p</sup>*<sup>ˆ</sup> <sup>=</sup> 1, 2, ··· , *<sup>N</sup>*<sup>ˆ</sup> (ˆ*t*, *<sup>m</sup>*ˆ).

As can be seen from Figure 2, the MTN is composed of three main layers, including the input layer, the middle layer, and the output layer. Its structure is simple, it contains only addition and multiplication operations, it is equivalent to only one neuron of the neural network, and it has low algorithm complexity, namely real-time performance. Otherwise, the model parameters are obtained through continuous optimization, which has a "selflearning" ability and good robustness.

The MTN model has a powerful approximation capability for nonlinear systems and can approximate any model with sufficient precision as long as *N*ˆ (ˆ*t*, *m*ˆ) is large enough [21,39].

#### **4. MTN Predictive Control Scheme with the Differential Term**

*4.1. Scheme of the MTN Predictive Model*

#### 4.1.1. MTN Predictive Model

The MTN model is essentially a one-step-ahead predictive model. Therefore, the MTN model can be regarded as a predictive model. We adopt the MTN model as shown in (10).

#### 4.1.2. Learning Algorithm of the MTN Predictive Model

The BP algorithm is used to adjust the weight coefficients of the MTN model [40,41]. Define the cost function as

$$J\_{\mathbb{P}}(k) = \frac{1}{2} \mathfrak{e}\_{\mathbb{P}}^2(k) \tag{11}$$

where *<sup>e</sup>*P(*k*) represents the predictive model error and *<sup>e</sup>*P(*k*) = *<sup>y</sup>*(*k*) <sup>−</sup> *<sup>y</sup>*ˆ(*k*); *<sup>y</sup>*(*k*) is the system output; and *y*ˆ(*k*) is the predictive model output.

Let

$$\mathfrak{a}\mathfrak{v}\_{\mathbf{P}}(k) = \left(\mathfrak{v}\_{\mathbf{P}1}(k), \mathfrak{v}\_{\mathbf{P}2}(k), \cdot, \cdot, \mathfrak{v}\_{\mathbf{P}\hat{N}(\vec{\mathbb{F}}, \mathfrak{v}\mathfrak{h})}(k)\right)^{\mathbf{T}}$$

The BP algorithm can be obtained as follows

$$\mathfrak{w}\_{\mathcal{P}}(k+1) = \mathfrak{w}\_{\mathcal{P}}(k) + \mu \frac{\partial f\_{\mathcal{P}}}{\partial \mathfrak{w}\_{\mathcal{P}}(k)} \tag{12}$$

where *μ* is the learning factor and *μ* > 0.

*4.2. Design of the Predictive Control Law with the Differential Term*

Consider the following quadratic performance index:

$$J\_{\mathbb{C}}(k) = \min \left\{ \frac{1}{2} (r(k) - \mathfrak{J}(k))^2 + \frac{1}{2} \varepsilon \Delta u^2(k) \right\} \tag{13}$$

where *r*(*k*) is the reference input, *y*ˆ(*k*) is the output of the MTN predictive model, *ε* is the control effort weighting factor, <sup>Δ</sup>*u*(*k*) = *<sup>u</sup>*(*k*) <sup>−</sup> *<sup>u</sup>*(*<sup>k</sup>* <sup>−</sup> <sup>1</sup>), and *<sup>u</sup>*(*k*) is the control input.

Denote *<sup>∂</sup>J*C(*k*) *<sup>∂</sup>u*(*k*) <sup>=</sup> 0. We can now obtain the MTN predictive control law:

$$
\mu(k) = \mu(k-1) + \frac{1}{\varepsilon}(r(k) - \mathcal{Y}(k))\frac{\partial \mathcal{Y}(k)}{\partial u(k)}\tag{14}
$$

where *u*(*k*), *r*(*k*), *y*ˆ(*k*), and *ε* have the same meaning as in Equation (13).

The differential term is introduced under the traditional quadratic performance index. A differential term can help the system to move in advance, reduce the overshoot, and shorten the adjustment time. Meanwhile, it can reduce the influence of time variation and

disturbance factors. The following quadratic performance index with the differential term was adopted:

$$J\_{\mathbb{C}}(k) = \min\left\{\frac{1}{2}\varepsilon\varepsilon^2(k) + \frac{1}{2}\varepsilon\Delta\mu^2(k) + \frac{1}{2}\gamma[\varepsilon\_{\mathbb{C}}(k) - \varepsilon\_{\mathbb{C}}(k-1)]^2\right\}\tag{15}$$

where *<sup>γ</sup>* is the differential weighting factor and *<sup>γ</sup>* <sup>&</sup>gt; 0, *<sup>e</sup>*C(*k*) = *<sup>r</sup>*(*k*) <sup>−</sup> *<sup>y</sup>*ˆ(*k*), *<sup>r</sup>*(*k*), *<sup>y</sup>*ˆ(*k*), and *ε* have the same meaning as in Equation (13).

According to (15),

$$\begin{array}{ll} \frac{\partial \mathcal{J}\_{\mathbb{C}}(k)}{\partial u(k)} &= \mathfrak{e}\_{\mathbb{C}}(k) \frac{\partial \mathfrak{e}\_{\mathbb{C}}(k)}{\partial u(k)} + \varepsilon \Delta u + \gamma[\mathfrak{e}\_{\mathbb{C}}(k) - \mathfrak{e}\_{\mathbb{C}}(k-1)] \frac{\partial \mathfrak{e}\_{\mathbb{C}}(k)}{\partial u(k)} \\ &= -\mathfrak{e}\_{\mathbb{C}}(k) \frac{\partial \mathcal{J}(k)}{\partial u(k)} + \varepsilon \Delta u - \gamma[\mathfrak{e}\_{\mathbb{C}}(k) - \mathfrak{e}\_{\mathbb{C}}(k-1)] \frac{\partial \mathcal{J}(k)}{\partial u(k)} \end{array} \tag{16}$$

where *ε*, *γ*, *r*(*k*), and *y*ˆ(*k*) have the same meaning as in (15).

Denote

$$\frac{\partial \|\!\_{\mathbb{C}}(k)}{\partial u(k)} = 0$$

We can now obtain the MTN predictive control law with the differential term:

$$u(k) = u(k-1) + \frac{1}{\varepsilon} \varepsilon\_{\mathbb{C}}(k) \frac{\partial \mathcal{Y}(k)}{\partial u(k)} + \frac{\gamma}{\varepsilon} [\varepsilon\_{\mathbb{C}}(k) - \varepsilon\_{\mathbb{C}}(k-1)] \frac{\partial \mathcal{Y}(k)}{\partial u(k)}\tag{17}$$

where *ε*, *γ*, *r*(*k*), and *y*ˆ(*k*) have the same meaning as in (15).

*4.3. Stability Analysis*

**Theorem 1.** *Consider the quadratic performance index shown in Equation (13). Denote h*max = max*k*(*h*(*k*)) *and h*(*k*) = *∂y*ˆ(*k*) *∂u*(*k*) . *When <sup>ε</sup> satisfies* <sup>0</sup> <sup>&</sup>lt; <sup>1</sup> *<sup>ε</sup>* <sup>&</sup>lt; <sup>2</sup> *h*2 max , *the closed-loop system is stable.*

**Proof.** Define *V*(*k*) = <sup>1</sup> <sup>2</sup> *<sup>e</sup>*C2(*k*). So:

$$\begin{array}{l} \Delta V(k) &= V(k+1) - V(k) \\ &= \frac{1}{2} [\mathfrak{e}\_{\mathbb{C}}^{2}(k+1) - \mathfrak{e}\_{\mathbb{C}}^{2}(k)] \\ &= \frac{1}{2} [\mathfrak{e}\_{\mathbb{C}}(k+1) - \mathfrak{e}\_{\mathbb{C}}(k)] [\mathfrak{e}\_{\mathbb{C}}(k+1) + \mathfrak{e}\_{\mathbb{C}}(k)] \\ &= \frac{1}{2} \Delta \mathfrak{e}\_{\mathbb{C}}(k) [\mathfrak{e}\_{\mathbb{C}}(k+1) + \mathfrak{e}\_{\mathbb{C}}(k)] \end{array} \tag{18}$$

<sup>Δ</sup>*e*C(*k*) can be denoted by [42,43]

$$
\Delta x\_{\mathbb{C}}(k) = -\frac{\partial \hat{\mathcal{Y}}(k)}{\partial \mu(k)} \Delta \mu(k) \tag{19}
$$

Substitution of (19) into (18) yields

$$\begin{split} \Delta V(k) &= \frac{1}{2} \Big[ -\frac{\partial \mathcal{Y}(k)}{\partial u(k)} \Delta u(k) \Big] \left[ \mathcal{e}\_{\mathbb{C}}(k+1) + \mathcal{e}\_{\mathbb{C}}(k) \right] \\ &= -\frac{1}{2} \frac{\partial \mathcal{Y}(k)}{\partial u(k)} \Delta u(k) \Big[ 2\mathcal{e}\_{\mathbb{C}}(k) - \frac{\partial \mathcal{Y}(k)}{\partial u(k)} \Delta u(k) \Big] \end{split} \tag{20}$$

According to Equation (14), we can obtain

$$\begin{split} \Delta u(k) &= \frac{1}{\varepsilon} (r(k) - \mathcal{Y}(k)) \frac{\partial \mathcal{Y}(k)}{\partial u(k)} \\ &= \frac{1}{\varepsilon} e\_{\mathbb{C}}(k) \frac{\partial \mathcal{Y}(k)}{\partial u(k)} \end{split} \tag{21}$$

Substituting (21) into (20), we can obtain

$$\begin{split} \Delta V(k) &= -\frac{1}{2} \frac{1}{\varepsilon} [\varepsilon\_{\mathbb{C}}(k)]^2 \left[\frac{\partial \psi(k)}{\partial u(k)}\right]^2 \left\{ 2 - \frac{1}{\varepsilon} \left[\frac{\partial \psi(k)}{\partial u(k)}\right]^2 \right\} \\ &\leq -\frac{1}{2} \frac{1}{\varepsilon} [\varepsilon\_{\mathbb{C}}(k)]^2 \left[\frac{\partial \psi(k)}{\partial u(k)}\right]^2 \left(2 - \frac{1}{\varepsilon} \left|\frac{\partial \psi(k)}{\partial u(k)}\right|\_{\text{max}}^2\right) \\ &= -\frac{1}{2} \frac{1}{\varepsilon} [\varepsilon\_{\mathbb{C}}(k)]^2 \left[\frac{\partial \psi(k)}{\partial u(k)}\right]^2 \left(2 - \frac{1}{\varepsilon} h\_{\text{max}}^2\right) \end{split} \tag{22}$$

It can be found that <sup>Δ</sup>*V*(*k*) <sup>&</sup>lt; 0 if 2 <sup>−</sup> <sup>1</sup> *ε h*2 max > 0 holds. That is, 0 < <sup>1</sup> *<sup>ε</sup>* <sup>&</sup>lt; <sup>2</sup> *h*2 max . By the Lyapunov stability theorem, the closed-loop system is stable.

What needs to be noted is that for a quadratic performance index with a differential term, the differential term can help the system to act in advance, reduce the overshoot, and shorten the adjustment time. That is, it can reach the stable region as soon as possible.

It should be noted that an arbitrary number of learning weighting factors was selected according to the stability condition. In an actual situation, the appropriate parameters should be selected according to the actual situation, and the stability condition should be satisfied. -

#### *4.4. Scheme of Predictive Control with the Differential Term Based on the MTN Model*

An improved predictive control scheme on the basis of the MTN is proposed for tracking control of SUGVs. First, a predictive model is constructed as follows: (1) the MTN is used as a predictive model to construct a SUGV model; and (2) the BP algorithm is used as its learning algorithm. Second, a predictive control law is designed to control the SUGV. The frame diagram of the improved MTN predictive control system is shown in Figure 3. The front wheel steering angle *δ<sup>f</sup>* is used as the control input, which is equivalent to the *u*(*k*) mentioned above. The yaw velocity *w* is used as the system output, which is equivalent to the *y*(*k*) mentioned above. The predictive model output *w*ˆ is equivalent to the *<sup>y</sup>*ˆ(*k*) mentioned above. The desired yaw velocity *wdes* is used as the system input, which is equivalent to the *r*(*k*) mentioned above.

The specific steps are as shown in Algorithm 1.

**Algorithm 1** Scheme of MTN Predictive Control with the Differential Term

(1) Construct the quadratic performance index, shown as (13);

(2) Update the *u*(*k*) with Equation (17);

Step 1: Construct MTN predictive model;

<sup>(1)</sup> Initialize system input and output;

<sup>(2)</sup> Ascertain structure of MTN predictive model;

<sup>(3)</sup> Update *<sup>w</sup>*P(*k*) with Equation (12);

Step 2: Construct MTN predictive control law;

<sup>(2)</sup> Update *u*(*k*) with Equation (14);

Step 3: Construct an improved predictive control law to control SUGV based on MTN predictive control law;

<sup>(1)</sup> Construct the quadratic performance index with differential term, shown as (15);

Step 4. Implement the tracking control for the SUGV.

**Figure 3.** Block diagram of the improved MTN predictive control system.

#### **5. Simulation Results and Discussion**

#### *5.1. Simulation Results*

SUGVs have a wide range of applications, such as delivery, sanitation, and mining. A tracking control experiment on a SUGV is described here [44,45]. Moreover, the traditional MTN predictive control scheme [24] and the recurrent radial basis (RBF) model predictive control scheme [46] are used for comparison. According to the vehicle dynamics model, as shown in (7), the tracking effect is demonstrated by accurately tracking the desired yaw velocity signal. The control input and output variables are the front wheel steering angle *δ<sup>f</sup>* and the yaw velocity *w*, respectively. The mathematical relationship between the desired path curve *y* = *f*(*x*) and the desired yaw velocity signal is as follows [44,45]:

$$w\_{dcs} = v\_{dcs} k\_{dcs} \tag{23}$$

$$k\_{dcs} = \frac{d^2y/dx^2}{\left(1 + (dy/dx)^2\right)^{3/2}}\tag{24}$$

where *wdes* is the desired yaw velocity, *vdes* is the desired speed, and *kdes* represents the curvature of the desired road.

Specific parameters of the SUGV were set as follows: *<sup>M</sup>* = 0.1 kg, *lf* = 0.1 m, *lr* = 0.1 m, *Iz* = 0.4 kg/m2, *vx* = 1 m/s, *<sup>L</sup>* = 0.6 m, *Cf* = 0.6 N/rad, and *Cr* = 0.6 N/rad.

Two cases were considered: without noise and with noise. The same parameters were adopted for each of them to verify the performance of the proposed scheme.

**Remark 1.** *In the following, D-MTNC is our proposed scheme; MTNC is the traditional MTN predictive control scheme; and RBFC is the RBF model predictive control scheme.*

**Remark 2.** *In the following figures, wdes represents the given reference signal; yMTNC is the traditional MTN predictive control scheme; yD-MTNC is the proposed predictive control scheme; yRBFC is the RBF model predictive control scheme; and eMTNC, eD-MTNC, and eRBFC represent the corresponding tracking errors.*

We consider the control effect of the yaw velocity tracking step signal. The given reference signal is *wdes* = 1. The parameters of D-MTNC, MTNC, and RBFC were set as follows.

The MTN predictive model adopts a 4-15-1 structure with four inputs and two powers. The initial value of the weight coefficient vector for the BP algorithm is *<sup>w</sup>*<sup>ˆ</sup> <sup>P</sup>(0) = 0 and the learning factor is *μ* = 0.5.

For the D-MTNC control law parameters, the control effort weighting factor was set to 1/*ε* = 0.1 and the differential term weighting factor was set to *γ* = 0.005.

For the MTNC control law parameters, the control effort weighting factor was set to 1/*ε* = 0.1.

The RBF predictive model adopts a 4-18-1 structure. The initial value of the weight coefficient vector for the BP algorithm is ˆ*w*P(0) = 0 and the learning factor is *<sup>μ</sup>* = 0.9.

For the RBFC control law parameters, the control effort weighting factor was set to 1/*ε* = 0.9.

#### 5.1.1. Without Noise

Figure 4 shows the tracking trajectory and Figure 5 shows the tracking error. Figures 4 and 5 show that the SUGV reaches the desired speed quickly and accurately. Our proposed scheme tracks the given reference signal at *k* = 5, whereas the traditional MTN predictive control scheme and the RBF predictive control scheme track the reference signal at *k* = 70 and *k* = 250, respectively. These results verify that the improved predictive control scheme has a better control effect and faster convergence than the other schemes.

**Figure 4.** The tracking trajectory without noise.

**Figure 5.** The tracking error without noise.

#### 5.1.2. With Noise

To verify the robustness of the proposed control scheme, Gaussian white noise and 100 times the noise were added to the system. Regarding the Gaussian white noise, it exhibits a mean of 0 and a standard deviation of 0.2 at the 300th time point.

Figure 6 shows the tracking trajectory and Figure 7 shows the tracking error. From Figures 6 and 7, when 100 times the Gaussian white noise is added at the 300th time point to the system, there is a sharp change at the 300th time point, but the proposed control scheme has a quick response, which retracks the reference more quickly than the other schemes. These results verify that the improved predictive control scheme has a good control effect and good robustness.

**Figure 6.** The tracking trajectory with noise.

**Figure 7.** The tracking error with noise.

#### *5.2. Discussion*

For tracking control of SUGVs, the above results show that the proposed method has good performance and tracks the reference signal quickly and accurately. To verify the robustness of the proposed scheme, sharp noise was added. The results of the experiment in terms of tracking, robustness, and real-time performance all show that the proposed scheme has better performance than MTNC and RBFC.

Regarding the choice of model structure, what needs illustration is that the MTN predictive model adopted a 4-15-1 structure and the RBF predictive model adopted a 4-18-1 structure in this paper. The MTN structure depends on the middle layer's nodes when the input and output layer are confirmed. The middle layer's nodes rely on the input layer's nodes and the power item. The MTN predictive model's structure is shown in Figure 1. From Figure 1, when the input layer had four nodes and the number of power items was 2, the number of product items was 15. The RBF hidden layer's nodes depend on the history of multiple experiments, and we chose 18 nodes for the hidden layer here.

The proposed scheme showed good performance in tracking control of the SUGV. However, SUGVs have a wide range of applications, such as delivery, sanitation, and mining. When facing these practical scenarios, the proposed method needs to be experimentally validated and the appropriate parameters need to be selected according to the actual situation.

#### **6. Conclusions**

An improved predictive control scheme based on the MTN was proposed for tracking control of SUGVs. The traditional objective function was improved to obtain a predictive objective function with the differential term. The optimal control quantity was given in real time through iterative optimization. A tracking control experiment on a SUGV was carried out to verify the effectiveness of the proposed scheme. The results show that the proposed scheme is effective and has good real-time, robustness, and convergence performance, which ensure that the vehicle can quickly and accurately track the desired yaw velocity signal, and is superior to the traditional MTN and RBF predictive control schemes.

The proposed improved MTN predictive control scheme developed in the present paper shows great promise but requires further study. For example, the proposed scheme could be applied to different complex driving scenarios, and some complicated conditions could be considered, such as an actuator failure or input constraints.

**Author Contributions:** Writing—original draft preparation, Y.W.; methodology, C.L.; writing review and editing, C.Y.; validation, M.L.; formal analysis, H.L. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** The authors thank the Editor-in-Chief, the Associate Editor, and the anonymous reviewers for their valuable comments and suggestions.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Design of a NARX-ANN-Based SP Controller for Control of an Irrigation Main Canal Pool**

**Ybrain Hernandez-Lopez 1, Raul Rivas-Perez 1,\* and Vicente Feliu-Batlle 2,\***


**Abstract:** The management of irrigation main canals are studied in this research. One way of improving this is designing an efficient automatic control system of the water that flows through the canal pools, which is usually carried out by PI controllers. However, since canal pools are systems with large time delays and nonlinear hydrodynamics, these PIs are tuned in a very conservative way so that the closed-loop instability that may appear depending on the chosen operation regime is avoided. These controllers are inefficient because they have slow time responses. In order to obtain faster responses that remain stable independently of the operation regime, a control system that combines a Smith predictor, which is appropriate to control linear systems with large time delays, with a NARX artificial neural network (ANN), that models the nonlinear dynamics of the pools, is proposed. By applying system identification procedures, two nonlinear NARX-ANN-based models and a linear mathematical model of a real canal pool were obtained. These models were applied to implement a modified NARX-ANN-based SP controller and a conventional linear SP controller. Experimental results on our real canal pool showed that our modified NARX-ANNbased SP controller overcomes conventional linear SP controllers in both setpoint tracking and load disturbance rejection.

**Keywords:** NARX-ANN-based models; modified SP controller design; irrigation main canal pool automation; system identification; management of water resources

#### **1. Introduction**

Water is a vital and essential resource for human life, environment, as well as for sustainable development [1]. However, millions of people currently live in conditions of extreme water scarcity [2]. Water scarcity is the consequence of two particularities: physical scarcity, and economic scarcity [3,4]. Physical scarcity is determined as the insufficiency of natural resources to supply a region's demand for water, while economic scarcity is demarcated as the consequence of poor investment in science, technologies, and infrastructures [5,6].

In irrigation systems, most of the water resources are transported by the networks of irrigation canals [7–9]. In most countries, irrigation canals are manually operated [10–12].

Currently, one of the reliable solutions to face the problems related to water scarcity in irrigated agriculture is the automatic control, which is considered a powerful tool to improve the operation and efficiency of water distribution in the irrigation main canals [13,14].

The application of conventional PID (proportional integral derivative) controllers is one of the most widespread solutions to control the water distribution in irrigation main canal pools, see e.g., [15–17]. However, this control strategy usually does not yield satisfactory results because irrigation main canal pools are complex systems that are characterized by nonlinear dynamic behaviors with dominant time delays [18–20], and PID

**Citation:** Hernandez-Lopez, Y.; Rivas-Perez, R.; Feliu-Batlle, V. Design of a NARX-ANN-Based SP Controller for Control of an Irrigation Main Canal Pool. *Appl. Sci.* **2022**, *12*, 9180. https://doi.org/10.3390/ app12189180

Academic Editor: Emanuele Carpanzano

Received: 28 July 2022 Accepted: 8 September 2022 Published: 13 September 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

controllers are not adequate for this class of processes [21–24]. Therefore, the design and implementation of controllers aimed at increasing operability and safety in irrigation main canal pools, represents an important and challenging research field [25–27].

One of the most widely used controllers for processes with dominant time delays (processes in which the time delay is much greater than the largest of the time constants) is the Smith predictor (SP) [28–30]. It is characterized by having a linear mathematical model of the process to be controlled in its structure, and it shows a low effectiveness in rejecting load disturbances [31,32].

The SP has been the subject of intensive research and modifications aimed at improving its performance, robustness, and disturbance rejection features, as well as enabling its application in diverse processes (unstable, time-varying, nonlinear, etc.), see e.g., [33–35].

The SP has been applied to water distribution control in different irrigation main canal pools. Among the first proposals are those made by M.J. Shand [36], R. Rivas-Perez [9], J.L. Del-tour [37] and F. Sanfilippo [38]. They are characterized by combining the conventional structure of the SP with PI or PID controllers, which do not guarantee an effective water distribution control. The adaptive SP was proposed also to improve the effectiveness of water distribution control in irrigation main canal pools with large variations in their operating conditions [18,26]. Furthermore, in recent years, the SP has been combined with fractional order controllers to increase the robustness of the control of irrigation canal pools, see e.g., [19,34].

One of the essential problems that present these SP-based control systems is their poor performance due to the use (in the SP structure) of linear mathematical models of the water distribution dynamic behavior in irrigation main canal pools [6]. They also have a low rejection of load disturbances, which originates insufficient accuracy in water distribution control.

Irrigation main canals are nonlinear complex processes [8,9]. The physical dynamics of the water distribution in irrigation main canal pools has traditionally been modeled by the Saint-Venant equations, nonlinear hyperbolic partial differential equations with complex constraints, whose analytical solution is very complicated in the case of an arbitrary geometry [39,40]. These equations are derived from mass and momentum balances and are very difficult to use for the design of a controller [8]. Consequently, it is used in the application of different linearization methods [20] or the obtention of linear mathematical models by using systems identification tools [23,40]. The linear mathematical models yielded by these procedures do not characterize accurately the nonlinear dynamic behavior of the water distribution in irrigation main canal pools and, accordingly, their application to the controller's design does not yield satisfactory results in many cases [6,18].

In general, nonlinear processes can only be adequately characterized by nonlinear models [41]. Thus, in spite of the abundant research in this field, the development of nonlinear control-oriented models of irrigation main canal pools remains an open problem.

Recently, artificial neural networks (ANNs) are being used to describe effectively and accurately the dynamic behavior of processes with nonlinear dynamic behaviors. They are computational models that process information inspired by the behavior observed in biological neurons [42–44].

ANNs receive increasing acceptance and are applied in diverse fields because they use non-physics-based algorithms, handle appropriately nonlinear relationships, and incorporate user experience into their structure [45–47]. Therefore, they constitute a powerful tool for obtaining adequate dynamic models of nonlinear and time-varying processes that can be applied to the design of controllers based on nonlinear process models, including SP and model predictive controllers (MPC), see e.g., [48–52].

An important and useful class of discrete-time nonlinear models is the Nonlinear Autoregressive with eXogenous Inputs (NARX) model, which was introduced by S.A. Billings in 1980s as a new representation of a wide class of discrete nonlinear systems [41]. Different ANN architectures have been considered as subsets of the NARX model [43].

The NARX-ANN and TD-NARX-ANN (Time Delay-NARX-ANN) are recurrent dynamic neural network architectures used to model different nonlinear dynamic systems that are characterized by a good learning, fast convergence, and well generalization [53,54].

Diverse authors have proposed the use of NARX-ANN and TD-NARX-ANN for modelling different nonlinear processes and implementing their control systems. A NARX-ANN model for the prediction of the daily direct solar radiation was used in [55]. A NARX-ANN based modeling approach to establish a robust water management tool to aid urban water managers in controlling the development of shallow water tables induced by artificial recharge activity was presented in [56]. A NARX-ANN model for accurate discharge prediction through lateral orifices of irrigation main canals was employed in [57]. A NARX-ANN to predict moisture sorption kinetics and consequently equilibrium moisture content of shiitake mushrooms over a wide range of relative humidity and temperature was developed in [58]. A NARX model reference adaptive control system of a magnetic levitation process was introduced in [59]. A NARX-based model predictive controller for grain dryers was designed in [60]. A NARX-based model predictive controller for turboshaft direct fuel control of an aircraft engine to improve its response ability in the presence of degraded conditions was implemented in [61]. A novel control system for a shunt active power filter based on a NARX-ANN was proposed in [62]. Based on a NARX model, two intelligent controllers to guarantee the desired growth conditions of crops, such as humidity and temperature, in a greenhouse were implemented in [63]. A general-purpose genetic algorithm (GA)-optimized NARX-ANN-based controller for control of nonlinear chemical and biochemical processes was introduced in [64]. A NARX-ANN-based predictive controller of personalized heating systems was designed in [65]. A NARX-ANN-based Smith predictor controller for time-delay compensation of nonlinear processes was developed in [46,66].

The main objective of this paper is the design of modified SP controllers that use NARX-ANN models as nonlinear internal models for the accurate and reliable control of water distribution in an irrigation main canal pool.

The major contributions of this paper are: (1) the proposal for the first time of a water distribution controller in an irrigation main canal pool founded on a modified SP with a NARX-ANN based model that describes the nonlinear dynamic behavior of the canal pool with an arbitrarily small error and (2) the improvements in the SP rejection features to load disturbances by means of including a gain tuning block *K*(*t*) that allows an adequate estimation of the load disturbances to be removed.

The proposed controller has the advantages of increasing the operability and reliability of the water distribution control systems, as well as reducing current water losses due to operation in the irrigation main canals pools.

This article is organized as follows. Section 2 proposes mathematical models of the studied process through the application of systems identification procedures. Section 3 develops the design of the NARX-ANN-based SP controller. Section 4 shows the discussions of the obtained results. Finally, some conclusions are offered in Section 5.

#### **2. ANN-Based System Identification of an Irrigation Main Canal Pool**

The research presented in this paper is based on the Zaza-Ciego de Avila Main Canal (ZCAMC), which is one of the most important hydraulic constructions in Cuba [67]. This canal is in the central zone of the country and connects the Zaza reservoir (with a capacity of 1,020,106 m3) with the south of the Ciego de Avila County. The surface irrigated is about 148,200 ha that is used for the cultivation of rice, sugar cane, citrus fruits, soybeans, horticulture, as well as for the development of fishing and livestock. The canal delivers a total volume of 900,106 m<sup>3</sup> of water/year that alleviates the overexploitation of the phreatic level in the region [68].

This canal is a 44.5 km long cross-structure with variable water discharge between 33.0 and 23.0 m3/s, overall canal height from 4.80 to 3.87 m, and normal water depth varying from 4.30 to 3.10 m. It has a trapezoidal cross-section, the bottom width varies from 6 to 3.2 m, and the average slope is 0.07 m/km [6]. It has four pools separated by undershot gated cross-structures.

This canal is mostly operated in a downstream end regulation mode [8,26]. All the pools of this canal are electrified and equipped with downstream end water level and gate position sensors, SIMATIC S7-300 PLCs, motors for the gates' positioning, and systems for data acquisition, processing, and storage.

Likewise, this canal has a remote supervisory control and data acquisition system (SCADA) with communication by radio and field buses that provide real time data of the controlled variables. Figure 1 is a view of ZCAMC.

**Figure 1.** A view of Zaza-Ciego de Avila Main Canal (ZCAMC).

Mathematical models have a vital role in systems analysis and design [41]. The controller design habitually requires a dynamic mathematical model that accurately characterizes the behavior of the processes to be controlled [69,70]. Therefore, one of the significant challenges of the controller design methodology is to obtain a suitable mathematical model of the process.

Several mathematical models have been developed for the irrigation main canal pools, see e.g., [8,40,50]. Since the irrigation main canal pools have complex nonlinear dynamics, linear approximation models of their dynamic processes are often developed [8,9], which have been often attained by applying system identification procedures, see e.g., [9,18,23].

System identification is the approach used to infer and construct process models from a set of experimental input-output data [41,69].

Since 1986, neural networks have been applied in the system identification of nonlinear dynamic processes [42].

Data and results reported in this paper were obtained from the first pool in the ZCAMC. It is a trapezoidal cross section main canal pool 12.5 km long, operated by using the distant downstream operation method [6,70]. This pool is equipped with an industrial personal computer, and a SIMATIC S7-300 PLC. The downstream end water level is controlled by four upstream undershoot gates situated at the beginning of the pool. Figure 2 is a

schematic representation of this canal pool, in which the four control gates are represented by an equivalent gate.

**Figure 2.** Schematic diagram of the first pool in ZCAMC, where LS is a level sensor, GPS is a gate position sensor, PLC is a programmable logic controller, *Qi*(*t*) are the water flows; *<sup>u</sup>*1(*t*) is the gate position, *<sup>q</sup>*1(*t*) is the lateral discharge, *rL*<sup>1</sup> is the water level setpoint.

The available measurements are the downstream end water levels *<sup>y</sup>*1(*t*), and the gate positions *<sup>u</sup>*1(*t*). Thus, the controlled variable is the downstream end water level *<sup>y</sup>*1(*t*), and the manipulated variable is the position of the upstream equivalent gate *<sup>u</sup>*1(*t*).

In this paper, using the system identification procedures [69] and the real-time experimental data attained through the SIMATIC S7-300 PLC system in open loop operating points, three models are obtained that describe the nominal dynamic behavior of the water distribution in the studied canal pool. The first and second models are nonlinear and are based on a NARX-ANN and a TD-NARX-ANN with recurrent architectures [53] due to the advantages of this type of structures, among which are the possibility of approximating any function with the desired precision, provided that it presents a limited number of discontinuities [43].

For nonlinear system identification the Matlab Neural Network-based System Identification Toolbox was used.

Recurrent ANNs are characterized by having feedback (or connections) between neurons in one layer and neurons in previous layers and in the same layer, in addition to forward connections, as well as maintaining a certain dependence on events that occurred in past moments (memory effect), which is convenient in the system identification procedure [43].

The development of the best architecture for the NARX-ANN-based models requires the determination of the number of hidden layers, the activation functions, the process time delays, and the efficient training algorithm [53].

The mathematical formulation of the NARX-ANN-based model can be written by the following discrete time expression [71]:

$$y(t) = \begin{array}{c} f(u(t-1), u(t-2), \dots, u(t-n\_u)) \\ y(t-1), y(t-2), \dots, y(t-n\_y)) \end{array} \tag{1}$$

where *u*(*t*) and *y*(*t*) denote respectively the input and output of the model at the discrete time step *<sup>t</sup>*, *nu*, *ny* are the input-memory and output-memory orders respectively, and *<sup>f</sup>*(·) is a nonlinear function represented by the ANN. The sampling period of the input and output data was *T* =30 s.

The TD-NARX-ANN-based model is expressed as [72,73]:

$$y(t) = \begin{array}{c} f(u(t-d-1), u(t-d-2), \dots, u(t-d-n\_u)) \\ y(t-1), y(t-2), \dots, y(t-n\_y)), \end{array} \tag{2}$$

where *d* (*d* > 0) is the number of process time delay sampling periods. We consider *d* = 10 which implies a time delay of 300 s.

Figures 3 and 4 illustrate the diagrams with the recurrent architectures of the NARX-ANN and TD-NARX-ANN-based models that have been designed, which have an input layer with 13 memory blocks for the input and output signals, a hidden layer with three neurons, and an output layer with one neuron. The activation functions in the hidden and output layers are hyperbolic tangent sigmoid and piecewise-linear (symmetric saturating linear) respectively.

**Figure 3.** Architecture of the NARX-ANN-based model designed.

**Figure 4.** Architecture of the TD-NARX-ANN-based model designed.

The difference between the architecture of these models lies in the TD-NARX-ANNbased model has the time delay block *Z*−*<sup>d</sup>* highlighted in red line.

The mathematical expressions of the NARX-ANN- and TD-NARX-ANN-based models are, respectively, [43]:

$$y\_{m1}(t) = f\_0 \left(\sum\_{h=1}^{n\_h} w\_{0,h} f\_h \left(\sum\_{i=1}^{n\_u} w\_{h,i} u(t-i) + \sum\_{j=1}^{n\_g} w\_{h,j} y\_{m1}(t-j) + b\_h \right) + b\_0 \right), \tag{3}$$

$$y\_{m2}(t) = f\_0 \left(\sum\_{h=1}^{n\_h} w\_{0,h} f\_h \left(\sum\_{i=1}^{n\_u} w\_{h,i} u(t - d - i) + \sum\_{j=1}^{n\_y} w\_{h,j} y\_{m2}(t - j) + b\_h \right) + b\_0 \right), \tag{4}$$

where *wh*,*<sup>i</sup>* are the synaptic weights of each *<sup>h</sup>* = 1, ... , 3 neuron of the hidden layer from the *<sup>i</sup>* = 1, 2, ... , *nu* = 13 inputs *<sup>u</sup>*(*t*), *wh*,*<sup>j</sup>* are the synaptic weights of each *<sup>h</sup>* neuron of the hidden layer from the *<sup>j</sup>* = 1, 2, ... , *ny* = 13 outputs (*y*(*t*)), *<sup>w</sup>*0,*<sup>h</sup>* are the synaptic weights of the output layer neuron from each *h* neurons of the hidden layer, *bh* are the bias weights of each *h* neuron of the hidden layer, *b*<sup>0</sup> is the bias weight of the output layer neuron, *fh* is the hyperbolic tangent activation function of each *h* neuron of the hidden layer, which is defined by the expression:

$$f\_h(\mathbf{x}) = \frac{e^{\mathbf{x}} - e^{-\mathbf{x}}}{e^{\mathbf{x}} + e^{-\mathbf{x}}} \tag{5}$$

*nh* is the neuron number of the hidden layer, and *f*<sup>0</sup> is a piecewise-linear activation function of the output layer, which is formed by three linear pieces joined to form a non-linear function (symmetric saturating linear) defined as:

$$f\_0(\mathbf{x}) = \begin{cases} -1, \mathbf{x} < -1, \\ \mathbf{x}, -1 < \mathbf{x} < 1, \\ 1, \mathbf{x} > 1. \end{cases} \tag{6}$$

The optimal number of neurons in the hidden layer was attained by using the Akaike Information Criterion (AIC). The lowest value of AIC <sup>=</sup>−6.29 was obtained for *<sup>h</sup>* <sup>=</sup>3. The input-memory and output-memory orders (*nu* = *ny* =13) were obtained from the time-series modeling using the Box–Jenkins approach.

The data set *Z<sup>N</sup>* that describes the irrigation canal pool studied in its entire range of operation was attained by exciting the upstream equivalent undershoot gate, i.e., the process input *<sup>u</sup>*1(*t*), with a pseudo random binary sequence (PRBS), and measuring the downstream end water level *<sup>y</sup>*1(*t*):

$$Z^N = \{ [\mu\_1(t), y\_1(t)] | t = 1, \dots, N \} \tag{7}$$

Figure 5 illustrates the dataset *Z<sup>N</sup>* obtained, which was split into three parts, 50% for the training set, 25% for the validation set, and 25% for the testing set.

**Figure 5.** The obtained data set *Z<sup>N</sup>* of the irrigation canal pool studied.

Training is an iterative process that consists of determining the synaptic weights *wh* and bias *bh* of each of the *h* neurons of the ANNs designed, so that their output signals converge towards the desired states [72]. The training algorithm should adjust the ANN parameters to minimize the following performance index (mean square error) [71]:

$$F(\mathbf{x}\_k) = \frac{1}{N} \sum\_{i=1}^{N} \varepsilon\_i (\mathbf{x}\_k)^2 = \frac{1}{N} \sum\_{i=1}^{N} (y\_i - y\_{mi})^2 \tag{8}$$

where *xk* is a vector of current parameters (weights and biases) of the ANN, *ei*(*xk*) is the training error, *yi* is the observed (measured) data of the downstream end water level, *ymi* is the ANN output (estimation of the downstream end water level developed by the ANN), and *N* is the total number of samples used to train the ANN.

For the training of the two designed NARX-ANNs, a modified Levenberg-Marquardt optimization algorithm was used, which is a non-linear optimization algorithm that combines the Gauss-Newton method and the Steepest-Descend Gradient method in a single expression and finds the minimum of the function (either linear or nonlinear) over a space of parameters and optimizes the solution [74,75]. This algorithm is characterized by its high convergence speed and reliability, although the complexity of its calculations is greater than in other algorithms [43]. Therefore, the synaptic weights and biases of the NARX-ANNs were adjusted by the algorithm [76]:

$$\mathbf{x}\_{k+1} = \mathbf{x}\_k - \left[\boldsymbol{f}^T(\mathbf{x}\_k)\boldsymbol{f}(\mathbf{x}\_k) + \boldsymbol{\lambda}\_k \boldsymbol{I}\right]^{-1} \boldsymbol{f}^T(\mathbf{x}\_k)\boldsymbol{e}(\mathbf{x}\_k) \tag{9}$$

where, *J* is the Jacobian matrix, *J<sup>T</sup> J* is the Hessian matrix, *e* is the error vector of ANN between the real output and the estimated ANN output, *I* represents the identity matrix of appropriate size, *k* is the number of iterations, and *λ<sup>k</sup>* is the learning coefficient that is automatically updated based on the error at each iteration to secure the convergence.

The FIT crossover performance index was used to quantify the degree of precision of the obtained models. It is defined by the expression [69]:

$$FFT = \left\{ 1 - \frac{\left\| y\_i - \hat{y}\_i \right\|^2}{\left\| y\_i - mean\left(y\_i\right) \right\|} \right\} \times 100\% \tag{10}$$

where *yi* are the observed (measured) data of the downstream end water level, *y*ˆ*<sup>i</sup>* are the estimates of the downstream end water level developed by the model, and *mean*(*yi*) is the means of the downstream end water level.

The simplest procedure to evaluate the quality of the derived ANN model is to train it with the training dataset and validate its performance with different test datasets, e.g., with the validation and testing data sets [77]. The validation data set is used to improve the model performance by fine-tuning the model, while the testing data set is used to provide an unbiased assessment of the final model fit [41].

Validation results of these models with the validation and training data sets are given in Figure 6, which show that the two models adequately describe the dynamic behavior of the water distribution in the canal pool under study with FIT performance indices of 91.0% for the NARX-ANN- based model and 94.0% for the TD-NARX-ANN-based model.

The third model is linear. It was obtained by applying the step response method [69] and is represented by the following transfer function with time delay:

$$G\_p(s) = \frac{K\_p}{\tau\_p s + 1} e^{-Ls} = \frac{0.835}{250s + 1} e^{-300s} \tag{11}$$

where *Kp* is the static gain, *τ<sup>p</sup>* is the time constant, and *L* is the time delay.

For linear system identification the Matlab system identification toolbox was used.

Figure 7 illustrates the validation results of the linear model (11) with the estimated nominal parameters with a FIT performance index of 87.0%, which indicates that (11) is a suitable model since models with FIT greater than 80% are considered adequate for process control [69].

**Figure 6.** Validation results of the NARX-ANN and TD-NARX-ANN-based models with the validation and testing data sets.

**Figure 7.** Validation results of the obtained linear mathematical model.

The FIT indexes obtained with the three models demonstrate the better performance of NARX-ANN-based models due to their ability to model the non-linear dynamic behaviors.

#### **3. Design of the NARX-ANN-Based SP Controller**

Model (11) shows that the dynamics of the water distribution in our canal pool is characterized by a dominant time delay [28].

Since the time delay reduces the gain crossover frequency, the phase margin, and the speed of response of the control systems, and even can cause the loss of the stability of the closed-loop system [78], the design of a modified SP controller founded on the NARX-ANN-based models previously obtained is proposed for the effective control of water distribution in our canal pool.

The main advantage of the SP is that it removes the real process from the control loop, and it allows then to design the controller from a mathematical model that does not include the process time delay [29].

Therefore, if the mathematical model accurately describes the dynamics of the process to be controlled, the SP eliminates the time delay from the characteristic equation of the closed-loop control system [28].

One of the main problems of the SP controller is the degradation of its performance and robustness when the processes are subjected to the effect of external load disturbances [28,30].

Figure 8 illustrates a diagram of the conventional SP structure, made up of the controller *C*(*s*), the real process *P*(*s*), and the complete linear mathematical model of the process (integrated by the model without time delay *Gp*(*s*) and the time delay *e*−*Ls*, which is connected in parallel with the real process. Usually, *C*(*s*) is a standard PI or PID controller [21].

**Figure 8.** Conventional structure of the SP.

If the mathematical model accurately describes the dynamic behavior of the real process, that is *<sup>P</sup>*(*s*) = *Gp*(*s*)*e*−*Ls*, the output signal *Ym*2(*s*) has no time delay and constitutes a prediction of the output signal *Y*(*s*) of the real process without time delay [31].

Therefore, based on the mathematical model, it is possible to obtain a prediction of the real process output signal ahead of the time delay. The real process output signal *Y*(*s*) is fed back to the controller to remove the negative effect of the load disturbance *D*(*s*).

The transfer function of the load disturbance *D*(*s*) is the following:

$$G\_D(\mathbf{s}) = \frac{K\_D}{\mathbf{s} + 1} \tag{12}$$

where *KD* is a time varying parameter in the following range: *KD*min < *KD*(*t*) < *KD*max.

If the complete mathematical model *Gp*(*s*)*e*−*Ls* accurately describes the dynamic behavior of the real process *P*(*s*) and there is no load disturbance, the error signal *e*(*s*) is zero and the transfer function of the closed-loop control system is represented as:

$$G\_{L\mathbb{C}}(s) = \frac{\mathbb{C}(s)P(s)}{1 + \mathbb{C}(s)G\_p(s)} = \frac{\mathbb{C}(s)G\_p(s)e^{-Ls}}{1 + \mathbb{C}(s)G\_p(s)}\tag{13}$$

If *C*(*s*) is a PI controller, the control algorithm is defined by the expression [79]:

$$\mathcal{C}(\mathbf{s}) = \frac{\mu(\mathbf{s})}{c\_2(\mathbf{s})} = K\_\mathbb{C} \left( 1 + \frac{1}{T\_I \mathbf{s}} \right) \tag{14}$$

where *KC* is the proportional gain, *TI* is the integral time constant, *<sup>e</sup>*2(*s*) is the error signal and *u*(*s*) is the control signal. Consequently, the controller *C*(*s*) is designed for the real process without time delay and the response of the closed-loop control system is time delayed.

The right performance of the SP depends on the exact knowledge of the process model *Gp*(*s*) and the time delay *e*−*Ls* [28]. Any model uncertainty and/or load disturbance can lead to imprecise time responses of the control system, which can even cause its instability [32].

Our contribution consists of: (1) replacing the linear mathematical model of the conventional SP structure by the non-linear NARX-ANN (3) and the TD-NARX-ANN (4) based models obtained in the previous Section; and (2) adding to the controller output signal *u*(*t*) a signal equivalent to the estimate *D*ˆ (*t*) of the load disturbance *D*(*t*) (for example, an increase in the lateral discharge *<sup>q</sup>*1(*t*)) in order to reject its negative effect on the control system and, thus, increase the controller performance.

Figure 9 illustrates the diagram of our proposal for the modified SP structure with the non-linear NARX-ANN-based models for the water distribution control in the studied canal pool, where *K*(*t*) is a gain tuning block to achieve an adequate estimate *D*ˆ (*t*) of the load disturbance *D*(*t*). If there is no load disturbance, *D*(*t*) = 0, and the error signal is *<sup>e</sup>*(*t*) <sup>∼</sup><sup>=</sup> 0. However, when the load disturbance *<sup>D</sup>*(*t*) appears, we have that *<sup>e</sup>*(*t*) <sup>=</sup> 0. If the error signal returns to *<sup>e</sup>*(*t*) ∼= 0 by tuning the *<sup>K</sup>*ˆ*D*(*t*) of the gain tuning block, then the output signal *D*ˆ (*t*) of the gain tuning block is a suitable estimation of the load disturbance *D*(*t*).

**Figure 9.** Proposed modified structure of the SP.

The autotuning algorithm of the gain block *K*(*t*) is the following:

$$
\hat{K}\_D(t) = \hat{K}\_D(t-1)\frac{y(t-1)}{y\_{m1}(t-1)}\tag{15}
$$

If *<sup>y</sup>*(*<sup>t</sup>* <sup>−</sup> <sup>1</sup>) = *ym*1(*<sup>t</sup>* <sup>−</sup> <sup>1</sup>)the error signal*e*(*<sup>t</sup>* <sup>−</sup> <sup>1</sup>) = 0 and consequently *<sup>K</sup>*ˆ*D*(*t*) = *<sup>K</sup>*ˆ*D*(*<sup>t</sup>* <sup>−</sup> <sup>1</sup>). But if *<sup>y</sup>*(*<sup>t</sup>* <sup>−</sup> <sup>1</sup>) <sup>=</sup> *ym*1(*<sup>t</sup>* <sup>−</sup> <sup>1</sup>) the error signal *<sup>e</sup>*(*<sup>t</sup>* <sup>−</sup> <sup>1</sup>) <sup>=</sup> 0 and therefore, the new *<sup>K</sup>*ˆ*D*(*t*) is obtained through multiplying *<sup>K</sup>*ˆ*D*(*<sup>t</sup>* <sup>−</sup> <sup>1</sup>) by a tuning gain obtained as *<sup>y</sup>*(*<sup>t</sup>* <sup>−</sup> <sup>1</sup>)/*ym*1(*<sup>t</sup>* <sup>−</sup> <sup>1</sup>). The linear PI controller in the discrete time domain is represented as [80]:

$$u(t) = K\_{\mathbb{C}} e\_2(t) + K\_I T\_s \sum\_{j=0}^t e\_2(j) \tag{16}$$

where *KI* = *KC*/*TI*, *Ts* is the sampling period.

In the proposed modified structure of the SP, the design of the controller *C*(*t*) was carried out based on the following non-linear PI control law [46]:

$$u(t) = f\left[c\_2(t), T\_s \sum\_{i=0}^{t} c\_2(i)\right] \tag{17}$$

where *<sup>f</sup>*(·) is a nonlinear function represented by an ANN, *<sup>e</sup>*2(*t*) is the control system error, *<sup>e</sup>*2(*t*) = *<sup>r</sup>*(*t*) <sup>−</sup> *<sup>y</sup>*(*t*) + *ym*1(*t*) <sup>−</sup> *ym*2(*t*), and *<sup>r</sup>*(*t*) is the setpoint.

A multilayer feedforward with tree layers ANN was used to design the nonlinear ANN PI controller. The input layer has two nodes (the system error *<sup>e</sup>*2(*t*) and the sum of the system error *<sup>e</sup>*2(*i*)), the hidden layer has 5 nodes, and the output layer 1 node. The neuron activation function of the input layer is the linear function *fi*(*x*) = *<sup>x</sup>*, of the hidden layer is the sigmoid function *fh*(*x*) = 1/1 + *<sup>e</sup>*−*x*, and of the output layer is *<sup>f</sup>*0(*x*) = *<sup>x</sup>*. Figure <sup>10</sup> illustrates the diagram with the architecture of the designed nonlinear ANN PI controller.

**Figure 10.** Architecture of the designed nonlinear ANN PI controller.

The data for training this controller was obtained from a discrete-time linear PI controller installed in the canal pool under study.

In addition, using the frequency domain method, a conventional PI controller embedded in a SP structure was designed with specifications: gain crossover frequency *<sup>ω</sup><sup>c</sup>* <sup>=</sup> <sup>10</sup>−<sup>3</sup> rad/s to achieve a settling time around *Ts* <sup>=</sup> *<sup>L</sup>* <sup>+</sup> *<sup>π</sup>*/*ω<sup>c</sup>* <sup>≈</sup> 3400 s and phase

margin *φ<sup>m</sup>* = 80◦ to obtain a nearly zero overshoot. The controller tuned with these specifications was:

$$C(s) = 0.087(1 + \frac{1}{70.56s})\tag{18}$$

#### **4. Results and Discussions**

For the evaluation of the performance of the modified SP control system with the NARX-ANN-based models, a comparative experimental analysis was carried out with another SP with a conventional structure, that included an internal linear model (13) and did not include the gain tuning block *K*(*t*). Both control systems were designed to fulfill the same specifications, considering the following real operation scenarios of our canal pool.

**Scenario (1):** Experimental performance evaluation of the water distribution control system with both designed controllers when a positive setpoint change Δ*r*(*t*) is applied from 3.10 to 3.55 m (setpoint tracking). Table 1 and Figure 11 illustrate the comparative experimental time responses to that setpoint change yielded by the water distribution control system with the two controllers, where *MP* is the overshoot, *Eee* is the steady-state error, *Tr* is the rise time and *Ts* is the settling time. Figure 12 shows the experimental comparative output signals of the controllers designed.

**Table 1.** Experimental comparative results to setpoint change yielded by the water distribution control system with both designed controllers.


**Figure 11.** Experimental comparative time responses to a setpoint change of the water distribution control system with both designed controllers.

**Figure 12.** Experimental comparative output signals of the controllers designed.

From Figure 11 and Table 1 it is observed that the control system with the NARX-ANNbased SP controller designed reaches the new reference water level practically without overshoot and with zero steady state error—like the conventional SP—but the rise time and the settling time are approximately the 50% of the times yielded by the conventional SP. After time 3500 s, the responses provided by both control systems have reached the new setpoint with zero steady state error and, therefore, their steady time responses are similar. We highlight that the same responses attained after 3500 s is consequence of the integral terms included in the structures of controller (17) and the PI controller (18), which is a common term in both control systems. However, the main difference is produced in the transient, in which our proposal achieves the final value much faster than the conventional SP controller.

Specifications used to design the conventional PI controller embedded in the SP structure are quite conservative. In theory, much faster responses could be achieved using this control system by tuning controller (18) with higher values of the gain crossover frequency *ωc*. However, the value of *ω<sup>c</sup>* used here—though it is not too fast—yields a robust controller (18) that maintains a stable closed-loop response in all the range of operation of the canal, and not only in the nominal operation point at which the controller has been designed. In fact, the highly nonlinear dynamics of the canal makes aggressive linear controllers (controllers that yield very fast damped responses) designed based on a linearized model around an operation point become unstable when the canal operates in a point far away from the linearized one. However, our nonlinear control system based on our TD-NARX-ANN nonlinear model allows a much faster response in the nominal operation point without neither unstabilizing or excessively degrading the closed-loop response of the canal when operating at points far away from nominal. This is caused by the fact that our TD-NARX-ANN nonlinear model captures better than linear model (11) the nonlinear behaviour of our canal pool allowing us to design a controller that takes into account at some extent such nonlinearity.

These results confirm that the NARX-ANN-based SP controller designed improves the management of the water flow through the canal pool and satisfies the users water demand in a shorter time.

**Scenario (2):** Experimental performance evaluation of the water distribution control system with both designed controllers against the rejection of an external load disturbance *<sup>D</sup>*(*t*). In this case, an increment of 10% in the lateral discharge *<sup>q</sup>*1(*t*) of the canal pool to satisfy the users water demand was considered as an external load disturbance *D*(*t*). Figure 13 illustrates the experimental comparative time responses obtained by the water distribution control system with both designed controllers against the effect of an increment in the lateral discharge (*q*1(*t*)) originated in *<sup>t</sup>* =3800 s, causing a decrement in the water level of approximately 5.0 cm. The value obtained in the gain tuning block of the designed NARX-ANN-based SP controller was *K*(*t*) =0.1.

**Figure 13.** Experimental comparative time responses to the effect of external load disturbance of the water distribution control system with the controllers designed.

These results support the assumption that the best performance of the control system is obtained with the designed NARX-ANN-based SP controller, which makes it possible to fully reject the negative effect of the load disturbance and restore the setpoint water level of 3.55 m in the canal pool in an approximate time of 700 s, while with the conventional SP this load disturbance is fully rejected in an approximate time of 2200 s, which represents an approximate decrement in the rejection time of 70%.

Though conventional SP structures allow very fast responses to setpoint changes of linear systems, their disturbance rejection performance is not good because, though the steady state error caused by a constant disturbance can be driven to zero, the settling time required to achieve this is large. In fact, the basic structure of the SP has been modified several times in the last decades to improve this issue. However, the compromise between the disturbance rejection speed and the closed-loop robustness limits the improvements that can be achieved. An alternative is using the Active Disturbance Rejection Control (ADRC) combined with a SP structure. However, this technique is not well suited to deal with systems with time delay, and the versions that try to copy with this are relatively complicate. We recently developed a modification of the SP structure that included a new feedback term of the error between the real and the predicted outputs [81]. This scheme showed some advantages in rejecting disturbances as well as a better stability robustness than other SP structures. Then, this scheme is a suitable candidate to be used in our canal pool because it yields a faster disturbance rejection, and its improved robustness allows the control system to deal with the nonlinearities of our canal better than other SP schemes. Moreover, simulations proved that this scheme (in which the linear blocks were substituted by the NARX-ANN blocks in order to better account for the canal nonlinearities) could be further improved by making the gain of that feedback term vary in function of the quotient between the real and the predicted outputs. Then the control scheme implemented in our real canal is [81] but including the feedback term with variable gain that changes according to law (15). Experiments on our canal pool have shown its effectiveness.

These experimental results obtained in the considered real operation scenarios of the studied canal pool show that the designed NARX-ANN-based SP controller outperforms the conventional SP in tracking setpoint changes and rejecting disturbances. Moreover, they show that the proposed NARX-ANN and TD-NARX-ANN models accurately reproduce the canal pool nonlinearities and can be implemented in real time, being more accurate than linearized models while being less computationally involved than more exact models based on the numerical solution of the Saint-Venant equations that cannot be implemented in real time applications.

The NARX-ANN-Based SP controller designed was implemented in the industrial PC installed in the canal pool under study through a program developed in Matlab. This controller communicates via Matlab/OPC (OLE for Process Control) with the SIMATIC S7-300 PLC for real-time data exchange. The OPC technology is based on a Server/Client architecture and provides standards-based communication between PC and PLC.

#### **5. Conclusions**

The contribution of this work is a modification of the well-known Smith predictor controller in which the internal linear model has been substituted by the combination of a NARX-ANN-based model and a TD-NARX-ANN-based model, in order to take into account the dynamic nonlinearities in the effective control of water distribution in an irrigation main canal pool.

By application of system identification procedures, NARX-ANN and a TD-NARX-ANN with recurrent architectures have been obtained which describe with high accuracy the non-linear dynamic behavior of the water distribution in the studied canal pool. The NARX-ANN structure with an input layer with 13 memory blocks for the input and output signals, three neurons in the hidden layer and one neuron in the output layer provided the best models performance. A third model which is linear and is represented by a timedelay first order transfer function was obtained using an identification procedure. The validation results of the three models illustrate that the FIT performance indexes of the NARX-ANN-based models are higher than that of the linear model.

In the context of hydraulic engineering, one of the contributions of this paper consists therefore of the fact that accurate nonlinear models based on the NARX-ANN could be obtained for an irrigation main canal pool under different real operation conditions using real-time data and system identification procedures. The results attained are very promising and show that there is great potential for the use of NARX-ANN-based models in depicting the dynamic behavior of irrigation main canal pools. We remark that improving the accuracy of these nonlinear dynamic models has a significant impact on improving the performance of water distribution control systems.

Besides modifying the SP conventional structure by substituting the linear mathematical models by the non-linear NARX-ANN-based models, a gain tuning block of gain *K*(*t*) has been added that estimates the load disturbance *D*ˆ (*t*), which is summed to the controller output signal *u*(*t*) in order to reject the negative effect of the load disturbance *D*(*t*) on the control system and, thus, increase the controller performance.

The experimental results obtained on the real operation scenarios of the studied canal pool illustrate that our modified NARX-ANN-based SP controller improves the control system performance and therefore satisfies the users water demands in time.

The benefits that are obtained with the increment of water distribution control performance in the main irrigation canal pools are reverted to a better management and rational

use of the available hydraulic resources, as well as to a greater environment protection due to the reduction of the current operation water losses.

The proposed NARX-ANN-based SP controller constitutes a simple and attractive option for the accurate water distribution control in the irrigation main canal pools and, therefore, it reveals that there is a great potential in the application of the artificial intelligence to design non-linear control systems for this class of hydraulic processes, as well as to design prediction and decision support systems to optimize the operation of irrigation systems.

This research will be expanded in the future to obtain non-linear NARX-ANN-based models of the remaining canal pools of the ZCAMC, and to use these models in the design and implementation of an integral intelligent water distribution control system for the complete irrigation main canal.

**Author Contributions:** Conceptualization, R.R.-P. and V.F.-B.; methodology, V.F.-B. and R.R.-P.; software, Y.H.-L.; validation, Y.H.-L.; formal analysis, R.R.-P. and V.F.-B.; writing, review and editing, Y.H.- L., V.F.-B. and R.R.-P. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** The authors would like to acknowledge the support provided by the Consejería de Educación, Cultura y Deportes de la Junta de Comunidades de Castilla-La Mancha (Spain) with the Project SBPLY/21/180501/000052.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **References**


MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Applied Sciences* Editorial Office E-mail: applsci@mdpi.com www.mdpi.com/journal/applsci

MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel: +41 61 683 77 34

ISBN 978-3-0365-6809-6