**Computational Optimizations for Machine Learning**

Editor

**Freddy Gabbay**

MDPI • Basel • Beijing • Wuhan • Barcelona • Belgrade • Manchester • Tokyo • Cluj • Tianjin

*Editor* Freddy Gabbay Faculty of Engineering Ruppin Academic Center Israel

*Editorial Office* MDPI St. Alban-Anlage 66 4052 Basel, Switzerland

This is a reprint of articles from the Special Issue published online in the open access journal *Mathematics* (ISSN 2227-7390) (available at: https://www.mdpi.com/journal/mathematics/special issues/Comput Optim Mach Learn).

For citation purposes, cite each article independently as indicated on the article page online and as indicated below:

LastName, A.A.; LastName, B.B.; LastName, C.C. Article Title. *Journal Name* **Year**, *Volume Number*, Page Range.

**ISBN 978-3-0365-3186-1 (Hbk) ISBN 978-3-0365-3187-8 (PDF)**

© 2022 by the authors. Articles in this book are Open Access and distributed under the Creative Commons Attribution (CC BY) license, which allows users to download, copy and build upon published articles, as long as the author and publisher are properly credited, which ensures maximum dissemination and a wider impact of our publications.

The book as a whole is distributed by MDPI under the terms and conditions of the Creative Commons license CC BY-NC-ND.

## **Contents**


#### **Ruba Abu Khurma, Ibrahim Aljarah\*, Ahmad Sharieh, Mohamed Abd Elaziz, Robertas Damaˇseviˇcius\*, Tomas Krilaviˇcius**

A Review of the Modification Strategies of the Nature Inspired Algorithms for Feature Selection Problem

Reprinted from: *Mathematics* **2022**, *10*, 464, doi:10.3390/math10030464 ................ **219**

## **About the Editor**

**Freddy Gabbay** received his B.Sc., M.Sc. and Ph.D. in Electrical Engineering from Technion – Israel Institute of Technology, Haifa, Israel. In 1998, he worked as a researcher at Intel's Microprocessor Research Lab. In 1999, he joined Mellanox Technologies and held various positions in leading switch product line architecture and ASIC design. In 2003, he joined Freescale Semiconductor as a senior design manager and led the design of baseband ASIC products. In 2012, he rejoined Mellanox Technologies, where he served as Vice President of Chip Design. Today, he is the Dean of the Engineering Faculty and an associate professor at the Ruppin Academic Center, Emek Hefer, Israel. His research interests include VLSI design, computer architecture, machine learning and domain-specific accelerators. Prof. Gabbay holds 19 patents and is a senior member of IEEE.

## **Preface to "Computational Optimizations for Machine Learning"**

In the recent decade, machine learning has emerged as a powerful tool for an incredible number of applications, such as computer vision, medicine, fintech, autonomous systems, speech recognition, traffic management and social media among many others. Machine learning models provide state-of-the-art and robust accuracy in various applications. Beyond the major impact of machine learning applications on our life and environment, machine learning introduces a revolutionary approach in the method of developing algorithms. While past approaches rely on humans to develop new algorithms, machine learning uses powerful computers to train algorithms based on large datasets. Machine learning can thereby identify complex connections and relations between features that cannot be handled using conventional methods.

The increasing deployment of machine learning algorithms introduces major computational challenges due to the explosive growth in their model size and complexity. These challenges have been further emphasized due to the diversity of hosting computational platforms, from edge devices and cloud systems to high-performance computing. Given that each platform introduces different computational and cost constraints, the need for computational optimizations that are fine-tuned to the application and platform is crucial.

The present book contains the 10 articles accepted for publication among the 15 submissions to the Special Issue "Computational Optimizations for Machine Learning" of the MDPI journal Mathematics.

The 10 articles, which appear in the present book in the order in which they were published in Volume 9 (2021) of the journal, cover a wide range of topics connected to the machine learning computational optimizations theory and applications. These topics include, among others, elements from convolutional neural networks, nature inspired algorithms, neural networks training, quantization, predictive control of nonlinear processes, weather prediction and adaptive online learning.

It is hoped that the book will be interesting and useful to those developing mathematical algorithms and applications in the domain of artificial intelligence and machine learning as well as for those having the appropriate mathematical background and willing to become familiar with recent advances of machine learning computational optimization mathematics, which has nowadays permeated into almost all sectors of human life and activity.

As the Guest Editor of the Special Issue, I am grateful to the authors of the papers for their quality contributions, to the reviewers for their valuable comments toward the improvement of the submitted works and to the administrative staff of the MDPI publications for the support to complete this project. Special thanks are due to the Managing Editor of the Special Issue, Dr. Syna Mu, for his excellent collaboration and valuable assistance.

> **Freddy Gabbay** *Editor*

## *Article* **Adaptive Online Learning for the Autoregressive Integrated Moving Average Models**

**Weijia Shao 1,\*, Lukas Friedemann Radke 1, Fikret Sivrikaya <sup>2</sup> and Sahin Albayrak 1,2**


**Abstract:** This paper addresses the problem of predicting time series data using the autoregressive integrated moving average (ARIMA) model in an online manner. Existing algorithms require model selection, which is time consuming and unsuitable for the setting of online learning. Using adaptive online learning techniques, we develop algorithms for fitting ARIMA models without hyperparameters. The regret analysis and experiments on both synthetic and real-world datasets show that the performance of the proposed algorithms can be guaranteed in both theory and practice.

**Keywords:** ARIMA model; time series analysis; online optimization; online model selection

#### **1. Introduction**

The autoregressive integrated moving average (ARIMA) model is an important tool for time series analysis [1], and has been successfully applied to a wide range of domains including the forecasting of household electric consumption [2], scheduling in smart grids [3], finance [4], and environment protection [5]. It specifies that the values of a time series depend linearly on their previous values and error terms. In recent years, online learning (OL) methods have been applied to estimate the univariate [6,7] and multivariate [8,9] ARIMA models for their efficiency and scalability. These methods are based on the fact that any ARIMA model can be approximated by a finite dimensional autoregressive (AR) model, which can be fitted incrementally using online convex optimization algorithms. However, to guarantee accurate predictions, these methods require a proper configuration of hyperparameters, such as the diameter of the decision set, the learning rate, the order of differencing, and the lag of the AR model. Theoretically, these hyperparameters need to be set according to prior knowledge about the data generation, which is impossible to obtain. In practice, the hyperparameters are usually tuned to optimize the goodness of fit on the unseen data, which requires numerical simulation (e.g., cross-validation) on a previously collected dataset. The numerical simulation is notoriously expensive, since it requires multiple training runs for each candidate hyperparameter configuration. Furthermore, a previously collected dataset containing ground truth is needed for validation of the fitted model, which is unsuited for the online setting. Unfortunately, the expensive tuning process needs to be regularly repeated if the statistical properties of the time series change over time in an unforeseen way.

Given a new problem of predicting time series values, it appears that tuning the hyperparameters of the online algorithms can negate the benefits of the online setting. This paper addresses this problem in the online learning framework by proposing new parameter-free algorithms for learning ARIMA models, while their performance can still be guaranteed in both theory and practice. A naive attempt for this would be to directly apply parameter-free online convex optimization (PF-OCO) algorithms to the AR approximation. However, the theoretical performance of the AR approximation and the parameter-free

**Citation:** Shao, W.; Radke, L.F.; Sivrikaya, F.; Albayrak S. Adaptive Online Learning for the Autoregressive Integrated Moving Average Models. *Mathematics* **2021**, *9*, 1523. https://doi.org/10.3390/ math9131523

Academic Editors: Freddy Gabbay, Ioannis K. Argyros and Mihai Postolache

Received: 19 April 2021 Accepted: 24 June 2021 Published: 29 June 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

algorithms rely on the bounded gradient vectors of the loss function, which is unreasonable for the widely used squared error with an unbounded domain.

The key contribution of this paper is the design of online learning algorithms for ARIMA models, avoiding regular and expensive hyperparameter tuning without damaging the power of the models. Our algorithms update the model incrementally with a computational complexity that is linearly related to the size of the model parameters and the number of candidate models in each iteration. To obtain a solid theoretical foundation, we first show that, for any locally Lipschitz-continuous function, ARIMA models with fixed order of differencing can be approximated using an AR model of the same order for a large enough lag. Based on this, new algorithms are proposed for learning the AR model adaptively without requiring any prior knowledge about the model parameters. For Lipschitz-continuous loss functions, we apply a new algorithm based on the adaptive follow the regularized leader (FTRL) framework [10] and show that our algorithm achieves a sublinear regret bound depending on the data sequence and the Lipschitz constant. A special treatment on the commonly used squared error is required due to its non-Lipschitz continuity. To obtain a data-dependent regret bound, we combine a polynomial regularizer [11] with the adaptive FTRL framework. Finally, to find the proper order and lag of the AR model in an online manner, multiple AR models are simultaneously maintained, and an adaptive hedge algorithm is applied to aggregate their predictions. In the previous attempts [12,13] to solve this online model selection (OMS) problem, the exponentiated gradient (EG) algorithm has been directly applied to aggregate the predictions, which not only requires tuning the learning rate, but also yields a regret bound depending on the loss incurred by the worst model. Our adaptive hedge algorithm is parameter-free and guarantees a regret bound depending on the time series sequence. Table 1 provides a comparison of the online learning algorithms applied to the learning of the ARIMA models. In addition to the theoretical analysis, we also demonstrate the performance of the proposed algorithm using both synthetic and real-world datasets.


For non-Lipschitz-continuous loss functions, the gradient norm can be unbounded. These algorithms with performance depending on the gradient norm can fail without making further assumptions on the data generation. For OGD, the learning rate and the diameter of the decision set need to be tuned in practice. ONS has an additional hyperparameter controlling the numerical stability. Applying SF-MD to ARIMA, the diameter of the model parameter has to be tuned. To obtain optimal performance, the learning rate of EG has to be tuned.

> The rest of the paper is organized as follows. Section 2 reviews the existing work on the subject. The notation, learning model, and formal description of the problem are introduced in Section 3. Next, we present and analyze our algorithms in Section 4. Section 5 demonstrates the empirical performance of the proposed methods. Finally, we conclude our work with some future research directions in Section 6.

#### **Algorithm 1** ARIMA-AdaFTRL.

```
Input: L1 > 0
Initialize θ1,i arbitrarily, η1,i = 0, Gi,0 = 0 for i = 1, . . . , m
for t = 1 to T do
   for i = 1 to m do
       Gi,t = max{Gi,t−1, -
                              -
                                dXt−i-
                                        2}
       ηi,t = -
                θi,1-
                     F +
                         -

                            ∑t−1 s=1-
                                   gi,s-
                                        2
                                        F + (LtGi,t)2
       if ηi,t = 0 then
           γi,t = θi,t
                   ηi,t
       else
           γi,t = 0
       end if
   end for
   Play X˜t(γt)
   Observe Xt and ht ∈ ∂lt(X˜t(γt))
    Lt+1 = max{Lt, -
                        gt-
                           2}
   for i = 1 to m do
       gi,t = gt-
                   dX
                      t−i
       θi,t+1 = θi,t − gi,t
   end for
end for
```
**Algorithm 2** ARIMA-AdaFTRL-Poly.

Input: *G*<sup>0</sup> > 0 Initialize *θ*<sup>1</sup> arbitrarily, *G*<sup>1</sup> = max{*G*0, -*dX*0-2,..., -*dX*−*m*+1-2} **for** *t* = 1 to *T* **do** *η<sup>t</sup>* = *θ*1-*<sup>F</sup>* + - <sup>∑</sup>*t*−<sup>1</sup> *s*=1-*dXsx s* -2 *<sup>F</sup>* + (*Gtxt*-<sup>2</sup>)<sup>2</sup> *λ<sup>t</sup>* = - ∑*t s*=1*xs*-4 2 **if** *θt*-*<sup>F</sup>* = 0 **then** Select *<sup>c</sup>* <sup>≥</sup> 0 satisfying *<sup>λ</sup>tc*<sup>3</sup> <sup>+</sup> *<sup>η</sup>tc* <sup>=</sup> *θt*-*F γ<sup>t</sup>* = *<sup>c</sup>θ<sup>t</sup> θt*-*F* **else** *γ<sup>t</sup>* = 0 **end if** Play *X*˜*t*(*γt*) Observe *Xt* and *gt* = *γtxt* − *dXt Gt*+<sup>1</sup> = max{*Gt*, -*dXt*-2} *θt*+<sup>1</sup> = *θ<sup>t</sup>* − *gtx t* **end for**

#### **Algorithm 3** ARIMA-AO-Hedge.

```
Input: predictor A1,..., AK, d
Initialize θk,1 = 0, η1 = 0 for i = 1, . . . , K
for t = 1 to T do
    Get prediction X˜ i
                        t from Ak for i = 1, . . . , K
    Set Yt = ∑d−1
                i=0 -
                       i
                        Xt−1
    Set hi,t = l(Yt, X˜ i
                       t) for i = 1, . . . , K
    if η1 = 0 then
        Set wi,t = 1 for some i ∈ arg maxj∈{1,...,K} hj,t
    else
        Set wi,t = exp(η−1
                            t (θi,t−hi,t))
                    ∑K
                      i=1 exp(η−1
                              t (θi,t−hi,t)) for i = 1, . . . , K
    end if
    Predict X˜t = ∑K
                      i=1 wi,tX˜ i
                                t
    Observe Xt, update Ai, and set zi,t = l(Xt, X˜ i
                                                         t) for i = 1, . . . , K
    θt+1 = θt − zt
    ηt+1 =
            -
              1 2 log K ∑t
                         s=1-
                              ht − zt-
                                       2
                                       ∞
end for
```
#### **2. Related Work**

An ARIMA model can be fitted using statistical methods such as recursive least square and maximum likelihood estimation, which are not only based on strong assumptions such as the Gaussian distributed noise terms [18], linear dependencies [19], and data generated by a stationary process [20], but also require solution of non-convex optimization problems [21]. Although these assumptions can be relaxed by considering non-Gaussian noise [22,23], non-stationary processes [24], or a convex relaxation [21], the pre-trained models still cannot deal with concept drift [7]. Moreover, retraining is time consuming and memory intensive, especially for large-scale datasets. The idea of applying regret minimization techniques to autoregressive moving average (ARMA) prediction was first introduced in [6]. The authors propose online algorithms incrementally producing predictions close to the values generated by the best ARMA model. This idea was extended to ARIMA(*p*, *q*, *d*) models in [7] by learning the AR(*m*) model of the higher-order differencing of the time series. Further extensions to multiple time series can be found in [8,9], while the problem of predicting time series with missing data was addressed in [25].

In order to obtain accurate predictions, the lag of the AR model and the order of differencing have to be tuned, which has been well studied in the offline setting. In some textbooks [20,26,27], Akaike's Information Criterion (AIC) and the Bayesian Information Criterion (BIC) are recommended for this task. Both require prior knowledge and strong assumptions about the variance of the noise [20], and are time and space consuming as they require numerical simulation such as cross-validation on previously collected datasets. Nevertheless, given a properly selected lag *m* and order *d*, online convex optimization techniques such as online Newton step (ONS) or online gradient descent (OGD) can be applied to fitting the model in the regret minimization framework [6–9]. However, both algorithms introduce additional hyperparameters to control the learning rate and numerical stability.

The idea of selecting hyperparameters for online time series prediction was proposed in [12,13]. Regarding the online AR predictor with different lags as experts, the authors aggregate over predictors by applying a multiplicative weights algorithm for prediction with expert advice. The proposed algorithm is not optimal for time series prediction, since the regret bound of the chosen algorithm depends on the largest loss incurred by the experts [28]. Furthermore, each individual expert still requires that the parameters are taken from a compact decision set, the diameter of which needs to be tuned in practice. A series of recent works on parameter-free online learning have provided possibilities of achieving sublinear regret without prior information on the decision set. In [14], the unconstrained

online learning problem is modeled as a betting game, based on which a parameterfree algorithm is developed. The algorithm was further extended in [15], so a better regret bound can be achieved for strongly convex loss functions. However, the coin betting algorithm requires that the gradient vectors are normalized, which is unrealistic for unbounded time series and the squared error loss. In [16,17], the authors introduced parameter-free algorithms without requiring normalized gradient vectors. Unfortunately, the regret upper bounds of the proposed algorithms depend on the norm of the gradient vectors, which could be extremely large in our setting.

The main idea of the current work is based on the combination of the adaptive FTRL framework [10] and the idea of handling relative Lipschitz continuous functions [11], which makes it possible to devise an online algorithm with a data-dependent regret upper bound. To aggregate the results, an adaptive optimistic algorithm is proposed, such that the overall regret depends on the data sequence instead of the worst-case loss.

#### **3. Preliminary and Learning Model**

Let *Xt* denote the value observed at time *t* of a time series. We assume that *Xt* is taken from a finite dimensional real vector space <sup>X</sup> with norm -·-. We denote by <sup>L</sup>(X, <sup>X</sup>) the vector space of bounded linear operators from <sup>X</sup> to <sup>X</sup> and *α*op <sup>=</sup> sup*x*∈X,*x*=<sup>0</sup> *αx*- *x* the corresponding operator norm. An AR(*p*) model is given by

$$X\_t = \sum\_{i=1}^p \alpha\_i X\_{t-i} + \epsilon\_{t\prime}$$

where *<sup>α</sup><sup>i</sup>* ∈ L(X, <sup>X</sup>) is a linear operator and *<sup>t</sup>* <sup>∈</sup> <sup>X</sup> is an error term. The ARMA(*p*, *<sup>q</sup>*) model extends the AR(*p*) model by adding a moving average (MA) component as follows:

$$X\_t = \sum\_{i=1}^p \alpha\_i X\_{t-i} + \sum\_{i=1}^q \beta\_i \epsilon\_{t-i} + \epsilon\_{t-i}$$

where *<sup>t</sup>* <sup>∈</sup> <sup>X</sup> is the error term and *<sup>β</sup><sup>i</sup>* ∈ L(X, <sup>X</sup>). We define the *<sup>d</sup>*-th order differencing of the time series as *dXt* = *<sup>d</sup>*−1*Xt* <sup>−</sup> *<sup>d</sup>*−<sup>1</sup>*Xt*−<sup>1</sup> for *<sup>d</sup>* <sup>≥</sup> 1 and -<sup>0</sup>*Xt* = *Xt*. The ARIMA(*p*, *q*, *d*) model assumes that the *d*-th order differencing of the time series follows an ARMA(*p*, *q*) model. In this section, this general setting suffices for introducing the learning model. In the following sections, we fix the basis of X to obtain implementable algorithms, for which different kinds of norms and inner products for vectors and matrices are needed. We provide a table of required notation in Appendix C.

In this paper, we consider the setting of online learning, which can be described as an iterative game between a player and an adversary. In each round *t* of the game, the player makes a prediction *X*˜*t*. Next, the adversary chooses some *Xt* and reveals it to the player, who then suffers the loss *<sup>l</sup>*(*Xt*, *<sup>X</sup>*˜*t*) for some convex loss function *<sup>l</sup>* : <sup>X</sup> <sup>×</sup> <sup>X</sup> <sup>→</sup> <sup>R</sup>. The ultimate goal is to design a strategy for the player to minimize the cumulative loss ∑*<sup>T</sup> <sup>t</sup>*=<sup>1</sup> *<sup>l</sup>*(*Xt*, *<sup>X</sup>*˜*t*) of *<sup>T</sup>* rounds. For simplicity, we define

$$I\_t: \mathbb{X} \to \mathbb{R}, X \mapsto I(X\_t, X).$$

In classical textbooks about time series analysis, the signal is assumed to be generated by a model, based on which the predictions are made. In this paper, we make no assumptions on the data generation. Therefore, minimizing the cumulative loss is generally impossible. An achievable objective is to keep a possibly small regret of not having chosen some ARIMA(*p*, *q*, *d*) model to generate the prediction *X*˜*t*. Formally, we denote by *X*˜*t*(*α*, *β*) the prediction using the ARIMA(*p*, *q*, *d*) model parameterized by *α* and *β*, given by (in this

paper, we do not directly address the problem of the cointegration, where the third term should be applied to a low-rank linear operator):

$$\tilde{X}\_t(\boldsymbol{\alpha}, \boldsymbol{\beta}) = \sum\_{i=1}^p \alpha\_i \nabla^d X\_{t-i} + \sum\_{i=1}^q \beta\_i \boldsymbol{\varepsilon}\_{t-i} + \sum\_{i=0}^{d-1} \nabla^i X\_{t-1}. \tag{1}$$

The cumulative regret of *T* rounds is then given by

$$\mathcal{R}\_{\Gamma}(\boldsymbol{\alpha}, \boldsymbol{\beta}) = \sum\_{t=1}^{T} l\_{t}(\boldsymbol{\tilde{X}}\_{t}) - \sum\_{t=1}^{T} l\_{t}(\boldsymbol{\tilde{X}}\_{t}(\boldsymbol{\alpha}, \boldsymbol{\beta})).$$

The goal of this paper is to design a strategy for the player such that the cumulative regret grows sublinearly in *T*. In the ideal case, in which the data are actually generated by an ARIMA process, the prediction generated by the player yields a small loss. Otherwise, the predictions are always close to those produced by the best ARIMA model, independent of the data generation. Following the adversarial setting in [6], we allow the sequences {*Xt*}, {*t*} and the parameters *α*, *β* to be selected by the adversary. Without any restrictions on the model, this is no different than the impossible task of minimizing the cumulative loss, since *t*−<sup>1</sup> can always be selected such that *Xt* <sup>=</sup> *<sup>X</sup>*˜*t*(*α*, *<sup>β</sup>*) holds for all *<sup>t</sup>*. Therefore, we make the following assumptions throughout this paper:

**Assumption 1.** *Xt* <sup>=</sup> *<sup>t</sup>* <sup>+</sup> *<sup>X</sup>*˜*t*(*α*, *<sup>β</sup>*)*, and there is some <sup>R</sup>* <sup>&</sup>gt; <sup>0</sup> *such that t*- ≤ *R for all t* = 1, . . . *T.*

**Assumption 2.** *The coefficients <sup>β</sup><sup>i</sup> satisfy* <sup>∑</sup>*<sup>q</sup> i*=1*βi*op ≤ 1 − *for some* > 0*.*

Since we are interested in competing against predictions generated by ARIMA models, we assume that *<sup>t</sup>* is selected as if *Xt* is generated by the ARIMA process. Furthermore, we assume the norm *t* is upper bounded within *T* iterations. Assumption 2 is a sufficient condition for the MA component to be invertible, which prevents it from going to infinity as *t* → ∞ [27].

Our work is based on the fact that we can compete against an ARIMA(*p*, *q*, *d*) model by taking predictions from an AR(*m*) model of the *d*-th order differencing for large enough *m*, which is shown in the following lemma, the proof of which can be found in Appendix A.

**Lemma 1.** *Let* {*Xt*}*,* {*t*}*, α, and β be as assumed in Assumptions 1 and 2. Then there is some <sup>γ</sup>* ∈ L(X, <sup>X</sup>)*<sup>m</sup> with m* <sup>≥</sup> *<sup>q</sup>* log *<sup>T</sup>* log <sup>1</sup> 1− + *p such that*

$$\left\| \left| \nabla^{d} \mathcal{X}\_{t}(\gamma) - \nabla^{d} \mathcal{X}\_{t}(\alpha, \beta) \right| \right\| \leq (1 - \epsilon)^{\frac{t}{q}} R + \frac{2R}{T} $$

*holds for all t* = 1... *T, where we define dX*˜*t*(*γ*) = ∑*<sup>m</sup> <sup>i</sup>*=<sup>1</sup> *γidXt*−*i.*

As can be seen from the lemma, a prediction *X*˜*t*(*γ*) generated by the process

$$\mathcal{R}\_t(\gamma) = \sum\_{i=1}^m \gamma\_i \nabla^d X\_{t-i} + \sum\_{i=0}^{d-1} \nabla^i X\_{t-1}$$

is close to the prediction *X*˜*t*(*α*, *β*) generated by the ARIMA process. In the previous works [6,7], the loss function *lt* is assumed to be Lipschitz continuous to control the difference of loss incurred by the approximation. In general, this does not hold for squared error. However, from Assumption 1 and Lemma 1, it follows that both *X*˜*t*(*α*, *β*) and *X*˜*t*(*γ*) lie in a compact set around *Xt* with a bounded diameter. Given the convexity of *l*, which is local Lipschitz continuous in the compact convex domain, we obtain a similar property:

$$d\left(\mathcal{X}\_{t\prime}\tilde{\mathcal{X}}\_{t}(\gamma)\right) - l(\mathcal{X}\_{t\prime}\tilde{\mathcal{X}}\_{t}(\mathfrak{a},\mathfrak{\beta})) \leq L(\mathcal{X}\_{t}) \|\nabla^{d}\tilde{\mathcal{X}}\_{t}(\gamma) - \nabla^{d}\tilde{\mathcal{X}}\_{t}(\mathfrak{a},\mathfrak{\beta})\|\_{\mathcal{H}}$$

where *L*(*Xt*) is some constant depending on *Xt*. For squared error, it is easy to verify that the Lipschitz constant depends on -*dXt*-, the boundedness of which can be reasonably assumed. To avoid extraneous details, we simply add the third assumption:

**Assumption 3.** *Define set* <sup>X</sup>*<sup>t</sup>* <sup>=</sup> {*<sup>X</sup>* <sup>∈</sup> <sup>X</sup>|-*X* − *Xt*- ≤ 4*R*}*. There is a compact convex set* X ⊇ *<sup>T</sup> <sup>t</sup>*=<sup>1</sup> X*t, such that lt is L-Lipschitz continuous in* X *for t* = 1, . . . *T.*

The next corollary shows that the losses incurred by the ARIMA and its approximation are close, which allows us to take predictions from the approximation.

**Corollary 1.** *Let* {*Xt*}*,* {*t*}*, α, β, and l be as assumed in Assumptions 1–3. Then there is some <sup>γ</sup>* ∈ L(X, <sup>X</sup>)*<sup>m</sup> with m* <sup>≥</sup> *<sup>q</sup>* log *<sup>T</sup>* log <sup>1</sup> 1− + *p, such that*

$$\sum\_{t=1}^{T} l\_t(\tilde{X}\_t(\gamma)) - l\_t(\tilde{X}\_t(a, \beta)) \le LR(\frac{1}{1 - (1 - \epsilon)^{\frac{1}{q}}} + 2)$$

*holds for all t* = 1... *T.*

**Proof.** It follows from Assumption 1 and Lemma <sup>1</sup> that *<sup>X</sup>*˜*t*(*γ*), *<sup>X</sup>*˜*t*(*α*, *<sup>β</sup>*) ∈ X holds for all *t* = 1, . . . *T*. Together with Assumption 3, we obtain

$$\sum\_{t=1}^{T} (l\_t(\mathcal{X}\_t(\gamma)) - l\_t(\mathcal{X}\_t(a, \beta))) \le L \sum\_{t=1}^{T} ||\mathcal{X}\_t(\gamma) - \mathcal{X}\_t(a, \beta)||.$$

Applying Lemma 1, we obtain the claimed result.

#### **4. Algorithms and Analysis**

From Corollary 1, it follows clearly that an ARIMA(p, q, d) model can be approximated by an integrated AR model with large enough *m*. However, neither the order of differencing *d* nor the lag *m* is known. To circumvent tuning them using a previously collected dataset, we propose a framework with a two-level hierarchical construction, which is described in Algorithm 4.


Input: *K* instances of the slave algorithm A1, ... , A*K*. An instance of master algorithm M. **for** *t* = 1 to *T* **do** Get *X*˜ *<sup>i</sup> <sup>t</sup>* from each A*<sup>i</sup>* Get *wt* <sup>∈</sup> <sup>Δ</sup>*<sup>K</sup>* from <sup>M</sup> - Δ*<sup>K</sup>* is the standard *K*-simplex Integrate the prediction: *X*˜*<sup>t</sup>* = ∑*<sup>K</sup> <sup>i</sup>*=<sup>1</sup> *w<sup>i</sup> tX*˜ *i t* Observe *Xt* Define *zt* <sup>∈</sup> <sup>R</sup>*<sup>K</sup>* with *zi*,*<sup>t</sup>* <sup>=</sup> *lt*(*X*˜ *<sup>i</sup> t*) Update A*<sup>i</sup>* using *zi*,*<sup>t</sup>* for *i* = 1, . . . , *K* Update M using *zt* **end for**

The idea is to maintain a master algorithm M and a set of slave algorithms {A*m*|*<sup>m</sup>* <sup>=</sup> 1, . . . , *<sup>K</sup>*}. At each step *<sup>t</sup>*, the master algorithm receives predictions *<sup>X</sup>*˜ *<sup>k</sup> <sup>t</sup>* from <sup>A</sup>*<sup>k</sup>* for *<sup>k</sup>* <sup>=</sup> 1, . . . , *<sup>K</sup>*. Then it comes up with a convex combination *<sup>X</sup>*˜*<sup>t</sup>* <sup>=</sup> <sup>∑</sup>*<sup>K</sup> <sup>i</sup>*=<sup>1</sup> *w<sup>i</sup> tX*˜ *i <sup>t</sup>* for some *wt* <sup>∈</sup> <sup>Δ</sup> in the simplex. Next, it observes *Xt* and computes the loss *lt*(*X<sup>k</sup> <sup>t</sup>*(*γ*)) for each slave <sup>A</sup>*k*, which is then used to update <sup>A</sup>*<sup>k</sup>* and *wt*+1. Let {*X*˜ *<sup>k</sup> <sup>t</sup>* } be the sequence generated by some slave *k*. We define the regret of not having chosen the prediction generated by slave *k* as

$$R\_T(k) = \sum\_{t=1}^T l\_t \left(\sum\_{i=1}^K w\_t^i \mathcal{R}\_t^i \right) - \sum\_{t=1}^T l\_t (\mathcal{R}\_t^k)\_t$$

and the regret of the slave *k*

$$R\_T(\mathcal{A}\_k) = \sum\_{t=1}^T l\_t(\mathcal{R}\_t^k) - \sum\_{t=1}^T l\_t(\mathcal{R}\_t(\gamma\_k))\_t$$

where *X*˜*t*(*γk*) is the prediction generated by an integrated AR model parameterized by *γk*. Let A*<sup>k</sup>* be some slave. Then the regret of this two-level framework can obviously be decomposed as

$$R\_T(\boldsymbol{a}, \boldsymbol{\beta}) = R\_T(\boldsymbol{k}) + R\_T(\mathcal{A}\_{\boldsymbol{k}}) + \underbrace{\sum\_{t=1}^T l\_t(\boldsymbol{\mathcal{X}}\_t(\boldsymbol{\gamma}\_k)) - \sum\_{t=1}^T l\_t(\boldsymbol{\mathcal{X}}\_t(\boldsymbol{a}, \boldsymbol{\beta}))}\_{\text{Corollary 1}}.$$

For *γk*, *α*, and *β* satisfying the condition in Corollary 1 (this is not a condition of having a correct algorithm—with more slaves, there are more *α*, *β* satisfying the condition; we increase the freedom of the model by increasing the number of slaves), the marked term above is upper bounded by a constant, that is,

$$\sum\_{t=1}^{T} l\_t(\check{X}\_t(\gamma\_k)) - \sum\_{t=1}^{T} l\_t(\check{X}\_t(\alpha, \beta)) \in \mathcal{O}(1).$$

If the regret of the master and the slaves grow sublinearly in *T*, we can achieve an overall sublinear regret upper bound, which is formally described in the following corollary.

**Corollary 2.** *Let* A*<sup>i</sup> be an online learning algorithm against an* AR(*mi*) *model parameterized by <sup>γ</sup><sup>i</sup> for <sup>i</sup>* <sup>=</sup> 1, ... , *K. For any* ARIMA *model parameterized by <sup>α</sup> and <sup>β</sup>, if there is a <sup>k</sup>* ∈ {1, ... , *<sup>K</sup>*} *such that <sup>X</sup>*˜*t*(*γk*)*, <sup>X</sup>*˜*t*(*α*, *<sup>β</sup>*) *and* {*Xt*} *satisfy Assumptions 1–3, then running Algorithm <sup>4</sup> with* M *and* A1,..., A*<sup>K</sup> guarantees*

$$\sum\_{t=1}^{T} \left( l\_t(\check{X}\_t) - l\_t(\check{X}\_t(\alpha, \beta)) \right) \le \mathcal{R}\_T(k) + \mathcal{R}\_T(\mathcal{A}\_k) + \mathcal{O}(1).$$

Next, we design and analyze parameter-free algorithms for the slaves and the master.

*4.1. Parameter-Free Online Learning Algorithms*

#### 4.1.1. Algorithms for Lipschitz Loss

Given fixed *m* and *d*, an integrated *AR*(*m*) model can be treated as an ordinary linear regression model. In each iteration *<sup>t</sup>*, we select *<sup>γ</sup><sup>t</sup>* = (*γ*1,*t*, ... , *<sup>γ</sup>m*,*t*) ∈ L(X, <sup>X</sup>)*<sup>m</sup>* and make prediction

$$\tilde{X}\_t(\gamma\_t) = \sum\_{i=1}^m \gamma\_{i,t} \nabla^d X\_{t-i} + \sum\_{i=0}^{d-1} \nabla^i X\_{t-1} \dots$$

Since *lt* is convex, there is some subdifferential *gt* <sup>∈</sup> *<sup>∂</sup>lt*(*X*˜*t*(*γt*)) such that

$$l\_t(\check{X}\_t(\gamma\_t)) - l\_t(\check{X}\_t(\gamma)) \le \gcd\_{i=1}^m (\gamma\_{i,t} - \gamma\_i) \nabla^d X\_{t-i}),$$

for all *<sup>γ</sup>* ∈ L(X, <sup>X</sup>)*m*. Define *gi*,*<sup>t</sup>* : <sup>L</sup>(X, <sup>X</sup>) <sup>→</sup> <sup>R</sup>, *<sup>v</sup>* → *gt*(*vdXt*−*i*). The regret can be further upper bounded by

$$\sum\_{t=1}^{T} l\_t(\check{X}\_t(\gamma\_t)) - l\_t(\check{X}\_t(\gamma)) \le \sum\_{t=1}^{T} \sum\_{i=1}^{m} g\_{i,t}(\gamma\_{i,t} - \gamma\_i). \tag{2}$$

Thus, we can cast the online linear regression problem to an online linear optimization problem. Unlike the previous work, we focus on the unconstrained setting, where *γ<sup>t</sup>* is not picked from a compact decision set. In this setting, we can apply an FTRL algorithm with an adaptive regularizer. To obtain an efficient implementation, we fix a basis for both X and <sup>X</sup>∗. Now we can assume <sup>X</sup> <sup>=</sup> <sup>X</sup><sup>∗</sup> <sup>=</sup> <sup>R</sup>*<sup>n</sup>* and work with the matrix representation of *<sup>γ</sup>* ∈ L(X, <sup>X</sup>). It is easy to verify that (2) can be rewritten as

$$\sum\_{t=1}^{T} l\_t(\mathcal{X}\_t(\gamma\_t)) - l\_t(\mathcal{X}\_t(\gamma)) \le \sum\_{t=1}^{T} \sum\_{i=1}^{m} \langle \mathcal{g}\_t \nabla^d X\_{t-i\prime}^\top \gamma\_{i,t} - \gamma\_i \rangle\_{\mathcal{F}, \mathcal{Y}}$$

where *A*, *B <sup>F</sup>* = tr(*AB*) is the Frobenius inner product. It is well known that the Frobenius inner product can be considered as a dot product of vectorized matrices, with which we obtain a simple first-order (the computational complexity per iteration depends linearly on the dimension of the parameter, i.e., <sup>O</sup>(*n*2*m*)) algorithm described in Algorithm 1.

The cumulative regret of Algorithm 1 can be upper bounded using the following theorem.

**Theorem 1.** *Let* {*Xt*} *be any sequence of vectors taken from* <sup>X</sup>*. Algorithm <sup>1</sup> guarantees*

$$\begin{split} &\sum\_{t=1}^{T} l\_t(\boldsymbol{\mathcal{X}}\_t(\boldsymbol{\gamma}\_t)) - l\_t(\boldsymbol{\mathcal{X}}\_t(\boldsymbol{\gamma})) \\ &\leq \sum\_{i=1}^{m} (\frac{||\boldsymbol{\gamma}\_i||\_F^2 L\_{T+1}}{2} + L\_{T+1} + \frac{L\_{T+1}^2}{L\_1}) \sqrt{\sum\_{t=1}^{T} ||\boldsymbol{\nabla}^d \boldsymbol{\mathcal{X}}\_{t-i}||\_2^2} \\ &\quad + \sum\_{i=1}^{m} \frac{(L\_{T+1} \boldsymbol{\mathcal{G}}\_{i,T+1} + ||\boldsymbol{\theta}\_{i,1}||\_F) ||\boldsymbol{\gamma}\_i||\_F^2 + ||\boldsymbol{\theta}\_{i,1}||\_F}{2} .\end{split}$$

For an *L*-Lipschitz loss function *lt*, in which *LT*+<sup>1</sup> is upper bounded by *L*, we obtain a sublinear regret upper bound depending on the sequence of *d*-th order differencing {*dXt*}. In case *<sup>L</sup>* is known, we can set *<sup>L</sup>*<sup>0</sup> <sup>=</sup> *<sup>L</sup>*, otherwise picking *<sup>L</sup>*<sup>0</sup> arbitrarily from a reasonable range (e.g., *L*<sup>0</sup> = 1) would not have a devastating impact on the performance of the algorithms.

4.1.2. Algorithms for Squared Errors

For the commonly used squared error given by

$$d\_t(\mathcal{R}\_t(\gamma\_t)) = \frac{1}{2} \| \mathcal{R}\_t(\gamma\_t) - X\_t \|\_{2'}^2$$

it can be verified that *gt* can be represented as a vector

$$\mathbf{g}\_{\mathbf{f}} = \sum\_{i=1}^{m} \gamma\_{i,\mathbf{f}} \nabla^{d} \mathbf{X}\_{\mathbf{f}-i} - \nabla^{d} \mathbf{X}\_{\mathbf{f}}$$

for all *t*. Existing algorithms, which have a regret upper bound depending on *gt*-2, could fail since *gt*-<sup>2</sup> can be set arbitrarily large due to the adversarially selected data sequence *X*1, ... , *Xt*. To design a parameter-free algorithm for the squared error, we equip FTRL with a time-varying polynomial regularizer described in Algorithm 2.

Define

$$\mathbf{x}\_{\mathbf{f}} = \begin{pmatrix} \nabla^{d} \mathbf{X}\_{\mathbf{f}-1} \\ \vdots \\ \nabla^{d} \mathbf{X}\_{\mathbf{f}-m} \end{pmatrix}$$

and consider the matrix representation *γ<sup>t</sup>* = *γ*1,*<sup>t</sup>* ··· *γm*,*<sup>t</sup>* . Then we have *gt* = *γtxt* − *dXt*, and the upper bound of the regret can be rewritten as

$$\sum\_{t=1}^{T} l\_t \left( \check{X}\_t(\gamma\_t) \right) - l\_t \left( \check{X}\_t(\gamma) \right) \le \sum\_{t=1}^{T} \langle \left( \gamma\_t \mathbf{x}\_t - \nabla^d \mathbf{X}\_t \right) \mathbf{x}\_t^\top, \gamma\_t - \gamma \rangle\_{\mathcal{F}}.$$

The idea of Algorithm 2 is to run the FTRL algorithm with a polynomial regularizer

$$\frac{\lambda\_t}{4} \|\gamma\|\_F^4 + \frac{\eta\_t}{2} \|\gamma\|\_{F'}^2$$

for increasing sequences {*λt*} and {*ηt*}, which leads to updating rule given by

$$\gamma\_t = \arg\max\_{\gamma \in \mathcal{L}(\mathbb{X}, \mathbb{X})^m} \langle \theta\_{t\prime} \gamma \rangle\_F - \frac{\lambda\_t}{4} ||\gamma||\_F^4 - \frac{\eta\_t}{2} ||\gamma||\_F^2 = \frac{c\theta\_t}{||\theta\_t||\_F}.$$

for *<sup>c</sup>* satisfying *<sup>λ</sup>tc*<sup>3</sup> <sup>+</sup> *<sup>η</sup>tc* <sup>=</sup> *θt*-*<sup>F</sup>*. Since we have *λ<sup>t</sup>* ≥ 0 and *η<sup>t</sup>* > 0 for *θ*<sup>1</sup> = 0, *c* exists and has a closed-form expression. The computational complexity per iteration has a linear dependency on the dimension of <sup>L</sup>(X, <sup>X</sup>)*m*. The following theorem provides a regret upper bound of Algorithm 2.

**Theorem 2.** *Let* {*Xt*} *be any sequence of vectors taken from* <sup>X</sup> *and*

$$d\_t(\mathcal{X}\_t(\gamma)) = \frac{1}{2} \|X\_t - \mathcal{X}\_t(\gamma)\|\_2^2 = \frac{1}{2} \|\nabla^d X\_t - \nabla^d \mathcal{X}\_t(\gamma)\|\_2^2$$

*be the squared error. We define xt* = *dXt*−<sup>1</sup> ··· *dXt*−*<sup>m</sup> and γ* = *γ*<sup>1</sup> ··· *γ<sup>m</sup> , the matrix representation of <sup>γ</sup>*1,... *<sup>γ</sup><sup>m</sup>* ∈ L(X, <sup>X</sup>)*. Then, Algorithm <sup>2</sup> guarantees*

$$\begin{split} \sum\_{t=1}^{T} \left( l\_{t} (\check{X}\_{t}(\gamma\_{t})) - l\_{t} (\check{X}\_{t}(\gamma)) \right) &\leq \frac{(\sqrt{m} G\_{T+1}^{2} + 1 ||\theta\_{1}||\_{F}) ||\gamma||\_{F}^{2}}{2} \\ &+ ||\theta\_{1}||\_{F} + (1 + \frac{||\gamma||\_{F}^{4}}{4}) \sqrt{\sum\_{t=1}^{T} ||x\_{t}||\_{2}^{4}} \\ &+ (1 + \frac{G\_{T+1}}{G\_{0}} + \frac{||\gamma||\_{F}^{2}}{2}) \sqrt{\sum\_{t=1}^{T} ||\nabla^{d} X\_{t} x\_{t}^{\top}||\_{F}^{2}} \end{split}$$

*for all <sup>γ</sup>* ∈ L(X, <sup>X</sup>)*m.*

For squared error, Algorithm 2 does not require a compact decision set and ensures a sublinear regret bound depending on the data sequence. Similar to Algorithm 1, one can set *G*<sup>0</sup> according to the prior knowledge about the bounds of the time series. Alternatively, we can simply set *G*<sup>0</sup> = 1 to obtain a reasonable performance.

#### *4.2. Online Model Selection Using Master Algorithms*

The straightforward choice of the master algorithm would be the exponentiated gradient algorithm for prediction with expert advice. However, this algorithm requires tuning of the learning rate and losses bounded by a small quantity, which can not be assumed for our case. The AdaHedge algorithm [29] solves these problems. However, it yields a worst-case regret bound depending on the largest loss observed, which could be much worse compared to a data-dependent regret bound.

Our idea is based on the adaptive optimistic follow the regularized leader (AO-FTRL) framework [10]. Given a sequence of hints {*ht*} and loss vectors {*zt*}, AO-FTRL guarantees a regret bound related to ∑*<sup>T</sup> t*=1*zt* − *ht*-2 *<sup>t</sup>* for some time-varying norm -·*t*. In our case, where the loss incurred by a slave is given by *<sup>l</sup>*(*Xt*, ˜ *Xk <sup>t</sup>*) at iteration *t*, we simply choose *hk*,*<sup>t</sup>* = *l*(∑*d*−<sup>1</sup> *<sup>i</sup>*=<sup>0</sup> *i Xt*−1, ˜ *Xk <sup>t</sup>*). If *l* is *L*-Lipschitz in its first argument, then we have |*zk*,*<sup>t</sup>* − *hk*,*t*| ≤ *L*-*dXt*-, which leads to a data-dependent regret. The obtained algorithm is described in Algorithm 3. Its regret is upper bounded by the following theorem, the proof of which is provided in Appendix B.

**Theorem 3.** *Let* {*X*˜*t*}*,* {*X*˜ *<sup>k</sup> <sup>t</sup>* }*,* {*zt*}*,* {*ht*}*, and* {*wt*} *be as generated in Algorithm 3. Assume l is L-Lipschitz in its first argument and convex in its second argument. Then for any sequence* {*Xt*} *and slave algorithm* A*k, we have*

$$\mathcal{R}\_T(k) \le (\sqrt{2\log K} + \sqrt{\frac{8}{\log K}}) \sqrt{\sum\_{t=1}^T L^2 \|\nabla^d X\_t\|\_2^2}.$$

By Corollary 2, combining Algorithm 3 with Algorithms 1 or 2 guarantees a datadependent regret upper bound sublinear in *T*. Note that there is an input parameter *d* for Algorithm 3, which can be adjusted according to the prior knowledge of the dataset such that -*dXt*-2 <sup>2</sup> can be bounded by a small quantity. In case no prior knowledge can be obtained, we can set *d* to the maximal order of differencing used in the slave algorithms. Arguably, the Lipschitz continuity is not a reasonable assumption for squared error with unbounded domain. With a bounded -*dXt*-2 <sup>2</sup>, we can assume that the loss function is locally Lipschitz, but with a Lipschitz constant depending on the prediction. In the next section, we show the performance of Algorithm 3 in combination with Algorithms 1 and 2 in different experimental settings.

#### **5. Experiments and Results**

In this section, we carry out experiments on both synthetic and real-world data to show that the proposed algorithms can generate promising predictions without tuning hyperparameters.

#### *5.1. Experiment Settings*

The synthetic data was generated randomly. We run 20 trials for each synthetic experiment and average the results. For numerical stability, we scale the real-world data down so that the values are between 0 and 10. Note that the range of the data are not assumed or used in the algorithms.

#### Setting 1: Sanity Check

For a sanity check, we generate a stationary 10-dimensional ARIMA(5, 2, 1) process using randomly drawn coefficients.

#### Setting 2: Time-Varying Parameters

Aimed at demonstrating the effectiveness of the proposed algorithm in the nonstationary case, we generate the non-stationary 10-dimensional ARIMA(5, 2, 1) process using time-varying parameters. We draw *α*1, *α*2, and *β*1, *β*<sup>2</sup> randomly and independent, and generate data at iteration *t* with the ARIMA(5, 2, 1) model parameterized by *α<sup>t</sup>* = *<sup>t</sup>* <sup>104</sup> *<sup>α</sup>*<sup>1</sup> + (<sup>1</sup> <sup>−</sup> *<sup>t</sup>* <sup>104</sup> )*α*<sup>2</sup> and *<sup>β</sup><sup>t</sup>* <sup>=</sup> *<sup>t</sup>* <sup>104</sup> *<sup>β</sup>*<sup>1</sup> + (<sup>1</sup> <sup>−</sup> *<sup>t</sup>* <sup>10</sup><sup>4</sup> )*β*2.

#### Setting 3: Time-Varying Models

To get more adversarially selected time series values, we generate the first half of the values using a stationary 10-dimensional ARIMA(5, 2, 1) model and the second half of the values using a stationary 10-dimensional ARIMA(5, 2, 0) model. The model parameters are drawn randomly.

#### Stock Data: Time Series with Trend

Following the experiments in [8], we collect the daily stock prices of seven technology companies from Yahoo Finance together with the S&P 500 index for over twenty years, which has an obvious increasing trend and is believed to exhibit integration.

#### Google Flu Data: Time Series with Seasonality

We collect estimates of influenza activity of the northern hemisphere countries, which has an obvious seasonal pattern. In the experiment, we examine the performance of the algorithms for handling regular and predictable changes that occur over a fixed period.

#### Electricity Demand: Trend and Seasonality

In this setting, we collect monthly load, gross electricity production, net electricity consumption, and gross demand in Turkey from 1976 to 2010. The dataset contains both trend and seasonality.

#### *5.2. Experiments for the Slave Algorithms*

We first fix *d* = 1 and *m* = 16 and compare our slave algorithms with ONS and OGD from [9] for squared error *lt*(*X*˜*t*) = <sup>1</sup> 2 -*Xt* <sup>−</sup> *<sup>X</sup>*˜*t*-2 <sup>2</sup> and Euclidean distance *lt*(*X*˜*t*) = -*Xt* <sup>−</sup> *<sup>X</sup>*˜*t*-2. ONS and OGD stack and vectorize the parameter matrices, and incrementally update the vectorized parameter respectively using the following rules

$$w\_{t+1} = \Pi\_{\mathcal{W}}(w\_t - \eta (\sum\_{s=1}^t \mathfrak{g}\_t \mathfrak{g}\_t^\top + \lambda I)^{-1} \mathfrak{g}\_t)$$

and

$$w\_{t+1} = \Pi\_{\mathcal{W}} (w\_t - \eta \mathbf{g}\_t)\_{\mathcal{H}}$$

where *gt* is the vectorized gradient at step *<sup>t</sup>*, W is the decision set satisfying sup*u*∈W *u*-2 ≤ *<sup>c</sup>*, and the operator <sup>Π</sup><sup>W</sup> (*v*) projects *<sup>v</sup>* into W. We select a list of candidate values for each hyperparameter, evaluate their performance on the whole dataset, and select the configuration with the best performance for comparison. Since the synthetic data are generated randomly, we average the results over 20 trials for stability. The corresponding results are shown in Figures 1–6 (to amplify the differences of the algorithms, we use *log* plots for the *y*-axis for all settings; for the synthetic datasets, we also use *log* plot for the *x*-axis, so that the behavior of the algorithms in the first 1000 steps can be better observed). To show the impact of the hyperparameters on the performance of the baseline algorithm, we also plot their performance using sub-optimal configurations. Note that since the error term *<sup>t</sup>* cannot be predicted, an ideal predictor would suffer an average error rate of at least *t*-2 <sup>2</sup> and *t*-<sup>2</sup> for the two kinds of loss function. This is known for the synthetic datasets and plotted in the figures.

In all settings, both AdaFTRL and AdaFTRL-Poly have a performance on par with well-tuned OGD and ONS, which can have extremely bad performance using sub-optimal hyperparameter configurations. In the experiments using synthetic datasets, AdaFTRL suffers large loss at the beginning while generating accurate predictions after 1000 iterations. The relative performances of the proposed algorithms after the first 1000 iterations compared to the best tuned baseline algorithms are plotted in Appendix D. AdaFTRL-Poly has more stable performance compared to AdaFTRL. In the experiment with Google Flu data, all algorithms suffer huge losses around iteration 300 due to an abrupt change in the dataset. OGD and ONS with sub-optimal hyperparameter configurations, despite good

performance for the first half of the data, generate very inaccurate predictions after the abrupt change in the dataset. This could lead to a catastrophic failure in practice, when certain patterns do not appear in the dataset collected for hyperparameter tuning. Our algorithms are more robust against this change and perform similarly to OGD and ONS with optimal hyperparameter configurations.

**Figure 1.** Results for setting 1 (sanity check), using a stationary ARIMA(5,2,1) model.

**Figure 2.** Results for setting 2 (time-varying parameters), using a non-stationary ARIMA(5,2,1) model.

**Figure 3.** Results for setting 3 (time-varying models), using a combination of stationary ARIMA(5,2,1) and ARIMA(5,2,0) models.

**Figure 6.** Results for electricity demand data.

#### *5.3. Experiments for Online Model Selection*

The performance of the two-level framework and Algorithm 3 for online model selection is demonstrated in Figures 7–12. We simultaneously maintain 96 AR(*m*) models of *d*-th-order differencing for *m* = 1, ... 32 and *d* = 0, ... 2, which are updated by Algorithms 1 and 2 for squared error and Euclidean distance, respectively. The predictions generated by the AR models are aggregated using Algorithm 3 and the aggregation algorithm (AA) introduced in [13] with learning rate set to <sup>√</sup>*T*. We compare the average losses incurred by the aggregated predictions with those incurred by the best AR model. To show the impact of *m* and *d*, we also plot the average loss of some other sub-optimal AR models.

In all settings, AO-Hedge outperforms AA, although the differences are very slight in some of the experiments. We would like to stress again that the choice of the hyperparameters has a great impact on the performance of the AR model. In settings 1–3, the AR model with 0-th-order differencing has the best performance, although the data are generated using *d* = 1, which suggests that the prior knowledge about the data generation may not

be helpful for the model selection in all cases. The experimental results also show that AO-Hedge has a performance similar to the best AR model.

**Figure 11.** Model selection for Google Flu.

**Figure 12.** Model Selection for electricity demand.

#### **6. Conclusions**

We proposed algorithms for fitting ARIMA models in an online manner without requiring prior knowledge or tuning hyperparameters. We showed that the cumulative regret of our method grows sublinearly with the number of iterations and depends on the values of the time series. The comparison study on both synthetic and real-world datasets suggests that the proposed algorithms have a performance on par with the well-tuned state-of-the-art algorithms.

There are still several remaining issues that we want to address in future research. Firstly, it would be interesting to also develop a parameter-free algorithm for the cointegrated vector ARMA model. Secondly, we believe that the strong assumption on the *β* coefficient can be relaxed for multi-dimensional time series by generalizing Lemma 2 in [7]. Furthermore, we are also interested in applying online learning to other time series models such as the (generalized) ARCH model [30]. Finally, the proposed algorithms need to be empirically analyzed using more real-world datasets and loss functions, and compared with more recent predictive models such as recurrent neural networks and the models combining neural networks and ARIMA models [31].

**Author Contributions:** Conceptualization, W.S.; methodology, W.S. and L.F.R.; validation, W.S., L.F.R., and F.S.; formal analysis, W.S.; investigation, W.S. and L.F.R.; writing—original draft preparation, W.S. and L.F.R.; writing—review and editing, W.S., L.F.R., F.S., and S.A.; visualization, L.F.R.; supervision, F.S. and S.A. All authors have read and agreed to the published version of the manuscript.

**Funding:** We acknowledge support by the German Research Foundation and the Open Access Publication Fund of TU Berlin.

**Institutional Review Board Statement:** Not applicable

**Informed Consent Statement:** Not applicable

**Data Availability Statement:** The source code for generating the synthetic data set, the implementation of the algorithms, and the detailed information about our experiments are available on GitHub: https://github.com/OnlinePredictorTS/AOLForTimeSeries (accessed on March 2021). The stock data are collected from https://finance.yahoo.com/ (accessed on March 2021). The Google Flu data are available in https://github.com/datalit/googleflutrends/ (accessed on March 2021). The detailed information about the electricity demand can be found in [32].

**Conflicts of Interest:** The authors declare no conflicts of interest.

#### **Appendix A**

We prove Lemma 1 in this section. Consider the ARIMA model given by

$$\nabla^d X\_t(\boldsymbol{\alpha}, \boldsymbol{\beta}) = \sum\_{i=1}^p \alpha\_i \nabla^d X\_{t-i} + \sum\_{i=1}^q \beta\_i \boldsymbol{\varepsilon}\_{t-i} + \boldsymbol{\varepsilon}\_t$$

with *dXt*(*α*, *β*) = *dXt* for *<sup>t</sup>* <sup>≤</sup> 0. Let

$$X\_t(\boldsymbol{\alpha}, \boldsymbol{\beta}) = \nabla^d X\_t(\boldsymbol{\alpha}, \boldsymbol{\beta}) + \sum\_{i=0}^{d-1} \nabla^i X\_{t-1}$$

be the *t*-th value generated by the ARIMA process. To prove Lemma 1, we generalize the proof provided in [6]. To remove the MA component, we first recursively define a growing process of the *d*-th-order differencing

$$\nabla^{d}X\_{t}^{\infty}(\boldsymbol{\alpha},\boldsymbol{\beta}) = \sum\_{i=1}^{p} \boldsymbol{\alpha}\_{i} \nabla^{d}X\_{t-i} + \sum\_{i=1}^{q} \beta\_{i} (\nabla^{d}X\_{t-i} - \nabla^{d}X\_{t-i}^{\infty}(\boldsymbol{\alpha},\boldsymbol{\beta})),$$

with *dX*<sup>∞</sup> *<sup>t</sup>* (*α*, *β*) = *dXt* for *<sup>t</sup>* <sup>≤</sup> 0. Let

$$X\_t^{\infty}(\boldsymbol{\alpha}, \boldsymbol{\beta}) = \nabla^d X\_t^{\infty}(\boldsymbol{\alpha}, \boldsymbol{\beta}) + \sum\_{i=0}^{d-1} \nabla^i X\_{t-1}$$

be the *t*-th value generated by this process.

The next lemma shows that it approximates an ARIMA(*p*, *q*, *d*) process.

**Lemma A1.** *For any α, β, and* {*t*} *satisfying Assumptions 1 and 2, we have, for t* = 1, . . . , *T,*

$$\|\|X\_t^{\infty}(\alpha,\beta) - \mathcal{X}\_t(\alpha,\beta)\|\| \le (1 - \epsilon)^{\frac{l}{q}}R.$$

**Proof.** First of all, we have

$$\begin{aligned} X\_t^{\infty}(\boldsymbol{\alpha}, \boldsymbol{\beta}) - \mathcal{X}\_t(\boldsymbol{\alpha}, \boldsymbol{\beta}) &= \nabla^d X\_t^{\infty}(\boldsymbol{\alpha}, \boldsymbol{\beta}) - \nabla^d \mathcal{X}\_t(\boldsymbol{\alpha}, \boldsymbol{\beta}) \\ &= \sum\_{i=1}^q \beta\_i (\nabla^d X\_{t-i} - \nabla^d X\_{t-i}^{\infty}(\boldsymbol{\alpha}, \boldsymbol{\beta}) - \boldsymbol{\epsilon}\_{t-i}) \end{aligned}$$

for *t* ≥ 0. Define *Yt* = *dXt* <sup>−</sup> *dX*<sup>∞</sup> *<sup>t</sup>* (*α*, *β*) − *t*. W.l.o.g. we can assume *t*- ≤ *R* for *t* ≤ 0. Next, we prove by induction on *t* that -*Yτ*- ≤ (1 − ) *τ <sup>q</sup> R* holds for all *τ* ≤ *t*. For the induction basis, we have

$$\|\|Y\_{\pi}\|\| = \|\| - \epsilon\_{t} \|\| \le \mathcal{R}$$

for all *τ* ≤ 0. We assume the claim holds for some *t*, then we have

$$\begin{split} \|Y\_{t+1}\| &= \|\big(\bigvee^{d}X\_{t+1} - \bigvee^{d}X\_{t+1}^{\infty}(\mathfrak{a},\beta) - \mathfrak{e}\_{t+1}\| \\ &= \|\bigvee^{d}X\_{t+1} - \bigfrac{p}{i-1}\mathfrak{a}\_{i}\bigcap^{d}X\_{t+1-i} - \sum\_{i=1}^{q}\beta\_{i}\mathfrak{e}\_{t+1-i} - \mathfrak{e}\_{t+1}\| + \|\sum\_{i=1}^{q}\beta\_{i}Y\_{t+1-i}\| \\ &= \sum\_{i=1}^{q} \|Y\_{t+1-i}\| \|\|\beta\_{i}\|\|\_{\mathrm{op}} \\ &\leq (1-\epsilon)^{\frac{l+1-q}{q}}R\sum\_{i=1}^{q} \|\beta\_{i}\|\_{\mathrm{op}} \\ &\leq (1-\epsilon)^{\frac{l+1}{q}}R, \end{split}$$

which concludes the induction. Finally, we have

$$\begin{aligned} \|\|X\_t^{\infty}(\boldsymbol{a},\boldsymbol{\beta}) - \check{X}\_t(\boldsymbol{a},\boldsymbol{\beta})\|\| &= \|\sum\_{i=1}^q \beta\_i(\nabla^d X\_{t-i}(\boldsymbol{a},\boldsymbol{\beta}) - \bigtriangledown^d X\_{t-i}^{\infty}(\boldsymbol{a},\boldsymbol{\beta}) - \epsilon\_{t-i})\|\| \\ &\leq \sum\_{i=1}^q \|\|\beta\_i\|\_{\text{op}} \|\|Y\_{t-i}\|\| \\ &\leq (1-\epsilon)(1-\epsilon)^{\frac{l-q}{q}}R \\ &= (1-\epsilon)^{\frac{l}{q}}R, \end{aligned}$$

which is the claimed result.

Next, we recursively define the following process:

$$\nabla^{d}X\_{t}^{m}(\boldsymbol{a},\boldsymbol{\beta}) = \sum\_{i=1}^{p} \boldsymbol{a}\_{i} \nabla^{d}X\_{t-i} + \sum\_{i=1}^{q} \beta\_{i} (\nabla^{d}X\_{t-i} - \nabla^{d}X\_{t-i}^{m-i}(\boldsymbol{a},\boldsymbol{\beta})),\tag{A1}$$

where *dXm <sup>t</sup>* (*α*, *β*) = *dXt* for *<sup>m</sup>* <sup>≤</sup> 0. Let {*X<sup>m</sup> <sup>t</sup>* (*α*, *β*)} be the sequence generated as follows:

$$X\_t^{\mathfrak{m}}(\mathfrak{a}, \boldsymbol{\beta}) = \nabla^d X\_t^{\mathfrak{m}}(\mathfrak{a}, \boldsymbol{\beta}) + \sum\_{i=0}^{d-1} \nabla^i X\_{t-1}. \tag{A2}$$

We show in the next lemma that it is close to {*X*<sup>∞</sup> *<sup>t</sup>* (*α*, *β*)}.

**Lemma A2.** *For any α, β,* {*lt*}*, and* {*t*} *satisfying A1–A2, we have*

$$||X\_t^{\mathfrak{m}}(\alpha,\beta) - X\_t^{\infty}(\alpha,\beta)|| \le \frac{2R}{T} \lambda$$

*for m* = *<sup>q</sup>* log *<sup>T</sup>* log <sup>1</sup> 1− *.*

**Proof.** Define *Z<sup>m</sup> <sup>t</sup>* = *dXm <sup>t</sup>* (*α*, *β*) − *dX*<sup>∞</sup> *<sup>t</sup>* (*α*, *β*). We prove by induction on *m* that

$$||Z\_t^{\vec{m}}|| \le (1 - \epsilon)^{\frac{\underline{n}}{q}} 2R$$

holds for all *t* = 1, . . . , *T* and 0 ≤ *m*˜ ≤ *m*. For *m* = 0, we have for *t* = 1, . . . , *T*

$$\begin{aligned} \|Z\_t^0\| &= \|\nabla^d X\_t^0(\alpha, \beta) - \nabla^d X\_t^\infty(\alpha, \beta)\|, \\ &= \|\nabla^d X\_t - \nabla^d X\_t^\infty(\alpha, \beta)\|. \end{aligned}$$

By the definition of the stochastic process {*dX*<sup>∞</sup>(*α*, *<sup>β</sup>*)}, we have

$$\begin{split} & -\nabla^{d}X\_{t} + \nabla^{d}X\_{t}^{\infty}(\boldsymbol{a}, \boldsymbol{\beta}) \\ &= -\nabla^{d}X\_{t} + \sum\_{i=1}^{p} \boldsymbol{a}\_{i}\nabla^{d}X\_{t-i} + \sum\_{i=1}^{q} \beta\_{i}(\nabla^{d}X\_{t-i}(\boldsymbol{a}, \boldsymbol{\beta}) - \nabla^{d}X\_{t-i}^{\infty}(\boldsymbol{a}, \boldsymbol{\beta})) \\ &= -\nabla^{d}X\_{t} + \sum\_{i=1}^{p} \boldsymbol{a}\_{i}\nabla^{d}X\_{t-i} + \sum\_{i=1}^{q} \beta\_{i}\boldsymbol{\varepsilon}\_{t-i} + \sum\_{i=1}^{q} \beta\_{i}(\nabla^{d}X\_{t-i}(\boldsymbol{a}, \boldsymbol{\beta}) - \nabla^{d}X\_{t-i}^{\infty}(\boldsymbol{a}, \boldsymbol{\beta}) - \boldsymbol{\varepsilon}\_{t-i}) \\ &= \nabla^{d}\mathcal{X}\_{t}(\boldsymbol{a}, \boldsymbol{\beta}) - \nabla^{d}X\_{t} + \sum\_{i=1}^{q} \beta\_{i}(\nabla^{d}X\_{t-i}(\boldsymbol{a}, \boldsymbol{\beta}) - \nabla^{d}X\_{t-i}^{\infty}(\boldsymbol{a}, \boldsymbol{\beta}) - \boldsymbol{\varepsilon}\_{t-i}) \\ &= \nabla^{d}\mathcal{X}\_{t}(\boldsymbol{a}, \boldsymbol{\beta}) - \nabla^{d}X\_{t} + \sum\_{i=1}^{q} \beta\_{i}Y\_{t-i}. \end{split}$$

where *Yt*−*<sup>i</sup>* is defined as in the proof of Lemma A1. From the assumption, we have -*dX*˜*t*(*α*, *<sup>β</sup>*) <sup>−</sup> *dXt*- = *t*- ≤ *R*, and, as we have proved in Lemma A1, -*Yt*- ≤ *R* holds. Therefore, we obtain -*Z*0 *t* - ≤ 2*R*, which is the induction basis. Next, assume the claim holds for all 0, . . . , *m* − 1. Then we have

$$\begin{split} \| |Z\_t^m| \| &= \| \sum\_{i=1}^q \beta^i (\nabla^d X\_{t-i} - \nabla^d X\_{t-i}^{m-i}(\boldsymbol{a}, \boldsymbol{\beta}) - \nabla^d X\_{t-i} + \nabla^d X\_{t-i}^{\infty}(\boldsymbol{a}, \boldsymbol{\beta})) \| \\ &\leq \| \sum\_{i=1}^q \beta\_i (\nabla^d X\_{t-i}^{\infty}(\boldsymbol{a}, \boldsymbol{\beta}) - \nabla^d X\_{t-i}^{m-i}(\boldsymbol{a}, \boldsymbol{\beta})) \| \\ &\leq \sum\_{i=1}^m \| |\beta\_i (\nabla^d X\_{t-i}^{\infty}(\boldsymbol{a}, \boldsymbol{\beta}) - \nabla^d X\_{t-i}^{m-i}(\boldsymbol{a}, \boldsymbol{\beta})) \| \\ &+ \sum\_{i=m+1}^q \| |\beta\_i (\nabla^d X\_{t-i}^{\infty}(\boldsymbol{a}, \boldsymbol{\beta}) - \nabla^d X\_{t-i})| \| \end{split}$$

From the induction hypothesis, we have

$$\|\left|\nabla^{d}X\_{t-i}^{\infty}(\mathfrak{a},\beta)-\nabla^{d}X\_{t-i}^{m-i}(\mathfrak{a},\beta)\right|\|\leq(1-\epsilon)^{\frac{m-i}{q}}2R.\|$$

From the proof of the induction basis, we have

$$\sum\_{i=m+1}^{q} ||\beta\_i(\nabla^d X\_{t-i}^{\infty}(\mathfrak{a}, \beta) - \nabla^d X\_{t-i})|| \le 2R \sum\_{i=m+1}^{q} ||\beta\_i||\_{\mathrm{op}}.$$

Therefore, -*Z<sup>m</sup> t* can be further bounded using

$$\begin{split} \|Z\_{t}^{m}\| \leq & 2R \sum\_{i=1}^{m} \|\beta^{i}\|\underset{\mathrm{op}}{\left(1-\varepsilon\right)^{\frac{m-i}{q}}} + 2R \sum\_{i=m+1}^{q} \|\beta^{i}\|\_{\mathrm{op}} \\ \leq & 2R \sum\_{i=1}^{m} \|\beta^{i}\|\_{\mathrm{op}} \left(1-\varepsilon\right)^{\frac{m-i}{q}} + 2R \sum\_{i=m+1}^{q} \|\beta^{i}\|\_{\mathrm{op}} \left(1-\varepsilon\right)^{\frac{m-i}{q}} \\ \leq & (1-\varepsilon)^{\frac{m-q}{q}} 2R \sum\_{i=1}^{q} \|\beta^{i}\|\_{\mathrm{op}} \\ \leq & (1-\varepsilon)^{\frac{m}{q}} 2R. \end{split}$$

Choosing *<sup>m</sup>* <sup>≥</sup> *<sup>q</sup>* log *<sup>T</sup>* log <sup>1</sup> 1− <sup>=</sup> *<sup>q</sup>* log1−(*T*)−1, we have

$$||\mathcal{X}\_t^m(\mathfrak{a}, \mathfrak{z}\mathfrak{z}) - \mathcal{X}\_t^\infty(\mathfrak{a}, \mathfrak{z}\mathfrak{z})|| \le \frac{2R}{T} \mathfrak{z}$$

which is the claimed result.

This process of the *d*-th-order differencing is actually an integrated AR(*m* + *p*) process with order *d*, which is shown in the following lemma.

**Lemma A3.** *For any data sequence* {*X<sup>m</sup> <sup>t</sup>* (*α*, *β*)} *generated by a process of the d-th-order differencing given by (A1) and (A2) there is a <sup>γ</sup>* ∈ L(X, <sup>X</sup>)*m*+*<sup>p</sup> such that*

$$\sum\_{i=1}^{m+p} \gamma\_i \nabla^d X\_{t-i} + \sum\_{i=0}^{d-1} \nabla^i X\_{t-1} = X\_t^m(\alpha, \beta).$$

*holds for all t.*

**Proof.** Let {*dXm <sup>t</sup>* (*α*, *β*)} be the sequence generated by (A1). We prove by induction on *m* that for all *<sup>m</sup>*˜ <sup>≤</sup> *<sup>m</sup>* there is a *<sup>γ</sup>* ∈ L(X, <sup>X</sup>)*m*˜ <sup>+</sup>*<sup>p</sup>* such that

$$\left(\nabla^{d}X\_{t}^{\mathfrak{M}}(\alpha,\beta)\right) = \sum\_{i=1}^{\mathfrak{M}+p} \gamma\_{i}\nabla^{d}X\_{t-i}$$

holds for all *α* and *β*. The induction basis follows directly from the definition that

$$\nabla^d X\_t^0(\boldsymbol{\alpha}, \boldsymbol{\beta}) = \sum\_{i=1}^p \alpha\_i \nabla^d X\_{t-i}.$$

Assume that the claim holds for some *m*. Let *α<sup>i</sup>* be the zero linear functional for *i* > *p* and *β<sup>i</sup>* be the zero linear functional for *i* > *q*. Then we have

$$\begin{split} &\nabla^{d}X\_{t}^{m+1}(\boldsymbol{a},\boldsymbol{\beta}) \\ &=\sum\_{i=1}^{p}a\_{i}\nabla^{d}X\_{t-i}+\sum\_{i=1}^{q}\beta\_{i}(\nabla^{d}X\_{t-i}-\nabla^{d}X\_{t-i}^{m+1-i}(\boldsymbol{a},\boldsymbol{\beta})) \\ &=\sum\_{i=1}^{p}a\_{i}\nabla^{d}X\_{t-i}+\sum\_{i=1}^{m+1}\beta\_{i}\nabla^{d}X\_{t-i}-\sum\_{i=1}^{m+1}\beta\_{i}\nabla^{d}X\_{t-i}^{m+1-i}(\boldsymbol{a},\boldsymbol{\beta}) \\ &=\sum\_{i=1}^{p}a\_{i}\nabla^{d}X\_{t-i}+\sum\_{i=1}^{m+1}\beta\_{i}\nabla^{d}X\_{t-i}-\sum\_{i=1}^{m+1}\beta\_{i}\sum\_{j=1}^{m+1-i+p}\gamma\_{j}^{m+1-i-j}\nabla^{d}X\_{t-i-j} \\ &=\sum\_{i=1}^{p}a\_{i}\nabla^{d}X\_{t-i}+\sum\_{i=1}^{m+1}\beta\_{i}\nabla^{d}X\_{t-i}-\sum\_{i=1}^{m+p+1}(\sum\_{j=1}^{m+1}\beta\_{j}\sum\_{k=1}^{i-j}\gamma\_{k}^{m+1-j})\nabla^{d}X\_{t-i}. \end{split}$$

where the second equality follows from the fact that *βi*(*dXt*−*<sup>i</sup>* <sup>−</sup> *dXm*<sup>+</sup>1−*<sup>i</sup> <sup>t</sup>*−*<sup>i</sup>* (*α*, *<sup>β</sup>*)) = <sup>0</sup> for *i* > *m* + 1, the third line uses the induction hypothesis and the last line is obtained by rearranging and setting *<sup>n</sup>* ∑ *i*=*m ai* = 0 for *m* > *n*. The induction step is obtained by setting

$$\gamma\_i^{m+1} = \alpha\_i + \beta\_i - \sum\_{j=1}^{m+1} \beta\_j \sum\_{k=1}^{i-j} \gamma\_k^{m+1-j}$$

for *i* = 1, . . . , *m* + *p* + 1, and the claimed result follows.

Finally, we prove Lemma 1 by combining the results.

**Proof of Lemma 1.** From Lemmas A1, A2, and A3, there is some *<sup>γ</sup>* ∈ L(X, <sup>X</sup>)*<sup>m</sup>* with *<sup>m</sup>* <sup>≥</sup> *<sup>q</sup>* log *<sup>T</sup>* log <sup>1</sup> 1− + *p* such that

$$\begin{aligned} & \| |\nabla^d \mathcal{X}\_t(\gamma) - \nabla^d \tilde{\mathcal{X}}\_t(\mathfrak{a}, \beta) | \\ &= \| |\nabla^d \mathcal{X}\_t^m(\gamma) - \nabla^d \mathcal{X}\_t(\mathfrak{a}, \beta) | \\ &\le \| |\nabla^d \mathcal{X}\_t^m(\gamma) - \nabla^d \mathcal{X}\_t^\infty(\mathfrak{a}, \beta) | + \| |\nabla^d \mathcal{X}\_t^\infty(\gamma) - \nabla^d \mathcal{X}\_t(\mathfrak{a}, \beta) | \| \\ &\le (1 - \epsilon)^{\frac{l}{q}} R + \frac{2R}{T} .\end{aligned}$$

which is the claimed result.

#### **Appendix B**

In this section, we prove the theorems in Section 4. The required notation is summarized in Appendix C. We apply some important properties of convex functions and their convex conjugate defined on a general vector space, which can be found in [17]. The proposed algorithms are instances of the adaptive optimistic follow the regularized leader (AO-FTRL) [10], which is described in Algorithm A1.


Input: closed convex set W ⊆ <sup>X</sup> Initialize: *θ*<sup>1</sup> arbitrary **for** *t* = 1 to *T* **do** Get hint *ht wt* = *ψ*∗ *<sup>t</sup>* (*θ<sup>t</sup>* − *ht*) Observe *gt* <sup>∈</sup> <sup>X</sup><sup>∗</sup> *θt*+<sup>1</sup> = *θ<sup>t</sup>* − *gt* **end for**

**Lemma A4.** *We run AO-FTRL with closed convex regularizers <sup>ψ</sup>*1, ... , *<sup>ψ</sup><sup>T</sup> defined on* W ⊆ <sup>X</sup> *satisfying ψt*(*w*) ≤ *ψt*+1(*w*)*s for all w* ∈ W *and t* = 1, . . . , *T. Then, for all u* ∈ W*, we have*

$$\sum\_{t=1}^{T} \mathcal{g}\_t(w\_t - u) \le \psi\_{T+1}(u) + \psi\_1^\*(\theta\_1) + \sum\_{t=1}^{T} \mathcal{B}\_{\psi\_t^\*}(\theta\_{t+1}, \theta\_t - h\_t)\_{\ast}$$

*where* B*ψ*<sup>∗</sup> *t* (*θt*+1, *θ<sup>t</sup>* − *ht*) *is the Bregman divergence associated with ψ*<sup>∗</sup> *t .*

**Proof.** W.l.o.g. we assume *hT*+<sup>1</sup> = 0, since it is not involved in the algorithm. Then we have

$$\begin{split} &\sum\_{t=1}^{T} \left( \psi\_{t+1}^{\*} (\theta\_{t+1} - h\_{t+1}) - \psi\_{t}^{\*} (\theta\_{t} - h\_{t}) \right) \\ &= \psi\_{T+1}^{\*} (\theta\_{T+1} - h\_{T+1}) - (\theta\_{1} - h\_{1}) w\_{1} + \psi\_{1} (w\_{1}) \\ &\geq (\theta\_{T+1} - h\_{T+1}) u - \psi\_{T+1} (u) + h\_{1} w\_{1} - \theta\_{1} w\_{1} + \psi\_{1} (w\_{1}) \\ &\geq \theta\_{T+1} u - \psi\_{T+1} (u) + h\_{1} w\_{1} - \sup\_{w \in \mathcal{W}} \left( \theta\_{1} w\_{1} - \psi\_{1} (w\_{1}) \right) \\ &= - \sum\_{t=1}^{T} \mathcal{g}\_{t} u - \psi\_{T+1} (u) + h\_{1} w\_{1} - \psi\_{1}^{\*} (\theta\_{1}). \end{split}$$

Furthermore, we have

$$\begin{aligned} &\psi\_{t+1}^\*(\theta\_{t+1} - h\_{t+1}) - \psi\_t^\*(\theta\_t - h\_t) \\ &= \psi\_{t+1}^\*(\theta\_{t+1} - h\_{t+1}) - \psi\_t^\*(\theta\_{t+1}) + \psi\_t^\*(\theta\_{t+1}) - \psi\_t^\*(\theta\_t - h\_t) \\ &\leq (\theta\_{t+1} - h\_{t+1})w\_{t+1} - \psi\_{t+1}(w\_{t+1}) - \theta\_{t+1}w\_{t+1} + \psi\_t(w\_{t+1}) + \psi\_t^\*(\theta\_{t+1}) - \psi\_t^\*(\theta\_t - h\_t) \\ &\leq \psi\_t^\*(\theta\_{t+1}) - \psi\_t^\*(\theta\_t - h\_t) - h\_{t+1}w\_{t+1} \end{aligned}$$

Combining the inequalities above, rearranging and adding ∑*<sup>T</sup> <sup>t</sup>*=<sup>1</sup> *gt*, *wt* to both sides, we obtain

$$\begin{aligned} &\sum\_{t=1}^{T} \mathcal{G}\_{t}(w\_{t} - u) \\ &\leq \psi\_{T+1}(u) + \psi\_{1}^{\*}(\theta\_{1}) + \sum\_{t=1}^{T} (\psi\_{t}^{\*}(\theta\_{t+1}) - \psi\_{t}^{\*}(\theta\_{t} - h\_{t}) + \mathcal{g}\_{t}w\_{t} - h\_{t}w\_{t}) \\ &= \psi\_{T+1}(u) + \psi\_{1}^{\*}(\theta\_{1}) + \sum\_{t=1}^{T} (\psi\_{t}^{\*}(\theta\_{t+1}) - \psi\_{t}^{\*}(\theta\_{t} - h\_{t}) - (\theta\_{t+1} - \theta\_{t} + h\_{t}) \nabla \psi\_{t}^{\*}(\theta\_{t} - h\_{t})) \\ &= \psi\_{T+1}(u) + \psi\_{1}^{\*}(\theta\_{1}) + \sum\_{t=1}^{T} \mathcal{B}\_{\psi\_{t}^{\*}}(\theta\_{t+1}, \theta\_{t} - h\_{t}), \end{aligned}$$

which is the claimed result.

**Proof of Theorem 1.** First of all, since we have

$$\begin{aligned} \sum\_{t=1}^{T} l\_t(\bar{X}\_t(\gamma\_t)) - l\_t(\bar{X}\_t(\gamma)) &\leq \sum\_{t=1}^{T} \sum\_{i=1}^{m} g\_{i,t} (\gamma\_{i,t} - \gamma\_i) \\ &= \sum\_{i=1}^{m} (\sum\_{t=1}^{T} g\_{i,t} (\gamma\_{i,t} - \gamma\_i)), \end{aligned}$$

the overall regret can be considered as the sum of the regrets ∑*<sup>T</sup> <sup>t</sup>*=<sup>1</sup> *gi*,*t*(*γi*,*<sup>t</sup>* − *γi*). Next, we analyse the regret of each *<sup>i</sup>* <sup>=</sup> 1, ... *<sup>m</sup>*. Define *<sup>ψ</sup>i*,*t*(*γi*) = *<sup>η</sup>i*,*<sup>t</sup>* 2 *γi*-2 *<sup>F</sup>*. It is easy to verify *γi*,*<sup>t</sup>* ∈ *∂ψ*<sup>∗</sup> *i*,*t* (*θi*,*t*) for *t* = 1, . . . , *T*. Applying Lemma A4 with *ht* = 0, we obtain

$$\sum\_{t=1}^{T} \mathcal{g}\_{i,t}(\gamma\_{i,t} - \gamma\_i) \le \psi\_{i,T+1}(\gamma\_i) + \psi\_{i,1}^\*(\theta\_{i,1}) + \sum\_{t=1}^{T} \mathcal{B}\_{\psi\_{i,t}^\*}(\theta\_{i,t+1}, \theta\_{i,t}) \dots$$

From the updating rule of *Gi*,*t*, we have *gi*,*<sup>t</sup>* = 0 for *Gi*,*<sup>t</sup>* = 0. Let *t*<sup>0</sup> be the smallest index such that *Gi*,*t*<sup>0</sup> > 0. Then we have

$$\sum\_{t=1}^{T} \mathcal{B}\_{\boldsymbol{\Phi}\_{i,t}^{\*}} (\theta\_{i,t+1}, \theta\_{i,t}) = \sum\_{t=t\_0}^{T} \mathcal{B}\_{\boldsymbol{\Phi}\_{i,t}^{\*}} (\theta\_{i,t+1}, \theta\_{i,t}) \dots$$

For *Gi*,*<sup>t</sup>* > 0, *ψi*,*<sup>t</sup>* is *ηi*,*t*-strongly convex with respect to -·-*<sup>F</sup>*. From the duality of strong convexity and strong smoothness (see Proposition 2 in [17]), we have

$$\sum\_{t=t\_0}^T \mathcal{B}\_{\boldsymbol{\Psi}\_{i,t}^\*} (\theta\_{i,t+1}, \theta\_{i,t}) \le \sum\_{t=t\_0}^T \frac{1}{2\eta\_{i,t}} \|\mathcal{g}\_{i,t}\|\_F^2 = \sum\_{t=t\_0}^T \frac{\|\mathcal{g}\_{i,t}\|\_F^2}{2\sqrt{\sum\_{s=1}^{t-1} \|\mathcal{g}\_{i,s}\|\_F^2 + (L\_t\mathcal{G}\_{i,t})^2}}.$$

From the definition of Frobenius norm, we have

$$\|\|\mathcal{g}\_{i,t}\|\|\_{F}^{2} = \|\|h\_{t}\nabla^{d}\mathcal{X}\_{t-i}^{\top}\|\|\_{F}^{2} = \|\|h\_{t}\|\|\_{2}^{2} \|\|\nabla^{d}\mathcal{X}\_{t-i}\|\|\_{2}^{2} \leq \frac{\|\|h\_{t}\|\|\_{2}^{2}}{L\_{t}^{2}}L\_{t}^{2}G\_{i,t}^{2}.$$

Then, we obtain

$$\begin{split} \sum\_{t=t\_0}^T \frac{\|\|\mathcal{G}\_{\bar{t},t}\|\_F^2}{2\sqrt{\sum\_{s=1}^{t-1} \|\|\mathcal{G}\_{\bar{s},s}\|\_F^2 + (L\_t G\_{\bar{t},t})^2}} &\leq \sum\_{t=t\_0}^T \frac{\max\{1, \sqrt{\|h\_1\|\_2}\} \|\|\mathcal{G}\_{\bar{t},t}\|\_F^2}{2\sqrt{\sum\_{s=1}^t \|\|\mathcal{G}\_{\bar{s},s}\|\_F^2}} \\ &\leq \max\{1, \frac{\|\|h\_1\|\|\_2}{L\_1}, \dots, \frac{\|h\_T\|\_2}{L\_T}\} \sqrt{\sum\_{t=1}^T \|\|\mathcal{G}\_{\bar{t},t}\|\_F^2} \\ &\leq (1 + \frac{L\_{T+1}}{L\_1}) \sqrt{\sum\_{t=1}^T \|\|\mathcal{G}\_{\bar{t},t}\|\_F^2} \\ &\leq (L\_{T+1} + \frac{L\_{T+1}^2}{L\_1}) \sqrt{\sum\_{t=1}^T \|\|\nabla^d X\_{t-\bar{t}}\|\_2^2} \end{split}$$

where the second inequality uses Lemma 4 in [17] and the last inequality follows from the fact that *gi*,*t*-*<sup>F</sup>* ≤ *Lt*-*dXt*−*i*-<sup>2</sup> ≤ *LT*+1-*dXt*−*i*-2. Furthermore, we have

$$\begin{aligned} \psi\_{i,T+1}(\gamma\_i) &\leq \frac{||\gamma\_i||\_F^2}{2} \sqrt{\sum\_{t=1}^T ||g\_{i,t}||\_F^2} + \frac{L\_{T+1} G\_{i,T+1} ||\gamma\_i||\_F^2}{2} \\ &\leq \frac{||\gamma\_i||\_F^2 L\_{T+1}}{2} \sqrt{\sum\_{t=1}^T ||\nabla^d X\_{t-i}||\_2^2} + \frac{L\_{T+1} G\_{i,T+1} ||\gamma\_i||\_F^2}{2}, \end{aligned}$$

and *ψ*∗ *<sup>i</sup>*,1(*θi*,1) <sup>≤</sup> *θi*,1-*F* <sup>2</sup> . Adding up from 1 to *m*, we have

$$\begin{aligned} &\sum\_{t=1}^{T} l\_t(\bar{X}\_t(\gamma\_t)) - l\_t(\bar{X}\_t(\gamma)) \\ &\leq \sum\_{i=1}^{m} (\frac{||\gamma\_i||\_F^2 L\_{T+1}}{2} + L\_{T+1} + \frac{L\_{T+1}^2}{L\_1}) \sqrt{\sum\_{t=1}^{T} ||\nabla^d X\_{t-i}||\_2^2} \\ &+ \sum\_{i=1}^{m} \frac{L\_{T+1} G\_{i,T+1} ||\gamma\_i||\_F^2 + ||\theta\_{i,1}||\_F}{2} \end{aligned}$$

**Proof of Theorem 2.** Define *ψt*(*γ*) = *<sup>λ</sup>tγ*-4 <sup>4</sup> <sup>+</sup> *<sup>λ</sup>tγ*-2 <sup>2</sup> . First of all, it is easy to verify that *γ<sup>t</sup>* ∈ *∂ψ*<sup>∗</sup> *<sup>t</sup>* (*θt*). Applying Lemma A4 with *ht* = 0, we have

$$\sum\_{t=1}^{T} \langle g\_t \mathbf{x}\_t^\top, \gamma\_t - \gamma \rangle\_F \le \psi\_{T+1}(\gamma) + \psi\_1^\*(\theta\_1) + \sum\_{t=1}^{T} \mathcal{B}\_{\Psi\_t^\tau}(\theta\_{t+1}, \theta\_t). \tag{A3}$$

Define *vt* ∈ *∂ψ*<sup>∗</sup> *<sup>t</sup>*+1(*θt*). Then we have

B*ψ*<sup>∗</sup> *t* (*θt*+1, *θt*) =*ψ*<sup>∗</sup> *<sup>t</sup>* (*θt*+1) − *ψ*<sup>∗</sup> *<sup>t</sup>* (*θt*) − *γt*, *θt*+<sup>1</sup> − *θ<sup>t</sup> <sup>F</sup>* =*θt*+1, *vt <sup>F</sup>* − *ψt*(*vt*) − *θt*, *γ<sup>t</sup> <sup>F</sup>* + *ψt*(*γt*) − *γt*, *θt*+<sup>1</sup> − *θ<sup>t</sup> <sup>F</sup>* =*θt*+1, *vt <sup>F</sup>* − *ψt*(*vt*) + *ψt*(*γt*) − *γt*, *θt*+<sup>1</sup> *<sup>F</sup>* =*θt*+1, *vt* − *γ<sup>t</sup> <sup>F</sup>* − *ψt*(*vt*) + *ψt*(*γt*) =*gtx <sup>t</sup>* , *γ<sup>t</sup>* − *vt <sup>F</sup>* − *ψt*(*vt*) + *ψt*(*γt*) + *θt*, *vt* − *γ<sup>t</sup> <sup>F</sup>* =*gtx <sup>t</sup>* , *γ<sup>t</sup>* − *vt <sup>F</sup>* − B*ψt*(*vt*, *γt*) =*γtxtx <sup>t</sup>* , *γ<sup>t</sup>* − *vt <sup>F</sup>* + −*dXtx <sup>t</sup>* , *γ<sup>t</sup>* − *vt <sup>F</sup>* − B*ψt*(*vt*, *γt*) =*γtxtx <sup>t</sup>* , *γ<sup>t</sup>* − *vt <sup>F</sup>* − B*ψ*˜ *t* (*vt*, *γt*) + −*dXtx <sup>t</sup>* , *γ<sup>t</sup>* − *vt <sup>F</sup>* − B*ψ*¯ *t* (*vt*, *γt*), (A4)

where we define *ψ*˜*t*(*γ*) = *<sup>λ</sup><sup>t</sup>* 4 *γ*-4 *<sup>F</sup>* and *<sup>ψ</sup>*¯*t*(*γ*) = *<sup>η</sup><sup>t</sup>* 2 *γ*-2 *F*. From the properties of the Frobenius norm, we have

$$\begin{aligned} \langle \gamma\_t \mathbf{x}\_t \mathbf{x}\_t^\top, \gamma\_t - \upsilon\_t \rangle\_F &\leq ||\gamma\_t \mathbf{x}\_t \mathbf{x}\_t^\top||\_F ||\gamma\_t - \upsilon\_t||\_F\\ &\leq ||\mathbf{x}\_t||\_2^2 ||\gamma\_t||\_F ||\gamma\_t - \upsilon\_t||\_F \end{aligned}$$

Following the idea of [33], we can upper bound *γt*-2 *Fγ<sup>t</sup>* − *vt*-2 *<sup>F</sup>* as follows:

*λt* 2 *γt*-2 *Fγ<sup>t</sup>* − *vt*-2 *F* <sup>=</sup>*λ<sup>t</sup>* 2 *γt*-2 *F*(*γt*-2 *<sup>F</sup>* + *vt*-2 *<sup>F</sup>* − 2*γt*, *vt <sup>F</sup>*) ≤ *λt* <sup>4</sup> (*γt*-4 *<sup>F</sup>* + *vt*-4 *<sup>F</sup>* − 2*γt*-2 *Fvt*-2 *<sup>F</sup>*) + *<sup>λ</sup><sup>t</sup>* 2 *γt*-2 *F*(*γt*-2 *<sup>F</sup>* + *vt*-2 *<sup>F</sup>* − 2*γt*, *vt <sup>F</sup>*) <sup>=</sup>*λ<sup>t</sup>* 4 *vt*-4 *<sup>F</sup>* + 3*λ<sup>t</sup>* 4 *γt*-4 *<sup>F</sup>* − *λtγt*-2 *<sup>F</sup> γt*, *vt <sup>F</sup>* <sup>=</sup>*λ<sup>t</sup>* 4 *vt*-4 *<sup>F</sup>* <sup>−</sup> *<sup>λ</sup><sup>t</sup>* 4 *γt*-4 *<sup>F</sup>* + *λtγt*-2 *<sup>F</sup> γt*, *γ<sup>t</sup> <sup>F</sup>* − *λtγt*-2 *<sup>F</sup> γt*, *vt <sup>F</sup>* <sup>=</sup>*λ<sup>t</sup>* 4 *vt*-4 *<sup>F</sup>* <sup>−</sup> *<sup>λ</sup><sup>t</sup>* 4 *γt*-4 *<sup>F</sup>* − *λtγt*-2 *<sup>F</sup> γt*, *vt* − *γ<sup>t</sup> <sup>F</sup>* =B*ψ*˜ *t* (*vt*, *γt*)

Thus, for *λ<sup>t</sup>* = 0, we have

$$\begin{aligned} \langle \gamma\_t \mathbf{x}\_t \mathbf{x}\_t^\top, \gamma\_t - v\_t \rangle\_F - \mathcal{B}\_{\tilde{\Psi}\_t}(v\_t, \gamma\_t) &\leq 2\sqrt{\frac{||\mathbf{x}\_t||\_2^4}{2\lambda\_t}} \mathcal{B}\_{\tilde{\Psi}\_t}(v\_t, \gamma\_t) - \mathcal{B}\_{\tilde{\Psi}\_t}(v\_t, \gamma\_t) \\ &\leq \frac{||\mathbf{x}\_t||\_2^4}{2\lambda\_t}, \end{aligned}$$

where the second inequality uses the fact that 2*ab* <sup>−</sup> *<sup>b</sup>*<sup>2</sup> <sup>≤</sup> *<sup>a</sup>*2. Let *<sup>t</sup>*<sup>0</sup> be the smallest index such that *λt*<sup>0</sup> > 0. Then we have

$$\begin{split} &\sum\_{t=1}^{T} \left( \langle \gamma\_{t} \mathbf{x}\_{t} \mathbf{x}\_{t}^{\top}, \gamma\_{t} - \upsilon\_{t} \rangle\_{F} - \mathcal{B}\_{\tilde{\varphi}\_{t}}(\upsilon\_{t}, \gamma\_{t}) \right) \\ &\leq \sum\_{t=t\_{0}}^{T} \frac{||\mathbf{x}\_{t}||\_{2}^{4}}{2\lambda\_{t}} \\ &= \sum\_{t=t\_{0}}^{T} \frac{||\mathbf{x}\_{t}||\_{2}^{4}}{2\sqrt{\sum\_{s=1}^{t} ||\mathbf{x}\_{t}||\_{2}^{4}}} \\ &\leq \sqrt{\sum\_{t=1}^{T} ||\mathbf{x}\_{t}||\_{2'}^{4}} \end{split} \tag{A5}$$

where the last inequality uses Lemma 4 in [17]. Similarly, let *t*<sup>1</sup> be the smallest index such that *ηt*<sup>0</sup> > 0. Then we obtain the upper bound

*T* ∑ *t*=1 (−*dXtx <sup>t</sup>* , *γ<sup>t</sup>* − *vt <sup>F</sup>* − B*ψ*¯*<sup>t</sup>* (*vt*, *γt*)) ≤ *T* ∑ *t*=1 (-*dXtx t* -*Fγ<sup>t</sup>* − *vt*-*F* − B*ψ*¯*<sup>t</sup>* (*vt*, *γt*)) ≤ *T* ∑ *t*=*t*<sup>1</sup> ( 2-*dXtx t* -2 *F ηt* B*ψ*¯*<sup>t</sup>* (*vt*, *γt*) − B*ψ*¯*<sup>t</sup>* (*vt*, *γt*)) ≤ *T* ∑ *t*=*t*<sup>1</sup> (2 -*dXtx t* -2 *F* 2*η<sup>t</sup>* B*ψ*¯*<sup>t</sup>* (*vt*, *γt*) − B*ψ*¯*<sup>t</sup>* (*vt*, *γt*)) ≤ *T* ∑ *t*=*t*<sup>1</sup> -*dXtx t* -2 *F* 2*η<sup>t</sup>* = *T* ∑ *t*=*t*<sup>1</sup> -*dXtx t* -2 *F* 2 - <sup>∑</sup>*t*−<sup>1</sup> *s*=1-*dXsx s* -2 *<sup>F</sup>* + *<sup>L</sup>*<sup>2</sup> *t xt*-2 2 <sup>≤</sup> max{1, -*dX*1*x* 1 -*F G*1 ,..., -*dXTx T* -*F GT* } *T* ∑ *t*=*t*<sup>1</sup> -*dXtx t* -2 *F* 2 - ∑*t s*=1-*dXsx s* -2 *F* <sup>≤</sup> max{1, -*dX*1*x* 1 -*F G*1 ,..., -*dXTx T* -*F GT* } *<sup>T</sup>* ∑ *t*=1 -*dXtx t* -2 *F* ≤(1 + *GT*+<sup>1</sup> *G*1 ) *<sup>T</sup>* ∑ *t*=1 -*dXtx t* -2 *F* (A6)

Combining (A3)–(A6), we obtain

$$\begin{split} \sum\_{t=1}^{T} \langle \mathcal{g}\_{t} \mathbf{x}\_{t}^{\top}, \gamma\_{t} - \gamma \rangle\_{F} &\leq \frac{(\sqrt{m}G\_{T+1}^{2} + \|\theta\_{1}\|\_{F}) \|\gamma\|\_{F}^{2}}{2} + \psi\_{1}^{\*}(\theta\_{1}) + (1 + \frac{\|\gamma\|\_{F}^{4}}{4}) \sqrt{\sum\_{t=1}^{T} \|\mathbf{x}\_{t}\|\_{2}^{4}} \\ &+ (1 + \frac{G\_{T+1}}{G\_{1}} + \frac{\|\|\gamma\|\_{F}^{2}}{2}) \sqrt{\sum\_{t=1}^{T} \|\nabla^{d} \mathbf{X}\_{t} \mathbf{x}\_{t}^{\top}\|\_{F}^{2}}. \end{split}$$

For *θ*<sup>1</sup> = 0, it is easy to verify that *ψ*<sup>∗</sup> <sup>1</sup> (*θ*1) <sup>≤</sup>*w*1, *<sup>θ</sup>*<sup>1</sup> *<sup>F</sup>* <sup>≤</sup> *θ*1-2 *F <sup>η</sup>*<sup>1</sup> ≤ *θ*1-*<sup>F</sup>*. By putting this in the inequality above, we obtain the claimed result.

*Proof of Theorem 3* **Proof.** Define

$$\psi\_t : \Delta \to \mathbb{R}, w \mapsto \eta\_t \sum\_{k \in I\_w}^K w\_k \log w\_k + \eta\_t \log K,$$

where *Iw* = {*i* = 1, ... , *k*|*wi* = 0}. It can be verified that *wt* ∈ *∂ψ*<sup>∗</sup> *<sup>t</sup>* (*θt*). Applying Lemma A4, we obtain

$$\sum\_{t=1}^{T} z\_t^\top (w\_t - \mu) \le \psi\_{T+1}(\mu) + \psi\_1^\*(\theta\_1) + \sum\_{t=1}^{T} \mathcal{B}\_{\psi\_t^\*} (\theta\_{t+1}, \theta\_t - h\_t).$$

From the definition of *ψt*, it follows that *ψT*+1(*u*) ≤ log *K* <sup>2</sup> <sup>∑</sup>*<sup>T</sup> t*=1*zt* − *ht*-2 <sup>∞</sup> and *ψ*<sup>∗</sup> <sup>1</sup> (*θ*1) = 0 hold. Define *vt* ∈ *∂ψ*<sup>∗</sup> *<sup>t</sup>* (*θt*+1). Next, we bound the third term as follows:

B*ψ*<sup>∗</sup> *t* (*θt*+1, *θ<sup>t</sup>* − *ht*) =*ψ*∗ *<sup>t</sup>* (*θt*+1) − *ψ*<sup>∗</sup> *<sup>t</sup>* (*θ<sup>t</sup>* − *ht*) − (*ht* − *zt*) *wt* =*θ <sup>t</sup>*+1*vt* − *ψt*(*vt*) − (*θ<sup>t</sup>* − *ht*)*wt* + *ψt*(*wt*) − (*ht* − *zt*) *wt* =(*ht* − *zt*)(*vt* − *wt*) − (*ψt*(*vt*) − *ψt*(*wt*) − (*θ<sup>t</sup>* − *ht*)(*vt* − *wt*)) =(*ht* − *zt*)(*vt* − *wt*) − B*ψt*(*vt*, *wt*) =(*ht* − *zt*)(*vt* − *wt*) − *ηt*+1*vt* − *wt*-2 <sup>1</sup> + *ηt*+1*vt* − *wt*-2 <sup>1</sup> − B*ψt*(*vt*, *wt*) ≤(*ht* − *zt*)(*vt* − *wt*) − *ηt*+1*vt* − *wt*-2 <sup>1</sup> + (*ηt*+<sup>1</sup> − *ηt*)*vt* − *wt*-2 1 ≤*ht* − *zt*-∞*vt* − *wt*-<sup>1</sup> − *ηt*+1*vt* − *wt*-2 <sup>1</sup> + 4(*ηt*+<sup>1</sup> − *ηt*) ≤*ht* − *zt*-2 ∞ 4*ηt*+<sup>1</sup> + 4(*ηt*+<sup>1</sup> − *ηt*),

where the first inequality uses the fact that *ψ<sup>t</sup>* is 2*η<sup>t</sup>* strongly convex w.r.t. -·-1. Adding up from 1 to *T*, we have

$$\begin{aligned} &\sum\_{t=1}^{T} \mathcal{B}\_{\boldsymbol{\Phi}\_{t}^{\boldsymbol{r}}}(\theta\_{t+1}, \theta\_{t} - h\_{t}) \leq \sum\_{t=1}^{T} (\frac{||h\_{t} - z\_{t}||\_{\infty}^{2}}{4\eta\_{t+1}} + 4(\eta\_{t+1} - \eta\_{t})) \\ &\leq \sqrt{\frac{\log K}{2} \sum\_{t=1}^{T} ||h\_{t} - z\_{t}||\_{\infty}^{2}} + 4\eta\_{T+1} \\ &\leq \sqrt{\frac{\log K}{2} \sum\_{t=1}^{T} ||h\_{t} - z\_{t}||\_{\infty}^{2}} + \sqrt{\frac{8}{\log K} \sum\_{t=1}^{T} ||h\_{t} - z\_{t}||\_{\infty}^{2}} .\end{aligned}$$

Combining the inequalities, we obtain

$$\begin{aligned} &\sum\_{t=1}^{T} l(X\_{t\prime}\sum\_{i=1}^{K} w\_{i\prime}\tilde{X}\_{t\prime}^{i}) - \sum\_{t=1}^{T} l(X\_{t\prime}\tilde{X}\_{t\prime}^{k}) \\ &\leq \sum\_{t=1}^{T} \sum\_{i=1}^{K} w\_{i\prime}l(X\_{t\prime}\tilde{X}\_{t\prime}^{i}) - \sum\_{t=1}^{T} l(X\_{t\prime}\tilde{X}\_{t\prime}^{k}) \\ &= \sum\_{t=1}^{T} w\_{t}^{\top}z\_{t} - \sum\_{t=1}^{T} l(X\_{t\prime}\tilde{X}\_{t\prime}^{k}) \\ &\leq (\sqrt{2\log K} + \sqrt{\frac{8}{\log K}})\sqrt{\sum\_{t=1}^{T} ||h\_{t} - z\_{t}||\_{\infty}^{2}} \end{aligned}$$

where the first inequality follows from Jensen's inequality. Furthermore, if *l* is *L*-Lipschitz in its first argument, then we have

$$||h\_t - z\_t||\_{\infty} = \max\_{i \in \{1, \dots, \mathbb{K}\}} |z\_{i,t} - h\_{i,t}| \le L ||\nabla^d \mathcal{X}\_t||\_2.$$

Finally, we obtain the regret upper bound

$$\sum\_{t=1}^{T} l(X\_{t\prime} \sum\_{i=1}^{K} w\_{i\prime} \tilde{X}\_{t\prime}^{i}) - \sum\_{t=1}^{T} l(X\_{t\prime} \tilde{X}\_{t\prime}^{k}) \le \left(\sqrt{2\log K} + \sqrt{\frac{8}{\log K}}\right) \sqrt{\sum\_{t=1}^{T} L^{2} \|\nabla^{d} X\_{t}\|\_{2^{d}}^{2}}$$

which is the claimed result.

#### **Appendix C**

We summarize the main notations used throughout the article in Table A1.

**Table A1.** Nomenclature.


#### **Appendix D**

For the synthetic data, the relative performance of the proposed algorithms after the first 1000 iterations are plotted in Figures A1–A3. For each setting, we calculate the average loss after the first 1000 iterations and plot the difference of the proposed algorithms compared to the average loss incurred by the best baseline algorithm.

**Figure A1.** Relative performance for setting 1.

**Figure A2.** Relative performance for setting 2.

**Figure A3.** Relative performance for setting 3.

Similarly, we plot the relative performance for the real-world data over the time horizon in Figures A4–A6.

**Figure A6.** Relative Performance for electricity demand.

#### **References**


## *Article AutoNowP***: An Approach Using Deep Autoencoders for Precipitation Nowcasting Based on Weather Radar Reflectivity Prediction**

**Gabriela Czibula 1,\*,†, Andrei Mihai 1,†, Alexandra-Ioana Albu 1,†, Istvan-Gergely Czibula 1,†, Sorin Burcea <sup>2</sup> and Abdelkader Mezghani <sup>3</sup>**


**Abstract:** Short-term quantitative precipitation forecast is a challenging topic in meteorology, as the number of severe meteorological phenomena is increasing in most regions of the world. Weather radar data is of utmost importance to meteorologists for issuing short-term weather forecast and warnings of severe weather phenomena. We are proposing *AutoNowP*, a binary classification model intended for precipitation nowcasting based on weather radar reflectivity prediction. Specifically, *AutoNowP* uses two convolutional autoencoders, being trained on radar data collected on both stratiform and convective weather conditions for learning to predict whether the radar reflectivity values will be above or below a certain threshold. *AutoNowP* is intended to be a proof of concept that autoencoders are useful in distinguishing between convective and stratiform precipitation. Real radar data provided by the Romanian National Meteorological Administration and the Norwegian Meteorological Institute is used for evaluating the effectiveness of *AutoNowP*. Results showed that *AutoNowP* surpassed other binary classifiers used in the supervised learning literature in terms of probability of detection and negative predictive value, highlighting its predictive performance.

**Keywords:** precipitation nowcasting; deep learning; autoencoders; radar data

#### **1. Introduction**

Forecast of severe weather phenomena, including the quantitative precipitation forecast (QPF), represents a challenging topic in meteorology. Due to the increase in the number of heavy rainfall events in most regions of the world, population safety could be affected and significant damage may occur. The short-term weather forecasting is known as nowcasting and is of particular interest as it has an important role in risks management and crisis control. The problem of weather nowcasting is a complex and difficult one, due to its high dependence on numerous environmental conditions. Precipitation nowcasting represents a challenging and actual research topic, referring to producing predictions of rainfall intensities over a certain region in the near future, and playing an important role in daily life [1].

At global scale, flood threat is increasing because of climate change impact of heavy precipitation, as for instance the total urban area being exposed to flood has dramatically increased in Europe over the past century. Also, various socioeconomic sectors are impacted by climate change induced hazards, such as extreme rainfall, which amplify both the intensity and probability of floods [2]. Research on the exposure of flood hazard, using climate models simulations, showed that the climate change presents the potential to actively change the human, assets, and urban areas exposure to flood hazard, but nevertheless

**Citation:** Czibula, G.; Mihai, A.; Albu, A.-I.; Czibula, I.G.; Burcea, S.; Mezghani, A. *AutoNowP*: An Approach Using Deep Autoencoders for Precipitation Nowcasting Based on Weather Radar Reflectivity Prediction. *Mathematics* **2021**, *9*, 1653. https://doi.org/10.3390/math9141653

Academic Editor: Freddy Gabbay

Received: 31 May 2021 Accepted: 30 June 2021 Published: 14 July 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

considerable uncertainty in the magnitude of the climate change impact in different regions around the globe exists [3].

Nowadays, integrating crowdsourced observations into research studies can contribute to reducing the risk and the costs related to extreme events. Citizens around the world have, currently, at their disposal a great number of sources of information and amazing possibilities to report and to study meteorological phenomena. Hence, these volunteers who collect, report and/or process the data they observe are citizen scientists. They are active not only in the field of meteorology, but also in sciences as astronomy, archeology, natural history and others [4]. Their contribution to science can have a practical effect, especially by increasing the awareness and perception on climate change related risks, thus helping in mitigating the effects.

Although significant progress has been made recently on nowcasting systems in general, and precipitation nowcasting in particular, the challenges remain as, for instance, severe convective storms are localized, occurring on a small spatial area (i.e., mesoscale) and having an overall short lifecycle. Due to its high spatiotemporal resolution, radar data is used both in the so-called expert nowcasting systems and in the less complex forms that involve processing the radar data solely [5,6]. These systems blend radar data and other observations with numerical weather prediction (NWP) models to generate forecasts up to 6h[7]. Although NWP significantly improves the precipitation nowcasting, there are still issues to be resolved, like the predictability of precipitation systems, the improvement of rapid update NWP, and the need for improvement of mesoscale observation networks [8].

Some of the most used radar products in weather nowcasting are reflectivity (R) and Doppler radial velocity (V). For instance, operational meteorologists are mainly using the values of reflectivity and radial velocity to monitor the spatiotemporal evolution of precipitating clouds, while operational radar algorithms use the reflectivity for rainfall estimation and storm tracking and classification: R values above a certain threshold (e.g., 35 dBZ [5,9]) indicate possible convective storms occurrence associated with heavy rainfall. Estimating the values of the radar products based on their historical values is important for QPF. NWP models [10] represent the main techniques for QPF, but there are still errors in rainfall forecasting due to difficulties in modelling cloud dynamics and microphysics [11].

Deep learning methods [12–14] are believed to have the potential to overcome the limitations of NWP methods through modeling patterns in large amounts of historical meteorological data. Deep learning methods offer data-driven solutions for the nowcasting problem, by learning dependencies between radar measurements at consecutive time steps [15]. A central characteristic of deep neural networks is represented by their ability to learn abstract representations of the input data through stacking multiple layers and thus forming deep architectures. Autoencoders (AEs) are a type of neural network that can be trained to learn low dimensional representations that capture the relevant characteristics of the input data [16]. AEs are trained to learn data representations by reconstructing their inputs. They are built of two components, an encoder that maps the input to a latent representation and a decoder that uses this representation to reconstruct the input. Typically, the dimensionality of the latent representation is chosen to be smaller than the input space dimensionality, thus obtaining a so-called undercomplete autoencoder. Autoencoders can be trained using gradient descent methods to minimize the error between the input data and the predicted reconstruction [16]. Convolutional autoencoders (ConvAEs) are able to capture spatial patterns in the input data by using convolutions as their building blocks. Convolutional encoder-decoder architectures have been extensively used in various computer vision tasks and they are the typical choice for modeling the spatial characteristics of meteorological measurements gathered along geographical locations [15,17,18].

The contribution of the paper is threefold. First, we aim at introducing a supervised classifier *AutoNowP* that uses two convolutional autoencoders for distinguishing between convective and stratiform rainfall based on radar reflectivity prediction. *AutoNowP* is based on training two ConvAEs trained on radar data collected on both stratiform and

convective weather conditions. After the training step, *AutoNowP* will learn to predict whether the radar reflectivity values will be higher than a certain threshold, and thus indicating if a convective storm is likely to happen. *AutoNowP* is intended to be a proof of concept that AEs applied on radar data are useful in distinguishing between convective and stratiform rainfall. Secondly, the effectiveness of *AutoNowP* is empirically proven on two case studies consisting of real radar data collected from the Romanian National Meteorological Administration (NMA) and the Norwegian Meteorological Institute (MET). The obtained results are compared to the results of recent similar approaches in the field of precipitation nowcasting. As an additional goal we aim at analyzing the relevance of the obtained results from a meteorological perspective, as a proof of concept that autoencoders are able to capture relevant meteorological knowledge. To the best of our knowledge, an approach similar to *AutoNowP* has not been proposed in the nowcasting literature so far.

To summarize, the research conducted in the paper is oriented toward answering the following research questions:


The rest of the paper is organized as follows. A literature review on recent deep learning methods for precipitation nowcasting is presented in Section 2. Section 3 introduces our binary classification model *AutoNowP* for predicting if the radar reflectivity values are above or below a specific threshold. The performed experiments and the obtained results are described in Section 4, while a discussion on the results and a comparison to related approaches is provided in Section 5. Section 6 presents the conclusions of our research and highlights directions for future work.

#### **2. Literature Review on Machine-Learning-Based Precipitation Nowcasting**

A lot of work has been carried out lately in the field of machine-learning-based precipitation nowcasting. We are reviewing, in the following, several recent approaches in the field.

Shi et al. [19] have approached precipitation nowcasting by introducing an extension of a long short-term memory (LSTM) network, named ConvLSTM, suitable for handling spatiotemporal data by preserving due to the convolutional structure of the spatiotemporal features. Their architecture is composed of two networks, a ConvLSTM encoder and a ConvLSTM decoder. As precipitation nowcasting performance indicators, a Rainfall Mean Squared Error (Rainfall-MSE) of 1.420, a Critical Success Index (CSI) of 0.577, a False Alarm Rate (FAR) of 0.195 and a Probability of Detection (POD) of 0.660 have been obtained.

Heye et al. [20] investigated a precipitation nowcasting approach based on a 3D ConvLSTM architecture. In their experiments, a vanilla sequence-to-sequence model achieved better performance than a model using attention layers. Overall, the CSI varied between 0.40 and 0.43, the FAR ranged from 0.28 to 0.31, and the POD fluctuated between 0.46 and 0.51.

A method for precipitation nowcasting, combining the advantages of convolutional gated recurrent networks (ConvGRU) and adversarial training was introduced by Tian et al. [21]. The method aimed at improving the sharpness of the predicted precipitation maps by means of adversarial training. The system is composed of a generator network, represented by the ConvGRU, which learns to generate realistically looking precipitation maps and a discriminator represented by a convolutional neural network that is trained to distinguish between predicted ground truth maps. Their method achieved better performance in terms of probability of detection than an optical flow algorithm and the original ConvGRU. Han et al. [22] used 3D convolutions to build a neural network for convective storm nowcasting. The task was formulated as a binary classification problem

and their multisource approach achieved a CSI of 0.44, FAR of 0.45, and POD of 0.69 for 30 min forecasts, outperforming a Support-Vector Machine using hand-crafted features.

The MetNet model [15] has been introduced by Sønderby et al. using both radar and satellite data for precipitation forecasting with a lead time of up to 8 h. The model incorporates three components—a feature extractor formed of a succession of downsampling convolutional layers, a ConvLSTM component used for modeling dependencies on the past time steps and an attention module composed of several axial self-attention blocks that aim to capture relationships among geographic locations situated far away in the map. By including the forecasted time in the data given as input and thus conditioning the entire model on it, predictions for multiple time steps can be obtained in parallel. The loss function was computed only for points on good quality maps from the data set in order to account for possible noisy or incorrect labels. MetNet outperformed the persistence model, an optical flow-based algorithm, as well as the High-Resolution Rapid Refresh (HRRR) for forecasts up to 8 h in the future. By performing ablation studies, they pointed out that using a large spatial context leads to better performance than using a smaller context on long-term predictions. However, reducing the temporal context up to 30 min did not decrease the model's performance. Moreover, the authors pointed out that radar data plays a more important role in the overall model performance for short-term predictions than for long-term ones. These results can be explained by the fact that long-term predictions need to take into account a larger spatial context that cannot be typically captured by radar, thus highlighting the importance of incorporating satellite data for this type of predictions.

The model proposed by Franch et al. [23] aimed to improve the performance of nowcasting systems on extreme events prediction by training an ensemble of Trajectory Gated Recurrent Units (TrajGRUs), each optimized by over-weighting the objective for a specific precipitation threshold. In addition to the ensemble components, a model stacking strategy that consists of training an additional model using the outputs of the ensemble components is employed. Moreover, their approach enhances the radar data with orographic features. The proposed model achieved overall better performance than several TrajGRU baselines and two models obtained by using only part of the components—an ensemble model without orographic features, and a single model trained with orographic features.

Chen et al. [1] improved upon the training of ConvLSTMs by introducing a multisigmoid loss function tailored for the precipitation nowcasting task and incorporating residual connections in the recurrent architecture. Additionally, the group normalization mechanism proved to be beneficial for the model's performance. The model was trained on radar images and predictions were evaluated for lead times of up to one hour.

The Small Attention-Unet (SmaAt-Unet) [17] precipitation nowcasting model introduced by Trebing et al. is a modified U-Net architecture, in which traditional convolutions have been replaced by depthwise separable convolutions and convolutional block attention modules have been added to the encoder. The proposed approach achieved an overall comparable performance to the original U-Net, while using a quarter of the number of parameters. The nowcasting is done for up to 30 min in the future using 1 h of past radar data, sampled at a frequency of 5 min. Similarly to other U-Net-based methods, different time stamps are concatenated channelwise and given as input to the network. Patterns across the channel dimension are captured by the attention modules. As precipitation nowcasting performance indicators, a CSI of 0.647, a FAR of 0.270, and an *F*-*score* of 0.768 have been obtained.

An approach for weather forecasting using ConvLSTMs and attention was introduced in [18]. Their proposed method was tested on the ECMWF (European Centre for Medium-Range Weather Forecasts) Reanalysis v5 (ERA5) data set, which contains several weather measurements such as temperature, geopotential, humidity and vertical velocity at a time resolution of one hour. The approach was shown to outperform other methods such as Simple Moving Average, U-Net, and ConvLSTM, achieving MSE values between 1.32 and 2.47.

Jeong et al. [24] alternatively proposed a weighted broadcasting strategy for ConvLSTMs, which is based on the idea of overweighting the last time stamp in the input sequence. Their approach reached generally better performance than the baseline ConvL-STM architecture, with CSI values ranging between 0.0108 and 0.5031, FAR between 0.2960 and 0.5653, POD values in the range 0.0110-0.6403 and Heidke skill score (HSS) between 0.01 and 0.3.

A deep learning approach for precipitation estimation from reflectivity values was introduced by Yo et al. [25]. The proposed approach was compared to an operational precipitation estimation method used by the Central Weather Bureau in Taiwan and was shown to slightly outperform it, especially in predicting extreme meteorological events. However, the improvement was not statistically significant, the proposed method obtaining an average POD of 0.8 and FAR of 0.0134.

#### **3. Methodology**

With the goal of answering research question **RQ1**, this section introduces our binary classification model proposal, *AutoNowP*, that consists of two ConvAEs, trained on radar data collected from rainfall conditions with different classes of severity, for recognizing severe phenomena. More specifically, *AutoNowP* is trained for learning to predict whether the radar reflectivity values will be above or below a specific threshold. The ConvAE models are used due to their ability to preserve the structure of the input data and to detect underlying structural relationships within the data.

*AutoNowP* is aimed to empirically demonstrate that autoencoders are able to learn, by self-supervision, features that are relevant for distinguishing structural relationships in radar data collected in both stratiform and convective weather conditions. The model is designed to classify if a radar product *Rp* is below or above a threshold *τ*. In the experiments we will use two radar products, the reflectivity at the first elevation level (R01) and the composite reflectivity, and different values for the threshold *τ* (e.g., 5, 20, 35 dBZ). *AutoNowP* consists of three stages depicted in Figure 1: data representation and preprocessing, training, and testing (evaluation). These stages will be further detailed.

**Figure 1.** Overview of *AutoNowP*.

#### *3.1. Data Representation and Preprocessing*

The raw radar data used in our experiments is converted into two-dimensional arrays, with a grid cell representing a geographical location. A cell in the matrix stores the value of a specific radar product at a given time stamp. A sequence of such matrices is available for

a given day, each matrix storing the values for a specific radar product *p* at a time moment *t*. We assume that *np* radar products are available and thus, the radar data at a time moment *t* may be visualized as a data grid with *np* channels.

In our previous works [11,26] we highlighted that similar values for the radar products in a specific location *l* at a time *t* are encoded in similar neighborhoods of the location *l* at time *<sup>t</sup>*−1. For a specific location *<sup>l</sup>* at time *<sup>t</sup>*, a *<sup>d</sup>*2–dimensional vector containing the values of a radar product *Rp* from the sub-grid of diameter *d* centered on *l* (at time *t*−1) will be assigned. The *d*2—dimensional instance will be labeled with the of *Rp* for the location *l* at time *t* [11]. A sample data grid containing the values for the product R01 at time *t* is shown in Figure 2a, while Figure 2b depicts the data grid at time *t*−1.



(**a**) The data matrix at time stamp *t*. In red is the value of R01 at location *l* = (3, 3).

(**b**) The data grid at time stamp *t*−1. In blue is the neighborhood of the location *l* = (3, 3) of diameter *d* = 3.

**Figure 2.** Sample data grids at time stamp *t* and *t*−1 highlighting an instance sample at location *l* = (3, 3) and a diameter *d* = 3 for the neighborhood.

For the example from Figure 2, the instance corresponding to the location (3,3) at time *t* is the vector (15,10,20,10,15, 20,5,10,10) and is labeled with 10 (the value of R01 at location (3,3) and time *t*).

Consequently, considering a specific diameter *d* for the neighborhood, a data set *R* is built from the instances (*d*2—dimensional points) associated to each location from the data grid and all available time moments [11]. The radar data set *R* will be divided in two classes: the positive class (denoted as "+") composed by the instances having the label (i.e., values for the radar product *Rp* at a certain time *t*) higher than a threshold *τ*, while the negative class (denoted as "−") contains the instances having the label lower or equal to the threshold *τ*. The data set representing the positive class is denoted by *R*+, while *R*<sup>−</sup> denotes the set of instances belonging to the negative class. We note that the dimensionality of *R*<sup>−</sup> is significantly larger than the cardinality of *R*+, as the number of severe weather events is often small.

Both data sets are then normalized so that the value *Rp* of a radar product is transformed to be in the [0, 1] range. For normalization purposes, we use the classic min/max normalization formula:

$$Rp'(l,t) = \frac{Rp(l,t) - Rp\_{\min}}{Rp\_{\max} - Rp\_{\min}},$$

where:


It should be noted that we are using the minimum and maximum values from a radar product's domain to ensure that both *R*<sup>+</sup> and *R*<sup>−</sup> data sets are normalized in the same way (i.e., the same value in different data sets is mapped to the same normalized value), as the positive data set may have different minimum and maximums than the negative data set. *AutoNowP* is trained and tested on the normalized data.

#### *3.2. AutoNowP Classification Model*

Considering the notations from Section 3.1, the classification problem is formalized as the approximation of two target functions (i.e., one target function for each class) *tc* : R<sup>+</sup> ∪ R− → [0, 1] (∀*c* ∈ {+, −}) that express the probability of instances from R<sup>+</sup> ∪ R− to belong to the "+" or "−" classes. Thus, the learning goal of *AutoNowP* will be to approximate the functions *t*<sup>+</sup> and *t*−. *AutoNowP* consists of two ConvAEs, one for the "+" class (*CA*+) and one for the "−" class (*CA*−). For training an autoencoder *CAc* (*c* ∈ {+, −} 47% from the data set *Rc* (i.e., 70% from the data not used for testing) will be used for training, 20% for the model validation and the rest of 33% from *Rc* will be further used for testing, using a 3-fold cross-validation testing methodology.

#### 3.2.1. Training

As previously stated, *AutoNowP* classifier will be trained to predict, based on the radar products values from the neighborhood of a geographical location at time *t* − 1, whether the value of a radar product *Rp* at time *t* will be higher than a threshold *τ*. For instance, if *Rp* is chosen as R01 and *τ* as 35 dBZ, then *AutoNowP* will be trained to predict if, in a certain geographical location or area, a convective storm is likely to occur (i.e., if the value of R01 will be higher than 35 dBZ in that geographical location).

*AutoNowP* is trained to recognize both normal and severe weather events, and thus it will learn to predict if a certain instance is likely to indicate stormy or normal weather. Each of the two autoencoders *CA*<sup>+</sup> and *CA*<sup>−</sup> will be self-supervisedly trained on the data set of positive and negative instances, respectively (R<sup>+</sup> and R−).

The prediction is based on estimating the probabilities (denoted by *p*<sup>+</sup> and *p*−) that a high-dimensional instance corresponding to a particular geographic location (as described in Section 3.1) belongs to the positive and negative classes. The method for computing these probabilities will be detailed in Section 3.2.2.

#### Autoencoders Architecture

The current study uses convolutional undercomplete AEs to learn meaningful lowerdimensional representations for radar data. The autoencoders were implemented in Python, using the Keras framework with Tensorflow backend. Both autoencoders (*CA*<sup>+</sup> and *CA*−) have the same architecture. The input data of the AEs is the 2D grid of the neighborhood of diameter *d* for one location (as exemplified in Figure 2b)—i.e., the 2D grid representing the values of an instance from R<sup>+</sup> ∪ R−. As we have to choose a different diameter *d* for our experiments on different data sets (see Section 4.1), we made the architecture so that it minimally changes with *d*: while the number, type and hyper-parameters of each layer of the network remain the same, the number of neurons on each layer changes, proportionally, depending on *d*.

Even if the architecture of the autoencoder may be adapted to the diameter *d* of the neighborhood (i.e., the dimensionality *d*<sup>2</sup> of the input data), the value of *d* may influence the performance of *AutoNowP* model. Intuitively, high values for *d* will make the AEs to harder distinguish between the positive and negative instances. This may happen since, hypothetically speaking, it would be possible that two neighboring points at time *t* (one positive and one negative) have a large number of identical neighbors at time *t* − 1 (i.e., the data instances representing the two locations are similar) and thus the AEs are unable to distinguish between them. On the other hand, a small number of neighbors for a data point (i.e, small values for *d*) is not enough for *AutoNowP* classifier to discriminate between the input instances. For determining the most appropriate value for the diameter *d*, a grid search was performed for selecting the value *d* that provides the best performance for *AutoNowP*.

In the following, we will present the architecture of the autoencoders and the hyperparameters used, without mentioning the number of neurons, so that the following description is valid for the *AutoNowP* model in general, regardless of the specific experiment. Figure 3 illustrates the architecture of the autoencoder (as mentioned above, both autoen-

coders, *CA*<sup>−</sup> and *CA*+, have the same architecture). This is a Convolutional Autoencoder, thus the main layers are the Conv2D—2-dimensional convolution layers—represented in yellow in the figure. These layers reduce the data grid input in three steps, leading to an encoding layer (the blue layer in the figure). From the encoding, the autoencoder needs to recreate the input, thus the inverse of the Conv2D is needed: Conv2DTranspose (the orange layers). Using the Conv2DTranspose layers we apply the reverse transformation so that it recreates the data grid as it was before the convolutions. When using convolutions, we need to reduce the size of the image, and this works best if the size of the image is even. However, our input layer has always an odd size: since the input represents the neighborhood of one point, having that point in the center, for a given radius *r*, the size will be (2*r* + 1, 2*r* + 1)—i.e., we take *r* neighbors from all sides of the center; for example, *r* neighbors on the right with *r* neighbors on the left plus the center itself results in an 2*r* + 1 length. Since the input is always odd in size, we need to adjust it so that we can perform the convolutions. For this, we use a ZeroPadding2D layer: after the Input layer (first gray layer), we pad the margins of the data grid with zeros until it reaches the desired size, using the ZeroPadding2D layer (the red layer). Afterwards, the convolutions can occur. The transpose convolutions will recreate the data grid as it was before the convolutions—that is, after padding—so it is not the same size as the input. Since it is an autoencoder, we want to match the output to the input, thus, we need to adjust the transpose convolutions output so that the final size of the autoencoder output fits the size of its input. To readjust the size, we use a Cropping2D layer, which will also be the output layer of the autoencoder (the second gray layer represented in the figure).

**Figure 3.** Architecture of a Convolutional Autoencoder (*CAc*).

As with other neural networks, while the architecture is the principal element of the network, there are other metaparameters that need to be tuned that change the network's behavior. One of these is the number of neurons on each hidden layer, but as we mentioned above, this number may differ among the experiments if the input size changes; however, while the absolute number changes, the proportion of neurons on the hidden layers are preserved. Then, we have the activation used for the layers: for all convolutional layers, transpose convolutional layers and the dense layers, except for the last transpose convolutional layer, we use the SELU activation function (Scaled Exponential Linear Unit [27]). For the last transpose convolutional layer, we used the sigmoid activation function, so that the output of the autoencoder is between 0 and 1, as is the input. For all convolutional layers and transpose convolutional layers, we used a kernel size of 4 and 2 strides.

The training configuration was the following: we used a batch size of 1024 and we trained each autoencoder for 500 epochs in the case of the NMA data set and for 200 epochs for the MET data set; the Adam optimizer [28] was used with learning rates of 0.01 and 0.001 respectively for the NMA and MET data sets and *epsilon* of 0.00001.

Loss Functions

As explained in Section 3.1, the high-dimensional input instance *x* may be visualized as a data grid, i.e., the neighborhood around the location of the value we want to predict. The autoencoders learn to encode and decode each instance, the output of the autoencoder being the reconstruction of the instance. The loss functions represent the difference between the original instances and their reconstruction; lower values for the loss indicate better reconstructions (i.e., closer to the input), with a loss equal to 0 meaning no difference. The loss is based on a modified mean squared error (MSE), to assign a priority to the values greater than the threshold *τ* relative to the other values. More specifically, we wanted to be able to make the autoencoders prioritize values in the neighborhood that are either greater or lower or equal to the given threshold *τ*. We also wanted to be able to change this prioritization between *CA*<sup>−</sup> and *CA*<sup>+</sup> (i.e., *CA*<sup>−</sup> is trained to prioritize negative points, while *CA*+ is trained by over-weighting positive points in the neighborhood) and between experiments, so we introduced a parameter, *α*, that controls this prioritization. We split the computation of MSE in two parts: computing the MSE for values greater than *τ* (Formula (1)) and computing the MSE for values lesser or equal to *τ* (Formula (2)). The final loss value (Formula (3) is expressed as a linear combination between the two separately computed MSEs; we use the *α* parameter to decide how to prioritize the values greater than *τ* relative to the values less or equal to *τ*. The exact way to compute the loss function *L*(*x*, *x* ) for a given instance *x* ∈ R<sup>+</sup> ∪ R− is given by Formulae (1)–(3):

$$MSE\_{\text{greater}}(\mathbf{x}, \mathbf{x}') = \frac{1}{d^2} \sum\_{1 \le i \le d^2 \atop \mathbf{x}\_i' \not\subseteq \mathbf{d}^2} (\mathbf{x}\_i - \mathbf{x}\_i')^2 \tag{1}$$

$$MSE\_{lesser}(\mathbf{x}, \mathbf{x}') = \frac{1}{d^2} \sum\_{\substack{1 \le i \le d^2 \\ \mathbf{x}\_i \le \mathbf{r}}} (\mathbf{x}\_i - \mathbf{x}\_i')^2 \tag{2}$$

$$L(\mathbf{x}, \mathbf{x}') = \mathbf{a} \cdot MSE\_{\text{greater}}(\mathbf{x}, \mathbf{x}') + (1 - \mathbf{a}) \cdot MSE\_{\text{lesser}}(\mathbf{x}, \mathbf{x}') \tag{3}$$

where:


#### 3.2.2. Classification Using *AutoNowP*

After *AutoNowP* has been trained as described in Section 3.2.1, when an unseen query instance *q* has to be classified, the probabilities *p*+(*q*) (that *q* belongs to the positive class) and *p*−(*q*) (that *q* belongs to the negative class) are computed. As shown above, a query instance *q* is a high-dimensional vector (Section 3.1) consisting of radar products values from the neighborhood of a specific geographical location *l* at time *t*. *AutoNowP* will classify *q* as "+" (i.e., the value of the radar product *Rp* at time *t*+1 is likely to be higher than the threshold *τ*) iff *p*+(*q*) ≥ *p*−(*q*), i.e., *p*+(*q*) ≥ 0.5.

The underlying idea behind deciding that a query instance *q* is likely to belong to the "+" class (i.e., *p*+(*q*) ≥ *p*−(*q*)) is the following. We started from the assumption that an AE is able to encode the structure of the class of instances it was trained on well and with the intention to further reconstruct data similar to the training data. In addition, the AE will be unable to reconstruct, through its learned latent space representation, the instances that are dissimilar to the training data (i.e., likely to belong to another class than the class on which the AE was trained on). Thus, if for a certain instance *q* the MSE between *q* and the reconstruction of *q* by *CA*+ is less than the MSE between *q* and the reconstruction of *q* by *CA*−, then it is likely that the query instance belongs to the "+" class, as it is more similar to the information encoded for the positive class.

**Definition 1.** *Let us denote by MSEc*(*q*, *<sup>q</sup>*) *the MSE between <sup>q</sup> and the reconstruction (q) of <sup>q</sup> by the autoencoder CAc (c* ∈ {+, −}*) and by τ the threshold considered. The probabilities p*+(*q*) *and p*−(*q*) *are computed as given in Formulae (4) and (5).*

$$p\_{+}(q) = 0.5 + \frac{MSE\_{-}(\hat{q}, q) - MSE\_{+}(\hat{q}, q)}{2 \cdot (MSE\_{-}(\hat{q}, q) + MSE\_{+}(\hat{q}, q))}\tag{4}$$

$$p\_{-}(q) = 1 - p\_{+}(q). \tag{5}$$

From Formula (4) we observe that 0 <sup>≤</sup> *<sup>p</sup>*+(*q*) <sup>≤</sup> 1 and that if *MSE*+(*q*, *<sup>q</sup>*) <sup>≤</sup> *MSE*−(*q*, *<sup>q</sup>*), then *pr*+(*q*) ≥ 0.5, meaning that *q* is classified by *AutoNowP* as being positive. Much more, we note that:


After the probabilities *p*+(*q*) and *p*−(*q*) were computed from the training data, the classification *c*(*q*) of *q* is computed as shown in Formula (6).

$$\mathcal{L}(q) = \begin{cases} + & \text{if } pr\_+(q) \ge 0.5 \\ - & \text{otherwise.} \end{cases} \tag{6}$$

#### *3.3. Testing*

After *AutoNowP* was trained as described in Section 3.2.1, it is evaluated on 33% of the instances from each data set *R*<sup>+</sup> and *R*<sup>−</sup> that were unseen during the training stage. The classification of a query instance *q* is made as described in Section 3.2.2.

For evaluating the performance of *AutoNowP* on a testing data set, the confusion matrix is computed [29], composed by the number of true positives—TP, true negatives— TN, false positives—FP, and false negatives—FN. Then, based on the values from the confusion matrix, evaluation measures used for assessing the performance of supervised classifiers and weather predictors are employed:


All these measures take values in the [0, 1] range, with higher values indicating better predictors, excepting *FAR* that should be minimized for a better performance.

A three-fold cross-validation testing methodology is then applied. The value for each of the performance measures previously described are averaged over the three runs. The mean values are computed together with their 95% confidence intervals (CI) [31].

#### **4. Data and Experiments**

In this section, we answer research question **RQ2** by describing the experiments conducted for evaluating the performance of *AutoNowP* and analyzing the obtained experimental results.

#### *4.1. Data Sets*

For assessing the performance of *AutoNowP*, experiments were conducted on real radar data provided by the Romanian National Meteorological Administration (NMA) and the Norwegian Meteorological Institute (MET).

#### 4.1.1. NMA Radar Data Set

The NMA radar data set was collected over central Romania by a single polarization S-band Weather Surveillance Radar—98 Doppler (WSR-98D) located near the village of Bobohalma. The radar completes a full volume scan every 6 min, gathering data about the location, intensity and movement direction, and speed of atmospheric cloud systems. Volume scan data is collected by employing a scan strategy consisting in 9 elevation angles, the raw data being afterwards processed to compute a large variety of radar products. For *AutoNowP* experiments, we used the base Reflectivity product (R) sampled at the lowest elevation angle (R01), being expressed in decibels relative to the reflectivity factor Z (dBZ). Using the so-called Z-R relationships, the base reflectivity is used to derive the rainfall rate, and further, the radar estimated precipitation accumulation over a given area and time interval.

The radar data set used herein contains the quality controlled (cleaned) values of the raw R01 product. The cleaning is needed, as during the radar scans, both meteorological and nonmeteorological targets can be detected. Various clutter sources (e.g., terrain, buildings), biological targets (e.g., insects, birds) and external electromagnetic sources (e.g., sun) can impact the data quality within the volume scan, and although the signal processing can effectively mitigate the effects of this data contamination, additional processing is required to identify and remove the residual nonmeteorological echoes. Herein, the quality control algorithm is applied in a two-way process, by firstly detecting and removing the contaminated radar data, and secondly tuning the key variables to mitigate the effects of the first step on good data. The method used to clean and filter the reflectivity data is based on the three-dimensional structure of the measured data, in terms of computing horizontal and vertical data quality parameters. The computation algorithm is executed on radar data projected on a polar grid to not alter the measurements and to remain at the level of data recording, and it is built considering various key quality issues like ground clutter echoes and external electromagnetic interferences. First, the radar data is passed through a noise filter to remove the isolated ground clutter reflectivity bins, and then the algorithm performs the identification and removal of echoes generated by external signals and calculates the horizontal texture and the vertical gradient of reflectivity. The outputs of these steps (i.e., sub-algorithms) are finally used to reconstruct the quality-controlled reflectivity field.

Within AutoNowP, the NMA radar data was processed by selecting a value of 7 for the diameter *d* of the neighborhood (introduced in Section 3.1), representing about 7 km on the physical map, and this distance commonly determines small gradients of the meteorological parameters [30]. The value 7 for *d* provided the best performance for *AutoNowP*.

#### 4.1.2. MET Radar Data Set

The MET radar data set used in our experiments consists of composite reflectivity values gathered from the MET Norway Thredds Data Server [32].

The reflectivity product, available at [33] was derived from the raw reflectivity values by considering the best radar scan out of all considered elevations. Thus, it is a composite product, obtained by applying an interpolation scheme that weights radar volume sources differently based on their quality flags and various properties that may influence the measurement. The considered properties include ground or sea clutter, ships or airplanes, beam blockage, RLAN, sun flare, height above CAPPI level (typically 1000 m msl), range, and azimuth displacement. The measurements used in our experiments were collected by the radar at a time resolution of 7.5 min.

The dimension *d* of the neighborhood data grid was set to 15 for the MET experiment, since this dimensionality provided the best performance for *AutoNowP*.

Table 1 describes the data sets used as our case studies. The second column in the table indicates the radar product *Rp* of interest. The next three columns contain the number of instances from the data sets (both "+" and "−") and the percentage of positive and negative instances obtained using a threshold of 10 dBZ. The last column illustrates the entropy of each data set. The entropy is used for measuring the imbalancement of each data set [34]: lower entropy values indicate a higher degree of imbalancement.


**Table 1.** Description of the data sets.

From Table 1 we can see that the NMA data set is severely imbalanced: only 3.44% of the instances belong to the positive class, leading to a negative to positive ratio of about 28:1. Another element that highlights the high degree of data imbalancement is the entropy; where an entropy value of 1 reflects a perfectly balanced data set, the NMA data set entropy of 0.216 reflects a data set with low diversity, heavily weighted in favor of one class to the detriment of the other. The MET data set, on the other hand, showed a higher proportion of positive samples for this choice of threshold, as reflected by a higher entropy. In this setting, the negative to positive ratio is approximately 2:1.

The two-dimensional PCA [35] projections of the instances from both NMA and MET data sets from Figure 4 highlight the difficulty of the classification task. For both data sets, there is a low degree of separation between the class of negative instances (blue colored) and the class of positive instances (red colored).

**Figure 4.** 2D PCA visualization of the NMA data set (**a**) and MET data set (**b**).

The NMA data sets used in our experiments are publicly available at [36], while the MET data is publicly available at [37].

#### *4.2. Results*

This section presents the experimental results obtained by applying *AutoNowP* classifier on the data sets described in Section 4.1. For the ConvAEs, the implementation from the Keras deep learning API [38] using the Tensorflow framework was employed.

The experiments were performed on a workstation laptop, with an Intel i9-10980HK CPU, 32 GB RAM and Nvidia RTX 2080 Super for GPU acceleration; and on a Google cloud instance with 12 vCPUs, 64 GB RAM and access to a Nvidia Tesla V100 for GPU acceleration.

The evaluation measures and the testing methodology described in Section 3.3 were employed. Table 2 depicts the obtained results for both data sets used in our case studies, for various values of the threshold *τ*. The 95% confidence intervals (CIs) are used for the results.

The thresholds we decided to use were chosen considering both computational and meteorological factors. In the literature, there is no convention on thresholds for R. For example, Han et al. [9,39] chose to use the 35 dBZ threshold while Tran and Song [40] studied their prediction performance using the 5, 20 and 40 dBZ thresholds. Thus, the values 10, 20 and 30 were chosen for *τ* for the NMA data and 10, 15, 20 for the MET data set. Since the MET data contains few instances whose values are higher than 30 dBZ, *AutoNowP* could not be applied for this threshold. The best values obtained for the evaluation measures are highlighted for both data sets.


**Table 2.** Experimental results, using 95% CIs.

As shown in Table 2, the values for most of the evaluation measures decrease as the threshold *τ* increases. This is normal behavior, as the prediction becomes more difficult for higher values. The precision values (both for the positive and negative classes—*PPV* and *NPV*) and the true negative rate (*Spec*) increase for higher thresholds, denoting that the negative class is easier to predict for high values for *τ* and the number of false predictions decreases. However, the number of true positives significantly decreases for higher thresholds and this is reflected in the other performance metrics that decrease. High values (around 0.9) were obtained for sensitivity (*POD*), specificity, and *AUC* for *τ* = 10 denoting a good enough performance of *AutoNowP*. In addition, the small values obtained for the 95% CI reveal the stability of the model.

#### **5. Discussion**

With the goal of better highlighting the performance of *AutoNowP*, this section discusses the obtained results and then provides a comparison between *AutoNowP* and similar approaches from the nowcasting literature.

#### *5.1. Analysis of AutoNowP performance*

As shown in Table 2, *AutoNowP* succeeds in recognizing the negative class (high specificity) and detecting the positive class (probability of detection higher than 0.85 for *τ* = 10). This is a strength of *AutoNowP*, the ability to detect severe phenomena well. However, we observed false predictions, both for the positive and negative classes and these occur mostly close to the decision boundary. The performance of *AutoNowP* is impacted mainly by a large enough amount of false positive predictions, but most of these errors appear near the edges of radar echoes. In these areas the difference between classes becomes blurred, as the neighborhood contains some high values, not enough to be similar enough to the center of the event, but not few enough to be outside the event. These kinds of neighborhoods are close to both classes, the dissimilarity between them and either class is small. For these kinds of instances, AutoNowP has the most prediction errors. In order to better understand the areas where these instances appear, we have created a visualization in Figure 5. This figure shows the actual R01 values read by the radar in two consecutive time steps, color-coded by the dBZ value at each location. In the figure, there are also white and black regions, which represent the regions where most of the errors made by AutoNowP appear. The aforementioned regions were found by studying the erroneous predictions of the model and discovering the common elements of the neighborhoods that are problematic, both for false negative errors and false positive errors. Then, in Figure 5, we changed a pixel to white or black if its neighborhood is problematic, if it belongs to the false negative problems or, respectively, false positive problems. In short, in the image are represented with black points the locations where the model is highly likely to erroneously predict them as positive and, similarly, with white points where it tends to wrongly predict them as negative. The black and white areas in the image account for more than 98% of AutoNowP's errors.

(**a**) Errors areas analysis of R01 at time *t*. (**b**) Errors areas analysis of R01 at time *t* + 1.

**Figure 5.** Visualization of AutoNowP errors areas analysis for two consecutive time steps. In white are the areas where the model usually predicts false negatives and in black are areas where the model usually predicts false positives.

In Figure 5, it can be observed that most errors appear either at the edges of meteorological events, mostly in case of false positives, or in areas where there are few positive values, in case of false negatives. In case of false positives (in black), the problem areas show a tendency of the model to smooth the predictions out, i.e., to create shapes that are much more uniform. This is not an effect typical for AutoNowP; it is a general problem affecting radar reflectivity prediction models (e.g., the RadRAR model [11]). In Figure 5, a region containing false positives is exemplified in the first highlighted region (the bigger one, around the pixel at (75,50)); it can be seen that the black region surrounds the actual meteorological event, smoothing it out, creating much more homogenous shapes. This tendency is kept from one time step to another, the smoothed shape following closely the real shape.

In case of false negatives (in white), the problems appear generally in areas where there are few positive values, i.e., the neighborhoods of locations contain many zero or close to zero values and few values higher than the threshold. For these kinds of neighborhoods, it is hard to differentiate between classes as they appear both at the start of meteorological events and at the end of meteorological events. The beginning of meteorological events is especially hard to predict, as there is no indication if and where a meteorological event will form; for this reason, the model generally predicts locations with these kinds of neighborhoods as being negative, introducing some false negative errors. In Figure 5, an example area containing a false negative region can be observed in the second highlight (the small one, around the pixel (125,50)). In that highlight, in the first time step (left side) it can be observed that the meteorological event is small, while in the next time step (right side), the region of the meteorological event has more than doubled in size. Since in the first time step the event region is so small, the model has problems predicting the relatively big changes that will happen until the next time step, thus introducing false negative errors, visualized as white regions.

Analyzing the false negative predictions of *AutoNowP*, we also noticed (in both NMA and MET experiments) situations as the one depicted in Figure 6. The figure presents the composite reflectivity for two consecutive radar acquisitions from MET data. The red rectangles highlight a region that illustrates a sample case where *AutoNowP* provides false predictions.

**Figure 6.** Actual composite reflectivity values on two consecutive acquisitions (*t*—(**left**) side image— and *t* + 1—(**right**) side image) from MET data.

> From Figure 6 one observes that at time *t* (left side image) there are no values in the highlighted region for the composite reflectivity, but at *t* + 1 (the next data received from the radar—right side image) high values for composite reflectivity are suddenly detected. Some of the data points inside the rectangle should be classified as positive instances (higher values are displayed in red), but the model fails to predict the correct class (i.e., the positive one) as the input for *AutoNowP* (the data at *t*) contained mostly zero-valued data. While these situations are relatively infrequent in real life (the values

are usually increasing slowly between consecutive time stamps), they still contribute to a lower prediction accuracy. However, even if *AutoNowP* is unable to detect the positive instances at time step *t* + 1, in the next step, at time *t* + 2, the model will correctly classify the data points. This is not a limitation of *AutoNowP*, as such unexpected events cannot be detected by a learning model that was trained to predict time *t* + 1 based on time *t*. A possible solution would be to include more previous time steps in the prediction (*t* − 1, *t* − 2, etc).

In order to assess how the cleaning of the raw radar data impacts the predictive performance of our model, *AutoNowP* was trained on the uncleaned NMA data as well. A threshold *τ* = 10 and the methodology introduced in Section 3 were applied for building the *AutoNowP* classification model on the uncleaned data. Table 3 depicts the obtained results. One observes a significant performance improvement on the cleaned data. For a specific evaluation measure *P*, the performance improvement is computed as *Pcleaned*−*Puncleaned Puncleaned* being shown in the last row of the table.

**Table 3.** Experimental results obtained applying *AutoNowP* on the uncleaned NMA data and the improvement achieved on the cleaned data, for all performance measures.


Table 3 highlights an average improvement of 42% on the performance measures when using the cleaned data. The highest improvements are observed on *TSS* (96%), *POD* (89%) and on *CSI* (69%), while the lowest improvements are on *PPV*, *NPV* and *Spec* (less than 5%). These variations in the measures occurs because the uncleaned data introduces many false negative errors while marginally introducing true positive errors, thus for measures reliant on false negatives, such as *POD*, the difference is great while for measures reliant on false positives, such as *Spec*, the difference is small. We can speculate why this happens by analyzing uncleaned data and how it might affect the model: as explained in Section 4.1, the cleaning of the NMA data removes noise and clutter introduced by the interference of nonmeteorological targets during the scan. Effectively, this means that in the uncleaned data there are many locations where there are wrong values, higher than zero instead of zero. Because of this, during training, the model receives many locations labeled as negative where the neighborhood still has a large number of high-valued locations (the erroneous values), thus leading the model to make a false negative prediction (i.e., it will predict "−" even where there were actual meteorological events with a similar pattern as the erroneous training instance).

#### *5.2. Comparison to Related Work*

As shown in Section 2, most of the approaches introduced in the literature are for precipitation nowcasting. The existing methods based on radar reflectivity nowcasting were applied to radar data collected from various geographical regions, using various parameters settings, testing methodologies and various thresholds for the radar reflectivity values. The analysis of the recent literature highlighted *CSI* values ranging from 0.40 [20] to 0.647 [17]; *POD* values ranging from 0.46 [20] to 0.71 [21]; *F*-*score* values ranging from 0.58 [15] to 0.786 [15]. The performance of *AutoNowP* on both data sets used in our experiments (Table 2) compares favorably with the literature results, considering the magnitude of the evaluation measures for a threshold of 10 (*CSI* higher than 0.61, *POD* higher than 0.87, *F*-*score* higher than 0.8).

As the literature approaches for nowcasting do not use the same data model as our approach, an exact comparison with these methods cannot be made. For a more exact comparison, we decided to apply four well-known machine learning classifiers on the data sets described in Section 4.1, using *τ* = 10 and following the testing methodology used for evaluating the performance of *AutoNowP* (the performance measures were computed as shown in Section 3.3 and the testing was repeated 3 times for each training–validation split): logistic regression (LR), linear support vector classifier (linear SVC), decison trees (DT), and nearest centroid classification (NCC). We have selected these classifiers as baseline methods so as to cover a diverse set of methods—linear classifiers, rule-based, and distance-based.

These classifiers were implemented in Python using the scikit-learn [41] machine learning library. The comparative results are depicted in Table 4, with a 95% CIs for the values averaged over the three runs of the classifiers. The best values obtained for each performance metric are highlighted.

**Table 4.** Comparative results between *AutoNowP* and other classifiers. 95% CIs are used for the results.


The comparative results from Table 4 reveal that *AutoNowP* obtained the best results in terms of *POD* and *NPV* for both data sets. In addition, for the NMA data set, our classifier provided the highest *TSS* and *AUC* values. Figures 7 and 8 illustrate the ROC curves for the classifiers from Table 4 on NMA and MET data sets.

**Figure 7.** ROC curves for the classifiers from Table 4 on NMA data set.

**Figure 8.** ROC curves for the classifiers from Table 4 on MET data set.

Table 5 summarizes the results of the comparison between *AutoNowP* and the classifiers from Table 4. The table indicates, for both the NMA and MET data sets, the number of comparisons **won** (first row) and **lost** (second row) by *AutoNowP* considering all the evaluation measures and the classifiers from Table 4. More specifically, a comparison between our approach and a classifier *c*, considering a specific performance measure *p*, is won by *AutoNowP* if the value for *p* provided by *AutoNowP* is greater than the one provided by the classifier *c*. Similarly, the comparison is lost by *AutoNowP* if the value for *p* provided by *AutoNowP* is lower than the one provided by the classifier *c*.


**Table 5.** Summary of the comparison between *AutoNowP* and existing classifiers.

The results from Table 5 highlight that *AutoNowP* outperforms similar classifiers in 66% of the cases for the NMA data set and in 50% of the cases for the MET data set out. Overall, out of 64 comparisons, our *AutoNowP* approach wins in 37 cases, i.e in 58% of the cases.

One of the main current limitations of *AutoNowP* is the training data: in order for the model to have a high performance it needs to be trained using large amounts of relevant data. While there are large amounts of historical meteorological data, finding a cohesive set of relevant, high-quality data is not trivial. Due to the large training data set needed, the training process of *AutoNowP* tends to take quite some time, which may hamper the practicality of the model. The data model might be another drawback of the *AutoNowP*, as the way it is currently designed, it might lead to the confounding of the 2 classes in some special cases, as presented in Section 5.1. Nevertheless, these limitations can be addressed, which we plan to do in the future: the long training time can be improved by parallelizing the training process, while the data model can be improved, for example by extending it to contain more than one previous time step.

#### **6. Conclusions and Future Work**

The paper introduced *AutoNowP*, a new binary classification model for precipitation nowcasting based on radar reflectivity. *AutoNowP* used two convolutional autoencoders that are trained on radar data collected on both stratiform and convective weather conditions for learning to predict if the value for the radar reflectivity on a specific location will be above or below a certain threshold. *AutoNowP* was introduced in this paper as a proof a concept that autoencoders are helpful in distinguishing between convective and stratiform rainfall. Experiments performed on radar data provided by the Romanian National Meteorological Administration and the Norwegian Meteorological Institute highlighted that the ConvAEs used in *AutoNowP* are able to learn structural characteristics from radar data and thus the lower-dimensional radar data encoded in the ConvAEs latent space is consistent with the meteorological evidence.

The generality of *AutoNowP* classifier has to be noted. Even if it was introduced and evaluated in the context of precipitation nowcasting, it may be extended and applied for other meteorological data sources and binary classification tasks.

*AutoNowP* is one step toward the end goal of our research: to create machine-learningbased prediction models to be integrated in existing national weather nowcasting systems. The integration of these models aims to improve the Early Warning System frameworks, as the predictions create the possibility of issuing more accurate early warnings. Better early warnings can lead to avoidance of loss and damage due to heavy precipitations, for example in events such as flash floods in densely populated areas [42].

Future work will be conducted in order to extend the data sets used in the experimental evaluation. In addition, we aim to apply *AutoNowP* to other meteorological data sources (such as satellite data) and thus using the model for other nowcasting scenarios.

**Author Contributions:** Conceptualization, G.C., A.-I.A., A.M. (Andrei Mihai) and I.-G.C.; methodology, G.C., A.-I.A., A.M. (Andrei Mihai) and I.-G.C.; software, A.-I.A., A.M. (Andrei Mihai) and I.-G.C.; validation, G.C., A.-I.A., A.M. (Andrei Mihai) and I.-G.C.; formal analysis, G.C., A.-I.A., A.M. (Andrei Mihai) and I.-G.C.; investigation, G.C., A.-I.A., A.M. (Andrei Mihai) and I.-G.C.; resources, G.C., A.-I.A., A.M. (Andrei Mihai), I.-G.C., S.B. and A.M. (Abdelkader Mezghani); data curation, S.B.; writing—original draft preparation, G.C.; writing—review and editing, G.C., A.-I.A., A.M. (Andrei Mihai), S.B. and A.M. (Abdelkader Mezghani); visualization, G.C., A.-I.A., A.M. (Andrei Mihai) and I.-G.C.; funding acquisition, G.C., A.-I.A., A.M. (Andrei Mihai), I.-G.C., S.B. and A.M. (Abdelkader Mezghani). All authors have read and agreed to the published version of the manuscript.

**Funding:** The research leading to these results has received funding from the NO Grants 2014–2021, under Project contract No. 26/2020.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The NMA data sets used in our experiments are publicly available at [36], while the MET data is publicly available at [37].

**Acknowledgments:** The authors would like to thank the editor and the anonymous reviewers for their useful suggestions and comments that helped to improve the paper and the presentation. The research leading to these results has received funding from the NO Grants 2014–2021, under Project contract No. 26/2020.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **References**


## *Article* **Statistical Machine Learning in Model Predictive Control of Nonlinear Processes**

**Zhe Wu 1, David Rincon 1, Quanquan Gu <sup>2</sup> and Panagiotis D. Christofides 1,3,\***


**Abstract:** Recurrent neural networks (RNNs) have been widely used to model nonlinear dynamic systems using time-series data. While the training error of neural networks can be rendered sufficiently small in many cases, there is a lack of a general framework to guide construction and determine the generalization accuracy of RNN models to be used in model predictive control systems. In this work, we employ statistical machine learning theory to develop a methodological framework of generalization error bounds for RNNs. The RNN models are then utilized to predict state evolution in model predictive controllers (MPC), under which closed-loop stability is established in a probabilistic manner. A nonlinear chemical process example is used to investigate the impact of training sample size, RNN depth, width, and input time length on the generalization error, along with the analyses of probabilistic closed-loop stability through the closed-loop simulations under Lyapunov-based MPC.

**Keywords:** generalization error; recurrent neural networks; machine learning; model predictive control; nonlinear systems

#### **1. Introduction**

Modeling large-scale, complex nonlinear processes has been a long-standing research problem in process systems engineering. The traditional approaches to modeling nonlinear processes include data-driven modeling approach with parameters identified from industrial/simulation data [1,2], and first-principles modeling approach based on a fundamental understanding of the underlying physico-chemical phenomena. While traditional first-principles modeling approach has been used extensively in monitoring, control and optimization of chemical processes, it can be time-demanding and inaccurate to model complex nonlinear processes using first-principle modeling tools. Machine learning methods have been increasingly adopted to model complex nonlinear systems due to their ability to model a rich set of nonlinear functions and handle efficiently with big datasets from processes [3–10]. Among many machine learning modeling techniques, recurrent neural network (RNN) is widely used to model nonlinear dynamic systems using timeseries data [11–13]. While the history of machine learning methods in chemical process control can be traced back to 1990s [14–18], machine learning has become popular again this decade due to a number of reasons such as cheaper computation (mature and efficient libraries/hardware), availability of large datasets, and advanced learning algorithms. Designing MPC systems that utilize machine learning models with well-characterized accuracy is a new frontier in control systems that will impact the next generation of industrial control systems.

Despite the success of machine learning methods in modeling nonlinear chemical processes in the context of MPC, there remain fundamental challenges that limit the

**Citation:** Wu, Z.; Rincon, D.; Gu, Q.; Christofides, P.D. Statistical Machine Learning in Model Predictive Control of Nonlinear Processes. *Mathematics* **2021**, *9*, 1912. https://doi.org/ 10.3390/math9161912

Academic Editor: Freddy Gabbay

Received: 29 July 2021 Accepted: 9 August 2021 Published: 11 August 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

implementation of machine-learning-based MPC to real chemical processes. One important challenge is to characterize the generalization ability on unseen data for machine learning models trained using finite training samples. Furthermore, a theoretical analysis of closedloop stability for MPC using machine learning models needs to be developed via machine learning and control theory. Typically, theoretical developments on machine-learningbased MPC derived closed-loop stability properties based on the assumption of bounded modeling errors. For example, in [9], a Lyapunov-based MPC scheme using RNN models as the prediction model has been developed with guaranteed closed-loop stability by assuming that the RNN models are able to obtain a sufficiently small and bounded testing error. Similarly, a neural Lyapunov MPC that trains a stabilizing nonlinear MPC based on surrogate model and neural-network-based terminal cost was proposed in [19] with stability properties derived by assuming the boundedness of modeling error. Additionally, in [20], a nonparametric machine learning model is implemented together with MPC in which input-to-state stability is evaluated. In [21], a learning-based MPC targeting deterministic linear models is proposed in which safety, stability, and robustness are proved. However, the fundamental question regarding the generalization accuracy of machine learning models in MPC has not been addressed.

Probably approximately correct (PAC) learning theory is a framework that mathematically analyze the generalization ability of machine learning models [22]. Specifically, in PAC learning, given a set of training data, the learner is supposed to choose the optimal hypothesis (i.e., machine learning model) that yields a low generalization error with high probability from a certain class of hypotheses. Therefore, PAC learning theory provides a useful tool that demonstrates under what conditions a learning algorithm will probably output an approximately correct hypothesis. For example, in [23], PAC learning theory was used to study the learnability of compression learning algorithm for the optimization problem of stochastic MPC using a finite number of realizations of the uncertainty. In [24], PAC learning was used to analyze the generalization performance of a convex piecewise linear classifier that classifies the thermal comfort in a HVAC system. However, to the best of our knowledge, the use of statistical machine learning theory in analyzing stability properties of machine learning models in MPC, and guiding machine learning model structure and training data collection have not been fully explored.

Many recent works have been developed characterizing learnability of neural networks in terms of sample complexity and generalization error [25–32]. Generalization error bound is a common methodology in statistical machine learning for evaluating the predictive performance of machine learning algorithms [33]. This bound depends on a number of factors such as the number of data samples, the number of layers and neurons, bounds of weight matrices, initialization method, among others. For example, in [29], a generalization error bound was developed for a family of RNN models including vanilla RNNs, long short term memory and minimal gated unit. The generalization error bound was established for multiclass classification problems, and was dependent on the total number of network parameters and the spectral norms of the weight matrices. In [27], a sample complexity bound that was fully independent of network depth and width under some assumptions was developed for feedforward neural networks. In [34], an expected risk bound was developed for RNNs that model single-output nonlinear dynamic systems. However, at this stage, generalization error bounds for RNNs that model multiple-input and multiple-output (MIMO) nonlinear dynamic systems using time-series data have not been studied.

Motivated by the above, in this work, we develop the methodological framework of generalization error bounds from machine learning theory for the development and verification of RNN models with specific theoretical accuracy guarantees and integrate these models into model predictive control system design for nonlinear chemical processes. Specifically, in Section 2, the class of nonlinear systems, the formulation of RNNs, along with some general assumptions on system stabilizability and RNN development are presented. In Section 3, preliminaries including some important definitions and lemmas are

first presented, followed by the development of a probabilistic generalization error bound for RNN models accounting for the impact of training data size and the number of neurons and layers on accuracy and guiding network structure selection and training. In Section 4, the RNN models are incorporated in the MPC formulation, under which probabilistic closed-loop stability is derived based on the RNN generalization error bound. Finally, in Section 5, a chemical reactor example is used to demonstrate the impact of training sample size, RNN depth and width, input time length on its generalization error. Additionally, closed-loop simulations are carried out to analyze the probabilistic closed-loop stability and performance.

#### **2. Preliminaries**

#### *2.1. Notation*

The Frobenius norm of *A* is denoted by -*A*-*<sup>F</sup>*. The Euclidean norm of a vector is denoted by the operator |·| and the weighted Euclidean norm of a vector is denoted by the operator |·|*<sup>Q</sup>* where *Q* is a positive definite matrix. **R**<sup>+</sup> denotes nonnegative real numbers. **x***<sup>T</sup>* denotes the transpose of **x**. The notation *Lf V*(**x**) denotes the standard Lie derivative *Lf <sup>V</sup>*(**x**) :<sup>=</sup> *<sup>∂</sup>V*(**x**) *<sup>∂</sup><sup>x</sup> <sup>f</sup>*(**x**). Set subtraction is denoted by "\", i.e., *<sup>A</sup>*\*<sup>B</sup>* :<sup>=</sup> {*<sup>x</sup>* <sup>∈</sup> **<sup>R</sup>***<sup>n</sup>* <sup>|</sup> *<sup>x</sup>* <sup>∈</sup> *<sup>A</sup>*, *<sup>x</sup>* <sup>∈</sup>/ *<sup>B</sup>*}. A function *<sup>f</sup>*(·) is of class <sup>C</sup><sup>1</sup> if it is continuously differentiable. A continuous function *α* : [0, *a*) → [0, ∞) belongs to class K if it is strictly increasing and is zero only when evaluated at zero. A function *<sup>f</sup>* : **<sup>R</sup>***<sup>n</sup>* <sup>→</sup> **<sup>R</sup>***<sup>m</sup>* is said to be *<sup>L</sup>*-Lipschitz, *<sup>L</sup>* <sup>≥</sup> 0, if <sup>|</sup> *<sup>f</sup>*(*a*) <sup>−</sup> *<sup>f</sup>*(*b*)| ≤ *<sup>L</sup>*|*<sup>a</sup>* <sup>−</sup> *<sup>b</sup>*<sup>|</sup> for all *<sup>a</sup>*, *<sup>b</sup>* <sup>∈</sup> **<sup>R</sup>***n*. <sup>P</sup>(*A*) denotes the probability that event *<sup>A</sup>* will occur. E[*X*] denotes the expected value of a random variable *X*.

#### *2.2. Class of Systems*

The class of continuous-time nonlinear systems considered is described by the following state-space form:

$$\dot{\mathbf{x}} = F(\mathbf{x}, \boldsymbol{\mu}) := f(\mathbf{x}) + \mathbf{g}(\mathbf{x})\boldsymbol{\mu}, \ \mathbf{x}(t\_0) = \mathbf{x}\_0 \tag{1}$$

where *<sup>x</sup>* <sup>∈</sup> **<sup>R</sup>***<sup>n</sup>* and *<sup>u</sup>* <sup>∈</sup> **<sup>R</sup>***<sup>k</sup>* are the sate vector, and the manipulated input vector. The control action is constrained by *<sup>u</sup>* <sup>∈</sup> *<sup>U</sup>* :<sup>=</sup> {*u*min <sup>≤</sup> *<sup>u</sup>* <sup>≤</sup> *<sup>u</sup>*max} ⊂ **<sup>R</sup>***k*, where *<sup>u</sup>*min and *<sup>u</sup>*max represent the minimum and the maximum value vectors of inputs allowed, respectively. *f*(·) and *g*(·) are sufficiently smooth vector and matrix functions of dimensions *n* × 1, and *n* × *k*, respectively. Without loss of generality, the initial time *t*<sup>0</sup> is taken to be zero (*t*<sup>0</sup> = 0), and it is assumed that *f*(0) = 0, and thus, the origin is a steady-state of the system of Equation (1).

We assume the system of Equation (1) is stabilizable in the sense that there exists a stabilizing controller *u* = Φ(*x*) ∈ *U* that renders the origin exponentially stable. The stabilizability assumption implies that there exists a <sup>C</sup><sup>1</sup> control Lyapunov function *<sup>V</sup>*(*x*) such that for all *x* in an open neighborhood *D* around the origin, the following inequalities hold:

$$c\_1|\mathbf{x}|^2 \le V(\mathbf{x}) \le c\_2|\mathbf{x}|^2,\tag{2}$$

$$\frac{\partial V(\mathbf{x})}{\partial \mathbf{x}} F(\mathbf{x}, \Phi(\mathbf{x})) \le -c\_3 |\mathbf{x}|^2,\tag{3}$$

$$\left|\frac{\partial V(x)}{\partial x}\right| \le c\_4|x|\tag{4}$$

where *c*1, *c*2, *c*<sup>3</sup> and *c*<sup>4</sup> are positive constants. Additionally, the Lipschitz property of *F*(*x*, *u*) and the boundedness of *u* implies there exist positive constants *MF*, *Lx*, *L <sup>x</sup>* such that the following inequalities hold for all *x*, *x* ∈ *D* and *u* ∈ *U*:

$$|F(\mathbf{x}, \mu)| \le M\_F \tag{5}$$

$$|F(\mathbf{x}, \boldsymbol{\mu}) - F(\mathbf{x'}, \boldsymbol{\mu})| \le L\_{\mathbf{x}} |\mathbf{x} - \mathbf{x'}|\tag{6}$$

$$\left|\frac{\partial V(\mathbf{x})}{\partial \mathbf{x}} F(\mathbf{x}, \boldsymbol{\mu}) - \frac{\partial V(\mathbf{x'})}{\partial \mathbf{x}} F(\mathbf{x'}, \boldsymbol{\mu})\right| \le L\_{\mathbf{x}}' |\mathbf{x} - \mathbf{x'}| \tag{7}$$

Following the data generation method in [9], open-loop simulations of the nonlinear system of Equation (1) are first conducted to generate a large dataset that captures the system dynamics for *<sup>x</sup>* <sup>∈</sup> <sup>Ω</sup>*<sup>ρ</sup>* and *<sup>u</sup>* <sup>∈</sup> *<sup>U</sup>*, where <sup>Ω</sup>*<sup>ρ</sup>* :<sup>=</sup> {*<sup>x</sup>* <sup>∈</sup> **<sup>R</sup>***<sup>n</sup>* <sup>|</sup> *<sup>V</sup>*(*x*) <sup>≤</sup> *<sup>ρ</sup>*}, *<sup>ρ</sup>* <sup>&</sup>gt; 0, is a compact set within which the system stability is guaranteed using the controller *u* = Φ(*x*) ∈ *U*. Specifically, we sweep over all the values that (*x*, *u*) can take by running extensive open-loop simulations of the system of Equation (1) under various *x*<sup>0</sup> ∈ Ω*<sup>ρ</sup>* and inputs *u* to generate a large number of dynamic trajectories. The open-loop simulation of the continuous system of Equation (1) under a sequence of inputs *u* ∈ *U* is carried out in a sample-and-hold fashion (i.e., the inputs are fed into the system of Equation (1) as a piecewise constant function, *u*(*t*) = *u*(*tk*), ∀*t* ∈ [*tk*, *tk*<sup>+</sup>1), where *tk*<sup>+</sup><sup>1</sup> := *tk* + Δ, and Δ is the sampling period). The nonlinear system of Equation (1) is integrated via explicit Euler method with a sufficiently small integration time step *hc* < Δ. Using the open-loop simulation data, recurrent neural network (RNN) models are developed to predict future states for (at least) one sampling period based on the current state measurements, and the manipulated inputs that will be applied for the next sampling period. In other words, the RNN model is developed to predict *x*(*t*), ∀*t* ∈ [*tk*, *tk*<sup>+</sup>1) based on the measurements *x*(*tk*) and the inputs *u* ∈ [*tk*, *tk*<sup>+</sup>1). Finally, the time-series dataset is partitioned into three subsets for the purposes of training, validation and testing.

#### *2.3. Recurrent Neural Network Model*

Consider an RNN model that approximates the nonlinear dynamics of the system of Equation (1) with *<sup>m</sup>* sequences of *<sup>T</sup>*-time-length data points (**x***i*,*t*, **<sup>y</sup>***i*,*t*) where **<sup>x</sup>***i*,*<sup>t</sup>* <sup>∈</sup> **<sup>R</sup>***dx* is the RNN input, and **<sup>y</sup>***i*,*<sup>t</sup>* <sup>∈</sup> **<sup>R</sup>***dy* is the RNN output, *<sup>i</sup>* <sup>=</sup> 1, ..., *<sup>m</sup>* and *<sup>t</sup>* <sup>=</sup> 1, ..., *<sup>T</sup>* (Figure 1). It should be noted that the RNN inputs and outputs do not necessarily represent the nonlinear system inputs and states/outputs in Equation (1). Therefore, to differentiate the notations for RNN inputs/outputs from those for the nonlinear system of Equation (1), all the vectors for RNN models are written in boldface. Additionally, to simplify the discussion, the RNN model of Equations (8) and (9) is developed to predict states over one sampling period with total time steps *T* = <sup>Δ</sup> *hc* (i.e., the RNN model is to predict future states for all the integration time step *hc* within one sampling period Δ). As a result, the RNN input **x***i*,*<sup>t</sup>* consists of the current state measurements and manipulated inputs that will be applied over *t* = 1, ..., *T*, and the RNN output **y***i*,*<sup>t</sup>* consists of the predicted states over *t* = 1, ..., *T*. Note that **x***i*,*<sup>t</sup>* remains unchanged over *t* = 1, ..., *T* due to the sample-and-hold implementation of manipulated inputs.

The dataset is developed consisting of *m* data sequences drawn independently from some underlying distribution over **<sup>R</sup>***dx*×*<sup>T</sup>* <sup>×</sup> **<sup>R</sup>***dy*×*T*. In this work, we consider a one-hiddenlayer RNN with hidden states **<sup>h</sup>***<sup>i</sup>* <sup>∈</sup> **<sup>R</sup>***dh* computed as follows:

$$\mathbf{h}\_{i,t} = \sigma\_h (lI\mathbf{h}\_{i,t-1} + \mathcal{W}\mathbf{x}\_{i,t}) \tag{8}$$

where *<sup>σ</sup><sup>h</sup>* is the element-wise nonlinear activation function (e.g., ReLU). *<sup>U</sup>* <sup>∈</sup> **<sup>R</sup>***dh*×*dh* and *<sup>W</sup>* <sup>∈</sup> **<sup>R</sup>***dh*×*dx* are weight matrices connected to the hidden states and input vector, respectively. The output layer **y***i*,*<sup>t</sup>* is computed as follows:

$$\mathbf{y}\_{i,t} = \sigma\_y(V\mathbf{h}\_{i,t})\tag{9}$$

where *<sup>V</sup>* <sup>∈</sup> **<sup>R</sup>***dy*×*dh* is the weight matrix, and *<sup>σ</sup><sup>y</sup>* is the element-wise activation function in the output layer (typically linear unit for regression problems).

**Figure 1.** Recurrent neural network structure.

We consider the loss function *L*(**y**, **y**¯) which calculates the squared difference between the true value **y**¯ and the predicted value **y** (i.e., *L*<sup>2</sup> loss). Without loss of generality, we have the following assumptions on the RNN model and dataset.

**Assumption 1.** *The RNN inputs are bounded, i.e.,* |**x***i*,*t*| ≤ *BX, for all i* = 1, ..., *m and t* = 1, ..., *T.*

**Assumption 2.** *The Frobenius norms of all the weight matrices are bounded as follows:*

$$\|\|L\|\|\_{F} \le B\_{II,F\prime} \|\|V\|\|\_{F} \le B\_{V,F\prime} \|\|\mathcal{W}\|\|\_{F} \le B\_{W,F} \tag{10}$$

**Assumption 3.** *Training, validation, and testing datasets are drawn from the same distribution.*

**Assumption 4.** *The nonlinear activation function σ<sup>h</sup> is 1-Lipschitz continuous, and is positivehomogeneous, i.e., σh*(*αz*) = *ασh*(*z*) *for all α* ≥ 0 *and z* ∈ **R***.*

**Remark 1.** *All the assumptions made are standard in machine learning theory, and can be presented in system-theoretic language as follows. Assumption 1 assumes that the RNN inputs are bounded, which is consistent with the fact that the process states x and inputs u are bounded by x* ∈ Ω*<sup>ρ</sup> and u* ∈ *U. Assumption 2 requires the RNN weight matrices to be bounded, which implies that only a finite class of neural network hypotheses are considered for modeling the nonlinear system of Equation (1). Assumption 3 is a natural and necessary assumption for generalization performance analysis. It implies that the machine learning models built from industrial operation data will be applied to the same process with the same data distribution. An example of activation function that satisfies Assumption 4 is Rectified Linear Unit (ReLu), which is a nonlinear activation function that has gained popularity in the machine learning domain.*

#### **3. RNN Generalization Error**

Since any learning algorithms are evaluated on finite training samples only, and do not provide any information on their predictive performance for unseen data, generalization error provides an important measure of how accurately a neural network model is able to predict output values for input data that has not been used in training. To implement machine learning models into real chemical processes, it is necessary to demonstrate that models are developed with a desired generalization error such that they can be applied for any reasonable operating conditions beyond those in the training dataset while maintaining a sufficiently small modeling error. In this section, we develop an upper bound for the generalization error of RNN models, and demonstrate that this error can be bounded with high probability provided that the training data samples and neural network structure meet a few requirements.

#### *3.1. Preliminaries*

We first present some important definitions and lemmas that will be used in the derivation of RNN generalization error. The random variables satisfying sub-Gaussian distribution, which is a probability distribution with strong tail decay, are defined as follows:

**Definition 1.** *A centered random variable x* ∈ **R** *is said to be sub-Gaussian with variance proxy σ*2*, if* E[*x*] = 0*, and the moment generating function satisfies*

$$\mathbb{E}[\exp(aX)] \le \exp\left(\frac{a^2v^2}{2}\right), \forall a \in \mathbb{R} \tag{11}$$

**Lemma 1** (McDiarmid's inequality [35])**.** *Consider independent random variables X*1, ..., *Xn* ∈ *<sup>X</sup> and a function <sup>f</sup>* : *<sup>X</sup><sup>n</sup>* <sup>→</sup> **<sup>R</sup>** *with bounded difference property, i.e., there exist positive numbers ci such that the following inequality holds for all x*1, ..., *xn*, *x <sup>i</sup>* ∈ *X, xi* = *x <sup>i</sup> and i* ∈ {1, ..., *n*}*:*

$$|f(\mathbf{x}\_1, \dots, \mathbf{x}\_i, \dots, \mathbf{x}\_n) - f(\mathbf{x}\_1, \dots, \mathbf{x}\_i', \dots, \mathbf{x}\_n)| \le c\_i \tag{12}$$

*then the following probability holds for any a* > 0*:*

$$\mathbb{P}(f(X\_1, \ldots, X\_n) - \mathbb{E}[f(X\_1, \ldots, X\_n)] \ge a) \le \exp\left(-\frac{2a^2}{\sum\_{i=1}^n c\_i^2}\right) \tag{13}$$

Let *L*(**y***t*, **y**¯*t*) be the loss function, where **y***<sup>t</sup>* = *h*(**x***t*) is the predicted RNN output, and *<sup>h</sup>*(·) represents the RNN functions in the hypothesis class <sup>H</sup> mapping input **<sup>x</sup>** <sup>∈</sup> **<sup>R</sup>***dx* to output **<sup>y</sup>** <sup>∈</sup> **<sup>R</sup>***dy* . The following error definitions are commonly used in machine learning theory.

**Definition 2.** *Given a function h that predicts output values y for each input x, and an underlying distribution D, the expected loss/error or generalization error is*

$$L\_D(h) \triangleq \mathbb{E}[L(h(\mathbf{x}), y)] = \int\_{X \times Y} L(h(\mathbf{x}), y)\rho(\mathbf{x}, y)d\mathbf{x}dy \tag{14}$$

*where ρ*(*x*, *y*) *is joint probability distribution for x and y, and X, Y are the vector space of all possible inputs, and outputs, respectively.*

Since in general the joint probability distribution *ρ* is unknown, we use the data samples drawn from this unknown probability distribution to compute empirical error, which is a proxy measure for the expected loss.

**Definition 3.** *Given a dataset with m data samples S* = (*s*1, ...,*sm*)*, where si* = (*xi*, *yi*)*, the empirical error or risk is*

$$\mathbb{E}\_{\mathbb{S}}[L(h(\mathbf{x}), y)] = \frac{1}{m} \sum\_{i=1}^{m} L(h(\mathbf{x}\_i), y\_i) \tag{15}$$

The RNN model is developed by minimizing the empirical risk of Equation (15) using a set of *m* data sequences. To ensure that the RNN model achieves a desired generalization performance in the sense that it well captures the nonlinear dynamics of the system of Equation (1) for various operation conditions, the objective of this work is to show that the generalization error E[*L*(*h*(**x**), **y**)] can be bounded provided that the empirical risk is sufficiently small and bounded.

We consider the mean squared error (MSE) as loss function in this work. It is readily shown that the MSE loss function *<sup>L</sup>*(**y**, **<sup>y</sup>**¯) is not Lipschitz continuous for all **<sup>y</sup>**, **<sup>y</sup>**¯ <sup>∈</sup> **<sup>R</sup>***dy* . However, since we consider a finite hypothesis class that satisfies Assumptions 1–4, we can show that the RNN output is bounded. This is consistent with the fact that the nonlinear

system of Equation (1) is operated in the stability region Ω*ρ*, and therefore, the RNN outputs are bounded within a compact set.

Let *rt* > 0 denote the upper bound of **y***t*, i.e., |**y***t*| ≤ *rt*, *t* = 1, ..., *T*. Without loss of generality, we assume that the true outputs are also bounded by *rt*. Therefore, the MSE loss function is a locally Lipschitz continuous function satisfying the following inequality for all |**y***t*|, |**y**¯*t*| ≤ *rt*.

$$|L(\mathbf{y}\_1, \mathbf{y}) - L(\mathbf{y}\_2, \mathbf{y})| \le L\_\tau |\mathbf{y}\_1 - \mathbf{y}\_2| \tag{16}$$

where *Lr* is the local Lipschitz constant.

The generalization error of a neural network function *hS* chosen from a hypothesis class H based on a certain learning algorithm and a training dataset *S* drawn from distribution *D* can be decomposed into the approximation error and the estimation error as follows:

$$L\_D(h\_S) - L\_D(h^\*) = \left(\min\_{h \in \mathcal{H}} L\_D(h) - L\_D(h^\*)\right) + \left(L\_D(h\_S) - \min\_{h \in \mathcal{H}} L\_D(h)\right) \tag{17}$$

where the first and second terms in parentheses represent **approximation error** and **estimation error**, respectively. Specifically, *LD*(*hS*) represents the error evaluated using the hypothesis *hS* over the underlying data distribution *D*. *h*∗ represents the optimal hypothesis (maybe outside of the finite hypothesis class H) for the data distribution *<sup>D</sup>*. min*h*∈H *LD*(*h*) is the optimal hypothesis within H that minimizes the loss functions over the distribution *D*. It can be seen that the approximation error depends on how close the hypothesis class H is to the optimal hypothesis *h*∗. In other words, a larger hypothesis class H generally leads to a lower approximation error since it is more likely that the optimal hypothesis *h*<sup>∗</sup> is included in H. The estimation error depends on both the hypothesis class size and training data, and characterizes how good the selected hypothesis *hS* associated with the training dataset *<sup>S</sup>* is with respect to the best hypothesis min*h*∈H *LD*(*h*) within hypothesis class H. As a result, a larger hypothesis class H may in turn lead to a higher estimation error since it is more difficult to find the optimal hypothesis within H over the distribution *D*. From the error decomposition of Equation (17), we demonstrate the dependencies of generalization error on the training dataset size and the complexity of hypothesis class. In the next section, we will take advantage of Rademacher complexity technique to derive a generalization error bound accounting for its dependencies on the above factors in a quantitative aspect. The results will also provide a guide for the design of neural network structures and the collection of training data in order to achieve a desired generalization performance for a specific modeling task.

#### *3.2. Rademacher Complexity Bound*

Rademacher complexity quantifies the richness of a class of functions, and is often used in machine learning theory to bound the generalization error. The definition of empirical Rademacher complexity is given below.

**Definition 4** (Empirical Rademacher Complexity)**.** *Given a hypothesis class* F *of real-valued functions, and a set of data samples S* = {*s*1, ...,*sm*}*, the empirical Rademacher complexity of* F *is defined as*

$$\mathcal{R}\_S(\mathcal{F}) = \mathbb{E}\_{\mathfrak{e}} \left[ \sup\_{f \in \mathcal{F}} \frac{1}{m} \sum\_{i=1}^m \mathfrak{e}\_i f(s\_i) \right] \tag{18}$$

*where* = (1, ..., *m*)*<sup>T</sup> with <sup>i</sup> being independent and identically distributed (i.i.d.) Rademacher random variables satisfying* <sup>P</sup>(*<sup>i</sup>* <sup>=</sup> <sup>1</sup>) = <sup>P</sup>(*<sup>i</sup>* <sup>=</sup> <sup>−</sup>1) = 0.5*.*

We also have the following contraction inequality for the hypothesis class H of vectorvalued functions *<sup>h</sup>* <sup>∈</sup> **<sup>R</sup>***dy* .

**Lemma 2** (c.f. Corollary 4 in [36])**.** *Consider a hypothesis class* H *of vector-valued functions <sup>h</sup>* <sup>∈</sup> **<sup>R</sup>***dy , and a set of data samples <sup>S</sup>* <sup>=</sup> {*s*1, ...,*sm*}*. Let <sup>L</sup>*(·) *be a Lr-Lipschitz function mapping <sup>h</sup>* <sup>∈</sup> **<sup>R</sup>***dy to* **<sup>R</sup>***, then we have*

$$\mathbb{E}\_{\mathbf{c}}\left[\sup\_{h\in\mathcal{H}}\sum\_{i=1}^{m}\epsilon\_{i}L(h(\mathbf{x}\_{i}),\mathbf{y}\_{i})\right] \leq \sqrt{2}L\_{r}\mathbb{E}\_{\mathbf{c}}\left[\sup\_{h\in\mathcal{H}}\sum\_{i=1}^{m}\sum\_{k=1}^{d\_{\mathcal{H}}}\epsilon\_{ik}h\_{k}(\mathbf{x}\_{i})\right] \tag{19}$$

*where hk*(·) *is the k-th component in the vector-valued function h*(·)*, and ik is an m* × *dy matrix of independent Rademacher variables. In the following text, we will omit the subscript of expectation for simplicity.*

Since the RHS of Equation (19) is generally difficult to compute, we can reduce it to scalar classes, and derive the following bound [36]:

$$\mathbb{E}\left[\sup\_{h\in\mathcal{H}}\sum\_{i=1}^{m}\sum\_{k=1}^{d\_{\mathcal{Y}}}\varepsilon\_{ik}h\_{k}(\mathbf{x}\_{i})\right] \leq \sum\_{k=1}^{d\_{\mathcal{Y}}}\mathbb{E}\left[\sup\_{h\in\mathcal{H}\_{k}}\sum\_{i=1}^{m}\varepsilon\_{i}h(\mathbf{x}\_{i})\right] \tag{20}$$

where H*k*, *k* = 1, ..., *dy*, are classes of scalar-valued functions that correspond to the components of vector-valued functions in H. Equation (20) will later be used in the derivation of the generalization error bound for RNN models approximating the nonlinear system of Equation (1).

Let G*<sup>t</sup>* be the family of loss functions associated to H mapping the first *t*-time-step inputs {**x**1, **<sup>x</sup>**2, ..., **<sup>x</sup>***t*} ∈ **<sup>R</sup>***dx*×*<sup>t</sup>* to the *<sup>t</sup>*-th output **<sup>y</sup>***<sup>t</sup>* <sup>∈</sup> **<sup>R</sup>***dy* .

$$\mathcal{G}\_t = \{ \mathbf{g}\_t : (\mathbf{x}, \dot{\mathbf{y}}) \to L(h(\mathbf{x}), \dot{\mathbf{y}}), h \in \mathcal{H} \}\tag{21}$$

where **x** is the RNN input vector, and **y**¯ is the true output vector. The following lemma characterizes the upper bound for the generalization error using Rademacher complexity R*S*(G*t*).

**Lemma 3** (c.f. Theorem 3.3 in [37])**.** *Given a set of m i.i.d. data samples, with probability at least* <sup>1</sup> <sup>−</sup> *<sup>δ</sup> over samples S* = (*xi*,*t*, *<sup>y</sup>i*,*t*)*<sup>T</sup> <sup>t</sup>*=1*, i* = 1, ..., *m, the following inequality holds for all gt* ∈ G*t:*

$$\mathbb{E}[\mathcal{g}\_t(\mathbf{x}, \mathbf{y})] \le \frac{1}{m} \sum\_{i=1}^m \mathcal{g}\_t(\mathbf{x}\_i, \mathbf{y}\_i) + 2\mathcal{R}\_S(\mathcal{G}\_t) + 3\sqrt{\frac{\log(\frac{2}{\delta})}{2m}} \tag{22}$$

**Proof.** While the full proof can be found in many machine learning books, e.g., [37], a proof sketch is presented below to help readers understand the derivation of Equation (22). To simplify the notations, let E[*gt*] and Eˆ *<sup>S</sup>*[*gt*] denote the expected loss E[*gt*(**x**, **y**)] and the empirical loss <sup>1</sup> *<sup>m</sup>* <sup>∑</sup>*<sup>m</sup> <sup>i</sup>*=<sup>1</sup> *gt*(**x***i*, **y***<sup>i</sup>* ) based on a dataset *S* with *m* data samples, respectively. Additionally, we assume that *gt*(**x**, **y**) is bounded in [0, 1] (if not, we can scale the RNN output layer or loss function) without loss of generality. We define *β*(*S*) to be a function of data samples *S* = (*s*1,*s*2, ...,*sm*) as follows, where *si* represents each data sample (**x***i*,*t*, **y***i*,*t*), *i* = 1, ..., *m*.

$$\beta(\mathcal{S}) = \sup\_{\mathcal{g}\_t \in \mathcal{G}\_t} (\mathbb{E}[\mathcal{g}\_t] - \mathbb{E}\_{\mathcal{S}}[\mathcal{g}\_t]) \tag{23}$$

Given two datasets *S* = (*s*1, ...,*si*, ...,*sm*) and *S* = (*s*1, ...,*s i* , ...,*sm*) with only one different data point, i.e., *si* = *s i* , the following inequality holds for any *gt*(**x***i*, **y***<sup>i</sup>* ) ∈ [0, 1]:

$$\begin{split} \left| \beta(\mathcal{S}) - \beta(\mathcal{S}') \right| &= \left| \sup\_{\mathcal{S}\_t \in \mathcal{G}\_t} (\mathbb{E}[\mathcal{g}\_t] - \hat{\mathbb{E}}\_S[\mathcal{g}\_t]) - \sup\_{\mathcal{G}\_t \in \mathcal{G}\_t} (\mathbb{E}[\mathcal{g}\_t] - \hat{\mathbb{E}}\_{S'}[\mathcal{g}\_t]) \right| \\ &\leq \left| \sup\_{\mathcal{G}\_t \in \mathcal{G}\_t} (\hat{\mathbb{E}}\_{S'}[\mathcal{g}\_t] - \hat{\mathbb{E}}\_S[\mathcal{g}\_t]) \right| \\ &= \left| \sup\_{\mathcal{G}\_t \in \mathcal{G}\_t} \frac{\mathcal{g}\_t(s'\_t) - \mathcal{g}\_t(s\_t)}{m} \right| \\ &\leq \frac{1}{m} \end{split} \tag{24}$$

Then, using the McDiarmid's inequality in Lemma 1 and letting *a* ≥ log( <sup>2</sup> *δ* ) <sup>2</sup>*<sup>m</sup>* , we have

$$\mathbb{P}[\beta(S) - \mathbb{E}\_S[\beta(S)] \ge a] \le \exp\left(\frac{-2a^2}{\sum\_{i=1}^m \frac{1}{m^2}}\right) = \exp(-2a^2m) \le \frac{\delta}{2} \tag{25}$$

where E*S*[*β*(*S*)]denotes the expectation of *β*(*S*) with respect to the dataset *S* of *m* data samples. Equivalently, the following inequality holds with probability at least 1 <sup>−</sup> *<sup>δ</sup>* <sup>2</sup> , for any *δ* > 0, :

$$\beta(\mathcal{S}) \le \mathbb{E}\_{\mathcal{S}}[\beta(\mathcal{S})] + \sqrt{\frac{\log(\frac{2}{\mathcal{S}})}{2m}} \tag{26}$$

Next, we derive the upper bound for E*S*[*β*(*S*)] as follows:

$$\begin{split} \mathbb{E}\_{S}[\beta(S)] &= \mathbb{E}\_{S} \left[ \sup\_{g\_{l} \in \mathcal{G}\_{l}} \left( \mathbb{E}[g\_{l}] - \mathbb{E}\_{S}[g\_{l}] \right) \right] \\ &\leq \mathbb{E}\_{S,S'} \left[ \sup\_{g\_{l} \in \mathcal{G}\_{l}} \left( \mathbb{E}\_{S'}[g\_{l}] - \mathbb{E}\_{S}[g\_{l}] \right) \right] \\ &= \mathbb{E}\_{\mathbf{c},S,S'} \left[ \sup\_{g\_{l} \in \mathcal{G}\_{l}} \left( \frac{1}{m} \sum\_{i=1}^{m} \epsilon\_{i} (\mathcal{G}\_{l}(s\_{i}^{\prime}) - \mathcal{g}\_{l}(s\_{i})) \right) \right] \\ &\leq \mathbb{E}\_{\mathbf{c},S'} \left[ \sup\_{g\_{l} \in \mathcal{G}\_{l}} \left( \frac{1}{m} \sum\_{i=1}^{m} \epsilon\_{i} g\_{l}(s\_{i}^{\prime}) \right) \right] + \mathbb{E}\_{\mathbf{c},S} \left[ \sup\_{g\_{l} \in \mathcal{G}\_{l}} \left( \frac{1}{m} \sum\_{i=1}^{m} -\epsilon\_{i} g\_{l}(s\_{i}) \right) \right] \\ &= 2 \mathbb{E}\_{\mathbf{c},S} \left[ \sup\_{g\_{l} \in \mathcal{G}\_{l}} \frac{1}{m} \sum\_{i=1}^{m} \epsilon\_{i} g\_{l}(s\_{i}) \right] = 2 \mathbb{E}\_{\mathbf{c},S}[\mathcal{R}\_{S}(\mathcal{G}\_{l})] \end{split} \tag{27}$$

where the first line is by substituting the definition of Equation (23) into E*S*[*β*(*S*)]. The second line is derived using the fact that E[*gt*] = E*S*[E<sup>ˆ</sup> *<sup>S</sup>*(*gt*)] and the property of supremum function: sup*gt*∈G*<sup>t</sup>* <sup>E</sup>*S*(*f*(*S* , *gt*)) <sup>≤</sup> <sup>E</sup>*S*[sup*gt*∈G*<sup>t</sup> <sup>f</sup>*(*S* , *gt*)] for any function *f* . The third line is derived by introducing Rademacher variables *i*, which do not affect its outcome since *<sup>i</sup>* are i.i.d. randome variables taking values in {−1, +1}. The fourth line is obtained by separating the supremum function as sup(*f* + *g*) ≤ sup(*f*) + sup(*g*), and the last line is derived using the fact that Rademacher variables *<sup>i</sup>* have a symmetric distribution. Note that <sup>E</sup>,*S*[R*S*(G*t*)] in the last line of Equation (27) represents the expectation of the empirical Rademacher complexity, R*S*(G*t*), over all samples of size *m* drawn from the same distribution. In order to bound this term, we apply McDiarmid's inequality again using confidence *δ* <sup>2</sup> , which yields a similar result as in Equation (25). Finally, using union bound which states

that <sup>P</sup>(∪*iAi*) <sup>≤</sup> <sup>∑</sup>*<sup>i</sup>* <sup>P</sup>(*Ai*) holds for any finite or countable set of events *Ai*, *<sup>i</sup>* <sup>=</sup> 1, 2, ..., the following inequality holds with probability at least 1 − *δ*:

$$\begin{split} \mathcal{S}(\mathcal{S}) &\leq 2\left( \mathcal{R}\_{\mathcal{S}}(\mathcal{G}\_{t}) + \sqrt{\frac{\log\frac{2}{\delta}}{2m}} \right) + \sqrt{\frac{\log\frac{2}{\delta}}{2m}} \\ &= 2\mathcal{R}\_{\mathcal{S}}(\mathcal{G}\_{t}) + 3\sqrt{\frac{\log\frac{2}{\delta}}{2m}} \end{split} \tag{28}$$

By substituting the definition of *β*(*S*) of Equation (23) into the above equation, we obtain the result in Equation (22). This completes the proof of Lemma 3.

It can be seen from Equation (22) that the generalization error bound depends on the empirical error (the first term), Rademacher complexity (the second term), and an error function associated with the confidence *δ* and the number of samples *m* (the last term). Since the first and last terms are known given a set of *m* training data, in order to characterize the upper bound for the generalization error E[*gt*(**x**, **y**)], we need to determine the upper bound for the Rademacher complexity R*S*(G*t*). Since most of the established results of Rademacher complexity are with respect to feedforward neural networks modeling real-valued functions only, we will start with a lemma for the hypothesis class of realvalued functions.

**Lemma 4.** *Given a hypothesis class* H*<sup>k</sup> of real-valued functions corresponding to the k-th component of vector-valued function class* <sup>H</sup>*, and a set of <sup>m</sup> i.i.d. data samples <sup>S</sup>* = (*xi*,*t*, *<sup>y</sup>i*,*t*)*<sup>T</sup> <sup>t</sup>*=1*, i* = 1, ..., *m, the following inequality holds for the scaled empirical Rademacher complexity <sup>m</sup>*R*S*(H*k*) = <sup>E</sup>[sup*h*∈H*<sup>k</sup>* <sup>∑</sup>*<sup>m</sup> <sup>i</sup>*=<sup>1</sup> *ih*(*xi*)]*.*

$$\begin{split} m\mathcal{R}\_S(\mathcal{H}\_k) &= \frac{1}{\lambda} \log \exp \left( \lambda \mathbb{E} \left[ \sup\_{h \in \mathcal{H}\_k} \sum\_{i=1}^m \epsilon\_i h(\mathbf{x}\_i) \right] \right) \\ &\leq \frac{1}{\lambda} \log \left( \mathbb{E} \left[ \sup\_{h \in \mathcal{H}\_k} \exp \left( \lambda \sum\_{i=1}^m \epsilon\_i h(\mathbf{x}\_i) \right) \right] \right) \end{split} \tag{29}$$

*where λ* > 0 *is an arbitrary parameter.*

**Proof.** Equation (29) can be readily proved by using Jensen's inequality which states that given a random variable *<sup>X</sup>* and a convex function *<sup>β</sup>*(·), it holds that *<sup>β</sup>*(E[*X*]) <sup>≤</sup> <sup>E</sup>[*β*(*X*)]. Equation (29) will be used in the derivation of the upper bound for the Rademacher complexity *RS*(H) in Lemma 7.

We can see from the definition of Rademacher complexity of Equation (18) that the value of R*S*(G*t*) depends on the complexity of hypothesis class G*t*. However, since the RNN model of Equations (8) and (9) is a complex nonlinear function which is difficult to measure its learning capacity, we need to peel off the nonlinear activation functions and weight matrices through layers. The following lemma shows the "peeling" step used in the derivation of Rademacher complexity for the output layer of RNNs.

**Lemma 5** (c.f. Lemma 1 in [27])**.** *Given a hypothesis class* H *of vector-valued functions that map the RNN inputs <sup>x</sup>* <sup>∈</sup> **<sup>R</sup>***dx to the hidden states <sup>h</sup>* <sup>∈</sup> **<sup>R</sup>***dh , and any convex and monotonically increasing function p* : **R** → **R**+*, the following inequality holds for the RNN model of Equations (8) and (9) with a 1-Lipschitz, positive-homogeneous activation function σy*(·)*:*

$$\mathbb{E}\left[\sup\_{h\in\mathcal{H}\_{\boldsymbol{r}}||V||\_{F}\leq B\_{V,\boldsymbol{F}}}p\left(\left|\sum\_{i=1}^{m}\epsilon\_{i}\sigma\_{\boldsymbol{y}}(V\boldsymbol{h}\_{i})\right|\right)\right]\leq 2\cdot\mathbb{E}\left[\sup\_{h\in\mathcal{H}}p\left(B\_{V,\boldsymbol{F}}\cdot\left|\sum\_{i=1}^{m}\epsilon\_{i}\boldsymbol{h}\_{i}\right|\right)\right]\tag{30}$$

**Proof.** The proof is omitted here as it is similar to the proof for the next lemma, which will be presented in detail. Interested readers can refer to [27] for the proof of Lemma 5.

Lemma 5 peels off the weight matrix *V* between the RNN hidden layer and output layer. To further peel off the weight matrices in the RNN hidden layers, we provide the following lemma.

**Lemma 6.** *Given a hypothesis class* H *of vector-valued functions that map the RNN inputs <sup>x</sup>* <sup>∈</sup> **<sup>R</sup>***dx to the hidden states <sup>h</sup>* <sup>∈</sup> **<sup>R</sup>***dh , and any convex and monotonically increasing function p* : **R** → **R**+*, the following equation holds for the RNN model of Equations (8) and (9) with a 1-Lipschitz, positive-homogeneous activation function σh*(·)*:*

$$\begin{split} & \mathbb{E}\left[\sup\_{h \in \mathcal{H}, ||\|L||| \ge \mathcal{B}\_{\mathcal{U}, \mathcal{F}, \|} ||W||\_{F} \le \mathcal{B}\_{\mathcal{W}, \mathcal{F}}} p\left( \left| \sum\_{i=1}^{m} \varepsilon\_{i} \mathbf{f}\_{i,t} \right| \right) \right] \\ &= \mathbb{E}\left[\sup\_{h \in \mathcal{H}, ||\|L|| \le \mathcal{B}\_{\mathcal{U}, \mathcal{F}, \|} ||W||\_{F} \le \mathcal{B}\_{\mathcal{W}, \mathcal{F}}} p\left( \left| \sum\_{i=1}^{m} \varepsilon\_{i} \sigma\_{h} (\mathsf{L} \Pi\_{i,t-1} + \mathsf{W} \mathbf{x}\_{i,t}) \right| \right) \right] \\ & \le 2 \mathbb{E} \left[\sup\_{h \in \mathcal{H}} p\left( \left| \sum\_{i=1}^{m} \varepsilon\_{i} \mathbf{f}\_{i,t-1} \right| + B\_{\mathcal{W}, \mathcal{F}} \left| \sum\_{i=1}^{m} \varepsilon\_{i} \mathbf{x}\_{i,t} \right| \right) \right] \end{split} \tag{31}$$

**Proof.** We first define an augmented weight matrix *<sup>Z</sup>* = [*U*|*W*] <sup>∈</sup> **<sup>R</sup>***dh*×(*dh*+*dx*), and an augmented vector **<sup>h</sup>**¯*i*,*<sup>t</sup>* = [**h***i*,*t*−1|**x***i*,*t*] <sup>∈</sup> **<sup>R</sup>***dh*+*dx* . To simplify the discussion, we assume that the Frobenius norm of the matrix *Z* is bounded by ||*Z*||*<sup>F</sup>* ≤ *BZ*,*F*, given that both *U* and *W* are bounded by ||*U*||*<sup>F</sup>* ≤ *BU*,*<sup>F</sup>* and ||*W*||*<sup>F</sup>* ≤ *BW*,*F*. Then, the hidden layer vector at *t*-th time step, **h***i*,*t*, can be written as follows:

$$\mathbf{h}\_{i,t} = \sigma\_h(lI\mathbf{h}\_{i,t-1} + l\mathbf{V}\mathbf{x}\_{i,t}) = \sigma\_h(Z\mathbf{f}\_{i,t})\tag{32}$$

Letting **z**1, **z**2, ..., **z***<sup>h</sup>* denote the rows of the matrix *Z*, we have

$$\left|\sum\_{i=1}^{m}\boldsymbol{\epsilon}\_{i}\mathbf{h}\_{i,t}\right|^{2} = \sum\_{j=1}^{d\_h} |\mathbf{z}\_{j}|^{2} \left(\sum\_{i=1}^{m} \boldsymbol{\epsilon}\_{i}\boldsymbol{\sigma}\_{h} (\frac{\mathbf{z}\_{j}^{T}}{|\mathbf{z}\_{j}|} \tilde{\mathbf{h}}\_{i,t})\right)^{2} \tag{33}$$

The supremum of Equation (33) over all the weight matrix *Z* = [**z**<sup>1</sup> **z**<sup>2</sup> ... **z***h*] that satisfies ||*Z*||*<sup>F</sup>* ≤ *BZ*,*<sup>F</sup>* is obtained when |**z***j*| = *BZ*,*<sup>F</sup>* for some *j*, and |**z***i*| = 0 for all *i* = *j*. Therefore, we have

$$\begin{split} & \mathbb{E} \left[ \sup\_{h \in \mathcal{H}, ||\mathcal{U}||\_F \leq B\_{L, \mathcal{F}, t} ||W||\_F \leq B\_{W, \mathcal{F}}} p \left( \left| \sum\_{i=1}^m \epsilon\_i \mathbf{h}\_{i, t} \right| \right) \right] \\ &= \mathbb{E} \left[ \sup\_{h \in \mathcal{H}, |\mathbf{z}| = B\_{Z, \mathcal{F}}} p \left( \left| \sum\_{i=1}^m \epsilon\_i \sigma\_h (\mathbf{z}^T \bar{\mathbf{h}}\_{i, t}) \right| \right) \right] \end{split} \tag{34}$$

Since *p*(·) is a convex and monotonically increasing function, *p*(|*a*|) ≤ *p*(*a*) + *p*(−*a*) holds, and the above equation can be further bounded as follows:

$$\begin{split} \mathbb{E}\left[\sup\_{h\in\mathcal{H}, |\mathbf{z}|=B\_{\mathcal{Z},\mathcal{F}}} p\left(\left|\sum\_{i=1}^{m}\boldsymbol{\varepsilon}\_{i}\boldsymbol{\sigma}\_{h}(\mathbf{z}^{T}\mathbf{\bar{h}}\_{i,t})\right|\right)\right] &\leq \mathbb{E}\left[\sup\_{h\in\mathcal{H}, |\mathbf{z}|=B\_{\mathcal{Z},\mathcal{F}}} p\left(\sum\_{i=1}^{m}\boldsymbol{\varepsilon}\_{i}\boldsymbol{\sigma}\_{h}(\mathbf{z}^{T}\mathbf{\bar{h}}\_{i,t})\right)\right] \\ &+ \mathbb{E}\left[\sup\_{h\in\mathcal{H}, |\mathbf{z}|=B\_{\mathcal{Z},\mathcal{F}}} p\left(-\sum\_{i=1}^{m}\boldsymbol{\varepsilon}\_{i}\boldsymbol{\sigma}\_{h}(\mathbf{z}^{T}\mathbf{\bar{h}}\_{i,t})\right)\right] \\ &= 2\mathbb{E}\left[\sup\_{h\in\mathcal{H}, |\mathbf{z}|=B\_{\mathcal{Z},\mathcal{F}}} p\left(\sum\_{i=1}^{m}\boldsymbol{\varepsilon}\_{i}\boldsymbol{\sigma}\_{h}(\mathbf{z}^{T}\mathbf{\bar{h}}\_{i,t})\right)\right] \end{split} (35)$$

where the last equality is derived from the fact that the random variables *<sup>i</sup>* have a symmetric distribution, i.e., <sup>P</sup>(*<sup>i</sup>* <sup>=</sup> <sup>1</sup>) = <sup>P</sup>(*<sup>i</sup>* <sup>=</sup> <sup>−</sup>1) = 0.5. Following the proof in [27] and Theorem 4.12 in [38], the RHS of Equation (35) can be further bounded by

$$\begin{split} 2\mathbb{E}\left[\sup\_{h\in\mathcal{H},|\mathbf{z}|=B\_{L,F}} p\left(\sum\_{i=1}^{m}\varepsilon\_{i}\eta\_{i}(\mathbf{z}^{T}\widehat{\mathbf{h}}\_{i,t})\right)\right] &\leq 2\mathbb{E}\left[\sup\_{h\in\mathcal{H},|\mathbf{z}|=B\_{L,F}} p\left(\sum\_{i=1}^{m}\varepsilon\_{i}\mathbf{z}^{T}\widehat{\mathbf{h}}\_{i,t}\right)\right] \\ &\leq 2\mathbb{E}\left[\sup\_{h\in\mathcal{H},|\mathbf{z}|=B\_{L,F},|\mathbf{w}|=B\_{W,F}} p\left(|\mathbf{u}|\left|\sum\_{i=1}^{m}\varepsilon\_{i}\mathbf{h}\_{i,t-1}\right| + |\mathbf{w}|\left|\sum\_{i=1}^{m}\varepsilon\_{i}\mathbf{x}\_{i,t}\right|\right)\right] \\ &= 2\mathbb{E}\left[\sup\_{h\in\mathcal{H}} p\left(B\_{L,F}\left|\sum\_{i=1}^{m}\varepsilon\_{i}\mathbf{h}\_{i,t-1}\right| + B\_{W,F}\left|\sum\_{i=1}^{m}\varepsilon\_{i}\mathbf{x}\_{i,t}\right|\right)\right] \end{split} \tag{36}$$

Based on Lemmas 5 and 6, the following lemma provides an upper bound for the Rademacher complexity of the RNN hypothesis class.

**Lemma 7.** *Let* H*k*,*t, k* = 1, ..., *dy, be the class of real-valued functions that corresponds to the k-th component of the RNN output at t-th time step, with weight matrices and activation functions satisfying Assumptions 1–4. Given a set of <sup>m</sup> i.i.d. data samples <sup>S</sup>* = (*xi*,*t*, *<sup>y</sup>i*,*t*)*<sup>T</sup> <sup>t</sup>*=1*, i* = 1, ..., *m, the following equation holds for the Rademacher complexity:*

$$\mathcal{R}\_S(\mathcal{H}\_{k,t}) \le \frac{M(\sqrt{2\log(2)}t + 1)B\_X}{\sqrt{m}} \tag{37}$$

*where M* = *BV*,*FBW*,*<sup>F</sup>* (*BU*,*F*)*<sup>t</sup>* −1 *BU*,*F*−<sup>1</sup> *.*

**Proof.** Let **v***<sup>k</sup>* be the *k*-th row in the weight matrix *V*. Using Equations (29) and (30), the scaled Rademacher complexity *m*R*S*(H*k*) can be bounded as follows:

$$\begin{split} m\mathcal{R}\_{S}(\mathcal{H}\_{k,l}) &= \mathbb{E}\left[\sup\_{h\in\mathcal{H}\_{k,l},||V||\_{F}\leq B\_{V,F}} \sum\_{i=1}^{m} \varepsilon\_{i}\sigma\_{y}(\mathbf{v}\_{k}\mathbf{h}\_{i,t})\right] \\ &\leq \frac{1}{\lambda}\log\mathbb{E}\left[\sup\_{h\in\mathcal{H}\_{k,l},||V||\_{F}\leq B\_{V,F}} \exp\left(\lambda\sum\_{i=1}^{m}\varepsilon\_{i}\sigma\_{y}(\mathbf{v}\_{k}\mathbf{h}\_{i,t})\right)\right] \\ &\leq \frac{1}{\lambda}\log\mathbb{E}\left[\sup\_{h\in\mathcal{H}\_{k,l}}\exp\left(B\_{V,F}\lambda\left|\sum\_{i=1}^{m}\varepsilon\_{i}\mathbf{h}\_{i,t}\right|\right)\right] \end{split} \tag{38}$$

where exp(·) corresponds to the monotonically increasing function *p*(·) in Lemmas 5 and 6. Then, we use Equation (31) and further derive the bound for the RHS of the above equation as follows:

$$\begin{split} & \frac{1}{\lambda} \log \mathbb{E} \left[ \sup\_{h \in \mathcal{H}\_{k,l}} \exp \left( B\_{V,F} \lambda \left| \sum\_{i=1}^{m} \varepsilon\_{i} \mathbf{h}\_{i,t} \right| \right) \right] \\ & \leq \frac{1}{\lambda} \log \left( 2 \cdot \mathbb{E} \left[ \sup\_{h \in \mathcal{H}\_{k,l-1}} \exp \left( B\_{V,F} \lambda \cdot \left( B\_{\mathcal{U},F} \left| \sum\_{i=1}^{m} \varepsilon\_{i} \mathbf{h}\_{i,t-1} \right| + B\_{\mathcal{W},F} \left| \sum\_{i=1}^{m} \varepsilon\_{i} \mathbf{x}\_{i,t} \right| \right) \right) \right] \right) \end{split} \tag{39}$$

Assuming that the initial hidden states **h***i*,0 = 0, by recursively applying Lemma 6 to the term <sup>|</sup>∑*<sup>m</sup> <sup>i</sup>*=<sup>1</sup> *i***h***i*,*t*−1| in Equation (39), we obtain that

$$\begin{split} m\mathcal{R}\_S(\mathcal{H}\_{k,t}) &\leq \frac{1}{\lambda} \log \left( 2^t \cdot \mathbb{E} \left[ \exp \left( B\_{V,F} \lambda \cdot \left( B\_{W,F} \cdot \left| \sum\_{i=1}^m \varepsilon\_i \mathbf{x}\_{i,t} \right| \cdot \sum\_{j=0}^{t-1} (B\_{\mathrm{l,F}})^j \right) \right) \right] \right) \\ &= \frac{1}{\lambda} \log \left( 2^t \cdot \mathbb{E} \left[ \exp \left( B\_{V,F} \lambda \cdot \left( B\_{W,F} \cdot \left| \sum\_{i=1}^m \varepsilon\_i \mathbf{x}\_{i,t} \right| \cdot \frac{(B\_{\mathrm{l,F}})^t - 1}{B\_{\mathrm{l,F}} - 1} \right) \right) \right] \right) \end{split} \tag{40}$$

It is noted that the RNN model in this work is developed to predict one sampling time, for which the RNN inputs **x***i*,*<sup>t</sup>* remain the same. If the RNN inputs are varying over time, Equation (40) can be modified by taking the maximum value of <sup>|</sup>∑*<sup>m</sup> <sup>i</sup>*=<sup>1</sup> *i***x***i*,*t*| within the prediction period. Subsequently, we define the following random variable *q*

$$\mathcal{q} = \mathcal{M} \left| \sum\_{i=1}^{m} \varepsilon\_i \mathbf{x}\_{i,t} \right| \tag{41}$$

where the randomness comes from the Rademacher variables *i*, and *M* denotes the product of all weight matrices, i.e., *M* = *BV*,*FBW*,*<sup>F</sup>* (*BU*,*F*)*<sup>t</sup>* −1 *BU*,*F*−<sup>1</sup> . Then, Equation (40) can be written as

$$\begin{split} m\mathcal{R}\_S(\mathcal{H}\_{k,t}) &\leq \frac{1}{\lambda} \log(2^t \cdot \mathbb{E}[\exp(\lambda q)]) \\ &= \frac{t \log(2)}{\lambda} + \frac{1}{\lambda} \log(\mathbb{E}[\exp(\lambda(q - \mathbb{E}[q]))]) + \mathbb{E}[q] \end{split} \tag{42}$$

Using Jensen's inequality, we can bound E[*q*] as follows:

$$\mathbb{E}[q] = \mathbb{E}\left[\mathcal{M}\left|\sum\_{i=1}^{m}\epsilon\_{i}\mathbf{x}\_{i,t}\right|\right] \leq \mathcal{M}\sqrt{\mathbb{E}\left[\left|\sum\_{i=1}^{m}\epsilon\_{i}\mathbf{x}\_{i,t}\right|^{2}\right]} = \mathcal{M}\sqrt{\sum\_{i=1}^{m}|\mathbf{x}\_{i,t}|^{2}} \leq \sqrt{m}M B\_{\mathbf{X}}\tag{43}$$

where the second equality comes from the fact that *<sup>i</sup>* are i.i.d. Rademacher random variables, and the last inequality is due to the assumption that |**x***i*,*t*| ≤ *BX*. Subsequently, following the results in [38], we can show *q* is sub-Gaussian with the following variance factor *v* since *q* satisfies a bounded-difference condition with respect to its random variables *i*, i.e., *q*(1, ..., *i*, ..., *m*) − *q*(1, ..., −*i*, ..., *m*) ≤ 2*M*|**x***i*,*t*|.

$$v = \frac{1}{4} \sum\_{i=1}^{m} (2M|\mathbf{x}\_{i,t}|)^2 = M^2 \sum\_{i=1}^{m} |\mathbf{x}\_{i,t}| \tag{44}$$

According to the property of sub-Gaussian random variables in Definition 1, the following inequality holds for *q*:

$$\frac{1}{\lambda} \log \left( \mathbb{E} [\exp(\lambda (q - \mathbb{E}[q]))] \right) \le \frac{\lambda M^2 \sum\_{i=1}^m |\mathbf{x}\_{i,t}|}{2} \tag{45}$$

Let *λ* = √2 log(2)*<sup>t</sup> <sup>M</sup>*√∑*<sup>m</sup> <sup>i</sup>*=<sup>1</sup> <sup>|</sup>**x***i*,*t*|<sup>2</sup> <sup>&</sup>gt; 0. The Rademacher complexity *<sup>m</sup>*R*S*(H*k*,*t*) in Equation (42) can be bounded as follows:

$$\begin{split} m\mathcal{R}\_S(\mathcal{H}\_{k,t}) &\leq \frac{t\log(2)}{\lambda} + \frac{1}{\lambda} \log(\mathbb{E}[\exp(\lambda(q-\mathbb{E}[q]))]) + \mathbb{E}[q] \\ &\leq \mathcal{M}(\sqrt{2\log(2)}t + 1)\sqrt{\sum\_{i=1}^m |\mathbf{x}\_{i,t}|^2} \\ &\leq \mathcal{M}(\sqrt{2\log(2)}t + 1)\sqrt{m}B\_X \end{split} \tag{46}$$

Lemma 7 develops the Rademacher complexity upper bound for the hypothesis class H*<sup>k</sup>* of real-valued functions that map RNN inputs to the *k*-th output. Subsequently, we derive the generalization bound for the loss function associated with the vector-valued functions that map the RNN inputs to the output vector by taking advantage of the contraction inequality of Equations (19) and (20).

**Theorem 1.** *Let* G*<sup>t</sup> be the family of loss function associated to the hypothesis class* H*<sup>t</sup> of vectorvalued functions that map the RNN inputs to the RNN output at t-th time step, with weight matrices and activation functions satisfying Assumptions 1–4. Given a set of m i.i.d. data samples <sup>S</sup>* = (*xi*,*t*, *<sup>y</sup>i*,*t*)*<sup>T</sup> <sup>t</sup>*=1*, i* = 1, ..., *m, with probability at least* 1 − *δ over S, we have r*

$$\mathbb{E}[g\_t(\mathbf{x}, \mathbf{y})] \le \frac{1}{m} \sum\_{i=1}^m g\_t(\mathbf{x}\_i, \mathbf{y}\_i) + 3 \sqrt{\frac{\log(\frac{2}{\delta})}{2m}} + \mathcal{O}\left(L\_r d\_y \frac{M(\sqrt{2\log(2)}t + 1)B\_X}{\sqrt{m}}\right) \tag{47}$$

*where M* = *BV*,*FBW*,*<sup>F</sup>* (*BU*,*F*)*<sup>t</sup>* −1 *BU*,*F*−<sup>1</sup> *.*

**Proof.** Using the results in Lemma 7 and Equations (19) and (20), we can derive the following upper bound for the loss function *L*(*h*(**x***i*), **y***<sup>i</sup>* ) with *h*(**x***i*) being vector-valued functions:

$$\begin{split} \mathcal{R}\_{S}(\mathcal{G}\_{t}) = \mathbb{E}\left[\sup\_{h \in \mathcal{H}} \frac{1}{m} \sum\_{i=1}^{m} \epsilon\_{i} \mathcal{L}(h(\mathbf{x}\_{i}), \mathbf{y}\_{i})\right] &\leq \sqrt{2} \mathcal{L}\_{r} \mathbb{E}\left[\sup\_{h \in \mathcal{H}} \frac{1}{m} \sum\_{i=1}^{m} \sum\_{k=1}^{d\_{\mathcal{Y}}} \epsilon\_{ik} h\_{k}(\mathbf{x}\_{i})\right] \\ &\leq \sqrt{2} \mathcal{L}\_{r} d\_{\mathcal{Y}} \frac{M(\sqrt{2\log(2)}t + 1)B\_{X}}{\sqrt{m}} \end{split} \tag{48}$$

Then, substituting Equation (48) into Equation (22), we derive the generalization error bound in Equation (47).

**Remark 2.** *As stated in [27], the assumption of positive-homogeneity for the nonlinear activation function can be loosened in some cases, under which a similar result of generalization error bound can be derived. Interested readers are referred to Lemma 2 and Theorem 2 in [27].*

**Remark 3.** *The generalization error bound of Equation (47) implies that the following attempts can be taken to reduce the generalization error: (1) minimize the empirical loss* <sup>1</sup> *<sup>m</sup>* <sup>∑</sup>*<sup>m</sup> <sup>i</sup>*=<sup>1</sup> *gt*(*xi*, *y<sup>i</sup>* ) *over the training data samples S through a careful design of neural network, and (2) increase the number of training samples m. Additionally, as discussed in the error decomposition of Equation (17), increasing the complexity hypothesis class in terms of larger weight matrices bounds M could decrease the approximation error, but may also increase the estimation error, which corresponds to the last term* O(·) *in Equation (47). Therefore, in practice, we generally start with a simple neural network and gradually increase it complexity in terms of more neurons, layers and larger weight matrices bounds to improve the training and testing performance. The whole process stops when the testing error starts increasing, which indicates the occurrence of overfitting.*

**Remark 4.** *While the actual generalization error is difficult to obtain due to unknown data distribution and complexity of hypothesis class, Equation (47) characterizes the upper bound for the gap between the generalization error and empirical error by moving the term* <sup>1</sup> *<sup>m</sup>* <sup>∑</sup>*<sup>m</sup> <sup>i</sup>*=<sup>1</sup> *gt*(*xi*, *y<sup>i</sup>* ) *to the LHS of Equation (47). Since the neural network training process itself is to minimize the training error only, this generalization gap is more useful in practice by showing how good the neural network will be for unseen data under the same data distribution. In terms of modeling the nonlinear system of Equation (1), this generalization gap provides an upper bound for the modeling error for all the states in the operating region, and can be used in the design of model-based controllers that probabilistically ensure closed-loop stability accounting for bounded modeling errors.*

**Remark 5.** *It is noticed that the generalization error bound also depends on the time length t of RNN inputs, which is different from the results derived for the feedforward neural networks in [27]. Additionally, unlike other deep neural networks which utilize different parameters for each hidden layer, RNNs share the same weight matrix U at each time step, and therefore, the bound for the product of weight matrices is derived in the form of M* = *BV*,*FBW*,*<sup>F</sup>* (*BU*,*F*)*<sup>t</sup>* −1 *BU*,*F*−<sup>1</sup> *. From Equation (47), it can be seen that as the data sequence length t increases, the network hypothesis becomes more complex, which leads to a larger generalization error bound. Therefore, a shorter time sequence prediction is preferred from the perspective of prediction accuracy. However, it does not necessarily mean a short prediction period is always desirable from the control perspective, especially in model predictive control (MPC) schemes. In Section 5, we will demonstrate that the RNN models predicting a short period of time achieved desired prediction performance in open-loop tests, but perform poorly in closed-loop simulation due to the error accumulated during successive execution of RNN predictions within MPC prediction horizon.*

#### **4. RNN-Based MPC with Probabilistic Stability Analysis**

In this section, we present the formulation of Lyapunov-based MPC (LMPC) that uses RNN models to predict evolution of future states, along with the closed-loop stability analysis showing the boundedness of closed-loop state of Equation (1) in the stability region for all times in probability.

#### *4.1. Lyapunov-Based Control Using RNN Models*

To simplify the discussion of RNN stability properties for the continuous-time nonlinear system of Equation (1), we represent the RNN model in the following continuous-time form [9]:

$$\dot{\mathfrak{X}} = F\_{nn}(\mathfrak{X}, \mathfrak{u}) := A\mathfrak{X} + \Theta^T \mathfrak{z} \tag{49}$$

where *<sup>x</sup>*<sup>ˆ</sup> <sup>∈</sup> **<sup>R</sup>***<sup>n</sup>* and *<sup>u</sup>* <sup>∈</sup> **<sup>R</sup>***<sup>k</sup>* are the RNN state vector and the manipulated input vector, respectively. *<sup>z</sup>* = [*z*1, ..., *zn*, *zn*+1, ..., *zk*<sup>+</sup>*n*]=[*σ*(*x*ˆ1), ..., *<sup>σ</sup>*(*x*ˆ*n*), *<sup>u</sup>*1, ..., *uk*] <sup>∈</sup> **<sup>R</sup>***n*+*<sup>k</sup>* is a vector of both the input *u* and the network state *x*ˆ, where *σ*(·) represents the nonlinear activation function. *A* is a diagonal coefficient matrix with all diagonal elements being negative, and <sup>Θ</sup> = [*θ*1, ..., *<sup>θ</sup>n*] <sup>∈</sup> **<sup>R</sup>**(*k*+*n*)×*<sup>n</sup>* with *<sup>θ</sup><sup>i</sup>* <sup>=</sup> *bi*[*wi*1, ..., *wi*(*k*+*n*)], *<sup>i</sup>* <sup>=</sup> 1, ..., *<sup>n</sup>*, where *wij* denotes the weight connecting the *j*th input to the *i*th neuron, *i* = 1, ..., *n* and *j* = 1, ...,(*k* + *n*). The weight matrices and activation functions satisfy Assumptions 1–4. To simplify the notation, we use Equation (49) to represent one-hidden-layer RNN model, and bias terms are not explicitly included in Equation (49); however, it is noted that the results that we will derive in this section are not restricted to one-hidden-layer RNN models, and can be extended to deep RNNs with multiple hidden layers.

We assume that there exists a stabilizing feedback controller *u* = Φ*nn*(*x*) ∈ *U* that can render the origin of the RNN model of Equation (49) exponentially stable in an open neighborhood *D*ˆ around the origin. The stabilizability assumption implies the existence of a <sup>C</sup><sup>1</sup> control Lyapunov function *<sup>V</sup>*ˆ(*x*) such that the following inequalities hold for all *<sup>x</sup>* in *D*ˆ :

$$\left|\mathcal{E}\_1|\mathbf{x}|^2 \le \hat{\mathcal{V}}(\mathbf{x}) \le \mathcal{E}\_2|\mathbf{x}|^2,\tag{50}$$

$$\frac{\partial \hat{V}(\mathbf{x})}{\partial \mathbf{x}} F\_{nn}(\mathbf{x}, \Phi\_{nn}(\mathbf{x})) \le -\mathcal{E}\_3 |\mathbf{x}|^2 \tag{51}$$

$$\left|\frac{\partial \hat{V}(x)}{\partial x}\right| \le \pounds\_4 |x|\tag{52}$$

where *c*ˆ1, *c*ˆ2, *c*ˆ3, *c*ˆ4 are positive constants. The closed-loop stability region for the RNN model of Equation (49) is characterized as a level set of Lyapunov function embedded in *<sup>D</sup>*<sup>ˆ</sup> as follows: <sup>Ω</sup>*ρ*<sup>ˆ</sup> :<sup>=</sup> {*<sup>x</sup>* <sup>∈</sup> *<sup>D</sup>*<sup>ˆ</sup> <sup>|</sup> *<sup>V</sup>*ˆ(*x*) <sup>≤</sup> *<sup>ρ</sup>*ˆ}, where *<sup>ρ</sup>*<sup>ˆ</sup> <sup>&</sup>gt; 0. Additionally, there exist positive constants *Mnn* and *Lnn* such that the following inequalities hold for all *x*, *x* ∈ Ω*ρ*<sup>ˆ</sup> and *u* ∈ *U*:

$$|F\_{nn}(\mathbf{x}, \boldsymbol{\mu})| \le M\_{\text{III}} \tag{53}$$

$$\left| \frac{\partial \hat{V}(\mathbf{x})}{\partial \mathbf{x}} F\_{\text{nn}}(\mathbf{x}, \boldsymbol{\mu}) - \frac{\partial \hat{V}(\mathbf{x'})}{\partial \mathbf{x}} F\_{\text{nn}}(\mathbf{x'}, \boldsymbol{\mu}) \right| \leq L\_{\text{nn}} |\mathbf{x} - \mathbf{x'}| \tag{54}$$

Due to the model mismatch between the nonlinear system of Equation (1) and the RNN model of Equation (49), the following proposition is developed to demonstrate that the feedback controller *u* = Φ*nn*(*x*) ∈ *U* is able to stabilize the system of Equation (1) with high probability if the modeling error is sufficiently small.

**Proposition 1.** *Consider the RNN model trained using a set of m i.i.d. data samples S* = (*xi*,*t*, *<sup>y</sup>i*,*t*)*<sup>T</sup> <sup>t</sup>*=1*, i* = 1, ..., *m, and satisfying Assumptions 1–4. Under the assumption that the feedback controller u* = Φ*nn*(*x*) ∈ *U renders the the origin of the RNN system of Equation (49) exponentially stable for all x* ∈ Ω*ρ*ˆ*, if for all x* ∈ Ω*ρ*<sup>ˆ</sup> *and u* ∈ *U, the modeling error can be constrained by* |*F*(*x*, *u*) − *Fnn*(*x*, *u*)| ≤ *γ*|*x*|*, where γ is a positive real number satisfying γ* < *c*ˆ3/*c*ˆ4*, then the controller u* = Φ*nn*(*x*) ∈ *U also renders the origin of the nonlinear system of Equation (1) exponentially stable with probability at least* 1 − *δ for all x* ∈ Ω*ρ*ˆ*.*

**Proof.** To demonstrate that the origin of the nominal system of Equation (1) can be rendered exponentially stable ∀*x* ∈ Ω*ρ*<sup>ˆ</sup> with probability at least 1 − *δ* under the controller *u* = Φ*nn*(*x*) ∈ *U* designed for the RNN model of Equation (49), we prove that the timederivative of *V*ˆ associated with the state *x* of Equation (1) can be rendered negative in probability under *<sup>u</sup>* <sup>=</sup> <sup>Φ</sup>*nn*(*x*) <sup>∈</sup> *<sup>U</sup>*. Based on Equations (51) and (52), ˙ *V*ˆ is derived as follows:

$$\begin{split} \dot{\mathcal{V}} &= \frac{\partial \mathcal{V}(\mathbf{x})}{\partial \mathbf{x}} F(\mathbf{x}, \boldsymbol{\Phi}\_{nn}(\mathbf{x})) \\ &= \frac{\partial \hat{\mathcal{V}}(\mathbf{x})}{\partial \mathbf{x}} (F\_{nn}(\mathbf{x}, \boldsymbol{\Phi}\_{nn}(\mathbf{x})) + F(\mathbf{x}, \boldsymbol{\Phi}\_{nn}(\mathbf{x})) - F\_{nn}(\mathbf{x}, \boldsymbol{\Phi}\_{nn}(\mathbf{x}))) \\ &\leq -\mathcal{E}\_3 |\mathbf{x}|^2 + \mathcal{E}\_4 |\mathbf{x}| \cdot |(F(\mathbf{x}, \boldsymbol{\Phi}\_{nn}(\mathbf{x})) - F\_{nn}(\mathbf{x}, \boldsymbol{\Phi}\_{nn}(\mathbf{x})))| \end{split} \tag{55}$$

where the last term |*F*(*x*, Φ*nn*(*x*)) − *Fnn*(*x*, Φ*nn*(*x*))| represents the error between the RNN model and the process model of Equation (1). Since the RNN model is trained using sampled data with a sufficiently small time interval (i.e., integration time step *hc*), the modeling error term for the same initial state *x*(*t*) = *x*ˆ(*t*) can be approximated as follows:

$$\begin{aligned} \left| F(\mathbf{x}, \boldsymbol{\Phi}\_{nn}(\mathbf{x})) - F\_{nn}(\mathbf{x}, \boldsymbol{\Phi}\_{nn}(\mathbf{x})) \right| \\ \leq & \left| \frac{\mathbf{x}(t + h\_{\varepsilon}) - \mathbf{x}(t)}{h\_{\varepsilon}} - \frac{\mathbf{x}(t + h\_{\varepsilon}) - \mathbf{\hat{x}}(t)}{h\_{\varepsilon}} \right| + \mathcal{O}(h\_{\varepsilon}) \\ \leq & \left| \frac{\mathbf{x}(t + h\_{\varepsilon}) - \mathbf{\hat{x}}(t + h\_{\varepsilon})}{h\_{\varepsilon}} \right| + \mathcal{O}(h\_{\varepsilon}) \end{aligned} \tag{56}$$

where *x*ˆ is the predicted state by RNN model, and *x* is the state of actual nonlinear system of Equation (1). O(*hc*) is the truncation error from finite difference method. Since |*x*(*t* + *hc*) − *x*ˆ(*t* + *hc*)| represents the Euclidean norm of the prediction error, while the generalization error bound is derived using MSE as loss function in Theorem 1, the modeling error can be bounded as follows:

$$|F(\mathbf{x}, \Phi\_{\rm{nn}}(\mathbf{x})) - F\_{\rm{nn}}(\mathbf{x}, \Phi\_{\rm{nn}}(\mathbf{x}))| \le E\_M \tag{57}$$

where

$$E\_M = \frac{1}{h\_c} \sqrt{\frac{1}{m} \sum\_{l=1}^{m} g(\mathbf{x}\_l, y\_l) + 3 \sqrt{\frac{\log(\frac{2}{\delta})}{2m}} + \mathcal{O}\left(L\_l d\_g \frac{M(\sqrt{2\log(2)}h\_c + 1)B\_X}{\sqrt{m}}\right) + \mathcal{O}(h\_c)}\tag{58}$$

By choosing the number of samples *m* ≥ *mN*(*δ*, *hc*, |*x*|), where *mN*(*δ*, *hc*, |*x*|) is the minimum data sample size satisfying *EM* ≤ *γ*|*x*|, *γ* < *c*ˆ3/*c*ˆ4, we have the following equation showing that ˙ *<sup>V</sup>*<sup>ˆ</sup> can be rendered negative for all *<sup>x</sup>* <sup>∈</sup> <sup>Ω</sup>*<sup>ρ</sup>* and *<sup>x</sup>* <sup>=</sup> 0 with probability at least 1 <sup>−</sup> *<sup>δ</sup>*, i.e., <sup>P</sup>[ ˙ *<sup>V</sup>*<sup>ˆ</sup> <sup>≤</sup> <sup>0</sup>] <sup>≥</sup> <sup>1</sup> <sup>−</sup> *<sup>δ</sup>*,

$$\begin{split} \dot{\mathcal{V}} &\leq -\mathcal{E}\_{3}|\mathbf{x}|^{2} + \mathcal{E}\_{4}|\mathbf{x}| \cdot |F(\mathbf{x}, \Phi\_{\text{nn}}(\mathbf{x})) - F\_{\text{nn}}(\mathbf{x}, \Phi\_{\text{nn}}(\mathbf{x}))| \\ &\leq -\mathcal{E}\_{3}|\mathbf{x}|^{2} + \mathcal{E}\_{4}|\mathbf{x}| \frac{\hat{\mathcal{E}}\_{3}|\mathbf{x}|}{\hat{\mathcal{E}}\_{4}} \\ &= -\mathcal{E}\_{3}|\mathbf{x}|^{2} \\ &\leq 0 \end{split} \tag{59}$$

where *c*˜3 = −*c*ˆ3 + *c*ˆ4*γ* < 0 for any *γ* < *c*ˆ3/*c*ˆ4. Therefore, with probability at least 1 − *δ*, the closed-loop state of the system of Equation (1) converges to the origin under *u* = Φ*nn*(*x*) ∈ *U* for all *x*<sup>0</sup> ∈ Ω*ρ*ˆ.

**Remark 6.** *The modeling error constraint EM* ≤ *γ*|*x*|*,* ∀*x* ∈ Ω*ρ*<sup>ˆ</sup> *implies that more data is needed for states closer to the origin. This is because when x approaches the origin, the upper bound γ*|*x*| *is close to zero, and therefore, the prediction of x*ˆ *should be more accurate in order to yield a desired approximation of system dynamics x*˙ = *F*(*x*, *u*) *using numerical methods. As a result, it seems that an infinite number of data samples may be needed when state converges to the origin (i.e., x is infinitely close to zero). However, we will show in the next subsection that the requirement of such a large dataset for the states around a small neighborhood around the origin is not necessary for operation under MPC. This is because under sample-and-hold implementation of control actions, the states are forced to be bounded in a small ball around the origin, instead of converging to the exact steady-state. Therefore, the modeling error constraint EM* ≤ *γ*|*x*|*,* ∀*x* ∈ Ω*ρ*<sup>ˆ</sup> *can be loosened for states in this small ball, which could improve computational efficiency of training process.*

#### *4.2. Stabilization of Nonlinear System under Lyapunov-Based Controller*

Subsequently, the following propositions are developed to demonstrate the impact of sample-and-hold implementation of control actions on system stability. Specifically, Proposition 2 demonstrates that in the presence of mismatch between the plant model of Equation (1) and the RNN models of Equation (49), the error between the predicted state and the actual state is bounded in a finite period of time. Then, we consider the Lyapunov-based controller *u* = Φ*nn*(*x*) applied to the nonlinear system of Equation (1) in sample-and-hold fashion, and demonstrate in Proposition 3 that with high probability, the nonlinear system of Equation (1) can be stabilized using the controller *u* = Φ*nn*(*x*) designed for the RNN model of Equation (49).

**Proposition 2** (c.f. Proposition 3 in [9])**.** *Consider the nonlinear system x*˙ = *F*(*x*, *u*) *of Equation (1) and the RNN model* ˙ *x*ˆ = *Fnn*(*x*ˆ, *u*) *of Equation (49) with the same initial condition x*<sup>0</sup> = *x*ˆ0 ∈ Ω*ρ*ˆ*. There exists a class* K *function fw*(·) *and a positive constant κ such that the following inequalities hold* ∀*x*, *x*ˆ ∈ Ω*ρ*ˆ*:*

$$|\mathfrak{x}(t) - \mathfrak{x}(t)| \le f\_{\mathfrak{w}}(t) := \frac{E\_M}{L\_{\mathfrak{x}}} (e^{L\_{\mathfrak{x}}t} - 1) \tag{60}$$

$$\mathcal{V}(\mathbf{x}) \le \mathcal{V}(\mathbf{\hat{x}}) + \frac{\mathcal{E}\_4 \sqrt{\hat{\rho}}}{\sqrt{\mathcal{E}\_1}} |\mathbf{x} - \mathbf{\hat{x}}| + \kappa |\mathbf{x} - \mathbf{\hat{x}}|^2 \tag{61}$$

**Proof.** The proof can be found in [9], and is omitted here. Note that the proof in [9] considers the nonlinear system subject to bounded disturbances, while in this work, we consider the nominal system without disturbances only. However, the stability results derived in this section can be readily generalized to the disturbed systems provided that the disturbances are sufficiently small and bounded. Additionally, the modeling error term in [9] is replaced by *EM* (see the definition of *EM* in Equation (58)) in Equations (60) and (61) which accounts for the RNN generalization error derived in a probabilistic manner.

The following proposition is developed to show probabilistic closed-loop stability of the nonlinear system of Equation (1) under sample-and-hold implementation of the controller *u* = Φ*nn*(*x*) ∈ *U*.

**Proposition 3** (c.f. Proposition 4 in [9])**.** *Consider the nonlinear system of Equation (1) with the controller u* = Φ*nn*(*x*ˆ) ∈ *U that meets the conditions of Equations (50)–(52), and the RNN model of Equation (49) that meets all the conditions in Theorem 1. Under the sample-and-hold implementation of control actions, i.e., u*(*t*) = Φ*nn*(*x*ˆ(*tk*))*,* ∀*t* ∈ [*tk*, *tk*<sup>+</sup>1)*, where tk*<sup>+</sup><sup>1</sup> := *tk* + Δ*. there exist <sup>w</sup>* > 0*,* Δ > 0 *and ρ*ˆ > *ρmin* > *ρnn* > *ρ<sup>s</sup> that satisfy*

$$-\frac{\vec{\mathcal{E}}\_3}{\mathcal{E}\_2} \rho\_s + L\_\mathbf{x}^{'} M\_F \Delta \le -\varepsilon\_w \tag{62}$$

*and*

$$\rho\_{nm} := \max \{ \hat{V}(\pounds(t+\Delta)) \mid \pounds(t) \in \Omega\_{\rho\_{\bullet}}, u \in \mathcal{U} \}\tag{63}$$

$$
\rho\_{\min} \ge \rho\_{\min} + \frac{\p\_4 \sqrt{\rho}}{\sqrt{\mathcal{E}\_1}} f\_w(\Delta) + \kappa (f\_w(\Delta))^2 \tag{64}
$$

*such that for any x*(*tk*) ∈ Ω*ρ*ˆ\Ω*ρs, with probability at least* 1 − *δ, the following inequality holds:*

$$
\hat{\mathcal{V}}(\mathbf{x}(t)) \le \hat{\mathcal{V}}(\mathbf{x}(t\_k)), \ \forall t \in [t\_k, t\_{k+1}) \tag{65}
$$

*and the state x*(*t*) *of the nonlinear system of Equation (1) is bounded in* Ω*ρ*<sup>ˆ</sup> *for all times and ultimately bounded in* Ω*ρmin .*

**Proof.** The key steps for the proof of Proposition 3 are presented below, and the full proof is omitted here as it is similar to the proof of Proposition 4 in [9]. The only difference is that Equation (65) now holds in probability due to the probabilistic nature of the modeling error bound.

To show that the state will move towards Ω*ρs*, which is a sufficiently small level set of *V*ˆ around the origin, we show that the time derivative of *V*ˆ can be rendered negative for any *x*(*tk*) ∈ Ω*ρ*ˆ\Ω*ρ<sup>s</sup>* under *u* = Φ*nn*(*x*) ∈ *U*.

$$\begin{split} \dot{\mathcal{V}}(\mathbf{x}(t)) &= \frac{\partial \mathcal{V}(\mathbf{x}(t))}{\partial \mathbf{x}} F(\mathbf{x}(t), \Phi\_{mn}(\mathbf{x}(t\_k))) \\ &= \frac{\partial \mathcal{V}(\mathbf{x}(t\_k))}{\partial \mathbf{x}} F(\mathbf{x}(t\_k), \Phi\_{mn}(\mathbf{x}(t\_k))) + \frac{\partial \mathcal{V}(\mathbf{x}(t))}{\partial \mathbf{x}} F(\mathbf{x}(t), \Phi\_{mn}(\mathbf{x}(t\_k))) \\ &- \frac{\partial \mathcal{V}(\mathbf{x}(t\_k))}{\partial \mathbf{x}} F(\mathbf{x}(t\_k), \Phi\_{mn}(\mathbf{x}(t\_k))) \end{split} \tag{66}$$

As shown in Proposition 1, by choosing the number of samples *m* ≥ *mN*(*δ*, *hc*, |*x*|) such that *EM* <sup>≤</sup> *<sup>γ</sup>*|*x*|, where *<sup>γ</sup>* <sup>&</sup>lt; *<sup>c</sup>*ˆ3/*c*ˆ4, it holds that <sup>P</sup>[ ˙ *<sup>V</sup>*<sup>ˆ</sup> <sup>≤</sup> <sup>0</sup>] <sup>≥</sup> <sup>1</sup> <sup>−</sup> *<sup>δ</sup>* under *<sup>u</sup>* <sup>=</sup> <sup>Φ</sup>*nn*(*x*) <sup>∈</sup> *<sup>U</sup>*. Then, using the Lipschitz condition in Equations (5)–(7) and the condition for Lyapounov function in Equations (50)–(52), Equation (66) can be further bounded as follows:

$$\begin{split} \dot{\mathcal{V}}(\mathbf{x}(t)) &\leq -\frac{\mathbb{E}\_{3}}{\mathfrak{E}\_{2}} \rho\_{s} + \frac{\partial \dot{\mathcal{V}}(\mathbf{x}(t))}{\partial \mathbf{x}} F(\mathbf{x}(t), \boldsymbol{\Phi}\_{\text{nn}}(\mathbf{x}(t\_{k}))) - \frac{\partial \dot{\mathcal{V}}(\mathbf{x}(t\_{k}))}{\partial \mathbf{x}} F(\mathbf{x}(t\_{k}), \boldsymbol{\Phi}\_{\text{nn}}(\mathbf{x}(t\_{k}))) \\ &\leq -\frac{\mathbb{E}\_{3}}{\mathfrak{E}\_{2}} \rho\_{s} + L\_{\text{x}}'|\mathbf{x}(t) - \mathbf{x}(t\_{k})| \\ &\leq -\frac{\tilde{\mathcal{C}}\_{3}}{\mathfrak{E}\_{2}} \rho\_{s} + L\_{\text{x}}' M\_{\text{F}} \Delta \end{split} \tag{67}$$

Therefore, if Equation (62) is satisfied, we can find a negative real number −*<sup>w</sup>* that bounds the time derivative of *<sup>V</sup>*<sup>ˆ</sup> . This implies that for any state *<sup>x</sup>*(*tk*) <sup>∈</sup> <sup>Ω</sup>*ρ*ˆ\Ω*ρs*, with probability at least 1 − *δ*, the Lyapunov function value will decrease in one sampling time, and therefore, the state can ultimately reach the set Ω*ρ<sup>s</sup>* under *u* = Φ*nn*(*x*) ∈ *U* with a certain probability. Additionally, since ˙ *V*ˆ may not be rendered negative within Ω*ρ<sup>s</sup>* under sample-and-hold implementation of Lyapunov-based control law *u* = Φ*nn*(*x*) ∈ *U*, the predicted state of the RNN model of Equation (49) is only required to be bounded in Ω*ρnn* , which is a slightly larger level set that includes Ω*ρ<sup>s</sup>* (see definition of Ω*ρnn* in Equation 63). In this case, we can show that the state of the actual nonlinear system of Equation (49) is bounded in Ω*ρmin* , which is a superset of Ω*ρnn* that accounts for the modeling error within one sampling period (see definition of Ω*ρmin* in Equation (64)). As a result, we do not impose any constraints on ˙ *<sup>V</sup>*<sup>ˆ</sup> for *<sup>x</sup>* <sup>∈</sup> <sup>Ω</sup>*ρs*. This explains why the modeling error constraint *EM* <sup>≤</sup> *<sup>γ</sup>*|*x*<sup>|</sup> is not necessary for *x* ∈ Ω*ρ<sup>s</sup>* as stated in Remark 6.

#### *4.3. Lyapunov-Based MPC Using RNN Models for Nonlinear Systems*

The Lyapunov-based model predictive control design is given by the following optimization problem [9,10]:

$$\mathcal{J} = \min\_{u \in \mathcal{S}(\Delta)} \int\_{t\_k}^{t\_{k+N}} L\_{\text{MPC}}(\ddot{\mathfrak{x}}(t), u(t)) dt \tag{68}$$

$$\text{s.t.}\quad \dot{\mathfrak{x}}(t) = F\_{\text{III}}(\mathfrak{x}(t), u(t)) \tag{69}$$

$$
u(t) \in \mathcal{U}, \; \forall \; t \in [t\_k, t\_{k+N}) \tag{70}$$

$$
\tilde{\mathbf{x}}(t\_k) = \mathbf{x}(t\_k) \tag{71}
$$

$$\dot{\mathcal{V}}(\mathbf{x}(t\_k), \boldsymbol{\mu}) \le \dot{\mathcal{V}}(\mathbf{x}(t\_k), \boldsymbol{\Phi}\_{nn}(\mathbf{x}(t\_k)), \text{ if } \mathbf{x}(t\_k) \in \Omega\_{\boldsymbol{\beta}} \backslash \Omega\_{\boldsymbol{\rho}\_{nn}} \tag{72}$$

$$\mathcal{V}(\mathfrak{x}(t)) \le \rho\_{nn}, \forall t \in [t\_k, t\_{k+N}), \text{ if } \mathfrak{x}(t\_k) \in \Omega\_{\rho\_{nn}} \tag{73}$$

where *x*˜, *N* and *S*(Δ) are the predicted states, the prediction horizon length, and the set of piecewise constant functions with period <sup>Δ</sup>, respectively. We use ˙ *V*ˆ(*x*, *u*) to represent the time derivative of Lyapunov function *V*ˆ , i.e., ˙ *<sup>V</sup>*ˆ(*x*, *<sup>u</sup>*) = *<sup>∂</sup>V*ˆ(*x*) *<sup>∂</sup><sup>x</sup>* (*Fnn*(*x*, *u*)). After solving the optimization problem of Equations (68)–(73) at *t* = *tk*, we apply the first control action *u*(*t*), *t* ∈ [*tk*, *tk*<sup>+</sup>1) from the optimal input trajectory *u*∗(*t*), *t* ∈ [*tk*, *tk*<sup>+</sup>*N*) to the system of Equation (1). Then the horizon is rolled one sampling period forward, and the LMPC is resolved at the next sampling time with new state measurements available at *t* = *tk*+1.

The optimization problem of Equations (68)–(73) minimizes the objective function of Equation (68), which is the integral of *LMPC*(*x*˜(*t*), *u*(*t*)) over the prediction horizon, subject to the constraints of Equations (69)–(73). The RNN model of Equation (49) is used to predict state evolution over *t* ∈ [*tk*, *tk*<sup>+</sup>*N*) given the state measurements at *t* = *tk* in Equation (71). In the constraint of Equation (69), the RNN model of Equation (49) is used to predict the states of the closed-loop system. The constraint of Equation (70) ensures that the input are bounded over the entire prediction horizon. Finally, the constraints of Equations (72)–(73) drives the predicted state towards the origin and ultimately maintain it inside Ω*ρnn* . It should be noted that despite the probabilistic nature of the RNN generalization error bound, the neural network prediction of Equation (69) is deterministic after training is

completed. In other words, given the same initial state *x*(*tk*), and the manipulated inputs *u*(*t*), ∀*t* ∈ [*tk*, *tk*<sup>+</sup>*N*), the RNN model of Equation (69) produces deterministic results that statistically approximate the evolution of states over *t* ∈ [*tk*, *tk*<sup>+</sup>*N*). This is different from stochastic MPC which uses a stochastic process model in the MPC formulation, and therefore, requires calculation of uncertainty prorogation and accounts for probabilistic constraint satisfaction. The LMPC formulation of Equations (68)–(73) is solved with a deterministic RNN model, based on which recursive feasibility is guaranteed, and probabilistic stability results can be developed.

The following theorem is established to demonstrate that LMPC ensures closed-loop stability for the nonlinear system of Equation (1) with high probability provided that the RNN model is well constructed that satisfies the modeling error constraint in Proposition 1.

**Theorem 2.** *Consider the closed-loop system of Equation (1) under the LMPC of Equations (68)–(73) based on the controller* Φ*nn*(*x*) *that satisfies Equations (50)–(52). Let* Δ > 0*, <sup>w</sup>* > 0 *and ρ*ˆ > *ρmin* > *ρnn* > *ρ<sup>s</sup> satisfy Equations (62)–(64). Then, given any initial state x*<sup>0</sup> ∈ Ω*ρ*ˆ*, if the RNN model is developed satisfying the conditions in Proposition 2 and Proposition 3, there always exists a feasible solution for the optimization problem of Equations (68)–(73) . Additionally, by choosing the number of samples m* ≥ *mN*(*δ*, *hc*, |*x*|) *such that EM* ≤ *γ*|*x*| *holds, then for each time step, with probability at least* 1 − *δ, closed-loop stability is guaranteed for the system of Equation (1) under the LMPC of Equations (68)–(73) in the sense that x*(*t*) ∈ Ω*ρ*ˆ, ∀*t* ≥ 0*, and x*(*t*) *ultimately converges to* Ω*ρmin .*

**Proof.** The proof consists of two parts. In the first part, we prove recursive feasibility of the LMPC optimization problem of Equations (68)–(73) . The proof of this part follows closely the proof of Theorem 2 in [9], which shows that the stabilizing controller *u*(*t*) = Φ*nn*(*x*(*t*)) ∈ *U*, *t* = [*tk*, *tk*<sup>+</sup>*N*) is a feasible solution to the LMPC optimization problem. Specifically, when *x*(*tk*) ∈ Ω*ρ*ˆ\Ω*ρnn* at *t* = *tk*, it is readily shown that the control action *u*(*t*) = Φ*nn*(*x*(*tk*)) is a feasible solution that satisfies the constraint of Equation (72) by taking the equal sign. When *x*(*tk*) ∈ Ω*ρnn* , as shown in [9], *u*(*t*) = Φ*nn*(*x*(*t*)) ∈ *U*, *t* = [*tk*, *tk*<sup>+</sup>*N*) again are feasible solutions that maintain predicted states within Ω*ρnn* within the prediction horizon.

In the second part, we prove that closed-loop stability is guaranteed in probability for the nonlinear system of Equation (1) under LMPC. Specifically, when *x*(*tk*) ∈ Ω*ρ*ˆ\Ω*ρnn* at *<sup>t</sup>* <sup>=</sup> *tk*, we have shown in Proposition <sup>3</sup> that for each sampling time, *<sup>V</sup>*ˆ(*x*(*t*)) <sup>≤</sup> *<sup>V</sup>*ˆ(*x*(*tk*)) holds under *u*(*t*) = Φ*nn*(*x*(*t*)) ∈ *U* for *t* ∈ [*tk*, *tk*<sup>+</sup>1) with probability at least 1 − *δ*. This implies that the state of the actual nonlinear system of Equation (1) can be driven towards the origin under the LMPC using RNN models for prediction provided that the modeling error is sufficiently small and satisfies *EM* ≤ *γ*|*x*|, ∀*x* ∈ Ω*ρ*ˆ. When *x*(*tk*) ∈ Ω*ρnn* , the input sequences are optimized to minimize the objective function of Equation (68) while meeting the constraint of Equation (73). However, due to the existence of modeling error, the true states may leave Ω*ρnn* while the predicted states remain inside Ω*ρnn* . In Proposition 3, we have shown that with probability at least 1 − *δ*, the true state of the system of Equation (1) can be bounded within Ω*ρmin* , which is a superset of Ω*ρnn* designed accounting for the modeling error within one sampling period. Additionally, it is noted that depending on the prediction horizon of RNN models, we may need to perform RNN predictions successively to obtain the full prediction of the state trajectory over the entire prediction horizon, *t* ∈ [*tk*, *tk*<sup>+</sup>*N*). For example, in this work, the RNN model of Equation (49) is developed to predict one sampling period forward, and thus, in order to predict state trajectory over *t* ∈ [*tk*, *tk*<sup>+</sup>*N*), we need to carry out RNN predictions *N* times. After the initial prediction at *t* = *tk*, each prediction uses the previous predicted state as the initial state, along with the manipulated input *u* to predict the state at the next sampling time. This inevitably accumulates the modeling error over calculation, which may lead to a probability lower than 1 − *δ* for the final state prediction error to be bounded by *EM* ≤ *γ*|*x*|.

As a result, the true states may further deviate from predicted states, and ultimately leave Ω*ρmin* within finite time. Despite the degradation of prediction performance over time, closed-loop stability is not affected since LMPC is implemented in a rolling horizon manner with feedback state measurements available every sampling time. The input sequences are re-optimized using new state measurements at every sampling time to meet desired closed-loop performance. Additionally, since the modeling error condition *EM* ≤ *γ*|*x*| holds for the first sampling period, the state of the actual nonlinear system of Equation (1) is guaranteed to not leave Ω*ρmin* within one sampling period with probability at least 1 − *δ* as shown in Proposition 3. At the next sampling period, the constraints of Equation (72) and of Equation (73) will be activated depending on the measurement of *x*(*tk*<sup>+</sup>1). Regardless of where *x*(*tk*<sup>+</sup>1) is, the LMPC of Equations (68)–(73) will drive the predicted state into Ω*ρnn* , and correspondingly, maintain the true state within Ω*ρmin* in probability. Therefore, for any state *x*(*tk*) ∈ Ω*ρ*ˆ, with probability at least 1 − *δ*, the closed-loop state of the system of Equation (1) is bounded in Ω*ρ*<sup>ˆ</sup> for each sampling time, and is ultimately bounded within Ω*ρmin* . This completes the proof of Theorem 2.

**Remark 7.** *It is noted that in Theorem 2, the probability of closed-loop stability (i.e., at least* 1 − *δ) is derived for each sampling time since the probability of the modeling error bounded by γ*|*x*| *is at least* 1 − *δ for one sampling period only. It is difficult to compute the overall probability of closed-loop stability for the entire state trajectory because given an initial state x*<sup>0</sup> ∈ Ω*ρ*ˆ*, we do not know how many times steps it will take to drive the state into* Ω*ρmin beforehand. Additionally, the actual probability of closed-loop stability for each time step could be higher than the lower bound* 1 − *δ due to many reasons. For example, 1) the RNN model is well trained that yields a modeling error far below its upper bound, and 2) closed-loop stability may be unaffected if the next state does not leave* Ω*ρ*<sup>ˆ</sup> *even if the modeling error exceeds its upper bound during one sampling period. Therefore, the probability* 1 − *δ is conservative in many cases, and only provides a lower bound for the probability of closed-loop stability.*

#### **5. Application to a Chemical Process Example**

We use the same chemical process example as in [10] to illustrate the application of LMPC using RNN models. However, in this work, we will primarily demonstrate the use of generalization error bound framework to provide estimates of their accuracy in the development of RNN models for nonlinear dynamic processes. Specifically, we carry out five case studies to evaluate the relation between RNN generalization error and a number of factors such as data sample size, RNN depth/width, and data time length that impact its performance. Additionally, after the RNN model is incorporated in the LMPC formulation, we will demonstrate the closed-loop performances under the RNN models developed with different data sample size and structures, and evaluate their probabilistic closed-loop stability properties. We consider a well-mixed, non-isothermal continuous stirred tank reactor (CSTR) with an irreversible second-order exothermic reaction in this example. The reaction transforms a reactant *A* to a product *B* (*A* → *B*), where *CA*0, *T*<sup>0</sup> and *F* denote the inlet concentration of *A*, the inlet temperature and feed volumetric flow rate of the reactor, respectively. A heating jacket is used to supply/remove heat to/from the CSTR at a rate *Q*. The CSTR dynamic model is represented by the following material and energy balance equations:

$$\begin{aligned} \frac{d\mathbb{C}\_A}{dt} &= \frac{F}{V}(\mathbb{C}\_{A0} - \mathbb{C}\_A) - k\_0 e^{\frac{-E}{RT}} \mathbb{C}\_A^2\\ \frac{dT}{dt} &= \frac{F}{V}(T\_0 - T) + \frac{-\Delta H}{\rho\_L \mathbb{C}\_p} k\_0 e^{\frac{-E}{RT}} \mathbb{C}\_A^2 + \frac{Q}{\rho\_L \mathbb{C}\_p V} \end{aligned} \tag{74}$$

where *CA* and *T* are the concentration of reactant *A* and temperature in the reactor, respectively. *Q* denotes the heat input rate, and *V* is the volume of the reacting liquid in the reactor. *F*, *T*0, and *CA*<sup>0</sup> are the volumetric flow rate, the feed temperature and the feed

concentration of reactant *A*, respectively. We assume that the reacting liquid has a constant density of *ρ<sup>L</sup>* and a heat capacity of *Cp*. Δ*H*, *k*0, *E*, and *R* represent the enthalpy of reaction, pre-exponential constant, activation energy, and ideal gas constant, respectively. The list of process parameter values can be found in [10].

The objective of LMPC is to stabilize the CSTR at its unstable equilibrium point (*CAs*, *Ts*)=(1.95 *kmol*/*m*3, 402 *K*) corresponding to (*CA*<sup>0</sup>*<sup>s</sup> Qs*)=(4 *kmol*/*m*3, 0 *k J*/*hr*) by manipulating the inlet concentration of species *A* and the heat input rate. All the process states (*CA*, *T*) and manipulated inputs (*CA*0, *Q*) are represented in the deviation variables form, i.e., Δ*CA*<sup>0</sup> = *CA*<sup>0</sup> − *CA*<sup>0</sup>*s*, Δ*Q* = *Q* − *Qs*, Δ*CA* = *CA* − *CAs*, and Δ*T* = *T* − *Ts*. To simplify the notation, we use *x<sup>T</sup>* = [Δ*CA* Δ*T*] and *u<sup>T</sup>* = [Δ*CA*<sup>0</sup> Δ*Q*] to represent CSTR states and inputs, respectively. By using deviation variables, the equilibrium point of the CSTR of Equation (74) is at the origin of the state-space. The following positive definite *P* matrix is used to characterize the closed-loop stability region Ω*ρ*<sup>ˆ</sup> (i.e., a level set of Lyapunov function *V*(*x*) = *xTPx*) with *ρ*ˆ = 368:

$$P = \left[ \begin{array}{cc} 1060 & 22\\ 22 & 0.52 \end{array} \right] \tag{75}$$

Additionally, the manipulated inputs are required to be bounded as follows: |Δ*CA*0| ≤ 3.5 kmol/m3 and <sup>|</sup>Δ*Q*| ≤ <sup>5</sup> <sup>×</sup> 105 kJ/hr to meet physical constraints. The integration of RNN models in MPC follows the method in [10,39]. Specifically, the RNN models are developed offline using Keras (version 2.4) [40], and then used to predict future states based on the state measurement at each sampling time in the real-time implementation of MPC. Then, the nonlinear optimization problem of the LMPC of Equations (68)–(73) is solved under the sampling period Δ = 10−<sup>2</sup> *hr* using PyIpopt, which is the python module of the IPOPT software package (version 3.9.1) [41]. The dynamic model of Equation (74) is integrated using numerical method, i.e., explicit Euler method, with a sufficiently small integration time step of *hc* = 10−<sup>4</sup> *hr*.

#### *5.1. RNN Generalization Performance*

In this section, we carry out a number of RNN trainings with different RNN structures and data samples to show the relation between RNN generalization performance and a number of factors such as RNN input length, width, depth, weight bounds and data sample size.

#### 5.1.1. Case Study 1: Data Sample Size

In the first case study, we trained RNN models using different data sample sizes. Specifically, we follow data generation method in [10] to initially generate a large dataset from open-loop simulation of Equation (74) under various control actions *u* ∈ *U* and initial conditions within the stability region, i.e., *x*<sup>0</sup> ∈ Ω*ρ*ˆ. The dataset consists of 200,000 timeseries data samples, and is separated into 140,000 training, 30,000 validation, and 30,000 testing samples. The RNN models are developed by gradually increasing the training sample size, and is tested using unseen data from the testing dataset. It should be noted that only the data sample size is changed in this case study, while all the other parameters such as the RNN structure (i.e., number of layers, neurons, and other hyper-parameters) and training algorithm remain the same for all RNN models. The RNN models are developed with one hidden layer of 50 neurons, and using mean squared errors (MSE) as loss function.

Figure 2 shows the variation of RNN training and testing performances with respect to the training sample size. In the top figure of Figure 2, it is observed that both the testing and training MSEs increase as training data becomes less; in the bottom figure, we show the generalization gap <sup>E</sup>[*gt*(**x**, **<sup>y</sup>**)] <sup>−</sup> <sup>1</sup> *<sup>m</sup>* <sup>∑</sup>*<sup>m</sup> <sup>i</sup>*=<sup>1</sup> *gt*(**x***i*, **y***<sup>i</sup>* ) in Equation (47), where the expected error E[*gt*(**x**, **y**)] is approximated using the testing dataset. The trend in Figure 2 is consistent with the result in Theorem 1, which demonstrates that more training data is needed in order to obtain a lower generalization gap between expected loss and training loss. Additionally, it is noticed when the training sample size is greater than 3000, both

training and testing MSEs approach zero, and no significant improvement is observed for the models using more training data. The trend in Figure 2 also follows the relation between generalization error and data sample size in Equation (47), i.e., the generalization gap <sup>E</sup>[*gt*(**x**, **<sup>y</sup>**)] <sup>−</sup> <sup>1</sup> *<sup>m</sup>* <sup>∑</sup>*<sup>m</sup> <sup>i</sup>*=<sup>1</sup> *gt*(**x***i*, **y***<sup>i</sup>* ) is roughly proportional to <sup>√</sup><sup>1</sup> *<sup>m</sup>* , which shows that the generalization gap initially decreases fast when the sample size *m* starts increasing from zero, and changes slowly when *m* becomes large.

**Figure 2.** RNN generalization performance vs. training sample size.

5.1.2. Case Study 2 : RNN Depth and Width

In the second case study, we train RNN models with various depths and widths. 1400 training data, 300 validation data, and 300 testing data are used for all models. We first develop RNN models by fixing the network depth as one hidden layer, and increasing the number of neurons. As shown in Figure 3, both training and testing errors decrease as the network width increases up to 250 neurons. However, as more neurons are added (i.e., 270 and 280 neurons in Figure 3), the testing MSE increases while the training MSE remains close to zero all the time, which implies that overfitting has occurred during training. As a result, the generalization gap in Figure 3 shows a similar pattern, which decreases initially and increases again when a large number of neurons are used. While theoretically the expected error of Equation (47) does not explicitly depend on the network width, the results in Figure 3 are consistent with the fact that increasing the capacity of a model by adding more layers and/or more nodes to layers can improve the network learnability, but may also lead to overfitting.

Subsequently, we train RNN models by increasing the number of layers, and fixing five neurons each layer. Figure 4 shows that the testing MSE starts at around 0.02 for one hidden layer, gradually decreases with more layers, and finally increases again as the neural network becomes deeper. Meanwhile, the training MSE remains close to zero at the beginning, yet also slightly increases as the number of hidden layers increase. From Figure 4, it is concluded that one hidden layer is not sufficient to learn the process dynamics well, and with two, three, and four layers, the RNN models achieve the best training and generalization performance among all the models. Similar to Figure 3, the increase of generalization gap in Figure 4 implies that deeper RNN models are overfitting the training data. Additionally, it is also interesting to notice that the training error slightly increases in deeper networks. While in general, the generalization performance deteriorates and the training error remains unaffected when increasing the capacity of a model, the worse training performance in Figure 4 are actually common in neural network development due

to the difficulty of training deep networks. Specifically, the optimization problem of neural network training is highly non-convex, and may get stuck at some local minima as the network becomes deeper. This is noticed during the training of RNN models in Figure 4, where both the training and validation losses exhibit a sharp increase at a certain epoch and then get stuck around that point until the end of epochs. Additionally, with more hidden layers, the number of parameters to be trained grows exponentially, which could lead to a poor training performance without a careful tuning of other hyperparameters.

**Figure 3.** RNN generalization performance vs. RNN width (One hidden layer with increasing number of neurons).

**Figure 4.** RNN generalization performance vs. RNN depth (Increasing the number of hidden layers and fixing 5 neurons for each layer).

**Remark 8.** *At first glance, the generalization error trend in Figures 3 and 4 seems in contrast to the results in Equation (47), which shows the generalization error bound is proportional to the complexity of RNN hypothesis class. However, it should be noted that Equation (47) only gives the upper bound for the generalization error of RNN models from the hypothesis class. It does not mean all the RNN models from the hypothesis class have a generalization error as large as* *its upper bound. From the error decomposition of Equation (17) showing the interplay between approximation and estimation errors, we have learned that as we enlarge the hypothesis class, the approximation error decreases, but the estimation error may increase. In this case study, by increasing the complexity of RNN hypothesis class in terms of more layers and neurons, overall the generalization performance improves; however, as the RNN models become deeper, overfitting also occurs due to a large estimation error. Therefore, in practice, we can do a grid search such as Figures 3 and 4 to determine the optimal number of layers and neurons.*

#### 5.1.3. Case Study 3: Different Regions in Ω*ρ*<sup>ˆ</sup>

As discussed in Remark 6, to meet the modeling error constraint *EM* ≤ *γ*|*x*|, ∀*x* ∈ Ω*ρ*ˆ, more data is needed as the state approaches the origin, i.e., *x* → 0. It is equivalent to show that under the same data density for different regions within the stability region Ω*ρ*ˆ, a larger constant *γ* is needed to bound the modeling error *EM* ≤ *γ*|*x*| for the states close to the origin. Therefore, in this case study, we develop multiple RNN models for different regions inside Ω*ρ*<sup>ˆ</sup> with the same data density, and demonstrate the variation of generalization performances. Specifically, we choose 9 level sets of Lyapunov function <sup>Ω</sup>*ρ<sup>i</sup>* :<sup>=</sup> {*<sup>x</sup>* <sup>∈</sup> **<sup>R</sup>***<sup>n</sup>* <sup>|</sup> *<sup>V</sup>*ˆ(*x*) <sup>≤</sup> *<sup>ρ</sup>i*}, *<sup>i</sup>* <sup>=</sup> 0, ..., 8, within <sup>Ω</sup>*ρ*ˆ, with *<sup>ρ</sup>*<sup>ˆ</sup> <sup>=</sup> 368 and *ρ<sup>i</sup>* = [40, 88, 115, 138, 159, 177, 195, 213, 244]. For example, the first RNN model (model 0 with *ρ*0) is developed and tested using the data within Ω*ρ*<sup>0</sup> , the second RNN model (model 1 with *ρ*1) uses the data between Ω*ρ*<sup>0</sup> and Ω*ρ*<sup>1</sup> , and so on. Figure 5 shows a schematic of the training regions considered for the CSTR of Equation (74), where *xs* is the steady-state, and Ω*ρ*<sup>ˆ</sup> is the stability region. The training datasets are generated for each region (i.e., elliptical annuli in Figure 5) with the same data density, where the data density is defined as the ratio of sample size to the area of each elliptical annulus. Similarly, in this case study, we use data from different regions within Ω*ρ*<sup>ˆ</sup> to build RNN models, while all the other parameters remain the same. The RNN models are developed with one hidden layer of 20 neurons, and using MSE as loss function.

**Figure 5.** Schematic of different regions inside Ω*ρ*ˆ.

To compute the modeling error <sup>|</sup>*F*(*x*, *<sup>u</sup>*) <sup>−</sup> *Fnn*(*x*, *<sup>u</sup>*)<sup>|</sup> <sup>=</sup> <sup>|</sup> *dx dt* <sup>−</sup> *dx*<sup>ˆ</sup> *dt* | where *x* and *x*ˆ denote the true state and predicted state, respectively, we carry out prediction for one integration time step, and use finite difference method to approximate the derivatives following Equation (56). Specifically, we first calculate the training and testing mean absolute errors (MAE) and divide them by the integration time step *hc*, i.e., | *x*(*t*+*hc* )−*x*ˆ(*t*+*hc* ) *hc* |. Subsequently, to obtain an approximated value of *γ* for each model, i.e., *EM* <sup>|</sup>*x*<sup>|</sup> <sup>≤</sup> *<sup>γ</sup>*, we divide those MAEs by the maximum value of |*x*| in each elliptical annulus in Figure 5. Figure 6 shows the training and testing errors for the RNN models trained for different regions inside Ω*ρ*ˆ. It is observed that under the same data density, the models trained for the regions close to the origin (i.e., Models 0, 1, and 2 for Ω*ρ*<sup>0</sup> , Ω*ρ*<sup>1</sup> and Ω*ρ*<sup>2</sup> ) produce larger generalization gaps. This implies a larger *γ*, or equivalently, more data is needed to meet the constraint *EM* ≤ *γ*|*x*| for *x* in these regions. Additionally, it is observed that the generalization gap settles at around 2 <sup>×</sup> <sup>10</sup>−<sup>5</sup> for model 4 and after because those RNN models have achieved the best they can do under the current neural network training settings and data density.

**Figure 6.** RNN generalization performance vs. different regions in Ω*ρ*ˆ.

5.1.4. Case Study 4: Weight Matrix Bound

From Equation (48), it is seen that the generalization gap also depends on the weight matrix bound. To evaluate the relation between generalization performance and weight matrix bound, in this case study, we train RNN models with different weight matrix bounds. Specifically, we impose an upper bound constraint for each element in the RNN weight matrices with the following values [0.8, 1.3, 1.8, 2.5, 3.0, 3.4, 3.9, 4.3].

The Frobenius norms of all the weight matrices are therefore also bounded. The training and testing errors are calculated following the approach in Case study 1, and are shown in Figure (7). It is observed that as the weight matrix bound becomes larger, the generalization gap gradually increases and settles at around 8 <sup>×</sup> <sup>10</sup>−4. This behavior implies that the RNN model is over-fitting when training with a large weight bound. The reason for the trend in Figure 7 is similar to that for Case study 2, which demonstrates that as the size of neural network hypothesis class becomes larger with increasing weight bounds, it is easier to find a hypothesis that fits training data well, but could also lead to large testing error (i.e., over-fitting).

**Figure 7.** RNN generalization performance vs. weight matrix bound.

#### 5.1.5. Case Study 5: RNN Input Length

Lastly, we study the dependency of RNN generalization error on the input time length *t* according to Equation (47). If we unfold a vanilla RNN over time to form a multi-layer feedfoward neural network, then this relation can also be interpreted in the way that a deep feedforward neural network has a large generalization error. In this example, we train RNN models with different input time length as follows: *<sup>t</sup>* <sup>=</sup> <sup>10</sup>−<sup>3</sup> <sup>×</sup> [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] hr.

Figure 8 shows the training and testing errors for different time lengths. Specifically, as RNN input time length increases, it is seen that the training error remains at a very low level for all models, but the testing error gradually increases and finally settles at around <sup>6</sup> <sup>×</sup> <sup>10</sup>−3. It is concluded from Figure <sup>8</sup> that a shorter input sequence yields better generalization performance, which is consistent with the theoretical result shown in Equation (47). However, it should be noted that a shorter input sequence does not necessarily yield better prediction in the formulation of MPC because as discussed in Theorem 2, in order to predict future states for a long prediction horizon, the RNN prediction needs to be executed successively, which inevitably accumulates the error during calculations. Therefore, when used in MPC, the RNN input length should be carefully chosen to account for MPC prediction horizon and maintain a desired generalization performance simultaneously.

**Remark 9.** *A small training dataset was chosen in Case studies 2–5 for demonstration purposes. Specifically, it was demonstrated in Case study 1 that with more than 3000 data samples, both training and testing errors are rendered sufficiently small. Therefore, to better demonstrate the relation between RNN generalization error bound and RNN depth/width, and data time length in other case studies, we chose a small training dataset such that significant differences can be observed by varying RNN depths, widths, time sequence length. However, it is noted that in practice, the sample size and all the other factors studied in this manuscript should be carefully chosen in order to improve the RNN generalization performance.*

**Figure 8.** RNN generalization performance vs. input time length.

#### *5.2. Closed-Loop Performance Analysis*

In this section, we carry out closed-loop simulations of CSTR under the LMPC of Equations (68)–(73) using the different RNN models derived from the previous case studies. Additionally, we demonstrate the probabilistic closed-loop stability properties of RNNbased LMPC through extensive closed-loop simulations for the CSTR of Equation (74) with different initial conditions.

Figures 9–12 show the simulation results using 48 different initial conditions within Ω*ρ*<sup>ˆ</sup> for a few RNN models trained in Case study 1. Specifically, we first discretize the stability region Ω*ρ*<sup>ˆ</sup> and choose 48 initial conditions *x*<sup>0</sup> ∈ Ω*ρ*<sup>ˆ</sup> that are evenly spread within the stability region. Then, we run closed-loop simulations for all initial conditions using the following settings: (1) the whole simulation period *tp* is twenty sampling periods (i.e., 20 × 0.01 = 0.2 hr), (2) the stability region Ω*ρ*<sup>ˆ</sup> and the terminal region Ω*ρmin* are characterized as *ρ*ˆ = 368 and *ρmin* = 2, respectively, and (3) the simulations are carried out using UCLA Hoffman 2 cluster and the optimization problem is solved using the python module of the IPOPT software package (i.e., PyIpopt). After obtaining the closed-loop profiles for each initial condition, the following policies are utilized to determine whether the closed-loop system is stable or not. Specifically, the closed-loop system is considered unstable if (1) the closed-loop state leaves the stability region Ω*ρ*<sup>ˆ</sup> at any point during the simulation, or (2) the closed-loop state remains inside Ω*ρ*ˆ, but stays outside of Ω*ρmin* until the end of simulation or leaves Ω*ρmin* after entering for the first time.

Figure 9 shows the probability of closed-loop stability calculated following the above policies. It is seen that with more training data, the probability of the CSTR of Equation (74) being stabilized at its steady-state becomes higher, and the probability settles at around 0.78 for a sufficiently large dataset. The probability results in Figure 9 for RNN models in Case study 1 are consistent with its generalization performance plot in Figure 2, which shows that the generalization error decreases with more data used for training. In addition to the calculation of the probability for closed-loop stability, we also use the MPC cost function of Equation (68) as an indicator for comparing control performance in terms of the convergence speed and energy consumption. Specifically, the MPC cost function of Equation (68) in this example is designed in the following form:

$$L\_{\rm MPC}(\mathbf{x}, \boldsymbol{\mu}) = \mathbf{x}^T P \mathbf{x} + \boldsymbol{\mu}^T Q \boldsymbol{\mu} \tag{76}$$

where *<sup>P</sup>* = [1000 0; 0 1] and *<sup>Q</sup>* = [1 0; 0 3 <sup>×</sup> <sup>10</sup>−10] are chosen such that the two states and the two inputs are in the same order of magnitude, respectively. Also, in this example, we put more penalty on the states *x* to allow the states to be driven to the steady-state more quickly. For each RNN model, we calculate the total costs % *tp <sup>t</sup>*=<sup>0</sup> *LMPC*(*x*, *u*)*dt* over the entire simulation period *tp* = 0.2 hr, and sum up the cost values for all the trajectories initiated from 48 different initial conditions. Figure 10 shows the MPC total costs for the RNN models trained with different data sample sizes. It is demonstrated that with less training data, the MPC achieves a higher total cost, representing a slower convergence to the steady-state and/or a higher energy consumption. With a large number of training data (i.e., ≥ 6000), the MPC total costs remain at around 1420, and no significant improvement is noticed with more data added in training. Additionally, Figures 11 and 12 show the closed-loop state trajectory and state profiles for one of the initial condition out of 48 initial conditions. As shown in Figure 11, the state trajectory using the RNN model trained with 50 training data (dashed line) leaves the stability region due to poor predictions in solving the MPC optimization problem. On the contrary, the state trajectory using the RNN model with 14,000 training data (solid line) moves towards the steady-state smoothly and is ultimately bounded in the terminal set Ω*ρmin* . This can also be seen in the closed-loop state profiles of Figure 12, where the temperature under 50 training data shows a sharp increase at 0.03 hr.

**Figure 9.** Probability of closed-loop stability vs. training sample sizes.

**Figure 10.** MPC total costs vs. training sample size.

**Figure 11.** Closed-loop state trajectory under LMPC using two RNN models trained with different data sample sizes.

**Figure 12.** Closed-loop state profiles under LMPC using two RNN models trained with different data sample sizes.

Similar to the analysis for Case study 1, Figures 13–16 show the probability of closedloop stability, MPC total costs, as well as the state-space trajectory and state profiles for one of the initial condition for the RNN models in Case study 2. In Figure 13, it is shown that the probability starts from 0.5, and settles at around 0.7 for wider RNN models (i.e., more neurons). Figure 14 shows the MPC total costs for different models, from which it is demonstrated that the first model with only 5 neurons has a extremely high value, and all the other models achieve a total cost around 1500. Figures 13 and 14 demonstrate that all the RNN models except the first one achieve desired closed-loop performance in terms of high probability of closed-loop stability and low total costs. This is due to the low generalization error (around 0.005) for nearly all the models in Figure 3. Figure 15 shows the comparison of the closed-loop state trajectories under the two RNN models using 5 and 350 neurons, respectively, from which it is demonstrated that the model with 5 neurons (dashed line) drives the state out of the stability region, while the one with 350 neurons successfully stabilizes the system in the terminal set. The corresponding state profiles (i.e., *CA* − *CAs* and *T* − *Ts*) can be found in Figure 16.

**Figure 13.** Probability of closed-loop stability vs. RNN width.

**Figure 14.** MPC total costs vs. RNN width.

**Figure 15.** Closed-loop state trajectory under LMPC using two RNN models trained with different widths.

**Figure 16.** Closed-loop state profiles under LMPC using two RNN models trained with different widths.

To simplify the discussion for the remaining case studies, we will show the probability plot and MPC total cost plot only. Figure 17 shows the probability of closed-loop stability with respect to different RNN depths. It is demonstrated that the probability starts close to zero for one layer, increases up to 0.7 for four layers, and then decreases to almost zero for six layer and after. This trend follows exactly the generalization error plot in Figure 4, which shows the model with two, three and four layers achieve the lowest generalization error, and the models with more than five layers show worse generalization performance due to overfitting. Comparing to the closed-loop results for the RNNs with various widths in Figures 13 and 14, it is not surprising to see that the overall probability of closed-loop stability in this case study is worse because the open-loop generalization performance for the RNNs developed with different depths (Figure 4) is worse than that for the RNNs developed with different widths (Figure 3). Additionally, in Figure 18, we observe a similar pattern showing that the MPC total costs have the lowest values for two, three and four layers, and rise up for more layers.

**Figure 17.** Probability of closed-loop stability vs. RNN depth.

**Figure 18.** MPC total costs vs. RNN depth.

Closed-loop simulations for Case study 3 of different regions in Ω*ρ*<sup>ˆ</sup> are not carried out in this work, since the MPC formulation of Equations (68)–(73) only uses a single RNN model for prediction. Additionally, it is demonstrated from previous case studies that a single RNN model is sufficient to capture the process dynamics in the stability region, and therefore, there is no need to use different RNN models for different regions in Ω*ρ*<sup>ˆ</sup> from the control perspective.

Figure 19 shows the probability of closed-loop stability for the RNN models with different weight matrix bounds in Case study 4. It is shown that all the RNN models achieve a probability up to 0.7. The high probability of closed-loop stability is expected since in the open-loop generalization error plot in Figure 7, it is shown that all the models with different weight matrix bounds have a sufficiently small generalization error around <sup>8</sup> <sup>×</sup> <sup>10</sup><sup>−</sup>4. As a result, the MPC total costs in Figure <sup>20</sup> are stable around 1000 for all models.

**Figure 19.** Probability of closed-loop stability vs. weight matrix bound.

**Figure 20.** MPC total costs vs. weight matrix bound.

Lastly, Figures 21 and 22 show the closed-loop simulation results for Case study 5. As shown in Figure 21, the probability of closed-loop stability increases as the RNN input time length increases, and settles at around 0.9 for input time length greater than 6 <sup>×</sup> <sup>10</sup>−<sup>3</sup> hr. This seems inconsistent with the generalization performance of Figure 8 which shows the generalization error increases for longer input sequences at first glance. However, as we have discussed earlier, a low open-loop generalization error for short input sequences does not guarantee a desired closed-loop performance under MPC. Specifically, with shorter input sequences, the RNN prediction needs to be executed successively in each MPC iteration to predict all the future states within the prediction horizon. For example, in order to predict one sampling time <sup>Δ</sup> <sup>=</sup> <sup>10</sup>−<sup>2</sup> hr, the first RNN model with 1 <sup>×</sup> <sup>10</sup>−<sup>3</sup> input length in Figure 21 needs to run 10 times, and each time uses the previous predicted state as the initial state. The error accumulates during the calculation, which ultimately leads to poorer closed-loop performance. Therefore, for RNN models used in MPC, the input time length should be chosen carefully accounting for the system sampling time and MPC prediction horizon. Additionally, Figure 22 shows the MPC total costs with respect to different RNN input time lengths. It is seen that the first RNN model achieves the worst cost value, and all the other models have similar cost values around 2000. Through the closed-loop simulation of all the case studies investigated in the previous section, we demonstrate that the closedloop performance is consistent with the open-loop generalization performance in the way that lower generalization errors typically leads to higher probability of closed-loop stability and lower MPC total costs. Therefore, the generalization error bound proposed in this work provides an efficient method for choosing neural network structure and data sample size to meet the closed-loop stability requirements.

**Figure 21.** Probability of closed-loop stability vs. input time length.

**Figure 22.** MPC total costs vs. input time length.

**Remark 10.** *The RNN models are trained offline, and the RNN-based MPC is solved in real time with new state measurements available at each sampling time. The averaged computation time for solving RNN-based MPC per sampling step is around 10 s, which is less than one sampling period* Δ = 0.01 *hr* = 36 *s in this example. Therefore, the RNN-based MPC scheme can be implemented in real time without any computational issues.*

#### **6. Conclusions**

In this work, we developed a generalization probabilistic error bound for RNN models by taking advantage of the Rademacher complexity method for vector-valued functions. The RNN models were incorporated in the design of MPC, and probabilistic closed-loop stability properties were derived based on the RNN generalization error bounds. A number of case studies were simulated using a nonlinear chemical reactor example to demonstrate the impact of training sample size, the number of neurons and layers, regions where the data was generated, and input time length on the RNN generalization performance. Closed-loop simulation were carried out to further demonstrate the probabilistic closedloop stability properties derived by the RNN-based LMPC.

**Author Contributions:** Z.W. developed the main results, performed the simulation studies and prepared the initial draft of the paper. D.R. contributed to the simulation studies in this manuscript. Q.G. and P.D.C. developed the idea of RNN generalization error, oversaw all aspects of the research and revised this manuscript. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare that they have no conflict of interest regarding the publication of the research article.

#### **References**


## *Article* **NICE: Noise Injection and Clamping Estimation for Neural Network Quantization**

**Chaim Baskin 1,\*,†, Evgenii Zheltonozhkii 1,†, Tal Rozen 2,†, Natan Liss 2, Yoav Chai 3, Eli Schwartz 3, Raja Giryes 3, Alexander M. Bronstein <sup>1</sup> and Avi Mendelson <sup>1</sup>**


**Abstract:** Convolutional Neural Networks (CNNs) are very popular in many fields including computer vision, speech recognition, natural language processing, etc. Though deep learning leads to groundbreaking performance in those domains, the networks used are very computationally demanding and are far from being able to perform in real-time applications even on a GPU, which is not power efficient and therefore does not suit low power systems such as mobile devices. To overcome this challenge, some solutions have been proposed for quantizing the weights and activations of these networks, which accelerate the runtime significantly. Yet, this acceleration comes at the cost of a larger error unless spatial adjustments are carried out. The method proposed in this work trains quantized neural networks by noise injection and a learned clamping, which improve accuracy. This leads to state-of-the-art results on various regression and classification tasks, e.g., ImageNet classification with architectures such as ResNet-18/34/50 with as low as 3 bit weights and activations. We implement the proposed solution on an FPGA to demonstrate its applicability for low-power real-time applications. The quantization code will become publicly available upon acceptance.

**Keywords:** neural networks; low power; quantization; CNN architecture

#### **1. Introduction**

Deep neural networks are important tools in the machine learning arsenal. They have shown spectacular success in a variety of tasks in a broad range of fields such as computer vision, computational and medical imaging, signal, image, speech, and language processing [1–3].

However, while deep learning models' performance is impressive, the computational and storage requirements of both training and inference are harsh. For example, ResNet-50 [4], a popular choice for image detection, has 98 MB parameters and requires 4 GFLOPs of computations for a single inference. Common devices do not have such resources, which makes deep learning infeasible especially when it comes to low-power devices such as smartphones and the Internet of Things (IoT).

In an attempt to solve these problems, many researchers have recently proposed less demanding models, often at the expense of more complicated training procedures. Since the training is usually performed on servers with significantly larger resources, this is usually an acceptable trade-off. Some methods include pruning weights and feature maps, which reduce the model's memory print and compute resources [5,6], low-rank decompression that removes the redundancy of parameters and feature maps [7,8], and efficient architecture design that requires less communication and has more feasible deployment [9,10].

**Citation:** Baskin, C.; Zheltonozhkii, E.; Rozen, T.; Liss, N.; Chai, Y.; Schwartz, E.; Giryes, R.; Bronstein, A.M.; Mendelson, A. NICE: Noise Injection and Clamping Estimation for Neural Network Quantization. *Mathematics* **2021**, *9*, 2144. https:// doi.org/10.3390/math9172144

Academic Editor: Alessandro Niccolai

Received: 12 August 2021 Accepted: 31 August 2021 Published: 2 September 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

One prominent approach is to quantize the networks. This approach reduces the size of memory needed to keep a large number of parameters while also reducing the computation resources. The default choice for the data type of the neural networks' weights and feature maps (activations) is 32 bit (single-precision) floating point. Gupta et al. [11] have shown that quantizing the pre-trained weights to a 16 bit fixed point has almost no effect on the accuracy of the networks. Moreover, minor modifications allow performing an integer-only 8 bit inference with reasonable performance degradation [12], which is utilized in DL frameworks, such as TensorFlow. One of the current challenges in network quantization is reducing the precision even further, up to 1–5 bits per value. In this case, straightforward techniques may result in unacceptable quality degradation.

**Contribution.** This paper introduces a novel simple approach denoted NICE (noise injection and clamping estimation) for neural network quantization that relies on the following two easy-to-implement components: (i) noise injection during training that emulates the quantization noise introduced at inference time and (ii) statistics-based initialization of parameter and activation clamping for faster model convergence. In addition, activation clamp is learned during train time. We also propose an integer-only scheme for an FPGA on a regression task [13].

Our proposed strategy for network training leads to an improvement over the state-ofthe-art quantization techniques in the performance vs. complexity trade-off. Our approach can be applied directly to existing architectures without the need to modify them at training (as opposed, for example, to the teacher–student approaches [14] that require to train a bigger network, or the XNOR networks [15] that typically increase the number of parameters by a significant factor in order to meet accuracy goals).

Moreover, our new technique allows quantizing all the parameters in the network to fixed point (integer) values. This includes the batch-norm component that is usually not quantized in other works. Thus, our proposed solution allows the integration of neural networks in dedicated hardware devices such as FPGA and ASIC easier. As a proof of concept, we present also a case study of such an implementation on hardware.

#### **2. Related Work**

**Expressiveness-based methods.** The quantization of neural networks to extremely low-precision representations (up to 2 or 3 possible values) has been actively studied in recent years [15–18]. To overcome the accuracy reduction, some works proposed to use a wider network [14,19,20], which compensates the expressiveness reduction of the quantized networks. For example, 32 bit feature maps were regarded as 32 binary ones. Another way to improve expressiveness, adopted by Zhu et al. [19] and Zhou et al. [21] is to add a linear scaling layer after each of the quantized layers.

**Keeping a full-precision copy of quantized weights.** Lately, the most common approach to training a quantized neural network [15,16,22–24] is to keep two sets of weights forward pass is performed with quantized weights, and updates are performed on full precision ones, i.e., approximating gradients with the straight-through estimator (STE) [25]. For quantizing the parameters, either a stochastic or deterministic function can be used.

**Distillation.** One of the leading approaches used today for quantization relies on the idea of distillation [26]. In distillation a teacher–student setup is used, where the teacher is either the same or a larger full precision neural network, and the student is the quantized one. The student network is trained to imitate the output of the teacher network. This strategy is successfully used to boost the performance of existing quantization methods [14,27,28].

**Model parametrization.** Zhang et al. [18] proposed to represent the parameters with learned basis vectors that allow acquiring an optimized non-uniform representation. In this case, MAC operations can be computed with bitwise operations. Choi et al. [29] proposed to learn the clamping value of the activations to find the balance between clamping and quantization errors. In this work, we also learn this value but with the difference that we are learning the clamps value directly using STE backpropagation method without any regulations on the loss. Jung et al. [28] created a more complex parameterization of both weights and activations and approximated them with symmetric piecewise linear function, learning both the domains and the parameters directly from the loss function of the network.

**Optimization techniques.** Zhou et al. [21] and Dong et al. [30] used the idea of not quantizing all the weights simultaneously but rather gradually increasing the number of quantized weights to improve the convergence. McKinstry et al. [31] demonstrated that 4 bit fully integer neural networks can achieve full-precision performance by applying simple techniques to combat variance of gradients: larger batches and proper learning rate annealing with longer training time. However, 8 bit and 32 bit integer representations were used for the multiplicative (i.e., batch normalization) and additive constants (biases), respectively.

**Generalization bounds.** Interestingly, the quantization of neural networks has been used recently as a theoretical tool to understand better the generalization of neural networks. It has been shown that while the generalization error does not scale with the number of parameters in over-parameterized networks, it does so when these networks are being quantized [32].

**Hardware implementation complexity.** While the quantization of CNN parameters leads to a reduction of power and area, it can also generate unexpected changes in the balance between communication and computation. Karbachevsky et al. [33] studied the impact of CNN quantization on hardware implementation of computational resources. It combines the research conducted in Baskin et al. [34] to propose a computation and communication analysis for quantized CNN.

#### **3. Method**

In this work, we propose a training scheme for quantized neural networks designed for fast inference on hardware with integer-only arithmetic. To achieve maximum performance, we applied a combination of several well-known and novel techniques. Firstly, in order to emulate the effect of quantization, we injected additive random noise into the network weights. Uniform noise distribution is known to approximate well the quantization error for fine quantizers; however, our experiments show that it is also suitable for relatively coarse quantization. As seen in Figure 1 the distribution of noise is almost uniform for 4 and 5 bits and only starts to deviate from the uniform model in 3 bits, which corresponds to only 8 bins.

**Figure 1.** Weight quantization error histogram for a range of bitwidths.

Furthermore, some amount of random weight perturbation seems to have a regularization effect beneficial for the overall convergence of the training algorithm. Secondly, we used a gradual training scheme to minimize the perturbation of network parameters performed simultaneously. In order to give the quantized layers as many gradient updates as possible, we used the STE approach to pass the gradients to the quantized layers. After the gradual phase, the whole network was quantized and trained for a number of fine-tuning epochs. Thirdly, we propose to clamp both the activations and weights in order to reduce the quantization bin size (and, thus, the quantization error) at the expense of some sacrifice of the dynamic range. The clamping values were initialized using the statistics of each layer. In order to truly optimize the trade-off between the reduction of the quantization

error vs. that of the dynamic range, we learned optimal clamping values by defining a loss on the quantization error.

Lastly, following the common approach proposed by Zhou et al. [23], we did not quantize the first and last layers of the networks, which have significantly higher impacts on network performance.

Algorithm 1 summarizes the proposed training method for network quantization. The remainder of the section details these main ingredients of our method.

**Algorithm 1** Training a neural network with NICE. *N* denotes the number of layers; *S* is the number of epochs in which each layer's weights are noised; *T* is the total number of training epochs; *c* is the current noised layer; *i* denotes the *i*th layer; *W* is the weights of the layer; *f* denotes the layer's function, i.e, convolution or fully connected; and *α* and *β* are hyper-parameters.


#### *3.1. Uniform Noise Injection*

We propose to inject uniform additive noise to weights and biases during model training to emulate the effect of quantization incurred at inference. Prior works have investigated the behavior of quantization error [35,36] and concluded that in sufficiently fine-grain quantizers, it can be approximated as a uniform random variable. We observed the same phenomena and empirically verified it for weight quantization as coarse as 5 bits.

The advantage of the proposed method is that the updates performed during the backward pass immediately influence the forward pass, in contrast to strategies that directly quantize the weights, where small updates often leave them in the same bin, thus, effectively unchanged.

In order to achieve a dropout-like effect in the noise injection, we use a Bernoulli distributed mask *M*, quantizing part of the weights and adding noise to the others. From empirical evidence, we chose *M* ∼ Ber(0.05) as it gave the best results for the range of bitwidths in our experiments. Instead of using the quantized value *w*ˆ = QΔ(*w*) of a weight

*w* in the forward pass, *w*ˆ = (1 − *M*)QΔ(*w*) + *M*(*w* − *e*) is used with *e* ∼ Uni(−Δ/2, Δ/2), where Δ denotes the size of the quantization bin.

#### *3.2. Gradual Quantization*

In order to improve the scalability of the method for deeper networks, it is desirable to avoid the significant change of the network behavior due to quantization. Thus, we start by gradually adding a subset of weights to the set of quantized parameters, allowing the rest of the network to adapt to the changes.

The gradual quantization is performed in the following way: the network is split into *N* equally-sized blocks of layers {*B*1, ... , *BN*}. At the *i*-th stage, we inject the noise into the weights of the layers from block *Bi*. The previous blocks {*B*1,..., *Bi*−1} are quantized, while the following blocks {*Bi*+1, ... , *BN*} remain at full precision. We apply the gradual process only once, i.e., when the *N*-th stage finishes, in the remaining training epochs we quantize and train all the layers using the STE approach.

This gradual process of increasing the number of quantized layers is similar to the one proposed by Xu et al. [37]. This gradual process reduces, via the number of parameters, the amount of simultaneously injected noise and improves convergence. Since we start from the earlier blocks, the later ones have an opportunity to adapt to the quantization error affecting their inputs, and thus, the network does not change drastically during any phase of quantization. After finishing the training with the noise injection into the block of layers *BN*, we continue the training of the fully quantized network for several epochs until convergence. In the case of a pre-trained network destined for quantization, we have found that the optimal block size is a single layer with the corresponding activation, while using more than one epoch of training with the noise injection per block does not improve performance.

#### *3.3. Clamping and Quantization*

In order to quantize the network weights, we clamp their values in the range [−*cw*, *cw*]:

$$w\_{\mathbb{C}} = \mathbb{C}\text{clamp}(w\_{\prime} - c\_{\mathcal{W}}, c\_{\mathcal{W}}) = \max\left(-c\_{\mathcal{W}}, \min\left(\mathbf{x}, c\_{\mathcal{W}}\right)\right). \tag{1}$$

The parameter *cw* is defined per layer and is initialized with *cw* = mean(*w*) + *β* × std(*w*), where *w* values are the weights of the layer, and *β* is a hyper-parameter. Given *cw*, we uniformly quantize the clamped weight into *Bw* bits according to

$$
\widehat{w} = \left[ w\_c \frac{2^{B\_w - 1} - 1}{c\_w} \right] \frac{c\_w}{2^{B\_w - 1} - 1},
$$

where [·] denotes the rounding operation.

The quantization of the network activations is performed in a similar manner. The conventional ReLU activation function in CNNs is replaced by the clamped ReLU,

$$a\_{\mathfrak{c}} = \text{Clamp}(a, 0, \mathfrak{c}\_{\mathfrak{a}}),\tag{2}$$

where *a* denotes the output of the linear part of the layer, *ac* is the nonnegative value of the clamped activation prior to quantization, and *ca* is the clamping range. The constant *ca* is set as a local parameter of each layer and is learned with the other parameters of the network via backpropagation. We used the initialization *ca* = mean(*a*) + *α* × std(*a*) with the statistics computed on the training dataset and *α* set as a hyper-parameter.

A quantized version of the truncated activation is obtained by quantizing *ac* uniformly to *Ba* bits,

$$\hat{a} = \left[ a\_c \frac{2^{B\_a} - 1}{c\_a} \right] \cdot \frac{c\_a}{2^{B\_a} - 1}. \tag{3}$$

Since the Round function is non-differentiable, we used the STE approach to propagate the gradients through it to the next layer. For the update of *ca*, we calculated the derivative of *a*ˆ with respect to *ca* as

$$\frac{\partial \vec{a}}{\partial a\_c} = \begin{cases} 1, & a\_c \in [0, c\_d] \\ 0, & \text{otherwise}. \end{cases} \tag{4}$$

Figure 2 depicts the evolution of the activation clamp values throughout the epochs. In this experiment, *α* was set to 5. It can be seen that activation clamp values converge to values smaller than the initialization. This shows that the layer prefers to shrink the dynamic range of the activations, which can be interpreted as a form of regularization similar in its purpose to weight decay on weights.

**Figure 2.** Activation clamp values during ResNet-18 training on CIFAR10 dataset.

The quantization of the layer biases is more complex, since their scale depends on the scales of both the activations and the weights. For each layer, we initialize the bias clamping value as

$$c\_b = \left( \underbrace{\frac{c\_d}{\underbrace{2^{B\_d} - 1}\_{\text{Activation scale}}}}\_{\text{Activation scale}} \cdot \underbrace{\frac{c\_w}{2^{B\_w - 1} - 1}}\_{\text{Weight scale}} \right) \cdot \left( \underbrace{\underbrace{2^{B\_b - 1} - 1}\_{\text{Maximal bias value}}}\_{\text{Maximal bias value}} \right),\tag{5}$$

where *Bb* denotes the bias bitwidth. The biases are clamped and quantized in the same manner as the weights.

#### **4. Results**

To demonstrate the effectiveness of our method, we implemented it in PyTorch and evaluated it using image classification datasets (ImageNet and CIFAR-10) and a regression scenario (the MSR joint denoising and demosaicing dataset [38]). In all the experiments, we used a pre-trained FP32 model, which was then quantized using NICE .

#### *4.1. CIFAR-10*

We tested NICE with ResNet-18 on CIFAR-10 for various quantization levels of the weights and activations. Table 1 reports the results. Notice that for the case of 3 bit weights activations, we obtain the same accuracy and for the 2 bit case, only a small degradation. Moreover, observe that when we quantize only the weights or activations, we get a nice regularization effect that improves the achieved accuracy.

**Table 1.** NICE accuracy (% top-1) on CIFAR-10 for range of bitwidths.


#### *4.2. ImageNet*

For quantizing the ResNet-18/34/50 networks for ImageNet, we fine-tuned a given pre-trained network using NICE . We trained a network for a total of 120 epochs, following the gradual process described in Section 3.2 with the number of stages *N* set to the number of trainable layers. We used an SGD optimizer with a learning rate of 10−4, momentum of 0.9, and weight decay of 4 <sup>×</sup> <sup>10</sup>−5.

Table 2 compares NICE with other leading approaches to low-precision quantization [18,28,29,31]. Various quantization levels of the weights and activations are presented. As a baseline, we used a pre-trained full-precision model.

**Table 2.** ImageNet comparison. We report top-1, top-5 accuracy on ImageNet compared with state-ofthe-art prior methods. For each DNN architecture, rows are sorted in number of bits. Baseline results were taken from PyTorch model zoo. Compared methods: JOINT [28], PACT [29], LQ-Nets [18], FAQ [31].


Our approach achieves state-of-the-art results for 4 and 5 bits quantization and comparable results for 3 bits quantization, on the different network architectures. Moreover, notice

that our results for the 5,5 setup, on all the tested architectures, have slightly outperformed the FAQ 8,8 results.

#### *4.3. Regression—Joint Denoising and Demosaicing*

In addition to the classification tasks, we apply NICE on a regression task—namely, joint image denoising and demosaicing. The network we used is the one proposed in [13]. We slightly modified it by adding to it Dropout with *p* = 0.05, removing the tanh activations, and adding skip connections between the input and the output images. These skip connections improve the quantization results as, in this case, the network only needs to learn the necessary modifications to the input image. Figure 3 shows the whole network, where the modifications are marked in red. The three channels of the input image are quantized to 16 bits, while the output of each convolution, when followed by an activation, are quantized to 8 bits (marked in Figure 3). The first and last layers are also quantized.

We applied NICE on a full-precision pre-trained network for 500 epochs with Adam optimizer with learning rate of 3 <sup>×</sup> <sup>10</sup><sup>−</sup>5. The data were augmented with random horizontal and vertical flipping. Since we are not aware of any other work of quantization for this task, we implemented WRPN [17] as a baseline for comparison. Table 3 reports the test set PSNR for the MSR dataset [38]. It can be clearly seen that NICE achieves significantly better results than WRPN, especially for low-weight bitwidths.

**Table 3.** PSNR [dB] results on joint denoising and demosaicing for different bitwidths.


**Figure 3.** Model used in denoising/demosaicing experiment.

#### *4.4. Ablation Study*

In order to show the importance of each part of our NICE method, we used ResNet-18 on ImageNet. Table 4 reports the accuracy for various combinations of the NICE components. Notice that for high bitwidths, i.e., 5,5, the noise addition and gradual training contribute to the accuracy more than the clamp learning. This happens since (i) the noise distribution is indeed uniform in this case, as we show in Figure 1 and (ii) the relatively high number of activation quantization levels almost negates the effect of clamping. For low bitwidths, i.e., 3,3, we observe the opposite. The uniform noise assumption is no longer accurate. Moreover, due to the small number of bits, clamping the range of values becomes more significant.

**Table 4.** Ablation study of NICE scheme. Accuracy (% top-1) for ResNet-18 on ImageNet for different setups


#### **5. Hardware Implementation**

*5.1. Optimizing Quantization Flow for Hardware Inference*

Our quantization scheme can fit an FPGA implementation well for several reasons. Firstly, uniform quantization of both the weights and activation induces uniform steps between each quantized bin. This means that we can avoid the use of a resource costly codebook (look-up table) with the size *Ba* × *Bw* × *Ba*, for each layer. This also saves calculation time.

Secondly, our method enables having an integer-only arithmetic. In order to achieve that, we start, following (5), by representing each activation and network parameter in the form of *X* = *N* × *S*, where N is the integer code and S is a pre-calculated scale. We then reformulate the scaling factors *<sup>S</sup>* into the form *<sup>S</sup>*<sup>ˆ</sup> <sup>=</sup> *<sup>q</sup>* <sup>×</sup> <sup>2</sup>*p*, where *<sup>q</sup>* <sup>∈</sup> <sup>N</sup>, *<sup>p</sup>* <sup>∈</sup> <sup>Z</sup>. Practically, we found that it is sufficient to constrain these values to *q* ∈ [1, 256] and *p* ∈ [−32, 0] without an accuracy drop .This representation allows the replacement of hardware costly floating-point operations by a combination of cheap shift operations and integer arithmetics.

#### *5.2. Hardware Flow*

In the hardware implementation, for both the regression and the classification tasks, we adopt the PipeCNN [39] implementation released by the authors. (https://github. com/doonny/PipeCNN access on 12 August 2021) In this implementation, the FPGA is programmed with an image containing data moving, convolution, and a pooling kernel. Layers are calculated sequentially. Figure 4 illustrates the flow of feature maps in the residual block from a previous layer to the next one. *Sai*, *Swi* are the activations and weights scale factors of layer *i*, respectively. All these factors are calculated offline and are loaded to the memory along with the rest of the parameters. Note that we use the FPGA for inference only.

We compiled the OpenCL kernel to Intel's Arria 10 FPGA and ran it with the regression architecture in Figure 3. Weights were quantized to 4 bits, activations to 8 bits, and biases, and the input image to 16 bits. The resource utilization amounts to 222 K LUTs, 650 DSP Blocks, and 35.3 Mb of on-chip RAM. With a maximum clock frequency of 240 MHz, the processing of a single image takes 250 ms. In terms of power, the FPGA requires 30 W, while an NVIDIA Titan X GPU requires 160 W. From standard hardware design practices, we can project that a dedicated ASIC manufactured using a similar process would be much more efficient by at least one order of magnitude.

**Figure 4.** Residual block in hardware.

#### **6. Conclusions**

We introduced NICE —a training scheme for quantized neural networks. The scheme is based on using uniform quantized parameters, additive uniform noise injection, and learning the quantization clamping range. The scheme is amenable to efficient training by backpropagation in full precision arithmetic. One advantage of NICE is the ease of its implementation on existing networks. In particular, it does not require changes in the architecture of the network, such as increasing the number of filters as required by some previous works. Moreover, NICE can be used for various types of tasks such as classification and regression.

We report state-of-the-art results on ImageNet for a range of bitwidths and network architectures. Our solution outperforms current works on both the 4,4 and 5,5 setups, for all tested architectures, including non-uniform solutions such as [18]. It shows comparable results in the 3,3 setup.

We showed that quantization error for 4 and 5 bits distributes uniformly, which explains the larger success of our method in these bitwidths compared to the case of 3 bits. This implies that the results for less than 4 bits may be further improved by adding non-uniform noise to the parameters. However, the 4 bit quantization is of special interest since, being a power of 2, it is considered more hardware friendly, and INT4 matrix multiplications are supported by Tensor Cores in recently announced inference-oriented Nvidia's Tesla GPUs.

**Author Contributions:** Conceptualization, C.B., E.Z., T.R., N.L., Y.C. and E.S.; Methodology, C.B., E.Z., T.R., N.L., Y.C. and E.S.; Software, validation and formal analysis, C.B., E.Z., T.R., N.L., Y.C. and E.S.; Writing, C.B., E.Z., T.R., N.L., R.G.; Resources, A.M.B., A.M. and R.G. Project administration, A.M.B., A.M. and R.G. Funding acquisition, A.M.B., A.M. and R.G. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Image classification datasets (CIFAR-10 and ImageNet) are available in torchvision.datasets. MSR dataset is available in Microsoft Demosaicing Dataset (folder MSR-Demosaicing).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Genetic and Swarm Algorithms for Optimizing the Control of Building HVAC Systems Using Real Data: A Comparative Study**

**Alberto Garces-Jimenez 1,\*,†, Jose-Manuel Gomez-Pulido 2, Nuria Gallego-Salvador <sup>2</sup> and Alvaro-Jose Garcia-Tejedor 1,†**


**Abstract:** Buildings consume a considerable amount of electrical energy, the Heating, Ventilation, and Air Conditioning (HVAC) system being the most demanding. Saving energy and maintaining comfort still challenge scientists as they conflict. The control of HVAC systems can be improved by modeling their behavior, which is nonlinear, complex, and dynamic and works in uncertain contexts. Scientific literature shows that Soft Computing techniques require fewer computing resources but at the expense of some controlled accuracy loss. Metaheuristics-search-based algorithms show positive results, although further research will be necessary to resolve new challenging multi-objective optimization problems. This article compares the performance of selected genetic and swarmintelligence-based algorithms with the aim of discerning their capabilities in the field of smart buildings. MOGA, NSGA-II/III, OMOPSO, SMPSO, and Random Search, as benchmarking, are compared in hypervolume, generational distance, ε-indicator, and execution time. Real data from the Building Management System of Teatro Real de Madrid have been used to train a data model used for the multiple objective calculations. The novelty brought by the analysis of the different proposed dynamic optimization algorithms in the transient time of an HVAC system also includes the addition, to the conventional optimization objectives of comfort and energy efficiency, of the coefficient of performance, and of the rate of change in ambient temperature, aiming to extend the equipment lifecycle and minimize the overshooting effect when passing to the steady state. The optimization works impressively well in energy savings, although the results must be balanced with other real considerations, such as realistic constraints on chillers' operational capacity. The intuitive visualization of the performance of the two families of algorithms in a real multi-HVAC system increases the novelty of this proposal.

**Keywords:** multi-objective optimization; genetic algorithms; evolutionary computation; swarm intelligence; Heating, Ventilation and Air Conditioning (HVAC); metaheuristics search; bio-inspired algorithms; smart building; soft computing

#### **1. Introduction**

Global energy consumption has been growing at 1.4% annually over the last 10 years [1], and 94% of it is produced with combustion [2]. Greenhouse gas emissions produce adverse effects on the environment and society and cannot be completely replaced. Buildings consume on average 40% of the electrical energy in European Union cities and 32% in world cities [3], where the Heating, Ventilation, and Air Conditioning (HVAC) system requires 32.7% of the supplied electricity and up to 40.3% in public buildings [4].

**Citation:** Garces-Jimenez, A.; Gomez-Pulido, J.-M.; Gallego-Salvador, N.; Garcia-Tejedor, A.-J. Genetic and Swarm Algorithms for Optimizing the Control of Building HVAC Systems Using Real Data: A Comparative Study. *Mathematics* **2021**, *9*, 2181. https://doi.org/ 10.3390/math9182181

Academic Editor: Freddy Gabbay

Received: 14 July 2021 Accepted: 3 September 2021 Published: 7 September 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Advanced control systems improve energy management by adapting fast to unforeseen events or predicting system behavior. There are some examples of this, such as the application of neural networks with genetic algorithms in building management systems, reaching savings of 27% [5]. Other studies improve the cost by 19.7%, adding an optimization module in the ambient controller [6]. Some researchers have proven that it is possible to save 30% on cold days, embedding a machine-learning-based MPC controller [7]. On the other hand, the faster the controller reaches the goals, the better the energy efficiency is obtained; for example, Adaptive LAMDA-PI (Learning Algorithm for Multivariable Data Analysis—Proportional Integral) controllers improve the Integral Absolute Error (IAE) of the response time by above 140% compared with conventional PI and Fuzzy-PI controllers [8]. Optimization works are embedded in different tasks or problems of the HVAC systems in both design and operations. They are used to adjust Proportional, Integral, and Derivative (PID) controllers to improve the logic of Model-Predicting Controllers (MPCs) or to enhance the supervision tasks in the Building Management Systems (BMSs) or Multi-Agent Controllers (MACs) [9]. There is a significant interest in embedding advanced, Artificial Intelligence (AI)-based control architectures in the BMS [10] that provide acceptable results in uncertain contexts and complex systems, while allowing the adoption of multi-objective optimization policies. There are two visible advanced control strategies: (1) predicting the system behavior with machine-learning-based simulations to obtain the optimal sequence of instructions or (2) adapting the system parameters in case of context perturbances, so that it quickly returns to the zero-error state, such as with fuzzy logic control. Artificial Intelligence (AI), together with other technologies, such as Big Data, Internet of Things (IoT), or Cloud Computing, enhances the ubiquity, accessibility, mobility, knowledge extraction, and autonomy for the new software tasks. The traditional multi-objective problem in operations is to improve the energy efficiency and maintain comfort for the users, i.e., the ideal temperature, humidity, or Indoor Environmental Quality (IEQ) that mutually conflicts. Comfort, health, or maintenance add other objectives to the optimization problem, such as the CO2 concentration, reducing the efficiency of the optimization with fewer objectives [11].

Zadeh conceptually grouped under the umbrella of "soft computing" (SC) technologies that overperformed traditional deterministic approaches [12], at the expense of losing accuracy and generalization. Thus, SC is tolerant to imprecision and uncertain approximation and today are widely used for complex problems where moderate precision and generalization capability are acceptable, given their high-resolution speed. SC covers three main fields: (1) Machine Learning (ML), (2) metaheuristics-based optimization, and (3) Fuzzy Logic (FL) for decision-making. Metaheuristics-based optimization [11] offers good tradeoffs between consumed resources and accuracy for achieving global goals but brings challenges to face, such as algorithm convergence, stability, parameter tuning, a mathematical framework, benchmarking, generalization, and performance assessment [13]. SC also offers fitness estimation for optimization with data-based models that require fewer computer resources [14].

Digital transformation and the social trend towards standardization allow for sharing the functionality among different fields, requiring testing their approaches and convenience for specific applications. This conceptual 'liquidity' brings new challenges for optimization, such as the smart city, smart districts, and smart building, which leads to the scaling of the control and supervision capabilities to upper layers (e.g., ISA 95 and IEC 62264 L2), but constraining the lower layers. More conflicting objectives, such as the Coefficient of Performance (COP), allow for the monitoring of subtle equipment degradations, achieving considerable savings in the life cycle of the installations [15]. The system management is susceptible to becoming autonomous with the self-optimization organic function.

Thus, society, while aiming to enhance people's wealth and comfort, is forced to save energy and reduce costs. Multi-objective optimization strategies can be applied at several levels in building systems, especially HVAC, that can run at bare equipment control, at subsystems management, or at a superuser level integrating systems, buildings, blocks, or districts. In this scenario, there is a greenfield to explore, including among others autonomic building management architectures that automatically adapt their decisions to contextual changes and continuously improve with the experience. The proposed study demonstrates different multi-objective optimization techniques under this scenario that include conventional conflicting goals of comfort, observable with the ambient temperature, and energy-saving, quantifiable with the subsystems consumption, and add two new objectives: (1) the maximization of the absolute value of COP, allowing for optimal performance in saving energy and, at the same time, an enhancement of the lifecycle of the equipment, something rarely explored in operations before [16]; (2) the minimization of the rate of change in ambient temperature, which allows the system to enter into a steady-state mode since startup at nearly critical damping. The possibility for the system to automatically select the most appropriate algorithm is also proposed for the next research outcome. Although it was expected that the addition of conflicting objectives could reduce the efficiency of the optimization, the results show evidence of a wide field to be explored.

This comparative study shows the pros and cons of using different population-based multi-objective optimization algorithms for an HVAC control system. Current practices limit operation to ensure the comfort of building inhabitants dodging other objectives such as energy savings. The study will cover (1) Swarm Intelligence (SI) algorithms and (2) Genetic Algorithms (GAs) and will use real data from the HVAC system of Teatro Real de Madrid (Opera House). The individuals in the decision space are mapped in the objective space with cost functions empirically obtained with ML's Random Forest Regressors (RFRs) to assess their dominance. The RFRs have been trained with a selection of data obtained from a historic database kindly provided by the Board of Teatro Real. The selected GAs are the Multi-Objective Genetic Algorithm (MOGA) and the Non-dominated Sorting Genetic Algorithm version 2 and 3 (NSGA-II and NSGA-III), and the selected SIbased algorithms are Optimized Multi-objective Particle Swarm Optimization (OMOPSO) and Speed-constrained Multi-objective Particle Swarm Optimization (SMPSO). In the experiment, the Strength Pareto Evolutionary Algorithm Version 2 (SPEA2) was discarded, as its execution time was excessive compared to the others. Random Search (RS) results are exhibited as a reference point.

The paper is organized as follows: In Related Work, the authors bring to light significant research related to this study. Materials and Methods explain how the experiment was built and the metrics for comparing the algorithms. The Results section visualizes and discusses the outcomes. Finally, the Conclusions section compares the obtained results with other studies, outlines the novelty, and proposes possible future research lines for this work.

#### **2. Related Work**

#### *2.1. Towards a Clear Ontology*

It is common for recent literature about SC and multi-objective optimization to take for granted the approach followed in this work, due to the absence of effective classification and, therefore, the formation of an adequate body of knowledge. Although it is beyond the scope of this study, it is prudent to indicate some examples of confusing terms and try to position them.

Non-preference multi-objective optimization, i.e., those finishing with a set of nondominant solutions, is sometimes classified as a subset of 'a posteriori' decision-making, and sometimes they are synonyms. It is often associated with multimodal optimization, although only the latter also includes local search. It is also difficult to differentiate Evolutionary Computation from GAs. While sharing a similar process, a GA includes mating and crossover to improve the search. For some articles, they are synonyms and come grouped either as evolutionary or genetic. They are sometimes considered a subset of different approaches, such as bio-inspired algorithms.

Particle Swarm Optimization (PSO) can be classified itself [17] or together with GAs [18] under Multi-Objective Evolutionary Algorithms (MOEA). MOGA is sometimes considered a separated GA [19] or the family of multi-objective GAs [20]. MOEA and MOGA may include the whole metaheuristics-based search family or only those based on the population approach.

The new algorithms based on the observation of nature can be named bio-inspired, biosearch heuristics, or metaphor-based metaheuristics, among others, exchanging their different inspirations, be they biological, chemical, or physical. GAs or SI-based algorithms can be found included in the bio-inspired family, excluding the evolutionary algorithms [19].

With regard to the optimization performance, it is possible to mislead concepts such as 'convergence' that could mean either to end the search at any point (lumps) or end it in true global optima. The diversity feature sometimes indicates the uniformity in the distribution of the solutions, how they spread, or both.

The classification by Ahma et al. [9] and Oliva et al. [21] with Zadec's original SC definition [12] supports the position of this study. At the top, algorithms are split up into stochastic techniques and intelligent agents (deterministic). Stochastic techniques then split into population-based and single individual algorithms (trajectory metaheuristics) that include Simulated Annealing (SA) and Tabu Search (TS). Population-based algorithms then split into SI and evolutionary algorithms. This study compares GAs (part of evolutionary algorithms) and SI-based particle swarms, PSO.

#### *2.2. Research Interest*

According to Wang, G. [22], at an early stage, optimization methods diversified in different fields of study: (1) linear or nonlinear programming, (2) constraints, (3) singleor multi-objective optimization, and (4) dynamic programming. The first generation introduced the iteration and gradients. The second generation brought the metaheuristicsbased search for global multi-objectives that reduced computing resources and allowed for parallel computation. Soft Computing (SC) AI approaches support surrogate-based or metamodel generation, replacing computer-aided simulation software with ML models. The next-generation links and hybridizes the above approaches.

Nabaei et al. [23] provide a good reference for research interests in SI algorithms and GAs over time. GAs have been interesting since before 2000 with a peak from 2006 to 2010. PSO algorithms started to become comparable in 2006 and 2010, but there were much fewer articles published than there were regarding GAs, half of them spanning from 2011 to 2018. Another comprehensive study by Shaikh et al. [11] illustrates the research interest for the optimization in building HVAC systems in which GA articles are 24% of the total and MOGA represent 3%. PSO is present in 5%, and MOPSO in 7%. Scheduling Optimization, Hooke and Jeeves, and Linear Quadratic shares range between 3% and 6%.

Optimization can be used for designing systems or in real-time operations [24]. A GA is used for both design and operations, and so is NSGA-II, but only in a third of the articles reviewed. There are more articles about PSO in operations than in design, but a Differential Evolutionary (DE) algorithm is only used in designing, and the number of articles about the combination of these algorithms is similar to the number of articles related to NSGA-II.

#### *2.3. Genetic and Swarm Intelligence Outcomes*

Algorithms based on metaheuristics are good options for characterizing the behavior of complex, dynamic, and nonlinear systems [25].

A GA puts together a set of individuals (chromosomes) 'coded' with genes (variables), marking them with fitness functions. It then uses a selection strategy to obtain a new population ready for the next iteration. Mutation and crossover operators regulate the speed and variety of chromosome changes in the GA. While the crossover 'exploits' the search, the mutation widens the explored space. One key point is the adjustment of the parameters to the specific problem. The mutation operator can generate solutions with polynomial or uniform probability distributions. The non-uniform probability prevents the population from decaying in the early stages of the evolution by generating distant

solutions with a random probability. Simulated Binary Crossover (SBX) generates offspring from two parents attending to their probability distributions.

GAs discovers the optimal set in three different ways [26]. (1) The first approach is known as Pareto-based dominance with a two-level ranking scheme, one to obtain the dominance and diversity assessment and the other, containing such metrics as the total nondominated vectors generation, the hypervolume, the generational distance or spacing, and the error rate [27], to determine the convergence to local or global minima [28]. NSGA-II and SPEA2 make use of these principles. (2) The second approach uses unary or binary indicators to check their performance, for example, with the coefficient of determination, the R2-like S-Metric Selection Evolutionary Multi-Objective Optimization Algorithm (SMS-EMOA) that maximizes the hypervolume (HV). (3) The third approach is based on decomposition that splits the overall problem into smaller problems for the search. There is not a common procedure for these algorithms. Splitting up complicated Pareto Fronts (PFs) to apply a local search and Tchebycheff's scalarization is one of these methods, as well as the Multi-Objective Evolutionary Algorithm based on decomposition (MOEA/D) and NSGA-III.

The advantages of GAs are that they (1) have simple fitness arrangement schemes; (2) do not need derivatives or gradients; (3) are relatively robust; (4) are easy to parallelize. However, although they require less information about the problem, (1) designing an objective function, (2) getting a representation, and (3) adjusting the operators can be a difficult task. In addition, they are computationally expensive compared with others. NSGA and NSGA-II perform niching, decide deterministically the tournaments, and avoid chaotic perturbations of the population composition with updated fitness sharing. However, the niching function is too complex and scales poorly as the number of objectives increases [24].

SI-based optimization is also population-based, where its individuals are bio-inspired on natural ecosystem metaphors, such as ants, bees, or particles [29]. Swarm algorithms still generate some skepticism because of the mentioned metaphoric ornaments describing their operators [30].

In the case of PSO, the particles move around in the decision space with simple mathematical equations that yield their position and velocity. Each particle's best-known local position and velocity determine its movement towards the optimum. PSO (1) is easy to adjust; (2) can be implemented and provide fast speed results; (3) is capable of finding the global optimal solutions in most cases. However, (1) strict convergence cannot be assured; (2) they are relatively weak in terms of local search abilities; (3) in multi-modal problems, they are prone to obtain local optima [23].

#### *2.4. Research Activity*

There are two schools of thought for improving the efficiency of population-based optimization. One focuses on balancing the explore and exploit strategy with many variations, such as the elitist strategy found in some GAs. The other seeks simplification, as decisions cannot be well understood, especially for large search spaces, discontinuities, noise, or algorithms with time-varying parameters, such as PSO. The revision of the research activity is guided by the following goals:


Sharif et al. [31] included the assessment of the lifecycle cost (LCC) in addition to the energy consumption and environmental impact as a new optimization objective in the passive and active building design with a GA. They managed conflicting objectives such as renovating the envelope (passive structure) or the systems (active structure). Lee [32] also combined a GA with Computational Fluid Dynamics (CFD) for the building geometry (passive design) and the HVAC system (active design), having the temperature, energy consumption, and the Index of Air Quality (IAQ) as objectives.

Gagnon et al. [33] compared the computational resources spent in sequential and holistic approaches of a net-zero building design, using NSGA-II to optimize the carbon footprint, lifecycle cost, and thermal comfort. The experiment proved that the holistic approach achieved 59% of the optimal solutions in 100 h, and the sequential approach achieved 41% in 765 h.

The work of Haniff et al. [34] is representative of introducing minor changes to an algorithm that improves the addressed problem. They modified the Global PSO so that it can outperform the optimization of the energy consumption and the temperature, considering the weather forecast, an estimation of the characteristics of the building, and the Predicted Mean Vote (PMV) for Air Conditioning scheduling.

Cai et al. [35] proposed hybridizing a multi-objective evolutionary algorithm with a quantum-behaved PSO after dividing the problem into subproblems with Tchebycheff's decomposition: Decomposition-based Multi-Objective Binary Quantum-behaved Particle Swarm Optimization (MOMBQPSO/D). The algorithm minimizes the temperature mean and deviation in area-to-point heat conduction.

Zhai et al. [36] enhanced MOGA for the secondary cooling process in continuous casting by dynamically tuning the mutation and crossover operators with the probability method. They compared it with MOPSO and MOGA and showed a 10% water reduction.

Oliva et al. [21] reviewed different metaheuristics-based algorithms applied to the estimation of solar cell parameters. They outlined the advantages and disadvantages of the GA, Harmony Search (HS), Artificial Bee Colony (ABC), SA, Cat Swarm Optimization, Differential Evolutionary, PSO, Advanced Bee Swarm Optimization, Whale Optimization Algorithm (WOA), Gravitational Search Algorithm, Flower Pollination Algorithm, Shuffled Complex Evolution, and Wind-Driven Optimization. They concluded that WOA performs better than the others regarding the accuracy and convergence speed and avoided local minima trapping.

Aguilar et al. [37] proposed a new flexible architecture for Building Management Systems (BMSs), with an Autonomic Cycle of Data Analysis Tasks (ACODAT) that makes use of banks of optimization algorithms for HVAC system control and hinted at its use for supervisory and self-optimization tasks. In fact, in a later study, they developed a Fault Detect and Diagnosis (FDD) system optimized with MOPSO, also capable of long-term equipment degradation, using the COP [15].

Awan et al. [17] analyzed the design of a solar tower plant using fuzzy goals with PSO, showing significant improvements in most of the design parameters (solar multiple, tower height, and others).

Afzal et al. [38] compared the results of applying Fuzzy Logic (FL) in both a GA and PSO to optimize the Nusselt number, friction coefficient, and maximum temperature of a battery thermal management, observing that GAs provide better results, though they are less widespread than PSO.

Suthar et al. [39] compared NSGA-II, NSGA-III, and MOPSO, applying the Technique for the Order of Preference by the Similarity to Ideal Solution (TOPSIS) for tuning the parameters of a 2 Degree-of-Freedom (DoF) controller: the setpoint track, flow variation, and input fluid. The performance was measured with IAE, ISE, ITAE errors, and the execution time, and the step function reaction was analyzed.

Waseem Ahmad et al. [9] assessed several optimization methods and indicated that GAs perform global searches well but show poor convergence. Swarm-based algorithms are good for local searches but are slower than genetic algorithms for global searches. However, Ant Colony Optimization (ACO) is faster at searching compared to others and at converging compared to simple genetic algorithms. In an HVAC system's control, the most studied multi-objective optimization techniques are GAs, in 29% of the related literature, and MOPSO, in 10%. MOGA also stands out among them.

Behrooz et al. [40] confirmed that GAs provide optimization for comfort and energy savings because of their good behavior with nonlinear systems but are challenged with variable context information and perturbances [41]. They are sometimes combined with fuzzy control [8].

Previous and current research does not fully cover the topics addressed in this article, which constitutes a novelty. Most of the studies demonstrate GA and SI optimization in HVAC systems in both design and operations, but few compare them. Some research deals with dynamic adaptation, such as dynamic PID tunning, but none of them include optimization of the COP to enlarge the lifecycle and the rate of change in ambient temperature at the end of the transient state to moderate the damping to a steady state. Table 1 shows all cited works related to this section.

**Table 1.** This research's topics addressed in the cited articles.


#### **3. Materials and Methods**

#### *3.1. Teatro Real: The Opera House of Madrid*

The case study is the HVAC system of the emblematic Opera House of Madrid (Spain), known as Teatro Real. The building has a floor size of 65,000 m2 (700,000 ft2) in 10 levels above the ground and 6 underneath. The 1430 m<sup>2</sup> (15,400 ft2) stage includes the most advanced scenic technology and hosts opera and concerts for 1746 seated people in the stalls, the boxes, the balcony, and the paradise areas. The building has 11 lounges, four rehearsal rooms, and seven studios, and the scenic 'box' is surrounded by offices, warehouses, and technical premises. Figure 1 is a recent photo of the building.

**Figure 1.** Main façade of the Opera House of Madrid. Courtesy of Fundacion del Teatro Real.

The Opera House is open from September to July and closed in August every year. Madrid climate changes abruptly with cold winters, with an average of 0 ◦C (32 ◦F), and hot summers, with an average of 35 ◦C (95 ◦F), requiring heating and cooling. Teatro Real is also used out of the shows for rehearsals, celebrations, and product launches, making the HVAC operation a complex task.

The HVAC system of Teatro Real is an iconic example of a heterogenous HVAC system built with several refurbishments, allocating two 195 kW water–air heat pumps for both heating and cooling, and two 350 kW water–water chillers for extra cooling, managed with the same BMS. There is also a boiler and an ice accumulator that are falling into disuse.

The database provided by the Administration of Teatro Real contains historical data registered in the BMS between 1 January 2016 and 4 June 2018.

#### *3.2. Selection of the Optimization Algorithms*

The selection of the multi-objective optimization algorithms for HVAC analyzed in this study is based on the observations of Ekici et al.'s comprehensive review [42]. The initial selection of evolutionary algorithms is MOGA, NSGA-II, NSGA-III, and SPEA2.

#### 3.2.1. The Multi-Objective Genetic Algorithm (MOGA)

Fonseca et al. [27] proposed in 1993 to compute the fitness of each individual as a weighted sum of the objective functions with random weights to obtain the probability to either select or discard it. MOGA yields interesting results, but it is not yet widely spread in real building HVAC systems.

#### 3.2.2. The Non-Dominated Sorting Genetic Algorithm Version 2 (NSGA-II)

Deb et al. [43] proposed in 2002 to sort the individuals into categories based on nondominance. Thus, the non-dominated individuals are in the first category. The individuals dominated by others in upper levels belong to the second and next categories. Figure 2 shows how the algorithm works.

**Figure 2.** NSGA-II algorithm flowchart.

At the end of each iteration, the algorithm computes the distances among the individuals, known as crowding distance, for ranking.

3.2.3. The Non-Dominated Sorting Genetic Algorithm Version 3 (NSGA-III)

NSGA-III is a variant of NSGA-II that Deb et al. proposed later in 2014 [44] with an adaptive selection of the operator and a set of pre-specified (or manually) points of reference that generate a hyper-plane that improves the diversity of the population. It is conceived for improving performance when the number of objectives is larger.

#### 3.2.4. The Strength Pareto Evolutionary Algorithm Version 2 (SPEA2)

Zitzler et al. [45] proposed in 2001 a fitness function to sort the individuals by identifying how many were dominated by a given solution and how many dominate it. The density is estimated with the k-Nearest Neighbor (k-NN) technique that prunes the elitist set (non-dominated) so that the algorithm delivers the desired number of solutions. Figure 3 shows how SPEA2 works.

**Figure 3.** SPEA2 algorithm flowchart.

The other side of this analysis considers the SI-based algorithms, OMOPSO and SMPSO.

3.2.5. Optimized Multi-Objective Particle Swarm Optimization (OMOPSO)

OMOPSO is one of the MOPSO versions proposed by Reyes-Sierra et al. [46] in 2006 that uses Pareto's non-dominance to identify the leaders and the crowding distance to regulate the maximum number of them. Each iteration proclaims a leader, modifying the speed of the rest to head for it. The leaders of the current generation are set apart from the

global leaders. The algorithm splits the population into groups with different mutation operators. Figure 4 shows how these algorithms work.

**Figure 4.** OMOPSO algorithm flowchart.

3.2.6. Speed-Constrained Multi-Objective Particle Swarm Optimization (SMPSO)

SMPSO, proposed by Nebro et al. in 2009, is another version of MOPSO [47] that includes a speed constraint mechanism for each individual, being good when individuals are excessively accelerated. The optimization is no-preference, bringing an important Degree of Freedom (DoF) for making tactic and strategic decisions. The result consists of "nondominated" solutions located in the hyper-plane of optimum values or the Pareto Front (PF). Thus, for instance, the operation can take optimal values increasing the ventilation to reduce the risk of transmission of disease, e.g., COVID-19, or aiming toward maximum comfort, allowing the manager or the system to pick up the best value of the PF to accomplish the goal.

In any case, diversity is preserved by either the density estimation or truncation. Fitness with the k-NN of the ith individual, F(i), is computed as

$$\mathbf{F}(\mathbf{i}) = \mathbf{R}(\mathbf{i}) + \mathbf{D}(\mathbf{i})$$

When F(i) < 1, the individual is non-dominated. R(i) is the raw fitness, obtained from

$$\mathcal{R}(\mathbf{i}) = \sum\_{\mathbf{j} \in (\text{Population} + \text{Archive}), \mathbf{j} \succ \mathbf{i}} \mathcal{S}(\mathbf{j}),$$

where S(j) is the strength value, representing the number of solutions in both Population and Archive, when i dominates:

$$\mathcal{S}(\mathbf{i}) = \{ \mathbf{j} \mid \mathbf{j} \in (\text{Population} + \text{Archive}) \land \mathbf{i} \succ \mathbf{j} \}$$

D(i) is the density that allows the discrimination between individuals with identical fitness values, and it is obtained from

$$\mathcal{D}(\mathbf{i}) = \frac{1}{\sigma\_{\mathbf{i}}^{\mathbf{k}} + 2}$$

$$\mathbf{k} = \sqrt{|\text{Population}| + |\text{Archive}|}$$

where σ<sup>i</sup> <sup>k</sup> is the distance in the objective space to the kth nearest neighbor in both Population and Archive. In the case of truncation,

$$\text{i is removed, if i } \lhd \text{j, } \forall \text{ j}$$

The performances of these metaheuristics are compared with Random Search acting as a baseline, for not having any specific speeding up mechanism for exploring and exploiting the decision space.

#### *3.3. Selection of Metrics*

The "no free lunch" theorem is applicable for assessing the optimization [9], as the improvements on one feature reduce the effectiveness on another. The algorithm performance is a balance between the achievement of solutions with values close to the PF and the runtime resources required. This proves the algorithms empirically. Riquelme et al. [48] identified up to 54 metrics to prove (1) the cardinality or the number of solutions in the approximation set; (2) the accuracy, convergence, or distance to the PF; and (3) the diversity, which measures the distribution of the fitness values and how they spread. Another classification of metrics is given by the generic definition of Zitzler et al. [49], being unary if only one approximation set is received and binary if two are received. This analysis takes the top three metrics in the ranking and the one that records the runtime [48]:


#### *3.4. Auxiliary Tools*

The simulation was coded in Python, using basic NumPy, Pandas, and Datetime libraries for managing vectors, matrices, and time series. The simulation module, RFR, is implemented with Scikit recommended for machine learning [51]. The optimization is built with the JMetalPy framework [52], well proved for solving multi-objective optimization problems with metaheuristics [41]. The visualization of the obtained results is built with Matplotlib.

#### **4. Problem Formulation**

The HVAC system of Teatro Real is set to follow the mechanical and comfort setpoints required for a near event. The time spent to climatize and several HVAC parameters are those that the chiller's manufacturer initially recommended just after installation. The BMS sends commands to the HVAC system to start/stop the chillers in a certain sequence to ensure that, at the time of the event, the comfort parameters will be appropriate.

The proposed control loop for the multi-HVAC system performance optimization is depicted in Figure 5.

The Control Module, with the same functions as today, initiates the process by requesting the Optimization Module for instructions to improve its operation. The Optimization Module, which performs a metaheuristic search in the space of possible solutions, returns the best candidate obtained with the algorithm used in each model run (either with GA or SI). The fitness functions of the candidates are evaluated by the Simulation Module that receives every individual of the population and performs the simulation of the HVAC behavior (non-linear system) [53], as defined by the candidate control parameters. The simulation is carried out with an ML algorithm, specifically a Random Forest Regressor (RFR), previously trained with historical data from the database of Teatro Real, by minimizing the Mean Squared Error (MSE) and maximizing the coefficient of determination (R2). The RFR also requests contextual information to compute the simulation, which is provided by external sources. Finally, the Control Module translates the optimal recommendations into instructions for the actuators.

**Figure 5.** Advanced control optimized with a predicting context-driven model.

Each experiment carried out in this study executes one control cycle (petition) and addresses the optimization, without delving into the control stage. Inspired by the ACO-DAT management architecture for HVAC systems [37], an autonomous cycle updates the model offline, maintaining its accuracy in real operational conditions, as shown by the green arrow in Figure 5.

The primary objectives are to maximize comfort and minimize the consumed energy.

$$\text{Comfort} = |\mathbf{T}\_0 - \mathbf{T}\_1|$$

$$\mathbf{E} = \sum\_{1}^{n} \mathbf{E}\_i$$

where T0 is the setpoint temperature, and Tr is the indoor room temperature, both in ◦C. The maximum comfort for the optimization is therefore 0. The consumed electrical energy, E, is the sum of the consumed energy in kW.h in each chiller group, the multi-HVAC concept [37]. N is the number of chiller groups. The energy of one chiller group, Ei, is

$$\mathbf{E}\_{\rm i} = \mathbf{E}\_{\rm children, i} + \mathbf{E}\_{\rm CT, i} + \mathbf{E}\_{\rm CVP, i} + \mathbf{E}\_{\rm WP, i}$$

where Echiller,i is the energy consumed in the chiller machine, ECT,i is that in the cooling tower, Ecwp,i is that in the cooling water pump, and Ewpp,I is that in the chilled water primary pump.

This study includes two new objectives in the optimization as a novelty. The first one is the Coefficient of Performance, COP. The higher the COP is, the better the performance of the equipment, resulting in better energy efficiency and lower maintenance costs:

$$\text{COP} = \frac{\text{W}}{\text{P}}$$

The COP is the engineering ratio of the supplied thermal power, W, to the consumed electric power, P. The optimization of the COP brings two important advantages for the HVAC system. HVAC equipment is designed to work at maximum performance, and in this regime, the system obtains its best energy efficiency. With the appropriate autonomous cycle of data tasks [8], the supervisory system detects the degradation of the system, providing predictive maintenance [15].

The second novel objective is the rate of change in the ambient temperature, . Tr, that is the rate at which the temperature varies when it reaches the setpoint. This objective leads the system to rapidly reach the steady state with convenient damp that minimizes the overshooting:

$$\dot{\mathbf{T}}\_{\mathbf{r}} = \frac{\mathbf{d}\mathbf{T}^{\mathbf{r}}}{\mathbf{d}\mathbf{t}}$$

This parameter is important at sudden startups when there is a transient time before the steady state [16]. The lower the slope of the derivative is, the less impact on overshooting and steady noise there is on the next control phase.

The optimization requires Comfort, E, and . Tr to be minimized and COP to be maximized. The decision space is formed with the chillers' capacities, Ci [%], the setpoint, T0 [ ◦C], and the schedule or the date and time at which the system is expected to reach the setpoint, tstart. The indoor ambient temperature when the system starts, Tr(t = 0) [◦C], the number of occupants, N, and the outdoor ambient temperature, OAT [◦C], are the contextual information that determines the system. This study uses the capacity of the chillers as actuators on the subsystems, and this is justified with this simplified model:

$$\mathbf{P\_i} = \mathbf{P\_{max}} \mathbf{C\_i}$$

where Pi is the electrical power actually supplied from the ith chiller, and Pmax is the maximum power of the chiller. The chiller's thermal power is generated according to the machine performance that is added to the other chillers, WHVAC.

$$\mathbf{W\_{i}} = \mathbf{COP\_{i}} \,\mathrm{P\_{i}}$$

$$\mathbf{W\_{HVAC}} = \sum\_{1}^{n} \mathbf{W\_{ii}}$$

Thermal power conditions indoors compensate for the outdoor weather conditions and the corporal temperature of the occupants:

$$\mathcal{W} = \mathcal{W}\_{\text{HVAC}} + \mathcal{W}\_{\text{SUN}} + \mathcal{W}\_{\text{OCC}}$$

The thermal energy, Q, is then obtained from the power, and Tr is obtained from ΔTr, the indoor temperature variance.

$$\mathbf{Q} = \int\_0^{\mathbf{t}\_{\text{prod}}} \mathbf{W} \,\mathrm{d}\mathbf{t}$$

$$\mathbf{Q} = \begin{array}{c} \mathbf{C}\_{\text{e}} \ \mathbf{m} \ \Delta \mathbf{T}\_{\text{r}} \end{array}$$

Figure 6 shows the model with the inputs required, grouped in controllable and control variables, and the outputs, differentiating the normal optimization objectives of the thermal inertia for the next control plan [37].

An individual in the population consists of a sequence of four operational modes of the chillers based on their capacities, Ci [%], at certain times, ti, before the event starts at tend [37]. Each operational mode is a 5-tuple consisting of the proposed capacities for the four chillers ranging from 0% to 100% and the time that they start. Thus, a single individual contains four of these 5-tuples. The RFR performs a simulation for each 5-tuple, chaining them according to their start-up time. The last 5-tuple indicates the operational values applied to the chillers until the system reaches the steady state, tend.

**Figure 6.** Simulation module's functionality to compute the cost functions for the optimization.

The multi-objective optimization problem would be formally defined as follows:

1. Find the vector x in the decision space:

$$
\overline{\mathbf{x}} = \begin{bmatrix}
\mathbf{t}\_1 & \mathbf{t}\_2 & \mathbf{t}\_3 & \mathbf{t}\_4 \\
\mathbf{C}^1\_1 & \mathbf{C}^2\_1 & \mathbf{C}^3\_1 & \mathbf{C}^4\_1 \\
\mathbf{C}^1\_2 & \mathbf{C}^2\_2 & \mathbf{C}^3\_2 & \mathbf{C}^4\_2 \\
\mathbf{C}^1\_3 & \mathbf{C}^2\_3 & \mathbf{C}^3\_3 & \mathbf{C}^4\_3 \\
\mathbf{C}^1\_4 & \mathbf{C}^2\_4 & \mathbf{C}^3\_4 & \mathbf{C}^4\_4
\end{bmatrix}.
$$

ti, where i = 1, 2, 3, and 4, represents the starting dates and times to configure the capacities of every subsystem, while Ci j , where j = 1, 2, 3, and 4, represents the capacities of the chiller j during the period that starts at i and ends at i + 1. The last period is between t4 and tend.

2. x will satisfy these inequality constraints at the following point:

$$\mathbf{C}\_{\parallel}^{\dot{\mathbf{i}}} \le 100$$
 
$$\mathbf{t}\_{\mathbf{i}+1} \ge \mathbf{t}\_{\dot{\mathbf{i}}}$$

3. x will optimize the vector function f(x) in the objective space:

$$
\overline{\mathbf{f}} = \begin{bmatrix}
\text{Comfort}(\overline{\mathbf{x}}) \\
\text{E}(\overline{\mathbf{x}}) \\
\text{COP}(\overline{\mathbf{x}}) \\
\frac{\text{d}\overline{\mathbf{I}}\_{\overline{\mathbf{r}}}(\overline{\mathbf{x}})}{\text{d}\overline{\mathbf{t}}}
\end{bmatrix}.
$$

Comfort and COP must be maximized, while consumed energy, E, and the rate of change of ambient temperature, . Tr must be minimized.

#### **5. Results**

#### *5.1. Dataset*

The BMS is connected to 1824 digital and analog sensors, prompting the ambient and return temperatures, frozen water flow rates, valve states, chiller's performance, secondary circuit values, air flow rate, fan speeds, pumps rotational speeds, controller status, etc., and allows the operator to send instructions to the actuators from the centralized

platform. However, the historical data only keeps 169 variables: outdoor temperature, room temperatures, electrical supplied power, thermal energy generated by each of the four HVAC subsystems, and their COPs grouped in several tables with different sampling rates (10 min, 15 min, 1 h, daily). Usable records are from January 2016 to June 2018. The data have been cleaned to improve the accuracy by removing nonessential fields, records with outliers, nulls, and/or zeros, getting 9898 (80%) registers for training and 2475 (20%) for validation.

The Department of Engineering prepares the work order, based on the HVAC operational mode (HOM) for the field operators based on the events schedule and the weather forecasts, and consisting of pre-programmed routines. This is, however, inefficient because the complexity of the system operation reduces all possible variations to a small set of HOMs, based on the primitive recommendations of the installers. The occupancy of the building can reach up to 1700 during performances, while the number of people on labor days is around 600.

#### *5.2. Data Model*

The multi-objective estimation is computed with the RFR with good accuracy and speed balance. The model simulates the outputs in intervals of 15 min, which is a tradeoff between the system inertia and the discretization of the system dynamics. The model receives the time required for starting up the HVAC system, t0, the time of the venue or the moment in which the room temperature, tend, must reach the setpoint, T0, the room temperature at the beginning, Tr (t = t0), the number of people, N, and the outdoor temperature forecast, which is a vector of temperatures from t0 to tend every 15 min. Table 2 represents an example where the temperature at 17 ◦C must reach the setpoint, 23.5 ◦C, in an hour.

**Table 2.** Control request & context data.


The model also requires the outdoor temperature from the weather forecast. The optimization algorithm then releases the proposed individual for fitness.

In addition, the simulation receives the set of HOMs searched by the algorithms that will work in each interval. The algorithm is a sequence of HOMs proposed for the slots in the interval from t0 to tend, consisting of the power capacities of each chiller. Following the example, Table 3 shows one of these candidate solutions.

**Table 3.** Individual consisting of a sequence of four operational modes.


A negative capacity indicates that the chiller is cooling, while a positive one indicates that it is heating. Real implementations will impose restrictions that are not considered here, such as smoothing the capacity transitions from one slot to another or preparing the chiller for cooling or heating modes. Table 4 depicts the result of the optimization for this example.


**Table 4.** Model prediction applying optimal operational modes.

#### *5.3. Algorithm Analysis*

The analysis compares the performance and execution time (ET) of the algorithms. They start with the same expected number of solutions, i.e., the population size for the GA and the swarm size for the SI algorithm. The experiment involved population/swarm sizes ranging from 100 to 350 in steps of 50. The mutation probabilities were the same, and the SBX crossover probabilities and distribution index were the same for the GAs. The mutation scheme followed a polynomial probability distribution, except for OMOPSO, which combined uniform and nonuniform distributions with the same perturbation index, 0.5.

The algorithms stopped after 5000 iterations, and GAs stopped earlier if they were triggered with the dominance threshold. In order to obtain stable results, the algorithms were proved 10 times to determine the average of the obtained values. Figures 7–9 represent the objective space for the variables Comfort, Consumed Energy, and COP in 2D diagrams.

**Figure 8.** 2D objective space of Consumed Energy [kW.h] vs. Comfort [◦C].

**Figure 9.** 2D objective space of COP [kW/kW] vs. Comfort [◦C].

Metrics used in the comparison were computed with the JMetal framework, and the ETs were recorded. The SPEA2 algorithm was dropped from the analysis, as it takes 22-fold more runtime than MOGA [45]. Figure 10 shows the obtained ET values.

**Figure 10.** Average execution time for each algorithm.

The GAs ran faster than the SI-based algorithms. MOGA improved the Random Search by 13%, and OMOPSO improved it by 9%. It was observed that NSGA-III takes more time than NSGA-II to execute. This is because of the extra computation required for the adaptive operator and the generation of hyperplanes. On the other hand, the speed constraint mechanism seemed to increase the ET of the SMPSO, compared with OMOPSO. All outperformed RS.

GD showed how close the fitness of the set of solutions was from the ideal PF, and this is depicted in Figure 11.

**Figure 11.** Average GD to the Pareto's ideal front by each algorithm.

An approximate PF was constructed running the NSGA-II 20,000 times, simulating a limit behavior. The accuracy of OMOPSO and MOGA with 75% and 65% improvements compared to Random Search was observed. It was unavoidable that the quasi-ideal PF construction was insufficient for the rest of the algorithms. HV and EI are shown in Figures 12 and 13.

**Figure 12.** Average hypervolume of each algorithm.

In both metrics, it is possible to identify the significant improvements of all the algorithms compared with Random Search. The ε-Indicator show NSGA-III and MOGA as the best algorithms, outperforming Random Search by 42% and 40%, while the SI algorithms were worse (31–35%). The HV does not show significant differences among algorithms but shows an improvement of 5% on average.

#### *5.4. Visualization*

Regarding the question of whether an algorithm outperforms another with a combination of any quality measures, such as those seen above, Zitzler came to the conclusion that there was no such combination, but it could be seen as the equivalence to the concept of dominating [54]. Thus, Figures 14–16 show 2D maps formed with the metrics of this study, those closer to the bottom left corner being the most appropriate. The best algorithms are found in the lower-left corner in all cases. The charts also show the distance among them, presenting an intuitive method with which to make decisions as to which performs better. Figure 14 shows the behavior of the algorithms when setting the priority in ET and GD.

This case yields the selection of either MOGA or OMOPSO algorithms as the best for optimization accuracy. Both metrics penalize SMPSO, which obtains a GD even worse than RS. Figure 15 prioritizes the HV (the inverse in this case for obtaining a homogeneous visualization) with the ET.

In this case, SMPSO still performs worse than the others in terms of accuracy, but much better than RS, likely due to diversity. All the rest behave similarly, the GA family standing out. Figure 16 prioritizes the ε-Indicator and ET.

ε-Indicator also measures the cardinality and maintains SMPSO at the back, followed by OMOPSO, while GAs shows better behavior.

**Figure 14.** Plot chart mapping the studied algorithms according to the ET and the GD.

**Figure 16.** Plot chart mapping the studied algorithms according to the ET and the ε-Indicator.

#### *5.5. Energy Efficiency Improvements*

To complete the experiment, four differentiated events available in the historical data of the building were randomly selected to compare the performance of the HVAC equipment in terms of energy efficiency, with the results that would have been obtained by applying the proposed optimization. This indicates what can be expected from this approach. The events are defined in Table 5.


#### **Table 5.** Model prediction applying optimal HOMs.

To illustrate the example, a second decision-making process with a weighted sum was set to select one of the solutions with values in the PF. Weights slightly favored Consumed Energy savings over the others. Table 6 shows the results.


**Table 6.** Results obtained with MOGA optimization and comparison with real data.

The right column shows the theoretical energy savings in each case with the optimized HOMs compared with what was actually recorded in the dataset. This column also stresses the achievements in comfort with expected deviations of less than 0.5 ◦C and HVAC subsystems working with COPs above 3.00, which is considered a good value. These impressive results of 60–80% in energy savings, preserving the comfort and the system performance, must be adjusted with further research considering real restrictions, but they hint toward a promising line of research.

#### *5.6. Comparison with Other Works*

Several authors have proposed comparisons between NSGA-II and MOPSO, which may contribute to the comprehension of the results. Keshavarz et al. [55] compared NSGA-II and MOPSO for the stochastic optimization of an inventory control system, showing that NSGA-II has better performance in spacing and in the number of Pareto optimal solutions, while MOPSO better spreads the fitness of the solution set and consumes fewer computational resources. Niyomubyeyi et al. [56] studied optimization in evacuation planning, obtaining better convergence and spread with MOPSO, but the algorithm execution took five times longer than NSGA-II. Saldanha et al. [18] obtained similar results in convergence and spread for MOPSO and NSGA-II, although MOPSO yielded better results in spacing. Elgammal et al. [57] studied the integration of hybrid wind photovoltaic and fuel cells, obtaining similar system operating costs with both, but in this case, the MOPSO execution time was shorter than NSGA-II.

#### **6. Conclusions**

This study shows the performance of several genetic and SI-based algorithms when optimizing the control of a building HVAC system. The study works with the real historical data of a complex and singular building by adapting the control logic to the available sensed measures and individual chiller actuators. The results yield that simple MOGA and NSGA-II/III run faster than MOPSOs, confirming the pure Random Search algorithm as the slowest. The best convergence is obtained with OMOPSO according to GD and HV.

The achievement on energy consumption is impressive, as shown with several events randomly selected from the data, reaching savings from 60% to 80%. These results will be proved for generalization purposes with further research that will include the new model's restrictions.

This study is the first to take two new objectives into the optimization problem: the HVAC subsystem's performance, COP, and the rate of change in ambient temperature at the end of the system startup stage. The first objective brings the possibility of advanced supervisory policies that improve the maintenance of the equipment and extend its lifecycle. The minimization of the second allows for a smooth transition to the permanent stage of the HVAC operation, reducing the overshoots or the underdamping effect of the room temperature values. In the following works, the dominance variation produced when adding new conflicting objectives and how this affects control system decision-making will be analyzed.

The proposed simple visualization of the algorithms not only allows for an intuitive understanding of which algorithm performs better but also opens the possibility of the automatic real-time instantiation of the most convenient algorithm from a bank of optimizers according to given contextual information. This is important because there are no rigid rules, but rather, existing or new strategies, such as running out of time, operations when the building is closed, etc.

The article also claims for consensus in optimization with a body of knowledge that integrates the contribution of the different disciplines that theorize or are applicable to the case.

This study requires generalization to demonstrate its scope with other different buildings, HVAC systems, and overall different variables extracted from the control logic. It is also of interest to work on parameter tuning to characterize the inherent "no free lunch" theorem.

The use of real data has made the study more reliable. The singularity of the building and the heterogeneous equipment that forms the HVAC system represents a demanding test for this research.

This research will contribute to the development of the smart city with autonomic management systems capable of learning from experience and improving with the context using AI to overcome the complexity of the managed systems and changing the user's requirements.

**Author Contributions:** Conceptualization, N.G.-S., J.-M.G.-P. and A.G.-J.; methodology, A.-J.G.-T.; software, A.G.-J.; validation, A.G.-J. and N.G.-S.; formal analysis, A.G.-J.; investigation, A.G.-J. and A.-J.G.-T.; resources, N.G.-S., A.-J.G.-T. and J.-M.G.-P.; data curation, A.G.-J.; writing—original draft preparation, A.G.-J.; writing—review and editing, A.G.-J., A.-J.G.-T. and J.-M.G.-P.; visualization, A.G.-J.; supervision, A.G.-J., A.-J.G.-T. and J.-M.G.-P.; project administration, A.-J.G.-T. and A.G.-J.; funding acquisition, A.-J.G.-T. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was partially supported by the Vice Rectorate for Research of the Universidad Francisco de Vitoria with grant code MOGA-TR, Reference: UFV2020-34.

**Acknowledgments:** This work was possible thanks to the contribution of the Management Committee of Teatro Real de Madrid by providing the database of the HVAC BMS System. The authors wish to thank Raul Jiménez-Juarez for his help in preparing the code as part of his end-of-degree project in the Universidad Francisco de Vitoria.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study, in the collection, analyses, or interpretation of data, in the writing of the manuscript, or in the decision to publish the results.

#### **References**


## *Article* **Effect of Initial Configuration of Weights on Training and Function of Artificial Neural Networks**

**Ricardo J. Jesus 1,2, Mário L. Antunes 1,3,\*, Rui A. da Costa 4, Sergey N. Dorogovtsev 4, José F. F. Mendes <sup>4</sup> and Rui L. Aguiar 1,3**


**Abstract:** The function and performance of neural networks are largely determined by the evolution of their weights and biases in the process of training, starting from the initial configuration of these parameters to one of the local minima of the loss function. We perform the quantitative statistical characterization of the deviation of the weights of two-hidden-layer feedforward ReLU networks of various sizes trained via Stochastic Gradient Descent (SGD) from their initial random configuration. We compare the evolution of the distribution function of this deviation with the evolution of the loss during training. We observed that successful training via SGD leaves the network in the close neighborhood of the initial configuration of its weights. For each initial weight of a link we measured the distribution function of the deviation from this value after training and found how the moments of this distribution and its peak depend on the initial weight. We explored the evolution of these deviations during training and observed an abrupt increase within the overfitting region. This jump occurs simultaneously with a similarly abrupt increase recorded in the evolution of the loss function. Our results suggest that SGD's ability to efficiently find local minima is restricted to the vicinity of the random initial configuration of weights.

**Keywords:** training; evolution of weights; deep learning; neural networks; artificial intelligence

### **1. Introduction**

Training of neural networks is based on the progressive correction of their weights and biases (model parameters) performed by such algorithms as gradient descent, which compare actual outputs with the desired ones for a large set of input samples [1]. Consequently, the understanding of the internal operation of neural networks should be intrinsically based on the detailed knowledge of the evolution of their weights in the process of training, starting from their initial configuration. Recently, Li and Liang [2] revealed that, during training, weights in neural networks only slightly deviate from their initial values in most practical scenarios. In this paper, we explore in detail how training changes the initial configuration of weights, and the relations between those changes and the effectiveness of the networks' function. We track the evolution of the weights of networks consisting of two Rectified Linear Unit (ReLU) hidden layers trained on three different classification tasks with Stochastic Gradient Descent (SGD), and measure the dependence of the distribution of deviations from an initial weight on this initial value. In all of our experiments, we observe no inconsistencies in the results of the three tasks.

**Citation:** Jesus, R.J.; Antunes, M.L.; da Costa, R.A.; Dorogovtsev, S.N.; Mendes, J.F.F.; Aguiar, R.L. Effect of Initial Configuration of Weights on Training and Function of Artificial Neural Networks. *Mathematics* **2021**, *9*, 2246. https://doi.org/10.3390/ math9182246

Academic Editor: Freddy Gabbay

Received: 20 August 2021 Accepted: 7 September 2021 Published: 13 September 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

By experimenting with networks of different sizes, we have observed that, to reach an arbitrarily chosen loss value, the weights of larger networks tend to deviate less from their initial values than those of smaller networks. This suggests that larger networks tend to converge to minima which are closer to their initialization. On the other hand, we observe that for a certain range of network sizes, the deviations from initial weights abruptly increase at some moment during their training within the overfitting regime.

This effect is illustrated in Figure 1 by the persistence and disappearance of an initialization mask (the letters are stamped to a network's initial configuration of weights by creating a bitmap of the same shape as the matrix of weights of the layer being marked, rasterizing the letter to the bitmap, and using the resulting binary mask to set to zero the weights laying outside the mark's area.) in panels (a) and (b), respectively, for two network sizes. We find that the sharp increase on the deviations of the weights closely correlates with the crossover between two regimes of the network—trainability and untrainability occurring in the course of the training.

**Figure 1.** Train and test loss of networks consisting of two equally sized hidden layers of nodes, trained on HASYv2. Some of the weights connecting the two hidden layers were initially set to zero so that the weight matrix of these layers resembles the letter a (at initialization). The evolution of the weight matrix is shown in the subplots. (**a**) Loss of a stable learner network with 512 nodes in each hidden layer. (**b**) Training loss of an unstable network with 256 nodes in each hidden layer, illustrating the effect of crossing over from trainability to untrainability regimes on the network's weights (disappearance of the initialization mark). A single curve is shown for clarity, but networks of the same width show the same behavior. Experiments with other symbols (that serve as mask for the initialization) exhibit similar behavior.

The main contributions of this work are the following: (I) a quantitative statistical characterization of the deviation of the weights of two-hidden-layer ReLU network of various sizes, from their initial random configuration, and (II) we show a correlation between the magnitude of deviations of weights and the successful training of a network. Recent works [2–4] showed that in highly over-parametrized networks, the training process implies a fine-tuning of the initial configuration of weights, significantly adjusting only a small portion of them. Our quantitative statistical characterization describes this phenomenon in greater detail and empirically verifies the small deviations that occur when the training process is successful. Furthermore, our analysis allows us to draw some insights regarding the training process of neural network and pave the way for future research.

Our paper is organized as follows. In Section 2, we summarize some background topics on neural networks' initializations, and review a series of recent papers pertaining to ours. Section 3 presents the problem formulation, experimental settings and datasets used in this paper. In Section 4, we explore the shape of the distribution of the deviations of weights from their initial values and its dependence on the initial weights. We continue these studies in Section 5 by experimenting with networks of different widths and find that, whenever a network's training is successful, the network does not travel far from its initial

configuration. Finally, Section 6 provides concluding remarks and points out directions for future research.

#### **2. Background and Related Work**

#### *2.1. Previous Works*

It is widely known that a neural network's initialization is instrumental in its training [5–8]. The works of Glorot and Bengio [7], Chapelle and Erhan [9] and Krizhevsky et al. [10], for instance, showed that deep networks initialized with random weights and optimized with methods as simple as Stochastic Gradient Descent could, surprisingly, be trained successfully. In fact, by combining momentum with a well-chosen random initialization strategy, Sutskever et al. [11] managed to achieve performance comparable to that of Hessian-free methods.

There are many methods to randomly initialize a network. Usually, they consist of drawing the initial weights of the network from uniform or Gaussian distributions centered at zero, and setting the biases to zero or some other small constant. While the choice of the distribution (uniform or Gaussian) does not seem to be particularly important [12], the scale of the distribution from which the initial weights are drawn does. The most common initialization strategies—those of Glorot and Bengio [7], He et al. [8], and LeCun et al. [5]—define rules based on the network's architecture for choosing the variance that the distribution of initial weights should have. These and other frequently used initialization strategies are mainly heuristic, seeking to achieve some desired properties at least during the first few iterations of training. However, it is generally unclear which properties are kept during training or how they vanish [12] (Section 8.4). Moreover, it is also not clear why some initializations are better from the point of view of optimization (i.e., achieve lower training loss), but are simultaneously worse from the point of view of generalization.

Frankle and Carbin [13] recently observed that randomly initialized dense neural networks typically contain subnetworks (called winning tickets) that are capable of matching the test accuracy of the original network when trained for the same amount of time in isolation. Based on this observation, they formulate the Lottery Ticket Hypothesis, which essentially states that this effect is general and manifests with high probability in this kind of network. Notably, these subnetworks are part of the network's initialization, as opposed to an organization that emerges throughout training. The subsequent works of Zhou et al. [14] and Ramanujan et al. [15] corroborate the Lottery Ticket Hypothesis and propose that winning tickets may not even require training to achieve quality comparable to that of the trained networks.

In their recent paper, Li and Liang [2] established that two-layer over-parameterized ReLU networks, optimized with SGD on data drawn from a mixture of well-separated distributions, probably converge to a minimum close to their random initializations. Around the same time, Jacot et al. [3] proposed the neural tangent kernel (NTK), a kernel that characterizes the dynamics of the training process of neural networks in the so-called infinite-width limit. These works instigated a series of theoretical breakthroughs, such as the proof that SGD can find global minima under conditions commonly found in practice (e.g., over-parameterization) [16–23], and that, in the infinite-width limit, neural networks remain in an *O* 1/√*<sup>n</sup>* neighborhood of their random initialization (*n* being the width of the hidden layers) [24,25]. Lee et al. [4] make a similar claim about the distance a network may deviates from its linearized version. Chizat et al. [26], however, argue that such wide networks operate in a regime of "lazy training" that appears to be incompatible with the many successes neural networks are known for in difficult, high dimensional tasks.

#### *2.2. Our Contribution*

From distinct perspectives, these previous works have shown that, in highly overparametrized networks, the training process consists of a fine-tuning of the initial configuration of weights, adjusting significantly just a small portion of them (the ones belonging

to the winning tickets). Furthermore, Frankle et al. [27] recently showed that the winning ticket's weights are highly correlated with each other.

The previous investigations on the role of the initial weights configuration focus on networks with potentially infinite width, in which, as our results also show, the persistence of the initial configuration is more noticeable. In contrast, we explore a wide range of network sizes from untrainable to trainable by varying the number of units in the hidden layers. This approach allows us to explore the limits of trainability, and characterize the trainable–untrainable network transition that occurs at a certain threshold of the width of hidden layers.

A few recent works [28,29] indicated the existence of 'phase transitions' from narrow to wide networks associated to qualitative changes in the set of loss minima in the configuration space. These results resonate with ours, although neither relations to trainability nor the role of the initial configuration of weights were explored.

On one hand, we observe that, when the networks are trainable (large networks), they always converge to a minima in the vicinity of the initial weight configuration. On the other hand, when the network is untrainable (small networks) the weight configuration drifts away from the initial configuration. Moreover, in our simulations, we found an intermediate size range for which the networks train reasonably well for a while, decreasing the loss consistently, but, later in the overfitting region, their loss abruptly increases dramatically (due to overshooting). Past this point of divergence, the loss can no longer be reduced by more training. The behavior of these ultimately untrainable networks further emphasizes the connection between trainability (ability to reduce train loss) and proximity to the initial configuration: the distance to the initial configuration remains small in the first stage of training, while the loss is reduced, and later increases abruptly, simultaneously with the loss.

We hypothesize that networks initialized with random weights and trained with SGD can only find good minima in the vicinity of the initial configuration of weights. This kind of training procedure is unable to effectively explore more than a relatively small region of the configuration space around the initial point.

#### **3. Problem Formulation**

Our aim in this work is to contribute to the conceptual understanding of the influence that random initializations have in the solutions of feedforward neural networks trained with SGD. In order to avoid undesirable effects, specific to particular architectures, training methods, etc., we set up our experiments with very simple, vanilla, settings.

We trained feedforward neural networks with two layers of hidden nodes (three layers of links) with all-to-all connectivity between adjacent layers. In our experiments, we vary the widths (i.e., numbers of nodes) of the two hidden layers between 10 and 1000 nodes simultaneously, always keeping them equal to each other. The number of nodes of the input and output layers is determined by the dataset, specifically by the number of pixels of the input images and the number of classes, respectively. This architecture is largely based on the multilayer perceptron created by Keras for MNIST (https://raw.githubusercontent. com/ShawDa/Keras-examples/master/mnist\_mlp.py, accessed on 8 September 2021).

Let us denote the weight of the link connecting nodes *i* in a given layer and node *j* in the next layer by *wij*. The output of a node *j* in the hidden and output layers, denoted by *oj*, is determined by an activation function of the weighted sum of the outputs of the previous layer added by the node's bias, *bj*, as *bj* + ∑*<sup>i</sup> wijoi* ≡ *xj*. The nodes of two hidden layers employ the Rectified Linear Unit (ReLU) activation function

$$f(x\_j) = \begin{cases} x\_j \text{ if } x\_j \ge 0, \\ 0 \text{ if } x\_j < 0. \end{cases} \tag{1}$$

The ReLU is a piecewise linear function that will output the input directly if it is positive; otherwise, it will output zero. It has become the default activation function for many types of neural networks because a model that uses it is easier to train and often achieves better performance.

The nodes in the output layer employ the softmax activation function

$$f(x\_j) = \frac{e^{x\_j}}{\sum\_{k=1}^{K} e^{x\_k}} \tag{2}$$

where *K* is the number of elements in the input vector (i.e., the number of classes of the dataset). The softmax activation function is a generalization of the logistic function to multiple dimensions. It is used in multinomial logistic regression and is often used as the last activation function of a neural network to normalize the output of a network to a probability distribution over predicted output classes.

Unless otherwise stated, the biases of the networks are initialized at zero and the weights are initialized with Glorot's uniform initialization [7]:

$$w\_{ij} \sim \mathcal{U}\left(-\frac{\sqrt{6}}{\sqrt{m+n}}, \frac{\sqrt{6}}{\sqrt{m+n}}\right),\tag{3}$$

where *U*(*α*, *β*) is the uniform distribution in the interval (*α*, *β*), and *m* and *n* are the number of units of the layers that weight *wij* connects. In some of our experiments, we apply various masks to these uniformly distributed weights, setting to zero all weights *wij* not covered by a mask (see Figure 1). The loss function to be minimized is the categorical cross-entropy, i.e.,

$$L = -\sum\_{i=1}^{\mathbb{C}} y\_i \ln \sigma\_{i\nu} \tag{4}$$

where *C* is the number of output classes, *yi* ∈ {0, 1} the *i*-th target output, and *oi* the *i*-th output of the network. The neural networks were optimized with Stochastic Gradient Descent with a learning rate of 0.1 and in mini-batches of size 128. The networks were defined and trained in Keras [30] using its TensorFlow [31] back-end.

Throughout this paper, we use three datasets to train our networks: MNIST, Fashion MNIST, and HASYv2. Figure 2 displays samples of them. These are some of the most standard datasets used in research papers on supervised machine learning.

MNIST (http://yann.lecun.com/exdb/mnist/, accessed on 5 September 2021) [32] is a database of gray-scale handwritten digits. It consists of 6.0000 <sup>×</sup> <sup>10</sup><sup>4</sup> training and 1.0000 <sup>×</sup> 104 test images of size 28 <sup>×</sup> 28, each showing one of the numerals 0 to 9. It was chosen due to its popularity and widespread familiarity.

Fashion MNIST (https://github.com/zalandoresearch/fashion-mnist, accessed on 5 September 2021) [33] is a dataset intended to be a drop-in replacement for the original MNIST dataset for machine learning experiments. It features 10 classes of clothing categories (e.g., coat, shirt, etc.) and it is otherwise very similar to MNIST. It also consists of <sup>28</sup> <sup>×</sup> 28 gray-scale images, 6.0000 <sup>×</sup> 104 samples for training, and 1.0000 <sup>×</sup> <sup>10</sup><sup>4</sup> for testing.

HASYv2 (https://github.com/MartinThoma/HASY, accessed on 5 September 2021) [34] is a dataset of 32 × 32 binary images of handwritten symbols (mostly LATEX symbols, such as *α*, *σ*, % , etc.). It mainly differentiates from the previous two datasets in that it has many more classes (369) and is much larger (containing around 150,000 train and 17,000 test images).

In this paper, the number of epochs elapsed on the process of training is denoted by *t*. We typically trained networks for very long periods (up to *t* = 1000), and, consequently, for most of their training, the networks were in the overfitting regime. However, since we are studying the training process of these network and making no claims concerning the networks' ability to generalize on different data, overfitting does not affect our conclusions. In fact, our results are usually even stronger prior to overfitting. For similar reasons, we will be considering only the loss function of the networks (and not other metrics such as their accuracy), since it is the loss function that the networks are optimizing.

**Figure 2.** Samples of the datasets used in our experiments. Top: MNIST. Middle: Fashion MNIST. Bottom: HASYv2 (colors reversed).

#### **4. Statistics of Deviations of Weights from Initial Values**

To illustrate the reduced scale of the deviations of weights during the training, let us mark a network's initial configuration of weights using a mask in the shape of a letter, and observe how the marking evolves as the network is trained. Naturally, if the mark is still visible after the weights undergo a large number of updates and the networks converge, it indicates that the training process does not shift the majority of the weights of a network far from their initial states.

Figure 1a shows typical results of training a large network whose initial configuration is marked with the letter a. One can see that the letter is clearly visible after training for as many as 1000 epochs. In fact, one observes the initial mark during all of the network's training, without any sign that it will disappear. Even more surprisingly, these marks do not affect the quality of training. Independently of the shape marked (or whether there is a mark or not) the network trains to approximately the same loss across different realizations of initial weights. This demonstrates that randomly initialized networks preserve features of their initial configuration along their whole training—features that are ultimately transferred into the networks' final applications.

Figure 1b demonstrates an opposite effect for midsize networks that cross over between the regimes of trainability and untrainability. As it illustrates, the initial configuration of weights of these unstable networks tend to be progressively lost, suffering the largest changes when the networks diverge (loss function sharply increases at some moment).

By inspecting the distribution of the final (i.e., last, after training) values of the weights of the network of Figure 1a versus their initial values, portrayed in Figure 3, we see that weights that start with larger absolute values are more likely to suffer larger updates (in the direction that their sign points to). This trend can be observed in the plot by the tilt of the interquartile range (yellow region in the middle) with respect to the median (dotted line). The figure demonstrates that initially large weights in absolute value have a tendency to become even larger, keeping their original sign; it also shows the maximum concentration of weights near the line *wf* = *wi*, indicating that most weights either change very little or nothing at all throughout training.

**Figure 3.** Distribution of the final values of the weights of the network of Figure 1a, trained for 1000 epochs on HASYv2, as function of their initial value. The peak of the distribution is at *wf* = *wi*, which is extremely close to the median. The skewness of the distribution for large absolute values of *wi* is evidenced in the histograms at the top.

This effect may be explained by the presence of a winning ticket in the network's initialization. Our results suggest that the role of the over-parametrized initial configuration is decisive in successful training: when we reduce the level of over-parametrization to a point where the initial configuration stops containing such winning tickets, the network becomes untrainable by SGD.

The skewness in the distribution of the final weights can be explained by the randomness of the initial configuration, which initializes certain groups of weights with more appropriate values than the others, which makes them better suited for certain features of the dataset. This subset of weights does not need to be particularly good, but as long as it provides slightly better or more consistent outputs than the rest of the weights, the learning process may favor their training, improving them further than the rest. Over the course of many epochs, the small preference that the learning algorithm keeps giving them adds up and causes these weights to become the best recognizers for the features that they initially, by chance, happened to be better at. In this hypothesis, it is highly likely that weights with larger initial values are more prone to be deemed more important by the learning algorithm, which will try to amplify their 'signal'. This effect has several bearings with, for instance, the real-life effect of the month of birth in sports [35].

#### **5. Evolution of Deviations of Weights and Trainability**

One may understand the relationship between the success of training and the finetuning process observed in Section 4, during which a large fraction of the weights of a network suffer very tiny updates (and many are not even changed at all), in the following way. We suggest that the neural networks typically trained are so over-parameterized that, when initialized at random, their initial configuration has a high probability of being close to a proper minimum (i.e., a global minimum where the training loss approaches zero). Hence, to reach such a minimum, the network needs to adjust its weights only slightly, which causes its final configuration of weights to have strong traces of the initial configuration (in agreement with our observations).

This hypothesis raises the question of what happens when we train networks that have a small number of parameters. At some point, do they simply start to train worse? Or do they stop training at all? It turns out that, during the course of their training, neural networks cross over between two regimes—trainability and untrainability. The trainability region may be further split into two distinct regimes: a regime of *high trainability* where training drives the networks towards global minima (with zero training loss), and a regime of *low trainability* where the networks converge to sub-optimal minima of significantly higher loss. Only high trainability allows a network to train successfully (i.e., to be trainable), since in the remaining two regimes, of untrainability and low trainability, either the networks do not learn at all, or they learn but very poorly. Figure 4 illustrates these three regimes. Note that we use the term trainability/untrainability referring to the regimes of the training process, in which loss and deviations of weights are, respectively, small/large. We reserve the terms trainable-untrainable to refer to the capability of a network to keep a low train loss after infinite training, which depends essentially on the network's architecture.

We measure the dependence of the time at which these crossovers happen on the network size and build a diagram showing the network's training regime for each network width and training time. This diagram, Figure 4c, resembles a phase diagram, although the variable *t*, the training time, is not a control parameter, but it is rather the measure of the duration of the 'relaxation' process that SGD training represents. One may speak about a phase transition in these systems only in respect of their stationary state, that is, the regime in which they finally end up, after being trained for a very long (infinite) time. Figure 4c shows three characteristic instants (times) of training for each network width: (i) the time at which the minimum of the test loss occurs, (ii) the time of the minimum of the train loss, (iii) the time at which the loss abruptly increases ('diverges'). Each of these times differ for different runs, and, for some widths, these fluctuations are strong or they even diverge. The points in this plot are the average values over ten independent runs. By the error bars, we show the scale of the fluctuations between different runs. Notice that the times (ii) and (iii) approach infinity as we approach the threshold of about 300 nodes from below (which is specific to the network's architecture and dataset). Therefore, wide networks (300 nodes in each hidden layer) never cross over to the untrainability regime; wide networks should stabilize in the trainability regime as *t* → ∞. The untrainability region of the diagram exists only for widths smaller than the threshold, which is in the neighborhood of 300 nodes. Networks with such widths initially demonstrate a consistent decrease in the train loss. However, at some later moment, during the training process, the systems abruptly cross over from the trainability regime, with small deviations of weights from their initial values and decreasing train loss, to the untrainability regime, with large loss and large deviations of weights.

**Figure 4.** The regimes of a neural network over the course of its training: Evolution of (**a**) train and (**b**) test loss functions of networks of various widths. Panels (**a**,**b**) show single typical training runs of the full set which we explored. (**c**) Average times (in epochs) taken by the networks to reach the minima of the train loss and of the test loss functions, and to diverge (i.e., reach the plateau of the train loss). For each network width, we calculate the averages and standard deviations of these times (represented by error bars) over ten independent runs trained on HASYv2. (**d**) Average values of minimum loss at the train and test sets reached during individual runs (different runs reach their minima at different times). These averages were measured over the same ten independent realizations as in panel (**c**).

By gradually reducing the width, and looking at the trainability regime in the limit of infinite training time, we find a phase transition from trainable to untrainable networks. In the diagram of Figure 4c, this transition corresponds to a horizontal line at *t* = ∞, or, equivalently, to the projection of the diagram on the horizontal axis (notice that the border between regimes is concave).

The phase diagram in Figure 4 (the bottom left panel) suggests the existence of three different classes of networks: weak, strong, and, in between them, unstable learners. Weak learners are small networks that, throughout their training, do not tend to diverge, but only train to poor minima. They mostly operate in a regime of low trainability, since they can be trained, but only to ill-suited solutions. On the other side of the spectrum of trainability are large networks. These are strong learners, as they train to very small loss values and they are not prone to diverge (they operate mostly in the regime of high trainability). In between these two classes are unstable learners, which are midsize networks that train to progressively better minima (as their size increases), but that, at some point, in the course of their training, are likely to diverge and become untrainable (i.e., they tend to show a crossover from the regime of trainability to the one of untrainability).

Remarkably, we observe the different regimes of operation of a network not only in the behavior of its loss function, but also in the distance it travels from its initial configuration of weights. We have already demonstrated in Figure 1 how the mark of the initial configuration of weights of a network persists in large networks (i.e., strong learners that

over the course of their training were always on the regime of high trainability), and vanishes for midsize networks that ultimately cross over to the regime of untrainability. In Appendix A we supply detailed description of the evolution of the statistics of weights during the training of the networks used to draw Figure 4. Figures A1 and A2 show that, as the network width is reduced, the highly structured correlation between initial and final weight, illustrated by the coincidence of the median with the line *wf* = *wi* (see Figure 3), remains in effect in all layers of weights until the trainability threshold. Below that point the structure of the correlations eventually breaks down, given enough training time. The reliability of the observation of this breakdown in Figures A1 and A2, for widths below ∼300, is reinforced by the robust fitting method based in cumulative distributions that is explained in Appendix B.

To quantitatively describe how distant a network becomes from its initial configuration of weights we consider the root mean square deviation (RMSD) of its system of weights at time *t* with respect to its initial configuration, i.e.,

$$\text{RMSSD}(t) \equiv \sqrt{\frac{1}{m} \sum\_{j=1}^{m} \left[ w\_j(t) - w\_j(0) \right]^2},\tag{5}$$

where *m* is the number of weights of the network (which depends on its width), and *wj*(*t*) is the weight of edge *j* at time *t*.

Figure 5 plots, for three different datasets, the evolution of the loss function of networks of various widths alongside the deviation of their configuration of weights from its initial state. These plots evidence the existence of a link between the distance a network travels away from its initialization and the regime in which it is operating, which we describe below.

**Figure 5.** Top: Temporal evolution of the loss function of networks of various widths. Bottom: Evolution of the root mean square deviation (RMSD) between the initial configuration of weights of a network and its current configuration. From left to right: networks trained on the MNIST, Fashion MNIST, and HASYv2 datasets. Five independent test runs are plotted (individually) for each value of width and each dataset.

For all the datasets considered, the blue circles (•) show the training of networks that are weak learners—hence, they only achieve very high losses and are continuously operating in a regime of low trainability. These networks experience very large deviations on their configuration of weights, getting further and further away from their initial state. In contrast, the red left triangles () show the training of large networks that are strong

learners (in fact, for MNIST all the networks marked with triangles are strong learners; in our experiments we could not identify unstable learners on this dataset). These networks are always operating in the regime of high trainability, and over the course of their training they deviate very slightly from their initial configuration (compare with the results of Li and Liang [2]). Finally, for the Fashion MNIST and HASYv2 datasets, orange down () and green up () triangles show unstable networks of different widths (the former being smaller than the latter). While on the regime of trainability, these networks deviate much further from their initial configuration than strong learners (but less than weak learners). However, as they diverge and cross over into the untrainability regime (which could only be observed on networks training with the HASYv2 dataset), the RMSD suffers a sharp increase and reaches a plateau. These observations highlight the persistent coupling between the network's trainability (measured as train loss) and the distance it travels away from the initial configuration (measured as RMSD), as well as their dependence of the network's width.

To complete the description of the behavior of these networks on the different regimes, Figure 6 plots, for networks of different widths, the time at which they reach a loss below a certain value *θ*, and the RMSD between their configuration of weights at that time and the initial one. It shows that networks that are small and are operating under the low trainability regime fail to reach even moderate losses (e.g., on Fashion MNIST, no network of width 10 reaches a loss of 0.1, whereas networks of width 100 reach losses that are three orders of magnitude smaller). Moreover, even when they reach these loss values, they take a significantly longer time to do so, as the plots for MNIST demonstrate. Finally, the figure also shows that, as the networks grow in size, the displacement each weight has to undergo to allow the network to reach a particular loss decreases, meaning that the networks are progressively converging to minima that are closer to their initialization. We can treat this displacement as a measure of the work the optimization algorithm performs during the training of a network to make it reach a particular quality (i.e., value of loss). Then one can say that using larger networks eases training by decreasing the average work the optimizer has to spend with each weight.

**Figure 6.** Top: Time *t<sup>θ</sup>* (in epochs) at which the networks first reach a given value *θ*. Bottom: Root mean square deviation (RMSD) between the initial configuration of weights of a network and its configuration of weights at time *t<sup>θ</sup>* . From left to right: networks trained on the MNIST, Fashion MNIST, and HASYv2 datasets. Each point in the figure is the average of five independent test runs. The absence of a point in a plot indicates that the network does not reach this loss during the entire period of training.

#### **6. Conclusions**

In this paper, we explored the effects of the initial configuration of weights on the training process and function of shallow feedforward neural networks. We performed a statistical characterization of the deviation of the weights of two-hidden-layer networks of various sizes trained via Stochastic Gradient Descent from their initial random configuration. Our analysis has shown that there is a strong correlation between the successful training of a shallow feedforward network and the magnitude of the weights' deviations from their initial values. Furthermore, we were able to observe that the initial configuration of weights typically leaves recognizable traces on the final configuration after training, which provides evidence that the learning process is based on fine-tuning the weights of the network.

We investigated the conditions under which a network travels far from its initial configurations. We observed that a neural network learns in one of two major regimes: trainability and untrainability. Moreover, its size (number of parameters) largely determines the quality of its training process and its chance to enter the untrainability regime. By comparing the evolution of the distribution function of the deviations of the weights with the evolution of the loss function during training, we have shown that a network only travels far away from its initial configuration of weights if it is either (i) a poor learner (which means that it never reaches a good minimum) or (ii) when it crosses over from trainability to untrainability regimes. In the alternative (good) case, in which the network is a strong learner and it does not become untrainable, the network always converges to the neighbourhood of its initial configuration (keeping extensive traces of its initialization); in all of our experiments, we never observed a network converging to a good minimum outside the vicinity of the initial configuration. The results and analysis of our simulations point out that the typical black-box model, used in most applications of neural networks, hides the trainability capacity of the networks. For a set of three typical classification problems, these results indicate a range of network sizes where the training process is successful. Our conclusions are consistent with recent finds, specifically the Lottery Ticket Hypothesis [13].

Finally, it is important to mention that most of our analysis was conducted when overfitting was already taking place. At shorter times, the deviations of weights from their initial values are even smaller, and our conclusions remain valid. Our conclusions were based on the analysis of the loss function of the networks since this was the actual function that the networks were optimizing. However, we argue that equivalent results can be obtained by using the accuracy or other similar metric. Our analysis was conducted on a specific set of conditions. To fully generalize our findings, different initialization methods and datasets should be considered in order to fully validate the hypothesis stated in Section 2.2. The generalization of the results is outside of the scope of this paper and is intended as future work.

**Author Contributions:** Conceptualization, R.J.J., M.L.A., R.A.d.C., S.N.D., J.F.F.M. and R.L.A.; methodology, R.J.J., M.L.A., R.A.d.C., S.N.D. and R.L.A.; software, R.J.J.; validation, R.J.J., M.L.A., R.A.d.C. and R.L.A.; formal analysis, R.A.d.C. and S.N.D.; investigation, R.J.J., R.A.d.C., M.L.A.; writing—original draft preparation, R.J.J., M.L.A. and R.A.d.C.; writing—review and editing, S.N.D., J.F.F.M. and R.L.A. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was developed within the scope of the project i3N, UIDB/50025/2020 and UIDP/50025/2020, financed by national funds through the FCT/MEC. RAC acknowledges the FCT Grant No. CEECIND/04697/2017. The work was also supported under the project YBN2020075021- Evolution of mobile telecommunications beyond 5G inside IT-Aveiro; by FCT/MCTES through national funds and when applicable co-funded EU funds under the project UIDB/50008/2020- UIDP/50008/2020; under the project PTDC/EEI-TEL/30685/2017 and by the Integrated Programme of SR&TD "SOCA" (Ref. CENTRO-01-0145-FEDER-000010), co-funded by Centro 2020 program, Portugal 2020, European Union, through the European Regional Development Fund.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A. Statistics of Weights across the Trainable–Untrainable Transition**

The distribution of deviations from the initial weight is qualitatively similar across the trainable phase. The proximity between the line *wf* = *wi* and the median of the distributions of *wf* for fixed *wi*, observed in Figure 3, is a distinctive feature of that kind of distribution. Figure A1 shows the slope obtained by fitting a straight line to the peaks (or mode) of the distributions of trained weights *wf* for fixed *wi*. In large networks, the fitted slope of the peaks, *copt*, is very close 1 in all layers (even for very large training times), independently of the width. Below the width threshold for the network to be trainable, of about 300 nodes per hidden layer, the slope *copt* shows significant deviations from 1, and its value strongly fluctuates among realizations of the initialization and training. (The borders of the shaded area in the plots of Figures A1 and A2 represent the standard deviation measured in ten independent realizations.) To emphasize the coupling between trainability and proximity to the initial configuration of weights, we used data from the same ten realizations to plot Figures 4, A1 and A2. Combined, these figures show the simultaneousness of the abrupt increase in the loss and of the deviation from the initial configuration.

**Figure A1.** Fitting of the line across the maxima of the distribution of final weights in the networks of Figure 4. The parameter *copt* is the slope obtained by fitting a straight line to the peaks of the distributions of trained weights *wf* for fixed *wi*. These results are averages measured in ten independent realizations, and the shaded areas represents the dispersion (standard deviation across realizations).

For the sake of completeness, we perform linear fittings also to the mean trained weight as a function of the initial weight. Figure A2 shows the results of these fittings: *a*<sup>0</sup> and *a*<sup>1</sup> denote the constant and the slope, respectively. Similarly to the *copt*, while above a width of about 300 nodes the values of *a*<sup>0</sup> and *a*<sup>1</sup> are stable, below the threshold they suffer an abrupt change at some moment of training. The dispersion of the trained weights around their initial value, measured by the standard deviation, is also shown in Figure A2 for the set of weights that are initialized with the value *wi* = 0, displaying the same transition at a width of about 300. We observed that the distribution of trained weights for other *wi* = 0 behaves similarly with the variation of the network's width. Notice that, since the weights are initialized from a continuous distribution, we are able to measure the mode (peaks), average, and standard deviation of weights as functions of the initial value *wi* by applying the special procedure described in Appendix B, which is less affected by the presence of noise (fluctuations) in the data that the standard binning methods.

**Figure A2.** Mean and standard deviation of the trained weights in the networks of Figure 4. Top and middle rows: results of fitting a straight line *a*<sup>0</sup> + *a*1*wi* to the mean of the trained weights *wf*(*wi*). Bottom row: standard deviation of the distribution of trained weights for the set of weights that are initialized with the value *wi* = 0. These results are averages measured in ten independent realizations, and the shaded areas represents the dispersion (standard deviation across realizations).

#### **Appendix B. Fitting the Statistics of Weights in a Single Realization**

This appendix briefly describes the methods used in this work to characterize the statistics of the displacements of weights produced by training with SGD. The problem is that we cannot directly obtain the distributions *Pwi* (*wf*) of the final weights *wf* for each value of the initial weight *wi*, because the *wi*s are drawn from a continuous (uniform) distribution, see Equation (3). In practice, for a single realization of training, we have a set of points (*wi*, *wf*), one for each link, in the continuous plane, as shown in Figure 3. In this situation, calculating the mean *wf* (*wi*) and standard deviation *σwf*(*wi*) of the distribution *Pwi* (*wf*) may follow one of two approaches: either using a binning procedure or cumulative distributions. In our analysis, we employed the latter, which is less affected by random fluctuations than the standard binning methods.

We assume a linear fit *wf* (*wi*) = *a*<sup>0</sup> + *a*1*wi*, and obtain the constants *a*<sup>1</sup> and *a*<sup>0</sup> as follows. Let us define the function

$$\mathcal{W}\_f(w) = \int\_{\min(w\_i)}^w \langle w\_f \rangle(x) dx = \mathcal{C} + a\_0 w + \frac{a\_1}{2} w^2,\tag{A1}$$

where *C* is a constant resulting from the lower limit of the integral. For one given realization, we can estimate this function from the following cumulative sum

$$\mathcal{W}\_f(w) \approx \frac{\max(w\_i) - \min(w\_i)}{N} \sum\_{j: w\_j(0) \le w} w\_j(t\_f),\tag{A2}$$

where *wj*(*tf*) denotes the value of the weight of link *j* at time *tf* , *N* is the number of links of the network, and min(*wi*)/max(*wi*) is the minimum/maximum value of the initialization weights. (The sum in the right-hand side of Equation (A2) runs over all links whose initial weight is not larger than *w*.) Finally, we fit a second degree polynomial to *Wf*(*w*), and get the constants *a*<sup>1</sup> and *a*<sup>0</sup> from its coefficients.

We use the same 'cumulative-based' approach to find the second moment of *Pwi* (*wf*), denoted by *<sup>w</sup>*<sup>2</sup> *<sup>f</sup>* (*wi*). In this case, we assume the polynomial *<sup>w</sup>*<sup>2</sup> *<sup>f</sup>* (*wi*) = *<sup>b</sup>*<sup>0</sup> <sup>+</sup> *<sup>b</sup>*1*wi* <sup>+</sup> *<sup>b</sup>*2*w*<sup>2</sup> *i* . We again define the cumulative *W*(2) *<sup>f</sup>* (*w*) = % *<sup>w</sup>* min(*wi*)*<sup>w</sup>*<sup>2</sup> *<sup>f</sup>* (*x*)*dx*. Similarly to *Wf*(*w*), we estimate *W*(2) *<sup>f</sup>* (*w*) as

$$\mathcal{W}\_f^{(2)}(w) \approx \frac{\max(w\_i) - \min(w\_i)}{N} \sum\_{j: w\_j(0) \le w} w\_j^2(t\_f),\tag{A3}$$

and fit a third degree polynomial to get the coefficients *b*0, *b*1, and *b*2. Then, we calculate *σwf*(*wi*) as

$$
\sigma\_{w\_f}(w\_i) = \sqrt{\langle w\_f^2 \rangle(w\_i) - \left[ \langle w\_f \rangle(w\_i) \right]^2}. \tag{A4}
$$

The method for fitting the peak (or the mode) of the distribution *Pwi* (*wf*) is also based on a cumulative distribution. In our experiments we observe that, in the trainability regime, the peak of the distribution of *wf* as a function of *wi* is indistinguishable from a straight line, see Figure 3. Accordingly, we define

$$N\_c(b) = \left| \left\{ (w\_{i'}, w\_f) : w\_f \le b + cw\_i \right\} \right|,\tag{A5}$$

which is a function that counts the number of points (*wi*, *wf*) below or at the line *b* + *cwi*. Then, we fit the peak of *Pwi* (*wf*) by optimizing the expression

$$\max\_{c} \max \left( \frac{d}{db} N\_{\ell}(b) \right), \tag{A6}$$

In other words, we look for the slope that causes the largest rate of change in the function *Nc*(*b*). This slope, *copt*, is the slope of the linear function that best aligns with the peak of the distribution *Pwi* (*wf*).

#### **References**


## *Article* **Compression of Neural Networks for Specialized Tasks via Value Locality**

**Freddy Gabbay 1,\* and Gil Shomron <sup>2</sup>**


**\*** Correspondence: freddyg@ruppin.ac.il

**Abstract:** Convolutional Neural Networks (CNNs) are broadly used in numerous applications such as computer vision and image classification. Although CNN models deliver state-of-the-art accuracy, they require heavy computational resources that are not always affordable or available on every platform. Limited performance, system cost, and energy consumption, such as in edge devices, argue for the optimization of computations in neural networks. Toward this end, we propose herein the value-locality-based compression (VELCRO) algorithm for neural networks. VELCRO is a method to compress general-purpose neural networks that are deployed for a small subset of focused specialized tasks. Although this study focuses on CNNs, VELCRO can be used to compress any deep neural network. VELCRO relies on the property of value locality, which suggests that activation functions exhibit values in proximity through the inference process when the network is used for specialized tasks. VELCRO consists of two stages: a preprocessing stage that identifies output elements of the activation function with a high degree of value locality, and a compression stage that replaces these elements with their corresponding average arithmetic values. As a result, VELCRO not only saves the computation of the replaced activations but also avoids processing their corresponding output feature map elements. Unlike common neural network compression algorithms, which require computationally intensive training processes, VELCRO introduces significantly fewer computational requirements. An analysis of our experiments indicates that, when CNNs are used for specialized tasks, they introduce a high degree of value locality relative to the general-purpose case. In addition, the experimental results show that without any training process, VELCRO produces a compressionsaving ratio in the range 13.5–30.0% with no degradation in accuracy. Finally, the experimental results indicate that, when VELCRO is used with a relatively low compression target, it significantly improves the accuracy by 2–20% for specialized CNN tasks.

**Keywords:** machine learning; deep neural networks; convolutional neural network; deep compression

#### **1. Introduction**

Convolutional Neural Networks (CNNs) are broadly employed by numerous computer vision applications such as autonomous systems, healthcare, retail, and security. Over time, the processing requirements and complexity of CNN models have significantly increased. For example, AlexNet [1], which was introduced in 2012, has eight layers, whereas ResNet-101 [2], which was released in 2015, uses 101 layers and requires an approximately sevenfold-greater computational throughput [3]. The increasing model complexity in conjunction with large datasets used for model training has endowed CNNs with phenomenal performance for various computer vision tasks [4]. Typically, large complex networks can further extend their capacity to learn complex image features and properties. The growing model size of CNNs and the requirement of significant processing power have become major deployment challenges for migrating CNN models into mobile, Internet of Things, and edge applications. Such applications incur limited computational and memory

**Citation:** Gabbay, F.; Shomron, G. Compression of Neural Networks for Specialized Tasks via Value Locality. *Mathematics* **2021**, *9*, 2612. https:// doi.org/10.3390/math9202612

Academic Editor: Oliviu Matei

Received: 30 September 2021 Accepted: 15 October 2021 Published: 16 October 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

resources, energy constraints, and system cost and, in many cases, cannot rely on cloud computational resources due to privacy, online communication network availability, and real-time considerations.

The compression of CNN models without excessive performance loss significantly facilitates their deployment by a variety of edge systems. Such compression has the potential to reduce computational requirements, save energy, reduce memory bandwidth and storage requirements, and shorten inference time. Various techniques have been suggested to compress CNN models, one of the most common of which is pruning [5–7], which exploits the tendency to over-parameterize CNNs [8]. Pruning trades off degradation in model prediction accuracy for model size by removing weights, Output Feature Maps (OFMs), or filters that make minor or no contribution to the inference of a network. Quantization [9–12] is another common technique that attempts to further compress network size by reducing the number of bits used to represent weights, filters, and OFMs with only a minor impact on accuracy. These methods and other compression approaches are discussed in more detail in Section 2.

This paper focuses on machine learning models that are used for specialized tasks. A specialized neural network is typically a general-purpose model which has been adjusted and optimized to carry out a set of specific tasks. Specialized neural networks have recently become common not only for edge devices but also for datacenters [13–15]. Unlike general-purpose neural networks that are used for a diverse range of classification tasks, specialized neural networks are used for a small number of specific classification tasks. For example, a CNN model that is used to detect vehicles does not use its animal classification capabilities. A common usage of specialized CNN is as a fast filter in front of a heavy general-purpose CNN model. A typical example to such usage is related to offline video analytics [14], which is processed by a specialized CNN model, and only when the model has a low level of confidence are the corresponding frames sent to a general-purpose CNN. Another example is related to game scrapping, where specialized CNNs are used to classify video stream events in by scraping in-game text appearing in frames. A cascaded-CNN [16] is another approach that employs multiple specialized CNNs. The result of each specialized CNN is combined to produce a complete prediction map. The mixture-of-experts model [17] employs a combination of expert models, where each expert is a neural network specialized in specific tasks. Hierarchical classification is another example for specialized CNN usage. Since image categories are typically organized in hierarchical manner, hierarchical classification can be employed by performing a prediction starting from a super class and can only perform detailed classification within the super class. We introduce in this study the value-locality-based compression (VELCRO) algorithm. VELCRO is a method to compress deep neural networks that were originally trained for a large set of diverse classification tasks but are deployed for a smaller subset of specialized tasks. Although this work focuses on CNN models, VELCRO can be used to compress any deep neural network. The main principle of VELCRO is based on the property of value locality, which we introduce herein in the context of neural networks. This property suggests that, when the network is used for specialized tasks, a proximal range of values are produced by the activation functions in the inference process. VELCRO consists of two stages: a preprocessing stage, which identifies activation-function output elements with a high degree of value locality, and a compression stage, which replaces these activation elements with their corresponding arithmetic averages. As a result, VELCRO avoids not only the computation of these activation elements but also the convolution computation of their corresponding OFM elements. VELCRO also requires significantly fewer computational resources than common pruning techniques due to the avoidance of back propagation training. For our experimental analysis we use three CNN models: ResNet-18 [2], MobileNet V2 [18], and GoogLeNet [19] with the ILSVRC-2012 (ImageNet) [20] dataset to examine compression capabilities and model accuracy. Lastly, we implement VELCRO in hardware on a Field Programable Gate-Array (FPGA) and demonstrate the computational and energy savings.

The contributions of this paper are summarized as follows:

	- a. VELCRO produces a compression-saving ratio of computations in the range 20.0–27.7% for ResNet-18, 25–30% for GoogLeNet, and 13.5–20% for MobileNet V2 with no impact on model accuracy;
	- b. VELCRO significantly improves accuracy by 2–20% for specialized-task CNNs when given a relatively small compression-savings target.

The remainder of this paper is organized as follows: Section 2 reviews previous work. Section 3 introduces the proposed method and algorithm. Section 4 presents the experimental results. Finally, Section 5 summarizes the conclusions and suggests future research directions.

#### **2. Prior Works**

Numerous recent studies have proposed various techniques to optimize CNN computations, reduce redundancy, and improve computational efficiency and memory storage. This section describes the following related methods: pruning, quantization, knowledge distillation, deep compression, CNN folding, ablation and CNN filters compression methods.

Pruning is one of the most common methods used for CNN optimization and was introduced in Refs. [5–7]. The concept of pruning, which is inspired by neuroscience, assumes that some network parameters are redundant and may not contribute to network performance. Various pruning techniques [5,21–25] suggest the removal of activations, weights, OFMs, or filters that make a minor or no contribution to the inference process of an already-trained network. Thereby, pruning can significantly reduce the network size and the number of computations. Traditional pruning techniques typically require fine-tuned training on the full model, which may involve significant computational overhead [26].

Pruning techniques can be classified into unstructured and structured classes. Unstructured pruning imposes no constraints on the activations or weights with respect to the network structure (i.e., individual weights or activations are removed by replacing them with zero). Structured pruning [27], in contrast, restricts the pruning process to a set of weights, channels, filters, or activations. Whereas structured pruning incurs limitations on the sparsity that can be exploited in the network due to its coarse pruning granularity, unstructured pruning uses a broader scope of the available sparsity. Conversely, unstructured pruning may involve additional overhead for representing the pruned elements and may not always fit parallel processing elements such as GPUs.

The process of pruning is typically performed by ranking the network elements in accordance with their contribution. The rank can be determined by using various functions such as the L1 or L2 norms [28–31] of weights, activations, or other metrics [32]. Activation pruning requires dynamic mechanisms to monitor activation values because activation importance may depend on the model input. For example, Ref. [33] employs reinforcement learning to prune channels, and Refs. [34,35] leverage spatial correlations of CNN OFMs to predict and prune zero-value activations. Further pruning techniques based

on weight magnitudes were recently introduced in Refs. [21,36,37], which demonstrate that computation efficiency and network scale can be improved significantly. Various gradual pruning approaches [38], given memory footprints and computational bounds, were studied by examining the accuracy and size tradeoffs. The neuron importance score propagation, introduced by Ref. [39], suggests jointly pruning neurons based on a unified goal. Other approaches such as random neuron pruning and random grouping of weight connections into hash buckets were introduced in Refs. [40,41]. Pruning based on a Taylorexpansion criterion [42] focuses on transfer learning by optimizing a network trained to a large dataset of images into a smaller and more efficient network specialized in a subset of classes. Their pruning method performs an iterative backpropagation pruning by removing feature maps with the least level of importance. Ref. [42] evaluated their pruning method by using various criteria such as weight pruning, using l2 norm, and activation pruning, using mean, variance, mutual information, and Taylor-expansion criteria. Their results indicate that the importance of OFMs decreases with layer depth and that each layer has feature maps with both high and low degrees of importance. Ref. [43] introduced pruning by compression using residual connections and limited data (CURL) for residual CNN compression when relying on small datasets that represent specialized tasks.

Quantization methods attempt to reduce the number of bits used to represent the values of weights, filters, and OFMs from 32-bit floating point to 8 bit or less with a slight degradation in model accuracy while simplifying computational complexity. Employing quantization methods that use fewer than 8 bits, however, is not trivial because quantization noise excessively degrades model accuracy. Quantization-aware training uses training processes for quantization to reduce quantization noise and recover model accuracy [44–46]. This approach can be limited when the training process cannot be used due to lack of dataset availability or lack of computational resources. Various fixed-point and vector quantization methods, introduced in Refs. [9–12], present tradeoffs between network accuracy and quantization-compression ratios. A combination of pruning and quantization was introduced in Ref. [22]. Post-training quantization methods [47–50] avoid these limitations by searching for the optimal tensor-cutting values to reduce quantization noise after the network model has been trained.

Knowledge distillation is another machine learning optimization [51,52] that transfers knowledge from a large machine learning model into a smaller compact model that mimics the original model (instead of being trained on the original dataset) to perform competitively. These systems consist of three main elements: knowledge, an algorithm for knowledge distillation, and a teacher–student model. A broad survey of knowledge distillation is available in Ref. [53].

Deep compression was introduced in Ref. [22] and consists of a three-stage pipeline: pruning, trained quantization, and Huffman coding, which operate simultaneously to optimize model size. The first stage prunes the model by learning the important connections, the second stage performs weight quantization and sharing, and the last stage uses Huffman coding. Ref. [54] extends the deep compression idea and introduces the once-for-all network, which can be installed under diverse architectural constraints and configurations, such as performance, power, and cost. The once-for-all approach introduces the progressive shrinking techniques that generalize pruning. Whereas pruning shrinks the network width, progressive shrinking operates on four dimensions: image resolution, kernel size, depth, and width, thereby achieving a higher level of flexibility.

FoldedCNN [15] is another approach to optimize CNNs for specialized-inference tasks. Unlike compression techniques, FoldedCNN does not aim at compressing the CNN model but rather attempts to increase the inference throughput and hardware utilization. The FoldedCNN approach suggests CNN model transformations to increase their arithmetic intensity when processing a large batch size without increasing processing requirements.

Additional studies have attempted to understand the internal mechanisms of CNNs and their contribution to classification tasks. From various CNN models, Refs. [55,56] created visualized images based on the OFMs of different layers and units. Their results

indicate that OFMs extract features that detect patterns, textures, shapes, concepts, and various elements related to the classified images. Ablation techniques were used by Ref. [57] to further quantify the contribution of OFM units to the classification task. Their results indicate that elements that are selective to certain classes may be excluded from the network without necessarily impacting the overall model performance. The impact on ablation of a subset of classes was further studied in Ref. [58], which found that single-OFM-unit ablation can significantly impact the model accuracy for a subset of classes, leading them to suggest different methods to measure the importance of internal OFM units to specific classification accuracy.

CNN filter compression techniques attempt to remove kernel and filters that have small contribution to the network performance. Removal of specific convolution filters based on their importance has been introduced in Ref. [59]. The authors suggest considering two consecutive network layers as a coupled function where the weights are used to compute the coupling factors. In addition, they suggest using the coupling factors to prune filter and maximize the variance of feature maps. Another study on convolution filters compression [60] has highlighted that certain feature maps inside and across CNN layers may have a different contribution to the accuracy of the inference process. The authors indicate that, first model layers typically extract semantic features while the deep layers may extract simple features. Thereby, understanding the importance of feature map can help the compression of the network. They investigate the relationship between input feature map and filter kernels and suggest Kernel Sparsity and Entropy (KSE) as a quantitative indicator for the feature map importance.

These recent studies [55–60] provide the motivation for the present study by suggesting that, when using the CNN model for specialized tasks, we eliminate unrelated computations and thereby compress the model, all with minimal impact on classification accuracy.

#### **3. Method and Algorithm**

Our proposed VELCRO compression algorithm relies on the fundamental property of value locality. We start our discussion by first presenting qualitative and quantitative aspects of value locality, following which we describe the VELCRO compression algorithm for specialized neural networks.

#### *3.1. Value Locality of Specialized Convolutional Neural Networks*

The principle of the method proposed to compress specialized CNNs is based on the property of value locality. Value locality suggests that, when a CNN model runs specialized tasks, the output values of the activation tensor is in proximity through inference of images. The rationale behind this theory relies on the assumption that the inferred images, which already have a certain level of similarity, exhibit common features such as patterns, textures, shapes, and concepts. As a result, the intermediate layers of the model produce similar values in the vicinity. Figure 1 explains the property of value locality by illustrating the activation-function output tensors in each convolution layer k and channel c. In this example, the set of elements A(m)[k][c][i][j] for images m = 0, 1, ... , N <sup>−</sup> 1 in the activation tensor is populated with values in proximity through the inference between images.

For every convolution layer k, we define a variance tensor V[k], where each element V[k][c][i][j] in the variance tensor is defined as

$$\begin{split} \mathbf{V}[\mathbf{k}][\mathbf{c}][\mathbf{i}][\mathbf{j}] &= \text{Var}(\mathbf{A}[\mathbf{k}][\mathbf{c}][\mathbf{i}][\mathbf{j}]) = \text{E}(\mathbf{A}[\mathbf{k}][\mathbf{c}][\mathbf{i}][\mathbf{j}]^2) - \text{E}(\mathbf{A}[\mathbf{k}][\mathbf{c}][\mathbf{i}][\mathbf{j}])^2 \\ &= \frac{1}{N} \sum\_{m=0}^{N-1} \mathbf{A}^{(m)}[\mathbf{k}][\mathbf{c}][\mathbf{i}][\mathbf{j}]^2 - \left(\frac{1}{N} \sum\_{m=0}^{N-1} \mathbf{A}^{(m)}[\mathbf{k}][\mathbf{c}][\mathbf{i}][\mathbf{j}]\right)^2, \end{split} \tag{1}$$

where c is the channel index and i and j are the element coordinates.

We use the variance tensor as a measure to quantify the proximity of values for every activation tensor element A[k][c][i][j]. Thereby, a small value of V[k][c][i][j] suggests that the corresponding activation element has a high degree of value locality. The proposed compression algorithm leverages such activation elements for compression. Section 4

presents an experimental analysis of the distribution of the variance tensor for various specialized CNN models.

**Figure 1.** Value locality: The elements with coordinates i, j of the activation-function output tensor in convolutional layer k, channel c, are populated with values in proximity through the inference between images 0 to N − 1. The variance tensor V serves to measure the degree of value locality.

*3.2. VELCRO Algorithm for Specialized Neural Networks*

The VELCRO algorithm consists of two stages: preprocessing and compression.


$$\mathbf{C} = 1 - \frac{\text{Compressed model computations}}{\text{Original model computations}} = \sum\_{\mathbf{k}=0}^{\mathbf{K}-1} \mathbf{T}\_{\mathbf{k}} \mathbf{c}\_{\mathbf{k}} \mathbf{w}\_{\mathbf{k}} \mathbf{h}\_{\mathbf{k}} \tag{2}$$

where the tuple T = {T0, T1, ... , TK} contains the threshold values for the activation in each convolution layer. In addition, ck, wk, and hk are the number of channels, the width, and the height of the activation function output tensor for convolution layer k, respectively.

The complete and formal definition of the algorithm is given in Algorithm 1.

A simple example that demonstrates the VELCRO algorithm is illustrated in Figure 2, which shows the activation output tensor in convolution layer k for a preprocessing dataset of N = 3 images. The dimensions of the activation tensor are ck = 1, wk = 3, and hk = 3. The VELCRO preprocessing stage performs inference on the preprocessing dataset to create a variance tensor V[k] and an arithmetic average tensor B[k]. The hyperparameter threshold value for layer k is defined in this example as Tk = 0.33, which means that the three elements in the activation function output tensor with the lowest variance (highlighted in red) are replaced with their arithmetic average. The remaining elements remain unchanged. The outcome of the VELCRO compression stage is given by the compressed activation-function output tensor <sup>A</sup>0[k], where the computation of three elements (highlighted in green) are replaced by the arithmetic averages.

**Algorithm 1:** VELCRO algorithm for specialized neural networks

Input: A CNN model M with K activation-function outputs (each in a different convolution layer), N preprocessing images, and a threshold tuple T = {T0, T1,... ,TK}, where ∀ 0 ≤ *n* < *N* 0 ≤ T*<sup>n</sup>* < 1. Output: A compressed CNN Model MC. Preprocessing stage Step 1: Let A(k) be the activation-function output tensor in convolution layer k and let A(*m*) (k) be the corresponding activation-tensor values at the inference of image m, 0 ≤ *m* < *N*, where the tensors A[k] and A(*m*) [k] have dimension ck × wk × hk and ck, wk, and hk are the number of channels, the width, and the height of the tensor at convolution layer k, respectively. Step 2: For every 0 ≤ k < K, 0 ≤ c < ck, 0 ≤ i < wk, and 0 ≤ j < hk: Let tensors S and K be initialized such that S[k][c][i][j] = 0 and Q[k][c][i][j] = 0 Step 3: For each image 0 ≤ *m* < *N*: Perform inference by model M on image m. For every convolution layer 0 ≤ k < K: For every 0 ≤ c < ck, 0 ≤ i < wk, and 0 ≤ j < hk, Let the tensors S and Q be S[k][c][i][j] = S[k][c][i][j] + A(*m*) [k][c][i][j] Q[k][c][i][j] = Q[k][c][i][j] + (A(*m*) [k][c][i][j])2. Step 4: Let B[k] be the arithmetic average tensor in convolution layer k such that each tensor element is B[k][c][i][j] = <sup>1</sup> *<sup>N</sup>* S[k][c][i][j] For every 0 ≤ c < ck, 0 ≤ i < wk, and 0 ≤ j < hk, Step 5: Let V[k] be the variance tensor of convolution layer k such that each tensor element is V[k][c][i][j] = <sup>1</sup> *<sup>N</sup> <sup>Q</sup>*[k][c][i][j] <sup>−</sup> (B[k][c][i][j])<sup>2</sup> For each 0 ≤ c < ck, 0 ≤ i < wk, and 0 ≤ j < hk Compression stage: Step 6: For each convolution layer 0 ≤ k < K: Let p(x,Y) be the percentile function of element x in tensor Y. p returns the percentile value for x with respect to all elements in tensor Y. Let the tensor <sup>A</sup>0[k] be <sup>A</sup>0[k][c][i][j] <sup>=</sup> A[k][c][i][j] p(V[k][c][i][j], A[k]) > Tk B[k][c][i][j] p(V[k][c][i][j], A[k]) ≤ Tk For each 0 ≤ c < ck, 0 ≤ i < wk, and 0 ≤ j < hk Step 7: Let the compressed CNN model MC be such that every activation function output tensor A[k] is replaced with <sup>A</sup>0[k] for every convolution layer 0 <sup>≤</sup> <sup>k</sup> <sup>&</sup>lt; K.

**Figure 2.** Example of VELCRO preprocessing and compression stages.

#### **4. Experimental Results and Discussion**

Our experimental study consists of a comprehensive analysis of both value locality and the performance of various CNN models when used for specialized tasks. In the following, we first describe the experimental environment and then introduce the value locality experimental measurements. Next, we discuss the performance of the VELCRO compression algorithm. Finally, we demonstrate the computational and energy savings of VELCRO by designing a hardware that implements the compression algorithm on FPGA.

#### *4.1. Experimental Environment*

Our experimental environment is based on PyTorch [62], the ILSVRC-2012 dataset (also known as "ImageNet") [20,60], and the ResNet-18, MobileNet V2, and GoogLeNet CNN models [18,19,55] with their PyTorch pretrained models. The VELCRO algorithm, described in Algorithm 1, has been fully implemented on the PyTorch environment. Table 1 summarizes the specialized tasks used for our experimental analysis. The experiments examine five groups of specialized tasks: the groups Cats-2, Cats-3, and Cats-4 include

two, three, and four classes from the ILSVRC-2012 dataset, respectively, and the groups Dogs and Cars include four classes each. Throughout the experimental analysis, we do not modify the first layer of the model, which is a common practice that has been used by numerous studies [46].

**Table 1.** Specialized tasks summary.


#### *4.2. Experimental Analysis of Value Locality*

The distribution of the variance tensor elements in each layer (skipping the first layer) is a measure to quantify the proximity of the activation-function output. Figure 3 shows the distribution of the variance tensor elements for the selected activation function outputs in the convolution layers 1, 3, 7, 10, and 14 in ResNet-18. The distribution is shown for the groups of classes Cats-2, Cats-3, and Cats-4, which include two, three, and four classes of cats from the dataset, respectively. The group "all" contains a mixture of all ILSVRC-2012 dataset classes and represents the case when the CNN model is used for general tasks. When the CNN model is used for specialized tasks (Cats-1, -2, and -3), the distribution of the variance tensor elements clearly shifts toward zero with respect to the distribution when the model is used for general tasks (all), which indicates that the CNN model produces values of closer proximity (i.e., a higher degree of value locality) for specialized tasks. Another important outcome made apparent in Figure 3 is that the three groups of specialized tasks behave similarly regardless of the number of classes. The distribution of variance tensor elements in all ReseNet-18 layers is presented in Figure A1 (Appendix A) and behaves similarly to the distribution presented herein.

Figure 4 illustrates the same experimental analysis but for the GoogLeNet CNN model for selected layers 1, 6, 12, 21, 32, 38, 47, 51, and 56. The variance tensor elements of GoogLeNet behave very similarly to those of ResNet-18. When the model is used for specialized tasks, the variance distribution shifts left with respect to the general-purpose use, indicating a higher degree of value locality. The distribution in all GoogLeNet layers is presented in Figure A2 (Appendix A).

**Figure 3.** Distribution of ResNet-18 variance tensor elements in layers 1, 3, 7, 10, 14, and 16 for specialized tasks: all ImageNet classes, Cats-2, Cats-3, and Cats-4.

**Figure 4.** Distribution of GoogLeNet variance tensor elements in layers 1, 6, 12, 21, 32, 38, 47, 51, and 56 for specialized tasks: all ImageNet classes, Cats-2, Cats-3, and Cats-4.

Figure 5 presents a similar experimental analysis for MobileNet V2 layers 1, 6, 12, 19, 28, 30, and 35, and the distribution in all MobileNet V2 layers is presented in Figure A3 (Appendix A). The results indicate that a lower degree of value locality occurs relative to ResNet-18 and GoogLeNet when MobileNet V2 is used for specialized tasks. The results indicate that the shift of the variance tensor elements distribution is smaller with respect to the other CNN models. These observations reflect the highly compact nature of the

MobileNet V2 network with respect to ResNet-18 and GoogLeNet, which results in a lower potential for leveraging value locality for the former.

**Figure 5.** Distribution of MobileNet-V2 variance tensor elements in layers 1, 6, 12, 19, 28, 30, and 35 for specialized tasks: all ImageNet classes, Cats-2, Cats-3, and Cats-4.

Figures 6–8 extend our experimental analysis for additional groups of specialized tasks, Dog and Cars, each of which includes four classes from the ILSVRC-2012 dataset. Note that the Cats group corresponds to the group Cats-4. The results further confirm those shown in Figures 3–5. In all the examined CNN models and in the majority of activation-function outputs in all convolution layers, the distribution of variance tensor elements for the specialized tasks clearly shifts toward zero relative to the distribution when the model is used for general tasks (all). Like the results presented in Figure 5, we also observe that MobileNet V2 can leverage value locality but in a smaller magnitude with respect to ResNet-18 and GoogLeNet.

These experimental results support our expectations that CNN models that are used for specialized tasks exhibit a high degree of value locality. Figures A4–A6 (Appendix A) show the experimental results for all layers of all models. The complete experimental results for all layers behave similarly to the distribution presented in Figures 6–8.

**Figure 6.** Distribution of ResNet-18 variance tensor elements in layers 1, 3, 7, 10, 14, and 16 for specialized tasks: Cats, Dogs, Cars, and all ImageNet classes.

**Figure 7.** Distribution of GoogLeNet variance tensor elements in layers 1, 6, 12, 21, 32, 38, 47, 51, and 56 for specialized tasks: Cats, Dogs, Cars, and all ImageNet classes.

**Figure 8.** Distribution of MobileNet V2 variance tensor elements in layers 1, 6, 12, 19, 28, 30, and 35 for specialized tasks: Cats, Dogs, Cars, and all ImageNet classes.

#### *4.3. Performance of Compression Algorithm*

As part of our experimental analysis, we examine the compression-saving ratio of the VELCRO algorithm on three groups of specialized tasks: cats, cars, and dogs (see Table 1). Only a very small subset (<2%) of images from the preprocessing dataset has been used for the preprocessing stage of the algorithm, while the remaining images have used for the validation of the compressed model. This approach is essential in order to perform an unbiased evaluation of the model performance and preserved the generalization property of the model. Figure 9a–c present the top-1 prediction accuracy versus the compressionsaving ratio for cars, dogs, and cats, respectively. The experimental analysis is applied to the ResNet-18, GoogLeNet, and MobileNet V2 CNN models. For each compression-saving ratio, we examine different thresholds by running trial-and-error and choose those that produce the highest top-1 accuracy. Tables A1–A3 in Appendix B summarize the tuples of threshold values. Table 2 summarizes the maximum compression-saving ratio for each group of specialized tasks and each CNN model that produces the same accuracy as the original uncompressed model.

The experimental results indicate that VELCRO produces a compression-saving ratio of 20.00–27.73% in ResNet-18 and 25.46–30.00% in GoogLeNet. The higher compressionsaving ratio in GoogLeNet is attributed to the fact that GoogLeNet uses significantly a greater number of parameters and thereby has higher potential to leverage value locality. This explains why GoogLeNet better leverages value locality when the network is employed for special tasks. Conversely, MobileNet V2 produces a smaller compression-saving ratio, 13.50–19.76%, for the specialized tasks examined. These results comply with our previous measurements of the distribution of the variance tensor elements, which imply that the potential of leveraging value locality in MobileNet V2 is smaller than that of the other CNNs examined. This is explained by the fact that MobileNet V2 is much more compact than the other CNNs examined and thereby has a lower potential to leverage value locality.

**Figure 9.** Accuracy for ResNet-18, GoogLeNet, and MobileNet V2 versus compression-saving ratio for specialized tasks: (**a**) Cars, (**b**) Dogs, and (**c**) Cats.


**Table 2.** Maximum compression-saving ratio achieved while maintaining the accuracy of the original uncompressed CNN model.

Note that VELCRO does not aim to compress the network memory footprint but rather to reduce the computational requirements. Therefore, any comparison of VELCRO to pruning approaches should consider computation aspects rather than the number of parameters in the network. Table 3 compares the VELCRO algorithm with other pruning approaches for both specialized CNNs and general-purpose ones. Although VELCRO achieves smaller computation savings, it requires significantly fewer computational resources than common pruning techniques [61] due to the avoidance of back propagation training.

**Table 3.** Comparison summary of VELCRO with respect to punning techniques. We also examine the output of the activation functions compressed by VELCRO.


Table 4 presents the percent compression of activation elements with zero value out of all the compressed activation elements. The results in Table 3 correspond to the compression-saving ratios in Table 2 (i.e., when the network achieves maximum compression without losing accuracy). With ResNet-18 and GoogLeNet, the fraction of compressed zero values is in the range 0.08–0.31% and 0.56–0.64%, respectively. In contrast, MobileNet V2 produces a significantly larger fraction of compressed zero values: 10.48–14.91%, which is attributed to the fact that MobileNet V2 is a much more compact model than the other CNNs. These results indicate that VELCRO offers an extended level of compression with respect to pruning, which aims to remove weak connections of zero values.

**Table 4.** Compressed activation elements with zero value as a percent of all compressed activation elements.


Another important result gained from Figure 9a–c is that, when VELCRO is used with a relatively moderate compression ratio, it produces a significant increase in accuracy. The results are presented in Table 5, which summarizes the maximum top-1 accuracy achieved by VELCRO. These results are attributed to the fact that a relatively moderate level of compression helps the network leverage value locality to strengthen connections, thereby increasing the probability of favoring the prediction of classes that are in the scope of the specialized tasks.

**Table 5.** The maximum top-1 accuracy increase produced by VELCRO with respect to the uncompressed model when used for specialized tasks.


#### *4.4. Hardware Implementation*

In the last part of our experimental analysis, we demonstrate the computational optimization and energy savings of VELCRO through hardware implementation on the Xilinx® Alveo™ U280 Data Center accelerator card [63]. Our hardware implementation, which is illustrated in Figure 10, consists of 16 instance modules where each is comprised of a two-dimensional convolution layer with a 64 × 64 input feature map (IFMAP), 3 × 3 filter, and a ReLU activation function. In addition, each module also includes a compression control logic which skips the compressed computations and replaces them with their corresponding arithmetic averages. Our hardware implementation was designed in Verilog and implemented using the Xilinx® VivadoTM [64] design suite.

**Figure 10.** VELCRO compression implementation on Xilinx® AlveoTM U280 Accelerator Card.

Figure 11 presents the (normalized) throughput and energy consumption of a single module instance, denoted as conv2d, which consists of the hardware implementation of a two-dimensional convolution layer and ReLU activation. As expected, the computational throughput of the conv2d layer, which is measured as the number of conv2d operations per second, exhibits a growth rate proportional to <sup>1</sup> <sup>1</sup>−<sup>C</sup> , where C is the compression saving ratio). In addition, it can be observed that the energy consumption related to the computation of a single conv2d layer decays linearly with the compression saving ratio. Thereby, for the compression saving results presented in Table 2, VELCRO can achieve 13.5–30% energy consumption savings while maintaining the same accuracy of the uncompressed model.

**Figure 11.** VELCRO throughput and energy consumption Xilinx® AlveoTM U280 Accelerator Card.

#### **5. Conclusions**

We present herein value-locality-based compression algorithm (VELCRO), wherein a compression approach is introduced for general-purpose deep neural networks deployed for a small subset of specialized tasks. We introduce the notion of value locality in the context of neural networks for specialized tasks and show that CNNs that are used for specialized tasks produce a high degree of value locality. An analysis of the experimental results indicates that VELCRO leverages value locality to compress the network and thereby saves up to 30% of the computations in ResNet-18 and GoogLeNet and up to 20% in MobileNet V2. The analysis also indicates that, for specialized tasks, VELCRO significantly improves the accuracy by 2–20% when given a relatively small compression-saving target. Finally, a major advantage of VELCRO is that it offers a fast compression process that is based on inference rather than backpropagation training, thereby liberating VELCRO from a significant computational load. We demonstrate the feasibility of VELCRO by designing the algorithm in hardware on the Xilinx® Alveo™ U280 Data Center accelerator card. Our hardware implementation indicates that VELCRO translates the computation compression into an energy consumption savings of 13.5–30%, corresponding to the compressionsaving ratio.

**Author Contributions:** Conceptualization, F.G.; methodology, F.G. and G.S.; software, G.S. and F.G.; validation, F.G. and G.S.; formal analysis, F.G. and G.S.; investigation, F.G. and G.S.; resources, F.G. and G.S.; data curation, F.G. and G.S.; writing—original draft preparation, F.G. and G.S.; writing review and editing, F.G. and G.S.; visualization, F.G. and G.S. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Data Availability Statement:** The ImageNet data sets used in our experiments are publicly available at https://image-net.org (accessed on 11 March 2021).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A**

**Figure A1.** Distribution of ResNet-18 variance tensor elements for specialized tasks: all ImageNet classes, Cats-2, Cats-3, and Cats-4.

**Figure A2.** *Cont.*

**Figure A2.** *Cont.*

**Figure A2.** Distribution of GoogLeNet variance tensor elements for specialized tasks: all ImageNet classes, Cats-2, Cats-3, and Cats-4.

**Figure A3.** *Cont.*

**Figure A3.** Distribution of MobileNet V2 variance tensor elements for specialized tasks: all ImageNet classes, Cats-2, Cats-3, and Cats-4.

**Figure A4.** Distribution ResNet-18 variance tensor elements for specialized tasks: Cats, Dogs, Cars, and all ImageNet classes.

**Figure A5.** *Cont.*

**Figure A5.** *Cont.*

**Figure A5.** Distribution of GoogLeNet variance tensor elements for specialized tasks: Cats, Dogs, Cars, and all ImageNet classes.

**Figure A6.** *Cont.*

**Figure A6.** Distribution of MobileNet V2 variance tensor elements for specialized tasks: Cats, Dogs, Cars, and all ImageNet classes.

#### **Appendix B**


**Table A1.** Threshold tuple for Cats.

**Table A2.** Threshold tuple for Dogs.



**Table A2.** *Cont*.

**Table A3.** Threshold tuple for Cars.



**Table A3.** *Cont*.

#### **References**


## *Article* **Early Prediction of DNN Activation Using Hierarchical Computations**

**Bharathwaj Suresh 1,\*, Kamlesh Pillai 1,\*, Gurpreet Singh Kalsi 1, Avishaii Abuhatzera <sup>2</sup> and Sreenivas Subramoney <sup>1</sup>**


**Abstract:** Deep Neural Networks (DNNs) have set state-of-the-art performance numbers in diverse fields of electronics (computer vision, voice recognition), biology, bioinformatics, etc. However, the process of learning (training) from the data and application of the learnt information (inference) process requires huge computational resources. Approximate computing is a common method to reduce computation cost, but it introduces loss in task accuracy, which limits their application. Using an inherent property of Rectified Linear Unit (ReLU), a popular activation function, we propose a mathematical model to perform MAC operation using reduced precision for predicting negative values early. We also propose a method to perform hierarchical computation to achieve the same results as IEEE754 full precision compute. Applying this method on ResNet50 and VGG16 shows that up to 80% of ReLU zeros (which is 50% of all ReLU outputs) can be predicted and detected early by using just 3 out of 23 mantissa bits. This method is equally applicable to other floating-point representations.

**Keywords:** DNN; ReLU; floating-point numbers; hardware acceleration

#### **1. Introduction**

Ever since its inception, deep learning has evolved into one of the most widely used technique to solve problems in the area of speech recognition [1], pattern recognition [2], and natural language processing [1]. The effectiveness of Deep Neural Networks (DNNs) is pronounced when there is a huge amount of data with minimal features which are not easily apparent to humans [2]. This makes DNNs valuable tools to meet future data processing needs. However, producing accurate results using a large dataset comes at a cost. DNN inference requires a huge amount of computing power, and, as a result, consumes a large amount of energy. In a study by Strubell et al., it was estimated that training a single deep learning model can emit the same amount of CO2 as five cars do throughout their lifetime [3]. Due to this fact, optimizing DNN implementations has become an urgent requirement, and has been receiving widespread attention from the research community [4–6].

In their basic form, DNNs consist of simple mathematical operations like addition and multiplication, which are combined together to form the multiply and accumulate (MAC) operation. In fact, up to 95% of the computational workload of a DNN is due to MAC operations [7]. In a typical DNN, about a billion MAC operations are required to process each input sample [8]. This fact suggests that improving the efficiency of the MAC operations would contribute significantly towards reducing the computational requirement of DNNs. One way to do this is to reduce the number of bits used to perform the MAC operations, an idea that has been widely explored in the field of approximate computing [9]. Some studies have shown that using approximate computing techniques for DNN implementation can reduce power consumption by as much as 88% [10]. However,

**Citation:** Suresh, B.; Pillai, K.; Kalsi, G.S.; Abuhatzera, A.; Subramoney, S. Early Prediction of DNN Activation Using Hierarchical Computations. *Mathematics* **2021**, *9*, 3130. https:// doi.org/10.3390/math9233130

Academic Editor: Ezequiel López-Rubio

Received: 11 October 2021 Accepted: 1 December 2021 Published: 4 December 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

the majority of the approximate computing techniques result in a decrease of accuracy, which may not be acceptable for some applications. In particular, the training of DNNs, which could take many days even using GPUs, require high-precision floating-point values to achieve best results [11]. Hence, it is important to come up with methods that can make computation of DNNs more efficient without reducing the accuracy of the output.

A typical DNN consists of many convolution and fully connected layers. Each of these layers perform MAC operation on the input using weights that are trained to generate a unique feature representation as the output [1]. Many such layers placed in succession can be used to approximate a target function. While the convolution and fully-connected layers alone are sufficient to represent linear functions, they cannot be used directly for applications that need nonlinear representations. To introduce nonlinearity into the model, the outputs of the convolution and fully-connected layers are passed through a nonlinear operator called an activation function [12]. As every output value is required to pass through an activation function, choosing the right activation function is an important factor for the effectiveness of DNNs [13].

One of the most widely used activation function is the Rectified Linear Unit (ReLU) [14]. The simple, piece-wise linear nature of ReLU can enable faster learning, and maintain stable values when using gradient-descent methods [12]. The output of a ReLU function is the same as the input when the input is positive, and is zero for negative inputs. This means that the precision of output is important only when the input is a positive value. Input to a ReLU function is usually the output from a fully-connected or convolution layer of the DNN which consist of a large number of MAC operations [8]. Studies have found that between 50% to 95% of ReLU outputs in DNNs are zero [15]. Hence, a lot of high precision compute in DNNs is wasted where output elements are reduced to zero after ReLU function. Early detection of these negative values can result in reducing the energy spent on high precision MAC operations, which would ultimately result in an efficient DNN implementation.

To this end, our work proposes a method for early detection of negative input values to the ReLU function, accounting for the maximum possible error while performing MAC with reduced precision. Using these values, we develop a mathematical model that provides a threshold below which a negative output value is guaranteed, irrespective of the remaining bits to be computed. It is shown that a proposed model can detect up to 80% negative values for popular CNN models using just three mantissa bits of floating-point number. This mathematical model can be used as the basis to implement low-precision MAC operations for DNNs adopting ReLU functions, which would result in efficient DNN implementation without a loss in accuracy. In summary, our contributions are threefold:


#### **2. Literature Review**

The training and inference of DNNs is a compute intensive task, and has resulted in the need for various hardware accelerators [18–23]. Memory performance can be optimized through data locality by maximizing the reuse of data at buffers close to the compute block, as shown by Chen et al. [20]. A bit serial approach was considered by Judd et al. to reduce overall computations required by reducing activation precision [21]. Unnecessary multiplication with zero values was eliminated in Cnvlutin, which resulted in improved performance [19]. TETRIS used a high bandwidth 3D memory, which lead to reduced internal buffer size, for dealing with the memory bottleneck issue and overcoming the memory bottleneck [23]. Pruning techniques have also been studied to maximize compute saving by exploiting the sparsity in DNNs [24]. However, these methods are typically used for very specific applications and are expensive to generalize.

Approximate computing has emerged as one of the most effective solutions for generic DNNs, and it can exploit the inherent resilience of the CNN model (i.e., its ability to handle variations in data and still be able to figure out the pattern) and reduce the computation costs [25]. The level of approximation could be varied for different DNN models and datasets, hence approximate computing gained popularity [9]. This has led many researchers to investigate methods to perform low-precision computations in DNNs [26–33].

One of the most commonly applied technique is quantization, which is the process of replacing floating point numbers by numbers with reduced bit width. A study by Gupta et al. [26] demonstrated DNN training using 16-bit wide floating-point number with a very small reduction in accuracy as compared to a 32-bit floating-point number. Another study explored the effect of variable precision across different CNN layers, and demonstrated accuracy close to the benchmarks [27]. Venkatesh et al. studied the possibility of using 2-bit weights and space computing methods to produce state-of-the-art results. The study employed few iterations of full-precision training, followed by reduced precision training and inference [30]. The study on compute complexity is reduction using a 1D kernel factorized network is presented in the work [34].

Another approach in approximate computing is the use of multipliers and adders that compute results in a simplified manner. The work by Sarwar et al. [29] highlighted the use of simplified add and shift operations for power savings in DNNs. Another study explored the use of alternate full-adder implementation for efficient CNN hardware [28]. Stochastic computing based circuits have also been studied as potential candidates to implement a low-power DNN hardware accelerator [32].

Approximate computing has also been pursued at the software level, by simplifying the DNN architectures to reduce compute. Pruning the synaptic weights, reducing bit width of the synapse, and minimizing the number of hidden layers or neurons within these layers were demonstrated as effective methods to develop energy efficient DNNs [29]. Wei et al. came up with a more structured approach with pattern-based weight pruning for real-time DNN execution [33].

While all these studies have highlighted the relevance and requirement of approximate computing, they also mentioned that it comes at the cost of reduced accuracy. However, DNNs often require high precision floating point values during training to achieve high accuracy and reduced training time [35–37]. Such a reduction in accuracy may be unacceptable in real-life applications like self-driving cars [38] or medical diagnosis [39,40], where errors could be life threatening. Hence, most commercial DNNs still use floating point precision in their computations [41,42]. Hence, it is important to come up with a method to perform low precision computations in DNNs without reducing the accuracy of the model.

Shormann et al. proposed a method to reduce convolution operations in CNNs by dynamically predicting zero-valued outputs [43]. SnaPEA performs a reordering of weights and keeps track of the partial sum to predict zero outputs early [44]. A similar method was employed by Asadikouhanjani et al. to propose an efficient DNN accelerator [45]. By considering the spatial surroundings of an output feature map, Mosaic-CNN performs reduced precision compute to predict zero values early [46]. Other studies have explored methods to predict the zero values in an output feature map using the sign values [47–49]. Our study attempts further research in this direction by proposing a novel method to predict ReLU zeros with reduced precision compute.

#### **3. Background**

#### *3.1. Convolutional Neural Networks*

Among the different types of DNNs, Convolutional Neural Networks (CNNs) are extensively used in image processing, computer vision, and speech processing applications, often resulting in superior performance [50]. The convolution layer, which converts the input image into a form that is easier to process by the next layer, is at the heart of a CNN. Convolution is the application of a filter to an input to produce an output feature map to indicate a detected feature in the input data. Both the input values and the filter values are represented as matrices, with the filter dimensions typically being much smaller that the input. The values in the filter matrix are multiplied with the corresponding values in the input matrix, and the values are added to produce a single output value. This MAC operation is repeated by shifting the filter by a fixed amount each time, resulting in an output feature map. The number of element shifts by the weight matrix on the input matrix is called the stride. This convolution process is demonstrated in Figure 1. As shown in Figure 1, each term from the input layer is multiplied with every term in the filter matrix, and these values are added together (accumulate) to generate one value in the output feature map. This process is repeated by moving the filter matrix across the input matrix, until it has been traversed completely. Once the output feature map is generated, it is passed through an activation function (like ReLU) to introduce nonlinearity.

**Figure 1.** Example of the convolution operation. In this example, the stride is assumed to be 1. A 5 × 5 output is produced from the 7 × 7 input when a 3 × 3 weight matrix is considered.

#### *3.2. ReLU Activation Function*

The ReLU activation function is one of the most popular activation functions used in DNNs today [14]. The function returns zero for all negative inputs, and returns the input if it is a non-negative value. It can be written as:

$$f(\mathbf{x}) = \max(0, \mathbf{x}) \tag{1}$$

where max returns larger of the two inputs. The graphical representation is shown in Figure 2. The success of ReLU can be attributed to its simple implementation, which in turn reduces the computation time of the DNN model [51]. In addition, a majority of the ReLU outputs are zero [15], which makes the output matrix sparse and results in better prediction and reduced chances of overfitting [52]. Both the ReLU function and its derivative are monotonic, which ensures that the vanishing gradient problem is avoided when the gradient-descent training process is employed [53]. These factors have contributed to the widespread use of ReLU activation function in DNNs. Hence, the study of the ReLU activation function is important to implement DNNs more efficiently.

**Figure 2.** Graphical representation of the ReLU function. If *x* is the input and *y* is the output, then *y* = 0 for *x* < 0, and *y* = *x* for *x* ≥ 0.

#### *3.3. Floating Point Number Representation*

In any typical DNN, the input, output, and intermediate values are stored in the floating-point format. The standard format used in a majority of applications is the IEEE-754 floating point number format [54]. In this format, the Most Significant Bit (MSB) is the Sign bit (S) which is 0 for positive numbers, and 1 for negative numbers. This is followed by a fixed number of bits assigned to store the Exponent E, and the remaining bits are allotted to the Mantissa M. The fractional part is stored in the normalized form—i.e., the actual values in binary is 1 plus the fractional value represented by M. In order to accommodate negative exponent values, 127 is added (called excess-127). Hence, the actual exponent is E–127. Based on these rules, the floating point value represented using the *S*, *E* and *M* values in the IEEE-754 format is:

$$F = (-1)^S \times 2^{(E-127)} \times (1+M) \tag{2}$$

The two commonly used forms of the IEEE-754 format are the single and double precision format. In the single precision representation, there are 8 exponent bits and 23 mantissa bits to make a total of 32 bits. The double precision is a 64-bit representation with 11 exponents and 52 mantissa bits [54]. Figure 3 graphically depicts both the single and double precision representations.

**Figure 3.** IEEE 754 floating point representation [54]. The total bits are divided into sign, exponent, and mantissa. The single precision format has 1 sign, 8 exponents, and 23 mantissa bits, while the double precision has 1 sign, 11 exponents, and 52 mantissa bits.

#### **4. Methodology**

#### *4.1. Dataset and Framework*

As image recognition is one of the most widely used and researched applications of CNNs, we focus our analysis on models within this domain. VGG-16 is one of the pioneer CNN models for large scale image recognition tasks [16]. It takes a 224 × 224 RGB image as input and passes it through different convolution, max-pooling, and fully connected layers. The final classification is implemented using a softmax layer. Figure 4 describes the VGG-16 architecture. As evident from the figure, there are 13 convolution layers, and each convolution layer is followed by a ReLU activation layer. A set of convolution layers are followed by pooling layers to reduce the dimensions of the input before sending it to the next convolution sets. Finally, a set of fully-connected layers are added to produce the output classification probability.

As DNNs like VGG16 became difficult to train, Residual Networks (ResNets) emerged as improved alternatives. In ResNets, shortcut (or identity) connections were introduced between different layers to perform quick identity mapping with no additional model parameters [17]. One such ResNet model is the ResNet-50, which has 50 different convolution and fully-connected layers along the path from input to output. Like VGG-16, ResNet-50 also takes 224 × 224 RGB images as its input. The ResNet50 architecture is shown in Figure 5. A convolution operation that is applied on the input and the layer size is reduced before it is sent to the residual layers. Each of the residual layers is comprised of three sets, each with a convolution layer followed by a ReLU activation layer. Before the last ReLU operation, an identity connection is added to train identity mappings in some of the layers. The Res 2–1, Res 3–1, and Res 4–1 groups shown in Figure 5 have a convolution layer in the identity path. These residual layers are followed by a pooling and fully-connected layer, which give the classification probabilities as the output.

**Figure 4.** VGG-16 CNN architecture. There are 16 computation layers (13 convolution −3 × 3 kernel and three fully connected layers without dropout). Pooling layers are present in the intermediate stages to reduce the layer size as the network gets deeper. Regularization, normalization, and other layers may be present but have not been shown in this figure for simplicity.

**Figure 5.** ResNet-50 Architecture. There are 50 computations layers (excluding convolution layers in the identity path) between the input and output. This includes 49 convolution layers and the fully-connected layer at the end. Res 2–1 (conv with 1 × 1, 64; 3 × 3, 64; 1 × 1, 256), Res 3–1 (1 × 1, 128; 3 × 3, 128; 1 × 1, 512) and Res 4–1 (1 × 1, 256; 3 × 3, 256; 1 × 1, 1024) are shown with a dotted boundary to indicate that they include a convolution layer along their identity path (also shown with a dotted boundary in the elaboration below without dropout). Regularization, normalization, and other layers may be present but have not been shown in this figure for simplicity.

These models were tested using the ImageNet Large Scale Visual Recognition Challenge 2012 (ILSVRC2012) inference dataset, which includes 50,000 images belonging to 1000 different classes [55]. These images were converted to the 224 × 224 RGB format, and the pixel values were normalized. To ensure that the training methods were standard, the pretrained models of ResNet-50 and VGG-16 were used from the Keras library [56] running on top of the TensorFlow [57] backend.

#### *4.2. Proposed Hierarchical Computation*

It is evident that each convolution layer involves MAC operations between the input values and a filter with the trained weight values. The result of this MAC operation is passed through the ReLU activation function and negative values are made zero. Our implementation includes an intermediate step that predicts negative values early using a reduced number of mantissa bits. The MAC operation is performed using reduced mantissa bits, and the output is obtained. Then, based on the number of mantissa bits used for the computation, the proposed model predicts whether a value estimated is definitely negative or not. If the value is determined to be negative, the output is made to be zero. For the other values, we perform MAC using the full precision and obtain the output like in typical implementations. Hence, we reduce the total number of cases for which the expensive full precision compute must be performed, while simultaneously ensuring no loss in accuracy.

The steps are described using a flowchart in Figure 6. In the case presented, for every workload, we first perform computation without any mantissa bits. Since only exponent values are present, this can be achieved directly by adding the exponent bits. If output can be predicted to be negative at this step, then set the output as zero and move on to the next set of element of the input workload. If inconclusive, the first 8 MSB mantissa bits (bits 23 to 16) are considered for further computation. Once again, predict the accumulated negative value, and set those outputs to zero. For cases where the accumulated element sign is still ambiguous, the remaining mantissa bits (bits 15 to 0) are also used and the full precision compute is performed. The remaining outputs are obtained after this step, and this whole cycle is repeated for the other input workloads. This way, the total compute can be split into multiple levels by adding additional mantissa bits. At each level, some negative values can be detected with reduced precision compute. At the same time, full precision compute can be performed for all positive outputs, ensuring no loss in accuracy. The selection of levels of compute and bits selection for each level can be determined based on model, workload, and underlying compute hardware availability. The impact of selected mantissa bits on correctly predicted negative values is described later in the Results section.

Figure 7 intuitively describes the proposed hierarchical computation approach to estimate ReLU output with reduced precision compute. Here, the "Ideal" is the value that is computed with full precision, while "Reduced" is the output with only a few MSB mantissa bits considered. If "Reduced" is a large negative value, the output can be estimated to be negative irrespective of the mantissa bits. Our model detects such values until it reaches a threshold, where "Reduced" is negative but close to zero. To estimate these values correctly, more mantissa bits (next set of MSB bits) need to be considered. Similarly, when "Reduced" is a large positive value, it can be estimated to be positive without using the mantissa bits. However, as the value approaches zero, more mantissa bits are required to correctly estimate the sign of the element. A threshold is estimated along the positive axis too, beyond which values are always positive. Our model determines both the positive and negative thresholds, which gives rise to the region of interest where full precision compute needs to be performed, as shown in Figure 7. These thresholds are obtained by considering the maximum error contribution from each mantissa bit of a floating point number. As shown in Figure 7, the error is inversely proportional to the number of mantissa bits "*n*", which means that the region of interest gets smaller as the value of "*n*" increases.

The next section derives a mathematical model that can perform the ReLU checks shown in Figure 6, based on the intuitive model proposed in Figure 7. We use error calculations to prove that the model can determine ReLU zeros with no loss in accuracy.

**Figure 6.** Flow chart depicting the steps to perform hierarchical compute (three steps) and detect ReLU zeros with reduced precision. The first step is to perform MAC using exponent and predict ReLU output; if undetermined, compute most significant 8-bits of mantissa and check ReLU again if still not conclusive perform compute using remaining mantissa bits (every next step uses previously computed values). Here, the red arrows depict writing to the memory, and blue arrows indicate read from memory. Black arrows indicate that the computation has been completed for the given input.

**Figure 7.** Intuition behind estimating ReLU zeros based on reduced precision compute. In the hierarchical compute method, the value of "*n*" (number of MSB mantissa bits) is increased at each step, resulting in a decrease in the region of interest, until only positive values are remaining.

#### *4.3. Mathematical Model*

In this section, the mathematical model of the proposed solution is presented. There are three theorems, which cover all scenarios of the proposed solution. Theorem 1 presents an important scenario where errors due to addition of positive values are the main contributors for a sign change of resultant from positive to negative, which impacts the threshold

calculation. Theorem 2 describes the max error that is needed to detect negative values out of MAC operations, and Theorem 3 talks about the major condition that needs to be satisfied for predicting the ReLu output.

**Theorem 1.** *Let*

$$X\_S(a) = \sum\_{k=-n}^{n} \left( ifm(k) \* wt(a-k) \right) \tag{3}$$

*where a = number of terms involved in the convolution, if m(k) and wt(k) are input feature map and weight kernel in single precision floating point representation (FP32) with a reduced number of mantissa bits (number of mantissa bits after reduction = m).*

$$X\_{SPOS}(n) = \sum\_{k=0}^{n} \left( \!\!/fm(k) \* wt(n-k) \right) \tag{4}$$

*is responsible to convert a positive XS(n) with m = 23 (FP32) to negative XS(n) with m < 23.*

**Proof of Theorem 1.** Let *Xs*(*b*) where *b* < *a*, with *m* < 23. Let *XReduced* and *XIdeal* be the values of the next term to be added to the convolution sum, with *m* = 23 and *m* < 23, respectively. When this term is added to the existing sum, two different sums are obtained depending on the presence or absence of all mantissa bits. Let these be called *XIdeal <sup>S</sup>* and *XReduced <sup>S</sup>* , respectively. That is,

$$X\_S^{Ideal} = X\_S + X\_{Ideal} \tag{5}$$

$$X\_S^{Reduced} = X\_S + X\_{Reduced} \tag{6}$$

It is evident that reducing the number of mantissa bits in a floating point number results in a number having lower magnitude. However, the sign remains unaffected as the sign bit is unchanged. Hence, if

$$\begin{aligned} &X\_{Idcal} < 0\\ &\implies X\_{Raduced} > X\_{Idcal}\\ &\implies X\_S + X\_{Reduced} > X\_S + X\_{Idcal} \end{aligned}$$

From (5) and (6), we have

$$X\_S^{Reduced} > X\_S^{Idcal} \tag{7}$$

From (7), it is evident that, if *XReduced <sup>S</sup>* < 0, it can be concluded that *<sup>X</sup>Ideal <sup>S</sup>* < 0. In other words, error due to addition of a negative value cannot alter the sign of the sum from positive to negative. On the contrary, if

$$\begin{aligned} X\_{Idcal} &< 0\\ \implies X\_{Reduced} &< X\_{Idcal} \\ \implies X\_S + X\_{Reduced} &< X\_S + X\_{Idcal} \end{aligned}$$

From (5) and (6), we have

$$X\_S^{Reduced} \prec X\_S^{Idcal} \tag{8}$$

In the case of (8), *XReduced <sup>S</sup>* < 0 does not guarantee that *<sup>X</sup>Ideal <sup>S</sup>* < 0. Hence, errors due to the addition of positive values contribute towards sign change from positive to negative, and are important in determining the threshold to conclude that the convolution sum is negative when reduced-mantissa is considered.

**Theorem 2.** *If a positive term in the convolution sum is given by CMul* <sup>=</sup> <sup>2</sup>*EMul* <sup>×</sup> *MMul, where EMul and MMul are the unbiased exponent and mantissa value of the term, the maximum error that is possible when the number of mantissa bits is reduced to <sup>n</sup> is given by CErrMax* <sup>=</sup> <sup>2</sup>*EMul*−*n*+<sup>1</sup> <sup>×</sup> *MMul.* **Proof of Theorem 2.** For any floating point number given by

$$N = (-1)^S \times 2^E \times M$$

where *S*, *E*, *M* represent the sign, unbiased exponent, and mantissa value, the maximum possible error when only *n* mantissa bits are included is given by

$$E\_{\text{Max}} = -2^{(E-n)} \times (-1)^S \tag{9}$$

Consider an activation input (*I*) and weight (*W*) of a convolution layer. They are represented as

$$I = (-1)^{S\_I} \times 2^{E\_I} \times M\_I \tag{10}$$

$$\mathcal{W} = (-1)^{S\_W} \times 2^{E\_W} \times M\_W \tag{11}$$

From (9), the most erroneous values that could result from reducing the number of mantissa bits to *n* in *I* (10) and *W* (11) are given by

$$I\_{Reduced} = (-1)^{S\_I} \times 2^{E\_I} \times M\_I - 2^{(E\_I - n)} \times (-1)^S\_I \tag{12}$$

$$\mathcal{W}\_{\text{Reduced}} = (-1)^{S\_W} \times 2^{E\_W} \times M\_W - 2^{(E\_W - n)} \times (-1)^S\_W \tag{13}$$

The convolution term when *I* (10) and *W* (10) are multiplied is given by

$$\mathbb{C}\_{Ideal} = (-1)^{S\_I + S\_W} \times 2^{E\_I + E\_W} \times (M\_I \times M\_W) \tag{14}$$

With reduced mantissa in the convolution step, (12) and (13) give

$$\begin{split} \mathbb{C}\_{\text{Reduced}} &= I\_{\text{Reduced}} \times \mathcal{W}\_{\text{Reduced}} \\ &= (-1)^{S\_I + S\_W} \times 2^{E\_I + E\_W} \times (M\_I \times M\_W) \\ &\quad - (-1)^{S\_I + S\_W} \times 2^{E\_I + E\_W - n} \times (M\_I + M\_W) \\ &\quad + 2^{E\_I + E\_W - 2n} \end{split}$$

Hence,

$$\mathbb{C}\_{\text{Reduced}} = 2^{E\_I + E\_W} \times \left( M\_I \times M\_W - \left( 2^{-n} \times \left( M\_I + M\_W - 2^{-n} \right) \right) \right. \tag{15}$$

The error in convolution terms due to reduced mantissa can be obtained from (14) and (15)

$$\begin{aligned} \mathsf{C}\_{Error} &= \mathsf{C}\_{Ideal} - \mathsf{C}\_{Reduced} \\ &= 2^{E\_I + E\_W - n} \times (M\_I + M\_W + 2^{-n}) \end{aligned}$$

As 2−*<sup>n</sup>* is always positive,

$$\mathbb{C}\_{Error} \le 2^{E\_I + E\_W - n} \times (M\_I + M\_W). \tag{16}$$

Since *MI* and *MW* represent the mantissa values,

$$\begin{aligned} 1 &\le M\_{I\prime}M\_W \le 2\\ \implies M\_I + M\_W &\le 2 \times M\_I \times M\_W \end{aligned}$$

Hence, (16) can be rewritten as

$$\begin{aligned} \mathsf{C}\_{Error} &\leq 2^{E\_I + E\_W - n} \times \left( 2 \times M\_I \times M\_W \right), \\ &= 2^{E\_I + E\_W - n + 1} \times \left( M\_I \times M\_W \right) \end{aligned}$$

From (14), we get

$$\mathbb{C}\_{Error} \le 2^{-n+1} \times \mathbb{C}\_{Ideal} \tag{17}$$

It is evident from Theorem 1 that only positive terms will contribute to errors that can contribute to incorrectly identifying a negative value. Hence, *SI* + *SW* = 0 (Either both *I* and *W* are positive or both are negative). Including this in (14), we can rewrite *CIdeal* as

$$\mathbb{C}\_{Ideal} = \mathfrak{2}^{E\_{Mul}} \times M\_{Mul} \tag{18}$$

where *EMul* = *EI* + *EW* and *MMul* = *MI* × *MW*. Hence, the maximum error in a positive term in the convolution sum is

$$\mathbb{C}\_{ErrMax} = 2^{E\_{Mul} - n + 1} \times M\_{Mul} \tag{19}$$

Hence, we obtain the maximum error, which is needed to detect negative values from a MAC operation.

**Theorem 3.** *If the convolution sum before the ReLU activation layer is given by CTot* = (−1)*STot* <sup>×</sup> <sup>2</sup>*ETot* <sup>×</sup> *MTot, and the sum of positive terms in the summation (including the bias value) is given by CPos* <sup>=</sup> <sup>2</sup>*EPos* <sup>×</sup> *MPos, then the value of CTot can be concluded to be negative if STot* <sup>=</sup> <sup>1</sup> *and ETot* > *EPos* − *n, where n is the number of mantissa bits used in the computation.*

**Proof of Theorem 3.** Let the sum of all product terms in the convolution be given by

$$\mathcal{C}\_{\text{Tot}} = \sum\_{i} (-1)^{S\_i} \times 2^{E\_i} \times M\_i = (-1)^{S\_{\text{Tot}}} \times 2^{E\_{\text{Tot}}} \times M\_{\text{Tot}} \tag{20}$$

From (19) in Theorem 2, the maximum error due positive terms in the convolution is given by *C<sup>i</sup> ErrMax* <sup>=</sup> <sup>2</sup>*Ei*−*n*+<sup>1</sup> <sup>×</sup> *Mi*. Hence, when these errors are accumulated for all positive terms (including bias), we get

$$\mathcal{C}\_{ErrTot} = \sum\_{i:S\_i=0} \mathcal{C}\_{ErrMax}^i = \sum\_{i:S\_i=0} 2^{E\_i - n + 1} \times M\_i \tag{21}$$

Note that, unlike other terms in the convolution sum, the bias does not involve multiplication of reduced mantissa numbers. Hence, the maximum error for bias values will be lower. However, the same error has been considered (as an upper bound) to simplify calculations. We can represent the sum of positive terms (including bias) in the convolution sum as

$$\mathcal{C}\_{\text{Pos}} = \sum\_{i:\mathcal{S}\_i = 0} 2^{E\_i} \times M\_{\text{i}} = 2^{E\_{\text{Pos}}} \times M\_{\text{Pos}} \tag{22}$$

Using (22), the total error in (21) can be rewritten as

$$\mathbb{C}\_{ErrTot} = \mathbb{2}^{-n} \times \mathbb{C}\_{Pos} \tag{23}$$

To conclude that a convolution sum is zero/negative, the following two conditions should hold:

$$|\mathsf{C}\_{Tot}| \ge |\mathsf{C}\_{Pos}|\tag{24}$$

$$S\_{\text{Tot}} = 1\tag{25}$$

(24) can be expanded using (20) and (22) to give

$$2^{E\_{\rm Tot}} \times M\_{\rm Tot} \stackrel{\sim}{\simeq} 2^{E\_{\rm Pos} - n + 1} \times M\_{\rm Pos} \tag{26}$$

Note that, if *ETot* = *EPos* − *n* + 1, then the condition *MTot* ≥ *MPos* must hold (as the total convolution sum (*CTot*) must be greater than or equal to the sum of positive convolution terms and bias (*CPos*)) As a consequence, (26) now becomes

$$E\_{\rm Tot} \ge E\_{\rm Pos} - n + 1 \tag{27}$$

$$\implies E\_{\text{Tot}} > E\_{\text{Pos}} - n \tag{28}$$

Hence, from (25) and (28), we can conclusively say that a convolution sum computed using reduced-mantissa bits is negative (In addition, its ReLU output is zero) if *STot* = 1 and *ETot* > *EPos* − *n*.

#### *4.4. Early Negative Value Prediction*

The theorems derived above can be used to implement the proposed model for hierarchical computation. The steps to find out if a reduced precision value is a ReLU zero can be represented as an algorithm, as shown here:


#### **5. Results**

In order to motivate the use of the hierarchical compute method to detect ReLU zeros early, we first identify the number of ReLU zeros that are present when a typical image is processed using the ResNet-50 and VGG-16 CNN models; findings are shown in Figure 8. It is evident from the figure that, in a majority of the layers, more than 50% of the ReLU outputs are zero, with many of the deeper layers having up to 90% ReLU zeros. Considering all the layers, we found that, on an average, 61.77% of the ReLU outputs were zeros in VGG-16, while 61.24% ReLU zeros were seen for ResNet-50. These results indicate that a large portion of compute is wasted on computing ReLU zeros, which can be avoided using the proposed method.

**Figure 8.** Percentage of ReLU zeros present in (**a**) VGG16; (**b**) ResNet50 when a typical image is processed through the models. Only a few layers of ResNet-50 are shown for clarity—a similar trend is observed in all the layers.

In addition to the percentage of ReLU zeros, it is also important to understand the distribution of values seen by the ReLU layer. The results from ResNet-50 layers are shown in Figure 9. A total of 10 bins were chosen—values below −8, −8 to −4, −4 to −2, −2 to −1, −1 to 0, 0 to 1, 1 to 2, 2 to 4, 4 to 8 and values above 8. We notice that, in all layers, about 50% of the values fall between −1 to 1, and more than 80% between −2 and 2. This implies that the majority of the values are close to zero. As a result, it is not practical to use a fixed threshold value along with reduced precision compute. A large negative threshold (say −2) can ensure that a value computed with reduced precision will have the correct sign. However, we can see from the distribution that only very few values (under 20%) can be detected with such a fixed threshold. If the threshold is pushed closer to zero, the chances of incorrectly detecting ReLU zeros increase. This study demonstrates the importance of the variable threshold derived using our model.

**Figure 9.** Distribution of ReLU inputs in different layers of ResNet-50. Here, Val is the input to the ReLU function. A total of 10 bins have been considered, and the range in each bin in mentioned in the figure. The different layers shown in the figure are: (**a**) first convolution layer from the input image (**b**) first convolution layer in the Res 1-1 block; (**c**) first convolution layer in the Res 2-1 block; (**d**) first convolution layer in the Res 3-1 block.

The proposed model was tested by evaluating the ReLU output values at different layers of both the VGG-16 and ResNet-50 CNN implementation. This was done by comparing the outputs from the convolution layer using (25) and (28). The total number of negative values that were detected using our model were noted and compared with the total number of output values to provide the percentage of negative values that are detected early. This was repeated for different layers, with different numbers of mantissa bits. Figure 10 shows the percentage of ReLU zeros detected by our model across different layers of ResNet-50 with different numbers of mantissa bits. It is evident that, as the number of mantissa bits considered increases, our model is able to detect the majority of ReLU zeros in all layers.

**Figure 10.** Percentage of ReLU values detected using our model across different ResNet-50 layers. The first 33 convolution layers are shown in the figure. The number of MSB mantissa bits used were (**a**) 0; (**b**) 1; (**c**) 2; and (**d**) 3.

To get a closer look at the impact of increasing the number of mantissa bits, we plotted the percentage of ReLU zeros detected with 0, 1, 2, and 3 mantissa bits for randomly chosen layers in VGG-16 and ResNet-50. This is shown in Figure 11. As expected, the fraction of negative values detected increases as the number of mantissa bits used for computation is increased. Close to 80% of negative values can be detected early using just three mantissa bits, which can result in a significant increase in the efficiency of the network. Due to the nature of weights, range of values, and so on, we observe that the results across different layers vary. However, as seen in Figure 10, the amount of variation decreases as we use more mantissa bits. Additionally, we note similar effectiveness of our model for both VGG-16 and ResNet-50, which shows that the model does not depend on the type of CNN implementation—it works based on the fundamental characteristics of MAC operations and floating-point numbers, which makes it a generalized solution for any CNN layer with a ReLU activation function.

From the results presented, we see that about 60% of the outputs of the ReLU activation function are zero values in CNNs like VGG-16 and ResNet-50. If three mantissa bits are used for computation and our model is deployed, 80% of these ReLU zeros can be detected. Hence, we can expect about 50% of the all ReLU outputs to be estimated early. This way, almost half of the total computations can be carried out in low precision and the other half can be computed in full precision, while ensuring no loss in accuracy.

**Figure 11.** Percentage of ReLU zeros identified by our model when different mantissa bits were considered. The figure shows the results in (**a**) Conv 1-2; (**b**) Conv 3-3; and (**c**) Conv 5-3 layers of VGG-16, and (**d**) second convolution layer of Res 1-1 block; (**e**) first convolution layer of Res 2-4 block; and (**f**) third convolution layer of Res 3-5 block. Similar results were observed in other layers of both ResNet-50 and VGG-16.

#### **6. Discussion**

#### *6.1. Generalization to Other DNNs*

The results presented in this work utilize CNNs as the end application due to their ubiquitous nature and applicability to various fields. However, the model we have proposed is built on fundamental properties of floating-point numbers, MAC operations, and the ReLU activation function. Hence, the model can be extended to other applications too. When there are no negative values in the whole process, the algorithm will not predict any outcome and bypass all MAC output as valid output. However, these computations that are bypassed by algorithms as valid will be reused as a partial product for computing the actual output with the remaining mantissa bits. Since the compute used for prediction is re-purposed as a partial product and also for the cases which are slightly uncertain, the algorithm tends to predict them as positive value (no approximation) and bypass them out of the algorithm as valid outputs such that it will always go through the full precision compute. Hence, no accuracy drop is expected with the use of a proposed solution.

The proposed solution will be applicable across various networks with activation functions which displays a nonlinear behavior for either positive or negative numbers (not both) and the other one must be a zero or any constant value. To support activation functions like sigmoid, we might need to redevelop mathematical constraints to predict values between [−ve, +ve] range, while all other values outside this range can be set to a constant.

#### *6.2. Implementation on GPU/Other Accelerators*

This method is implementable on any compute engine that supports DNN workload. Since the proposed solution supports the reusing of partially computed elements that were used for early prediction, this will not impose a heavy tax on the existing hardware. On GPU and other accelerators, the proposed solution will need fine-tuning of data flow, data storage pattern and control logic, etc.

#### *6.3. Extension to Training*

The results presented in this work demonstrate the effectiveness of our model during the inference stage in a DNN. However, the process of training also involves the same set of steps, along with the additional step of adjusting the parameters. Hence, our model can be used in every layer with a ReLU activation layer. Since DNN training is a time-consuming and compute intensive process, this method can provide a significant improvement. It is also noteworthy to mention that, unlike inference, training must be done with high precision values. As a result, many of the approximate computing methods that have been studied cannot be extended to training. However, since our method ensures that there is no loss in accuracy, it can be applied to training as well.

#### **7. Conclusions**

In this work, we proposed a mathematical model that can detect zero outputs of the ReLU activation function using low-precision MAC operations. Our model takes into account the error resulting from the reduction of the number of bits in a floating-point representation, and identifies values that would be negative even when full-precision compute is performed. Our model can adapt based on the number of mantissa bits considered in the computation, ensuring its suitability for different number formats used in DNNs. We show that around 80% of ReLU zeros can be detected using just three mantissa bits, which corresponds to a total of 50% of all ReLU outputs in VGG16 and ResNet50 CNN implementations. As the model is developed with no assumption about the nature of the network or the application, we claim that the model can be extended to all DNNs that use the ReLU activation function. In addition, as the MAC operation and the activation layer in DNN training is identical to inference, this model can be adopted to make the compute-hungry training process more efficient. We also propose a system level model to implement this method and perform hardware acceleration of DNNs. The widespread use of DNNs with the ReLU activation function means that our model can be used as an error-free way to reduce computations in numerous applications.

**Author Contributions:** Conceptualization, B.S., K.P. and G.S.K.; writing—original draft preparation, B.S.; writing—review and editing, K.P. and G.S.; supervision, A.A. and S.S. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The dataset used for this study is publicly available and can be downloaded at https://www.image-net.org (accessed on 1 May 2021).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **References**


## *Article* **Prediction of Hydraulic Jumps on a Triangular Bed Roughness Using Numerical Modeling and Soft Computing Methods**

**Mehdi Dasineh 1, Amir Ghaderi 2,\*, Mohammad Bagherzadeh 3, Mohammad Ahmadi <sup>4</sup> and Alban Kuriqi 5,\***

	- bagherzadeh.mbz96@gmail.com <sup>4</sup> Department of Civil Engineering, Faculty of Engineering, Shabestar Branch, Islamic Azad University, Shabestar 1584743311, Iran; sthfar@gmail.com
	- <sup>5</sup> CERIS, Instituto Superior Técnico, Universidade de Lisboa, 1049-001 Lisbon, Portugal
	- **\*** Correspondence: amir\_ghaderi@znu.ac.ir (A.G.); alban.kuriqi@tecnico.ulisboa.pt (A.K.); Tel.: +98-93845-03512 (A.G.)

**Abstract:** This study investigates the characteristics of free and submerged hydraulic jumps on the triangular bed roughness in various *T*/*I* ratios (i.e., height and distance of roughness) using CFD modeling techniques. The accuracy of numerical modeling outcomes was checked and compared using artificial intelligence methods, namely Support Vector Machines (SVM), Gene Expression Programming (GEP), and Random Forest (RF). The results of the FLOW-3D® model and experimental data showed that the overall mean value of relative error is 4.1%, which confirms the numerical model's ability to predict the characteristics of the free and submerged jumps. The SVM model with a minimum of Root Mean Square Error (RMSE) and a maximum of correlation coefficient (*R*2), compared with GEP and RF models in the training and testing phases for predicting the sequent depth ratio (*y*2/*y*1), submerged depth ratio (*y*3/*y*1), tailwater depth ratio (*y*4/*y*1), length ratio of jumps (*Lj*/*y*∗ <sup>2</sup>) and energy dissipation (Δ*E*/*E*1), was recognized as the best model. Moreover, the best result for predicting the length ratio of free jumps (*Lj f* /*y*<sup>∗</sup> <sup>2</sup>) in the optimal gamma is *γ* = 10 and the length ratio of submerged jumps (*Ljs*/*y*<sup>∗</sup> <sup>2</sup>) is γ = 0.60. Based on sensitivity analysis, the *Froude number* has the greatest effect on predicting the (*y*3/*y*1) compared with submergence factors (*SF*) and *T*/*I*. By omitting this parameter, the prediction accuracy is significantly reduced. Finally, the relationships with good correlation coefficients for the mentioned parameters in free and submerged jumps were presented based on numerical results.

**Keywords:** artificial intelligence; energy dissipation; FLOW-3D; hydraulic jumps; bed roughness; sensitivity analysis

#### **1. Introduction**

The hydraulic jump is a natural phenomenon in an open channel, sometimes regarded as an effective method of energy dissipation near structures such as gates, chutes, and spillways [1]. The hydraulic jump is specified by the expansion of large-scale turbulence, surface waves and spray, energy dissipation, and air entrainment [2]. If the tailwater depth equals the subcritical sequent depth, it is called a free hydraulic jump. Furthermore, if the tailwater depth is greater than the subcritical sequent depth, the jump is submerged (submerged hydraulic jump). A hydraulic jump has been widely studied, but only a few investigations have regarded the effect of bed roughness on the characteristics of hydraulic jumps. Enormous research studies dealing with the free and submerged hydraulic jumps such as McCorquodale and Khalifa [3], Smith [4], Graber et al. [5], Vallé and Pasternack [6], Dey and Sarkar [7], Tokyay et al. [8], and Samadi-Boroujeni et al. [9] were carried out. Ead and Rajaratnam [10] experimentally studied hydraulic jumps on corrugated beds.

**Citation:** Dasineh, M.; Ghaderi, A.; Bagherzadeh, M.; Ahmadi, M.; Kuriqi, A. Prediction of Hydraulic Jumps on a Triangular Bed Roughness Using Numerical Modeling and Soft Computing Methods. *Mathematics* **2021**, *9*, 3135. https://doi.org/10.3390/math9233135

Academic Editors: Freddy Gabbay and Florin Leon

Received: 13 September 2021 Accepted: 3 December 2021 Published: 5 December 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

The results showed that the length of jumps was about half of those on smooth beds. Carollo et al. [11] investigated the hydraulic jump properties on a bed roughened by gravel particles. The results indicate that the roughness reduces the sequent depth and the length of the jump. Pagliara et al. [12] studied the hydraulic jump on homogeneous and non-homogeneous rough beds. The results satisfactorily matched with the experimental data and presented new equations to estimate the length of jump and sequent depth. Abbaspour et al. [13] investigated the impact of a corrugated bed on hydraulic jumps. The results stated that the jump length and tailwater depth on corrugated beds are smaller than the smooth bed. Chanson [14] observed the flow resistance effects in decreasing the sequence depth ratio for a given Froude number. The results indicated that the Bélanger equation is not appropriate. In addition, the cross-sectional properties of irregular channels have an important influence on the flow characteristics. Ahmed et al. [15] investigated the effect of bed roughness on the submerged jump. Conclusions show that the length of a jump and tailwater depth on a bed roughness are smaller than on a smooth bed. Palermo and Pagliara [16] produced two general equations for evaluating relative energy dissipation across various hydraulic and geometrical conditions. Pourabdollah et al. [17] studied free and submerged jumps in different stilling basins. They showed that the sequent depth, the submerged depth, and the length of the jump decreased compared to the classical jump. Moreover, the average energy dissipation of the submerged jump on the bed roughness was more than those of the classical jump. Habibzadeh et al. [18] investigated characteristics of hydraulic jumps with and without blocks. The mean longitudinal velocity, turbulence intensity, Turbulent Kinetic Energy (TKE), and shear stress and water surface fluctuations were studied and compared for various flow regimes.

In addition to laboratory research, numerical works have been done on hydraulic jumps. Gharangik and Chaudhry [19] solved the 1D Boussinesq equations to simulate a hydraulic jump in a rectangular channel. The results showed that the equation terms have little influence in determining the location of the hydraulic jump. Ma et al. [20] investigated the turbulence characteristics of 2D submerged hydraulic jumps using the k–ε turbulence model. The results are compared with available experimental data and are acceptable. Mousavi et al. [21] investigated predictive modeling of the free hydraulic jumps pressure through advanced statistical methods. It was verified that maximum and minimum pressure fluctuation are located near the spillway toe and downstream of hydraulic jumps, respectively. Abbaspour et al. [22] numerically studied hydraulic jump on a corrugated bed using the standard k-ε and RNG turbulent models. Their results stated that the k-ε model was suitable for predicting the jump characteristics. Chern and Syamsuri [23] applied the Smoothed Particle Hydrodynamics (SPH) model to evaluate characteristics of the hydraulic jump in different corrugated beds and classified jump. Bayon et al. [24] investigated the performance of Open-FOAM and FLOW-3D® software in the numerical investigation of the hydraulic jump. Nikmehr and Aminpour [25] investigated the characteristics of a hydraulic jump over bed roughness with trapezoidal blocks using the CFD model. The results state that increasing the distance and the height of the roughness will decrease the velocity near the bed and increase the shear stress. Ghaderi et al. [26] numerically investigated the characteristics of the hydraulic jumps over various roughness shapes using the FLOW-3D® model. The results were compared with previous studies. Relationships with good correlation coefficients for the mentioned parameters in free and submerged jumps were presented based on numerical results. Ghaderi et al. [27] studied the effects of triangular microroughness on the characteristics of the submerged jump with the help of the FLOW-3D® model. To validate the present model, comparisons between numerical simulations and experimental results were performed for the smooth bed and triangular microroughness [27].

Recent advancements in data-driven models, i.e., Gene Expression Programming (GEP) and Artificial Neural Networks (ANN), and their application in hydraulics engineering have challenged the conventional techniques of the analysis. Several researchers

have shown that soft computing techniques are more feasible and accurate than conventional techniques.

Karbasi and Azamathulla [28] studied free hydraulic jump characteristics in the bed roughness using Support Vector Regression (SVR), GEP, and ANN methods. The results showed the GEP model has better accuracy than other methods. Roushangar and Ghasempour [29] studied hydraulic jump characteristics in sudden expanding channels using GEP. The results showed the GEP model has better accuracy and was compared with existing empirical equations. Roushangar and Ghasempour [30] predicted the hydraulic jump energy dissipation using an SVM with channel geometry and roughness boundary conditions. The sensitivity analysis results stated that the Froude number had the most important impact on the modeling. Roushangar and Homayounfar [31] investigated the characteristics of hydraulic jumps on horizontal and sloping beds using the SVM method. Results verify that the upstream Froude number is the most critical and influential parameter for predicting the sequent depth in free and submerged jumps. At the same time, Naseri and Othman [32] predicted the length of jump on the smooth beds using ANN. Nasrabadi et al. [33] studied submerged hydraulic jump characteristics using machine learning methods. According to the evaluation, the Developed Group Method of Data Handling (DGMDH) model is more accurate than the Group Method Data Handling (GMDH) model and other previous research predicting the submergence depth and jump length relative energy dissipation.

Many studies have been carried out on hydraulic jumps over smooth beds. Nevertheless, few studies have investigated the effect of bed roughness and corresponding characteristics of free and submerged jumps numerically and predicted the outcomes from the numerical models using novel soft computing techniques. Hence, the main objectives of this study are the investigation of the effects of bed roughness parameters considering various roughness arrangements on characteristics of the free and submerged jumps, such as sequent depth and submerged depth, the length of jumps, and energy dissipation in triangular bed roughness through different hydraulic conditions with the CFD technique (CFD is a numerical methodology commonly used in engineering [34]), and verification of the prediction of this numerical model with the help of soft computing methods (SVM, GEP, and RF).

#### **2. Materials and Methods**

#### *2.1. Dimensional Analysis*

The hydraulic jumps characteristics on bed roughness are dependent on fluid properties, bed dimensions, and hydraulic state of flow. Therefore, subcritical of the free jump depth (*y*2) and submerged of the submerged jump depth (*y*3) will be a function of the following parameters:

$$y\_2 = f\_1(y\_1, \mu\_1, \lg \mu\_\prime \rho\_\prime T, I) \tag{1}$$

$$y\_3 = f\_2(y\_1, y\_2, y\_4, \mu\_1, \varrho, \mu, \rho, T, I) \tag{2}$$

Using the dimensional analysis, the following relationships are obtained:

$$\frac{y\_2}{y\_1} = f\_3(Fr\_1 = \frac{u\_1}{\sqrt{\mathcal{R}Y\_1}}, \text{Re}\_1 = \frac{y\_1 u\_1}{\nu}, \frac{T}{y\_1}, \frac{T}{I}) \tag{3}$$

$$\frac{y\_3}{y\_1} = f\_4(Fr\_1 = \frac{u\_1}{\sqrt{g\_1}\mathbf{y}\_1}, \text{Re}\_1 = \frac{y\_1u\_1}{\nu}, SF = \frac{y\_4 - y\_2}{y\_2}, \frac{T}{y\_1}, \frac{T}{I})\tag{4}$$

where *y*<sup>1</sup> and *y*<sup>4</sup> are referred to as supercritical of the free jump depth and tailwater of the submerged jump depth; *u*<sup>1</sup> is inlet velocity; and *g*, *ρ*, *μ*, *SF*, and *υ* are the gravity acceleration, mass density of water, water dynamic viscosity, submergence factors, and water kinematic viscosity, respectively. *T* and *I* are height and distance of roughness, and *Fr*<sup>1</sup> and *Re*<sup>1</sup> are Froude and Reynolds numbers, respectively. The values of the Reynolds number (*Re*1) were in the range of 39,884–59,825. For large values of the Reynolds number, viscous effects

can be neglected [35–37]. Based on the Ead and Rajaratnam [10] and Abbaspour et al. [22] studies, *T*/*y*<sup>1</sup> does not significantly affect the hydraulic jumps' depth ratio *y*2/*y*<sup>1</sup> and *y*3/*y*1. Then, relationships (3) and (4) become:

$$\frac{y\_2}{y\_1} = f\_5(Fr\_1, \frac{T}{I})\tag{5}$$

$$\frac{y\_3}{y\_1} = f\_6(Fr\_1, SF = \frac{y\_4 - y\_2}{y\_2}, \frac{T}{I})\tag{6}$$

Using the Buckingham Π theorem, for the length of jump on the free and submerged jumps (*Ljf*/*y*<sup>2</sup> and *Ljs*/*y*2), the following relationships are obtained:

$$\frac{L\_{\overline{J}f}}{\overline{y\_2}} = f\_7(Fr\_{1\prime}\frac{T}{I})\tag{7}$$

$$\frac{L\_{js}}{y\_2} = f\_8(Fr\_1, SF = \frac{y\_4 - y\_2}{y\_2}, \frac{T}{I}) \tag{8}$$

Figure 1 shows a schematic view of free and submerged jumps on the triangular bed roughness, along with critical hydraulic parameters of the present study. In this figure, *d* is gate opening.

**Figure 1.** Definition sketch of the free and submerged hydraulic jumps on a triangular bed roughness after Ghaderi et al. [26].

#### *2.2. The FLOW-3D® Model*

Numerical simulations were carried out using FLOW-3D, a well-known and established computational fluid dynamics software. This software uses the finite volume method in a Cartesian, staggered grid to solve the RANS equations (Reynolds Average Navier−Stokes) that describe continuity and momentum and are expressed as:

$$\frac{\partial}{\partial x}(uA\_X) + \frac{\partial}{\partial y}(vA\_Y) + \frac{\partial}{\partial z}(wA\_Z) = 0\tag{9}$$

$$\frac{\partial \mathcal{U}\_i}{\partial t} + \frac{1}{V\_F} \left( u\_j A\_j \frac{\partial u\_i}{\partial x\_j} \right) = -\frac{1}{\rho} \frac{\partial P}{\partial x\_i} + G\_i + f\_i \tag{10}$$

where *u*, *v*, and *w* represent the components of velocity in the *x*, *y*, and *z*-direction; *VF* is the volume fraction of fluid in each cell; *Ax*, *Ay*, and *Az* are the fractional areas open to flow in the subscript's direction; *ρ* is the fluid density; *P* is the hydrostatic pressure; *Gi* is the gravitational acceleration in subscript direction; and *fi* is the Reynolds stress. In FLOW-3D, free surfaces are modeled with the Volume of Fluid (VOF) technique and developed by Hirt and Nichols [37]. The VOF transport equation is expressed by the following equation:

$$\frac{\partial F}{\partial t} + \frac{1}{V\_F} \left[ \frac{\partial (FA\_x \mu\_1)}{\partial x} + \frac{\partial (FA\_y \mu\_2)}{\partial y} + \frac{\partial (FA\_z \mu\_3)}{\partial z} \right] = 0 \tag{11}$$

Here, *F* denotes the fraction function. In particular, as already stated, if a cell is empty, then *F* = 0, and if a cell is full, then *F* =1[38]. The free surface is determined at a position related to intermediate amounts of *F* (i.e., the user may usually determine *F* = 0.5, or another intermediate amount).

#### 2.2.1. Turbulence Model

In this study, the RNG k-ε turbulence model is used to simulate the turbulence in the water flow. The RNG k-ε model improves the standard k-ε model (Equations (12) and (13)), reflecting small-scale effects by large-scale motion and modified viscosity terms, and can handle the flow with a large degree of curvature well [39]. This model showed satisfactory outcomes in previous studies on hydraulic engineering studies in complex geometry and flow fields [26,27,40–46].

$$\frac{\partial(\rho k)}{\partial t} + \frac{\partial(\rho k u\_i)}{\partial x\_i} = \frac{\partial}{\partial x\_j} (\mathfrak{a}\_k \mu\_{eff} \frac{\partial k}{\partial x\_j}) + G\_k + \rho \varepsilon \tag{12}$$

$$\frac{\partial(\rho\varepsilon)}{\partial t} + \frac{\partial(\rho\varepsilon u\_i)}{\partial x\_i} = \frac{\partial}{\partial x\_j}(a\_\varepsilon \mu\_{eff} \frac{\partial \varepsilon}{\partial x\_j}) + \frac{C\_{1\varepsilon}^\* \varepsilon}{k} G\_k - C\_{2\varepsilon} \rho \frac{\varepsilon^2}{k} \tag{13}$$

Here, *k* is called turbulent kinetic energy (TKE); *ε* is the turbulence dissipation rate; *Gk* is the generation of turbulent kinetic energy caused by the average velocity gradient; *Gb* is the generation of turbulent kinetic energy caused by buoyancy. *Sk* and *S<sup>ε</sup>* are source terms. *αk*, *αε* and *μeff*, *C*2*ε*, *C*<sup>∗</sup> <sup>1</sup>*<sup>ε</sup>* are model constants is effective viscosity.

#### 2.2.2. Boundary Conditions

Corresponding to the physical conditions of the problem, four different boundary conditions were considered. Hence, the inlet and the exit boundary of the first mesh block needed to be set in the flow direction. The inlet boundary condition was set as discharge flow rate (*Q*) with flow depth at the channel's beginning. The boundary condition at the downstream end of the domain was described by a pressure boundary condition (*P*) corresponding to the tailwater depth in the flume. No-slip conditions were applied at the wall boundaries and the bottom, and they were treated as non-penetrative boundaries. Wall roughness has been neglected due to the slight roughness of the material of the experimental facility, which was used for validation. An atmospheric boundary condition is set to the upper boundary of the channel. This allows the flow to enter and leave the domain as null von Neumann conditions are imposed to all variables except for pressure, which is set to zero (i.e., atmospheric pressure). The symmetry (*S*) is used at the inner boundaries as well. Figure 2 shows the computational domain of the present study and the boundary conditions governing the simulation.

**Figure 2.** The boundary conditions governing the simulation, (**a**) smooth bed, (**b**) the triangular bed roughness.

#### 2.2.3. Checking Stability and Convergence Criterion

To obtain the correct numerical or experimental model data values, it is necessary to reach a stable state. A stability criterion similar to the Courant number is used to calculate the allowed time-step size. The Courant Number tells how fast the fluid passes through a cell. If the Courant Number is greater than 1, the velocity of the fluid is so high that it passes through a cell in less than one time step. This leads to numerical instabilities: the stability criteria leading to time steps between 0.001 s and 0.0016 s. The evolution in time was used as a relaxation to the final steady state. During the simulations, the solutions' steady-state convergence was checked by monitoring the flow discharge variations at the inlet and outlet boundaries. Figure 3 shows that t = 16 s is appropriate to achieve a near steady-state condition for *Q* = 0.03 m3/s and *Q* = 0.045 m3/s. The computational time for the simulations was between 14–18 h using a personal computer with eight cores of a CPU (Intel Core i7-7700K @ 4.20 GHz and 16 GB RAM).

**Figure 3.** CFD flow discharge time variation in the inlet and outlet boundaries, (**a**) *Q* = 0.03 m3/s, (**b**) *Q* = 0.045 m3/s.

#### 2.2.4. Numerical Domain

The research provided by Ahmed et al. [15] compares the numerical model and laboratory test results. Although the length of the experimental flume was 24.5 m, the present numerical study is set equal to 4.5 m to improve the performance in terms of computational effort and reduction in the number of overall cells [26] (for more details, see Ahmed et al. [15]). Table 1 shows the parameters of the numerical models.

**Table 1.** The parameters of the numerical models.


The geometry of the models is built represented through an STL (stereolithography) file. The numerical mesh is constructed to adopt two mesh blocks, a containing mesh block for the entire spatial domain and a nested block with refined cells for the area of interest. The hydraulic jump occurs (Figure 4). The best practice is to have fixed points aligning the mesh boxes and for the aspect ratios to be no greater than 2.

**Figure 4.** Structured rectangular hexahedral mesh with two different mesh blocks, (**a**) smooth bed, (**b**) the triangular bed roughness.

#### 2.2.5. Mesh Size Sensitivity Analysis

According to the sensitivity mesh results and by comparing *y*3/*y*<sup>1</sup> and *y*2/*y*<sup>1</sup> ratios at *Fr*<sup>1</sup> = 4.5 for a submerged and free hydraulic jump, numerical solutions for five different mesh sizes at distances close to the computational grid were used. Table 2 provides a summary list of the results for three different mesh sizes. Figure 5 shows that the simulated *y*3/*y*<sup>1</sup> and *y*2/*y*<sup>1</sup> ratios exhibit better agreement with the measured *y*3/*y*<sup>1</sup> and *y*2/*y*<sup>1</sup> for the finer cell size of 0.60 cm. In addition, the variation of mean relative errors can be neglected by decreasing the cell size from 0.65 cm to 0.60 cm. As a result, the selected mesh consists of a containing block with 1.3 cm cells and a nested block with 0.65 cm cells. In the present research, the same mesh was utilized for all models to reduce the effect of computational mesh on simulation results. A distance of the first cell from the walls was selected to prevent computations in the viscous sub-layer.


<sup>1</sup> Mean Absolute Percentage Error = 100 <sup>×</sup> <sup>1</sup> n n ∑ 1 XExp−XNum XExp . XExp: the experimental value of X; XNum: the numerical value of X; and n: the total amount of data.

**Figure 5.** Variations of the relative error of *y*3/*y*<sup>1</sup> and *y*2/*y*<sup>1</sup> at *Fr*<sup>1</sup> versus cell size.

#### *2.3. Artificial Intelligence Methods*

2.3.1. Support Vector Machine (SVM)

SVM algorithm is a data mining algorithm that uses the regression method to solve classification and prediction problems. Like artificial neural networks, problem-solving steps are divided into two phases of training and testing (i.e., validation). First, the system is trained by a part of data, then the problem's solution is evaluated with test data. The SVM is based on linear data classification and tries to select a line with a high margin of confidence in the linear division of data. The training data closer to the separator page is called the support vector. The maximum distance between the two categories is known as the optimal separator page [47]. Based on the limited information of the samples, the SVM algorithm seeks the best option among the models with different complexities and the ability to train these models [48]. The SVM algorithm consists of four different kernels, which are presented in Table 3. The most widely used kernel functions in support vector machine problems are Gaussian (RBF) and ring kernel (ERBF) functions [49]. These functions are used when information on the data type and their nature is not available in problem-solving [50]. In the present study, the RBF function has been used to predict the parameters.

Here, *Xi* and *Xj* are two vectors in directions *i* and *j*, and *a*, *c*, and *d* are Kernel parameters. According to Figure 6, first, the input data is entered into the statistical software. Based on dimensional analysis, the dependent and independent parameters are defined in the software environment by selecting the function (RBF) and entering the main feature of the SVM model of this function (i.e., γ by trial-and-error method). Selecting the appropriate values of γ makes the results accurate and close to reality.

**Table 3.** Types of kernel functions [50].


**Figure 6.** Schematic of the Support Vector Machine (SVM).

#### 2.3.2. Gene Expression Programming (GEP)

The GEP method is a combined and developed Genetic Algorithm (GA) and Genetic Programming (GP) developed by Ferreira [51]. This method combines linear and simple chromosomes with constant length, similar to genetic algorithms, and branch structures of different sizes and shapes, similar to decomposition trees in genetic programming. The first step in the GEP is to form the initial population through solutions. Then, the chromosomes are shown as a tree (ETs). The fitness function determines the degree of compatibility of each member of the population of chromosomes. Next, the number of genes and chromosomes must be determined to run the GEP model. One of the strengths of the GEP is that the criterion for genetic diversity is very simple, so genetic operators operate at chromosomal levels. Another strength of this method is the unique nature of its multi-genes, which provides the basis for evaluating complex simulations [52]. The GEP algorithm consists of five steps: determining the fitness function, selecting the set of terminals and the set of functions to create the chromosomes, selecting the structure of the chromosomes, selecting the link function, and selecting the genetic operators and their rates [50,53]. In the present study, the GeneXproTools program was used to predict the parameters. The main steps of the GEP method are shown schematically in Figure 7.

**Figure 7.** Schematic of the Gene Expression Programming (GEP).

#### 2.3.3. Random Forest (RF)

RF algorithm is currently one of the learning algorithms. This is a cumulative learning algorithm for regression-based problems and grouping based on decision tree development [54]. An RF is a collection of unpruned trees in which each tree is generated by a recursive segmentation algorithm [51]. In other words, an RF is a combination of several decision trees in which several self-organizing samples of data participate. The self-organizing method is the sampling method with placement. None of the selected data are deleted from the input samples to generate the following subset. Therefore, some data may be used more than once in educational branches. Others that have little effect on modeling should never be used. For the selective self-organizing sample, a classification tree is grown using the recursive segmentation algorithm. The analysis operation is based on a random sample of the number of predictor variables in each node. The recursive decomposition algorithm continues until the tree reaches its maximum size without pruning it [54]. The performance of the RF algorithm is shown in Figure 8.

**Figure 8.** Performance of Random Forest (RF).

#### *2.4. Evaluation Criteria*

In the present study, the evaluation criteria of correlation coefficient (*R*2), Root Mean Square Error (RMSE), Normalized Root Mean Square of Error (NRMSE), and Mean Absolute Percentage Error (MAPE) were used to compare the results of prediction models of hydraulic parameters of hydraulic jumps (Equations (14)–(17)).

$$R^2 = \left(\frac{n\sum X\_{Num}X\_{Pr} - \left(\sum X\_{Num}\right)\left(\sum X\_{Pr}\right)}{\sqrt{n\left(\sum X\_{Num}^2\right) - \left(\sum X\_{Num}\right)^2}\sqrt{n\left(\sum X\_{Pr}^2\right) - \left(\sum X\_{Pr}\right)^2}}\right)^2\tag{14}$$

$$RMSE = \sqrt{\frac{1}{n} \overline{\sum\_{1}^{n} (X\_{Num} - X\_{Pre})}^{2}} \tag{15}$$

$$NRMSE(\%) = 100 \times \frac{\sqrt{\frac{1}{n} \sum\_{1}^{n} (X\_{Num} - X\_{Pre})^2}}{\sum\_{1}^{n} X\_{Num}} \tag{16}$$

$$MAPE(\%) = 100 \times \frac{1}{n} \sum\_{1}^{n} \left| \frac{X\_{Num} - X\_{Pr}}{X\_{Num}} \right| \tag{17}$$

Here, the *XPre* and the *XNum* are the predicted and the numerical values. It should be noted that the best model is the model in which RMSE is zero and *R*<sup>2</sup> is one, and also NRMSE and MAPE values are less than 10%.

#### **3. Results**

In the present study, the output results of the FLOW-3D® model were investigated using SVM, GEP, and RF methods. For this purpose, a total of 620 output data of numerical model were used to predict the parameters (*y*2/*y*1), (*y*3/*y*1), (*y*4/*y*1), (*Lj*/*y*<sup>∗</sup> <sup>2</sup>), and (Δ*E*/*E*1) with artificial intelligence methods. To achieve accurate prediction and better results, the training process was repeated several times. Finally, a pattern of 25% data for testing and 75% data for training was used for all methods.

#### *3.1. Validity of the FLOW-3D® Model Results*

Although the CFD technique has been on the rise for more than half a century, computers have only allowed us to solve more complex 3D geometries in the recent decade. Because of that, it is very important to validate CFD results [55]. Hence, a comparison between numerical and experimental results on basic parameters including submerged ratio (*y*3/*y*1), tailwater ratio (*y*4/*y*1), and relative jump length (*Ljs*/*y*1) of a submerged hydraulic jump and the sequent depth ratio (*y*2/*y*1) of a free hydraulic jump on a smooth bed have been used to validate the numerical model and are plotted in Figure 9.

Moreover, the essential flow variables are summarized in Table 4.

From the graphs, a substantial agreement can be observed between numerical and experimental results by Ahmed et al. [15] as a function of *Fr*1. The overall mean value of relative error is 4.1%, which confirms the ability of the numerical model to predict the specifications of free and submerged jumps. In general, the CFD model is in excellent agreement with the experimental data [56].

**Figure 9.** Numerical versus basic experimental parameters of submerged and free hydraulic jumps. (**a**) *y*3/*y*1, (**b**) *y*4/*y*1, (**c**) *Ljs*/*y1*, and (**d**) *y*2/*y*1.

**Table 4.** Basic flow variables for the numerical and physical models after Ahmed et al. [15].


#### *3.2. Sequent Depth Ratio in the Free Jump (y2/y1)*

The *y*2/*y*1, which somehow represents the height of the jump, is directly related to the changes in the *Fr*<sup>1</sup> and the distance of the roughness element. By increasing these parameters, the value *y*2/*y*<sup>1</sup> is increased. According to the results of the FLOW-3D® model, the most significant decrease *y*2/*y*<sup>1</sup> with increasing Froude number compared to the smooth bed is at *T*/*I* = 0.50 with 17.83% as mean. The results showed that the *y*2/*y*<sup>1</sup> for the jump on the bed roughness was smaller than that of the corresponding jumps on a smooth bed [26,27]. Table 5 summarizes the results of estimating the *y*2/*y*1. Comparing the results of three models, the SVM model with the lowest RMSE = 0.2075 and the highest R<sup>2</sup> = 0.9966 for the training phase and RMSE = 0.2990 and R<sup>2</sup> = 0.9960 for the testing phase in predicting the *y*2/*y*<sup>1</sup> as a model the best was selected.

Figures 10 and 11 compare the results of the FLOW-3D® model and the SVM model to estimate the *y*2/*y*<sup>1</sup> in the training and testing phase. It can be seen that the SVM model has a good performance in predicting this parameter, and the output results of the SVM model are in good agreement with the FLOW-3D® values and were recognized as the best model. It is also observed that during predicting *y*2/*y*<sup>1</sup> in the testing phase, the SVM model estimates higher values at maximum points than the FLOW-3D® model.


**Table 5.** Prediction results for the sequent depth ratio (*y*2/*y*1).

**Figure 10.** FLOW-3D® model versus SVM model predicted for the *y*2/*y*1.

**Figure 11.** Comparison of FLOW-3D® model and SVM model for estimating the *y*2/*y*1.

In general, based on the numerical data of the present study, the equation provided for the *y*2/*y*<sup>1</sup> in the free jump with a correlation coefficient equal to 0.997 is expressed as:

$$\frac{y\_2}{y\_1} = 1.338 Fr\_1 - 2.458(\frac{T}{I}) + 0.0528 \tag{18}$$

#### *3.3. Submerged Depth Ratio in Submerged Jump (y3/y1)*

Based on dimensional analysis, the submerged depth ratio (*y*3/*y*1) and the tailwater ratio (*y*4/*y*1) depend on the *Fr*1, *T*/*I*, and *SF*. According to the results of the FLOW-3D®, the most significant decrease *y*3/*y*<sup>1</sup> and *y*4/*y*<sup>1</sup> with increasing Froude number compared to the smooth bed are at *T*/*I* = 0.50 with 20.88% and 23.34% as mean, respectively [26,27]. Comparing the results of the three models presented in Table 6 shows that among the three models, for the *y*3/*y*1, the SVM model with values of RMSE = 0.3391 and R<sup>2</sup> = 0.9964 for the testing phase is close to the FLOW-3D® numerical model. The SVM model also performed

better in predicting *y*4/*y*<sup>1</sup> and had very little error. After the SVM model, the GEP model also provided acceptable results in estimating (*y*3/*y*1) and (*y*4/*y*1).



Figures 12 and 13 present the results of comparing the FLOW-3D® model and predicting the SVM, GEP, and RF models in the testing phase (*y*3/*y*1) and (*y*4/*y*1). According to the graphs, it is clear that the SVM model has a better prediction than the other two models. At the maximum and minimum points, the (*y*3/*y*1) and (*y*4/*y*1), always accompanied by turbulence in the water surface, it can be seen that the SVM model has the highest efficiency and the lowest error over other models. The predicted values of these parameters by the SVM model have good adaptation. They overlap with the output values of the numerical model.

**Figure 12.** Comparison of the numerical results and the predicted models of (*y*3/*y*1) for the testing phase.

**Figure 13.** Comparison of the numerical results and the predicted models of (*y*4/*y*1) for the testing phase.

In general, based on the results drawn from this study, the following equation for the *y*3/*y*<sup>1</sup> and *y*4/*y*<sup>1</sup> in the submerged jump with a correlation coefficient equal to 0.993 and 0.989, respectively, on the triangular bed roughness was obtained:

$$\frac{y\_3}{y\_1} = 1.538Fr\_1 + 3.263SF - 3.219(\frac{T}{I}) - 0.915 \tag{19}$$

$$\frac{y\_4}{y\_1} = 1.909 Fr\_1 + 3.015SF - 3.961(\frac{T}{I}) - 0.977 \tag{20}$$

#### *3.4. The Length Ratio of Jumps (Lj/y*∗ 2*)*

In the present study, the subcritical depth of the classical hydraulic jump (*y*∗ <sup>2</sup>) can be obtained by the Bélanger equation, as explained by French [57]:

$$y\_2^\* = \frac{y\_1}{2} \left[ \sqrt{(1 + 8Fr\_1^2)} - 1 \right] \tag{21}$$

According to the results of the FLOW-3D® model, the (*Lj*/*y*<sup>∗</sup> <sup>2</sup>) for the bed roughness is less than the smooth bed, and for the submerged jump it is larger than the free jump. For *T*/*I* = 0.5, the ratio length of free and submerged jumps decreases by about 25.52% and 21.65% as a mean, respectively [27]. Estimating the jump length reduces the volume of construction operations and ultimately reduces the project's overall cost. Therefore, an accurate estimation of the hydraulic jump length is essential to design the length of the stilling basin based on this parameter. The results of predicting (*Lj*/*y*<sup>∗</sup> <sup>2</sup>) along with the evaluation criteria are presented in Table 7. According to the results, the SVM model has good statistical criteria among other models and has high accuracy in predicting the relative length of free and submerged hydraulic jumps.


**Table 7.** Prediction results for the length of the jumps (*Lj*/*y*∗ 2).

Graphs of changes in R<sup>2</sup> and RMSE versus different gammas are presented for the best model of the *Lj f* /*y*<sup>∗</sup> <sup>2</sup> and the *Ljs*/*y*<sup>∗</sup> <sup>2</sup> in the testing phase (Figure 14). In the support vector machine, selecting the appropriate gamma is one of the main parameters in determining the best model, which has been done by trial and error. Finally, the best result for predicting the *Lj f* /*y*<sup>∗</sup> <sup>2</sup> in the optimal gamma is 10 (γ = 10), and for *Ljs*/*y*<sup>∗</sup> <sup>2</sup> in the optimal gamma it is 0.60 (γ = 0.60).

**Figure 14.** Variations R2 and RMSE versus gamma for the best SVM model in jump length estimation.

Figures 15 and 16 show the results of the FLOW-3D® and the predicted models of *Lj f* /*y*<sup>∗</sup> <sup>2</sup> and the *Ljs*/*y*<sup>∗</sup> <sup>2</sup> data for the best SVM model in the training and testing phases. According to Figure 15, it can be seen that when the values of the *Lj f* /*y*<sup>∗</sup> <sup>2</sup> reach the maximum and minimum points, the prediction accuracy of the SVM model decreases. In other words, when the *Lj f* /*y*<sup>∗</sup> <sup>2</sup> reaches the maximum and minimum jump values, the prediction error of the SVM model increases. Moreover, as shown in Figure 16 for the *Ljs*/*y*<sup>∗</sup> <sup>2</sup>, it can be seen that the SVM model always has values close to the FLOW-3D® model and has a better performance compared to the *Lj f* /*y*<sup>∗</sup> <sup>2</sup>. On the other hand, most SVM model errors in both parameters occurred in the initial range of testing data. In the middle to the end of the data, the prediction error decreased.

The following equation shows the relationship between the *Lj*/*y*<sup>∗</sup> <sup>2</sup> with a correlation coefficient equal to 0.724 and 0.944, respectively, for the free and submerged jumps:

$$\frac{L\_{\bar{j}f}}{y\_2^\*} = 0.065 Fr\_1 - 3.757(\frac{T}{I}) + 6.103\tag{22}$$

$$\frac{L\_{js}}{y\_2^\*} = 0.037 Fr\_1 + 5.568SF - 2.556(\frac{T}{I}) + 5.579 \tag{23}$$

**Figure 15.** Comparison of FLOW-3D® and SVM model values to estimate the *Lj f* /*y*<sup>∗</sup> 2.

**Figure 16.** Comparison of FLOW-3D® and SVM model values to estimate the *Ljs*/*y*<sup>∗</sup> 2.

#### *3.5. The Energy Dissipation (*Δ*E/E1)*

The energy dissipation of hydraulic jumps based on free and submerged is calculated as follows by Pourabdollah et al. [17]:

$$\left(\frac{\Delta E}{E\_1}\right)\_f = \left(\frac{E\_1 - E\_2}{E\_1}\right)\_f = \left(\frac{(y\_1 + V\_1^2/2g) - (y\_2 + V\_2^2/2g)}{y\_1 + V\_1^2/2g}\right)\_f\tag{24}$$

$$\left(\frac{\Delta E}{E\_1}\right)\_s = \left(\frac{E\_3 - E\_4}{E\_3}\right)\_s = \left(\frac{(y\_3 + V\_1^2/2g) - (y\_4 + V\_4^2/2g)}{y\_3 + V\_1^2/2g}\right)\_s\tag{25}$$

*E*1, *E*2, *E*3, and *E*<sup>4</sup> are specific energies upstream and downstream of the free and submerged jumps, respectively (see Figure 1). According to the results of the FLOW-3D®, the Δ*E*/*E*<sup>1</sup> increases with increasing the *Fr*1. The highest Δ*E*/*E*<sup>1</sup> occurs with *T*/*I* = 0.50 in the free and submerged jumps compared to other distances between the roughnesses of the corresponding *T*/*I* ratios [26,27]. Determining the amount of Δ*E*/*E*<sup>1</sup> that occurs due to hydraulic jumps will lead to the stilling basin's more efficient and economical design. The results of predicting energy dissipation due to free jump (Δ*E*/*E*1)*<sup>f</sup>* and submerged jump (Δ*E*/*E*1)*<sup>S</sup>* are presented in Table 8. The results showed that for energy dissipation for (Δ*E*/*E*1)*f*, the SVM model with R2 = 0.9848 and RMSE = 0.0313, and for the testing phase (Δ*E*/*E*1)*S*, R<sup>2</sup> = 0.9843 and RMSE = 0.0238, these were recognized as the best models. Therefore, the best prediction with the least possible error among the three models is obtained by the SVM model.


**Table 8.** Prediction results for the energy dissipation (Δ*E*/*E*1).

Two radar graphs of the R2 and RMSE of energy dissipation due to free and submerged jumps are presented for the testing phase (Figure 17). Radar graphs can show the accuracy of predictions of different models compared to each other. It can be seen that the SVM model has provided acceptable performance and has a much better prediction than the GEP and RF models. Furthermore, because RMSE values are small and their changes are not visible in the graph, by multiplying the RMSE by 10, the range of changes became broader and more precise.

**Figure 17.** Radar graphs of R2 and RMSE for energy dissipation due to free and submerged jumps in the testing phase.

The distribution graph between numerical and predicted values is plotted for the best energy dissipation model due to free and submerged jumps (Figures 18 and 19). Changes in energy dissipation for jumps during the testing and training phase indicate good agreement and overlap between the values of the numerical model and the predicted. According to the figure, it can be seen that the data of the numerical model had less dispersion with the predicted data. In other words, the output data are very well matched to each other. Additionally, during the model simulation process, the network training did not fail, and the training values were always higher than the testing.

The following equations show the relationship between the Δ*E*/*E*<sup>1</sup> and *Fr*<sup>1</sup> with a correlation coefficient equal to 0.963 and 0.946, respectively, for the free and submerged jumps:

$$\left(\frac{\Delta E}{E\_1}\right)\_f = -0.009Fr\_1^2 + 0.184Fr\_1 - 0.177\tag{26}$$

$$\left(\frac{\Delta E}{E\_1}\right)\_s = -0.007Fr\_1^2 + 0.146Fr\_1 - 0.143\tag{27}$$

**Figure 18.** FLOW-3D® model versus SVM predicted for the free jump.

**Figure 19.** FLOW-3D® model versus SVM predicted for the submerged jump.

#### *3.6. Sensitivity Analysis*

Sensitivity analysis is the best solution to achieve the effectiveness of the input variables of a statistical model in a study. Sensitivity analysis is used when sufficient inputs are changed in an organized statistical model to observe the effects of the presence or absence of these variables in the predictive output model. The present study omitted the one-by-one parameters to predict the submerged depth ratio (*y*3/*y*1). The parameter that had the most impact was identified, and its results are presented in Table 9.


**Table 9.** Sensitivity analysis results for the submerged depth ratio (*y*3/*y*1).

It can be seen that the best result for predicting the effective parameter of (*y*3/*y*1) is when all three parameters of *Fr*1, *T*/*I*, and *SF* are involved in the prediction. The *Fr*<sup>1</sup> has the greatest effect on predicting the (*y*3/*y*1) based on sensitivity analysis. By omitting this parameter, the prediction accuracy is significantly reduced. The *SF* and *T*/*I* are also involved in the study of (*y*3/*y*1), but the impact of each is less than the *Fr*1.

#### **4. Conclusions**

This paper presented and discussed the characteristics of free and submerged hydraulic jumps on the triangular bed roughness in various roughness arrangements of the corresponding *T*/*I* ratios with the CFD techniques and compared the prediction of this numerical model with the help of artificial intelligence methods (SVM, GEP, and RF). To simulate the free flow surface, the Volume of Fluid (VOF) method, and the turbulence, the RNG k-ε model in FLOW-3D® software was used. Key findings of the comparative analysis are given below:


Finally, the methodology presented in this study and the solution-oriented result contributes to helping hydraulic engineers to design and construct cost-effective spillways, stilling basins, and other hydraulic structures that experience hydraulic jumps. Indeed, the accurate estimation of the hydraulic jump length, especially in high head spillways, reduces construction operations' volume and ultimately reduces the project's overall cost of the stilling basin built to dissipate the hydraulic jumps.

**Author Contributions:** Conceptualization, M.D., A.G., M.B., M.A. and A.K.; methodology, M.D., A.G., M.B., M.A. and A.K.; software, M.D., A.G., M.B., M.A. and A.K.; validation, M.D., A.G., M.B., M.A. and A.K.; formal analysis, M.D., A.G., M.B., M.A. and A.K.; investigation, M.D., A.G., M.B., M.A. and A.K.; resources, M.D., A.G., M.B., M.A. and A.K.; data curation, M.D., A.G., M.B., M.A. and A.K.; writing—original draft preparation, M.D., A.G., M.B., M.A. and A.K.; writing—review and editing, M.D., A.G., M.B., M.A. and A.K.; visualization, M.D., A.G., M.B., M.A. and A.K.; supervision, M.D., A.G., M.B., M.A. and A.K.; project administration, M.D., A.G., M.B., M.A. and A.K. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Data are contained within the article.

**Acknowledgments:** Alban Kuriqi acknowledge the support of the Portuguese Foundation for Science and Technology (FCT) through the project PTDC/CTA-OHR/30561/2017 (WinTherface).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Notation**


#### **References**


## *Article* **A Review of the Modification Strategies of the Nature Inspired Algorithms for Feature Selection Problem**

**Ruba Abu Khurma 1, Ibrahim Aljarah 1,\*, Ahmad Sharieh 1, Mohamed Abd Elaziz 2,3,4, Robertas Damaševiˇcius 5,\* and Tomas Krilaviˇcius <sup>5</sup>**


**\*** Correspondence: i.aljarah@ju.edu.jo (I.A.); robertas.damasevicius@vdu.lt (R.D.)

**Abstract:** This survey is an effort to provide a research repository and a useful reference for researchers to guide them when planning to develop new Nature-inspired Algorithms tailored to solve Feature Selection problems (NIAs-FS). We identified and performed a thorough literature review in three main streams of research lines: Feature selection problem, optimization algorithms, particularly, meta-heuristic algorithms, and modifications applied to NIAs to tackle the FS problem. We provide a detailed overview of 156 different articles about NIAs modifications for tackling FS. We support our discussions by analytical views, visualized statistics, applied examples, open-source software systems, and discuss open issues related to FS and NIAs. Finally, the survey summarizes the main foundations of NIAs-FS with approximately 34 different operators investigated. The most popular operator is chaotic maps. Hybridization is the most widely used modification technique. There are three types of hybridization: Integrating NIA with another NIA, integrating NIA with a classifier, and integrating NIA with a classifier. The most widely used hybridization is the one that integrates a classifier with the NIA. Microarray and medical applications are the dominated applications where most of the NIA-FS are modified and used. Despite the popularity of the NIAs-FS, there are still many areas that need further investigation.

**Keywords:** feature selection; evolutionary algorithms; nature inspired algorithms; meta-heuristic optimization; computational intelligence; soft computing

#### **1. Introduction**

As data accumulate rapidly in databases and data warehouses, a dimensionality problem becomes the main challenge for machine learning tasks (e.g., classification or clustering) [1]. Many negative effects may result from scaling up the dimensionality of a data set. These include the existence of irrelevant and redundant features that may adversely affect the learning algorithm or cause data over-fit [2]. Thus, the development of effective data mining techniques becomes an urgent necessity in various fields such as medicine [3], bioinformatics [4], text mining [5], image processing [6], design of smart infrastructures and smart homes [7], financial estimation [8,9], coastal engineering [10], and sustainability [11]. Their significance depends on their ability to turn huge amounts of data into an acceptable form. This will simplify knowledge discovery and make huge data sets more understandable, analyzable, and predictable.

Feature Selection (FS) is a pre-processing data mining technique for dimensionality reduction [12]. In recent years, research in FS has been rapidly developed in line with the

**Citation:** Abu Khurma, R.; Aljarah, I.; Sharieh, A.; Abd Elaziz, M.; Damaševiˇcius, R.; Krilaviˇcius, T. A Review of the Modification Strategies of the Nature Inspired Algorithms for Feature Selection Problem. *Mathematics* **2022**, *10*, 464. https:// doi.org/10.3390/math10030464

Academic Editors: Freddy Gabbay and Ripon Kumar Chakrabortty

Received: 2 December 2021 Accepted: 21 January 2022 Published: 31 January 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

tomas.krilavicius@vdu.lt

era of big data and huge data sets. This subject has attracted the attention of researchers who have become more interested in developing novel FS techniques and improving current technologies [13]. FS manages the dimensionality problem by finding the most representative feature subset. The essence of FS is to choose features that are highly correlated to the class concept (relevant features) and weakly correlated with each other (complementary features/not redundant) [14]. Removing irrelevant and redundant features from a data set will cause improvements in different directions. For the modeling process, it will promote the generalization process. This will improve the quality of the generated model, so it becomes less complicated and more understandable. As a result, the inductive learner will be more efficient. FS is categorized based on the evaluation strategy into filters and wrappers [15]. The main difference between them depends on integrating a learning algorithm in the evaluation stage. Wrappers use learning algorithms to evaluate the selected feature subset. Hence, wrappers are more accurate and more expensive. In contrast, filters do not rely on learning algorithms, but use some data proprieties for evaluation. Examples of filters include univariate and multivariate filters. The main difference between them is that univariate filters rank a single feature to evaluate its performance, while multivariate filters evaluate the entire feature subset, which includes a set of feature as a combination. The generation of a feature subset in multivariate filters depends on the search strategy and the staring point of generation such as: Forward selection, backward elimination, bidirectional selection, and heuristic feature subset selection. Forward selection starts with an empty feature subset and then adds features, backward selection starts with the whole feature subset and eliminates one or more features from the set, and bidirectional search starts from both sides from an empty feature subset and from the whole feature subset at the same time [16], F-statistic [17], and information gain [18].

FS is not only a variable shrinkage process, and the target is not just to perform arbitrary cardinality reduction for a data set. FS is a multi-objective optimization problem which searches for the (near) optimal subset of features in terms of certain evaluation criteria. The main target of the FS problem is to find trade-offs between various conflicting objectives [19]. FS tries to achieve the minimum number of selected features with maximum performance [20]. Relative to search space, FS is considered a combinatorial nondeterministic polynomial-time-hard (NP-hard) problem. The reason being that it has a large search space that needs exponential running time to traverse exhaustively all the generated subsets of features [12]. The 2*<sup>N</sup>* run time complexity will grow exponentially with increasing the value of *N* which represents the number of dimensions (features/variables) in a data set. This means that the traditional brute force methods are too impractical to be applied and other advanced search methods should be used.

Meta-heuristic search techniques are promising alternative solutions. They observed superior performance in various optimization scenarios. Potentially, they have a great opportunity to be suitable solutions for the FS problem. Meta-heuristics includes Nature Inspired Algorithms (NIAs), which are further divided into two main subcategories, namely Swarm Intelligence (SI) and Evolutionary Algorithms (EA) [21]. Both categories simulate the public behavior and biological evolution of agents in nature, respectively. Examples on EAs are: Genetic Algorithms (GA) [22] and Differential Evolution (DE) [23]. The SI category includes other types of algorithms such as Particle Swarm Optimization (PSO) [24], Ant Colony Optimization (ACO) [25], the Artificial Bee Colony (ABC) algorithm [26], memetic algorithms [27], artificial ecosystem-based optimization [28], marine predators algorithm [29], polar bear optimization [30], and red fox optimization [31].

Despite the effectiveness of nature-inspired Algorithms (NIAs) in solving the FS problem, finding the optimal solution is still not guaranteed. The main challenges that affect meta-heuristic optimization are stagnation in local minima, premature convergence, parameter tuning, exploitation and exploration imbalance, the diversity problem, dynamicity, multi-objectivity, constraints, and uncertainty [32].

Several kinds of modifications were proposed in the literature to enhance the performance of NIAs in optimization. Examples of these modification techniques include a new operator, hybridization [33], updated mechanism, new initialization strategy, new fitness function, new encoding schemes, modified population structure, multi-objectives, state flipping [34,35], and parallelism [36]. Each modification addresses the weakness of the NIA algorithm in some issues without harming the essence of the algorithm and its logic. The research field of NIA-FS has witnessed considerable development. To show the expansion of the NIAs-FS models in the literature, Figure 1 illustrates the correspondence between the year and number of publications that combine modified NIAs with FS. In the first years, research was volatile, and there were also years of research disruptions. Since 2006, the number of publications has remarkably increased to reach its peak in 2018. Furthermore, the research in this area has become very effective in the last five years. An intensive search for surveys in this area found that there are very limited NIAs-FS surveys [20]. Some FS surveys did not refer to meta-heuristics at all, but focused on other issues such as data perspectives [19], supervised/unsupervised FS approaches [15], and other FS surveys were tailored to specific applications or limited to certain domains [37]. The analysis of FS surveys showed that either they briefly refer to the meta-heuristic FS or they do not refer to them at all. To our knowledge, there is no survey about a modified NIAs-FS. This finding was one of the main motivations for this work. Unlike the previous FS surveys, FS will not be discussed in isolation from other related issues. The main objective is to bridge the gap in FS surveys by providing a review of the important aspects and design issues of NIAs-based FS approaches. The main modification strategies that have been adopted to enhance NIA for solving FS problem are categorized and discussed.

**Figure 1.** Development of research field regarding Nature Inspired Algorithms (NIA) modifications for tackling Feature Selection (FS).

In this review, a set of research questions will be asked and answered:


Based on the aforementioned research questions, we have constructed this review based on three primary issues:


The review will refer to various well-regarded publishers such as ACM, Elsevier, Springer, IEEE, World scientific, Hindawi, and others. Figure 2a shows the number of

publications for each NIA in main publishers regarding modifications for tackling FS. Figure 2b shows the number of citations for popular NIAs articles in the main publishers regarding modifications for tackling FS.

**Figure 2.** Statistics of the number of publications and citations for papers on NIAs modifications for FS. (**a**) Statistics of publications on modified NIAs-FS; (**b**) Statistics of citations for papers about modified NIAs-FS.

A description of meta-heuristic optimization is presented in Section 3. Section 4 discusses the problem of feature space symmetry in datasets and the need for feature selection as a disentanglement of symmetry. Section 4 discusses the feature selection problem and its related issues. A review of different NIAs-FS modification techniques is presented in Section 5. Section 6 highlights the main applications on modified NIAs-FS. An assessment of NIA-FS is provided in Section 7. Finally, in Section 8, the outlook for the NIA-FS research field and possible future directions are discussed.

#### **2. Feature Selection as a Task of Disentangling the Symmetry of Feature Space**

The aim of supervised machine learning is to estimate a function *f* that fits well with the features of training data and allows to predict the outputs on previously unseen inputs. The number of samples required for training grows exponentially with the dimension of a feature space, which is known as the "curse of dimensionality" [38]. To approximate a Lipschitz-continuous function composed of Gaussian kernels placed in the quadrants of a *<sup>d</sup>*-dimensional unit hypercube (blue) with error , one requires <sup>O</sup>(1/*d*) samples [39].

Intuitively, a symmetry of an object is a transformation that leaves certain properties of the object invariant. For example, translation and rotation are symmetries of objects, which do not change their representations [40]. The geometric structure of the feature space imposes the structure on the class of functions *f* that we are trying to learn. One can have invariant functions that are unaffected by the action of the group, i.e., *f*(*ρ*(*g*)*x*) = *f*(*x*) for any *g* ∈ G and *x*, here G is the symmetry group, *g* is the symmetrical transformation in the feature space, *ρ*(*g*) is the group representation, and *x* is an input in the space of input signals G(Ω) that acts as a point in the feature space. Such symmetrical transformations (e.g., translation, rotation, shifting) are commonly used for data (image) augmentation to increase the number of data instances for effective training of machine learning models.

The goal of feature selection is to eliminate uninformative and/or redundant features from the feature space, leaving only relevant (i.e., predictive) features [41]. Feature selection seeks to decrease *M* to *M* and *M* << *M* for a dataset with *N* samples and *M* dimensions (or features). In other words, feature selection produces a disentangled representation [40] with respect to a particular decomposition of a feature space with some symmetry group, which may be useful for subsequent tasks, such as reduced complexity of training a machine learning classifier. Such disentanglement, in fact, is performed by a neural network as a part of the classification process by learning the weights of a network nodes [42], which produce asymmetric activations for separation of classes.

The redundant features are characterized by a high level of inter-correlation. Such correlated features result in the symmetrical distribution of instances in feature space. Feature selection aims to reduce feature dimensionality by reducing the symmetry in feature space. The resulting distribution of classes in the lower-dimensional feature space should be as asymmetrical as possible to allow for easy separability of classes [43]. Furthermore, a strong correlation in features might result in numerous near-optimal feature subsets, making traditional feature selection approaches unstable and lowering the trust in selected features [44]. As many different feature space decompositions are possible, the problem of finding an optimal feature subspace in a high-dimensional feature space is known to be NP-hard [45]. In this paper, the nature-inspired meta-heuristic optimization algorithms are studied for solving the feature selection problem.

#### **3. Meta-Heuristic Optimization**

Meta-heuristic algorithms are characterized by flexibility, simplicity, low cost in computations, and they are derivation-free methods. The principle of meta-heuristics is reasonability vs completeness. In other words, it gives up completeness for providing approximated solutions for complex unsolved problems. Meta-heuristics are further categorized based on the number of candidate solutions encountered during the optimization process into the trajectory and population.

#### *3.1. Trajectory-Based Optimization*

A trajectory algorithm begins with one random solution and it tries to optimize the solution until a stop condition is satisfied. The computation overhead is reduced significantly because only one solution is being improved and evaluated during the optimization process. Equation (1) expresses the number of function evaluations needed in trajectory algorithms where *T* is the number of iterations:

$$\text{\#Evalations(in trajectory based)} = 1 \times T. \tag{1}$$

Trajectory algorithms are local search techniques. They depend on making a few changes in the components of the current solution to find a better one. A potential solution is picked, and its neighboring solutions are checked if they are better. Local search implies searching within a limited region (exploitation). This process suffers from a potential entrapment in local minima because of the diversity weakness and a lack of information exchange. Examples of trajectory algorithms are Simulated Annealing (SA) [46] and Tabu search (TS) [47].

#### *3.2. Population-Based Optimization*

A population algorithm begins with a set of randomly generated solutions and tries to enhance them during the optimization process. Each candidate solution fluctuates outward or converges toward the best solution following a certain mathematical framework. The predominance of these algorithms is because of their simplicity and flexibility. Simplicity means that they are built upon simple methodologies and are evolved from simple concepts. They can be adopted to deal with real-world problems without structural modifications. All that is required is an accurate representation of the problem and the structure of the optimizer is left untouched. Population algorithms are more efficient in mitigating local minima compared with trajectory algorithms because more individuals and more information are shared between them. However, multiplicity in solutions increases the computation burden because more evaluations are required. The number of calls for a fitness function is driven by the number of individuals and the number of iterations. Equation (2) identifies the number of function evaluations in population algorithms where *N* is the number of individuals and *T* is the number of iterations:

$$\#Evalations(\text{in } population\text{ based}) = N \times T.\tag{2}$$

A population algorithm begins with the initialization step where a set of candidate solutions are generated. The solution is a candidate or possible solution if it satisfies the constraints of the problem. The next step is the evaluation of individuals. The evaluation is carried out using a specified fitness function and in terms of predefined evaluation criteria. The fitness function is called for each individual so that each individual gets a fitness value. After evaluating the individuals, the update process refines and improves current solutions. This requires updating the positions of individuals in the search space. This iterative process of evaluating and updating individuals continues until a predefined criterion is satisfied and the global optimal solution is best approximated.

Population-based algorithms compromise of NIAs that are the result of the union of nature with different scientific fields including physics, biology, mathematics, and engineering. Computer science utilized these relations between science and nature and turned it into a well-defined discipline for optimizing different challenging problems. NIAs are categorized based on the source of inspiration into EA- and SI-based algorithms [21].

#### 3.2.1. Evolution-Based Optimization (EA)

This category includes different computational systems that share in their emulation for the biological evolution. EAs model the natural cellular processes such as reproduction, mutation, recombination, and selection.

EAs typically designed by generating a population of possible solutions −→I1 , −→I2 , −→I3 ...−−→In−1, −→In called chromosomes. Each chromosome is split into smaller units called genes. The length of the chromosome (#genes) determines the dimensionality of a problem. The relation between gene, chromosome, and population can be expressed as *gene* ⊂ *chromosome* ⊂ *population*. Most of the current evolutionary frameworks implement the chromosome and population as 1-d array (vector) and 2-d array, respectively. Equation (3) identifies the individual *Ii* with a length *d* and Equation (4) identifies a population *P* where each individual represents a row in a matrix. Each solution is evaluated by a certain object function −→O1, −→O2, −→O3 ... −−→ On−1, −→On to determine its quality and decide if it is fitted or unfitted. The highest evaluated solution (best individual) is preserved at each iteration. The unfitted solutions (worst individuals) are candidates to be replaced by newly-generated offsprings. This allows the average fitness value to increase dramatically throughout iterations. Common EA examples are GA [22] and DE [23]. GA undoubtedly is the most widespread and typical example of EAs:

$$I\_i = \begin{bmatrix} \mathbf{x}\_i^1 & \mathbf{x}\_i^2 & \mathbf{x}\_i^3 & \cdots & \mathbf{x}\_i^{d-1} & \mathbf{x}\_i^d \end{bmatrix} \tag{3}$$

$$P = \begin{bmatrix} \mathbf{x}\_1^1 & \mathbf{x}\_1^2 & \mathbf{x}\_1^3 & \cdots & \mathbf{x}\_1^{d-1} & \mathbf{x}\_1^d\\ \mathbf{x}\_2^1 & \mathbf{x}\_2^2 & \mathbf{x}\_2^3 & \cdots & \mathbf{x}\_2^{d-1} & \mathbf{x}\_2^d\\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots\\ \mathbf{x}\_n^1 & \mathbf{x}\_n^2 & \mathbf{x}\_n^3 & \cdots & \mathbf{x}\_n^{d-1} & \mathbf{x}\_n^d \end{bmatrix}. \tag{4}$$

#### 3.2.2. Swarm-Based Optimization (SI)

SI algorithms have a common behavior that is very similar to the social behavior of creators. The Swarm system comprises an abundant number of agents that are distributed in the environment to achieve a global target. Intelligence can be seen in the actions of agents to coexist. The main characteristics of swarm systems are adaptability, self-organization, distributed control, scalability, and flexibility [20]. The most common SI examples are Particle Swarm Optimization (PSO) [24] and Ant Colony Optimization (ACO) [25]. A PSO source of inspiration are flocks of birds that search for food. The search procedure is guided by two main factors: Pbest and gbest. Pbest represents the best experience that was gained by the previous particle itself. Gbest represents the best individual in the whole swarm. Particles also have a position and velocity that are both updated in each iteration.

#### *3.3. Challenges of Meta-Heuristic Optimization*

Despite the efficiency of meta-heuristics in tackling challenging optimization problems, some obstacles impact their performance. These include dynamicity, multi-objectivity, constraint, and uncertainty. For multi-objectivity, there are multiple conflicting objectives to be optimized until trade-offs (Pareto optimal set) are achieved. The search space is quite more complex. The optimization problem becomes highly challenging when the number of objectives becomes larger than four [48]. Many objective fields have emerged to deal with these cases. Constraints of real problems create gaps in the search space by dividing it into feasible and infeasible regions. Feasible regions satisfy the constraints while infeasible regions violate these constraints [49]. Accordingly, the optimization algorithm should follow certain mechanisms to become closer to the promising region and avoid the infeasible region until an optimal solution is found. The other main issue of meta-heuristic optimization is uncertainties. For example, the global solution frequently changes its position in the search space, which requires more attention from the optimization algorithm. Some operators are used for registering the history and memorizing the locations of the global optima all the time. Other severe challenges are related to the problem search space, such as the existence of many holes or valleys that lead to stagnation in local minima, discontinuities in a search space, the location of global optima that comes onto the boundary of a search space (the boundary of constraints), and the isolation of global optima [32].

Population algorithms are characterized by two conflicting milestones that are called exploration (diversification) and exploitation (intensification) [32]. In exploration, the candidate solutions churn and change violently, which leads one to examine more regions and to find diverse solutions. Exploitation changes gently and causes a less sudden stir for the candidate solutions. GA realizes these processes through crossover and mutation operators. Crossover intermixes a combination of solutions while mutation squeezes certain regions and searches locally. PSO configures the inertia weight operator by large values for more explorations and selects small values for more exploitation.

The main challenges of exploration and exploitation include: Firstly, since they have conflicting purposes, increasing any of the causes decreasing the other. Secondly, a transition between these two milestones is not defined because the search spaces of the optimization problems are usually unknown. Thirdly, performing pure exploration causes less accuracy in approximating an optimal solution because different regions are being explored without a focus on a certain promising region. Performing pure exploitation gives rise to entrap in local optima. Fulfilling a balance between exploration and exploitation may produce better results and increases the chance of being close to the optimal solution. Recently, this idea has become an active research problem. Several types of research have tried to attain balance by integrating several random and adaptive operators in the structure of the algorithms.

#### **4. Feature Selection**

This section introduces FS in two parts: The dimensionality problem and the FS system based on the NIA search strategies.

#### *4.1. Dimensionality Problem*

Due to the incremental growth of information and the abundance of data, data sets have increased in both data samples (number of instances) and dimensions (number of features). As a result of the increased dimensionality, different negative effects were embedded in data mining tasks. One of these problems is called the curse of dimensionality, which describes the status of data as it becomes sparser in large dimensionality space [12]. This raises the need for more instances for the training of the classifier, which increases the learning time. Learning algorithms were designed to build their models based on rules inferred from a small number of dimensions. Learning algorithms cannot generalize well in a large dimensionality space. High dimensionality implies the existence of noisy features, such as redundant and irrelevant features that mask the informative features and mislead

the classifier and cause data to overfit. An overfitting [2] problem occurs when a classifier overtrained on the data and learned all examples, including outliers. Considering that noise and random fluctuations as related concepts will cause building complex models; logically, learning from relevant features allows the classifier to be more accurate. Another negative effect of increasing dimensions is the increased demand for specialized devices such as large memory storage and high-speed processors, which increases cost.

#### *4.2. FS Preliminaries*

Features are defined as measurable properties of the observation under study. The complexity of the problem is determined by its features. In real-world applications, the discovery of relevant features is a big challenge. In 1997, the first papers about relevance and feature selection were published [14]. Feature relevance can be formalized as follows. Let 1 ≤ *i* ≤ *n*, *Ei* be the domain of feature *xi*, *X* = *x*1, *x*2, ··· *xn* be the set of all features. *E* = *E*<sup>1</sup> × *E*<sup>2</sup> ···× *En* is the instance space from which instances derive their values. Each instance can be represented as a point in space and the distribution of these data points has a probability *P*. If we consider the class (label) space to be *T*, then we can define an objective function *c* as a relation that maps an instance *S* to a specified label/class in labels space *T* as: *c*: *E* → *T*. Arguably, a data set with |*S*| number of instances is the result of sampling |*S*| times from *E* with a probability *P* and get label from *T*. An *xi* in *X* is a relevant feature with respect to class concept if there exist two instances (*A* and *B*) in *E*, which only differ in their assignment to *xi* (all their feature values are the same except those for feature *xi*) and *c*(*A*) = *c*(*B*). In contrast, a variable with no correlation or weak correlation with the target concept is called an irrelevant feature. Other types of noisy features are redundant features. These are features that are highly related and connected with other features and add nothing new regarding the classification decision.

In the literature, FS was defined in different ways, which are all close in meaning and intuition [15]. FS is a searching process that tries to find the subset of features which is the best one to describe the data. According to relevance discovery, FS aims to determine the most meaningful subset of features, which has the largest relevance and minimum redundancy. Even though those features are fewer than the original features, but they carry the maximum discriminate information. Classically, FS selects a subset of *M* features from a set of *N* features where *M* < *N* and the value of an evaluation function is optimized over all subsets of size *M*. The essence of FS is to select or discard features intelligibly in such a way, the resulting class distribution is as close to the class distribution with the complete set of features. In another meaning, FS is not a technique for only reducing data set cardinality, but it should find a trade-off and a balance between different conflicting objectives. As a multi-objective optimization problem, there are two primary objectives to be optimized. These objectives are the performance and the number of selected features. These are conflicting objectives because the optimization algorithms require getting the maximum performance and the minimum number of selected features.

Typically, the standard process of FS consists of four primary stages of subset generation, subset evaluation, stopping criterion, and results validation [15].

Regarding the subset generation and search procedure, FS is considered an NP-hard problem. When the number of features equals *n*, the search space comprises 2*<sup>n</sup>* subsets of features. Using brute search methods such as a huge search space needs an exponential running time to traverse all the candidate subsets of features.

Concerning subset evaluation, there are different methods to assess the goodness of a feature subset such as filters and wrappers. A stop criterion is a condition that halts the FS process and prevents the infinite loop. For example, the search completion (all feature subsets have been examined), the learning performance reached its highest limit, the subset of features with a specified size is obtained, the pre-defined number of iterations is reached, the occurrence of conversion situation in which results become stable, and no further enhancement is achieved. A direct way to validate the obtained results is based on prior knowledge from a domain. Unfortunately, this features knowledge is usually unavailable so other methods have to be used instead. FS could be validated by comparing

the system performance using the whole subset of features with its performance using the selected features. FS has many advantages that positively affect the data mining task, including improving the quality of the generated model, speeding the learning time of the classifier, enhancing the ease of reading the data set, and reducing the need for more hardware resources.

#### *4.3. NIAs for Feature Selection*

Two important points should be focused on: The representation of a solution and the evaluation for it. Normally, a feature subset is represented by a binary vector. The dimensionality of the problem is equal to the number of features in the data set. If the gene value is set to 1, this indicates that the feature is selected, otherwise, it is not selected. The quality of a feature subset is evaluated based on two contradictory objectives: The classification accuracy (minimum error rate) and the minimal number of selected features simultaneously. These two criteria are represented in one fitness function that is shown in Equation (5), where *αγR*(*D*) is the error rate of the classification produced by a classifier, |*R*| is the number of selected features in the reduced data set, and |*C*| is the number of features in the original data set, and *α* ∈ [0, 1], *β* = (1 − *α*) are two parameters for representing the significance of classification and length of feature subset according to recommendations:

$$Fitness = \kappa \gamma\_R(D) + \beta \frac{|R|}{|\mathbb{C}|}. \tag{5}$$

#### **5. NIAs FS Modifications**

This section highlights the main modification techniques applied in the literature to enhance the NIAs as wrappers FS. By referring to 156 articles in the domain of modified NIAs-FS, it can be noticed that the modification techniques can be classified into nine categories as depicted in Figure 3: New operators, hybridization, update mechanism, modified population structure, different encoding scheme, new initialization, new fitness function, multi-objective, and parallelism.

**Figure 3.** NIAs FS modifications categories

#### *5.1. New Operators*

This modification depends on integrating a new operator in the original NIA structure to achieve certain targets, such as improving the algorithm performance, increasing the diversity among the population, enhancing the exploitation and exploration processes, facilitating the sharing of information between population's individuals, repositioning of the worst individuals in the population, and performing a search along various vectors in search space [36]. In literature, several operators have been used to enhance NIAs wrappers. Some of these operators are discussed next.

#### 5.1.1. Chaotic Maps

The denotation of chaos means a state of disorder. In mathematics, it is a formula that describes a dynamic system with time dependence. The chaotic system has a high level sensitivity to its initial conditions. This behavior implies that even a simple modification in the initial conditions will lead to big changes in the outcomes. Although the chaotic system is deterministic and does not incorporate any randomness but the results are not always predictable [50].

Chuang in [51] used two kinds of chaotic maps and integrated them with Binary Particle Optimization (BPSO), namely logistic maps and tent maps. Equation (6) describes how the logistic map is written in mathematics (general formula), where *Xn* is a number between 0 and 1 which represents the ratio of the current population size to the maximum population size and *μ* is a constant value between 0 and 4. Equation (7) describes how Chuang exploited Equation (6) to modify the inertia weight value where *w* is the inertia value between (0,1) and *t* is the number of iteration. The same thing was followed to apply the tent map chaotic map. Equation (8) is the general mathematical formula and Equation (9) is the modified version of inertia weight using a tent map. Using large values for inertia weight facilitates more exploration while selecting small values facilitates more exploitation. Hence, chaos theory could be used for balancing the two types of search in the search space. Besides, the study contributed that Chaos Binary Particle Swarm Optimization (CBPSO) with a tent map achieved a higher classification accuracy than CBPSO with a logistic map:

$$X\_{n+1} = \mu \left| X\_n (1 - X\_n) \right. \tag{6}$$

$$w(t+1) = 4.0 \ast w(t)(1 - w(t))\tag{7}$$

$$f(\mathbf{x}) = X\_{n+1} = \begin{cases} \mu \, X\_{n\prime} & \text{if } X\_{n\prime} < 0.5\\ 1 - \mu \, X\_{n\prime} & \text{otherwise} \end{cases} \tag{8}$$

$$w(t+1) = \begin{cases} w(t)/0.7, & \text{if } w(t) < 0.7\\ 10/3w(t)(1 - w(t)), & \text{otherwise.} \end{cases} \tag{9}$$

In the same year, Chuang presented another model for FS [52]. The proposed model was a filter-wrapper approach based on using a correlation-based filter (CFS) and Taguchi chaotic BPSO (TCBPSO). In [53], chaotic was applied with BPSO for FS in text clustering. Ahmad in [54] used chaotic maps as modifications for the SSA algorithm. He replaced the C3 random parameter with chaotic sequences, namely logistic map, piecewise map, and tent map. It was clear the impact of chaotic maps in improving the SSA. In the same year, the influence of chaotic operators on SSA was investigated in [55]. The experiments proved that the logistic map achieved a better performance for the SSA algorithm over nine chaotic maps. The chaotic multiverse optimization (MVO) FS model was proposed in [56] to cope with some limitations of MVO. Tent, logistic, singer, sinusoidal, and piecewise chaotic maps were used. The results showed that the logistic chaotic maps were the best, which increased the MVO performance more than other maps. Sayed in [57] developed a new wrapper FS approach based on the Whale Optimization Algorithm (WOA) and chaotic theory named CWOA. He used 10 chaotic maps namely chebyshev, circle, guass/mouse, iterative, logistic, piecewise, sine, singer, sinusoidal, and tent. The results showed that a circle chaotic maps was the best among other chaotic. In [3], a model based on chaotic Moth Flame Optimization (CMFO) and Kernel Extreme Learning Machine (KELM) was proposed. In [58], Sayed developed a new FS system composed of the Crow Search Algorithm (CSA) algorithm and chaos theory to enhance the performance and convergence speed of CSA. Lately in [59], a Binary Black Hole optimization Algorithm (BBHA) has been modified by embedding new chaotic maps embedded with the movement of stars in the BBHA. This model was called CBBA and uses 10 chaotic maps. The results of three chemical data sets demonstrated that CBBA outperformed the BBHA in terms of the number of selected features, classification performance, and computational time.

#### 5.1.2. Rough Set

Rough Set (RS) was first described by Zdzislaw Pawlak at the beginning of the 1980s [60]. This is a mathematical concept related to topological operations. In mathematics, RS is a theory that tries to find two approximate sets for the original conventional set (crisp set). The first RS gives the lower approximation for the crisp set which compromises

the elements that surely belong to the target subset. The second RS gives the upper approximation of the crisp set which compromises the elements that possibly belong to the target subset. The pair of rough sets are themselves either crisp sets or fuzzy sets. Rather than belonging or not belonging in relation to the elements as in crisp sets, the fuzzy sets depend on the membership function for gradual assessments of the elements. Unlike the fuzzy sets, RSs depend on finding the positive region, not the membership function for dealing with uncertainties and vagueness. The RS has many advantages, including the approximation of concepts, reduction of spaces, discovering the equivalence relations, and finding the minimal sets of data in vague and uncertain domains. In FS, the RS tries to define the attribute dependency. Zainal in [61] proposed the RS-PSO model for a better representation of data. Another RS-PSO-FS model was proposed in [62] based on Relative Reduct (PSO-RR) and PSO-based Quick Reduct (PSO-QR). Both tools depend on the dependency measure for comparing sets of attributes. In [63], the authors proposed a model for FS in nominal data sets based on BCS and Rough Sets. Another CS model was introduced in [64] by incorporating the RS with different classifiers. In [65], a new model was developed based on two incremental techniques (QuickReduct and CEBARKCC). Quick reduct and CEBARKCC are two filtering methods where the former one is a rough set-based filter that simulates the forward generation method and the latter is a conditional entropy-based method. These two methods were integrated with the Ant Lion Optimization (ALO) algorithm to improve the initial population quality. The RS-FA model was developed in [66]. Hassanien in [67] developed a new system based on rough set and MFO. Lately, in [68], a hybrid model called BPSOFPA composed of Flower Pollination Algorithm (FPA) and PSO was also developed. BPSOFPA was integrated with the RS approach for the FS problem. Ropiak in [69] integrate RSs with deep learning as rough mereological granular computing.

#### 5.1.3. Selection Operators

Inspired from Darwin's theory [70], which explained the evolution and changes in species through the natural selection mechanism, the genetic algorithm incorporated selection operators to select some individuals from the population for later breeding. A conventional strategy to implement the selection is using the fitness values of the solutions. In other methods, these fitness values are normalized by finding the summation of them then divide the fitness of each individual by this summation. Another method sorts all individuals in the population according to their fitness values in descending order. The selection mechanism was applied in other studies by finding the accumulated fitness for each individual so that the final individual fitness value is one [71]. All such methods become computationally expensive and may negatively impact the performance of GA when the population becomes larger. Other methods of selection which are widely implemented with GA are Tournament Selection (TS) and Roulette Wheel Selection (RWS). The stochastic nature of these methods makes them simpler in implementation and better in performance than the aforementioned methods. TS is the most applied selection operator with GA because of its simplicity. It selects randomly a set of solutions from the population then the best one is used for breeding the successive generation. In RWS, the mechanism differs in that no agent in the population is discarded. The RWS strategy depends on creating something like a roulette where all fitness scores of the individuals are represented as areas or sectors on this roulette. The individual with a large fitness value well reserve a large sector on the roulette, which shows a larger probability for selection. Individuals with small fitness scores will reserve small areas on the roulette. In RWS, the final selection for the agent is done by rotating the roulette and the selected individual is the one where the point stayed when the roulette had stopped. Mafarja in [46] developed a new model that combines TS with the WOA optimizer to enhance the exploration of the search. One year later, Mafarja presented in [72] an FS model based on the Grasshopper Optimisation Algorithm (GOA) algorithm with RWS and TS. Mafarja in the same year developed a new wrapper FS model based on WOA along with studying the effect of TS and RWS [73]. In [26], the selection operators were incorporated to improve the ABC optimizer. In [74], the method compromised of a DE optimizer and RWS structure for the selection of the Wavelet Packet Transform.

#### 5.1.4. Sigmoidal Function

A sigmoid function is a mathematical function that falls under the S-shaped family and is considered a special case of a more general function called a logistic function, which has the mathematical formula defined by Equation (10), where *e* is the natural logarithm base (Euler's number), *x*0 is the sigmoid's midpoint, *L* is the sigmoid's maximum value, and *k* is the logistic growth rate of the curve [75]. The sigmoid function formula is defined by Equation (11):

$$f(\mathbf{x}) = L/(1 + \epsilon x p^{-k(x - \mathbf{x}0)}) \tag{10}$$

$$S(\mathbf{x}) = 1/(1 + \exp(-\mathbf{x})) = \exp(\mathbf{x})/(\exp(\mathbf{x}) + 1). \tag{11}$$

The sigmoid function has some special characteristics including the monotonic behavior, which means that the function is defined on all real numbers but the output of the function is increasing either from 0 to 1 or from −1 to 1. Moreover, the sigmoid function is differentiable and has a bell-shaped first derivative where the derivative at each point is a non negative value. There are several variations of the sigmoid function such as hyperbolic tangent, arctangent function, and algebraic functions which are respectively defined by Equations (12)–(14). The sigmoid function is widely applied as the activation function of a Neural Network (NN). Other useful usage of the sigmoid function is that it is used as a discretization method to convert a continuous space into a binary one, such an application is a feature selection application:

$$f(\mathbf{x}) = \tanh(\mathbf{x}) = (\mathbf{e}^{\mathbf{x}} - \mathbf{e}^{-\mathbf{x}}) / (\mathbf{e}^{\mathbf{x}} + \mathbf{e}^{-\mathbf{x}}) \tag{12}$$

$$f(\mathbf{x}) = \arctan(\mathbf{x})\tag{13}$$

$$f(\mathbf{x}) = \mathbf{x} / \sqrt{(1 + \mathbf{x}^2)}.\tag{14}$$

For solving the FS problem, Aneesh developed a modified BPSO called Accelerated BPSO (ABPSO). The strategy for accelerating the particles was using a new velocity update function based on a sigmoidal function [76]. In [6], the sigmoidal function was used with BGWO in solving FS. In [77], different transfer functions that map continuous solutions to binary ones were applied in combination with the CS algorithm. The CS-sigmoid and CShyperbolic tangent was performed on five data sets. In [78], the effect of different transfer functions on the Bat optimization (BA) algorithm was studied. Sigmoid and hyperbolic tangent functions were used to analyze their influence on FS. The results proved that the sigmoid function was better than the hyperbolic function in feature reduction for almost all data sets. Mafarja, in [79], presented new versions of the Grasshopper Optimization Algorithm (GOA) based on sigmoid and V-shaped TFs in the context of FS.

#### 5.1.5. Transfer Functions

Transfer functions (TFs) are mathematical formulas that play a significant role in mapping a continuous search space to discrete search space. The discrete search space could be viewed as a hyper-cube in which solutions move in different directions within its boundaries by flipping their bit values. TFs are one of the most efficient ways that could be utilized to covert continuous meta-heuristic algorithms into their corresponding binary versions [80]. The mathematical formulations of these TFs can be found in [80]. The update procedure in a binary meta-heuristic algorithm is switching solutions elements between 0 and 1 based on certain mapping formula TFs that links the original continuous update procedure with a new binary update procedure. TFs in a close meaning define the probability of updating each element (gene/feature) in a solution to be either selected 1 or not selected 0.

Equations (15) and (16) define the general update formulas of a solution using S-TFs and V-TFs, respectively, where *X<sup>d</sup> <sup>i</sup>* (*t* + 1) represents the *ith* element (gene/feature value) in the *X* solution (feature subset) at dimension *d* (feature number/index) in iteration *t* + 1, rand ∈ [0, 1], which was generated using a random probability distribution:

$$X\_i^d(t+1) = \begin{cases} 0, & \text{if } rand < S\_-TF(X\_i^d(t+1)) \\ 1, & \text{if } rand \gg S\_-TF(X\_i^d(t+1)) \end{cases} \tag{15}$$

$$X\_{t+1} = \begin{cases} X\_{t\prime} & \text{if } rand < V\_{-}TF(X\_{t+1})\\ \neg X\_{t\prime} & \text{if } rand \gg V\_{-}TF(X\_{t+1}). \end{cases} \tag{16}$$

These can be reformulated to preserve the concepts of searching using any specific meta-heuristic algorithm. As an example, PSO was converted by Kennedy and Eberhart [81] from a real algorithm to a binary algorithm. The PSO binary conversion started by employing a sigmoid function to convert the velocity values into probability values bounded in the interval [0,1] as in Equation (17), where *T*(*v<sup>d</sup> <sup>i</sup>* (*t*)) indicates the velocity of particle *i* at dimension *d* in iteration *t*. In the next step, the computed probabilities are used to update the position vector using Equation (18). To preserve the PSO continuous searching method and keep the concepts of pbest/gbest, the TF gives a high probability for switching gene values for those genes having high-velocity values since they are far away from the best solution. Small probability is given for genes having small velocity values since they are considered close to the best solution [80]:

$$T(v\_i^d(t)) = 1/\left(1 + e^{-v\_i^d(t)}\right) \tag{17}$$

$$X\_i^d(t+1) = \begin{cases} 0, & \text{if } rand < TF(v\_i^d(t+1)) \\ 1, & \text{if } rand \ge TF(v\_i^d(t+1)). \end{cases} \tag{18}$$

In the literature, there were several studies that adopted TFs operators with FS problem. Mirjalili in [80] improved the performance of BPSO by using TFs, S-shaped, and V-shaped transfer. The results of V-TFs improved the performance of BPSO more than S-TFs. In [82], a new wrapper was developed by modifying the Salp Swarm Algorithm (SSA) using TFs. The proposed approach achieved significant superiority over other competitive approaches in 90% of the data sets. Mafarja in [83] presented a new wrapper FS method based on a modified Dragonfly Algorithm (DA) using time-varying S-shaped and V-shaped TFs. Recently, in the context of Internet of Things (IoT) attack, a new wrapper-based approach using the WOA was developed. The augmented WOA used both V-shaped and S-shaped transfer functions.

#### 5.1.6. Crossover

In living things, the chromosomal crossover is a recombination process that occurs between non-sister chromatids to exchange the genetic material during recombination (sexual reproduction). This process ends in the production of new recombinant chromosomes. Faraway from the biological chromosomal crossover in the genetic algorithm and evolutionary computation, this process was inspired to exchange information between solutions in the population and generating new offsprings in the next generation. In the genetic algorithm, recombination (crossover) is defined as a stochastic operator that enforces the diversity in the population by exchanging (swapping) the bits after a random cutting point (crossover point) between the parents' vectors (selected individuals) to produce new children (offsprings). Equation (19) shows how a crossover operator is used to combine solutions where ✶ is an operator that performs the crossover scheme on the two binary solutions *Xi* and *Xi*−1. In a binary space, the crossover can be realized by exchanging the binary bits of two solutions to obtain an intermediate solution. Equation (20) shows that the crossover mechanism

switches between two input vector with the same probability, where *Xd* is the value of the *dth* dimension in the yielded vector after applying the crossover operator on *Xi* and *Xi*−1:

$$X\_{i}^{t+1} = \mathbb{A}(X\_{i}, X\_{i-1}) \tag{19}$$

$$X^d = \begin{cases} X\_{1'}^d & \text{if } rand \gg .5\\ X\_{2'}^d & \text{otherwise.} \end{cases} \tag{20}$$

In [84], a crossover operator was applied in combination with the sigmoid function to modify a Binary Grey Wolf Optimizer (BGWO). The BGWO1 approach was used to convert the Continuous version of GWO (CGWO) into the binary version. The first steps toward the three best solutions are converted into binary, then a random crossover is applied among them to find the updated position. The results of the approach positively affected the performance of GWO. In [82], the crossover operator was applied to improve the Salp Swarm Algorithm (SSA) optimizer in solving the FS problem. The crossover job was to increase the diversity of the model and improve the exploration process of the search space. In [73], the study incorporated many modifications strategies with WOA. Solving the limitations of the WOA represented by local minima and slow convergence was the priority. The crossover was used for achieving this target. Mafarja in [79] applied multiple operators with GOA. The combination operator together with the mutation was applied in his approach to BGOA-M for achieving more exploration.

#### 5.1.7. Mutation

In the organism, the mutation is an error that occurs during DNA replication (meiosis). The error specifically results from a permanent deletion, insertion, or alternation on the DNA segment (nucleotide sequence of the genome). Even though this is a small genome error, it causes abnormal changes in the characteristics of an organism. Evolutionary and genetic algorithms inspired the same idea to make changes and increase the diversity in the population. The advantages of mutation come from preventing solutions becoming similar and thus ensuring the evolution does not stop. Mutation operators alter one or more gene values (a bit in chromosome vector) which causes the solution to be changed from its previous state. Besides diversity, the mutation could contribute to mitigating the local minima problem. Equation (21) identifies the mutation process where *Xi*(*t* + 1)*<sup>d</sup>* is the ith element at the *d*th dimension in the *Xi* solution,

$$X\_i^d(t+1) = \begin{cases} 0, & \text{if } rand \gg .5\\ 1, & otherwise. \end{cases} \tag{21}$$

In [85], a Particle Swarm Optimization (PSO) applied mutation to a solution was conducted after it was updated. A probability commonly 1/*n* indicates one bit of the solution will be muted (flipped). The model proved the effectiveness of the suggested modified PSO-FS model. In [53], the authors developed a hybrid intelligent algorithm that combined mutation with the BPSO and other operators to solve FS in the text clustering. The model attained a higher clustering accuracy and improved the convergence speed of BPSO. In [79], the mutation operator was applied with the GOA optimizer. The BGOA-M approach achieved superiority in comparison with other approaches compared. In [86], an Improved Harris Hawks Optimization (IHHO) was proposed based on elite oppositebased learning, mutation neighborhood search, and rollback strategies to increase the search performance.

#### 5.1.8. Levy Flight

Levy flight has its source from chaos theory. It describes a random walk that follows a heavy-tailed probability distribution. This probability distribution represents the steplengths that take place either on a discrete grid or continuous space. In mathematics, according to a central limit theorem, the steps from the original point of a random walk

follow a stable distribution which could be modeled using equations of Levy flights. Investigators in nature found that Levy flights can describe the animals hunting patterns especially when the prey is sparsely distributed and not easily detected as opposed to Brownian motion, which can only approximate the prey place when the hunting is near an abundant and predictable prey [87]. In [64], a novel Cuckoo Search (CS) algorithm was developed using the Levy flight with the rough sets. He applied his idea by integrating the Levy flight random probability distribution in the equation that generates new solutions as shown in Equation (22) where ⊕ denotes the entry-wise multiplication, *α* is the step size, *α* > 0, and *Levy*(*λ*) is the Levy distribution which is described in Equation ( ´ 23). In [88], Levy flight was used in combination with transfer functions to enhance the performance of the MFO algorithm and increase diversity:

$$X\_{i}^{t+1} = X\_{i}^{t} + \mathfrak{a} \oplus Levy(\lambda) \tag{22}$$

$$
\omega\_{\mathbf{z}\circ\mathbf{y}} \sim u = t^{\times\lambda} \qquad 0 < \lambda < \mathfrak{Z}.\tag{2.3}
$$

5.1.9. Other Operators

A local search operator was incorporated with GA to mitigate the weakness of standard GA in fine-tuning near the local minima [89]. In [90], local search was used to improve the BPSO. A new local search and gbest resetting strategy called PSO-LSRG was proposed in [24] to facilitate the exploitation. A Uniform Combination (UC) operator was used in [80] to improve the performance of BPSO. Later, UC was adopted in [91] to balance the exploitation and exploration of bones PSO. The DE evolutionary operator was used in [5] to solve the local optima in standard WOA. The DE evolutionary includes mutation, crossover, and selection operators. Boolean algebra (and operator) was used in BPSO [92]. The bacterial evolutionary algorithm and PSO algorithm, both with a plain and a memetic variant complemented with gradient-based local search and fuzzy logic numbers were used in [93] for solving various resource allocation problems.

A catfish strategy was applied in [94] to improve the performance of BPSO based on introducing new particles into the search space when there is no improvement in the searching process. For example, when the gbest is unchanged over a consecutive number of iterations. The catfish particles replace the particles with the worst fitness and initialize a new search from the extreme positions of the search space. Feature subset ranking was introduced in [95]. The idea was to compute the significance of each feature according to its classification accuracy and compute the accuracy for some combinations of these ranks, then the BPSO wrapper approach was used to search on the top-ranked features subsets instead of the whole features.

A Gaussian operator was introduced in [96] and the idea was that FS is highly influenced by features interaction. The highly relevant features with a class label may have high interactions with other features which makes them redundant. On the other hand, irrelevant features concerning a class label may have small interactions with other features. As feature interaction is a challenge to classification and FS, a statistical clustering method based on Gaussian distribution was adopted. It groups homogeneous features based on the interactions between features then the PSO algorithm selects one feature from each cluster. Threshold was adopted in [97]. The idea was to set a nonzero value for a threshold based on the number of trails BPSO were run. The significance of a particular dimension is measured based on the frequency of appearance for that dimension in the gbest vector in all runs. The final gbest after thresholding will contain the most recurrent features.

Zhang in [91], used the Gaussian sampling to compute the positions of particles which is based on pbest and gbest instead of velocity. Another operator was incorporated, called reinforced memory. Reinforced memory is based on the idea of enhancing the probability of survival for outstanding genes. These are the important features with high fitness value in the current iteration. Consequently, the update of the local leaders (pbest) of each particle will avoid the gene degradation and preserve it in the next iteration. Hamming distance was used in [98] to replace the Euclidean distance in BPSO. Particularly, it was used to measure the distance between two binary vectors based on the Exclusively-OR (XOR) operator and count the number of ones in the resulting vector. In [99], a new model called Hybrid Particle Swarm Optimization Local Search (HPSO-LS) was proposed based on using local search with correlation information. The correlation information was used to guide the local search in PSO. This was carried out by including the most dissimilar features (low correlated) as a feature subset in the newly generated particles. Consequently, similar features (highly correlated) have less chance to be selected as a feature subset. Moreover, HPSO-LS used a specific subset size determination scheme to allow PSO to search within the abounded region and find a smaller number of features.

Binary quantum was used in [100] to modify and improve the PSO. The idea was to perform a sampling around the personal best and compute the mean best of the sampled points then introduce this value in the BQPSO. For any bit position of the mean best, it will be equal to 1 if 1 appears more often than 0 in all the corresponding bit positions of all pbests. On the other hand, if the 1 and 0 have the same frequencies, then each element of the mbest is set randomly either to 0 or 1. A re-initialization strategy was applied on PSO-mGA in [101]. The idea was to use a small population (3–6 chromosomes) with a reinitialization strategy to achieve convergence. A non replaceable memory operator was added to keep the original swarm and remains intact with it during the optimization process. This will help in increasing the diversity of a swarm. Moreover, the nonreplaceable memory was used for maintaining a secondary swarm with a leader and followers. Zhang in [102] developed a new wrapper-based approach by utilizing the Firefly Algorithm (FA), Return-cost, Pareto dominance-based, and adaptive movement operator. A return-cost indicator was used to compute attractiveness. The firefly is cloned based on the return cost instead of the distance so that the firefly with a big return and small cost has a great chance to be cloned. A pareto dominance-based operator was added. Pareto dominance is commonly used in multi-objective optimization. It is a selection strategy used to search for the attractive one of a firefly based on the cost and return. Adaptive jump was used in place of the fixed uniform jump. It requires a change in the jump probability based on a linear function concerning the number of iterations to allow for more exploration.

In [103], a greedy search was used to enhance the local search. Three modified versions of the Lion Algorithm (LA) (Lion M1, Lion M2, and Lion M1+M2) were proposed to improve the local search. Mafarja, in [72], applied a new methodology based on BGOA and Evolutionary Population Dynamics operator (EPD). EPD depends on making a local change in the population instead of external force. This idea comes from the theory of Self-organized Criticality (SOC). Hancer, in [26], developed a new version of the DisABC algorithm for FS by introducing a DE-based neighborhood mechanism into the similaritybased search of DisABC. DE evolutionary operators were also used in [5] for solving the problem of local optima in native WOA. These include mutation, crossover, and selection operators. Khushaba in [74] developed a new modified FS method called DEFS using a repair mechanism. The repair mechanism was based on feature distribution measures and the RWS structure. A new model was developed in [104] based on GA and m-features (OR operator). The OR operator performed a search space reduction and improved GA performance and convergence. Zeng in [105] developed a novel GA with a new population structure and a new operator called dynamic neighboring. Dynamic neighboring is a new selection strategy that was used to boost the capabilities of GA for the FS problem. In [106], Guo proposed a new repair operator that allowed GA to transform feature subsets from arbitrary combinations to valid combinations that conform to the feature model constraints and domain-specific objective function.

#### *5.2. Hybridization*

Hybridization means the integration of over one algorithm to build a powerful predictive framework that combines the power of the integrated algorithms. The expectation of combining the complementary features of different optimization strategies is to achieve a better performance compared with implementing them separately as pure paradigms. There are several categories of NIAs hybridization techniques that were investigated in the literature such as combining NIA with other NIA or combining NIA with other algorithmic components from different areas of optimizations, such as with tree search, dynamic programming, and constraint programming [107].

#### 5.2.1. NIA-NIA Hybridization

In mimetic models, a single solution algorithm is embedded in the population's structure algorithm to enhance the local search and exploitation of the search space. These algorithms are implemented in two search stages. In the first stage, the algorithm captures a global view of the search space. In the second stage, the algorithm focuses on the most promising area to perform a successive process of local search. As exploration/exploitation balance is guaranteed using these models and the premature conversion is avoided. In [4], Zawbaa developed a novel hybrid GWO-ALO system that exploits the GWO global search ability and Ant Lion Optimization algorithm (ALO) local search performance. In [65], Mafarja developed a hybrid model based on BALO and hill-climbing techniques called HBALO. A new hybrid algorithm was presented in [108] by combining the Clonal Selection Algorithm (CSA) with the Flower Pollination Algorithm (FPA). CSA was good in exploitation, while FPA was good in exploration via Levy flight. In [109], the Mine Blast Algorithm (MBA) was used to support the exploration phase. MBA was integrated with simulated annealing to optimize a local search in the exploitation phase to get closer to the optimal solutions. Ibrahim in [110] designed a hybrid SSA-PSO model. He integrated the update strategy of PSO into the structure of SSA so that the update for the current population was done by using either the SSA or PSO depending on the quality of the fitness function. PSO-mGA (micro Genetic Algorithm) model was presented in [101]. The ACO-DE model was developed in [23]. A novel SA-MFO model was presented by Sayed in [111]. The use of SA was to make the conversion rate slower, to reach to the global optima, and escape the local minima. A new MFO-based hybrid model was developed in [112] by combining MFO and Levy FA (LFA) algorithms. The other target of NIA-NIA hybridization is to refine the best solutions by implementing the NIAs sequentially as a pipeline where the operators of the first algorithm applied first then the operators of the other integrated algorithms are applied sequentially. These models often suffer from being slow in the search process. This hybridization strategy was applied in [113] to develop the PSO-GA model. In [46], the WOA-SA model was developed. In WOASA-1 (Low-Level Team-work Hybrid (LTH)) SA was used as an operator in WOA to enhance the exploitation. In WOASA-2 (High-Level Relay Hybrid (HRH)) SA was used after WOA to enhance the final solution. In 2020 [114], SA was hybridized with the HHO algorithm and AND and OR bitwise operations. SA was used to flee the HHO optimizer from local minima in the feature search space. A new hybrid binary version of the Bat Algorithm (BA) is suggested to solve feature selection problems. In [115], BA was hybridized with an enhanced version of the DE algorithm to reach the global solution. Hybridizing different NIAs to perform parallel exploration for the search space was also a primary target for other studies. Each algorithm generates its initial population and iteratively explores and evaluates the feature subsets. Using this strategy increases the speed of the search process. ACO-GA is an example of these hybrid models [116,117]. Recently, in [118], an enhanced hybrid approach using GWO and WOA was proposed to alleviate the drawbacks of both algorithms.

Another target for NIA-NIA hybridization is to enhance the initialization of the search using different NIAs. In these models, one algorithm is used to generate the initial solutions. Then the other combined algorithm is used to update these solutions. An example of these models is GA-IGWO presented in [119]. In [120], the hybridization of two Immune Firefly Algorithms (IFA1 and IFA2) was proposed. In IFA1, the FFA and Artificial Immune System (AIS) are used simultaneously to increase the global search of fireflies and select the best feature subset. IFA2 was used to study the influence the initial population on the searching progress of the AIS algorithm.

#### 5.2.2. NIA-Classifier Hybridization

Hybridizing different classifiers such as SVM, Artificial Neural Network (ANN), aided Radial Basis Function (RBF), Optimum Path Forest (OPF), bagging, and Bayesian statistical with NIA for evaluating the solutions. Since classifiers have different capabilities regarding the training speed, computation complexity, and generalization capability; many studies investigated their influence when used in the wrappers framework. Other studies tried to make simultaneous FS and parameter optimization to enhance the performance of a classifier. NIA in these hybrid models works as a tuner to optimize the training parameters set up and select the optimal feature subset. In [121], a new wrapper approach was built to perform parallel FS and optimization for SVM parameters by exploiting the merits of MVO. Another hybrid model was presented in [122] for optimizing the SVM parameters simultaneously with selecting the best feature subsets using a GOA optimizer.

#### 5.2.3. NIA-Filter (Wrapper-Filter) Hybridization

The filter-wrapper hybrid model is applied in two ways. First, a filter is applied to eliminate redundant and irrelevant features, minimize the dimensionality, and produce a reduced data set that is ready to be used by a wrapper. The second way to apply the filter-wrapper model is to use the filter in the structure of a wrapper to evaluate the generated features subsets. In [123], the Information gain and correlation-based were integrated with BPSO in models called IG-IBPSO and CB-IBPSO, respectively to solve FS. In [17], a MSPSO-F-score was developed. A mutual information filter was integrated with PSO and presented as a model called MI-PSO in [124]. PSO-MI and PSO-Intropy were developed in [125]. CS-MI was developed in [126]. BALO with QuickReduct and CEBARKCC filtering approaches were developed in [65]. In [5], IWOA-IG was developed. The ACO-MI model was presented in [127]. ACO with the multivariate filter was presented in [16]. The GA-MI model was presented in [128], GA-IG in [18,129], and GA-entropy in [130]. In [131], Relief-f was used with DE to rank the most significant features. Lately, in [132], an Embedded Chaotic Whale Survival Algorithm (ECWSA) has been proposed as a wrapper process and a filter method. In [133], an efficient hybrid model based on a combining filter and evolutionary wrapper approach was proposed for sentiment analysis of various topics on Twitter. The classification system was based on a SVM classifier and two FS methods using the ReliefF and MVO algorithms. Authors in [134] proposed a filter wrapper approach using a Sequential Floating Forward Search (SFFS) to acquire features for activity recognition. The model was validated using a benchmark dataset with a multiclass Support Vector Machine (SVM). The results show that the system is affected even with limited hardware resources.

#### *5.3. Update Mechanism*

The update modification aims to achieve a balance in exploration/exploitation processes. The update strategy is performed by either enhancing the update process of individuals or dynamically control the NIA parameters. A new variant of ACO was presented in [25,135]. The update strategy used performance and the number of selected features as heuristic information for ACO with no need for prior information about features. In [136], the gbest was updated based on some conditions. This strategy determines when to reset the gbest based on several epochs (iterations) in which the value of the gbest did not change. The same strategy was applied in [24,90]. Martinez, in [137], claimed that the initialization procedure and the update of all particles are not beneficial in high dimensional space. Hence, only a small subset of particles is randomly selected to be updated. The update for a particle is carried out by filling it with active features from the current particle, local best, and global best. This strategy was applied to the original PSO to get a new variant called CuPSO.

In [138], a new rule to update particle's positions was proposed. Instead of the original rule in BPSO that lies in giving equal probabilities to either selecting or not selecting a feature. *P*(*x<sup>d</sup> <sup>i</sup>* (*t*) = <sup>0</sup>) = *<sup>P</sup>*(*x<sup>d</sup> <sup>i</sup>* (*t*) = <sup>1</sup>) = 0.5 where *<sup>x</sup><sup>d</sup> <sup>i</sup>* (*t*) is the gene in the *d* dimension of the position vector at iteration *t*. The new rule was introduced to increase the probability of *x<sup>d</sup> <sup>i</sup>* (*<sup>t</sup>* + <sup>1</sup>) = 0 and reduce the probability of *<sup>x</sup><sup>d</sup> <sup>i</sup>* (*t* + 1) = 1. The idea in [139] is that pbest is usually updated based on the fitness value. However, if the new position has the same fitness value as the current pbest, then the pbest will not be updated even if the new solution corresponds to a smaller feature subset. This is a limitation of PSO. The proposed PSO was to update pbest and gbest into two stages where the priority is given first for the classification accuracy. Next, if the new particle position has the same performance as the current pbest but the number of features is smaller, then in this case, pbest will be updated and replaced by the new position.

In [96], the objective was to update PSO based on a clustering approach. The new GPSO uses Gaussian distribution. The idea was to group homogeneous features based on interactions between features, then PSO is used to select one representative feature from each cluster. Mafarja in [140] proposed five update strategies for the inertia weight (*w*) parameter. Linear, non-linear, coefficient, decreasing, oscillating, and logarithm were applied. His idea was based on applying an exploration operator more than exploitation at the beginning of the search then search those regions carefully to find the global optima. The conclusion was that the gradual decrease for the inertia weight (*w*) either linearly or non linearly improves BPSO. Mafarja in [141] studied the influence of the inertia weight (*w*) parameter on the performance of BPSO. He suggested the adaptive change for the exploration and exploitation by using a rank-based for updating the inertia weight (*w*) parameter. The same author presented in [83] the time-varying update strategy to improve the performance of the DA optimizer. In [142], Aljarah applied several asynchronous update strategies to solve the FS problem. An adaptive update strategy based on a descending linear function was used to update the SSA c1 parameter.

Recently in [143], a Binary DA (BDA) was proposed with new mechanisms to update its main coefficients. The main target is to apply the survival-of-the-fittest principle using different functions such as linear, quadratic, and sinusoidal. Three variants of BDA were introduced and compared with the standard DA. The new variants are linear-BDA, quadratic-BDA, and sinusoidal-BDA. Recently, in [144], a time-varying number of leaders and followers in a binary SSA (TVBSSA) with Random Weight Network (RWN) was proposed. In 2020, the CSA algorithm was enhanced in [145] using three enhancement strategies to solve the FS problem: Adaptive awareness probability to balance exploration and exploitation, dynamic local neighborhood to improve local search, and proposing a global search strategy to increase the global exploration of the crow.

In [146], an enhanced Binary Global Harmony Search algorithm, called IBGHS, was proposed to solve FS problems. An improved step is proposed to enhance the global search ability. In [147], a new update strategy based on ranking of the individuals was proposed. Each moth in the MFO algorithm is given a rank based on its fitness value. Therefore, a moth with a small fitness value will have a high rank so that there will be a great change in its position. On the other hand, a moth with a high fitness value will have a small rank so that there will be a small change in its position. This adaptive update strategy enhanced the performance of the optimizer. In [148], a time varying flame strategy was proposed to enhance the MFO algorithm. The number of flames represents the number of the best solution that decreases gradually across iterations. Different mathematical formulas were experimented with to decide the best formula that ensures exploitation around the best solution in the late stages.

#### *5.4. Modified Population Structure*

Zeng in [105] developed a novel GA with a dynamic chain-like agent population structure. CAGA aimed to enhance the population structure and diversity. This was better than the lattice-like agent population structure where agents do genetic operations just with neighboring agents. In [101], Mistry used a new population structure for PSO-mGA. He used a small-population secondary swarm strategy. A secondary swarm performs a collaborative role to avoid stagnation and overcome premature convergence.

#### *5.5. Different Encoding Scheme*

Galbally in [149], tried to minimize the verification error rate in the online signature system. Different encoding schemes were used, including binary and integer coding. GA with binary coding was used to search the complete search space. On the other hand, GA with integer coding was used for searching a subset of the search space. GA with an optimized descriptor weight or/and optimal descriptor subset was developed in [150] over MPEG-7. There were three different encoding schemes: A real-coded chromosome for weight optimization, binary-coded chromosome for the selection of optimal feature descriptor subset, and bi-coded chromosomes for simultaneous weight optimization and optimal feature descriptor selection. A new ensemble classifier was proposed in [151]. It was based on AdaBoost learning and parallel GA. A hybrid model parallel-GA-AdaBoost with different encoding schemes BGAFS and BCGAFS was proposed.

#### *5.6. New Initialization*

In [53], authors developed a hybrid model based on BPSO to solve the FS problem. A new initialization strategy called Opposition-based Learning (OBL) was proposed. The OBL strategy was used to enhance the initialization of particles and enforce diversity among solutions by considering the solution as well as its opposite solution simultaneously. OBL was used also to generate the opposite position of the gbest particle to get rid of the stagnation case. A novel framework based on IGWO and Kernel Extreme Learning Machine (KELM) was developed in [119]. In the GA-IGWO-KELM model, GA was applied first to generate high quality and diversified initial positions, then GWO was used to update the positions of the individuals in the discrete search space. Tubishat, in [5] developed a hybrid model called IWOA-SVM-IG. The OBL strategy was applied for increasing the level of diversity in the initial solutions generated by standard WOA. In [152], a quasi-oppositional learning-based Multi-Verse Optimization (MVO) algorithm was used to improve the initial setting up of solutions.

#### *5.7. New Fitness Function*

Chakraborty [153], proposed the PSO algorithm where the fitness evaluation of each particle is based on ambiguity. The new fuzzy evaluation function was used to measure the fuzziness of a fuzzy set. The best feature was represented with minimum intraclass ambiguity as well as maximum interclass ambiguity. In [154], GA was proposed with Fisher's Linear Discriminant function in a model called GA-FLD. The new evaluation function estimates the probability distribution of the class in the N-dimensional feature space. It uses also the cardinality of the feature subset using covariance matrices which is an extension of FLD. This method was used to measure the statistical proprieties of the feature subset. Authors in [53] developed BPSO with a new fitness function based on dynamic inertia weight. High inertia weights are assigned to particles with low fitness values to facilitate more exploration of the search space. Low inertia weights are assigned to particles with high fitness to facilitate more exploitation. In [6], GWO was modified using several fitness functions. The fitness functions were accuracy, Hausdorff distance, Jeffries–Matusita (JM) distance, the weighted sum of the accuracy and Hausdorff, and the weighted sum of the accuracy and JM. In [155], different fitness functions were used to enhance the performance of the MFO algorithm. The best fitness function was the one that was applied across two-stages. The first stage optimizes the classification performance only while the second stage takes into consideration the number of genes. The results show that the proposed fitness functions can achieve better classification results compared with the fitness function that takes into account only the classification performance.

#### *5.8. Multi Objective*

Zio [156], developed a system for nuclear plants based on GA to select among the several measured plant parameters. The first approach applied was single objective GA with fuzzy k-Nearest Neighbor classifier (KNN) then multi-objective approaches were applied. Mandal, in [157] developed a prediction system based on a multi-objective PSO that satisfies the Pareto front and makes a trade-off between the non-dominated solutions based on different objectives. The proposed multi-objective PSO FS algorithm performed a dual-task where the first objective was maximizing the mutual information between a feature and class label (relevance) and the second objective was minimizing the mutual information among the features (redundancy). A Dynamic Locality Multi-Objective SSA for FS was proposed in [158]. In [159], a multi-objective FS method was proposed based on bacterial foraging optimization. In [160], a multi-objective PSO modified by Levy Flight was proposed for intrusion detection in Internet of Things (IoT). RWS mechanism was used to remove redundant features and information exchange mechanisms to avoid local minima. A systematic review of the multi-objective FS problem that covered the related studies in the period (2012, 2019) was introduced in [161].

#### *5.9. Parallelism*

In [162], Punch applied a wrapper FS based on GA to biological datasets. 5KNN was modified to work on weighted features (multiplied by weights according to their importance). The new approach was applied to a parallel distributed machine (Sparc and HP). A new ensemble classifier was proposed in [151]. It was based on AdaBoost learning and parallel GA. A parallel version of GA was applied on 16 processors with a master-slave paradigm and KNN was used as a base classifier. Ghamisi in [163] applied the parallelism strategy on PSO. Darwinian PSO (DPSO) was based on running many PSO algorithms simultaneously. Each algorithm runs as a different swarm on the same problem. A natural selection process was applied by rewarding the swarm that got better results and extending its particles' life so that new descendants were spawned. On the other hand, the swarm with suboptimal results (stagnate) was punched so its search area was discarded and its life was reduced by deleting its particles.

#### **6. NIAs FS Applications**

This section provides an extensive discussion on the use of modified NIA algorithms in different applications.

#### *6.1. Microarray Gene Expression Classification*

In [164], a hybrid model of GA and SVM was developed to perform FS and kernel parameter optimization. GA-SVM is a recommended approach for FS especially when the kernel parameters are optimized and the number of selected features is not known beforehand. Huang, in [128], developed a new GA-based wrapper approach. He adopted two stages of optimizations. The outer optimization stage (global search) applied a fitness function based on mutual information between actual classes and predicted classes. The second stage (the inner optimization) implements a local search (filter manner) based on feature ranking. A gene selection approach based on ACO was developed in [165]. A high-dimensional multi-class cancer gene expression (GCM) and colon cancer data sets were used. The comparisons were conducted with several rank-based models. The simulated results proved the validity of the proposed ACO approach for FS in high dimensional data sets.

A reliable FS technique was developed in [136] for selecting relevant features in the gene expression data set. The proposed methodology was IBPSO-KNN. The results of the accuracy increased by 2.85% compared with other methods in the literature. Yang [92], presented a new modified model for BPSO and applied it over six multi-category cancerrelated human gene expression data set. Yang [18] developed a hybrid filter wrapper method for FS in microarray data sets using GA and IG. The ranking of features was performed using a decision tree. Experiments showed that the IG-GA algorithm simplified the number of gene expression levels and either achieved higher accuracy or used fewer features compared to other methods. A hybrid filter-wrapper model based on Information Gain (IG), Correlation-based (CFS), and IBPSO was proposed in [123]. Kabir [166] developed a new hybrid model based on GA, NN, MI, and local search operators. A new PSO model

that has the capability of discovering biomarkers from microarray data was designed in [137].

Chuang [52] developed a hybrid model for FS and classification of large-dimensional microarray data sets. Mohammad [138] developed a diagnostic medical model based on IBPSO to find the least possible number of discriminative genes. One year later, Kabir developed an ACO-based FS model in [167]. The ACOFS target was to select the salient features with the smallest size. The model combined the ACO, neural network, filter, and included an update for the rules-based on subset size determination scheme. In [24], the PSO variant was superior to other methods in terms of performance, a number of features, and cost.

A new filter-based approach based on the CS optimizer, mutual Information filter (MI), entropy, and Artificial Neural Network (ANN) classifier was proposed in [126]. The entropy and mutual information were applied in the fitness function to calculate the relevance and redundancy for the feature subsets. Banka developed a new modified version of the PSO algorithm in [98]. Three benchmark data sets were used for colon cancer, defused B-cell lymphoma, and leukemia. The model achieved a minimal number of features and a higher classification accuracy. In [100], a model for cancer gene selection and cancer classification was developed based on BQPSO and SVM with LOOCV. Five DNA microarray data sets were used. Experiments showed better results for BQPSO/SVM compared with BPSO/SVM and GA/SVM in terms of accuracy, robustness, and the number of genes selected. Zawbaa [4], handled the complexity of the FS problem in data sets with large dimensionality and few numbers of instances by developing a novel hybrid system called GWO-ALO. A total of 27 different microarray and image processing data sets were used. Some of the data sets were very complex with 50,000 features and less than 200 instances. The experiments showed promising results when compared with GA and PSO. Ibrahim, in [168], developed a novel wrapper approach based on combining SVM with the GOA optimizer, and then he applied the hybrid model on three biomedical data sets from Iraqi cancer patients and UCI.

#### *6.2. Facial Expression Recognition*

A new modified ACO-based FS approach without a need for prior knowledge about features was presented in [25]. The experiments were applied to an ORL gray-scale face image database. The same author proposed after one year another ACO-based FS approach [135], which showed superior performance compared with GA-based and other ACO FS approaches. Aneesh, in [76], proposed a new face recognition technology using a modified version of BPSO, called Accelerated BPSO (ABPSO). ORL database images taken at the AT&T Laboratories and Cropped Yale B database-4 were used in the experiments. A biometric technique for Face Recognition (FR) based on BPSO was developed in [97]. Seven benchmark databases, namely, Cambridge ORL, UMIST, Extended YaleB, CMUPIE, Color FERET, FEI, and HP were used in the experiments.

Zhang [112] developed a facial recognition system based on the MFO-LFA-SA hybrid model to avoid premature stagnation and to guide the search procedure towards global optima. MFO logarithmic spiral search behavior increased the exploitation power meanwhile the LFA used the attractiveness function for more exploration in the search space. The SA empowered the exploitation around the most promising solution. Experiments used frontal-view images extracted from CK+JAFFE, MMI, and BU-3DFE. MFO-LFA FS outperformed other facial expression recognition models. Mistry [101] incorporated several update mechanisms in one model including the hybridization of a PSO- and mGA- (micro Genetic Algorithm), modified population structure, new velocity update strategy, diversity maintenance strategy, and a subdimension-based regional facial feature search strategy. Cross-domain images from the extended Cohn Kanade and MMI benchmark databases were used in the experiments besides multiple classifiers including NN with back-propagation, a multi-class SVM, and ensemble classifiers.

In [169], a system for Facial Emotion Recognition (FER) was developed based on GWO-NN. The hybridization was used to tune the weights with less training error, then it classified the emotions from the selected features. The proposed FER system was evaluated using the JAFFE and Cohn–Kanade database and the results showed higher accuracy compared with conventional methods.

#### *6.3. Medical Applications*

A new recognition system for skin tumor diagnosis was developed by handels in [170]. A GA algorithm was used to extract the most suitable features from 2D images that characterize the structure of the skin surface. NN with back-propagation was used as a learning paradigm that was trained using the selected feature sets. Different network topologies and parameter settings were investigated for optimization purposes and GA was compared with heuristic greedy algorithms. The GA skin tumor achieved the highest classification performance of 97.7%. An optimized mass detection system for digitized mammograms was developed by Zheng [171]. A GA-BBN hybrid model was used to classify positive and negative regions for masses depicted in digitized mammograms. The results showed that GA achieved the same ratio of feature reduction in comparison with the exhaustive search but reduced the total computation time by a factor of 65. In [113], a hybrid PSO-GA FS system was developed to improve the cancer classification performance and reduce the cost of medical diagnoses. Chakraborty [153] proposed a modified version of PSO using a new fuzzy evaluation.

In [172], different hybridization models were developed using the GA algorithm with different neural classifiers to get the best feature subset while preserving accuracy. A comparison was conducted between GA-KNN, GA-BP-NN, GA-RBF-NN, and GA-LQV-NN. The results showed that GA with neural classifiers were more robust and effective. In [173], Babaoglu investigated the effectiveness of both BPSO and GA as FS models for determining the existence of Coronary Artery Disease (CAD). BPSO-SVM and GA-SVM were applied on a data set obtained from patients who had performed Exercise Stress Testing (EST) and coronary angiography. The results showed that the BPSO-FS method was more successful than GA-FS and SVM on determining CAD. An automatic breast cancer diagnosis framework was designed by Ahmad [174]. The developed hybrid Genetic Algorithm Multilayer Perceptron (GA-MLP) model performed simultaneous FS and parameter optimization of ANN.

Three different variations of the backpropagation training algorithm, namely the resilient backpropagation (GAANN-RP), Levenberg Marquardt (GAANN-LM), and Gradient Descent with momentum (GAANN-GD) were investigated. The Wisconsin Breast Cancer Database (WBCD) was used. The experiments showed that the best accuracy was achieved by the RP. Sheikhpour developed a hybrid model to distinguish between benign and malignant breast tumors [175].

PSO-KDE was used to minimize the kernel density estimation error and avoid the time needed by the surgical biopsy. The Wisconsin Breast Cancer Data set (WBCD) and Wisconsin Diagnosis Breast Cancer Database (WDBC) were used. Sayed [176] developed an automatic system based on MFO for Alzheimer's Disease (AD) diagnosis. It was able to distinguish three kinds of classes including Normal, AD, and Cognitive Impairment. A benchmark data set consisted of 20 patients from the National Alzheimer's Coordinating Center (NACC). Experiments showed that the SVM-polynomial kernel function was the best one in terms of accuracy precision, recall, and f-score. A novel medical diagnosis framework based on IGWO and KELM was developed in [119].

The model was investigated on Parkinson's and breast cancer disease data sets. The comparison was performed between IGWO-KELM, GWO-KELM, and GA-KELM. The experimental results proved that the proposed method was better than the other two competitive counterparts. One year later, Sayed developed a new approach for mitosis detection in breast cancer histopathology slide images based on the MFO FS algorithm [177]. MFO was used to extract the best discriminating features of mitosis cells such as statistical, shape, texture, and energy then the selected features were used to feed the Classification and Regression Tree (CART) to make classification into either mitosis and non-mitosis. Wang [3] developed an efficient medical diagnosis tool based on CMFO and KELM to minimize the number of features and to perform parameters optimization for KELM.

#### *6.4. Handwritten Letter Recognition*

The target in [178] was to study which one of the machine learning algorithms had the right bias to solve specific natural language processing tasks. GA achieved the best results on a language processing WSD data set. In [89], authors developed a hybrid GA to mitigate the weakness of standard GA in fine-tuning near the local minima. The proposed approach was validated using a data set gained by extracting the gray-mesh features from the CENPARMI handwritten numeral samples. Galbally [149] tried to find a way to minimize the verification error rate in the online signature verification system. A GAbased approach with new modification was proposed. Experiments were conducted on the MCYT signature database with 330 users and 16,500 signatures. The new approach showed remarkable performance in all the carried out experiments. Zeng [105] developed a novel GA with a dynamic chain-like agent population structure and dynamic neighboring competitive selection strategy. He used a letter-recognition database from UC Irvine (UCI). The experimental results showed that the feature subset generated from CAGA achieved a higher classification rate, more stability, and lower classification complexity in comparison with the other four GAs. A novel FS algorithm based on ACO was presented in [179] to improve the performance of the algorithm in text categorization. Comparisons were conducted with GA, information gain, and Chi Square test (CHI) on the Reuters-21578 data set. The proposed approach proved its superiority concerning the Reuters-21578 data set. In [129], Principal Component Analysis (PCA) was used with the IG filter method and GA optimizer in a model called IG-GA-PCA. In the first stage, the IG method was applied to rank the terms of the document according to their importance. In the second stage, GA and PCA FS and feature extraction methods were applied separately to the ranked terms. Experiments used both Reuters-21578 and Classic3 data sets. The experiments showed that the IG-GA-PCA model could achieve high categorization results as measured by precision, recall, and F-measure. In [154], a GA-FLD-based FS approach was used in order to find features subsets that could optimally discriminate samples from different classes without prior knowledge about features dimensionality. Another modification based on fitness function were also proposed. Three standard databases of handwritten digits and one of handwritten letters were used in the experiments. In [53], authors developed a hybrid intelligent algorithm using BPSO and other operators to solve the FS problem in the text clustering. A new initialization strategy, new fitness function, and new operator were proposed. The Reuters-21578, Classic4, and WebKB benchmark text data sets were used. The results showed higher clustering accuracy and improved the convergence speed of BPSO. Ewees, in [180] introduced a new approach for Arabic handwritten letter recognition (AHLR) called MFO-AHLR. A data set for Arabic handwritten letter images (CENPARMI) was used. Results showed that MFO-AHLR achieved a 99.25% accuracy, which was the highest ratio achieved among all AHLR approaches. Tubishat, in [5], developed a novel hybrid model for Arabic SA. The targets of the study were to mitigate the limitations of the WOA such as local minima, slow convergence diversity, and over-fitting problems. A hybrid model IWOA-SVM-IG was applied over four Arabic benchmark data sets for sentiment analysis. IWOA was compared with six well-known optimization algorithms and two deep learning algorithms, namely Convolution NN (CNN) and Long Shortterm Memory (LSTM). The results showed that the IWOA algorithm outperformed all other algorithms.

#### *6.5. Hyper Spectral Images Processing*

Tackett in [181] worked on extracting the statistical features from a large noisy US Army NVEOD Terrain Board imagery database using GP. In [182], a new model was proposed based on GA, Bayesian classification, and a new proposed fitness function to discriminate the targets from clutters in SAR images. Jarvis [183] developed a novel approach based on GA and DFA for the selection of important discriminatory variables from Fourier Transform Infrared (FT-IR) spectroscopic data. The GA achieved 16% reduction in the model error. The GA-SVM model for hyper-spectral data classification was proposed in [184]. The proposed GA-SVM was tested on an HYPERION hyper-spectral image. Experiments demonstrated that the number of bands was reduced from 198 to 13, while accuracy increased from 88.81% to 92.51%. A GA-based image annotation system with optimized descriptor weights or/and optimal descriptor subset over MPEG-7 was developed in [150]. The Corel image database consisted of 2000 images with 20 categories used. Experiments showed that the binary-coded GA and the bi-coded GA improved the accuracy of the image annotation system by 7%, 9%, and 13.6%, respectively compared to the commonly used methods.

A new ensemble classifier was proposed in [151]. It was based on AdaBoost and parallel GA in the context of the FS problem for image annotation in MPEG-7 standard. The experiments were performed over 2000 classified Corel images. In [185], a new approach based on GA, SVM, MI, and BB was developed to search for the best combination of bands in the hyper spectra images. MI was used as a pre-processing step for band grouping based on the correlation between bands and classes. GA-SVM was used to search for the optimal combinations of bands that increase accuracy. A post-processing step based on BB was used to filter out those irrelevant band groups. Ghamisi [163] applied the FODPSO SVM approach to determine the most informative bands in the Hekla and Indian Pines hyper-spectral data set using the parallelism modification technique. In the same year, Ghamisi [186] presented a new hybrid approach based on GA, PSO, and SVM. His target was to detect roads from a background in complex urban images. He integrated the standard velocity and update rules of PSO with selection, crossover, and mutation from GA. In [6], Medjahed developed a novel GWO framework for Pavia and AVIRIS hyper-spectral images data sets.

#### *6.6. Protein and Related Genome Annotation*

In [116], a new FS model based on ACO-GA was proposed. Both ACO and GA generated the feature subsets in parallel then the generated subsets were evaluated by a certain fitness function. ACO used GA operators to update the solutions. The GPCR-PROSITE dataset and ENZYME-PROSITE challenging protein sequences data sets were used. Mandal [157], developed a prediction system to identify the possible subcellular location of a protein-based on a multi-objective PSO.

#### *6.7. Biochemistry and Drug Design*

Raymer [187] developed a system that integrates FE, FS, and classifier training using GA and KNN. This approach was applied in biochemistry and drug design for the identification of favorable water-binding sites on protein surfaces. The approach was validated using protein water interactions from a biochemistry field. Another model was developed by Salcedo [104]. The proposed FS model was based on GA and m-features operator (OR operator). The new approach was evaluated using two machine learning classification problems; the first one used two artificial data sets and the second one was a real application in molecular bioactivity for a drug design taken from the ones used in the KDD Cup. THe m-features operator improved the GA performance over the other existing approaches.

#### *6.8. Electroencephalogram (EEG) Application*

Palani [188] used GA and Fuzzy ARTMAP (FA) NN for FS. GA-FA-NN was used with the VEP data which was recorded from 10 alcoholics and 10 controls. The target was to classify alcoholics and controls, using multi-channel EEG signals. The discriminatory spectral bands reduced from 7 to 2. The identification of useful spectral power ratios produced better performance. In [189], a hybrid GA-SVM model was used to extract the favorable patterns from noisy multidimensional time series obtained from EEG which are a base for Brain-computer Interfaces (BCIs). The data set was collected by a procedure in which subjects were placed in a dim, sound controlled room. The proposed nonlinear system was better than other linear approaches with a slight difference. A novel ACO-DE FS system called ANTDE was presented in [23]. It could cope with the limitations of ACO regarding the sequential generation for solutions. ANTDE was used in EEG and Myoelectric Control (MEC) biosignal applications. Wang [130] developed a BCI system using a hybrid model GA-SVM-entropy and 28 EEG channels. Noori designed an effective BCI in [190]. He used a new version of GA based on SVM to get smaller optimal features from functional Near-infrared Spectroscopy signals (fNIRS). The experiments were established by recruiting seven subjects who do not have any psychological disorder. Subjects were seated in a quiet room and asked to relax to settle down their responses before beginning to perform mental arithmetic tasks for a certain period.

#### *6.9. Financial Prediction*

In [191], a new financial prediction model was proposed. A hybrid model SVM-GA was evaluated using 15 business data sets. Each data set consisted of 186 sampled firms. GA-SVM achieved a prediction accuracy of up to 95.56% for all the tested business data. In [192], the authors developed a hybrid fuzzy-GA approach for stock selection. The fuzzybased scoring mechanism was applied for scoring a set of stocks then the topmost stocks were selected. GA applied for performing a dual job of FS and parameter optimization. The constituent stocks of the 200 largest market capitalization listed in the Taiwan Stock Exchange were used in the experiments.

#### *6.10. Software Product Line Estimation*

Oliveira [22] investigated the use of the GA method for simultaneous FS and parameters optimization of Support Vector Regression (SVR) when applied for software effort estimates. GA, SVR, MLP, and model trees were used. Six benchmark data sets of software projects, namely, Desharnais, NASA, COCOMO, Albrecht, Kemerer and Koten, and Gray were used in the experiments. In [106], Guo presented a new methodology for FS in the software application. The target of the new modified GA was optimizing FS in a Software Product Line (SPLs) to find a feature subset with an optimal product capability subject to feature model constraints and resource constraints. The results showed that GA FS algorithms produced a system with high performance and in 45–99% less time than existing heuristic FS techniques.

#### *6.11. Spam Detection in Emails*

Temitayo [193] developed a new approach for the classification of emails, either spam or legitimate. GA was used to perform simultaneous FS and parameter optimization. The hybrid GA-SVM spam detection model was evaluated using a Spam Assassin (6000 emails) data set. Experiments showed that GA-SVM improved the results compared with SVM by achieving a higher recognition rate with only a few feature subsets. In [85], a mutation-based BPSO FS model was developed in an email application. A data set of 6000 emails manually collected during the year 2012 was used. The proposed was able to effectively reduce the false-positive error. In [194], a hybrid GA-RWN was used for identifying the most relevant features in spam emails and automatic tuning for the hidden neurons. The GA-RWN achieved promising results according to the spam detection rate and optimization for the configuration of its core classifier. Lately, in [195], a novel Northern Bald Ibis Algorithm (NOA) was used with a SVM classifier to get an optimal feature subset of the Enron-spam dataset.

#### *6.12. Other Various Applications*

Zio [156] developed an efficient transient diagnosis system for nuclear power plants based on GA to select among the several measured plant parameters. In [61], the target was addressing the problem of a lengthy Intrusion Detection (ID) process based on attributes of network packets. Rough-PSO was used and evaluated using the KDDCup 1999 data set. An automatic FS model that can choose the most relevant features from password typing patterns was designed in [196]. The data sets were captured on a Sun Sparc-Station by a program in an X window environment in which the keystroke duration times were measured. Rodrigues in [197] proposed a CS-OPF model for theft detection in power distribution systems. The proposed model was evaluated using two data sets from a Brazilian electrical power company. Experiments proved the robustness of the CS-OPF model by increasing the theft recognition up to 40%. Zhang [127] developed a new forecaster FS model based on combining MI and ACO. The ACO-MI model was applied on forecasters data sets at the Australian Bureau of Meteorology. A system for diagnosing different types of fault in a gearbox was designed in [198]. Hassanien [67] developed an automatic tomato disease detection system based on integrating rough set with MFO.

#### *6.13. An Open Source Evolopy-FS Framework*

EvoloPy-FS [199] is an open-source FS software tool developed by our team and it is publicly available on (www.evo-ml.com). It serves as an explicit white-box NIAs-FS optimization framework. The main objective was to support researchers from different disciplines with an easy-to-use, transparent, and automated NIAs-FS optimization tool. The framework contains severe recent NIAs algorithms written in Python and a set of different operators such as transfer functions (S-TFs and V-TFs). Moreover, the framework applies wrappers, filters and a hybrid filter-wrapper, different evaluation metrics, and allows for loading data from different resources. Evolopy-FS is a continuation of our path, which is building an integrated optimization environment. The work was started by EvoloPy [200] for global optimization problems then EvoloPy-NN for optimizing MLP and recently Evolopy-FS for optimizing the feature selection process. In [199], authors constructed the experiments based on 30 different well-regarded data sets from common repositories such as UCI and Kaggle. The comparisons were conducted between wrapper FS, filter FS, and hybrid filter-wrapper approaches. It was shown that wrapper and hybrid filter-wrapper were superior and more trustable in dealing with large dimensionality data sets. However, the filter approach was faster and generated results in a shorter time and fewer computational efforts.

#### **7. Assessment and Evaluation of NIAs FS Modification Techniques**

As discussed, NIAs-FS approaches achieved big contributions and clear success in solving the FS problem in different domains. This section presents the results from the analysis of modified NIAs-FS studies. Table 1 shows a summary of the main studies in the literature that adopted new operators as modification techniques for NIAs-FS, Table 2 shows a summary of the main studies in the literature that adopted hybridization modification technique for NIAs-FS, Table 3 shows a summary of main studies in the literature that adopted the remaining modifications techniques for NIAs-FS, Table 4 shows a summary of main modifications applied in the literature on main NIAs (applied/not applied), Table 5 shows a summary of main modifications applied in the literature on main NIAs (by numbers), Table 6 shows a summary of the main studies in the literature that applied modified NIAs FS in applications, and Table 7 shows a summary of modifications applied on NIAa-FS in the main applications. It was observed that 34 different operators were applied on NIAs wrappers in 48 different papers. Some references adopted over one operator in their work. As it is clear also from the list, the most applied operator is the chaotic map, which was applied in 10 references, then rough set in 6 references, then selection operators (RWS, TS) in 5, then S-shaped and V-shaped transfer functions and crossover in 4 references. The mutation was applied in 3 and UC, DE, and local search operators each were been adopted by 2 references. A single reference adopted the remaining operators. It was found that the PSO wrapper was the most modified optimizer using newly adopted operators for tackling the FS problem. It was modified using a new operator in 21 references. In addition, GA was modified in 6 references, WOA in 4, CS in 3, SSA in 3, GWO, GOA, and MFO each one was modified by a new operator in 2 references. For DA, FFA, LA, BA, MVO, ABC, CSO, DE, and CSA, the number of references was 1. For FPA and ACO, no work applies new operators to their algorithms for solving the FS problem.

It was clear from the analysis that the hybridization modification technique was applied in 75 references to solve FS. This counting result shows that hybridization is the most widely applied modification technique to enhance NIAs wrappers in the FS domain. This high number of work comes from GA, which is the NIA that had the most number of works regarding wrapper hybridization. GA wasapplied hybridization in 38 different works, which is much higher than 6, the number of works that adopted new operators to GA. We can infer from this works count and from the contribution of studies that hybridization is the best suitable modification technique to be applied with GA. ACO also were hybridized in 7 references, while no work adopted a new operator to modify ACO. Conversely, PSO hybridization works were 11, which is less than 21, the number of works with new operators, thus we can again infer from these counts and the contribution of studies that adopting a new operator to PSO is more suitable than hybridization. It was also noticed that hybridization using different kinds of classifiers was the most prominent hybridization technique.

There are 49 studies that tried to investigate the influence of the classification technique on the performance of wrappers for optimizing FS, some of these studies applied simultaneous optimization for FS and a classification/prediction task by tuning the parameters of the classifier with applying FS. The next widespread hybridization technique is a filter-wrapper, which was applied in 14 studies and was very effective in dealing with large dimensionality feature space. Hybridization techniques that tried to balance the exploration/exploitation of the search space also were adopted by a considerable number of works. In summary, PSO and GA are the most widely modified NIAs-FS approaches. They were equally modified and used. Each one of them was adopted and modified for FS in 56 references of the gathered studies.

On the other side, regarding applications of NIAs-FS, it was evident that microarray gene expression classification is the most dominant application where NIAs-FS approaches were applied in 18 studies with a ratio 24% concerning other applications. The medical application was the second prominent application for applying NIAs-FS approaches with a ratio of 21%. The medical application includes different medical branches SONAR, tumor, mass, and various disease detection, medical diagnosis, medical data, and bio-signal analysis. Then, follows hyper-spectral image with a ratio of 17%, Arabic handwritten recognition with a ratio of 13%, facial expression recognition with a ratio of 9%, EEG application with a ratio of 7%, financial diagnosis with a ratio of 5%, and spam detection with a ratio of 4%. Furthermore, it is noticeable that GA is the most dominant NIA optimizer for optimizing FS in applications with a ratio of 45%. PSO is the second most widespread optimizer with a ratio of 26%, then ACO with a ratio of 11%, MFO with a ratio of 7%, GWO with a ratio of 6%, CS with a ratio of 2%, WOA with a ratio of 2%, and GOA with a ratio of 1%.

No Free Lunch (NFL) theorem [201] states no optimization algorithm can solve all the optimization problems equally. The success in solving a specific problem does not guarantee that the algorithm will perform similarly for other problems. On average, all the optimization algorithms perform equally. This theorem has motivated researchers to develop new algorithms or improve the existing ones to solve another wide area of optimization problems, such as feature selection. Researchers are advised to read the following references as they are the most cited papers after 2010 in the field of NIAs-FS: [17,51,80,90,94,124,202].

**Table 1.** Summary of main ro pub the tables, and please check if it is the background color can be deleted, same as follows. studies in the literature that adopted new operators as modification techniques for Nature Inspired Algorithms Feature Selection NIAs-FS.



**Table 2.** Summary of main studies in the literature that adopted hybridization modification technique for NIAs-FS.


**Table 2.** *Cont.* **Table 3.** Summary of main studies in the literature that adopted the remaining modifications techniques for NIAs-FS.


**Table 4.** Summary of main modifications applied in the literature on main NIAs-FS (applied/not applied).



**Table5.**Summaryofmainmodificationsapplied in the literatureonmainNIAs-FS(bynumberofstudies).


**Table 6.** Summary of main studies in the literature that applied modified NIAs-FS in applications.

#### **Table 6.** *Cont.*



**Table 7.**Summary of modifications applied on NIA-FS in the main applications.


*Mathematics* **2022** , *10*, 464

> **Table 7.** *Cont.*

#### **8. Conclusions and Future Research Directions**

In this study, a survey about modifications of NIAs for tackling the FS optimization problem is presented. The review is based on a solid theoretical, applied, and technical foundation. Three main research streams are identified in this review: Meta-heuristic optimization, feature selection, and modification on NIAs for tackling FS. This review aims to draw the map for researchers and guide them when creating new research in this area. This survey is based on 156 articles collected and studied on modifications of NIAs for solving the FS problem. The sources of the information search came mainly from six well-regarded scientific databases: Elsevier, Springer, Hindawi, ACM, World scientific, and IEEE. From the review, it can be seen that the NIAs algorithms have been extensively investigated over the past years to improve the FS problem. About 34 different operators were investigated. The most popular operator is chaotic maps. Hybridization is the most widely used modification technique. There are three types of hybridization: Integrating NIA with another NIA, integrating NIA with a classifier, and integrating NIA with a classifier. The most widely used hybridization is the one that integrates a classifier with the NIA. Microarray and medical applications are the dominated applications where most of the NIA-FS are modified and used. Despite the popularity of the NIAs-FS, there are still many areas that need further investigation:


Based on the above trends, the size of the NIAs-FS research area can be recognized. Besides, it can be imagined that a thorough investigation and improvement of NIAs will improve the FS process in various high-dimensional areas. This review paper will be used to help researchers take an excellent view of the modification strategies in nature-inspired algorithms for tackling the feature selection problem.

**Author Contributions:** Conceptualization, R.A.K., I.A. and A.S.; methodology, R.A.K., I.A. and A.S.; formal analysis, R.A.K., I.A. and A.S.; resources, R.A.K.; validation, M.A.E., R.D. and T.K.; writing—original draft preparation, R.A.K., I.A. and A.S.; writing—review and editing, M.A.E., R.D. and T.K.; supervision, M.A.E.; funding acquisition, T.K. All authors read and agreed to the published version of the manuscript.

**Funding:** No funding received for this research.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** Not applicable.

**Conflicts of Interest:** All authors declare that they have no conflict of interest.

#### **References**


MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Mathematics* Editorial Office E-mail: mathematics@mdpi.com www.mdpi.com/journal/mathematics

MDPI St. Alban-Anlage 66 4052 Basel Switzerland

Tel: +41 61 683 77 34 Fax: +41 61 302 89 18

www.mdpi.com