**Nonsmooth Optimization in Honor of the 60th Birthday of Adil M. Bagirov**

Editors

**Napsu Karmitsa Sona Taheri**

MDPI • Basel • Beijing • Wuhan • Barcelona • Belgrade • Manchester • Tokyo • Cluj • Tianjin

*Editors* Napsu Karmitsa University of Turku Finland

Sona Taheri RMIT University Australia

*Editorial Office* MDPI St. Alban-Anlage 66 4052 Basel, Switzerland

This is a reprint of articles from the Special Issue published online in the open access journal *Algorithms* (ISSN 1999-4893) (available at: https://www.mdpi.com/si/algorithms/Nonsmooth Optimization).

For citation purposes, cite each article independently as indicated on the article page online and as indicated below:

LastName, A.A.; LastName, B.B.; LastName, C.C. Article Title. *Journal Name* **Year**, *Volume Number*, Page Range.

**ISBN 978-3-03943-835-8 (Hbk) ISBN 978-3-03943-836-5 (PDF)**

c 2020 by the authors. Articles in this book are Open Access and distributed under the Creative Commons Attribution (CC BY) license, which allows users to download, copy and build upon published articles, as long as the author and publisher are properly credited, which ensures maximum dissemination and a wider impact of our publications.

The book as a whole is distributed by MDPI under the terms and conditions of the Creative Commons license CC BY-NC-ND.

## **Contents**


## **About the Editors**

**Napsu Karmitsa** has been a Docent (Adjunct Professor) of Applied Mathematics at the Department of Mathematics and Statistics at the University of Turku, Finland, since 2011. She obtained her MSc degree in Organic Chemistry in 1998 and Ph.D. degree in Scientific Computing in 2004, both from the University of Jyvaskyl ¨ a, Finland. At the moment, she holds a position ¨ of Academy Research Fellow granted by the Academy of Finland. Her research is focused on nonsmooth optimization and analysis. Special emphasis is given to nonconvex, global, and large-scale cases with applications in machine learning and data analysis. Dr. Karmitsa has published two books on nonsmooth optimization and one book on clustering as well as several journal papers and research reports. In addition, her webpage http://napsu.karmitsa.fi is one of the leading sources of nonsmooth optimization solvers available.

**Sona Taheri** received her Ph.D. in Information Technology from Federation University Australia in 2012. Currently, she holds a lecturer position at RMIT University, Australia. She has conducted research in the area of optimization, particularly nonsmooth, nonconvex, and DC optimization; data mining and machine learning, particularly cluster analysis and regression in large data sets; and cybersecurity, mainly in alert analysis and malicious multisource data sets. The results of her research have been published as a book, an edited book, book chapters, and journal and conference articles.

## **Preface to "Nonsmooth Optimization in Honor of the 60th Birthday of Adil M. Bagirov"**

Nonsmooth optimization (NSO) refers to the general problem of minimizing (or maximizing) functions that are typically not differentiable at their minimizers (maximizers). Such functions can be found in various applications such as image denoising; optimal shape design; computational chemistry and physics; water management; cybersecurity; machine learning; and data mining, including cluster analysis, classification, and regression. As the classical optimization theory presumes differentiability of the functions to be optimized, it cannot be directly applied, nor can the methods introduced for smooth problems.

The aim of this book was to gather the most recent developments in NSO techniques and applications. The book opens with the foreword by the Guest Editors and then presents six articles in the area of NSO and its applications.

The Guest Editors are grateful to Professor Adil Bagirov, with whom they have had the privilege of conducting research in the area of NSO and its applications, and wish him all the best in his career and personal life.

The Guest Editors would like to thank all the authors for their contributions to this book, all reviewers who provided constructive comments, and the editorial staff of the MDPI journal *Algorithms* for their support.

> **Napsu Karmitsa, Sona Taheri** *Editors*

## *Editorial* **Special Issue "Nonsmooth Optimization in Honor of the 60th Birthday of Adil M. Bagirov": Foreword by Guest Editors**

#### **Napsu Karmitsa 1,\* and Sona Taheri <sup>2</sup>**


Received: 4 November 2020; Accepted: 4 November 2020; Published: 7 November 2020

**Abstract:** Nonsmooth optimization refers to the general problem of minimizing (or maximizing) functions that have discontinuous gradients. This Special Issue contains six research articles that collect together the most recent techniques and applications in the area of nonsmooth optimization. These include novel techniques utilizing some decomposable structures in nonsmooth problems—for instance, the difference-of-convex (DC) structure—and interesting important practical problems, like multiple instance learning, hydrothermal unit-commitment problem, and scheduling the disposal of nuclear waste.

#### **1. Introduction**

In this special issue, we take the opportunity to acknowledge the outstanding contributions of Professor Adil Bagirov (Figure 1) to nonsmooth optimization (NSO) in both theoretical foundations and its practical aspects during his 35 year long research career. This Special Issue collects together the most recent techniques and applications in the area of NSO. It contains six excellent research papers by well-known mathematicians. Some of the authors have at some point collaborated with Adil Bagirov, and all of them would like to show their respect to him and his work.

**Figure 1.** Professor Adil Bagirov.

Adil Bagirov received a master's degree in Applied Mathematics from Baku State University, Azerbaijan in 1983, and the Candidate of Sciences degree in Mathematical Cybernetics from the

Institute of Cybernetics of Azerbaijan National Academy of Sciences in 1989. Then he worked at the Space Research Institute (Baku, Azerbaijan), Baku State University (Baku, Azerbaijan) and Joint Institute for Nuclear Research (Moscow, Russia) until 1998.

Bagirov has been joined with Federation University Australia since 1999. He completed his PhD in Optimization under the supervision of Professor Alexander Rubinov at Federation University Australia (formerly the University of Ballarat) in 2002. Currently, he holds the full Professor position at this university. Professor Bagirov has contributed exceptionally to NSO and its applications to real-life problems. These contributions include writing books on NSO [1] and its applications in clustering [2], an edited book on NSO methods [3] and more than 170 journal papers, book chapters and papers in conference proceedings in the area of NSO and its applications (see, e.g., [4–12]). He has also supervised more than 28 PhD students.

Professor Bagirov has been successful in securing five grants from the Australian Research Council's Discovery and Linkage schemes to conduct research in nonsmooth and global optimization and their applications. He was awarded the Australian Research Council Postdoctoral Fellowship and the Australian Research Council Research Fellowship. In addition, he is EUROPT Fellow 2009.

The Guest Editors are grateful to Professor Adil Bagirov, with whom they have had the privilege to do research in the area of NSO and its real-life applications. On behalf of the journal, the Guest Editors wish him all the best in his career and personal life.

#### **2. Nonsmooth Optimization**

NSO refers to the general problem of minimizing (or maximizing) functions that have discontinuous gradients. These types of functions arise in many applied fields, for instance, in image deionising, optimal shape design, computational chemistry and physics, water management, cyber security, machine learning, and data mining including cluster analysis, classification and regression. In most of these applications, the number of decision variables is very large, and their NSO formulations allow us to reduce these numbers significantly. Thus, the application of NSO approaches facilitates the design of efficient algorithms for their solutions, the more realistic modeling of various real-world problems, the robust formulation of a system, and even the solving of difficult smooth (continuously differentiable) problems that require reducing the problem's size or simplifying its structure. These are some of the main reasons for the increased attraction to nonsmooth analysis and optimization during the past few years. This Special Issue collects some of the most recent methods in NSO and its applications. These include novel techniques for solving NSO problems by utilizing, for instance, the decomposable (difference of convex (DC)) structure of the objective, the nonsmooth Gauss-Newton algorithm, the biased-randomized algorithm, and also interesting practical problems such as the multiple instance learning, the hydrothermal unit-commitment problem, and scheduling the disposal of nuclear waste.

In the first article, "A Mixed-Integer and Asynchronous Level Decomposition with Application to the Stochastic Hydrothermal Unit-Commitment Problem" by Bruno Colonetti, Erlon Cristian Finardi and Welington de Oliveira [13], the authors develop an efficient algorithm for solving uncertain unit-commitment (UC) problems. The efficiency of the algorithm is based on the novel asynchronous level decomposition of the UC problem and the parallelization of the algorithm.

In the second article "On a Nonsmooth Gauss-Newton Algorithm for Solving Nonlinear Complementarity Problems" by Marek J. Smieta ´ ´ nski [14], the author proposes a new nonsmooth version of the generalized damped Gauss-Newton method for solving nonlinear complementarity problems. In the proposed algorithm, the Bouligand differential plays the role of the derivative. The author presents two types of algorithms (usual and inexact), which have superlinear and global convergence for semismooth cases.

In the article "Polyhedral DC Decomposition and DCA Optimization of Piecewise Linear Functions" by Andreas Griewank and Andrea Walther [15], the abs-linear representation of the piecewise linear functions is extended, yielding their DC decomposition as well as a pair of generalized gradients that can be computed using the reverse mode of algorithmic differentiation. The DC decomposition and two subgradients are used to drive DCA algorithms where the (convex) inner problem can be solved in a finite many iterations and the gradients of the concave part can be updated using a reflection technique.

The fourth article, "On the Use of Biased-Randomized Algorithms for Solving Non-Smooth Optimization Problems" by Angel Alejandro Juan, Canan Gunes Corlu, Rafael David Tordecilla, Rocio de la Torre and Albert Ferrer [16], introduces the use of biased-randomized algorithms as an effective methodology to cope with NP-hard and NSO problems in many practical applications, in particular, those including so called soft constraints. Biased-randomized algorithms extend constructive heuristics by introducing a nonuniform randomization pattern into them. Thus, they can be used to explore promising areas of the solution space without the limitations of gradient-based approaches that assume the existence of the smooth objective.

In the fifth article, "Planning the Schedule for the Disposal of the Spent Nuclear Fuel with Interactive Multiobjective Optimization" by Outi Montonen, Timo Ranta and Marko M. Mäkelä [17], the very important problem of the scheduling of nuclear waste disposal is modelled as a multiobjective mixed-integer nonlinear NSO problem with the minimization of nine objectives. A novel method using the two-slope parameterized achievement scalarizing functions is introduced for solving this problem, and a case study adapting the disposal in Finland is given.

Finally, the article "SVM-Based Multiple Instance Classification via DC Optimization" by Annabella Astorino, Antonio Fuduli, Giovanni Giallombardo and Giovanna Miglionico considers the binary classification of the multiple instance learning problem [18]. The problem is formulated as a nonconvex unconstrained NSO problem with a DC objective function, and an appropriate nonsmooth DC algorithm is used to solve this problem.

The Guest Editors would like to thank all the authors for their contributions in this Special Issue. They would also like to thank all the reviewers for their timely and insightful comments on the submitted articles as well as the editorial staff of the MDPI Journal Algorithms for their assistance in managing the review process in a prompt manner.

**Funding:** This work was financially supported by Academy of Finland grant #289500.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

c 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **A Mixed-Integer and Asynchronous Level Decomposition with Application to the Stochastic Hydrothermal Unit-Commitment Problem**

#### **Bruno Colonetti 1,\*, Erlon Cristian Finardi 1,2 and Welington de Oliveira 3,\***


Received: 3 August 2020; Accepted: 14 September 2020; Published: 18 September 2020

**Abstract:** Independent System Operators (ISOs) worldwide face the ever-increasing challenge of coping with uncertainties, which requires sophisticated algorithms for solving unit-commitment (UC) problems of increasing complexity in less-and-less time. Hence, decomposition methods are appealing options to produce easier-to-handle problems that can hopefully return good solutions at reasonable times. When applied to two-stage stochastic models, decomposition often yields subproblems that are embarrassingly parallel. Synchronous parallel-computing techniques are applied to the decomposable subproblem and frequently result in considerable time savings. However, due to the inherent run-time differences amongst the subproblem's optimization models, unequal equipment, and communication overheads, synchronous approaches may underuse the computing resources. Consequently, asynchronous computing constitutes a natural enhancement to existing methods. In this work, we propose a novel extension of the asynchronous level decomposition to solve stochastic hydrothermal UC problems with mixed-integer variables in the first stage. In addition, we combine this novel method with an efficient task allocation to yield an innovative algorithm that far outperforms the current state-of-the-art. We provide convergence analysis of our proposal and assess its computational performance on a testbed consisting of 54 problems from a 46-bus system. Results show that our asynchronous algorithm outperforms its synchronous counterpart in terms of wall-clock computing time in 40% of the problems, providing time savings averaging about 45%, while also reducing the standard deviation of running times over the testbed in the order of 25%.

**Keywords:** stochastic programming; stochastic hydrothermal UC problem; parallel computing; asynchronous computing; level decomposition

#### **1. Introduction**

The unit-commitment (UC) problem aims at determining the optimal scheduling of generating units to minimize costs or maximize revenues while satisfying local and system-wide constraints [1]. In its deterministic form, UC still poses a challenge to operators and researchers due to the large sizes of the systems and the increasing modeling details necessary to represent the system operation. For instance, in the Brazilian case, the current practice is to set a limit of 2 h for the solution of the deterministic UC [2], while the Midcontinent Independent System Operator (MISO) sets a time limit of 20 min for its UC [3]. (Note that the Brazilian system and the MISO are different from a physical, as well as from a market-based, viewpoint, but the problems being solved in these two cases share the same classical concept of the UC.) Nonetheless, the growing presence of intermittent generation has

added yet more difficulty to the problem, giving rise to what is called uncertain UC [4]. The latter is considerably harder to solve than its deterministic counterpart, and one of the reasons for its lack of adoption in the industry is precisely its computational burden: Large-scale uncertain UC takes a prohibitively long time to be solved. In this context, efficient solution methods for the uncertain UC that can take full advantage of the computational resources at hand are both desirable and necessary to help system operators cope with uncertain resources.

In particular, to model the uncertainty arising from renewable sources, one of two approaches is generally employed: robust optimization or stochastic programming [4]. The latter is by far the most employed, both in its chance-constrained and recourse variants. In stochastic programs with recourse, uncertainty is, in general, represented by finite-many scenarios, and the problem is formulated either in a two-stage or multistage setting. In two-stage stochastic problems, the first-stage variables must be decided before uncertainty is revealed. Once the uncertain information becomes known, recourse actions are taken to best accommodate the first-stage decisions [5]. In stochastic hydrothermal unit-commitment (SHTUC) problems, the sources of uncertainties are related to renewable resources, spot prices, load, and equipment availability [1,4].

The commitment decisions are usually modeled as first-stage variables, while dispatch decisions are the recourse actions (second-stage variables). Given the mixed-integer nature of commitment decisions, SHTUC problems in a two-stage formulation give rise to large-scale mixed-integer optimization models whose numerical solution by off-the-shelf solvers is often prohibitive due to time requirements or limited computing resources. Consequently, decomposition techniques must come into play [1,4,6,7]. Benders decomposition (BD) and Lagrangian relaxation (LR) are the most used techniques to handle SHTUC problems. While the BD deals with the primal problem [8], LR is a dual procedure employed to compute the best lower bound for the SHTUC problem [7,9]. Primal-recovery heuristics are employed to compute primal-feasible points, which are not, in general, optimal solutions. This is the main shortcoming of LR-based techniques.

Decomposition techniques yield models that are amenable for parallelization [5]. A common strategy for solving problems simultaneously is to use a master/worker framework with pre-specified synchronization points [10], which we call synchronous computing (SYN). In this framework, the master chooses new iterates and sends them to workers, who, in turn, are responsible for solving one or more subproblems. Examples of SYN implementations for UC are given in [11–14]. An aspect of SYN is that, at predetermined points of the algorithm, the master must wait for all workers to respond to resume the iterative process: the synchronization points. However, the times for workers to finish their respective tasks might vary significantly. This results in idle times, both for the master and for workers who respond quickly [10]. One way to reduce idle times is to use asynchronous computing (ASYN).

In contrast to SYN, in ASYN, there are no synchronization points, so the master and workers do not need to wait until all workers respond to continue their operations. Thus, in an iterative process, e.g., in BD, the master would compute the next iterate based on information of possibly only a proper, but nonempty, subset of the workers. Based on this possibly incomplete information, the master sends a new iterate to available workers, while slower workers are still carrying their tasks on an outdated iterate. Because in ASYN iterates might not be evaluated by all workers, the evaluation of the objective function (yielding bounds on the optimal values) is precluded. Hence, a fundamental step in ASYN is the (scarce) coordination of workers to produce valid bounds.

ASYN implementations have been proposed in the UC literature mainly to solve the dual problems (issued by LRs) via either subgradient algorithms or cutting-plane-based methods [15–17]. In References [15,16], a queue of iterates is created and its elements are gradually sent to the workers. Auxiliary lists keep track of the evaluation status of each worker with respect tothe elements in the queue. Once an element has been evaluated by all workers, a valid bound to the original problem is available. The authors of Reference [15] demonstrate that their algorithm converges to a dual global solution regardless of the iterate-selection policy used to choose the iterates from the queue—first-in-first-out or last-in-first-out. In References [17], the authors keep a list of all the iterates

to compute valid bounds. In addition to solving the dual problem asynchronously, Reference [17] also conducts the primal recovery asynchronously. While References [15,16] employ a convex trust-region bundle method, Reference [17] implements an incremental subgradient method. Asynchronous implementations of BD for convex problems can be found in References [18–20]. In Reference [18], the dual dynamic-programming algorithm is handled asynchronously in a hydrothermal scheduling problem. In Reference [19], the stochastic dual dynamic-programming algorithm is used for addressing the long-term planning problem of a hydro-dominated system: The authors propose to compute Benders cuts in an asynchronous fashion. This is also the case in Reference [20], where the authors consider an asynchronous Benders decomposition for convex multistage stochastic programming.

Despite being successfully applied in a variety of fields, e.g., References [18,19] and the references in References [21], the classical BD is well-known to suffer from slow convergence due to the oscillatory nature of Kelley's cutting-plane method [22,23]. Regularized BDs have been proven to outperform the classical one in several problems: See Reference [24] for (convex) two-stage linear programming, Reference [25] for (nonconvex) chance-constrained problems, and Reference [26] for robust designed of stations in water distribution networks. Several types of regularization exist [25,27,28]: proximal, trust-region, and level sets. Among the regularization methods, the level bundle method [29], also known as level decomposition (LD) in two-stage programming [24], stands out for its flexibility in dealing with convex or nonconvex feasible sets, stability functions and centers, and inexact oracles [25,26,30]. Recently, asymptotically level bundle methods for convex optimization were proposed in Reference [31]. The paper presents two algorithms. The first one does not employ coordination, but it makes use of upper bounds on the Lipschitz constants of the involved functions to compute upper bounds for the problem. The second algorithm does not make use of the latter assumption but requires scarce coordination. The authors of Reference [31] focus on the convergence analysis of their proposals (suitable only for the convex setting) and present limited numerical experiments. In this work, we build on Reference [31] and extend its asynchronous algorithm with scarce coordination (Algorithm 3 of Reference [31]) to the mixed-integer setting. Moreover, we consider a more general setting in which tasks can be assigned to works in a dynamic fashion, as described in Section 3. We highlight that the convergence analysis given in Reference [31] relies strongly on elements of convex analysis such as the Smulian's theorem and the Painlevé–Kuratowski set convergence. Such key theoretical results are no longer valid in the setting of nonconvex sets, and hence the convergence analysis developed in Reference [31] does not apply to our mixed-integer setting. For this reason, the convergence analysis of our asynchronous LD must be done anew. We not only provide convergence analysis of our method but also assess its numerical performance on a test set consisting of 54 instances of two-stage UC problems with mixed-integer variables in the first stage.

We care to mention that other asynchronous bundle methods exist in the literature, but they are all designed for convex optimization problems [15,16,32]. The latter reference proposes an asynchronous proximal bundle method, whereas References [15,16] consider a trust-region variant for polyhedral functions. Our approach, which follows the lines of the extended level bundle method of Reference [30], does not require the involved functions to be polyhedral or the feasible set to be convex. As an additional advantage, our algorithm is easily implementable.

This work is organized as follows. Section 2 presents a generic formulation of our two-stage SHTUC problem. The extended asynchronous LD and its convergence analysis are presented in Sections 2.1 and 2.2, respectively. Section 3 presents more details of the considered SHTUC problem and states our case studies. Numerical experiments assessing the benefits of our proposal are given in Section 4. Finally, in Section 5, we present our final remarks.

#### **2. Materials and Methods**

We address the problem of an Independent System Operator (ISO) in a hydro-dominated system with a loose-pool market framework. The ISO decides the day-ahead commitment considering operation costs, forecast errors in wind generation, and inflows; and the usual generation and system-wide constraints. The uncertainties in wind and inflows are represented by a finite set of scenarios, S, and the decisions are made in two stages. At the first stage, the ISO decides on the commitment of units, whereas, at the second stage, the operator determines the dispatch according to the random-variable realization. Full details on the considered stochastic hydrothermal unit-commitment (SHTUC) are given shortly. For presenting our approach, which is not limited to (stochastic) unit-commitment (UC) problems, we adopt the following generic formulation.

$$f\_\* := \min\_{\mathbf{x}, \mathbf{y}} \left\{ \mathbf{c}^\mathsf{T} \mathbf{x} + \sum\_{\mathbf{s} \in \mathcal{S}} \mathbf{q}\_{\mathbf{s}}^\mathsf{T} y\_{\mathbf{s}} \, \middle| \, \begin{array}{l} \mathbf{x} \in \mathcal{X}, \ \mathsf{Tx} + \mathsf{W} y\_{\mathbf{s}} \le \mathsf{h}\_{\mathbf{s} \prime} \\ y\_{\mathbf{s}} \in \mathcal{Y}\_{\mathsf{s}}, \ \mathsf{s} \in \mathcal{S} \end{array} \right\}. \tag{1}$$

In this formulation, the n-dimensional vector *x* represents the first-stage variables with associated cost-vector, **c**. The second-stage variables, *y*s, and their associated costs, **q**s, depend on the scenario, s ∈ S. The cost vector, **q**s, is assumed to incorporate the positive probability of scenario s. The firstand second-stage variables are coupled by constraints **T***x* + **W***y*<sup>s</sup> ≤ **h**s: **T** is the technology matrix; and **W** and **h**<sup>s</sup> are, respectively, the recourse matrix and a vector of appropriate dimensions. While X - ∅ is a compact possibly nonconvex, the scenario-dependent set Y<sup>s</sup> is a convex polyhedron.

As previously mentioned, depending on the UC problem and number of scenarios, the mixed-integer linear programming (MILP) Problem (1) cannot be solved directly by an off-the-shelf solver. The problem is thus decomposed by making use of the recourse functions.

$$Q\_{\mathbf{s}}(\mathbf{x}) := \min\_{\mathcal{Y} \in \mathcal{Y}\_s} \mathbf{q}\_{\mathbf{s}}^T \mathbf{y} \text{ s.t. } \mathbf{W}\_{\mathbf{s}} \mathbf{y} \le \mathbf{h}\_{\mathbf{s}} - \mathbf{T}\_{\mathbf{s}} \mathbf{x}. \tag{2}$$

It is well-known that **x** → *Q*s(**x**) is a non-smooth convex function of **x**. If the above subproblem has a solution, then a subgradient of *Qs* at **x** can be computed by making use of a Lagrange multiplier, <sup>π</sup>*s*, associated with a constraint, **<sup>W</sup>**s*y*<sup>s</sup> <sup>≤</sup> **<sup>h</sup>**<sup>s</sup> <sup>−</sup> **<sup>T</sup>**s**x**: <sup>−</sup>**T<sup>T</sup>** <sup>s</sup> π*<sup>s</sup>* ∈ ∂*Qs*(**x**). On the other hand, if the recourse function *Qs* is infeasible, then the point **x** can be cutoff by adding a feasibility cut [5].

Let P be a partition of S into w subsets: P = {*P*1, ... , *P*w}, with *P*<sup>j</sup> - <sup>∅</sup> for all j <sup>∈</sup> {1, ... ,w}, and *<sup>P</sup>*<sup>j</sup> <sup>∩</sup> *<sup>P</sup>*<sup>i</sup> <sup>=</sup> <sup>∅</sup> for i j. By defining *f*<sup>j</sup> (**x**) := s∈*P*<sup>j</sup> *Q*s(**x**), Problem (1) can be rewritten as

$$f\_\* = \min\_{\mathbf{x} \in \mathcal{X}} \mathbf{c}^\mathbf{T} \mathbf{x} + f^1(\mathbf{x}) + \dots + f^\mathbf{w}(\mathbf{x}). \tag{3}$$

In our notation, w stands for the number of workers evaluating the recourse functions. The workers j ∈ {1, ... ,w} are processes running on a single machine or multiple machines. Likewise, we define a master process—hereafter referred to only as master—to solve the master program (which is defined shortly).

#### *2.1. The Mixed-Integer and Asynchronous Level Decomposition*

For every point **x**k, where k represents an iteration counter, worker j receives **x**<sup>k</sup> and provides us with the first-order information on the component function *f*<sup>j</sup> : the value of the function *f*<sup>j</sup> (**x**k) and a subgradient [23] *g* j <sup>k</sup> <sup>∈</sup> <sup>∂</sup> *<sup>f</sup>*<sup>j</sup> (**x**k), in the two-stage setting, *g* j <sup>k</sup> := − <sup>s</sup>∈*Pj* **<sup>T</sup>**<sup>T</sup> <sup>s</sup> πs. Convexity of *f*<sup>j</sup> implies that the linearization *f*<sup>j</sup> (**x**k) + *g* j <sup>k</sup>, *<sup>x</sup>* <sup>−</sup> **<sup>x</sup>**<sup>k</sup> approximates *<sup>f</sup>*<sup>j</sup> (*x*) from below for all *x*. By gathering iteration indices into sets *J* <sup>j</sup> <sup>⊂</sup> {1, 2, ... , k} along with the iterations at which *<sup>f</sup>*<sup>j</sup> were evaluated, we can construct individual cutting-plane models for functions *f*<sup>j</sup> , with j <sup>∈</sup> {1, ... ,w}:mini∈*J*<sup>j</sup>{*f*<sup>j</sup> (**x**k) + *g* j <sup>k</sup>, *<sup>x</sup>* <sup>−</sup> **<sup>x</sup>**k } ≤ *<sup>f</sup>*<sup>j</sup> (*x*). These models define—together with a stability center **x**ˆk, a level parameter f lev <sup>k</sup> ∈ , and a given norm ·2—the following master program (MP)

$$\begin{cases} \min\_{\mathbf{x},\mathbf{y}} & \|\mathbf{x} - \mathbf{\hat{x}}\_{\mathbf{k}}\|\_{2} \\ \text{s.t.} & \text{possible feasibility cuts} \\ & f^{\dagger}(\mathbf{x}\_{\mathbf{i}}) + \langle \mathbf{g}^{\dagger}\_{i'}, \mathbf{x} - \mathbf{x}\_{\mathbf{i}} \rangle \le r\_{\mathbf{j}}, \quad \forall \mathbf{i} \in \boldsymbol{f}^{\dagger}\_{\mathbf{k}'}, \forall \mathbf{j} = 1, \ldots, \mathbf{w} \\ & \mathbf{c}^{\mathrm{T}}\mathbf{x} + \sum\_{j=1}^{\mathrm{W}} r\_{\mathbf{j}} \le \mathbf{f}^{\mathrm{dev}}\_{\mathbf{k}'} \text{ } \mathbf{x} \in \mathcal{X}. \end{cases} \tag{4}$$

At iteration k, an MP solution is denoted by **x**k+1. If any *Qs* is infeasible at **x**k+1, then a feasibility cut is added to the MP. We skip further details on this matter, since it is a well-known subject in the literature of two-stage programming [5]. On the other hand, if **x**k+<sup>1</sup> (sent to a work j) is feasible for all recourse functions, *Qs*, the model *f*<sup>j</sup> in the MP is updated. The improvement in the model *f*<sup>j</sup> is possibly based on outdated iterate **x***a*(j), where a(j) < k is the iteration index of the *anterior* information provided by worker j. We care to mention that the MP can be infeasible itself depending on the level parameter f lev <sup>k</sup> . Due to the convexity of the involved functions, if the MP is infeasible, then f lev <sup>k</sup> is a valid lower bound, *f* low <sup>k</sup> , on *f*\* [30].

Without coordination, there is no reason for all workers to be called upon the same iterate. This fact precludes the computation of an upper bound, *f* up <sup>k</sup> , of *f*\*. Algorithm 2 in Reference [31] deals with this situation without resorting to coordination techniques, but it requires more assumptions on the functions *f*<sup>j</sup> : upper bounds on their Lipschitz constants should be known. Since we do not make this assumption, we will need scarce coordination akin to Algorithm 3 of Reference [31] for computing upper bounds on *f*\*. As in Reference [31], the coordination iterates are denoted by **x**k. Assuming that all workers eventually respond (after an unknown time), the coordination allows them to compute the full value, *f*(**x**k), and a subgradient, *g* ∈ ∂ *f*(**x**k), at the coordination iterate. The function value is used to update the upper bound, *f* up <sup>k</sup> , as usual for level methods; the subgradient is used to update the bound L on the Lipschitz constant of *f*.

In our algorithm below, the coordination is implemented by two vectors of Booleans: **to-coordinate** and **coordinating**. The role of **to-coordinate**[j] is to indicate to the master that worker j will evaluate *f*<sup>j</sup> on the new coordination point **x**k; (at that moment, **to-coordinate**[j] is set to *false*, and **coordinating**[j] is set to *true*). Similarly, **coordinating**[j] indicates to the master that worker j is responding to a coordination step, which is used to update the upper bound. When a worker j responds, it is included in the set <sup>A</sup> of available workers. If all workers are busy, then <sup>A</sup> <sup>=</sup> <sup>∅</sup>. Our algorithm mirrors as much as possible Algorithm 3 of Reference [31], but contains some important specificities to handle (i) mixed-integer feasible sets and (ii) extended real-valued objective functions (we do not assume that *f*(*x*) is finite for all *x* ∈ X). To handle (ii), we furnish our algorithm with a feasibility check (and addition of cuts), and for (i) we not only use a specialized solver for the MP but also change the rule for scarce coordination. The reason is that the rule of Reference [31] is only valid in the convex setting. Under nonconvexity, the coordination test **<sup>x</sup>**<sup>k</sup> <sup>−</sup> **<sup>x</sup>**k−<sup>1</sup> <sup>&</sup>lt; <sup>α</sup> <sup>L</sup> Δk−1(with α ∈ (0, 1) and L ≥ *g*<sup>i</sup>, i = 1, ... , k) implies that the following inequality (important for the convergence analysis) is jeopardized:

$$\left\|\mathbf{x}\_{\mathbf{k}} - \widehat{\mathbf{x}}\_{\mathbf{k}}\right\|^2 \ge \left\|\mathbf{x}\_{\mathbf{k}-1} - \widehat{\mathbf{x}}\_{\mathbf{k}}\right\|^2 + \left(\frac{\alpha \Delta\_{\mathbf{k}-1}}{\mathcal{L}}\right)^2. \tag{5}$$

In the algorithm below, coordination is triggered when (5) is not satisfied and all workers have already responded on the last coordination iterate (i.e., rr = 0, where **rr** stands for "remaining to respond").

The assumption that the algorithm starts with a feasible point is made only for the sake of simplicity. Indeed, the initial point can be infeasible, but, in this case, Step 3 must be changed to ensure that the first computed feasible point is a coordination iterate. For the problem of interest, the feasibility check performed at line 45 amounts to verifying if *f*(**x**k+1) < ∞. In our SHTUC, the feasibility check comprises an auxiliary problem for verifying if ramp-rate constraints would be violated by **x**k+<sup>1</sup> and an additional auxiliary problem for checking if reservoir-volume bounds would be violated. Both problems are easily reduced to small linear-programming problems that can be solved to optimality in split seconds by off-the-shelf solvers.

#### **Algorithm 1: Asynchronous Level Decomposition.**

**1.** Choose a gap tolerance tolΔ, upper bound *f* up <sup>1</sup> <sup>&</sup>gt; *<sup>f</sup>*\* <sup>+</sup> tolΔ, lower bound *<sup>f</sup>* low <sup>1</sup> <sup>&</sup>lt; *<sup>f</sup>*\*, <sup>α</sup> <sup>∈</sup> (0, 1), *<sup>L</sup>* <sup>&</sup>gt; 0, and *<sup>x</sup>*<sup>0</sup> <sup>a</sup> feasible point. Set **x**<sup>1</sup> = **x**ˆ1 = **x**best = **x**0, Δ<sup>0</sup> ← *f* up <sup>1</sup> <sup>−</sup> *<sup>f</sup>* low <sup>1</sup> , <sup>Δ</sup><sup>ˆ</sup> ← ∞, rr <sup>←</sup> 0, A ← {1, 2, ... , w}, k <sup>←</sup> 0, *<sup>J</sup>* <sup>j</sup> <sup>←</sup> <sup>∅</sup> for j ∈ A. **2. for** k ← 1 **to** k + 1 **do 3. if** (5) does not hold and rr = 0 **then 4. <sup>x</sup>**<sup>k</sup> <sup>←</sup> **<sup>x</sup>**k, rr <sup>←</sup> w, *<sup>f</sup>*←**c**T**x**<sup>k</sup> and *<sup>g</sup>* <sup>←</sup> **<sup>c</sup> 5. for all** j ∈ A **do 6. to\_coordinate**[j] ← *false* and **7. coordinating**[j] ← *true* **8. end for 9. for all** j ∈ {1, ... , *w*}\A **do 10. to\_coordinate**[j] ← *true* and **11. coordinating**[j] ← *false* **12. end for 13. end if 14.** Send **<sup>x</sup>**<sup>k</sup> to all available workers j ∈ A and set <sup>A</sup> <sup>=</sup> <sup>∅</sup> **15.** Update the set <sup>A</sup> of idle workers and receive (*<sup>f</sup> <sup>j</sup>* (**x***a*(*j*)), *<sup>g</sup><sup>j</sup> a*(*j*) ) from workers j ∈ A **16.** Update *J* j ←*J* <sup>j</sup> <sup>∪</sup> *a*(*j*) for all j ∈ A and set R ← <sup>∅</sup> **17. for all** j ∈ A **do 18. if coordinating**[j] = *true* **then 19. coordinating**[j] ← *false* and rr ← rr − 1 **20.** *<sup>f</sup>*←*<sup>f</sup>* <sup>+</sup> *<sup>f</sup>*<sup>j</sup> (**x***a*(j)) and *g*←*g* + *g* j *a*(j) **21. if** rr = 0 **then 22.** Set L ← max L, *g* **23. if** *f* < *f* up <sup>k</sup> **then 24.** *f* up <sup>k</sup> ←*f* and **x**best←**x**<sup>k</sup> **25. end if 26. end if 27. else 28. if to\_coordinate**[j] = *true* **then 29.** Send **x**<sup>k</sup> to worker j and set R←R∪ j **30.** Set **to\_coordinate**[j] ← *false* and **31. coordinating**[j] ← *true* **32. end if 33. end if 34. end for 35.** Set A ← A\R **36.** Set Δ*<sup>k</sup>* ← *f* up <sup>k</sup> <sup>−</sup> *<sup>f</sup>* low k **37.** if Δ*<sup>k</sup>* ≤ tol<sup>Δ</sup> **then stop**: return **x**best and *f* up <sup>k</sup> **end if 38.** if <sup>Δ</sup>*<sup>k</sup>* <sup>≤</sup> <sup>α</sup>Δ<sup>ˆ</sup> **then** Set **<sup>x</sup>**ˆk <sup>←</sup> **<sup>x</sup>**best and <sup>Δ</sup><sup>ˆ</sup> <sup>←</sup> <sup>Δ</sup>*<sup>k</sup>* **end if 39.** f lev <sup>k</sup> ←*f* up <sup>k</sup> − αΔ<sup>k</sup> **40. if** (4) is feasible **then 41.** Get a new iterate **x**k+<sup>1</sup> from the solution of (4) **42. else 43.** Set *f* low <sup>k</sup> <sup>←</sup>*<sup>f</sup>* lev <sup>k</sup> and go to Step 36 **44. end if 45. if x**k+<sup>1</sup> leads to infeasible subproblems **then 46.** Add a feasibility cut to the MP (2) and go to Step 40 **47. end if 48.** Set *f* up <sup>k</sup>+<sup>1</sup> ← *f* up <sup>k</sup> , *<sup>f</sup>* low <sup>k</sup>+<sup>1</sup> <sup>←</sup> *<sup>f</sup>* low <sup>k</sup> , **x**ˆk+<sup>1</sup> ← **x**ˆk and **x**k+<sup>1</sup> ← **x**<sup>k</sup> **49. end for**

#### *2.2. Convergence Analysis*

To analyze the convergence of the mixed-integer asynchronous computing (ASYN) level decomposition (LD) described above, we rely as much as possible on Reference [31]. However, to account for the mixed-integer nature of the feasible set, we need novel developments like the ones in Theorem 3.1 below. Throughout this section, we assume tolΔ = 0, as well as the following:

#### **Hypothesis 1 (H1).** *all the workers are responsive;*

#### **Hypothesis 2 (H2).** *algorithm generates only finitely many feasibility cuts;*

#### **Hypothesis 3 (H3).** *the workers provide bounded subgradients.*

As for H1, the assumption H2 is a mild one: H2 holds, for instance, when *f* is a polyhedral function, or when X has only finitely many points. The problem of interest satisfies both these properties, and, therefore, H2 is verified. Due to convexity of *f*, assumption H3 holds, e.g., if X is contained in an open convex set that is itself a subset of *Dom*(*f*) (in this case, no feasibility cut will be generated). H3 also holds in our setting if subgradients are computed via basic optimal dual solutions of the second-stage subproblems. Under H3, we can ensure that the parameter *L* in the algorithm is finite.

In our analysis, we use the fact that the sequences of the optimality gap, Δk, and upper bound, *f* up <sup>k</sup> , are non-increasing by definition, and that the sequence of lower bound, *<sup>f</sup>* low <sup>k</sup> , is non-decreasing. More specifically, we update the lower bound only when the MP is infeasible. We count with the number of times the gap significantly decreases, meaning that the test of line 38 is triggered, and denote by k() the corresponding iteration. We have the following by construction:

$$
\Delta\_{\mathbf{k}(\ell+1)} \le \alpha \Delta\_{\mathbf{k}(\ell)} \le \alpha^2 \Delta\_{\mathbf{k}(\ell-1)} \le \dots \le \alpha^\ell \Delta\_1 \quad \forall \; \ell = 1, 2, \dots \tag{6}
$$

As in Reference [31], k() denotes a critical iteration, and **x**k() denotes a critical iterate. We introduce the set of iterates between two consecutive critical iterates by *K* := k() + 1, ... , k( + 1) − 1 . The proof of convergence of the ASYN LD consists in showing that the algorithm performs infinitely many critical iterations when tolΔ = 0. We start with the following lemma, which is a particular case of Reference [31], Lemma 3, and does not depend on the structure of X.

**Lemma 1.** *Fix an arbitrary and let <sup>K</sup> be defined as above. Then, for all* <sup>k</sup> <sup>∈</sup> *<sup>K</sup> , (a) the MP is feasible, and (b) the stability center is fixed: <sup>x</sup>*ˆ*<sup>k</sup>* = *<sup>x</sup>*ˆ*k*()*.*

Item (a) above ensures that the MP is well-defined and *f* low <sup>k</sup> is fixed for all <sup>k</sup> <sup>∈</sup> *<sup>K</sup>*. Note that the lower bound is updated only when the MP is found infeasible, and this fact immediately triggers the test at line 38 of the algorithm. Similarly, Algorithm 1 guarantees that the stability center remains fixed for all k <sup>∈</sup> *<sup>K</sup>*, since an updated on the stability center would imply a new critical iteration.

**Theorem 1.** *Assume that* X *is a compact set and that H1-H3 hold. Let tol*<sup>Δ</sup> = *0 in the algorithm, and then lim k* Δ*<sup>k</sup>* = 0*.*

**Proof of Theorem 1.** By (6), we only need to show that the counter increases indefinitely (i.e., that there are infinitely many critical iterations). We obtain this by showing that, for any , the set *K* is finite; for this, suppose that <sup>Δ</sup><sup>k</sup> <sup>&</sup>gt; <sup>Δ</sup> <sup>&</sup>gt; 0 for all <sup>k</sup> <sup>∈</sup> *<sup>K</sup>*. We proceed in two steps, showing the following: (i) The number of asynchronous iterations between two consecutive coordination steps is finite, and (ii) the number of coordination steps in *K* is finite, as well. If case (i) were not true, then (5) and Lemma 3.1(b) would give **x**<sup>k</sup> − **x**ˆk() <sup>2</sup> <sup>≥</sup>**x**k−<sup>1</sup> <sup>−</sup> **<sup>x</sup>**ˆk() <sup>2</sup> + αΔ L 2 , for all <sup>k</sup> <sup>∈</sup> *<sup>K</sup>* greater than the iteration k of the last coordination iterate. Applying this inequality recursively up to k, we obtain *Diam*(X) <sup>2</sup> <sup>≥</sup>**x**<sup>k</sup> <sup>−</sup> **<sup>x</sup>**ˆk() <sup>2</sup> <sup>≥</sup> (<sup>k</sup> <sup>−</sup> <sup>k</sup> <sup>−</sup> <sup>1</sup>) αΔ L 2 . However, this inequality, together with H1 and L < ∞

(due to H3) contradicts the fact that X is bounded. Therefore, item (i) holds. We now turn our attention to the item (ii): Let s, <sup>s</sup> <sup>∈</sup> *<sup>K</sup>* such that <sup>s</sup> <sup>&</sup>lt; <sup>s</sup> be the iteration indices of any two coordination steps. At the moment in which **x**s is computed, the information (*f*<sup>j</sup> (**x**s), *g* j <sup>s</sup>) is available at the MP for all j = 1, ... , w. As a result of the MP definition, the following constraints are satisfied by *x*s:

$$f^{\circ}(\overline{\mathbf{x}}\_{\mathsf{s}}) + \langle \mathsf{g}^{\circ}\_{\mathsf{s}\prime}, \overline{\mathbf{x}}\_{\mathsf{s}\prime} - \overline{\mathbf{x}}\_{\mathsf{s}} \rangle \le r^{\circ} \text{ and } \mathsf{c}^{\mathrm{T}} \overline{\mathbf{x}}\_{\mathsf{s}\prime} + \sum\_{\mathsf{j}=1}^{\mathsf{W}} r^{\circ} \le \mathsf{f}^{\mathrm{dev}}\_{\mathsf{s}\prime}. \tag{7}$$

By assuming these inequalities and rearranging terms, we get *f*(**x**s) − f lev <sup>s</sup>−<sup>1</sup> ≤ **<sup>c</sup>** <sup>+</sup> *<sup>w</sup>* j=1 *g* j s, **x**<sup>s</sup> − **x**s ≤

<sup>Γ</sup> **<sup>x</sup>**<sup>s</sup> <sup>−</sup> **<sup>x</sup>**s, where the constant <sup>∞</sup> <sup>&</sup>gt; <sup>Γ</sup> <sup>≥</sup> <sup>L</sup> <sup>≥</sup>**<sup>c</sup>** <sup>+</sup> *<sup>w</sup>* j=1 *g* j <sup>s</sup> is ensured by H3. The definition of *f* lev <sup>s</sup> = *f up* <sup>s</sup> − αΔs and inequality *f*(**x**s) ≥ *f up* <sup>s</sup> gives **<sup>x</sup>**<sup>s</sup> <sup>−</sup> **<sup>x</sup>**s ≥ <sup>α</sup> <sup>Δ</sup>s <sup>Γ</sup> <sup>≥</sup> <sup>α</sup> <sup>Δ</sup> <sup>Γ</sup> > 0. If there was an infinite number of coordination steps inside *<sup>K</sup>*, the compactness of <sup>X</sup> would allow us to extract a converging subsequence, and this would contradict the above inequality. The number of coordination steps inside *K* is thus finite. As a conclusion of (i) and (ii), the index-set *K* is hence finite, and the chain (6) concludes the proof. -

#### *2.3. Dynamic Asynchronous Level Decomposition*

In the asynchronous approach described in Algorithm 1, the component functions *f*<sup>j</sup> are statically assigned to workers—worker j always evaluates the same component function j. Likewise, the usual implementation of the synchronous LD strategy is to task workers with solving fixed sets of *Qs*. We call these strategies static asynchronous LD and static synchronous LD. However, as previously mentioned, such task-allocation policies might result in significant idle times—even for the asynchronous method because we need the first-order information on all *f*<sup>j</sup> to compute valid bounds. To lessen the idle times, we implement dynamic-task-allocation strategies, in which component functions are dynamically assigned to workers as soon as they become available. Our dynamic allocation differs from Reference [15] because we do not use a list of iterates. To ease the understanding of the LD methods applied in this work—and to highlight their differences—we introduce a new figure: a coordinator process. The coordinator is responsible for tasking workers with functions to be evaluated. Note, however, that this additional figure is only strictly necessary in the dynamic asynchronous LD; in the other three methods, this responsibility can be taken by the master. Nonetheless, in all methods, the master has three roles: solving the MP, getting iterates, and requesting functions to be evaluated at the newly obtained iterates. By construction, in the synchronous methods, the master requests the coordinator to evaluate all functions *f*<sup>j</sup> at the same iterate, and it waits until the information of the all functions has been received to continue the process. On the other hand, in the asynchronous variants, the master computes a new iterate, requests the coordinator to evaluate it on possibly not all *f*<sup>j</sup> , and receives information on outdate iterates from the coordinator. Given that the master has requested an iterate **x** to be evaluated in some *f*<sup>j</sup> , the main difference between the static and the dynamic asynchronous methods is that, in the static form, the coordinator always sends **x** to the same worker who has been previously tasked with solving *f*<sup>j</sup> , while in the dynamic one, the coordinator sends **x** to any available worker.

#### **3. Modeling Details and Case Studies**

The general formulation of our SHTUC is presented in (8)–(19).

$$f\_\* = \min \sum\_{\mathbf{g} \in \mathcal{G}} \left[ \sum\_{\mathbf{t} \in \mathcal{T}} \left( \mathbf{C} \mathbf{S}\_{\mathbf{g}'} \mathbf{a}\_{\mathbf{t} \mathbf{t}} + \sum\_{s \in \mathcal{S}} \mathbf{C}\_{\mathbf{g}'} \mathbf{t} \mathbf{g}\_{\mathbf{g} \mathbf{t} \mathbf{s}} \right) \right] + \sum\_{\mathbf{b} \in \mathcal{B}} \sum\_{\mathbf{t} \in \mathcal{T}} \mathbf{C} \mathbf{L} \cdot (\boldsymbol{\delta}\_{\mathbf{b} \mathbf{t}}^{+} + \boldsymbol{\delta}\_{\mathbf{b} \mathbf{t}}^{-}) + \sum\_{\mathbf{s} \in \mathcal{S}} f\_{\mathbf{s}}^{\omega}(\mathbf{v}) \tag{8}$$

$$\text{s.t.} \sum\_{\text{o}=\text{t}-\text{T}\mathbf{U}\_{\text{g}}+1}^{\text{t}} a\_{\text{g}\text{o}} \le I\_{\text{gt}} \sum\_{\text{o}=\text{t}-\text{T}\mathbf{D}\_{\text{g}}+1}^{\text{t}} b\_{\text{g}\text{o}} \le 1 - I\_{\text{gt}} \tag{9}$$

$$a\_{\mathfrak{G}^\mathsf{t}} - b\_{\mathfrak{G}^\mathsf{t}} = I\_{\mathfrak{G}^\mathsf{t}} - I\_{\mathfrak{g}^\mathsf{t}-1}, \\ z\_{\mathfrak{ht}} - \mathfrak{u}\_{\mathfrak{ht}} = w\_{\mathfrak{ht}} - w\_{\mathfrak{ht}-1} \tag{10}$$

$$z\_{\rm ht}, u\_{\rm ht}, w\_{\rm ht}, a\_{\rm gt}, b\_{\rm gt}, l\_{\rm gt} \in \{0, 1\} \tag{11}$$

$$I\_{\mathfrak{G}^\mathsf \mathfrak{F}} \preceq\_{\mathfrak{K}} \mathfrak{k}\_{\mathfrak{g}\text{ts}} \preceq I \cdot \overline{\mathbf{P}}\_{\mathfrak{g}} \tag{12}$$

$$\log \mathfrak{t}\_{\mathsf{gts}} - \mathfrak{t} \mathsf{g}\_{\mathsf{gt}^{-1}\mathsf{t}} \le I\_{\mathsf{gt}^{-1}} \cdot \overline{\mathsf{R}}\_{\mathsf{g}} + (1 - I\_{\mathsf{gt}^{-1}}) \cdot \mathbf{SU}\_{\mathcal{S}} \tag{13}$$

$$t\lg\_{\mathfrak{g}\mathfrak{t}-1\mathfrak{s}} - t\lg\_{\mathfrak{g}\mathfrak{ts}} \le I\_{\mathfrak{g}\mathfrak{t}} \cdot \underline{\mathbf{R}}\_{\mathfrak{g}} + (1 - I\_{\mathfrak{g}\mathfrak{t}}) \cdot \mathbf{SD}\_{\mathfrak{s}} \tag{14}$$

$$
\sigma\_{\rm hts} - \upsilon\_{\rm ht-1s} + f\_{\rm hts}^{\rm V}(q, s) + \mathbf{A}\_{\rm hts} = 0 \tag{15}
$$

$$\underline{\mathbf{V}}\_{\text{h}} \le v\_{\text{hts}} \le \overline{\mathbf{V}}\_{\text{h}\prime} w\_{\text{ht}} \cdot \underline{\mathbf{Q}}\_{\text{h}} \le q\_{\text{hts}} \le w\_{\text{ht}\prime} \overline{\mathbf{Q}}\_{\text{h}\prime} 0 \le s\_{\text{hts}} \le \overline{\mathbf{S}}\_{\text{h}} \tag{16}$$

$$0 \le \hbar g\_{\rm hts} \le f\_{\rm hts}^{\rm hg}(q, s) \tag{17}$$

$$\mathbf{f}\_{\rm bts}^{\rm P} \left( \mathbf{t} \mathbf{g}, \hbar \mathbf{g}, \delta^{+}, \delta^{-} \right) + \mathbf{W} \mathbf{G}\_{\rm bts} - \mathbf{L}\_{\rm bt} = \mathbf{0} \tag{18}$$

$$\mathbf{T}\underline{\mathbf{L}}\_{\text{l}} \le f\_{\text{lts}}^{\text{l}} \left( \mathbf{t}\underline{\mathbf{g}}, \text{l}\underline{\mathbf{g}}, \delta^{+}, \delta^{-} \right) \le \overline{\mathbf{T}\mathbf{L}}\_{\text{l}}, \forall \text{l} \in \mathcal{L} \tag{19}$$

In our model, the indices and respective sets containing them are g ∈ G for thermal generators, h ∈ H for hydro plants, b ∈ B for buses, l ∈ L for transmission lines, and t and o ∈ T for periods. In (5), thermal generators' start-up costs are **CS**, and we assume that the shutdown cost is null. The thermal-generation costs are **C**; **CL** is the per-unit cost of load shedding (δ+) and generation surplus (δ−). Expected future-operation cost for scenario s is represented by the piecewise-affine function, *f* <sup>ω</sup> <sup>s</sup> (*v*s) : <sup>R</sup>|H| <sup>→</sup> <sup>R</sup>, where *<sup>v</sup>*<sup>s</sup> <sup>∈</sup> <sup>R</sup>|H| are the reservoir volumes in the last period of scenario s. The first-stage decisions are thermal generators' commitment, start-up, and shutdown, respectively, *I*, *a,* and *b*, and their hydro counterparts (*w*, *z*, and *u*). Set X in (1) contains the feasible commitments of thermal and hydro generators in our SHTUC, and it is defined by Constraints (9)–(11). In this work, we model the statuses of hydro plants with associated binary variables only in the first 48 h, to reduce the computational burden. For the remaining periods, the hydro plants are modeled only with continuous variables. The minimum up-time Constraint (9) ensures that, once turned on, thermal generator g remains on for at least **TU**<sup>g</sup> periods. Likewise, the minimum downtime in (9) requires that once g has been turned off, it must remain off for at least **TD**g periods. Constraints (10) guarantee the satisfaction of logical relations of status, start-up, and shutdown for thermal and hydro plants. The sets Ys are defined by (12)–(19). Constraints (12) are the usual limits on thermal generation *tg*; (13) and (14) are the up and down ramp-rate limits, and the start-up and shutdown requirements of generators g. Equation (15) is the mass balance of the hydro plant h's reservoir. The **A**hts is the inflow to reservoir h in period t of scenario s. Moreover, the affine function *f* <sup>v</sup> hts(*q*,*s*) : <sup>R</sup>2·|H|·|T |·|S| <sup>→</sup> <sup>R</sup> maps the inflow to h's reservoir in period t of scenario s given the vectors of turbine discharge *q* and spillage *s*. The constraints in (16) are the limits on reservoir volume, *v*, turbine discharge, *q*, and spillage, *s*. In (17), the piecewise-affine function *f* hg hts(*q*,*s*) : <sup>R</sup>2·|H|·|T |·|S| <sup>→</sup> <sup>R</sup> bounds the hydropower generation *hg*hts of plant h. We use the classical DC network model: Equation (18) is the bus power balance, where the linear function *f* p bts(*tg*, *hg*, <sup>δ</sup>+, <sup>δ</sup>−) : <sup>R</sup>|T |·|S|·(|G|+|H|+2·|B|) <sup>→</sup> <sup>R</sup> maps the controlled generation at each bus into the power injection at bus b, **WG**bts is the wind generation at bus b, and **L**bt is the corresponding load at b. Lastly, (19) are the limits on the flow of transmission line l in period t and scenario s, defined by the affine function *f* <sup>l</sup> lst(*tg*, *hg*, <sup>δ</sup>+, <sup>δ</sup>−) : <sup>R</sup>|T |·|S|·(|G|+|H|+2·|B|) <sup>→</sup> <sup>R</sup>.

We assess our algorithm on a 46-bus system with 11 thermal plants, 16 hydro plants, 3 wind farms, and 95 transmission lines. The system's installed capacity is 18,600 MW, from which 18.9% is due to thermal plants, hydro plants represent 68.1%, and wind farms have a share of 13%. We consider a one-week-long planning horizon with hourly discretization. Thus, a one-scenario instance of our

SHTUC would have 7848 binary variables and 5315 constraints at the first stage; and 36,457 continuous variables and 100,949 constraints for each scenario in the second stage. Furthermore, the weekly peak load in the baseline case is 11,204 MW—nearly 60.2% of the installed capacity. The hydro plants are distributed over two basins and include both run-of-river ones and plants with reservoirs capable of regularization. Further information about the system can be found in the multimedia files attached.

The uncertainty comes from wind generation and the inflows. In all tests, we use a scenario set with 256 scenarios. To assess how our algorithm performs in distinct scenario sets, three sets (A, B, and C) are considered. Moreover, we use three initial useful-reservoir-volume levels: 40%, 50%, and 70%. The impact of different load levels on the performance of our algorithms is analyzed through three load levels: low (L), moderate (M), and high (H). Level H is our baseline case regarding load. Levels M and L have the same load profile as H's, but with all loads multiplied by factors of 0.9 and 0.8, respectively. Lastly, to investigate how our algorithm's convergence rate is affected by different choices of initial stability centers, we implement two strategies for obtaining the initial stability center. In both strategies, we solve an expected-value problem, as defined in Reference [5]. In the first one, we use the classical Benders decomposition (BD) with a coarse relative-optimality-gap tolerance of 10% to get a, possibly, low-quality stability center (LQSC). To obtain the stability center of hopefully high quality, which we refer to as high-quality stability center (HQSC), we solve the expected-value problem directly with Gurobi 8.1.1 [33] with a relative-optimality-gap tolerance of 1%. The time limit for obtaining the initial stability centers LQSC and HQSC is set to 5 min. Additionally, the computing setting consists of seven machines of two types: 4 of them have 128 GB of RAM and two Xeon E5-2660 v3 processors with 10 cores clocking at 2.6 GHz; the other 3 machines have 32 GB of RAM and two Xeon X5690 processors with cores cores clocking at 3.47 GHz. All machines are in a LAN with 1-Gbps network interfaces. We test two machine combinations. In the first one, in Combination 1, there are four 20-core machines and one with 12 cores. In Combination 2, we replace one machine with 20 cores by 2 with 12 cores. Regardless of the combination, one 12-core machine is defined as the head node, where only the master is launched. Except for the master—for which Gurobi can take up to 10 cores—for all other processes, i.e., the workers, Gurobi is limited to computing on a single core.

Our computing setting is composed of machines with different configurations. Naturally, solving the same component function in two distinct machines may result in different outputs—and different runtimes. Consequently, the path taken by the MP across iterations might change significantly between experiments on the same data. More specifically to asynchronous methods, the varying order of information arrival to the MP may also yield different convergence rates. Hence, to reduce the effect of these seemingly random behaviors, we conducted 5 experiments for each problem instance. Therefore, our testbed E is defined as E = {40, 50, 70} × {A, B, C} × {L, M, H} × {LQ-SC, HQ-SC} × {Trial 1, ... , Trial 5} × {Combination 1, Combination 2}—we have 54 problems and 540 experiments. In all instances in E, we divide S into 16 subsets. Thus, following our previous definitions, w = 16 and any subset P<sup>j</sup> is such that |Pj| = 16. Additionally, we set a relative-optimality-gap tolerance of 1% and a time limit of 30 min for all instances in E. Gurobi 8.1.1 is used to solve the MILP MP and the component functions (linear-programming problems) that form the subproblem. The inter-process communication is implemented with mpi4py and Microsoft MPI v10.0.

#### **4. Results**

In this section, the methods are analyzed based on their computing-time performances. We focus on this metric because our results have not shown significant differences among the methods for other metrics, e.g., optimality gap and upper bounds. In addition to analyzing averages of the metric, we use the well-known performance profile [34]. Multimedia files containing the main results for the set E are attached to this work.

Figure 1 presents the performance profiles of the methods considering the experiments E. In Figure 1, ρ(τ) and τ are, respectively, the probability that the performance ratio of a given method is within a factor τ of the best ratio, as in Reference [34]. Applying the classical Benders decomposition

(BD) on the set {40, 50, 70} × {A} × {L, M, H} × {Combination 1} results in the convergence only of the problem in {70} × {A} × {M} × {Combination 1}, for which BD converges to a 1%-optimal solution in 1281.42 s. Thus, it is reasonable to expect that the classical BD would also perform poorly for the remaining experiments E.

**Figure 1.** Performance profiles over the set E.

In Figure 1, we see that the dynamic asynchronous LD outperforms all other methods for most instances E. Its performance ratio is within a factor of 2 from the best ratio for about 500 instances (about 92% of the total). Moreover, the static asynchronous LD has a reasonable overall performance—it is within a factor of 2 from the best ratio for more than 400 instances. Moreover, we see that the dynamic-allocation strategy provides significant improvements for both the asynchronous and synchronous LD approaches. The dynamic synchronous LD converges faster than its static counterpart for most of the experiments. Figures 2 and 3 show the performance profiles considering only instances in E with machine Combinations 1 and 2, respectively.

**Figure 2.** Performance profiles for the instances with machine Combination 1.

**Figure 3.** Performance profiles for the instances with machine Combination 2.

Figure 2 illustrates that, for a distributed setting in which workers are deployed on machines with identical characteristics, the performances of the methods with dynamic allocation and those with static allocation are similar. Nonetheless, we see that the asynchronous methods still outperform the synchronous LD for most experiments.

In contrast to Figure 2, Figure 3 shows that the dynamic-allocation strategy provides significant time savings for the instances in E with Machine Combination 2. This is due to the great imbalance between the different machines in Combination 2—machines with processors Xeon E5-2660 v3 are much faster than those with processors Xeon X5690.

Table 1 gives the average wall-clock computing times over subsets of E. From this table, we see that the relative average speed-up of the dynamic and static asynchronous LD over the entire set E w.r.t. The static synchronous LD are 54% and 29%, respectively—considering the dynamic synchronous LD, the speed-ups are 45% and 16%, respectively. Moreover, we see that the time savings are more significant for harder-to-solve instances, e.g., instances with high load and/or low-quality initial stability centers. Additionally, Table 1 shows that the dynamic asynchronous LD provides considerable reductions in the standard deviations of the elapsed computing times, in comparison with the other methods. For example, for the problems with high load level (H), the dynamic asynchronous LD has a standard deviation of about 16%, 13%, and 27% smaller than that of the static asynchronous LD, dynamic synchronous LD, and static synchronous LD, respectively.

Based on the data from Table 1, we can compute the speed-up provided by our proposed dynamic ASYN LD w.r.t., and the other three variants are considered here. To better appreciate such speed-ups, we show them in Table 2, where we see that the proposed ASYN LD provides consistent speed-ups over the entire range of operating conditions considered here.

The advantages of the asynchronous methods are made clearer in Figure 4, where we see that not only the asynchronous methods provide (on average) better running times but also present significantly less variation among the problems in E. The latter is relevant in the day-to-day operations of ISOs, since, if there are stochastic hydrothermal unit-commitment (SHTUC) cases that take significantly more time to be solved than the expected, subsequent operation steps that depend on the results of the SHTUC might be affected. Take, for instance, the case from the Midcontinent Independent System Operator reported in Reference [3], where the (deterministic) UC is reported to have solution times varying from just 50 to over 3600 s. Such variation can be problematic in the day-to-day operation of

power systems since it may disrupt tightly scheduled operations. Naturally, methods that can reduce such variance and still produce high-quality solutions in reasonable times are appealing.


**Table 1.** Average elapsed time and standard deviation in seconds.

The rows indicate that the average elapsed times and standard deviation given in parentheses are computed considering only the instances in E with the parameter given in the column 1. For example, the averages and respective standard deviations in row 3 are computed considering all experiments for which the initial useful-reservoir-volume level is 40%. Likewise, rows 4 and 7 provide the averages over instances with scenario set A and load level L, respectively. In rows 10 and 11, HQSC and LQSC stand for high-quality stability center and low-quality stability center, respectively.



As in Table 1, the rows indicate that the average speed-up computed considering only the instances in E with the parameter given in the column 1. Moreover, the columns indicate the method the speed-up is computed for. For example, column Static SYN (synchronous computing) LD gives the speed-ups provided by the ASYN LD over instances in the first column w.r.t. to the static synchronous level decomposition.

**Figure 4.** Boxplot of the methods over the set E.

#### **5. Conclusions**

In this work, we present an extension of the asynchronous level decomposition of Reference [31] in a Benders-decomposition framework. We show a convergence analysis of our algorithm, proving that it converges to an optimal solution, if one exists, in finite-many iterations. Our experiments are conducted on an extensive testbed from a real-life-size system. The results show that the proposed asynchronous algorithm outperforms its synchronous counterpart in most of the problems and provides significant time savings. Moreover, we show that the improvements provided by the asynchronous methods over the synchronous ones are even more evident in a distributed-computing setting with machines of different computational powers. Additionally, we show that the asynchronous method is further enhanced by implementing a dynamic-task-allocation strategy.

**Author Contributions:** Conceptualization, B.C., E.C.F., and W.d.O.; methodology, B.C., E.C.F., and W.d.O.; software, B.C.; validation, B.C.; formal analysis, E.C.F. and W.d.O.; investigation, B.C., E.C.F., and W.d.O.; resources, B.C. and E.C.F.; data curation, B.C.; writing—original draft preparation, B.C., E.C.F., and W.d.O.; writing—review and editing, B.C., E.C.F., and W.d.O.; visualization, B.C., E.C.F., and W.d.O.; supervision, E.C.F. and W.d.O. All authors have read and agreed to the published version of the manuscript.

**Funding:** The third author acknowledges financial support from the Gaspard-Monge program for Optimization and Operations Research (PGMO) project "Models for planning energy investment under uncertainty".

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **On a Nonsmooth Gauss–Newton Algorithms for Solving Nonlinear Complementarity Problems**

#### **Marek J. Smieta ´ ´ nski**

Faculty of Mathematics and Computer Science, University of Lodz, Banacha 22, 90-238 Łód ´z, Poland; marek.smietanski@wmii.uni.lodz.pl

Received: 25 June 2020; Accepted: 31 July 2020; Published: 4 August 2020

**Abstract:** In this paper, we propose a new version of the generalized damped Gauss–Newton method for solving nonlinear complementarity problems based on the transformation to the nonsmooth equation, which is equivalent to some unconstrained optimization problem. The B-differential plays the role of the derivative. We present two types of algorithms (usual and inexact), which have superlinear and global convergence for semismooth cases. These results can be applied to efficiently find all solutions of the nonlinear complementarity problems under some mild assumptions. The results of the numerical tests are attached as a complement of the theoretical considerations.

**Keywords:** Gauss–Newton method; nonsmooth equations; nonsmooth optimization; nonlinear complementarity problem; B-differential; superlinear convergence; global convergence

#### **1. Introduction**

Let *<sup>F</sup>* : *<sup>R</sup><sup>n</sup>* → *<sup>R</sup><sup>n</sup>* and let *Fi*, *<sup>i</sup>* = 1, ..., *<sup>n</sup>* denote the components of *<sup>F</sup>*. The nonlinear complementarity problem (NCP) is to find *<sup>x</sup>* <sup>∈</sup> *<sup>R</sup><sup>n</sup>* such that

$$\mathbf{x} \ge \mathbf{0}, F(\mathbf{x}) \ge \mathbf{0} \text{ and } \mathbf{x}^T F(\mathbf{x}) = \mathbf{0}. \tag{1}$$

The *i*th component of a vector *x* is represented by *xi*. Solving (1) is equivalent to solving a nonlinear equation *<sup>G</sup>*(*x*) = 0, where the operator *<sup>G</sup>* : *<sup>R</sup><sup>n</sup>* <sup>→</sup> *<sup>R</sup><sup>n</sup>* is defined by

*G*(*x*) = ⎡ ⎢ ⎣ *ϕ*(*x*1, *F*1(*x*)) ... *ϕ*(*xn*, *Fn*(*x*)) ⎤ ⎥ ⎦

with some special function *ϕ*. Function *ϕ* may have one of the following forms:

$$\begin{aligned} \varphi\_1(a,b) &= \min\{a,b\};\\ \varphi\_2(a,b) &= \sqrt{a^2+b^2} - a - b;\\ \varphi\_3(a,b) &= \left.\theta(|a-b|) - \theta(a) - \theta(b)\right| \end{aligned}$$

where *θ* : *R* → *R* is any strictly increasing function with *θ*(0) = 0, see [1].

The (NCP) problem is one of the fundamental problems of mathematical programming, operations research, economic equilibrium models, and in engineering sciences. A lot of interesting and important applications can be found in the papers of Harker and Pang [2] and Ferris and Pang [3]. We can find the most essential applications in:


We borrow a technique used in solving some smooth problems. If *g* is a merit function of *G*, i.e., *g*(*x*) = <sup>1</sup> <sup>2</sup>*G*(*x*)*TG*(*x*), then any stationary point of *<sup>g</sup>*(*x*) is a least-squares solution of the equation *G*(*x*) = 0. Then, algorithms for minimization are equivalent to algorithms for solving equations. The usual Gauss–Newton method (known also as the differential corrections method), presented by Ortega and Rheinboldt [4] in the smooth case, has the form

$$\mathbf{x}^{(k+1)} = \mathbf{x}^{(k)} - \left[ G'(\mathbf{x}^{(k)})^T G'(\mathbf{x}^{(k)}) \right]^{-1} G'(\mathbf{x}^{(k)})^T G(\mathbf{x}^{(k)}).\tag{2}$$

Local convergence properties of the Gauss–Newton method was discussed by Chen and Li [5], but only for some smooth case. The Levenberg–Marquardt method is also considered, which is a modified Gauss–Newton method, in some papers, e.g., [6] or [7]. Moreover, some comparison of semismooth algorithms for solving (NCP) problems has been made in [8].

In practice, we may also consider the damped Gauss–Newton method

$$\mathbf{x}^{(k+1)} = \mathbf{x}^{(k)} - \omega\_k \left[ G'(\mathbf{x}^{(k)})^T G'(\mathbf{x}^{(k)}) + \lambda\_k I \right]^{-1} G'(\mathbf{x}^{(k)})^T G(\mathbf{x}^{(k)}) \tag{3}$$

with parameters *ω<sup>k</sup>* and *λk*. Parameter *ω<sup>k</sup>* may be chosen to ensure suitable decrease of *g*. If *λ<sup>k</sup>* is positive for all *k*, then the inverse matrix in (3) always exists because *G* (*x*(*k*))*TG*(*x*(*k*)) is a symmetric and positive semidefinite matrix. The method (3) has the important advantage: the search direction always exists, even if *G* (*x*) is singular. Naturally, in the case of nonsmooth equations, some additional assumptions are needed to allow the use of some line search strategies and to ensure the global convergence. Because, in some cases, a function *G* is nondifferentiable, so the equation *G*(*x*) = 0 will be nonsmooth, whereby the method (3) may be useless. Some version of the Gauss–Newton method for solving complementarity problems was also introduced by Xiu and Zhang [9] for generalized problems, but only for linear ones. Thus, for solving nonsmooth and nonlinear problems, we propose two new versions of a damped Gauss–Newton algorithm based on B-differential. The usual generalized method is a relevant extension of the work by Subramanian and Xiu [10] for a nonsmooth case. In turn, an inexact version is related to the traditional approach, which was widely studied, e.g., in [11]. In recent years, various versions of the Gauss–Newton method were discussed, although most frequently for solving nonlinear least-squares problems, e.g., in [12,13].

The paper is organized as follows: in the next section, we review some notions needed, such as B-differential, BD-regularity, semismoothness, etc. (Section 2.1). Next, we propose a new optimization problem-based methods for the NCP, transforming the NCP into an unconstrained minimization problem by employing a function *ϕ*<sup>3</sup> (Section 2.2). We state its global convergence and superlinear convergence rate under appropriate conditions. In Section 3, we present the results of numerical tests.

#### **2. Materials and Methods**

#### *2.1. Preliminaries*

If *F* is Lipschitz continuous, the Rademacher's theorem [14] implies that *F* is almost everywhere differentiable. Let the set of points, where *F* is differentiable, be denoted by *DF*. Then, the B-differential (the Bouligand differential) of *F* at *x* (introduced in [15]) is

$$\partial\_B F(\mathfrak{x}) = \left\{ \lim\_{\mathfrak{x}^{(n)} \to \mathfrak{x}} F'\left(\mathfrak{x}^{(n)}\right), \mathfrak{x}^{(n)} \in D\_F \right\},$$

where *F* (*x*) denotes the usual Jacobian of *F* at *x*. The generalized Jacobian of *F* at *x* in the sense of Clarke [14] is

$$
\partial F(\mathbf{x}) = conv \partial\_B F(\mathbf{x})
$$

We say that *<sup>F</sup>* is BD-regular at *<sup>x</sup>*, if *<sup>F</sup>* is locally Lipschitz at *<sup>x</sup>* and if all *<sup>V</sup>* <sup>∈</sup> *<sup>∂</sup>BF*(*x*) are nonsingular (regularity on account of B-differential). Qi proved (Lemma 2.6, [15]) that, if *F* is BD-regular at *x*, then a neighborhood *<sup>N</sup>* of *<sup>x</sup>* and a constant C <sup>&</sup>gt; 0 exist such that, for any *<sup>y</sup>* <sup>∈</sup> *<sup>N</sup>* and *<sup>V</sup>* <sup>∈</sup> *<sup>∂</sup>BF*(*y*), *<sup>V</sup>* is nonsingular and

$$\left\|\mathbf{v}^{-1}\right\| \leq \mathbf{C}$$

Throughout this paper, · denotes the 2-norm.

The notion of semismoothness was originally introduced for functionals by Mifflin [16]. The following definition is taken from Qi and Sun [17]. A function *F* is semismooth at a point *x*, if *F* is locally Lipschitzian at *x* and

$$\lim\_{\mathbf{V}\in\partial F(\mathbf{x}+\mathbf{t}h'), h'\to h,\mathbf{t}\downarrow\mathbf{0}} \mathbf{V}h'$$

exists for any *<sup>h</sup>* <sup>∈</sup> *<sup>R</sup>n*. *<sup>F</sup>* is also said semismooth at *<sup>x</sup>*, if it is directionally differentiable at *<sup>x</sup>* and

$$\text{V}\mathbf{h} - F'\left(\mathbf{x}, \mathbf{h}\right) = o\left(||\mathbf{h}||\right).$$

Scalar products and sums of semismooth functions are still semismooth functions. Piecewise smooth functions and maximum of a finite number of smooth functions are also semismooth. The semismoothness is the almost usually seen assumption on *F* in papers dealing with nonsmooth equations because it implies some important properties for convergence analysis of methods in nonsmooth optimization.

If for any *<sup>V</sup>* <sup>∈</sup> *<sup>∂</sup>F*(*<sup>x</sup>* <sup>+</sup> *<sup>h</sup>*), as *<sup>h</sup>* <sup>→</sup> <sup>0</sup>

$$\text{Vh} - F'(\mathfrak{x}, \mathfrak{h}) = O\left(\left\|\mathfrak{h}\right\|^{1+p}\right).$$

where 0 < p ≤ 1, then we say *F* is p-order semismooth at *x*. Clearly, p-order semismoothness implies semismoothness. If p = 1, then the function *F* is called strongly semismooth. Piecewise *C*<sup>2</sup> functions are examples of strongly semismooth functions.

Qi and Sun [17] remarked that, if *<sup>F</sup>* is semismooth at *<sup>x</sup>*, then, for any *<sup>h</sup>* <sup>→</sup> <sup>0</sup>

$$F(\mathbf{x} + \mathbf{h}) - F(\mathbf{x}) - F'(\mathbf{x}; \mathbf{h}) = o\left(||\mathbf{h}||\right)\_{\prime\prime}$$

and, if *<sup>F</sup>* is p-order semismooth at *<sup>x</sup>*, then for any *<sup>h</sup>* <sup>→</sup> <sup>0</sup>

$$F(\mathbf{x} + \mathbf{h}) - F(\mathbf{x}) - F'(\mathbf{x}; \mathbf{h}) = O\left(\left\|\mathbf{h}\right\|^{1+\mathbf{p}}\right).$$

**Remark 1.** *Strong semismoothness of the appropriate function usually implies quadratic convergence of method instead of the superlinear one for semismooth function.*

In turn, Pang and Qi [18] proved that semismoothness of *F* at *x* implies that

$$\sup\_{V \in \partial F(x+h)} \{ F(x+h) - F(x) - Vh \} = o\left( ||h|| \right).$$

Moreover, if *F* is p-order semismooth at *x*, then

$$\sup\_{V \in \partial F(\mathbf{x} + \mathbf{h})} \left\{ F(\mathbf{x} + \mathbf{h}) - F(\mathbf{x}) - V\mathbf{h} \right\} = O\left( \left\| \mathbf{h} \right\|^{1 + \mathbf{p}} \right).$$

#### *2.2. The Algorithm and Its Convergence*

Consider nonlinear equation *G*(*x*) = 0 defined by *ϕ*3. The equivalence of solving this equation and problem (NCP) is described by the following theorem:

**Theorem 1** (Mangasarian [1])**.** *Let θ be any strictly increasing function from R into R, that is, <sup>a</sup>* <sup>&</sup>gt; *<sup>b</sup>* <sup>⇔</sup> *<sup>θ</sup>*(*a*) <sup>&</sup>gt; *<sup>θ</sup>*(*b*)*, and let <sup>θ</sup>*(0) = <sup>0</sup>*. Then, <sup>x</sup> solves the complementarity problem (1) if and only if*

$$
\theta(|F\_i(\mathbf{x}) - \mathbf{x}\_i|) - \theta(F\_i(\mathbf{x})) - \theta(\mathbf{x}\_i) = 0, \; i = 1, 2, \dots, n. \tag{4}
$$

For the convenience, denote

$$G\_{\dot{i}}(\mathbf{x}) := \theta(|F\_{\dot{i}}(\mathbf{x}) - \mathbf{x}\_{\dot{i}}|) - \theta(F\_{\dot{i}}(\mathbf{x})) - \theta(\mathbf{x}\_{\dot{i}}) \tag{5}$$

for *i* = 1, 2, ..., *n*.

We assume that the function *θ* in Theorem 1 has the form

$$
\theta(\xi) = \xi \mid \xi \mid \cdot
$$

Let *G*(*x*) be the associated function. We define function *g* in the following way:

$$\lg(\mathbf{x}) = \frac{1}{2} \left\| G(\mathbf{x}) \right\|^2 \text{ .}$$

which allows for solving system *G*(*x*) = 0 based on solving the nonlinear least-square problem

$$\min\_{\mathbf{x}} g(\mathbf{x}).\tag{6}$$

Let us note that *x*<sup>∗</sup> solves *G*(*x*) = 0 if and only if it is a stationary point of *g*. Thus, from Theorem 1, *x*<sup>∗</sup> solves (1).

**Remark 2.** *On the other hand, the first-order optimality conditions for problem (6) are equivalent to the nonlinear system*

$$\nabla \mathcal{g}(\mathbf{x}) = G'(\mathbf{x})^T G(\mathbf{x}) = 0,$$

*where* ∇*g is the gradient of g, provided G is differentiable and G is the Jacobian matrix of G.*

The continuous differentiability of the merit function *g* for some kind of nonsmooth functions was established by Ulbrich in the following lemma:

**Lemma 1** (Ullbrich, [19])**.** *Assume that the function G* : *<sup>R</sup><sup>n</sup>* ⊃ *<sup>D</sup>* → *<sup>R</sup><sup>n</sup> is semismooth, or, stronger, p-order semismooth,* <sup>0</sup> <sup>&</sup>lt; *<sup>p</sup>* <sup>≤</sup> <sup>1</sup>*, then the merit function* <sup>1</sup> <sup>2</sup> *G*(*x*)<sup>2</sup> *is continuously differentiable on <sup>D</sup> with gradient* <sup>∇</sup>*g*(*x*) = *<sup>V</sup>TG*(*x*)*, where <sup>V</sup>* <sup>∈</sup> *<sup>∂</sup>G*(*x*) *is arbitrary.*

**Lemma 2.** *For any <sup>x</sup>* <sup>∈</sup> *<sup>R</sup>n, let <sup>A</sup><sup>x</sup>* <sup>=</sup> *<sup>V</sup><sup>T</sup> <sup>x</sup>Vx, where <sup>V</sup><sup>x</sup>* <sup>∈</sup> *<sup>∂</sup>BG*(*x*)*. Suppose that* <sup>∇</sup>*g*(*x*) <sup>=</sup> <sup>0</sup>*. Then, given λ* > 0*, the direction d given by*

$$(A\_{\ge} + \lambda I)d = \nabla g(\mathbf{x})$$

*is an ascent direction for g. In particular, there is a positive* <sup>w</sup> *such that g*(*<sup>x</sup>* <sup>−</sup> <sup>w</sup>*d*) <sup>&</sup>lt; *<sup>g</sup>*(*x*)*.*

**Proof.** There exist constants *β* ≥ 0 and *γ* > 0 such that

$$\left\|\mathcal{B}\left\|\boldsymbol{h}\right\|\right\|^2 \leq \boldsymbol{h}^T \mathbf{A}\_x \boldsymbol{h} \leq \gamma \left\|\boldsymbol{h}\right\|^2 \text{ for all } \boldsymbol{h} \in \mathbb{R}^n,$$

because *A<sup>x</sup>* defined as *V<sup>T</sup> <sup>x</sup>V<sup>x</sup>* is symmetric and positive semidefinite.

It follows that

$$\|\left(\beta + \lambda\right)\|\|h\|\|^2 \le h^T(\mathbf{A}\_{\mathbf{x}} + \lambda\mathbf{I})\hbar \le (\gamma + \lambda) \left\|\|h\|\right\|^2 \text{ for all } \mathbf{h} \in \mathbb{R}^n.$$

Since <sup>∇</sup>*g*(*x*) <sup>=</sup> 0, *<sup>d</sup>* <sup>=</sup> 0. If we take *<sup>h</sup>* <sup>=</sup> *<sup>d</sup>*, we obtain

$$d^T \nabla \mathcal{g}(\mathbf{x}) \ge (\beta + \lambda) \left\| \mathbf{d} \right\|^2 > 0.$$

It follows that <sup>∇</sup>*g*(*x*)*<sup>d</sup>* <sup>&</sup>gt; 0 and that *<sup>d</sup>* is a ascent direction for *<sup>g</sup>* (Section 8.2.1 in [4]).

Now, we present the generalized version of the damped Gauss–Newton method for solving the nonlinear complementarity problem.

#### **Algorithm 1:** The damped Gauss-Newton method for solving NCP

Let *<sup>β</sup>*, *<sup>δ</sup>* <sup>∈</sup> (0, 1) be given. Let *<sup>x</sup>*(0) be a starting point. Given *<sup>x</sup>*(*k*), the steps for obtaining *<sup>x</sup>*(*k*+1) are: Step 1: If <sup>∇</sup>*g*(*x*(*k*)) = 0, then stop. Otherwise, choose any matrix *<sup>V</sup><sup>k</sup>* <sup>∈</sup> *<sup>∂</sup>BG*(*x*(*k*)) and let *A<sup>k</sup>* = *V<sup>T</sup> <sup>k</sup> <sup>V</sup>k*. Step 2: Let *λ<sup>k</sup>* = *g*(*x*(*k*)). Step 3: Find *d*(*k*) that is a solution of the linear system

$$(A\_k + \lambda\_k I)d^{(k)} = \nabla \lg(\mathbf{x}^{(k)})\,.$$

Step 4: Compute the smallest nonnegative integer *mk* such that

$$\lg(\mathfrak{x}^{(k)} + \beta^m d^{(k)}) - \lg(\mathfrak{x}^{(k)}) \le -\delta \beta^m \nabla \lg(\mathfrak{x}^{(k)})^T d^{(k)}$$

and set

$$\mathbf{x}^{(k+1)} = \mathbf{x}^{(k)} + \boldsymbol{\beta}^{m\_k} \mathbf{d}^{(k)}.$$

**Remark 3.** *(i) In Step 2, letting <sup>λ</sup><sup>k</sup>* <sup>=</sup> *<sup>g</sup>*(*x*(*k*)) *is one of the simplest strategy because then* {*λk*} *converges to 0. (ii) The line search step (Step 4) in the algorithm follows the Armijo rule.*

**Theorem 2.** *Let <sup>x</sup>*(0) *be a starting point and* {*x*(*k*)} *be a sequence generated by Algorithm 1. Assume that: (a)* sup*<sup>k</sup> Vk* <sup>&</sup>lt; <sup>∞</sup> *for all <sup>V</sup><sup>k</sup>* <sup>∈</sup> *<sup>∂</sup>BG*(*x*(*k*))*; (b)* <sup>∇</sup>*g*(*x*) *is Lipschitzian with a constant* <sup>L</sup>*<sup>g</sup>* <sup>&</sup>gt; <sup>0</sup> *on the level set* <sup>L</sup> <sup>=</sup> *<sup>x</sup>* : *<sup>g</sup>*(*x*) <sup>≤</sup> *<sup>g</sup>*(*x*(0)) *.*

*Then, the generalized damped Gauss–Newton method described by Algorithm <sup>1</sup> is well defined and either* {*x*(*k*)} *terminates at a stationary point of g, or else every accumulation point of* {*x*(*k*)}*, if it exists, is a stationary point of g.*

**Proof.** The proof is almost the same as Theorem 2.1 in [10], providing appropriately modified assumptions.

For the nonsmooth case, the alternative condition may be considered instead of Lipschitz continuity of <sup>∇</sup>*g*(*x*) (similar as in [10]). Thus, we have the following convergence theorem:

**Theorem 3.** *Let <sup>x</sup>*(0) *be a starting point and* {*x*(*k*)} *be a sequence generated by Algorithm 1. Assume that: (a) the level set* L = *<sup>x</sup>* : *<sup>g</sup>*(*x*) <sup>≤</sup> *<sup>g</sup>*(*x*(0)) *is bounded; (b) G is semismooth on* L*.*

*Then, the generalized damped Gauss–Newton method described by Algorithm <sup>1</sup> is well defined and either* {*x*(*k*)} *terminates at a stationary point of g, or else every accumulation point of* {*x*(*k*)}*, if it exists, is a stationary point of g.*

Now, we take up the rate of convergence of the considered algorithm. The following theorem shows suitable conditions in various cases.

**Theorem 4.** *Suppose that x*<sup>∗</sup> *is a solution of problem (1), G is semismooth, and G is BD-regular at x*∗*. Then, there exists a neighborhood <sup>N</sup>*<sup>∗</sup> *of <sup>x</sup>*<sup>∗</sup> *such that, if <sup>x</sup>*(0) <sup>∈</sup> *<sup>N</sup>*<sup>∗</sup> *and the sequence* {*x*(*k*)} *is generated by Algorithm 1, we have:*

*(i) <sup>x</sup>*(*k*) <sup>∈</sup> *<sup>N</sup>*<sup>∗</sup> *for all k and the sequence* {*x*(*k*)} *is linear convergent to x*∗*; (ii) if δ* < 0.5*, then the convergence is at least superlinear; (iii) If G is strongly semismooth, then the convergence is quadratic.*

**Proof.** The proof of similar theorem given by Subramanian and Xiu [10] is based on three lemmas, which have the same assumptions as theorem. Now, we present these lemmas in versions adapted to our nonsmooth case.

**Lemma 3.** *Assume that d<sup>x</sup> is a solution of the equation*

$$(\mathcal{A}\_\mathbf{x} + \lambda\_\mathbf{x} \mathbf{I}) \mathbf{d}\_\mathbf{x} = \nabla \mathcal{g}(\mathbf{x}),$$

*where*

$$
\lambda\_\mathbf{x} = \mathbf{g}(\mathbf{x}) \text{ and } \mathbf{A}\_\mathbf{x} = \mathbf{V}\_\mathbf{x}^T \mathbf{V}\_\mathbf{x}
$$

*for some matrix <sup>V</sup><sup>x</sup> taken from <sup>∂</sup>BG*(*x*)*. Then, there is a neighborhood D*<sup>1</sup> *of <sup>x</sup>*<sup>∗</sup> *such that, for all <sup>x</sup>* <sup>∈</sup> *<sup>D</sup>*1,

$$||\mathbf{x} - d\_{\mathbf{x}} - \mathbf{x}^\*|| = o\left(||\mathbf{x} - \mathbf{x}^\*||\right) .$$

**Lemma 4.** *There is a neighborhood D*<sup>2</sup> *of <sup>x</sup>*<sup>∗</sup> *such that, for all <sup>x</sup>* <sup>∈</sup> *<sup>D</sup>*2*, (a) g*(*x*) = <sup>1</sup> <sup>2</sup> (*<sup>x</sup>* <sup>−</sup> *<sup>x</sup>*∗)*TA*∗(*<sup>x</sup>* <sup>−</sup> *<sup>x</sup>*∗) + *<sup>o</sup> <sup>x</sup>* <sup>−</sup> *<sup>x</sup>*∗<sup>2</sup> *, (b) g*(*x*) = <sup>1</sup> <sup>2</sup> (*<sup>x</sup>* <sup>−</sup> *<sup>x</sup>*∗)*TAx*(*<sup>x</sup>* <sup>−</sup> *<sup>x</sup>*∗) + *<sup>o</sup> <sup>x</sup>* <sup>−</sup> *<sup>x</sup>*∗<sup>2</sup> *.*

**Lemma 5.** *Suppose that the conditions of Lemma 1 hold. Then, there is a neighborhood D*<sup>3</sup> *of x*<sup>∗</sup> *such that, for all <sup>x</sup>* <sup>∈</sup> *<sup>D</sup>*3*,*

$$g(\mathbf{x} - d\mathbf{x}) - g(\mathbf{x}) + \frac{1}{2} \nabla g(\mathbf{x})^T d\mathbf{x} \le o\left(||\mathbf{x} - \mathbf{x}^\*||^2\right).$$

The proofs of Lemmas 5 and 6 are almost the same as in [10]; however, in the proof of Lemma 4, we have to take into account the semismoothness and to use Lemma 1 to obtain the desired result.

At the same time, in a similar way, we may show a suitable rate of convergence.

Now, we consider the inexact version of the considered method, which computes an approximate step, using the nonnegative sequence of forcing terms to control the level of accuracy.

For the above inexact version of the algorithm, we can state the analogous theorems which are equivalents of Theorems 2–4. Based on our previous results, the proof can be carried out almost in the same way as that of theorems for the 'exact' version of the method. However, the condition (7), implied by inexactness given in Step 3 of Algorithm 2, has to be considered. Thus, we omit both theorems as proofs here.

#### **Algorithm 2:** The inexact version of the damped Gauss-Newton method for solving NCP

Let *<sup>β</sup>*, *<sup>δ</sup>*, *<sup>θ</sup>* <sup>∈</sup> (0, 1) and *<sup>η</sup><sup>k</sup>* <sup>∈</sup> [0, 1) for all *<sup>k</sup>* given. Let *<sup>x</sup>*(0) <sup>∈</sup> *<sup>R</sup><sup>n</sup>* be a starting point. Given *<sup>x</sup>*(*k*), the steps for obtaining *x*(*k*+1) are:

Step 1: If <sup>∇</sup>*g*(*x*(*k*)) = 0, then stop. Otherwise, choose any matrix *<sup>V</sup><sup>k</sup>* <sup>∈</sup> *<sup>∂</sup>BG*(*x*(*k*)) and let *A<sup>k</sup>* = *V<sup>T</sup> <sup>k</sup> <sup>V</sup>k*.

$$\text{Step 2: Let } \lambda\_k = \operatorname\*{g}(\mathfrak{x}^{(k)}).$$

Step 3: Find *d*(*k*) that is a solution of the linear system

$$\left\|(\mathbf{A}\_{k} + \lambda\_{k}\mathbf{I})\mathbf{d}^{(k)} + \nabla \mathcal{g}(\mathbf{x}^{(k)})\right\| \leq \eta\_{k} \left\|\nabla \mathcal{g}(\mathbf{x}^{(k)})\right\|.\tag{7}$$

Step 4: If

$$\left\| \nabla \mathcal{g} (\mathbf{x}^{(k)} + \mathbf{d}^{(k)}) \right\| \leq \theta \left\| \nabla \mathcal{g} (\mathbf{x}^{(k)}) \right\|.$$

then let

*x*(*k*+1) = *x*(*k*) + *d*(*k*)

and go to Step 1.

Step 5: Compute the smallest nonnegative integer *mk* such that

$$\lg(\mathfrak{x}^{(k)} + \beta^m d^{(k)}) - \lg(\mathfrak{x}^{(k)}) \le -\delta \beta^m \nabla \lg(\mathfrak{x}^{(k)})^T d^{(k)}$$

and set

$$\mathbf{x}^{(k+1)} = \mathbf{x}^{(k)} + \boldsymbol{\beta}^{m\_k} \mathbf{d}^{(k)}$$

and go to Step 1.

#### **3. Numerical Results**

In this section, we present results of our numerical experiments, obtained by coding both algorithms in Code:Blocks. We use double precision on an Intel Core i7 3.2 GHz running under the Windows Server 2016 operating system. We applied the generalized damped Gauss–Newton method to solve three nonlinear complementarity problems. In the following examples: *N*<sup>1</sup> and *N*<sup>2</sup> denote the number of performed iterations to satisfy the stopping criterion *x*(*k*+1) <sup>−</sup> *<sup>x</sup>*(*k*) <sup>&</sup>lt; <sup>10</sup>−<sup>7</sup> , using Algorithms 1 and 2, respectively. The forcing terms in Algorithm 2 were chosen as follows: *η<sup>k</sup>* = (10*k*)−<sup>1</sup> for all *k*.

**Example 1** (from Kojima and Shindo [20])**.** *Let the function F* : *<sup>R</sup>*<sup>4</sup> → *<sup>R</sup>*<sup>4</sup> *have the form*

$$\begin{array}{rclrcl}F^1(\mathbf{x})&=&3x\_1^2+2x\_1x\_2+2x\_2^2+x\_3+3x\_4-6,\\F^2(\mathbf{x})&=&2x\_1^2+x\_1+x\_2^2+10x\_3+2x\_4-2,\\F^3(\mathbf{x})&=&3x\_1^2+x\_1x\_2+2x\_2^2+2x\_3+9x\_4-9,\\F^4(\mathbf{x})&=&x\_1^2+3x\_2^2+2x\_3+3x\_4-3.\end{array}$$

*Problem (NCP) with the above function F has two solutions:*

$$\mathbf{x}^\* = (1,0,3,0)^T \text{ and } \mathbf{x}^{\*\*} = (\sqrt{6}/2,0,0,0.5)^T$$

*for which*

$$F(\mathbf{x}^\*) = (0, 31, 0, 4)^T \text{ and } F(\mathbf{x}^{\*\*}) = \left(0, 2 + \frac{\sqrt{6}}{2}, 0, 0\right)^T \dots$$

*Thus, x*<sup>∗</sup> *is a non-degenerate solution of (NCP) because*

$$\mathcal{L} := \left\{ i : \mathbf{x}\_i^\* = 0, \ F^i(\mathfrak{x}^\*) = 0 \right\} = \mathcal{Q}\_{\prime\prime}$$

*but x*∗∗ *is a degenerate solution.*

*Depending upon the starting point, we obtained the convergence iteration process to both solutions (see Table 1 or Figure 1).*

**Table 1.** Results for Example 1.


**Figure 1.** Number of iterations for various starting points (for Example 1).

**Example 2.** *Let function F* : *<sup>R</sup>*<sup>2</sup> → *<sup>R</sup>*<sup>2</sup> *be defined as follows:*

$$F(\mathbf{x}) = \begin{bmatrix} 2\mathbf{x}\_1 + \mathbf{x}\_2^2 - 6 \\ -\mathbf{x}\_1^2 + 4\mathbf{x}\_1 + \frac{1}{2}\mathbf{x}\_2 - 3 \end{bmatrix}.$$

*Then, problem (NCP) has two solutions:*

*- non-degenerate*

$$\mathbf{x}^\* = (0,6)^T \text{ for which } F(\mathbf{x}^\*) = (30,0)^T$$

*- degenerate*

$$\mathbf{x}^{\*\*} = \begin{pmatrix} \mathbf{3}, \mathbf{0} \end{pmatrix}^T \\ \text{for which } F(\mathbf{x}^{\*\*}) = \begin{pmatrix} \mathbf{0}, \mathbf{0} \end{pmatrix}^T.$$

*Similar to Example 1, we obtained the convergence iteration process for both solutions, depending on the starting point (see Table 2 or Figure 2).*

**Table 2.** Results for Example 2.


**Figure 2.** Number of iterations for various starting points (for Example 2.)

**Example 3** (from Jiang and Qi [21])**.** *Let function F* : *<sup>R</sup><sup>n</sup>* <sup>→</sup> *<sup>R</sup><sup>n</sup> has the form F*(*x*) = *Mx* <sup>+</sup> *<sup>q</sup>, where*

$$\mathbf{M} = \begin{bmatrix} 4 & -1 & 0 & \dots & 0 & 0 \\ -1 & 4 & -1 & \dots & 0 & 0 \\ 0 & -1 & 4 & \dots & 0 & 0 \\ \dots & \dots & \dots & \dots & \dots & \dots \\ 0 & 0 & 0 & \dots & 4 & -1 \\ 0 & 0 & 0 & \dots & -1 & 4 \end{bmatrix}, \ q = \begin{pmatrix} \dots \\ -1, \dots \end{pmatrix}^T.$$

*Because F is strictly monotonic, the proper problem (NCP) has exactly one solution.*

*Calculations have been made for various n with one starting point x*(0) = (0, ..., 0)*T. For all tests, we obtain the same number of iterations N*<sup>1</sup> = 3 *and N*<sup>2</sup> = 4*.*

#### **4. Conclusions**

We have given the nonsmooth version of the damped generalized Gauss–Newton method presented by Subramanian and Xu [10]. The generalized Newton algorithms related to the Gauss–Newton method are well-known important tools for solving nonsmooth equations, which arise from various nonlinear problems such as nonlinear complementarity or variational inequality. These algorithms are especially useful when the problem has many variables. We have proved that the sequences generated by the methods are superlinearly convergent under mild assumptions. Clearly, the semismoothness and BD-regularity are sufficient to obtain only a superlinear convergence of methods, while strong semismoothness even gives quadratic convergence. However, if function *G* is not semismooth or BD-regular or the gradient of *g* is not Lipschitzian, the Gauss–Newton methods may be useless.

The performance of both methods was evaluated in terms of the number of iterations required. The analysis of the numerical results seems to indicate that the methods are usually reliable for solving semismooth problems. The results show that the inexact approach can produce a noticeable slowdown by the number of iterations (compare *N*<sup>1</sup> and *N*<sup>2</sup> in Figures 1 and 2). In turn, an important advantage is that the algorithms allow us to find various solutions to the problem (this can be observed in two examples: the first and second one). However, if there are many solutions of the problem, then the relationship between the starting point and the obtained solution may be unpredictable.

Clearly, traditional numerical algorithms aren't the only method for solving the nonlinear complementarity problems, regardless of the degree of nonsmoothness. Except for the methods presented in the paper and mentioned in the Introduction, some computational intelligence algorithms can be used to solve (NCP) problems, i.a., monarch butterfly optimization (see [22,23]), the earthworm optimization algorithm (see [24]), the elephant herding optimization (see [25,26]), or the moth search algorithm (see [27,28]). All of these approaches are bio-inspired metaheuristic algorithms.

**Funding:** This research received no external funding.

**Conflicts of Interest:** The author declares no conflict of interest.

#### **References**


c 2020 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Polyhedral DC Decomposition and DCA Optimization of Piecewise Linear Functions**

#### **Andreas Griewank and Andrea Walther \***

Institut für Mathematik, Humboldt-Universität zu Berlin, 10099 Berlin, Germany; griewank@math.hu-berlin.de **\*** Correspondence: andrea.walther@math.hu-berlin.de

Received: 28 May 2020; Accepted: 8 July 2020; Published: 11 July 2020

**Abstract:** For piecewise linear functions *<sup>f</sup>* : <sup>R</sup>*<sup>n</sup>* ÞÑ <sup>R</sup> we show how their abs-linear representation can be extended to yield simultaneously their decomposition into a convex q*f* and a concave part p*f* , including a pair of generalized gradients *g* <sup>q</sup> <sup>P</sup> <sup>R</sup>*<sup>n</sup>* <sup>Q</sup> *<sup>g</sup>* <sup>p</sup>. The latter satisfy strict chain rules and can be computed in the reverse mode of algorithmic differentiation, at a small multiple of the cost of evaluating *f* itself. It is shown how q*f* and p*f* can be expressed as a single maximum and a single minimum of affine functions, respectively. The two subgradients *g* <sup>q</sup> and ´*<sup>g</sup>* <sup>p</sup> are then used to drive DCA algorithms, where the (convex) inner problem can be solved in finitely many steps, e.g., by a Simplex variant or the true steepest descent method. Using a reflection technique to update the gradients of the concave part, one can ensure finite convergence to a local minimizer of *f* , provided the Linear Independence Kink Qualification holds. For piecewise smooth objectives the approach can be used as an inner method for successive piecewise linearization.

**Keywords:** DC function; abs-linearization; DCA

#### **1. Introduction and Notation**

There is a large class of functions *<sup>f</sup>* : <sup>R</sup>*<sup>n</sup>* ÞÑ <sup>R</sup> that are called DC because they can be represented as the difference of two convex functions, see for example [1,2]. This property can be exploited in various ways, especially for (hopefully global) optimization. We find it notationally and conceptually more convenient to express these functions as averages of a convex and a concave function such that

> *<sup>f</sup>*p*x*q " <sup>1</sup> <sup>2</sup> p q*f*p*x*q ` p*f*p*x*qq with q*f*p*x*q convex and p*f*p*x*q concave.

Throughout we will annotate the convex part by superscriptqand the concave part by superscriptp, which seems rather intuitive since they remind us of the absolute value function and its negative. Since we are mainly interested in piecewise linear functions we assume without much loss of generality that the functions *f* and the convex and concave components are well defined and finite on all of the Euclidean space R*n*. Allowing both components to be infinite outside their proper domain would obviously generate serious indeterminacies, i.e., NaNs in the numerical sense. As we will see later we can in fact ensure in our setting that pointwise

$$
\widehat{f}(\mathbf{x}) \preccurlyeq f(\mathbf{x}) \preccurlyeq \check{f}(\mathbf{x}) \quad \text{for all} \quad \mathbf{x} \in \mathbb{R}^n\text{ }\tag{1}
$$

which means that we actually obtain an inclusion in the sense of interval mathematics [3]. This is one of the attractions of the averaging notation. We will therefore also refer to p*f* and q*f* as the concave and convex bounds of *f* .

#### *Conditioning of the Decomposition*

In parts of the literature the two convex functions q*f* and ´ p*f* are assumed to be nonnegative, which has some theoretical advantages. In particular, see, e.g., [4], one obtains for the square *h* " *f* <sup>2</sup> of a DC function *f* the decomposition

$$h = \frac{1}{4}(\check{f} + \hat{f})^2 = \frac{1}{2}\{\underbrace{\frac{1}{4}(\check{f}^2 + \hat{f}^2)}\_{=h} + \underbrace{\frac{1}{4}[-(\check{f} - \hat{f})^2]}\_{=h}\}.\tag{2}$$

The sign conditions of q*f* and p*f* are necessary to ensure that the three squares on the right hand side are convex functions. Using the Apollonius identity *<sup>f</sup>* ¨ *<sup>h</sup>* " <sup>1</sup> <sup>2</sup> rp*<sup>f</sup>* ` *<sup>h</sup>*q<sup>2</sup> ´ *<sup>f</sup>* <sup>2</sup> ´ *<sup>h</sup>*2<sup>s</sup> one may then deduce in a constructive way that not only sums but also products of DC functions inherit this property. In general, since the convex functions q*f* and ´ p*f* have both supporting hyperplanes one can at least theoretically always find positive coefficients *α* and *β* such that

$$\check{f}(\mathbf{x}) + \mathfrak{a} + \beta \|\mathbf{x}\|^2 \quad \Rightarrow \quad 0 \quad \rhd \quad \hat{f}(\mathbf{x}) - \mathfrak{a} - \beta \|\mathbf{x}\|^2 \qquad \text{for} \quad \mathbf{x} \in \mathbb{R}^n\text{ }\dots$$

Then the average of these modified functions is still *f* and their respective convexity/concavity properties are maintained. In fact, this kind of proximal shift can be used to show that any twice Lipschitz continuously differentiable function is DC, which raises the suspicion that the property by itself does not provide all that much exploitable structure from a numerical point of view. We believe that for its use in practical algorithms one has to make sure or simply assume that the condition number

$$\kappa(\check{f}, \hat{f}) = \sup\_{\mathbf{x} \in \mathbb{R}^n} \frac{|\bar{f}(\mathbf{x})| + |\bar{f}(\mathbf{x})|}{|\check{f}(\mathbf{x}) + \hat{f}(\mathbf{x})|} \in [1, \infty]$$

is not too large. Otherwise, there is the danger that the value of *f* is effectively lost in the rounding error of evaluating q*f* ` p*f* . For sufficiently large quadratic shifts of the nature specified above one has *κ*"*β*. The danger of an excessive growth in *κ* seems akin to the successive widening in interval calculations and similarly stems also from the lack of strict arithmetic rules. For example doubling *f* and then subtracting it yields the successive decompositions

$$(2f) - f = (\check{f} + \hat{f}) - \frac{1}{2}(\check{f} + \hat{f}) = (\check{f} - \frac{1}{2}\hat{f}) + (\hat{f} - \frac{1}{2}\check{f}) = \frac{1}{2}[(2\check{f} - \hat{f}) + (2\hat{f} - \check{f})] \,. \tag{3}$$

If in Equation (3) by chance we had originally ´ <sup>p</sup>*<sup>f</sup>* " <sup>1</sup> <sup>2</sup> <sup>q</sup>*<sup>f</sup>* <sup>ą</sup> 0 so that *<sup>f</sup>* " <sup>1</sup> <sup>2</sup> q*f* with the condition number *κ*p q*f* , ´0.5 q*f*q " 3 we would get after the doubling and subtraction the condition number *κ*p2.5 q*f* , ´2 q*f*q " 9. So it is obviously important that the original algorithm avoids as much as possible calculations that are ill-conditioned in that they even just partly compensate each other.

Throughout the paper we assume that the functions in question are evaluated by a computational procedure that generates a sequence of intermediate scalars, which we denote generically by *u*, *v* and *w*. The last one of these scalar variables is the dependent, which is usually denoted by *f* . All of them are continuous functions *<sup>u</sup>* " *<sup>u</sup>*p*x*<sup>q</sup> of the vector *<sup>x</sup>* <sup>P</sup> <sup>R</sup>*<sup>n</sup>* of independent variables. As customary in mathematics we will often use the same symbol to identify a function and its dependent variable. For the overall objective we will sometimes distinguish them and write *y* " *f*p*x*q. For most of the paper we assume that the intermediates are obtained from each other by affine operations or the absolute value function so that the resulting *u*p*x*q are all piecewise linear functions.

The paper is organized as follows. In the following Section 2 we develop rules for propagating the convex/concave decomposition through a sequence of abs-linear operations applied to intermediate quantities *<sup>u</sup>*. This can be done either directly on the pair of bounds <sup>p</sup>*u*q, *<sup>u</sup>*p<sup>q</sup> or on their average *<sup>u</sup>* and their halved distance *<sup>δ</sup><sup>u</sup>* " <sup>1</sup> <sup>2</sup> <sup>p</sup>*u*q´ *<sup>u</sup>*pq. In Section <sup>3</sup> we organize such sequences into an abs-linear form for *<sup>f</sup>* and then extend it to simultaneously yield the convex/concave decomposition. As a consequence of this analysis we get a strengthened version of the classical max´min representation of piecewise linear functions, which reduces to the difference of two polyhedral parts in max- and min-form. In Section 4 we develop strict rules for propagating certain generalized gradient pairs p*g* q, *g* <sup>p</sup><sup>q</sup> of <sup>p</sup>*u*q, *<sup>u</sup>*p<sup>q</sup> exploiting convexity and the cheap gradient principle [5]. In Section 5 we discuss the consequences for the DCA when using limiting gradients p*g* q, *g* <sup>p</sup>q, solving the inner, linear optimization problem (LOP) exactly, and ensuring optimality via polyhedral reflection. In Section 6 we demonstrate the new results on the nonconvex and piecewise linear chained Rosenbrock version of Nesterov [6]. Section 7 contains a summary and preliminary conclusion with outlook. In the Appendix A we give the details of the necessary and sufficient optimality test from [7] in the present DC context.

#### **2. Propagating Bounds and/or Radii**

In Equation (3) we already assumed that doubling is done componentwise and that for a difference *v* " *w* ´ *u* of DC functions *w* and *u*, one defines the convex and concave parts by

$$(\overline{w-u}) = \check{w} - \hat{u} \qquad \text{and} \qquad (\overline{w} - \overline{u}) = \hat{w} - \check{u} \text{ .}$$

respectively. This yields in particular for the negation

$$
\overrightarrow{\overline{(-\boldsymbol{\mu})}} = -\widehat{\boldsymbol{\mu}} \quad \text{and} \quad \overrightarrow{(-\boldsymbol{\mu})} = -\check{\boldsymbol{\mu}} \,. \tag{4}
$$

For piecewise linear functions we need neither the square formula Equation (2) nor the more general decompositions for products. Therefore we will not insist on the sign conditions even though they would be also maintained automatically by Equation (4) as well as the natural linear rules for the convex and concave parts of the sum and the multiple of a DC function, namely

$$\begin{aligned} \left(\overrightarrow{w+u}\right) &= \left(\check{w} + \check{u}\right) \\ \overbrace{\left(\check{c}\,\check{u}\right)}^{\prime} &= \mathsf{c}\,\check{u} \\ \overbrace{\left(\check{c}\,\check{u}\right)}^{\prime} &= \mathsf{c}\,\hat{u} \end{aligned} \qquad\qquad\text{and}\qquad\qquad \begin{aligned} \left(\overline{w+u}\right) &= \left(\hat{w} + \hat{u}\right) \\ \left(\hat{c}\,\check{u}\right) &= \mathsf{c}\,\hat{u} \\ \text{and} \qquad \overbrace{\left(\hat{c}\,\check{u}\right)}^{\prime} &= \mathsf{c}\,\check{u} \end{aligned} \qquad\text{if}\quad \begin{aligned} \left(\overline{w+u}\right) &= \left(\hat{w} + \hat{u}\right) \\ \left(\hat{c}\,\check{u}\right) &= \left(\hat{w} + \hat{u}\right) \\ \left(\hat{c}\,\check{u}\right) &= \left(\hat{w} + \hat{u}\right) \end{aligned}$$

However, the sign conditions would force one to decompose simple affine functions *u*p*x*q " *a*J*x* ` *β* as

$$\mu(\mathbf{x}) = \max(0, a^\top \mathbf{x} + \boldsymbol{\beta}) + \min(0, a^\top \mathbf{x} + \boldsymbol{\beta}) \ = \,\_2^1(\check{\boldsymbol{u}}(\mathbf{x}) + \hat{\boldsymbol{u}}(\mathbf{x})) \,. \tag{5}$$

which does not seem such a good idea from a computational point of view.

The key observation for this paper is that as is well known (see e.g., [8]), one can propagate the absolute value operation according to the identity

$$\begin{array}{lcl}|u|&=&\max(u\_{\prime}-\underline{u})=\frac{1}{2}\max(\check{\mu}+\hat{\iota}\_{\prime}-\check{\iota}\!-\hat{\iota}\!)\\ &=&\max(\check{\mu}\_{\prime}-\hat{\iota}\!)+\frac{1}{2}(\hat{\mu}-\check{\iota}\!)\\ \Longleftrightarrow&|\underline{u}|=2\max(\check{\mu}\_{\prime}-\hat{\iota}\!)\quad\text{and}\quad|\underline{\hat{\mu}}|=\hat{\mu}-\check{\iota}\!\,.\end{array} \tag{6}$$

Here the equality in the second line can be verified by shifting the difference <sup>1</sup> <sup>2</sup> <sup>p</sup>*u*p´ *<sup>u</sup>*q<sup>q</sup> into the two arguments of the max. Again we see that when applying the absolute value operation to an already positive convex function *<sup>u</sup>* " <sup>1</sup> <sup>2</sup>*u*<sup>q</sup> <sup>ě</sup> 0 we get <sup>|</sup> <sup>|</sup>*u*| " <sup>2</sup>*u*<sup>q</sup> and <sup>|</sup> <sup>x</sup>*u*|"´*u*<sup>q</sup> so that the condition number grows from *<sup>κ</sup>*p*u*q, 0q " 1 to *<sup>κ</sup>*p2*u*q, ´*u*qq " 3. In other words, we observe once more the danger that both component functions drift apart. This looks a bit like simultaneous growth of numerator and denominator in rational arithmetic, which can sometimes be limited through cancelations by common integer factors. It is currently not clear when and how a similar compactification of a given

*Algorithms* **2020**, *13*, 166

convex/concave decomposition can be achieved. The corresponding rule for the maximum is similarly easy derived, namely

$$\max(\boldsymbol{\mu}, \boldsymbol{w}) = \frac{1}{2} \max(\boldsymbol{\tilde{\mu}} + \boldsymbol{\hat{u}}, \boldsymbol{\tilde{\omega}} + \boldsymbol{\hat{w}}) = \frac{1}{2} \left( \max(\boldsymbol{\tilde{\mu}} - \boldsymbol{\hat{w}}, \boldsymbol{\tilde{\omega}} - \boldsymbol{\hat{u}}) + (\boldsymbol{\hat{u}} + \boldsymbol{\hat{w}}) \right) \dots$$

When *u* and *w* as well as their decomposition are identical we arrive at the new decomposition *u* " maxp*u*, *<sup>u</sup>*q " <sup>1</sup> <sup>2</sup> pp*u*q´ *<sup>u</sup>*pq ` <sup>2</sup>*u*pq, which obviously represents again some deterioration in the conditioning.

While it was pointed out in [4] that the DC functions *<sup>u</sup>* " <sup>1</sup> <sup>2</sup> <sup>p</sup>*u*q` *<sup>u</sup>*p<sup>q</sup> themselves form an algebra, their decomposition pairs <sup>p</sup>*u*q, *<sup>u</sup>*p<sup>q</sup> are not even an additive group, as only the zero <sup>p</sup>0, 0<sup>q</sup> has a negative partner, i.e., an additive inverse. Naturally, the pairs <sup>p</sup>*u*q, *<sup>u</sup>*p<sup>q</sup> form the Cartesian product between the convex cone of convex functions and its negative, i.e., the cone of concave functions. The DC functions are then the linear envelope of the two cones in some suitable space of locally Lipschitz continuous functions. It is not clear whether this interpretation helps in some way, and in any case we are here mainly concerned with piecewise linear functions.

#### *Propagating the Center and Radius*

Rather than propagating the pairs <sup>p</sup>*u*q, *<sup>u</sup>*p<sup>q</sup> through an evaluation procedure as defined in [5] to calculate the function value *f*p*x*q at a given point *x*, it might be simpler and better for numerical stability to propagate the pair

$$
\mu = \frac{1}{2}(\check{\mathfrak{u}} + \widehat{\mathfrak{u}}) \ \wedge \ \delta\mathfrak{u} = \frac{1}{2}(\check{\mathfrak{u}} - \widehat{\mathfrak{u}}) \ \iff \check{\mathfrak{u}} = \mathfrak{u} + \delta\mathfrak{u} \ \wedge \ \widehat{\mathfrak{u}} = \mathfrak{u} - \delta\mathfrak{u} \ . \tag{7}
$$

This representation resembles the so-called central form in interval arithmetic [9] and we will call therefore *u* the central value and *δu* the radius. In other words, *u* is just the normal piecewise affine intermediate function and the *δu* is a convex distance function to the hopefully close convex and concave part. Should the potential blow-up discussed above actually occur, this will only effect *δu* but not the central value *u* itself. Moreover, at least theoretically one might decide to reduce *δu* from time to time making sure of course that the corresponding *<sup>u</sup>*<sup>q</sup> and *<sup>u</sup>*<sup>p</sup> as defined in Equation (7) stay convex and concave, respectively. The condition number now satisfies the bound

$$\begin{aligned} \kappa (\boldsymbol{u} + \delta \boldsymbol{u}, \boldsymbol{u} - \delta \boldsymbol{u}) &= \sup\_{\boldsymbol{x}} \frac{|\boldsymbol{u} + \delta \boldsymbol{u}| + |\boldsymbol{u} - \delta \boldsymbol{u}|}{2|\boldsymbol{u}|} \\ &= \sup\_{\boldsymbol{x}} \frac{1}{2} \left\{ \left| 1 + \frac{\delta \boldsymbol{u}}{\boldsymbol{u}} \right| + \left| 1 - \frac{\delta \boldsymbol{u}}{\boldsymbol{u}} \right| \right\} \leqslant 1 + \sup\_{\boldsymbol{x}} \left| \frac{\delta \boldsymbol{u}}{\boldsymbol{u}} \right|. \end{aligned}$$

Recall here that all intermediate quantities *u* " *u*p*x*q are functions of the independent variable vector *<sup>x</sup>* <sup>P</sup> <sup>R</sup>*n*. Naturally, we will normally only evaluate the intermediate pairs *<sup>u</sup>* and *<sup>δ</sup><sup>u</sup>* at a few iterates of whatever numerical calculation one performs involving *f* so that we can only sample the ratio

$$\rho u(\mathbf{x}) \equiv |\delta u(\mathbf{x})/\mu(\mathbf{x})|$$

pointwise, where the denominator is hopefully nonzero. We will also refer to this ratio as the relative gap of the convex/concave decomposition at a certain evaluation point *x*. The arithmetic rules for propagating radii of the central form in central convex/concave arithmetic are quite simple.

**Lemma 1** (Propagation rules for central form)**.** *With <sup>c</sup>*, *<sup>d</sup>*, *<sup>x</sup>* <sup>P</sup> <sup>R</sup> *two constants and an independent variable we have*

$$\begin{aligned} \upsilon &= \upsilon + d\,\mathrm{x} \quad \Longrightarrow \quad \delta\upsilon = 0 \qquad \Longrightarrow \quad \rho\upsilon = 0 \quad \mathrm{if} \; \upsilon \neq 0\\ \upsilon &= u \pm w \quad \Longrightarrow \quad \delta\upsilon = \delta u + \delta w \quad \Longrightarrow \quad \rho\upsilon \leqslant \frac{|u| + |w|}{|u \pm w|} \, \mathrm{max}(\rho u, \rho w) \\ \upsilon &= \upsilon \, u \qquad \Longrightarrow \quad \delta\upsilon = |c| \, \delta u \qquad \Longrightarrow \quad \rho\upsilon = \rho u \, \, \mathrm{if} \; \ c \neq 0\\ \upsilon &= |u| \qquad \Longrightarrow \quad \delta\upsilon = |u| + 2 \, \delta u \qquad \Longrightarrow \quad \rho\upsilon \in [1, 1 + 2\rho u] \, . \end{aligned} \tag{8}$$

**Proof.** The last rule follows from Equation (6) by

$$\begin{aligned} \delta(|u|) &= \frac{1}{2} \left( \widetilde{|u|} - \widehat{|u|} \right) = \max(\breve{u}, -\hat{u}) - \frac{1}{2} (\hat{\widetilde{u}} - \breve{\widetilde{u}}) \\ &= \max(\breve{u} - \delta u, -\hat{u} - \delta u) + 2\delta u \\ &= \max(u, -u) + 2\delta u = |u| + 2\delta u \end{aligned}$$

The first equation in Equation (8) means that for all quantities *u* that are affine functions of the independent variables *<sup>x</sup>* the corresponding radius *<sup>δ</sup><sup>u</sup>* is zero so that *<sup>u</sup>*<sup>q</sup> " *<sup>u</sup>* " *<sup>u</sup>*<sup>p</sup> until we reach the first absolute value. Notice that *δv* does indeed grow additively for the subtraction just like for the addition. By induction it follows from the rules above for an inner product that

$$\delta\left(\sum\_{j=1}^{m}c\_{j}u\_{j}\right) = \sum\_{j=1}^{m}|c\_{j}|\,\delta u\_{j} \,. \tag{9}$$

where the *cj* <sup>P</sup> <sup>R</sup> are assumed to be constants. As we can see from the bounds in Lemma <sup>1</sup> the relative gap can grow substantially whenever one performs an addition of values with opposite sign or applies the absolute value operation. In contrast to interval arithmetic on smooth functions one sees that the relative gap, though it may be zero or small initially immediately jumps above 1 when one hits the first absolute value operation. This is not really surprising since the best concave lower bound on *<sup>u</sup>*p*x*q"|*x*<sup>|</sup> itself is *<sup>u</sup>*pp*x*q " 0 so that *<sup>δ</sup><sup>u</sup>* " |*x*|, *<sup>u</sup>*qp*x*q " <sup>2</sup>|*x*<sup>|</sup> and thus *<sup>ρ</sup>u*p*x*q " 1 constantly. On the positive side one should notice that throughout we do not lose sight of the actual central values *u*p*x*q, which can be evaluated with full arithmetic precision. In any case we can think of neither *ρ* nor *κ* ď 1 ` *ρ* as small numbers, but we must be content if they do not actually explode too rapidly. Therefore they will be monitored throughout our numerical experiments.

Again we see that the computational effort is almost exactly doubled. The radii can be treated as additional variables that occur only in linear operations and stay nonnegative throughout. Notice that in contrast to the (nonlinear) interval case we do not loose any accuracy by propagating the central form. It follows immediately by induction from Lemma 1 that any function evaluated by a evaluation procedure that comprises a finite sequence of


is piecewise affine and continuous. We will call these operations and the resulting evaluation procedure abs-linear. It is also easy to see that the absolute values |¨| can be replaced by the maximum maxp¨, ¨q or the minimum minp¨, ¨q or the positive part function maxp0, ¨q or any combination of them, since they can all be mutually expressed in terms of each other and some affine operations. Conversely, it follows from the min-max representation established in [10] (Proposition 2.2.2) that any piecewise affine function *f* can be evaluated by such an evaluation procedure. Consequently, by applying the formulas Equations (4)–(6) one can propagate at the same time the convex and concave components for all intermediate quantities. Alternatively, one can propagate the centered form according to the rules given in Lemma 1. These rules are also piecewise affine so that we have a finite procedure for simultaneously evaluating *<sup>u</sup>*<sup>q</sup> and *<sup>u</sup>*<sup>p</sup> or *<sup>u</sup>* and *<sup>δ</sup><sup>u</sup>* as piecewise linear functions. The combined computation requires about 2–3 times as many arithmetic operations and twice as many memory accesses. Of course due to the interdependence of the two components it is not possible to evaluate just one of them without the other. As we will see the same is true for the generalized gradients to be discussed later in Section 4.

#### **3. Forming and Extending the Abs-Linear Form**

In practice all piecewise linear objectives can be evaluated by a sequence of abs-linear operations, possibly after min and max have been rewritten as

$$\min(u, w) = \frac{1}{2}(u + w - |u - w|) \quad \text{and} \quad \max(u, w) = \frac{1}{2}(u + w + |u - w|) \,. \tag{10}$$

Our only restriction is that the number *s* of intermediate scalar quantities, say *zi*, is fixed, which is true for example in the max ´ min representation. Then we can immediately cast the procedure in matrix-vector notation as follows:

**Lemma 2** (Abs-Linear Form)**.** *Any continuous piecewise affine function <sup>f</sup>* : *<sup>x</sup>* <sup>P</sup> <sup>R</sup>*<sup>n</sup>* ÑÞ *<sup>y</sup>* <sup>P</sup> <sup>R</sup> *can be represented by*

$$\begin{aligned} z &= c + Zx + Mz + L|z| \\ y &= d + a^\top x + b^\top z \end{aligned} \tag{11}$$

*where <sup>z</sup>* <sup>P</sup> <sup>R</sup>*<sup>s</sup>* , *<sup>Z</sup>* <sup>P</sup> <sup>R</sup>*s*ˆ*n*, *<sup>M</sup>*, *<sup>L</sup>* <sup>P</sup> <sup>R</sup>*s*ˆ*<sup>s</sup> strictly lower triangular, <sup>d</sup>* <sup>P</sup> <sup>R</sup>, *<sup>a</sup>* <sup>P</sup> <sup>R</sup>*n*, *<sup>b</sup>* <sup>P</sup> <sup>R</sup>*<sup>s</sup> and* <sup>|</sup>*z*<sup>|</sup> *denotes the componentwise modulus of the vector z.*

It should be noted that the construction of this general abs-linear form requires no analysis or computation whatsoever. However, especially for our purpose of generating a reasonably tight DC decomposition, it is advantages to reduce the size of the abs-normal form by eliminating all intermediates *zj* with *j* ă *s* for which |*zj*| never occurs on the right hand side. To this end we may simply substitute the expression of *zj* given in the *j*-th row in all places where *zj* itself occurs on the right hand side. The result is what we will call a reduced abs-normal form, where after renumbering, all remaining *zj* with *j* ă *s* are switching variables in that |*zj*| occurs somewhere on the right hand side. In other words, all but the last column of the reduced, strictly lower triangular matrix *L* are nontrivial. Again, this reduction process is completely mechanical and does not require any nontrivial analysis, other than looking up which columns of the original *L* were zero. The resulting reduced system is smaller and probably denser, which might increase the computation effort for evaluating *f* itself. However, in view of Equation (9) we must expect that for the reduced form the radii will grow slower if we first accumulate linear coefficients and then take their absolute values. Hence we will assume in the remainder of this paper that the abs-normal form for our objective *f* of interest is reduced.

Based on the concept of abs-linearization introduced in [11], a slightly different version of a (reduced) abs-normal form was already proposed in [12]. Now in the present paper, both *z* and *y* depend directly on *z* via the matrix *M* and the vector *b*, but *y* does no longer depend directly on |*z*|. All forms can be easily transformed into each other by elementary modifications. The intermediate variables *zi* can be calculated successively for 1 ď *i* ď *s* by

$$z\_i = \mathbf{c}\_i + Z\_i \mathbf{x} + M\_i z + L\_i |z| \; \tag{12}$$

where *Zi*, *Mi* and *Li* denote the *i*th rows of the corresponding matrix. By induction on *i* one sees immediately that they are piecewise affine functions *zi* " *zi*p*x*q, and we may define for each *x* the signature vector

$$\sigma(\mathfrak{x}) = (\mathsf{sgn}(z\_i(\mathfrak{x})))\_{i=1\ldots s} \in \{-1, 0, 1\}^s \dots$$

Consequently we get the inverse images

$$\mathcal{P}\_{\sigma} = \{ \mathbf{x} \in \mathbb{R}^n : \text{sgn}(\mathbf{z}(\mathbf{x})) = \sigma \} \quad \text{for} \quad \sigma \in \{-1, 0, 1\}^s \,, \tag{13}$$

which are relatively open polyhedra that form collectively a disjoint decomposition of R*n*. The situation for the second example of Nesterov is depicted in Figure 3 in the penultimate section. There are six polyhedra of full dimension, seven polyhedra of co dimension 1 drawn in blue and two points, which are polyhedra of dimension 0. The point p0, ´1q with signature p0, ´1, 0q is stationary and the point p1, 1q with signature p1, 0, 0q is the minimizer as shown in [7]. The arrows indicate the path of our reflection version of the DCA method as described in Section 5.

When *σ* is definite, i.e., has no zero components, which we will denote by 0 R *σ*, it follows from the continuity of *z*p*x*q that P*<sup>σ</sup>* has full dimension *n* unless it is empty. In degenerate situations this may also be true for indefinite *σ* but then the closure of P*<sup>σ</sup>* is equal to the extended closure

$$\bar{\mathcal{P}}\_{\mathbb{P}} \equiv \{ \mathbf{x} \in \mathbb{R}^n : \sigma(\mathbf{x}) < \mathbb{P} \} \supset \text{ close}(\mathcal{P}\_{\mathbb{P}}) \tag{14}$$

for some definite 0 R *σ*˜ ą *σ*. Here the (reflexive) partial ordering ă between the signature vectors satisfies the equivalence

$$
\vartheta \prec \sigma \qquad \Longleftrightarrow \quad \vartheta\_i \sigma\_i \lessapprox \sigma\_i^2 \quad \text{for} \quad i = 1 \ldots s \qquad \Longleftrightarrow \quad \bar{\mathcal{P}}\_{\vec{\sigma}} \subset \bar{\mathcal{P}}\_{\sigma}
$$

as shown in [13]. One can easily check that for any *σ* ą *σ*˚ there exists a unique signature

$$(\sigma \rhd \vec{\sigma})\_{\dot{i}} \quad = \quad \begin{cases} \quad \sigma\_{\dot{i}} & \text{if } \quad \mathring{\sigma}\_{\dot{i}} \neq 0 \\ -\sigma\_{\dot{i}} & \text{if } \quad \mathring{\sigma}\_{\dot{i}} = 0 \end{cases} \quad \text{for} \quad \dot{i} = 1 \dots s \tag{15}$$

We call *σ*˜ " *σ* Ź *σ*˚ the reflection of *σ* at *σ*˚, which satisfies also *σ*˜ ą *σ*˚ and we have in fact Ps*σ*˜ X Ps*<sup>σ</sup>* " Ps*σ*˚ . Hence the relation between *σ* and *σ*˜ is symmetric in that also *σ* " *σ*˜ Ź *σ*˚. Therefore we will call p*σ*, *σ*˜q a complementary pair with respect to *σ*˚. In the very special case *zi* " *xi* for *<sup>i</sup>* " <sup>1</sup> ... *<sup>n</sup>* " *<sup>s</sup>* ´ 1 the <sup>P</sup>s*<sup>σ</sup>* are orthants and their reflections at the origin <sup>t</sup>0u " <sup>P</sup>s<sup>0</sup> <sup>Ă</sup> <sup>R</sup>*<sup>n</sup>* are their geometric opposites Ps*σ*˜ with *σ*˜ " ´*σ*. Here one can see immediately that all edges, i.e., one-dimensional polyhedra, have Cartesian signatures ˘*ei* for *i* " 1 ... *n* and belong to Ps*<sup>σ</sup>* or Ps*σ*˜ for any given *σ*. Notice that *x*˚ is a local minimizer of a piecewise linear function if and and only if it is a local minimizer along all edges of nonsmoothness emanating form it. Consequently, optimality of *f* restricted to a complementary pair is equivalent to local optimality on R*n*, not only in this special case, but whenever the Linear Independence Kink Qualification (LIKQ) holds as introduced in [13] and defined in the Appendix A. This observation is the basis of the implicit optimality condition verified by our DCA variant Algorithm 1 through the use of reflections. The situation is depicted in Figure 3 where the signatures p´1, ´1, ´1q and p1, ´1, 1q as well as p1, ´1, 1q and p1, 1, ´1q form complementary pairs at p0, ´1q and p1, 1q, respectively. At both reflection points there are four emanating edges, which all belong to one of the three polyhedra mentioned.

Applying the propagation rules from Lemma 1, one obtains with *<sup>δ</sup><sup>x</sup>* " <sup>0</sup> <sup>P</sup> <sup>R</sup>*<sup>n</sup>* the recursion

$$\begin{aligned} \delta z\_1 &= \delta(\mathfrak{c}\_1 + Z\_1 \mathfrak{x}) = 0 \\ \delta z\_i &= (|M\_i| + 2|L\_i|)\delta z + |L\_i||z| \qquad \text{for} \quad i = 2\dots s\_{\text{tot}} \end{aligned}$$

where the modulus is once more applied componentwise for vectors and matrices. Hence, we have again in matrix vector notation

$$
\delta z = (|M| + 2|L|)\delta z + |L||z|\,,\tag{16}
$$

which yields for *δz* the explicit expression

$$\delta z \quad = \left( I - |M| - 2|L| \right)^{-1} |L||z| \quad = \sum\_{j=0}^{\nu} (|M| + 2|L|)^{j} \left| L||z| \right| \tag{17}$$

Here, *<sup>ν</sup>* is the so-called switching depth of the abs-linear form of *<sup>f</sup>* , namely the largest *<sup>ν</sup>* <sup>P</sup> <sup>N</sup> such that p|*M*|`|*L*|q*<sup>ν</sup>* ‰ 0, which is always less than *<sup>s</sup>* due to the strict lower triangularity of *<sup>M</sup>* and *L*. The unit lower triangular p*I* ´ |*M*| ´ 2|*L*|q is an M-matrix [14], and interestingly enough does not even depend on *x* but directly maps |*z*|"|*z*p*x*q| to *δz* " *δz*p*x*q. For the radius of the function itself, the propagation rules from Lemma 1 then yield

$$
\delta f(\mathbf{x}) = \delta y = |b|^{\top} \delta z \not\approx 0 \,\,\,. \tag{18}
$$

This nonnegativity implies the inclusion Equation (1) already mentioned in Section 1, i.e.:

**Theorem 1** (Inclusion by convex/concave decomposition)**.** *For any piecewise affine function f in abs-linear form, the construction defined in Section 2 yields a convex/concave inclusion*

$$\hat{f}(\mathbf{x}) \lessapprox f(\mathbf{x}) \equiv \frac{1}{2} (\check{f}(\mathbf{x}) + \hat{f}(\mathbf{x})) \lessapprox \check{f}(\mathbf{x}) \text{ .}$$

*Moreover, the convex and the concave parts* q*f*p*x*q *and* p*f*p*x*q *have exactly the same switching structure as f*p*x*q *in that they are affine on the same polyhedra* P*<sup>σ</sup> defined in* (13)*.*

**Proof.** Equations (16) and (17) ensure that *<sup>δ</sup> <sup>f</sup>*p*x*<sup>q</sup> is nonnegative at all *<sup>x</sup>* <sup>P</sup> <sup>R</sup>*<sup>n</sup>* such that

$$f(\mathbf{x}) = f(\mathbf{x}) - \delta f(\mathbf{x}) \napprox f(\mathbf{x}) \napprox f(\mathbf{x}) + \delta f(\mathbf{x}) \napprox f(\mathbf{x}) \text{ .} $$

It follows from Equation (17) that the radii *δzi*p*x*q are like the |*zi*p*x*q| piecewise linear with the only nonsmoothness arising through the switching variables *z*p*x*q themselves. Obviously this property is inherited by *δ f*p*x*q and the linear combinations q*f*p*x*q" *f*p*x*q ` *δ f*p*x*q and p*f*p*x*q" *f*p*x*q ´ *δ f*p*x*q, which completes the proof.

Combining Equations (16) and (18) with the abs-linear form of the piecewise affine function *f* and defining *<sup>z</sup>*˜ " p*z*, *<sup>δ</sup>z*q P <sup>R</sup>2*<sup>s</sup>* , one obtains for the calculation of ˜ *f*p*x*q " *y*˜ " p*y*, *δy*q the following abs-linear form

$$
\tilde{z} = \tilde{c} + \tilde{Z}\mathfrak{x} + \tilde{M}\tilde{z} + \tilde{L}|\tilde{z}|\,\,\,\,\,\tag{19}
$$

$$
\tilde{y} = \tilde{d} + \tilde{a}^{\top}\mathbf{x} + \tilde{b}^{\top}\tilde{z} \tag{20}
$$

with the vectors and matrices defined by

$$\begin{aligned} \mathcal{E} &= \begin{bmatrix} c \\ 0 \end{bmatrix} \in \mathbb{R}^{2s}, \; Z = \begin{bmatrix} Z \\ 0 \end{bmatrix} \in \mathbb{R}^{2s \times n}, \; \bar{M} = \begin{bmatrix} M & 0 \\ 0 & |M| + 2|L| \end{bmatrix} \in \mathbb{R}^{2s \times 2s}, \\\ L = \begin{bmatrix} L & 0 \\ |L| & 0 \end{bmatrix} \in \mathbb{R}^{2s \times 2s}, \; \bar{d} = \begin{bmatrix} d \\ 0 \end{bmatrix} \in \mathbb{R}^{n \times 2}, \; \bar{b} = \begin{bmatrix} b & 0 \\ 0 & |b| \end{bmatrix} \in \mathbb{R}^{2s \times 2}. \end{aligned}$$

Then, Equations (19) and (20) yield

$$
\begin{aligned}
\begin{bmatrix} z \\ \delta z \end{bmatrix} &= \begin{bmatrix} c \\ 0 \end{bmatrix} + \begin{bmatrix} Z \\ 0 \end{bmatrix} \mathbf{x} + \begin{bmatrix} M & 0 \\ 0 & |M| + 2|L| \end{bmatrix} \begin{bmatrix} z \\ \delta z \end{bmatrix} + \begin{bmatrix} L & 0 \\ |L| & 0 \end{bmatrix} \begin{bmatrix} |z| \\ |\delta z| \end{bmatrix} = \begin{bmatrix} c + Z\mathbf{x} + M\mathbf{z} + L|z| \\ (|M| + 2|L|)\delta z + |L||z| \end{bmatrix} \\
\begin{bmatrix} y \\ \delta y \end{bmatrix} &= d + \mathbf{a}^\top \mathbf{x} + \mathbf{b}^\top \mathbf{z} = \begin{bmatrix} d \\ 0 \end{bmatrix} + \begin{bmatrix} a^\top \mathbf{x} \\ 0 \end{bmatrix} + \begin{bmatrix} b & 0 \\ 0 & |b| \end{bmatrix}^\top \begin{bmatrix} z \\ \delta z \end{bmatrix} = \begin{bmatrix} d + a^\top \mathbf{x} + b^\top z \\ |b|^\top \delta z \end{bmatrix}
\end{aligned}
$$

i.e., Equations (16) and (18). As can be seen, the matrices *M*˜ and *L*˜ have the required strictly lower triangular form. Furthermore, it is easy to check, that the switching depth of the abs-linear form of *f* carries over to the abs-linear form for ˜ *<sup>f</sup>* in that also p|*M*˜ |`|*L*˜|q*<sup>ν</sup>* ‰ <sup>0</sup> " p|*M*˜ |`|*L*˜|q*ν*`1. However, notice that this system is not reduced since the *s* radii are not switching variables, but globally nonnegative anyhow. We can now obtain explicit expressions for the central values, radii, and bounds for a given signature *σ*.

**Corollary 1** (Explicit representation of the centered form)**.** *For any definite signature σ* S 0 *and all x* P P*<sup>σ</sup> we have with* Σ " diagp*σ*q

$$\mathbb{E}\left[z\_{\sigma}(\mathbf{x}) = (I - M - L\Sigma)^{-1}(\mathbf{c} + Z\mathbf{x}) \mid \text{and} \quad |z\_{\sigma}(\mathbf{x})| = \Sigma z\_{\sigma}(\mathbf{x}) \; \gtrless \; 0 \tag{21}$$

$$\delta z\_{\mathcal{F}}(\mathbf{x}) = \left(I - |M| - 2|L|\right)^{-1} |L| \,\Sigma \left(I - M - L\Sigma\right)^{-1} (\mathbf{c} + Z\mathbf{x}) \tag{22}$$

$$
\nabla z\_{\mathcal{T}} = \left(I - M - L\Sigma\right)^{-1} Z \implies \nabla\_{\mathcal{T}} f = a^\top + b^\top \left(I - M - L\Sigma\right)^{-1} Z \tag{23}
$$

$$\nabla \check{f}\_{\mathcal{T}} = a^{\top} + \left[ b^{\top} + |b|^{\top} (I - |M| - 2|L|)^{-1} |L| \Sigma \right] \left( I - M - L\Sigma \right)^{-1} \mathcal{Z} \tag{24}$$

$$
\nabla \hat{f}\_{\mathcal{T}} = a^{\top} + \left[ b^{\top} - |b|^{\top} (I - |M| - 2|L|)^{-1} |L| \Sigma \right] (I - M - L\Sigma)^{-1} Z \tag{25}
$$

*where the restrictions of the functions and their gradients to* P*<sup>σ</sup> are denoted by subscript σ. Notice that the gradients are constant on these open polyhedra.*

**Proof.** Equations (21) and (23) follow directly from Equation (12), the abs-linear form (11) and the properties of Σ. Combining Equation (16) with (21) yields Equation (22). Since q*f*p*x*q " *f*p*x*q ` *δ f*p*x*q and p*f*p*x*q" *f*p*x*q ´ *δ f*p*x*q, Equations (24) and (25) follow from the representation in abs-linear form and Equation (23).

As one can see the computation of the gradient ∇ *f<sup>σ</sup>* requires the solution of one unit upper triangular linear system and that of both ∇ q*f<sup>σ</sup>* and ∇ p*f<sup>σ</sup>* one more. Naturally, upper triangular systems are solved by back substitution, which corresponds to the reverse mode of algorithmic differentiation as described in the following section. Hence, the complexity for calculating the gradients is exactly the same as that for calculating the functions, which can be obtained by one forward substitution for *f<sup>σ</sup>* and an extra one for *δ f<sup>σ</sup>* and thus q*f<sup>σ</sup>* and p*fσ*. The given ∇ *fσ*, ∇ q*f<sup>σ</sup>* and ∇ p*f<sup>σ</sup>* are proper gradients in the interior of the full dimensional domains P*σ*. For some or even many *σ* the inverse image P*<sup>σ</sup>* of the map *x* ÞÑ sgnp*z*p*x*qq may be empty, in which case the formulas in the corollary do not apply. Checking the nonemptiness of P*<sup>σ</sup>* for a given signature *σ* amounts to checking the consistency of a set of linear inequalities, which costs the same as solving an LOP and is thus nontrivial. Expressions for the generalized gradients at points in lower dimensional polyhedra are given in the following Section 4. There it is also not required that the abs-linear normal form has been reduced, but one may consider any given sequence of abs-linear operations.

#### *The Two-Term Polyhedral Decomposition*

It is well known ([15], Theorem 2.49) that all piecewise linear and globally convex or concave functions can be represented as the maximum or the minimum of a finite collection of affine functions, respectively. Hence, from the convex/concave decomposition we get the following drastic simplification of the classical min-max representation given, e.g., in [10].

**Corollary 2** (Additive max/min decomposition of PL functions)**.** *For every piecewise affine function <sup>f</sup>* : <sup>R</sup>*<sup>n</sup>* ÞÑ <sup>R</sup> *there exist <sup>k</sup>* <sup>ě</sup> <sup>0</sup> *affine functions <sup>α</sup>i*`*a*<sup>J</sup> *<sup>i</sup> x for i* " 1 ... *k and l* ě 0 *affine functions βj*`*b*<sup>J</sup> *<sup>j</sup> x for <sup>j</sup>* " 1... *l such that at all x* <sup>P</sup> <sup>R</sup>*<sup>n</sup>*

$$f(\mathbf{x}) = \underbrace{\max\_{i=1\ldots k} (a\_i + a\_i^\top \mathbf{x})}\_{=\frac{1}{2}\check{f}(\mathbf{x})} + \underbrace{\min\_{j=1\ldots l} (\beta\_j + b\_j^\top \mathbf{x})}\_{=\frac{1}{2}\hat{f}(\mathbf{x})}\tag{26}$$

*where furthermore* p*f*p*x*q ď *f*p*x*q ď q*f*p*x*q*.*

The max-part of this representation is what is called a polyhedral function in the literature [15]. Since the min-part is correspondingly the negative of a polyhedral function we may also refer to Equation (26) as a DP decomposition, i.e., the difference of two polyhedral functions.

We are not aware of a publication that gives a practical procedure for computing such a collection of affine functions *α<sup>i</sup>* ` *a*<sup>J</sup> *<sup>i</sup> x*, *i* " 1 ... *k*, and *β<sup>j</sup>* ` *b*<sup>J</sup> *<sup>j</sup> x*, *j* " 1 ... *l*, for a given piecewise linear function *f* . Of course the critical question is in which form the function *f* is specified. Here as throughout our work we assume that it is given by a sequence of abs-linear operations. Then we can quite easily compute for each intermediate variable *v* representations of the form

$$w \quad = \sum\_{i=1}^{m} \max\_{1 \in j \in k\_i} (a\_{i\bar{j}} + a\_{i\bar{j}}^{\top} \mathbf{x}) \; + \sum\_{i=1}^{n} \min\_{1 \in j \in l\_i} (\beta\_{i\bar{j}} + b\_{i\bar{j}}^{\top} \mathbf{x}) \tag{27}$$

$$=\max\_{\substack{j\_i \in I\_i \\ 1 \le i \le n}} \sum\_{i=1}^{m} (a\_{ij\_i} + a\_{ij\_i}^\top \mathbf{x}) + \min\_{\substack{j\_i \in I\_i \\ 1 \le i \le n}} \sum\_{i=1}^{n} (\beta\_{i\bar{j}\_i} + b\_{i\bar{j}\_i}^\top \mathbf{x}) \,. \tag{28}$$

with index sets *Ii* " t1, ... , *ki*u, 1 ď *i* ď *m*¯ , and *Ji* " t1, ... , *li*u, 1 ď *i* ď *n*¯, since one has to consider all possibilities of selecting one affine function each from one of the *m*¯ max and *n*¯ min groups, respectively. Obviously, (28) involves ś*<sup>m</sup> <sup>i</sup>*"<sup>1</sup> *ki* and <sup>ś</sup>*<sup>n</sup> <sup>i</sup>*"<sup>1</sup> *<sup>i</sup>* affine function terms in contrast to the first representation (27) which contains just ř*<sup>m</sup> <sup>i</sup>*"<sup>1</sup> *ki* and <sup>ř</sup>*<sup>n</sup> <sup>i</sup>*"<sup>1</sup> *<sup>i</sup>* of them. Still the second version conforms to the classical representation of convex and concave piecewise linear functions, which yields the following result:

#### **Corollary 3** (Explicit computation of the DP representation)**.** *For any piecewise linear function given as abs-linear procedure one can explicitly compute the representation* (26) *by implementing the rules of Lemma 1.*

**Proof.** We will consider the representations (27) from which (26) can be directly obtained in the form (28). Firstly, the independent variables *xj* are linear functions of themselves with gradient *a* " *ej* and inhomogeneity *α* " 0. Then for multiplications by a constant *c* ą 0 we have to scale all affine functions by *c*. Secondly, addition requires appending the expansions of the two summands to each other without any computation. Taking the negative requires switching the sign of all affine functions and interchanging the max and min group. Finally, to propagate through the absolute values we have to apply the rule (6), which means switching the signs in the min group, expressing it in terms of max and merging it with the existing max group. Here merging means pairwise joining each polyhedral term of the old max-group with each term in the switched min-group. Then the new min-group is the old one plus the old max-group with its sign switched.

We see that taking the absolute value or, alternatively, maxima or minima generates the strongest growth in the number of polyhedral terms and their size. It seems clear that this representation is generally not very useful because the number of terms will likely blow up exponentially. This is not surprising because we will need one affine function for each element of the polyhedral decompositions of the domain of the max and min term. Typically, many of the affine terms will be redundant, i.e., could be removed without changing the values of the polyhedral terms. Unfortunately, identifying those already requires solving primal or dual linear programming problems, see, e.g., [16]. It seems highly doubtful that this would ever be worthwhile. Therefore, we will continue to advocate dealing with piecewise linear functions in a convenient procedural abs-linear representation.

#### **4. Computation of Generalized Gradients and Constructive Oracle Paradigm**

For optimization by variants of the DCA algorithm [17] one needs generalized gradients of the convex and the concave component. Normally, there are no strict rules for propagating generalized gradients through nonsmooth evaluation procedures. However, exactly this is simply assumed in the frequently invoked oracle paradigm, which states that at any point *<sup>x</sup>* <sup>P</sup> <sup>R</sup>*<sup>n</sup>* the function value

*f*p*x*q and an element *g* P B *f*p*x*q can be evaluated. We have argued in [18] that this is not at all a reasonable assumption.

On the other hand, it is well understood that for the convex operations: Positive scaling, addition, and taking the maximum the rules are strict and simple. Moreover, then the generalized gradient in the sense of Clarke <sup>B</sup> <sup>q</sup>*f*p*x*q Ă <sup>R</sup>*<sup>n</sup>* is actually a subdifferential in that all its elements define supporting hyperplanes. Similarly B p*f*p*x*q might be called a superdifferential in that the tangent planes bound the concave part from above.

In other words, we have at all *<sup>x</sup>* <sup>P</sup> <sup>R</sup>*<sup>n</sup>* and for all increments <sup>Δ</sup>*<sup>x</sup>*

$$
\check{f}(\mathfrak{x} + \Delta \mathfrak{x}) \vDash \check{f}(\mathfrak{x}) + \check{\mathfrak{g}}^{\top} \Delta \mathfrak{x} \quad \text{if} \quad \check{\mathfrak{g}} \in \mathring{\mathcal{O}} \check{f}(\mathfrak{x}),
$$

and

$$
\widehat{f}(\mathfrak{x} + \Delta \mathfrak{x}) \prec\_{\mathbb{R}} \widehat{f}(\mathfrak{x}) + \widehat{\mathfrak{g}}^{\top} \Delta \mathfrak{x} \quad \text{if} \quad \widehat{\mathfrak{g}} \in \partial \widehat{f}(\mathfrak{x})\ ,.
$$

which imply for *g* <sup>q</sup> P B <sup>q</sup>*f*p*x*<sup>q</sup> and *<sup>g</sup>* <sup>p</sup> P B <sup>p</sup>*f*p*x*<sup>q</sup> that

$$
\hat{f}(\mathbf{x} + \boldsymbol{\Delta}\mathbf{x}) + \check{f}(\mathbf{x}) + \check{\mathbf{g}}^{\top}\boldsymbol{\Delta}\mathbf{x} \ll 2f(\mathbf{x} + \boldsymbol{\Delta}\mathbf{x}) \ll \check{f}(\mathbf{x} + \boldsymbol{\Delta}\mathbf{x}) + \hat{f}(\mathbf{x}) + \hat{\boldsymbol{g}}^{\top}\boldsymbol{\Delta}\mathbf{x} \tag{29}
$$

where the lower bound on the left is a concave function and the upper bound is convex, both with respect to Δ*x*. Notice that the generalized superdifferential B p*f* being the negative of the subdifferential of ´ p*f* is also a convex set.

Now the key question is how we can calculate a suitable pair of generalized gradients p*g* q, *g* pq P B q*f*p*x*qˆB p*f*p*x*q. As we noted above the convex part and the negative of the concave part only undergo convex operations so that for *v* " *c u*

$$
\hat{c}\check{\boldsymbol{w}} = \begin{cases}
\boldsymbol{c} \cdot \hat{\boldsymbol{u}}\check{\boldsymbol{u}} & \text{if} \quad \boldsymbol{c} > 0 \\
0 & \text{if} \quad \boldsymbol{c} = 0 \\
\boldsymbol{c} \cdot \hat{\boldsymbol{\alpha}}\hat{\boldsymbol{u}} & \text{if} \quad \boldsymbol{c} < 0
\end{cases}
\quad \text{and} \quad \hat{c}\hat{\boldsymbol{v}} = \begin{cases}
\boldsymbol{c} \cdot \hat{\boldsymbol{u}}\hat{\boldsymbol{u}} & \text{if} \quad \boldsymbol{c} > 0 \\
0 & \text{if} \quad \boldsymbol{c} = 0 \\
\boldsymbol{c} \cdot \hat{\boldsymbol{u}}\check{\boldsymbol{u}} & \text{if} \quad \boldsymbol{c} < 0
\end{cases} \tag{30}
$$

and for *v* " *u* ` *w*

$$
\hat{c}\check{\boldsymbol{\nu}} = \hat{c}\check{\boldsymbol{u}} + \hat{c}\check{\boldsymbol{w}} \quad \text{and} \quad \hat{c}\hat{\boldsymbol{\nu}} = \hat{c}\hat{\boldsymbol{u}} + \hat{c}\hat{\boldsymbol{w}} \,. \tag{31}
$$

Finally, for *v* " |*u*| we find by Equation (6) that B*v* <sup>p</sup> " B*u*p´ B*u*<sup>q</sup> as well as

$$\begin{aligned} \,^1\_2 \hat{\boldsymbol{\partial}} \check{\boldsymbol{v}} = \hat{\boldsymbol{\partial}} \, \text{max}(\check{\boldsymbol{u}}, -\hat{\boldsymbol{u}}) &= \begin{cases} \hat{\boldsymbol{\partial}} \, \text{if} & \quad \text{if} \quad \boldsymbol{u} > \boldsymbol{0} \\ \text{conv} \{\hat{\boldsymbol{\partial}} \check{\boldsymbol{u}} \cup (-\hat{\boldsymbol{\partial}} \hat{\boldsymbol{u}}) \} & \text{if} \quad \boldsymbol{u} = \boldsymbol{0} \\ -\hat{\boldsymbol{\partial}} \hat{\boldsymbol{u}} & \quad \text{if} \quad \boldsymbol{u} < \boldsymbol{0} \end{cases} \end{aligned} \tag{32}$$

where we have used that *<sup>u</sup>* " <sup>1</sup> <sup>2</sup> <sup>p</sup>*u*q` *<sup>u</sup>*p<sup>q</sup> in Equation (32). The sign of the arguments *<sup>u</sup>* of the absolute value function are of great importance, because they determine the switching structure. For this reason, we formulated the cases in terms of *u* rather than in the convex/concave components. The operator convt¨u denotes taking the convex hull or envelope of a given usually closed set. It is important to state that within an abs-linear representation the multipliers *c* will stay constant independent of the argument *x*, even if they were originally computed as partial derivatives by an abs-linearization process and thus subject to round-off error. In particular their sign will remain fixed throughout whatever algorithmic calculation we perform involving the piecewise linear function *f* . So, actually the case *c*"0 could be eliminated by dropping this term completely and just initializing the left hand side *v* to zero.

*Algorithms* **2020**, *13*, 166

Because we have set identities we can propagate generalized gradient pairs <sup>p</sup>∇*u*q, ∇*u*pqPB*u*qˆB*u*<sup>p</sup> and perform the indicated algebraic operations on them, starting with the Cartesian basis vectors

$$\nabla \check{\mathbf{x}}\_{\rangle}^{\boldsymbol{\omega}} = \nabla \hat{\mathbf{x}}\_{\rangle}^{\boldsymbol{\omega}} = \nabla \mathbf{x}\_{\rangle} = \boldsymbol{\varepsilon}\_{\rangle} \quad \text{since} \quad \check{\mathbf{x}}\_{\rangle} = \hat{\mathbf{x}}\_{\rangle} = \mathbf{x}\_{\rangle} \quad \text{for} \quad j = 1 \ldots n \ldots$$

The result of this propagation is guaranteed to be an element of B q*f* ˆB p*f* . Recall that in the merely Lipschitz continuous case generalized gradients cannot be propagated with certainty since for example the difference *v* " *w* ´ *u* generates a proper inclusion B*v* Ă B*w* ´ B*u*. In that vein we must emphasize that the average <sup>1</sup> <sup>2</sup> <sup>p</sup><sup>∇</sup> <sup>q</sup>*<sup>f</sup>* ` <sup>∇</sup> <sup>p</sup>*f*<sup>q</sup> need not be a generalized gradient of *<sup>f</sup>* " <sup>1</sup> <sup>2</sup> p q*f* ` p*f*q as demonstrated by the possibility that p*f* " ´ q*f* algebraically but we happen to calculate different generalized gradients of <sup>q</sup>*<sup>f</sup>* and ´ <sup>p</sup>*<sup>f</sup>* at a particular point *<sup>x</sup>*. In fact, if one could show that <sup>B</sup> *<sup>f</sup>* " <sup>1</sup> <sup>2</sup> pB q*f* ` B p*f*q one would have verified the oracle paradigm, whose use we consider unjustified in practice. Instead, we can formulate another corollary for sufficiently piecewise smooth functions.

**Definition 1.** *For any d* <sup>P</sup> <sup>N</sup>*, the set of functions f* : <sup>R</sup>*<sup>n</sup>* ÞÑ <sup>R</sup>, *<sup>y</sup>* " *<sup>f</sup>*p*x*q, *defined by an abs-normal form*

$$\begin{array}{rcl} z & = & F(\mathbf{x}, z\_{\prime}|z|) \ \_ {\prime} \\ y & = & \!\!\!g(\mathbf{x}, z) \end{array}$$

*with F* <sup>P</sup> <sup>C</sup>*d*pR*n*`*s*`*<sup>s</sup>* <sup>q</sup> *and <sup>ϕ</sup>* <sup>P</sup> <sup>C</sup>*d*pR*n*`*<sup>s</sup>* <sup>q</sup>*, is denoted by* C*<sup>d</sup>* abspR*n*q*.*

Once more, this definition differs slightly from the one given in [7] in that *y* depends only on *z* and not on |*z*| in order to match the abs-linear form used here. Then one can show the following result:

**Corollary 4** (Constructive Oracle Paradigm)**.** *For any function <sup>f</sup>* <sup>P</sup> C<sup>2</sup> abspR*n*<sup>q</sup> *and a given point <sup>x</sup> there exist a convex polyhedral function* Δ |*f*p*x*; Δ*x*q *and a concave polyhedral function* Δ x*f*p*x*; Δ*x*q *such that*

$$f(\mathbf{x} + \Delta \mathbf{x}) - f(\mathbf{x}) = \frac{1}{2} \left( \widetilde{\Delta f}(\mathbf{x}; \Delta \mathbf{x}) + \widehat{\Delta f}(\mathbf{x}; \Delta \mathbf{x}) \right) + \mathcal{O}(\|\Delta \mathbf{x}\|^2)$$

*Moreover, both terms and their generalized gradients at* Δ*x* " 0 *or anywhere else can be computed with the same order of complexity as f itself.*

**Proof.** In [11], we show that

$$f(\mathbf{x} + \Delta \mathbf{x}) - f(\mathbf{x}) = \Delta f(\mathbf{x}; \Delta \mathbf{x}) + \mathcal{O}(\|\Delta \mathbf{x}\|^2) \text{ (s)}$$

where Δ*f*p*x*; Δ*x*q is a piecewise linearization of *f* developed at *x* and evaluated at Δ*x*. Applying the convex/concave decomposition of Theorem 1, one obtains immediately the assertion with a convex polyhedral function Δ |*f*p*x*; Δ*x*q and a concave polyhedral function Δ x*f*p*x*; Δ*x*q evaluated at Δ*x*. The complexity results follow from the propagation rules derived so far.

We had hoped that it would be possible to use this approximate decomposition into polyhedral parts to construct at least locally an exact decomposition of a general function *<sup>f</sup>* <sup>P</sup> C*<sup>d</sup>* abspR*n*<sup>q</sup> into a convex and compact part. The natural idea seems to add a sufficiently large quadratic term *β*}Δ*x*}<sup>2</sup> to

$$f(\mathbf{x} + \Delta \mathbf{x}) - f(\mathbf{x}) - \frac{1}{2}\bar{\Delta}\bar{f}(\mathbf{x}; \Delta \mathbf{x}) = \frac{1}{2}\bar{\Delta}\bar{f}(\mathbf{x}; \Delta \mathbf{x}) + \mathcal{O}(\|\Delta \mathbf{x}\|^2)$$

such that it would become convex. Then the same term could be subtracted from Δ x*f*p*x*; Δ*x*q maintaining its concavity. Unfortunately, the following simple example shows that this is not possible.

#### **Example 1** (Half pipe)**.** *The function*

$$\begin{aligned} f: \mathbb{R}^2 \mapsto \mathbb{R}, \quad \begin{aligned} f(\mathbf{x}\_1, \mathbf{x}\_2) &= \max(\mathbf{x}\_2^2 - \max(\mathbf{x}\_1, 0), 0) \\ &= \begin{cases} \mathbf{x}\_2^2 & \text{if } \mathbf{x}\_1 \lessapprox 0 \\ \mathbf{x}\_2^2 - \mathbf{x}\_1 & \text{if } 0 \lessapprox \mathbf{x}\_1 \prec \mathbf{x}\_2^2 \\ 0 & \text{if } 0 \lessapprox \mathbf{x}\_2^2 \prec \mathbf{x}\_1 \end{cases} \end{aligned} \tag{33}$$

*in the class* C<sup>8</sup> abspR*n*<sup>q</sup> *is certainly nonconvex as shown in Figure 1. As already observed in [19] this generally nonsmooth function is actually Fréchet differentiable at the origin x* " 0 *with a vanishing gradient* ∇ *f*p0q " 0*. Hence, we have <sup>f</sup>*pΔ*x*q " Op}Δ*x*}2<sup>q</sup> *and may simply choose constantly* <sup>Δ</sup> |*f*p0; Δ*x*q " 0 " Δ x*f*p0; Δ*x*q*. However, neither by adding β*}Δ*x*}<sup>2</sup> *nor any other smooth function to f*pΔ*x*q *can we eliminate the downward facing kink along the vertical axis* Δ*x*<sup>1</sup> " 0*. In fact, it is not clear whether this example has any DC decomposition at all.*

**Figure 1.** Half pipe example as defined in Equation (33).

#### *Applying the Reverse Mode for Accumulating Generalized Gradients*

Whenever gradients are propagated forward through a smooth evaluation procedure, i.e., for functions in <sup>C</sup>2pR*n*q, they are uniquely defined as affine combinations of each other, starting from Cartesian basis vectors for the components of *x*. Given only the coefficients of the affine combinations one can propagate corresponding adjoint values, or impact factors backwards, to obtain the gradient of a single dependent with respect to all independents at a small multiple of the operations needed to evaluate the dependent variable by itself. This cheap gradient result is a fundamental principle of computational mathematics, which is widely applied under various names, for example discrete adjoints, back propagation, and reverse mode differentiation. For a historical review see [20] and for a detailed description using similar notation to the current paper see our book [5]. For good reasons, there has been little attention to the reverse mode in the context of nonsmooth analysis, where one can at best obtain subgradients. The main obstacle is again that the forward propagation rules are only sharp when all elementary operations maintain convexity, which is by the way the only constructive way of verifying convexity for a given evaluation procedure. While general affine combinations and the absolute value are themselves convex functions, they do not maintain convexity when applied to a convex argument.

The last equation of Lemma 1 shows that one cannot directly propagate a subgradient of the convex radius functions *δu* because there is a reference to *v* " |*u*| itself, which does not maintain convexity except when it is redundant due to its argument having a constant sign. However, it follows from the identity *<sup>δ</sup><sup>u</sup>* " <sup>1</sup> <sup>2</sup> <sup>p</sup>*u*q´ *<sup>u</sup>*p<sup>q</sup> that for all intermediates *<sup>u</sup>*

$$
\nabla \vec{u} \in \partial \vec{u} \wedge \nabla \hat{u} \in \partial \hat{u} \quad \implies \quad \frac{1}{2}(\nabla \vec{u} - \nabla \hat{u}) \in \partial \delta \hat{u} \dots
$$

Hence one can get affine lower bounds of the radii, although one would probably prefer upper bounds to limit the discrepancy between the convex and concave parts. When *v* " |*u*| and *u* " 0 we may choose according to Equation (32) any convex combination

$$
\frac{1}{2}\nabla\breve{\boldsymbol{\upsilon}} = (1-\mu)\nabla\breve{\boldsymbol{\mu}} - \mu\nabla\hat{\boldsymbol{\mu}} \quad \text{for} \quad 0 \lessapprox \mu \lessapprox 1 \tag{34}
$$

It is tempting but not necessarily a good idea to always choose the weight *μ* equal to <sup>1</sup> 2 for simplicity.

Before discussing the reasons for this at the end of this subsection, let us note that from the values of the constants *c*, the intermediate values *u*, and the chosen weights *μ* it is clear how the next generalized gradient pair p∇*v* <sup>q</sup>, ∇*<sup>v</sup>* <sup>p</sup><sup>q</sup> is computed as a linear combination of the generalized gradients of the inputs for each operation, possibly with a switch in their roles. That means after only evaluating the function *f* itself, not even the bounds q*f* and p*f* , we can compute a pair of generalized gradients in B q*f* ˆB p*f* using the reverse mode of algorithmic differentiation, which goes back to at least [21] though not under that name. The complexity of this computation will be independent of the number of variables and relative to the complexity of the function *f* itself. All the operations are relatively benign, namely scaling by constants, interchanges and additions and subtractions. After all the reverse mode is just a reorganization of the linear algebra in the forward propagation of gradients. Hence, it appears that we can be comparatively optimistic regarding the numerical stability of this process.

To be specific we will indicate the (scalar) adjoint value of all intermediates *<sup>u</sup>*<sup>q</sup> and *<sup>u</sup>*<sup>p</sup> as usual by s *<sup>u</sup>*<sup>q</sup> <sup>P</sup> <sup>R</sup> and <sup>s</sup> *<sup>u</sup>*<sup>p</sup> <sup>P</sup> <sup>R</sup>. They are all initialized to zero except for either <sup>s</sup>*<sup>y</sup>* <sup>q</sup> " 1 or <sup>s</sup>*<sup>y</sup>* <sup>p</sup> " 1. Then at the end of the reverse sweep, the vectors p*x* s*j*q*n <sup>j</sup>*"<sup>1</sup> represent either ∇*<sup>y</sup>* <sup>q</sup> or ∇*<sup>y</sup>* <sup>p</sup>, respectively. For computational efficiency one may propagate both adjoint components simultaneously, so that one computes with sextuplets consisting of *<sup>u</sup>*q, *<sup>u</sup>*<sup>p</sup> and their adjoints with respect to *<sup>y</sup>* <sup>q</sup> and *<sup>y</sup>* <sup>p</sup>. In any case we have the following adjoint operations. For *v* " *u* ` *w*

$$(\bar{\tilde{w}}, \bar{\hat{w}}) \dashv = (\bar{\tilde{v}}, \bar{\hat{v}}) \quad \text{and} \quad (\bar{\tilde{u}}, \bar{\hat{u}}) \dashv = (\bar{\tilde{v}}, \bar{\hat{v}}) \dashv$$

for *v* " *c u*

$$\left(\bar{\underline{\boldsymbol{u}}}, \bar{\underline{\boldsymbol{u}}}\right) \mathrel{\mathop{=}} \begin{cases} \boldsymbol{c} \left(\bar{\overline{\boldsymbol{v}}}, \bar{\overline{\boldsymbol{v}}}\right) & \text{if } \boldsymbol{c} > 0 \\ & \left(\boldsymbol{0}, \boldsymbol{0}\right) & \text{if } \boldsymbol{c} = \boldsymbol{0} \\ & \boldsymbol{c} \left(\bar{\overline{\boldsymbol{v}}}, \bar{\overline{\boldsymbol{v}}}\right) & \text{if } \boldsymbol{c} < 0 \end{cases}$$

and finally for *v* " |*u*|

$$(\bar{\tilde{u}}, \bar{\tilde{u}}) \, \Bigleftarrow \begin{cases} (2\bar{\tilde{v}} - \bar{\tilde{v}}, \bar{\tilde{v}}) & \text{if } u > 0 \\ ( -\bar{\tilde{v}} + 2(1 - \mu)\bar{\tilde{v}}, \bar{\tilde{v}} - 2\mu\bar{\tilde{v}}) & \text{if } u = 0 \\ ( -\bar{\tilde{v}}, \bar{\tilde{v}} - 2\bar{\tilde{v}} ) & \text{if } u < 0 \end{cases} \tag{35}$$

Of course, the update for the critical case *u* " 0 of the absolute value is just the convex combination for the two cases *u* ą 0 and *u* ă 0 weighted by *μ*. Due to round-off errors it is very unlikely that the critical case *u*"0 ever occurs in floating point arithmetic. Once more, the sign of the arguments *u* of the absolute value function are of great importance, because they determine on which faces of the polyhedral functions q*f* and p*f* the current argument *x* is located. In some situations one prefers a gradient that is limiting in that it actually occurs as a proper gradient on one of the adjacent smooth pieces. For example, if we had simply *<sup>f</sup>*p*x*q " *<sup>v</sup>* " |*x*<sup>|</sup> for *<sup>x</sup>* <sup>P</sup> <sup>R</sup> and chose *<sup>μ</sup>* " <sup>1</sup> <sup>2</sup> we would get

*v* <sup>q</sup> " <sup>2</sup>|*x*|, *<sup>v</sup>* <sup>p</sup> " 0 and find by Equation (34) that ∇*<sup>v</sup>* <sup>q</sup> " <sup>2</sup><sup>p</sup> <sup>1</sup> <sup>2</sup> ´ <sup>1</sup> <sup>2</sup> q " 0 at *x* " *x* <sup>q</sup> " *<sup>x</sup>* <sup>p</sup> " 0. This is not a limiting gradient of *v* <sup>q</sup> since <sup>B</sup>*<sup>v</sup>* <sup>q</sup> " r´2, 2s, whose interior contains the particular generalized gradient 0.

#### **5. Exploiting the Convex/concave Decomposion for the DC Algorithm**

In order to minimize the decomposed objective function *f* we may use the DCA algorithm [17] which is given in its basic form using our notation by

Choose *<sup>x</sup>*<sup>0</sup> <sup>P</sup> <sup>R</sup>*<sup>n</sup>* For *k* " 0, 1, 2, . . . Calculate *gk* P ´B` <sup>1</sup> 2 p*f* ˘ p*xk*q Calculate *xk*`<sup>1</sup> P B` <sup>1</sup> 2 q*f* ˘˚ p*gk*q

where ` <sup>1</sup> 2 q*f* ˘˚ denotes the Fenchel conjugate of ` <sup>1</sup> 2 q*f* ˘ . For a convex function *<sup>h</sup>* : <sup>R</sup>*<sup>n</sup>* ÞÑ <sup>R</sup> one has

$$w \in \partial h^{\bullet}(\mathcal{y}) \quad \Leftrightarrow \quad w \in \underset{\mathfrak{x} \in \mathbb{R}^{n}}{\operatorname{argmin}} \{h(\mathfrak{x}) - \mathfrak{y}^{\top}\mathfrak{x}\}\_{\mathcal{Y}}$$

see [15], Chapter 11. Hence, the classic DCA reduces in our Euclidean scenario to a simple recurrence

$$\mathbf{x}\_{k+1} \in \operatorname\*{argmin}\_{\mathbf{x} \in \mathbb{R}^n} \left\{ \check{f}(\mathbf{x}) + \hat{\mathbf{g}}\_k^\top \mathbf{x} \right\} \quad \text{for some} \quad \hat{\mathbf{g}}\_k \in \hat{\mathcal{O}}\hat{f}(\mathbf{x}\_k) \,. \tag{36}$$

The objective function on the left hand side is a constantly shifted convex polyhedral upper bound on 2 *f*p*x*q since

$$\check{f}(\mathbf{x}) + \hat{\mathcal{g}}\_k^\top \mathbf{x} = 2f(\mathbf{x}) - \left(\hat{f}(\mathbf{x}) - \hat{\mathcal{g}}\_k^\top \mathbf{x}\right) \approx 2f(\mathbf{x}) - \hat{f}(\mathbf{x}\_k) + \hat{\mathcal{g}}\_k^\top \mathbf{x}\_k \,. \tag{37}$$

It follows from Equation (29) and *xk*`<sup>1</sup> being a minimizer that

$$\begin{split} f(\mathbf{x}\_{k+1}) &\quad \lnot\limits\_{} \quad \frac{1}{2} \left( \check{f}(\mathbf{x}\_{k+1}) + \hat{f}(\mathbf{x}\_{k}) + \hat{\mathbf{g}}\_{k}^{\top}(\mathbf{x}\_{k+1} - \mathbf{x}\_{k}) \right) \\ &\quad \lnot\limits\_{} \quad \frac{1}{2} \left( \check{f}(\mathbf{x}\_{k}) + \hat{f}(\mathbf{x}\_{k}) \right) = \; f(\mathbf{x}\_{k}) \; . \end{split}$$

Now, since (36) is an LOP, an exact solution *xk*`<sup>1</sup> can be found in finitely many steps, for example by a variant of the Simplex method. Moreover, we can then assume that *xk*`<sup>1</sup> is one of finitely many vertex points of the epigraph of q*f* . At these vertex points, *f* itself attains a finite number of bounded values. Provided *f* itself is bounded below, we can conclude that for any choice of the *g* <sup>p</sup>*<sup>k</sup>* P B <sup>p</sup>*<sup>f</sup> <sup>σ</sup>*p*k*<sup>q</sup> the resulting function values *f*p*xk*q can only be reduced finitely often so that *f*p*xk*q " *f*p*xk*´1q and w.l.o.g. *xk* " *xk*´<sup>1</sup> eventually. We then choose the next *g* <sup>p</sup>*<sup>k</sup>* " ∇ <sup>p</sup>*<sup>f</sup> <sup>σ</sup>*p*k*<sup>q</sup> with *<sup>σ</sup>*p*k*<sup>q</sup> " *<sup>σ</sup>*p*k*´1<sup>q</sup> <sup>Ź</sup> *<sup>σ</sup>*p*xk*<sup>q</sup> as the reflection of *<sup>σ</sup>*p*k*´1<sup>q</sup> at *<sup>σ</sup>*p*xk*<sup>q</sup> as defined in (15). If then again *<sup>f</sup>*p*xk*`1q " *<sup>f</sup>*p*xk*<sup>q</sup> it follows from Corollary A2 that *xk* is a local minimizer of *f* and we may terminate the optimization run. Hence we obtain the DCA variant listed in Algorithm 1, which is guaranteed to reach local optimality under LIKQ. It is well defined even without this property and we conjecture that otherwise the final iterate is still a stationary point of *f* . The path of the algorithm on the example discussed in Section 5 is sketched in Figure 3. It reaches the stationary point p0, ´1q where *σ* " p0, ´1, 0q from within the polyhedron with the signature p´1, ´1, ´1q and then continues after the reflection p1, ´1, 1q " p´1, ´1, ´1qŹp0, ´1, 0q. From within that polyhedron the inner loop reaches the point p1, 1q with signature p1, 0, 0q, whose minimality is established after a search in the polyhedron Ps <sup>p</sup>1,1,´1q.

If the function *f*p*x*q is unbounded below, so will be one of the inner convex problems and the convex minimizer should produce a ray of infinite descent instead of the next iterate *xk*`1. This exceptional scenario will not be explicitly considered in the remainder of the paper. The reflection operation is designed to facilitate further descent or establish local optimality. It is discussed in the context of general optimality conditions in the following subsection.

#### **Algorithm 1** Reflection DCA

**Require:** *<sup>x</sup>*<sup>0</sup> <sup>P</sup> <sup>R</sup>*n*, 1: Set *f*´<sup>1</sup> " 8 and Evaluate *f*<sup>0</sup> " *f*p*x*0q 2: **for** *k* " 0, 1, . . . **do** 3: **if** *fk* ă *fk*´<sup>1</sup> **then** Ź Normal iteration with function reduction 4: Choose 0 R *σ* ą *σ*p*xk*q Ź Here different heuristics may be applied 5: Compute *g* <sup>p</sup>*<sup>k</sup>* " ∇ <sup>p</sup>*f<sup>σ</sup>* <sup>Ź</sup> Apply formula of Corollary 1 6: **else** Ź The starting point was already optimal 7: Reflect *σ*˜ " *σ* Ź *σ*p*xk*q Ź The symbol Ź is defined in Equation (15). 8: Update *g* <sup>p</sup>*<sup>k</sup>* " ∇ <sup>p</sup>*fσ*˜ 9: **end if** 10: Calculate *xk*`<sup>1</sup> <sup>P</sup> argmin ! <sup>q</sup>*f*p*x*q ` *<sup>g</sup>* pJ *k x* ˇ ˇ <sup>ˇ</sup> *<sup>x</sup>* <sup>P</sup> <sup>R</sup>*<sup>n</sup>* ) Ź Apply any LOP finite solver 11: Set *fk*`<sup>1</sup> " *f*p*xk*`1q 12: **if** *fk*`<sup>1</sup> " *fk* " *fk*´<sup>1</sup> **then** Ź Local optimality established 13: Stop 14: **end if** 15: **end for**

#### *5.1. Checking Optimality Conditions*

Stationarity of *xk* happens when the convex function q*f*p*x*q ` *g* pJ *<sup>k</sup> x* is minimal at *xk* so that for all large *k*

$$0 \in \partial f(\mathbf{x}\_k) + \widehat{\mathbf{g}}\_k \iff \widehat{\mathbf{g}}\_k \in \partial f(\mathbf{x}\_k) \cap (-\partial f(\mathbf{x}\_k)) \neq \bigotimes \ . \tag{38}$$

The nonemptiness condition on the right hand side is known as criticality of the DC decomposition at *xk*, which is necessary but not sufficient even for local optimality of *f*p*x*q at *xk*. To ensure the latter one has to verify that all *g* <sup>p</sup>*<sup>k</sup>* P B <sup>p</sup>*f*p*xk*<sup>q</sup> satisfy the criticality condition (38) so that

$$
\hat{c}\hat{f}(\mathbf{x}\_k) \subset -\hat{c}\check{f}(\mathbf{x}\_k) \iff \hat{c}^L \hat{f}(\mathbf{x}\_k) \subset -\hat{c}\check{f}(\mathbf{x}\_k) \,. \tag{39}
$$

The left inclusion is a well known local minimality condition [22], which is already sufficient in the piecewise linear case. The right inclusion is equivalent to the left one due to the convexity of B q*f*p*xk*q.

If q*f* and p*f* were unrelated convex and concave polyhedral functions, one would normally consider it extremely unlikely that p*f* were nonsmooth at any one of the finitely many vertices of the polyhedral domain decomposition of q*f* . For instance when p*f* is smooth at *xk* we find that B p*f*p*xk*q"t*g* <sup>p</sup>*k*<sup>u</sup> is a singleton so that criticality according to Equation (38) is already sufficient for local minimality according to Equation (39). As we have seen in Theorem 1 the two parts have exactly the same switching structure. That means they are nonsmooth on the same skeleton of lower dimensional polyhedra. Hence, neither B*L*q*f*p*xk*q nor B*L*p*f*p*xk*q will be singletons at minimizing vertices of the upper bound so that checking the validity of Equation (39) appears to be a combinatorial task at first sight.

However, provided the Linear Independence Kink Qualification (LIKQ) defined in [7] is satisfied at the candidate minimizer *xk*, the minimality can be tested with cubic complexity even in case of a dense abs-linear form. Moreover, if the test fails one can easily calculate a descent direction *d*. The details of the optimality test in our context including the calculation of a descent direction are given in the Appendix A. They differ slightly from the ones in [7]. Rather than applying the optimality test Proposition A1 explicitly, one can use its Corollary A2 stating that if *x*˚ with *σ*˚ " *σ*p*x*˚q is a local minimizer of the restriction of *f* to a polyhedron Ps*<sup>σ</sup>* with definite *σ* ą *σ*˚ then it is a local minimizer of the unrestricted *f* if and only if it also minimizes the restriction of *f* to Ps*σ*˜ with the reflection *σ*˜ " *σ* Ź *σ*˚. The latter condition must be true if *x*˚ also minimizes *f*p*x*q ` ∇ p*fσ*˜ , which can be checked by solving that convex problem. If that test fails the optimization can continue.

#### *5.2. Proximal Rather Than Global*

By some authors the DCA algorithm has been credited with being able to reach global minimizers with a higher probability than other algorithms. There is really no justification for this optimism in the light of the following observation. Suppose the objective *<sup>f</sup>*p*x*q " <sup>1</sup> <sup>2</sup> p q*f*p*x*q ` p*f*p*x*qq has an isolated local minimizer *<sup>x</sup>*˚. Then there exists an *<sup>ε</sup>* <sup>ą</sup> 0 such that the level set <sup>t</sup>*<sup>x</sup>* <sup>P</sup> <sup>R</sup>*<sup>n</sup>* : *<sup>f</sup>*p*x*q ď *<sup>f</sup>*p*x*˚q ` *<sup>ε</sup>*<sup>u</sup> has a bounded connected component containing *x*˚, say L*ε*. Now suppose DCA is started from any point *<sup>x</sup>*<sup>0</sup> <sup>P</sup> <sup>L</sup>*ε*. Since *<sup>f</sup>*0p*x*q " <sup>1</sup> <sup>2</sup> p q*f*p*x*q ` p*f*p*x*0q ` *g* <sup>p</sup>p*x*0qJp*<sup>x</sup>* ´ *<sup>x</sup>*0qq is by Equation (37) a convex upper bound on *f*p*x*q its level set t *f*0p*x*q ď *f*p*x*0qu will be contained in L*ε*. Hence any step from *x*<sup>0</sup> that reduces the upper bound *f*0p*x*q must stay in the same component, so there is absolutely no chance to move away from the catchment L*<sup>ε</sup>* of *x*<sup>0</sup> towards another local minimizer of *f* , whether global or not. In fact, by adding the convex term

$$\frac{1}{2}\left(\widehat{f}(\mathbf{x}\_{0}) + \widehat{g}(\mathbf{x}\_{0})^{\top}(\mathbf{x} - \mathbf{x}\_{0}) - \widehat{f}(\mathbf{x})\right) \gg 0,$$

which vanishes at *x*0, to the actual objective *f*p*x*q one performs a kind of regularization, like in the proximal point method. This means the step is actually held back compared to a larger step that might be taken by a method that only requires the reduction of *f*p*x*q itself.

Hence we may interpret DCA as a proximal point method where the proximal term is defined as an affinely shifted negative of the concave part. Since in general the norm and the coefficient defining the proximal term may be quite hard to select, this way of defining it may make a lot of sense. However, it is certainly not global optimization. Notice that in this argument we have used neither the polyhedrality nor the inclusion property. So it applies to a general DC decomposition on Euclidean space. Another conclusion from the "holding back" observation is that it is probably not worthwhile to minimize the upper bound very carefully. One might rather readjust the shift *g* <sup>p</sup>J*<sup>x</sup>* after a few or even just one iteration.

#### **6. Nesterov's Piecewise Linear Example**

According to [6], Nesterov suggested three Rosenbrock-like test functions for nonsmooth optimization. One of them given by

$$f(\mathbf{x}) = \frac{1}{4}||\mathbf{x}\_1 - \mathbf{1}|| \quad + \sum\_{i=1}^{n-1} |\mathbf{x}\_{i+1} - \mathbf{2}| \mathbf{x}\_i| + \mathbf{1}|\tag{40}$$

is nonconvex and piecewise linear. It is shown in [6] that this function has 2*n*´<sup>1</sup> Clarke stationary points only one of which is a local and thus the global minimizer. Numerical studies showed that optimization algorithms tend to be trapped at one of the stationary points making it an interesting test problem. We have demonstrated in [23] that using an active signature strategy one can guarantee convergence to the unique minimizer from any starting point albeit using in the worst case 2*<sup>n</sup>* iterations as all stationary points are visited. Let us first write the problem in the new abs-linear form.

Defining the *s* " 2 *n* switching variables

$$z\_i = F\_\mathbf{i}(\mathbf{x}\_\prime |z|) = \mathbf{x}\_\prime \quad \text{for} \quad 1 \lessapprox i \prec n, \qquad z\_{\mathrm{il}} = F\_\mathbf{i}(\mathbf{x}\_\prime |z|) = \mathbf{x}\_1 - \mathbf{1}\_\prime$$

and

$$z\_{n+i} = F\_{n+i}(\mathbf{x}\_\prime |z|) = \mathbf{x}\_{i+1} - 2\left|z\_i\right| + 1 \text{ for } 1 \prec i < n, \quad z\_\mathbf{s} = \frac{1}{4}|z\_n| + \sum\_{i=1}^{n-1} |z\_{n+i}|$$

the resulting objective function is then simply identical to *y* " *f*p*x*q " *zs*. With the vectors and matrices

$$\begin{aligned} \mathbf{c}^{\top} &= (0, -1, e\_{n-1}^{\top}, 0) \in \mathbb{R}^{(n-1) + 1 + (n-1) + 1}, \; \mathbf{Z} = \begin{bmatrix} I\_{n-1} & 0 \\ I\_{n-1} & 0 \\ 0 & 1 \\ 0 & 0 \end{bmatrix} \in \mathbb{R}^{s \times (n-1) + 1}, \\\ \mathbf{M} &= \mathbf{0}, \; \mathbf{L} = \begin{bmatrix} 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ -2I\_{n-1} & 0 & 0 & 0 \\ 0 & \frac{1}{4} & e\_{n-1}^{\top} & 0 \end{bmatrix} \in \mathbb{R}^{s \times (n-1) + 1 + (n-1) + 1}, \; \mathbf{d} = 0 \in \mathbb{R}\_{\prime} \\\ \mathbf{a} &= \mathbf{0}, \; \mathbf{b}^{\top} = (\mathbf{0}, \cdots, \mathbf{0}, 1) \in \mathbb{R}^{(2n-1) + 1}, \end{aligned}$$

where *Z* and *L* have different row partitions, one obtains an abs-linear form (11) of *f* . Here, *Ik* denotes the identity matrix of dimension *<sup>k</sup>*, *<sup>e</sup>*<sup>J</sup> " p1, ¨¨¨ , 1q P <sup>R</sup>*<sup>k</sup>* the vector containing only ones and the symbol 0 pads with zeros to achieve the specified dimensions. One can easily check that |*L*| <sup>2</sup> ‰ 0 " |*L*| 3, hence this example has switching depth *ν* " 2. The geometry of the situation is depicted in Figure 3, which was already briefly discussed in Sections 3 and 5.

Since the corresponding extended abs-linear form for ˜ *f* " p*y*, *δy*q does not provide any new insight we do not state it here. Directly in terms of the original equations we obtain for the radii

*δzi* " 0 for 1 ď *i* ď *n*, *δzn*`*<sup>i</sup>* " 2|*zi*| " 2|*xi*| for 1 ď *i* ă *n* (41)

and

$$\begin{aligned} \delta f &= \delta z\_s &= \begin{array}{rcl} \frac{1}{4}|z\_n| + \sum\_{i=1}^{n-1} (|z\_{n+i}| + 2\delta z\_{n+i}) \\ &=& \frac{1}{4}|x\_1 - 1| + \sum\_{i=1}^{n-1} (|x\_{i+1} - 2|x\_i| + 1| + 4|x\_i|) \end{array} . \end{aligned} \tag{42}$$

Thus, from Equation (7) we get the convex and concave part explicitly as

$$\begin{aligned} \check{z}\_{i} &= z\_{i} = \hat{z}\_{i} \text{ for } 1 \lessapprox i \lessapprox n\\ \check{z}\_{n+i} &= \chi\_{i+1} + 1\\ \hat{z}\_{n+i} &= \chi\_{i+1} - 4|z\_{i}| + 1 = \chi\_{i+1} - 4|\chi\_{i}| + 1 \end{aligned} \quad \text{for } 1 \lessapprox i < n$$

and most importantly

$$\begin{aligned} \check{f} &= z\_s + \delta z\_s = \frac{1}{2}|\mathbf{x}\_1 - \mathbf{1}| + 2\sum\_{i=1}^{n-1} \left( |\mathbf{x}\_{i+1} - \mathbf{2}\, |\mathbf{x}\_i| + \mathbf{1}| + 2|\mathbf{x}\_i| \right), \\\widehat{f} &= z\_s - \delta z\_s = -4\sum\_{i=1}^{n-1} |\mathbf{x}\_i| \, . \end{aligned}$$

Clearly p*f* is a concave function and to check the convexity of q*f* we note that

$$\begin{aligned} \left| \mathbf{x}\_{i+1} - \mathbf{2} \left| \mathbf{x}\_{i} \right| + \mathbf{1} \right| + 2 \left| \mathbf{x}\_{i} \right| &= \left| \mathbf{2} \left| \mathbf{x}\_{i} \right| - \mathbf{1} - \mathbf{x}\_{i+1} \right| + \left( 2 \left| \mathbf{x}\_{i} \right| - \mathbf{1} - \mathbf{x}\_{i+1} \right) + \mathbf{x}\_{i+1} + 1 \\ &= \left| \mathbf{1} + \mathbf{x}\_{i+1} + \mathbf{2} \max \left( 0, 2 \left| \mathbf{x}\_{i} \right| - \mathbf{x}\_{i+1} - 1 \right) \right. \end{aligned} \tag{43}$$

The last expression is the sum of an affine function and the positive part of the sum of the absolute value and an affine function, which must therefore also be convex. The corresponding term in Equation (42) is the same with the convex function 2|*xi*| added, so that *δ f* is also convex in agreement with the general theory. Finally, one verifies easily that

$$\hat{f} \ll f = \frac{1}{2}(\check{f} + \hat{f}) \ll \check{f}\_\*$$

which is the whole idea of the decomposition. It would seem that the automatic decomposition by propagation through the abs-linear procedure yields a rather tight result. The function *f* as well as the lower and upper bound given by the convex/concave decomposition are illustrated on the left hand side of Figure 2. Notice that the switching structure is indeed identical for all three as stated in Theorem 1. On the right hand side of Figure 2, the difference 2*δ f* between the upper, convex and lower, concave bound is shown, which is indeed convex.

**Figure 2.** Nesterov–Rosenbrock test function polyhedral inclusion for *n* " 2.

It is worthwhile to look at the condition number of the decomposition, namely we get the following trivial bound

$$\begin{split} \kappa(\boldsymbol{\hat{f}},\boldsymbol{\hat{f}}) &= \sup\_{\boldsymbol{\chi}\in\mathbb{R}^{n}} \frac{\frac{1}{2}|\boldsymbol{\chi}\_{1}-\boldsymbol{1}| + 2\sum\_{i=1}^{n-1} \left( \left| \boldsymbol{\chi}\_{i+1} - 2\left| \boldsymbol{\chi}\_{i} \right| + 1 \right| + 4|\boldsymbol{\chi}\_{i}| \right)}{\frac{1}{2}|\boldsymbol{\chi}\_{1}-\boldsymbol{1}| + 2\sum\_{i=1}^{n-1} |\boldsymbol{\chi}\_{i+1} - 2\left| \boldsymbol{\chi}\_{i}| + 1|} \\ &= 1 + \sup\_{\boldsymbol{\chi}\in\mathbb{R}^{n}} \frac{8\sum\_{i=1}^{n-1} |\boldsymbol{x}\_{i}|}{\frac{1}{4}|\boldsymbol{x}\_{1}-\boldsymbol{1}| + 2\sum\_{i=1}^{n-1} |\boldsymbol{x}\_{i+1} - 2\left| \boldsymbol{x}\_{i}| + 1|} \end{split}$$

The disappointing right hand side value follows from the fact that at the well known unique global optimizer *<sup>x</sup>*˚ " p1, 1, ... , 1q P <sup>R</sup>*<sup>n</sup>* the numerator is zero and the denominator positive. However, elsewhere, we can bound the conditioning as follows.

**Lemma 3.** *In case of the example* (40) *there is a constant c* <sup>P</sup> <sup>R</sup> *such that*

$$\kappa(\check{f}(\mathbf{x}), \hat{f}(\mathbf{x})) \ll 1 + \frac{c}{\min(\|\mathbf{x} - \mathbf{x}\_{\bullet}\|, \mathfrak{Z})} \,. \tag{44}$$

**Proof.** Since the denominator is piecewise linear and vanishes only at the minimizer *x*˚ there must be a constant *c*<sup>0</sup> ą 0 such that for }*x* ´ *x*˚}<sup>8</sup> ď 3

$$\frac{8\sum\_{i=1}^{n-1}|\mathbf{x}\_{i}|}{\frac{1}{4}|\mathbf{x}\_{1}-1|+2\sum\_{i=1}^{n-1}|\mathbf{x}\_{i+1}-2|\mathbf{x}\_{i}|+1|} \lessapprox \frac{8\sum\_{i=1}^{n-1}|\mathbf{x}\_{i}|}{c\_{0}\|\mathbf{x}-\mathbf{x}\_{\bullet}\|\_{\mathcal{O}}} \lessapprox \frac{8(n-1)\|\mathbf{x}\|\_{\mathcal{O}}}{c\_{0}\|\mathbf{x}-\mathbf{x}\_{\bullet}\|\_{\mathcal{O}}} \lessapprox \frac{32(n-1)}{c\_{0}\|\mathbf{x}-\mathbf{x}\_{\bullet}\|\_{\mathcal{O}}} \times$$

which takes the value 32p*n* ´ 1q{p3*c*0q on the boundary. On the other hand we get for }*x*}<sup>8</sup> ě 2 and thus in particular }*x* ´ *x*˚}<sup>8</sup> ě 3

$$\frac{8\sum\_{i=1}^{n-1}|\mathbf{x}\_{i}|}{\frac{1}{4}|\mathbf{x}\_{1}-1|+2\sum\_{i=1}^{n-1}|\mathbf{x}\_{i+1}-2|\mathbf{x}\_{i}|+1|} \lessapprox \frac{4(n-1)|\mathbf{x}||\_{\mathcal{O}}}{\max\_{1\le i\le n}|2|\mathbf{x}\_{i}|-\mathbf{x}\_{i+1}-1|} \lessapprox \frac{4(n-1)}{2-1-1/2} \lessapprox 8(n-1)\dots n$$

Assuming without loss of generality that *c*<sup>0</sup> ď 4{3 we can combine the two bounds to obtain the assertion with *c* " 32p*n* ´ 1q{*c*0.

Hence, we see the condition number *κ*pq*f*p*x*q, p*f*p*x*qq is nicely bounded and the decomposition should work as long as our optimization algorithm has not yet reached its goal *x*˚. It is verified in the companion article [24], that the DCA exploiting the observations made in this paper reaches the global minimizer in finitely many steps. It was already shown in [7] that the LIKQ condition is satisfied everywhere and that the optimality test singles out the unique minimizer correctly. In Figure 3, the arrows indicate the path of our reflection version of the DCA method as described in Section 5.

**Figure 3.** Signatures and reflection-based DCA for Nesterov–Rosenbrock variant (40) with *n* " 2.

#### **7. Summary, Conclusions and Outlook**

In this paper the following new results were achieved


These results are illustrated on the piecewise linear Rosenbrock variant of Nesterov.

On a theoretical level it would be gratifying and possibly provide additional insight, to prove the result of Corollary A3 directly using the explicit representations of the generalized differentials of the convex and concave part given in Corollary 1. Moreover, it remains to be explored what happens when LIKQ is not satisfied. We have conjectured in [25] that just verifying the weaker Mangasarian Fromovitz Kink Qualification (MFKQ) represents an NP hard task. Possibly, there are other weaker conditions that can be cheaply verified and facilitate the testing for at least local optimality.

Global optimality can be characterized theoretically in terms of *ε*´subgradients, albeit with *ε* arbitrarily large [26]. There is the possibility that the alternative definition of *ε*-gradients given in [18] might allow one to constructively check for global optimality. It does not really seem clear how these global optimality conditions can be used to derive corresponding algorithms.

The implementation of the DCA algorithm can be optimized in various ways. Notice that for applying the Simplex method in standard form, one could use for the representation as DC function the max-part in the more economical representation Equation (27) introducing *m*¯ additional variables, rather than the potentially combinatorial Equation (28) to assemble the constraint matrix. In any case it seems doubtful that solving each sub problem to completion is a good idea, especially as the resulting step in the outer iteration is probably much too small anyhow. Therefore, the generalized gradient of the concave part, which defines the inner problem, should probably be updated much more frequently. Moreover, the inner solver might be an SQOP type active signature method or a matrix free gradient method with momentum term, as is used in machine learning, notwithstanding the nonsmoothness of the objective. Various options in that range will be discussed and tested in the companion article [24].

Finally, one should always keep in mind that the task of minimizing a piecewise linear function will most likely occur as an inner problem in the optimization of a piecewise smooth and nonlinear function. As we have shown in [27] the local piecewise linear model problem can be obtained easily by a slight generalization of automatic or algorithmic differentiation, e.g., ADOL-C [28] and Tapenade [29].

**Author Contributions:** Conceptualization, A.G. and A.W.; methodology, A.G. and A.W.; writing–original draft preparation; writing–review and editing, A.G. and A.W. Both authors have read and agreed to the published version of the manuscript.

**Funding:** We acknowledge support by the German Research Foundation (DFG) and the Open Access Publication Fund of Humboldt-Universität zu Berlin.

**Acknowledgments:** We thank Napsu Karmitsa and Sona Taheri for inviting us to participate in this special issue in honor of Adil M. Bagirov. We also thank the three anonymous referees, who asked for various corrections and clarifications, which made the paper much more self-contained and readable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A. Polynomial Optimality Test Based on Abs-Linear Form**

As illustrated for the Nesterov test function, it may be advantageous to use intermediate variables *zi* that are not arguments of the absolute value themselves. For simplicity, we assume that these switching variables that do not impose nonsmoothness are located in the last components of *z* and that only the *s*˜ ď *s* components *z*1, ... *zs*˜ are arguments of the absolute value. Let us abbreviate the current iterate *xk* with *x*˚ " *xk* and denote the corresponding switching vector by *z*˚ " *z*p*x*˚q, the signature vector *σ*˚ " sgnp*z*˚q and the active index set by *α* " t*i* ď *s*˜ : *σ*˚*<sup>i</sup>* " 0u with cardinality *m* " |*α*| ď *s*˜. Consequently, there are exactly 2*<sup>m</sup>* definite signatures by *σ* ą *σ*˚ and the same number of limiting gradients for the three generalized differentials B q*f* , B p*f* , and B *f* .

For all *x* P P*σ*˚ , the signature *σ*˚ is constant and we can use Corollary 1 to define the smooth function

$$z\_{\mathbb{P}}(\mathbf{x}) \ = \ (I - M - L\hat{\Sigma})^{-1}(\mathbf{c} + Z\mathbf{x}) \ = \ \mathcal{C} + \hat{Z}\mathbf{x} \ , \tag{A1}$$

where we have pulled out the unit lower triangular factor <sup>p</sup>*<sup>I</sup>* ´ *<sup>M</sup>* ´ *<sup>L</sup>*Σ˚<sup>q</sup> such that

$$
\hat{Z} = (I - M - L\hat{\Sigma})^{-1} Z \quad \text{and} \quad \pounds = (I - M - L\hat{\Sigma})^{-1} \circ \dots
$$

For *x* « *x*˚ to be contained in the extended closure Ps*σ*˚ as defined in Equation (14), it must satisfy the *m* linear equations

$$P\_{\mathfrak{a}}z(\mathfrak{x}) = 0 \in \mathbb{R}^{\mathfrak{m}} \quad \text{for} \quad P\_{\mathfrak{a}} = (\mathfrak{e}\_{i}^{\top})\_{i \in \mathfrak{a}} \in \mathbb{R}^{m \times \mathfrak{a}}$$

with *ei* denoting the *<sup>i</sup>*th unit vector in R*s*˜ . Thus it is necessary and sufficient for Ps*σ*˚ to be a polyhedron of dimension *<sup>n</sup>* ´ *<sup>m</sup>* that the Jacobian *<sup>P</sup>αZ*˚ <sup>P</sup> <sup>R</sup>*m*ˆ*<sup>n</sup>* has full row rank *<sup>m</sup>*. This rank condition was introduced as LIKQ in [7] and obviously requires that no more than *n* switches are active at *x*˚. As discussed in [7], for the point *x*˚ to be a local minimizer of *f* it is necessary that it solves the trunk problem

$$\min a^\top \mathbf{x} + b^\top \mathbf{z} \quad \text{s.t.} \quad |\mathring{\Sigma}|z - \pounds - \hat{Z}\mathbf{x} = \mathbf{0} \dots$$

Here <sup>|</sup>Σ˚ | P <sup>R</sup>*s*˜ˆ*s*˜ is the projection onto the *<sup>s</sup>*˜ ´ *<sup>m</sup>* vector components whose indices do not belong to *α* so the equality constraint combines (A1) and the constraint *Pαz* " 0. Now we get from KKT theory or equivalently LOP duality that *x*˚ is a minimizer on P*<sup>α</sup>* if and only if for some Lagrange multiplier vector *<sup>λ</sup>* <sup>P</sup> <sup>R</sup>*s*˜

$$a^\top = -\lambda^\top \hat{Z} \quad \text{and} \quad b^\top = \lambda^\top |\hat{\Sigma}|\,. \tag{A2}$$

Since *<sup>I</sup>* " |Σ˚ | ` *<sup>P</sup>*<sup>J</sup> *<sup>α</sup> P<sup>α</sup>* we derive that

$$
\lambda^\top (I - |\mathring{\Sigma}|) \mathring{Z} \ = \ \lambda\_a^\top P\_a \mathring{Z} = -a^\top - b^\top \mathring{Z} \ . \tag{A3}
$$

where *λα* " *Pαλ*. This is a generally overdetermined system of *n* equations in the *m* components of *λα*. If it is solvable the full multiplier vector *λ* " *P*<sup>J</sup> *<sup>α</sup> λα* ` |Σ˚ <sup>|</sup>*<sup>b</sup>* is immediately available. Because of the assumed full rank of the Jacobian *<sup>P</sup>αZ*˚ we have *<sup>m</sup>* <sup>ď</sup> *<sup>n</sup>*, and if *<sup>x</sup>*˚ is a vertex in that *<sup>m</sup>*"*<sup>n</sup>* the tangential stationarity condition (A3) is automatically satisfied.

Now it is necessary and sufficient for local minimality that *x*˚ is also a minimizer of *f* on all polyhedra <sup>P</sup>s*<sup>σ</sup>* with definite *<sup>σ</sup>* <sup>ą</sup> *<sup>σ</sup>*˚. Any such *<sup>σ</sup>* <sup>ą</sup> *<sup>σ</sup>*˚ can be written as *<sup>σ</sup>* " *<sup>σ</sup>*˚ ` *<sup>γ</sup>* with *<sup>γ</sup>* P t´1, 0, 1u*s*˜ structurally orthogonal to *σ*˚ such that for Γ " diagp*γ*q we have the matrix equations

$$
\Sigma = \hat{\Sigma} + \Gamma \quad \text{and} \quad \hat{\Sigma}\,\Gamma = 0 = |\hat{\Sigma}|\,\Gamma\,\dots
$$

Then we can express the *z*p*x*q " *zσ*p*x*q for *x* P P*<sup>σ</sup>* as

$$\begin{aligned} z\_{\sigma}(\mathbf{x}) &= z\_{\hat{\sigma} + \gamma}(\mathbf{x}) &= (I - \mathcal{M} - L\hat{\Sigma} - L\Gamma)^{-1}(\mathbf{c} + Z\mathbf{x}) \\ &= (I - \hat{L}\Gamma)^{-1}(\mathbf{c} + \hat{Z}\mathbf{x}) \end{aligned}$$

with *<sup>L</sup>*˚ " p*<sup>I</sup>* ´ *<sup>M</sup>* ´ *<sup>L</sup>*Σ˚q´1*<sup>L</sup>* . Now *<sup>x</sup>*˚ must be the minimizer of *<sup>f</sup>* on <sup>P</sup>s*σ*, i.e., solve the problem

$$\min a^\top \mathbf{x} + b^\top \mathbf{z} \qquad \text{s.t.} \qquad (I - \not\perp \Gamma)\mathbf{z} = \mathbf{c}^\natural + \not\hat{Z}\mathbf{x}, \quad P\_\mathbf{z} \Gamma \mathbf{z} \not\simeq 0 \in \mathbb{R}^m. \tag{A4}$$

Notice that the inequalities are only imposed on the sign constraints that are active at *x*˚ since the strict inequalities are maintained in a neighborhood of *x*˚ due to the continuity of *z*p*x*q. Then we get again from KKT theory or equivalently LOP duality that still *<sup>a</sup>*<sup>J</sup> "´*λ*J*Z*˚ and for a second multiplier vector 0 <sup>ď</sup> *<sup>μ</sup>* <sup>P</sup> <sup>R</sup>*<sup>m</sup>* the equalities

$$a^\top = -\lambda^\top \vec{Z} \quad \text{and} \quad b^\top = \lambda^\top (I - \vec{L}\Gamma) + \mu^\top P\_\mathbf{a} \Gamma \,. \tag{A5}$$

Multiplying from the right by the projection <sup>|</sup>Σ˚ <sup>|</sup> we find that the conditions (A2) and (A3) must still hold so that *λ* remains exactly the same. Moreover, multiplying from the right by Γ*P*J *<sup>α</sup>* we get with *PαP*J *<sup>α</sup>* " *Im* and ΓΓ " *P*<sup>J</sup> *<sup>α</sup> P<sup>α</sup>* after some rearrangement the inequality

$$\lambda(\lambda - b)^{\top} \Gamma P\_a^{\top} = \lambda^{\top} \hat{L} P\_a^{\top} - \mu^{\top} \ll \lambda^{\top} \hat{L} P\_a^{\top} \,. \tag{A6}$$

Now the key observation is that this condition is linear in Γ and is strongest for the choice *γ<sup>i</sup>* " sgnp*λ<sup>i</sup>* ´ *bi*q for *i* P *α* yielding the inequalities

$$|\lambda\_i - b\_i| \prec\_\varepsilon e\_i^\top \hat{L}^\top \lambda \quad \text{for} \quad i \in \mathfrak{a} \text{ .}\tag{A7}$$

In other words, *x*˚ is a solution of the branch problems (A4) if and only if it is for the worst case where *γ<sup>i</sup>* " sgnp*λ<sup>i</sup>* ´ *bi*q for *i* P *α*. When coincidentally *λ<sup>i</sup>* " *bi* we can define *γ<sup>i</sup>* arbitrarily. Note that the complementarity condition *μ*J*Pαz*p*x*˚q " 0 associated with Equation (A4) is automatically satisfied at *x*˚ for any *μ*, since *Pαz*˚ " 0 by definition of the active index set *α*. These observations yield immediately:

**Proposition A1** (Necessary and sufficient minimality condition)**.** *Assume LIKQ holds in that PαZ*˚ *has full row rank m* " |*α*|*. Then the point x*˚ *is a local minimizer of f if and only if we have tangential stationarity in that a* ` *<sup>Z</sup>*˚ <sup>J</sup>*b belongs to the range of <sup>Z</sup>*˚ <sup>J</sup>*P*<sup>J</sup> *<sup>α</sup> and normal growth holds in that* <sup>|</sup>*Pα*p*<sup>λ</sup>* ´ *<sup>b</sup>*q| ď *<sup>P</sup>αL*˚ <sup>J</sup>*<sup>λ</sup> .*

The verification that LIKQ holds and subsequently the test whether tangential stationarity is satisfied can be based on a QR decomposition of the active Jacobian *<sup>P</sup>αZ*˚ <sup>P</sup> <sup>R</sup>*m*ˆ*n*. The main expense here is the calculation of *<sup>Z</sup>*˚ itself, which requires one forward substitution on <sup>p</sup>*<sup>I</sup>* ´ *<sup>M</sup>* ´ *<sup>L</sup>*Σ˚<sup>q</sup> for each of *n* columns of *Z* and hence at most *ns*2{2 fused multiply adds. Very likely this effort will already be made by any kind of active set method for reaching the candidate point *x*˚. Once the multiplier vector *λ* is obtained the remaining test (A7) for normal growth is almost for free so that we have a polynomial minimality criterion provided LIKQ holds. Otherwise one may assume a weaker generalization of the Mangasarian Fromovitz constrained qualification called MFKQ in [25]. However, we have conjectured in [19] that verifying MFKQ is probably already NP-hard.

**Corollary A1** (Descent direction in the nonoptimal case)**.** *Suppose that LIKQ holds. If tangential stationarity is violated there exits some direction <sup>d</sup>* <sup>P</sup> <sup>R</sup>*<sup>n</sup> such that <sup>P</sup>αZd*˚ " <sup>0</sup> *but* <sup>p</sup>*a*J` *<sup>b</sup>*J*Z*˚q*<sup>d</sup>* <sup>ă</sup> <sup>0</sup>*, which implies descent in that f*p*x*˚ ` *τd*q ă *f*p*x*˚q *for τ* Á 0*. If tangential stationarity holds but normal growth fails there exists at least one i* P *α with* |*λ<sup>i</sup>* ´ *bi*| ą *e*<sup>J</sup> *<sup>i</sup> <sup>L</sup>*˚ <sup>J</sup>*λ. Defining <sup>γ</sup>* " sgnp*λ<sup>i</sup>* ´ *bi*q*ei* <sup>P</sup> <sup>R</sup>*s*˜ *, any d satisfying <sup>P</sup>α*p*<sup>I</sup>* ´ *<sup>L</sup>*˚ <sup>Γ</sup>q´1*Zd*˚ " *<sup>P</sup>αγ is a descent direction.*

**Proof.** In the first case it is clear that *x*˚ ` *τd* P P*σ*˚ for *τ* Á 0 since the components of *z*p*x*˚ ` *τd*q with indices in *α* stay zero and the others vary only slightly. Then the directional derivative of *f*p.q at *x*˚ in direction *τd* is given by

$$
\tau a^\top d + \tau b^\top \mathring{Z} d = \tau (a^\top d + b^\top \mathring{Z} d) < 0 \,,
$$

which proves the first assertion. Otherwise, *λ* is well defined and we can choose *i* P *α* with |*λ<sup>i</sup>* ´ *bi*| ą *e*J *<sup>i</sup> <sup>L</sup>*˚ <sup>J</sup>*λ*. Setting *<sup>γ</sup>* " *<sup>γ</sup>iei* with *<sup>γ</sup><sup>i</sup>* " sgnp*λ<sup>i</sup>* ´ *bi*q*ei*, one obtains for *<sup>d</sup>* with *<sup>P</sup>α*p*<sup>I</sup>* ´ *<sup>L</sup>*˚ <sup>Γ</sup>q´1*Zd*˚ " *<sup>γ</sup>* that *x*˚ ` *τd* P P*σ*˚ `*<sup>γ</sup>* for *τ* Á 0. On that polyhedron the Lagrange multiplier vector *μ* is also well defined by Equation (A6) but we have

$$\mu\_{\dot{i}} = \mathcal{e}\_{\dot{i}}^\top \hat{L}^\top \lambda - (\lambda\_{\dot{i}} - b\_{\dot{i}})\\\gamma\_{\dot{i}} = \mathcal{e}\_{\dot{i}}^\top \hat{L}^\top \lambda - |\lambda\_{\dot{i}} - b\_{\dot{i}}| < 0 \text{ .} $$

Then we get the directional derivative of *f*p.q at *x*˚ in direction *τd*

$$\begin{aligned} \tau a^\top d + \tau b^\top (I - \triangle \Gamma)^{-1} \hat{Z} d &= \tau (-\lambda^\top \hat{Z} d + \lambda^\top \hat{Z} d + \mu^\top P\_a^\top \Gamma (I - \triangle \Gamma)^{-1} \hat{Z} d) \\ &= \tau \mu\_i \gamma\_i^2 < 0 \end{aligned}$$

where we have used identity (A5). Hence we have again descent, which completes the proof.

**Corollary A2** (Optimality via Reflection)**.** *Suppose an x*˚ *where LIKQ holds has been reached by minimizing* q*f*p*x*q ` *g* <sup>p</sup>J*<sup>x</sup> with <sup>g</sup>* <sup>p</sup> " <sup>∇</sup> <sup>p</sup>*f<sup>σ</sup> for* <sup>0</sup> <sup>R</sup> *<sup>σ</sup>* <sup>ą</sup> *<sup>σ</sup>*˚*. Then <sup>x</sup>*˚ *is a local minimizer of <sup>f</sup> on* <sup>R</sup>*<sup>n</sup> if and only if it is also a minimizer of* q*f*p*x*q ` ∇ p*f* <sup>J</sup> *<sup>σ</sup>*˜ *x with σ*˜ " *σ* Ź *σ*˚ *as defined in* (15)*.*

**Proof.** By assumption *x*˚ solves one of the branch problems of *f* itself. Hence we must have tangential stationarity (A5) with the corresponding Γ " diagp*γ*q for *γ* " *σ* ´ *σ*˚. Since *σ*˜ ´ *σ*˚ " ´*γ* we conclude from (A6) that

$$(\lambda - b)^\top \Gamma P\_a^{\parallel} \ll \lambda^\top \check{L} P\_a^{\parallel} \gg (\lambda - b)^\top (-\Gamma) P\_a^{\parallel} = -(\lambda - b)^\top \Gamma P\_a^{\parallel}$$

which implies that

$$\left| (\lambda - b)^{\top} P\_a^{\top} \right| \;= \left| (\lambda - b)^{\top} \Gamma P\_a^{\top} \right| \; \approx \; \lambda^{\top} \hat{L} P\_a^{\top} \;. \tag{A8}$$

Hence both tangential stationarity and normal growth are satisfied, which completes the proof by Proposition A1 as the converse implication is trivial .

The key conclusion is that if an *x*˚ is the solution of two complementary convex problems it must be locally optimal in the full dimensional space R*n*. Hence one can establish local optimality just using the preferred convex solver. If this test fails one naturally obtains descent to function values below *f*p*x*˚q until eventually a local minimizer is found.

#### *Appendix A.1. Equivalence to DC Optimality Condition*

ˇ ˇ

Using the explicit expressions given in Lemma 1 we find that (see [18])

$$\hat{c}^{L}f(\vec{x}) = \bigcup\_{0 \hookrightarrow \gamma^{\top}\vec{b}} \left\{ a^{\top} + b^{\top}(I - \vec{L}\Gamma)^{-1}\vec{Z} \right\}\,\,,\tag{A9}$$

where *γ* ranges over all complements of *σ*˚ such that *σ*˚ `*γ* P t´1, 1u*<sup>s</sup>* is definite. Similarly we obtain with

$$\bar{\mathfrak{b}}^{\top} = |\mathfrak{b}|^{\top} (I - |M| - 2|L|)^{-1} |L| \quad \text{as} \quad \mathbf{0} \in \mathbb{R}^{s}$$

the limiting differentials of the convex and the concave part as

$$\hat{c}^{L}\check{f}(\vec{x}) \quad = \bigcup\_{0=\gamma^{\top}\vec{b}} \left\{ a^{\top} + (b^{\top} + \tilde{b}^{\top}\hat{\Sigma} + \tilde{b}^{\top}\Gamma)(I - \hat{L}\Gamma)^{-1}\mathcal{Z} \right\}, \tag{A10}$$

$$\hat{c}^{L}\hat{f}(\vec{x}) \quad = \bigcup\_{0=\gamma^{\top}\delta} \left\{ a^{\top} + (b^{\top} - \vec{b}^{\top}\vec{\Sigma} - \vec{b}^{\top}\Gamma)(I - \hat{L}\Gamma)^{-1}\mathcal{Z} \right\}. \tag{A11}$$

Hence we have an explicit representation for the limiting gradients of *f* as well as its convex and concave part q*f* and p*f* at *x*˚. It is easy to see that the minimality condition (A5) requires *a* to be in the range of *<sup>Z</sup>*˚ <sup>J</sup> so that we have again *<sup>a</sup>*<sup>J</sup> " ´*λ*J*Z*˚ yielding

$$\hat{c}^{L}\check{f}(\vec{x}) \quad = \bigcup\_{0 \sim \gamma^{\top}\delta} \left\{ (b^{\top} - \lambda^{\top} + \lambda^{\top}\hat{L}\Gamma + \tilde{b}^{\top}\hat{\Sigma} + \tilde{b}^{\top}\Gamma)(I - \hat{L}\Gamma)^{-1}\hat{\mathcal{Q}} \right\},\tag{A12}$$

$$\hat{\boldsymbol{\beta}}^{L}\hat{\boldsymbol{f}}(\hat{\mathbf{x}}) \quad = \bigcup\_{0 \sim \boldsymbol{\gamma}^{\top}\boldsymbol{\delta}} \left\{ (\boldsymbol{b}^{\top} - \boldsymbol{\lambda}^{\top} + \boldsymbol{\lambda}^{\top}\boldsymbol{\hat{L}}\boldsymbol{\Gamma} - \boldsymbol{\tilde{\boldsymbol{\sigma}}}^{\top}\boldsymbol{\Sigma} - \boldsymbol{\tilde{\boldsymbol{\sigma}}}^{\top}\boldsymbol{\Gamma})(\boldsymbol{I} - \boldsymbol{\tilde{\boldsymbol{L}}}\boldsymbol{\Gamma})^{-1}\boldsymbol{\mathring{\boldsymbol{Z}}} \right\}. \tag{A13}$$

We had hoped to be able to derive directly from these expressions that normal growth implies the condition (39), but we have so far not been able to do so. However, we can indirectly derive the following equivalence.

**Corollary A3** (First order minimality condition)**.** *Under LIKQ the limiting differential* B*L*p*f*p*x*˚q *is contained in the convex hull of* ´B*L*q*f*p*x*˚q *if and only if tangential stationarity and normal growth condition hold according to Proposition A1.*

#### **References**


c 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **On the Use of Biased-Randomized Algorithms for Solving Non-Smooth Optimization Problems**

#### **Angel Alejandro Juan 1,\*, Canan Gunes Corlu 2, Rafael David Tordecilla <sup>3</sup> and Rocio de la Torre <sup>4</sup> and Albert Ferrer <sup>5</sup>**


Received: 13 December 2019; Accepted: 23 December 2019; Published: 25 December 2019

**Abstract:** Soft constraints are quite common in real-life applications. For example, in freight transportation, the fleet size can be enlarged by outsourcing part of the distribution service and some deliveries to customers can be postponed as well; in inventory management, it is possible to consider stock-outs generated by unexpected demands; and in manufacturing processes and project management, it is frequent that some deadlines cannot be met due to delays in critical steps of the supply chain. However, capacity-, size-, and time-related limitations are included in many optimization problems as hard constraints, while it would be usually more realistic to consider them as soft ones, i.e., they can be violated to some extent by incurring a penalty cost. Most of the times, this penalty cost will be nonlinear and even noncontinuous, which might transform the objective function into a non-smooth one. Despite its many practical applications, non-smooth optimization problems are quite challenging, especially when the underlying optimization problem is *NP-hard* in nature. In this paper, we propose the use of biased-randomized algorithms as an effective methodology to cope with *NP-hard* and non-smooth optimization problems in many practical applications. Biased-randomized algorithms extend constructive heuristics by introducing a nonuniform randomization pattern into them. Hence, they can be used to explore promising areas of the solution space without the limitations of gradient-based approaches, which assume the existence of smooth objective functions. Moreover, biased-randomized algorithms can be easily parallelized, thus employing short computing times while exploring a large number of promising regions. This paper discusses these concepts in detail, reviews existing work in different application areas, and highlights current trends and open research lines.

**Keywords:** non-smooth optimization; biased-randomized algorithms; heuristics; soft constraints

#### **1. Introduction**

Optimization models are used in many practical situations to represent decision-making challenges in areas such as computational finance, transportation and logistics, telecommunication networks, smart cities, etc. [1]. Many of these challenges can be transformed into optimization problems (OPs) that can be then solved using a plethora of methods of both exact and approximate nature. Typically, solving an OP implies exploring a vast solution space while searching for one solution that minimizes or maximizes a given objective function. In addition, the solution has to satisfy a series of constraints in order to be a feasible one [2]. It is frequent to model these OPs by using linear programming (LP), integer programming (IP), or mixed integer linear programming (MILP) methods. Unfortunately, in many real-life situations, these OPs are also *NP-hard*, which implies that the computing time requested to find an optimal solution grows extraordinarily fast as the size of the problem increases [3]. Hence, one has to make use of heuristic-based algorithms if reasonably good solutions are needed in short computing times for large-scale *NP-hard* OPs [4]. Moreover, the effective use of exact methods might be also limited whenever the mathematical model does not comply with desirable properties such as convexity or smoothness. In particular, the existence of non-convex and non-smooth objective functions might limit the efficiency of gradient-based optimization methods.

Bagirov and Yearwood [5] discuss a non-smooth OP called the minimum sum-of-squares clustering problem. According to the authors, previously employed approaches such as dynamic programming, branch-and-bound, or the *k*-means algorithm are efficient only for small instances of this problem. The authors also support the idea that the use of heuristic-based approaches becomes necessary for large-size instances. Similarly, Bagirov et al. [6] analyze another non-smooth OP related to the facility location problem in a wireless sensor network. Roy et al. [7] study non-smooth power-flow problems, while Lu et al. [8] propose an adaptive hybrid differential evolution algorithm to cope with a non-smooth version of the dynamic economic dispatch problem. To the best of our knowledge, however, there is a lack of publications considering realistic non-smooth cost functions in many OPs. Nevertheless, OPs with soft constraints might frequently appear in real-life applications. As discussed in Hashimoto et al. [9], "in real-world simulations, time windows and capacity constraints can be often violated to some extent". Hence, for example, in cost minimization, problems violating these soft constraints might generate penalty costs that might be taken into account in the objective function. These penalty costs will typically come in the form of a piecewise cost function, which can transform the objective function into a non-smooth one.

This paper reviews different examples of OPs with non-smooth objective functions and then analyzes how biased-randomized algorithms (BRAs) can constitute an effective methodology to generate reasonably good solutions in very short computing times. As described in Ferone et al. [10], BRAs make use of skewed probability distributions to integrate a "biased" (nonuniform) random behavior into a heuristic. This allows one to quickly generate a large set of alternative good solutions by simply changing the seed of the pseudo-random number generator [11,12]. Hence, each execution of the BRA can be seen as an individual "agent" searching the solution space following the logic behind the heuristic but starting from a different point and using a different searching (Figure 1). Moreover, the execution of these BRA agents can be performed in parallel, thus consuming virtually the same time as the original heuristic (i.e., milliseconds in most cases).

**Figure 1.** Exploring the solution space using biased-randomization algorithms.

According to our previous experience with using BRAs to solve OPs in different application fields, these algorithms can be especially useful in cases where the solution space is highly irregular (non-convex and/or non-smooth) and requires an extensive exploration stage, thus reducing the effectiveness of traditional optimization methods. Actually, BRAs have been already proposed to solve non-smooth OPs in different application areas. For instance, they have been used to solve different rich and realistic variants of the well-known vehicle routing problem (VRP), including the two-dimensional VRP [13], VRP variants with horizontal cooperation [14], multi-agent versions of the VRP [15], the location routing problem [16], the fleet mixed VRP with backhauls [17,18], the multi-period VRP [19], and even other versions of the multi-depot VRP [20]. BRAs have also been employed in solving other OPs, such as the single-round divisible load scheduling [21], the stochastic flow-shop scheduling [22], scheduling heterogeneous multi-round systems [23], the minimization of open stacks problem [24], the dynamic home service routing [25], waste collection management [26], or the maximum quasi-clique problem [27].

Accordingly, the main contributions of this paper are as follows: (i) a discussion on the importance of considering non-smooth objective functions in realistic combinatorial OPs, mainly due to the existence of soft constraints which might be violated to some extent by incurring non-smooth penalty costs, and (ii) a discussion on how BRAs can be employed in different applications to solve these non-smooth OPs in short computing times. The remainder of the paper is structured as follows: Section 2 reviews some basic concepts related to non-smooth OPs. Section 3 presents a review of recent works on BRAs. Sections 4–6 review applications of BRAs to non-smooth OPs in logistics, transportation, and scheduling, respectively. Section 7 provides an overview on current trends and open research lines. Finally, Section 8 concludes by highlighting the main contributions of this work.

#### **2. Non-Smooth Optimization Problems**

OPs can be broadly classified as convex or non-convex. Convex OPs are usually characterized by a convex objective function and a set of constraints that form a convex region. Each constraint restricts the solution space to a convex region, and the intersection of these regions, which form the feasible solutions, is also convex. The main feature that makes convex OPs easy to work with is that any local optimum is also a global optimum. This significantly reduces the computational time yielding exact solutions in reasonable times. Therefore, if doable, it is of interest to convert any optimization problem into a convex OP. Despite the specific structure needed for a convex OP, we find several applications of it in real-life problems. For example, those problems that can be modeled as a linear programming model are convex problems because all linear functions are by definition convex [28]. Nevertheless, there are some other problems that cannot be modeled as a convex OP. Non-convex problems have either non-convex objective functions or non-convex feasible regions (or both). This brings in several challenges to solve these problems. The main challenge is that the solution methods employed for convex OPs cannot be directly applied for non-convex ones because of the availability of many disjoint regions in the solution space, each of which usually has its own local optimum. Therefore, it is easy for the algorithm to get trapped into one of these local optima, which may indeed be far away from the global optimum. Also, it is usually time-consuming—or even impossible—to demonstrate that the algorithm reached the global optimum or whether a feasible solution can be obtained.

Another way of classifying an OP is by whether it is a smooth or a non-smooth one. Smooth optimization problems have smooth objective functions and constraints. A smooth function has derivatives of all orders and is differentiable. On the contrary, a non-smooth one has an objective function—or at least one constraint—that does not possess at least one of the properties of a smooth function. Figure 2 shows an example of a one-dimensional function which is neither smooth nor convex.

**Figure 2.** Example of a non-convex and non-smooth piecewise objective function.

From a combinatorial point of view, non-smooth OPs possess similar properties as non-convex OPs because they are time consuming to solve. The lack of derivative information makes it almost impossible to determine the direction in which the function is increasing or decreasing. Likewise, the solution space may also have several disjoint regions, each of which has its own local optimum. Unfortunately, non-convex and non-smooth OPs arise in several application domains, including telecommunication networks, economic load dispatch, portfolio optimization, vehicle routing, regression, or clustering problems. For instance, the minimum sum-of-squares clustering problem is solved by Bagirov et al. [29] and by Karmitsa et al. [30]. Both papers formulate the clustering problem as a non-smooth and non-convex optimization problem and make use of incremental algorithms. However, the former is based on the difference of convex functions and the latter is based on the limited memory bundle method. Real world data sets are used to test both approaches, demonstrating numerically their efficiency compared to other incremental algorithms. Difference of convex functions are also used by Bagirov et al. [31] to solve the nonparametric regression estimation problem. These authors propose an algorithm to minimize a non-convex and non-smooth empirical *L*2-risk function. Synthetic and real-world data sets are used to test it. Compared to other algorithms, this approach is proved to be a good alternative in terms of computational time and several prediction indicators.

Several studies have investigated the applicability of well-known metaheuristic approaches—such as tabu search, artificial bee colony optimization, or particle swarm optimization—to solve non-smooth and non-convex OPs [32]. For example, tabu search has been used in Al-Sultan [33] for the clustering problem and in Oonsivilai et al. [34] for a telecommunication network problem. Ant colony optimization has been used to solve the non-smooth economic load dispatch problem in Hemamalini and Simon [35], while particle swarm optimization has been investigated for the same problem in Niknam et al. [36] and Basu [37]. Both ant colony optimization and particle swarm optimization have been utilized for the non-smooth portfolio selection problem in Schlüter et al. [38] and Corazza et al. [39], respectively. The remainder of this paper discusses the use of BRAs in solving non-smooth optimization problems in logistics, transportation, and scheduling.

#### **3. Basic Concepts on Biased-Randomized Algorithms**

Pure greedy constructive heuristics are algorithms that iteratively build a solution by selecting the next movement from a list of candidates. Such candidates have been sorted previously according to some criteria, such as costs, savings, profits, etc. These heuristics typically select the "most promising" (in the short run) candidate from the list. Since they follow a constructive logic, a good final solution is expected by the end of the procedure. Nevertheless, these algorithms are deterministic, i.e., the solution is always the same every time the heuristic is executed. This means that the exploration process is poor, which prevents the algorithm from finding better solutions unless more complex searching structures—i.e., local searches and perturbation movements—are considered by investing more computing time. Examples of such heuristics are the well-known savings heuristic for the VRP [40],

the nearest neighbor criterion for the traveling salesman problem [41], or the shortest processing time dispatching rule for some scheduling problems [42].

As described in Juan et al. [43], using a skewed (nonuniform) probability distribution to introduce a biased-randomization behavior into the process that selects the candidates from the sorted list is an efficient way of generating better solutions. The idea is to assign a weighted probability to each candidate in the list, in such a way that the more promising candidates—those at the top of the list—receive a higher probability of being selected than those below them. This randomization process leads to the generation of slightly different solutions every time the algorithm is executed. Hence, multiple executions of a BRA—either completed in a sequential or in a parallel mode—will yield a set of alternative solutions, all of them based on the logic behind the heuristic. Since we are executing many biased-random variations of the constructive procedure defined by the heuristic, chances are that some of these "near-greedy" heuristics lead to solutions that outperform the one generated by the greedy heuristic [10]. Algorithm 1 shows a pseudo-code description of a basic BRA that performs in a sequential way.


Notice that, by using this approach, a broad exploration of the solution space is carried out, which might be specially beneficial in the case of highly irregular objective functions as the ones characterizing non-smooth OPs. The proposed methodology can be seen as a natural extension of the basic greedy randomized adaptive search procedure (GRASP) [44], as analyzed in Ferone et al. [10]. Instead of employing empirical probability distributions—which require time-consuming parameter fine tuning and thus might slow down computations—a theoretical probability distribution such as the geometric distribution or the decreasing triangular distribution can be used. Random variates from these theoretical distributions can be quickly generated by employing analytical expressions. Moreover, they tend to have less parameters and these are typically easy to set. Application fields such as food logistics [45], flow-shop scheduling [46], or mobile cloud computing [47] have successfully utilized geometric distributions to introduce biased-randomized processes during the selection of the candidates that are employed to construct a feasible solution. Figure 3 illustrates how geometric probability distributions with four different parameter values (*p* ∈ {0.1, 0.3, 0.6, 0.9}) will have a

different behavior while assigning probabilities of being selected to the elements of the sorted list during the iterative construction of a biased-randomized solution.

**Figure 3.** Biased-random sampling of elements from a list using a geometric distribution.

Thus, while for *p* = 0.1 the distribution is closer to a uniform one (i.e., the probabilities are distributed among a relatively large number of top positions in the sorted list), for *p* = 0.9, the behavior is closer to the greedy one that characterizes a classical heuristic, with the top element in the sorted list accumulating most of the chances of being the next selected element. Both extremes (*p* → 0 and *p* → 1) represent diversification and greediness, respectively. Usually, parameter values in the middle of both extremes are able to provide a better trade-off between these two cases, thus promoting some degree of diversification without losing the rational (domain-specific) criterion employed to sort the list.

#### **4. Applications in Logistics**

The field of logistics encompasses several problems, including supply chain design, facility location, warehouse management, etc. All of these problems have been studied extensively in the literature, mostly with the consideration of hard constraints and smooth objective functions. Nevertheless, as previously discussed, real-world problems in the field of logistics may allow some constraints to be violated by incurring a penalty cost, which needs to be incorporated into the objective function. This typically leads to the emergence of non-smooth objective functions. Therefore, traditional exact methods cannot always be efficiently employed to solve these problems and heuristic-based algorithms are required. This section focuses on the use of BRAs in solving the facility location problem (FLP) [48] and its variants. This problem consists of locating a set of facilities—e.g., production plants, distribution centers, warehouses, etc.—from which a set of customers must be served. Basic decisions are as follows: (i) which potential facilities must be open (or remain open) and which ones must be closed (or not open) and (ii) how to allocate customers to open facilities. This problem is *NP-hard* [49]. Moreover, facilities can be considered capacitated or uncapacitated. The former refers to the case in which each facility has a limited capacity that cannot be exceeded by the total demand served from there. In the latter, the facilities' total capacity is virtually infinite or at least much greater than the cumulative demand of all customers.

BRAs have been applied successfully to solve both capacitated and uncapacitated FLPs. The latter has been tackled mainly considering hard constraints [50]. In Correia and Melo [51], the authors considered a multi-period FLP in which customers are sensitive to delivery lead times (i.e., some flexibility is allowed regarding the delivery dates). Using similar concepts, Estrada-Moreno et al. [52] consider soft constraints and a non-smooth and non-convex objective function for the single-source capacitated FLP. In this context, "single-source" refers to an additional constraint stating that each customer must be served from just one facility. The capacity of each facility may be exceeded by the consideration soft constraints. In real world, decision-makers manage this by using strategies such as storing safety stocks, performing emergency deliveries, and outsourcing part of the customers' service. These strategies tend to generate additional costs that need to be considered as well during the optimization process. The aforementioned authors propose the following model to represent the single-source FLP with soft capacity constraints:

$$\text{Minimize } \sum\_{i \in f} f\_i^\* y\_i + \sum\_{(i,j) \in I \times f} c\_{ij} x\_{ij} = 0$$

subject to:

$$\begin{aligned} \sum\_{j \in I} x\_{ij} &= 1 & \forall \ i \in I \\ x\_{ij} &\le y\_j & \forall \ (i, j) \in I \times I \\ y\_j(1 - y\_j) &= 0 & \forall j \in I \\ x\_{ij}(1 - x\_{ij}) &= 0 & \forall \ (i, j) \in I \times I \\ y\_j \in \mathbb{R} & & \forall j \in I \\ x\_{ij} &\in \mathbb{R} & \forall \ (i, j) \in I \times I \end{aligned}$$

In this model, *xij* is a binary variable that takes the value of 1 if customer *i* is serviced by facility *j* (0 otherwise). Similarly, *yj* is another binary variable that takes the value 1 if facility *j* is open; *cij* is the service cost of assigning customer *i* to facility *j*; and *f* ∗ *<sup>j</sup>* is a piecewise function representing the cost of opening a facility *j*:

$$f\_{\vec{j}}^{\*} = \begin{cases} f\_{\vec{j}} & \text{if } \sum\_{i \in I} d\_{i} \mathbf{x}\_{i\vec{j}} \le s\_{\vec{j}} y\_{\vec{j}}\\ f\_{\vec{j}} + \lambda \left(d\_{\vec{j}}^{\*} s\_{\vec{j}}\right) & \text{otherwise} \end{cases}$$

where *di* > 0 is the demand of customer *i*; *sj* max{*di*} is the nominal capacity of facility *j*; *d*∗ *<sup>j</sup>* = <sup>∑</sup>*i*∈*<sup>I</sup> dixij* is defined for any *<sup>j</sup>* ∈ *<sup>J</sup>*; and *<sup>λ</sup> d*∗ *<sup>j</sup>* ,*sj* is a non-smooth function which will be applied whenever the total demand assigned to facility *j* exceeds its maximum capacity *sj*.

A BRA is integrated within an iterated local search metaheuristic to solve the OP above. The algorithm contains the following components: (i) an initial solution generation, based on a BRA; (ii) a local search procedure composed of functions that open or close facilities; (iii) a perturbation procedure that destroys the current solution by opening a number of closed facilities and reallocating all customers to the newly open facilities; and (iv) an acceptance criterion based on the concept of "credit", in which a solution with a worse cost is accepted if this credit is not exceeded. The objective is to explore other regions of the solution space to escape from local optima. A total of 60 small-, medium-, and large-scale instances are used to test this approach. Different levels of penalties are also tested. Authors demonstrate the advantages of using soft constraints, obtaining costs that are lower than the optimal ones found in the literature for hard constraints. Authors show that, if penalty costs are low or moderate, hard constraints' violation is worth because some facilities do not need to be open and a more efficient allocation of customers can be made. Finally, a comparison between their

BRA-inspired metaheuristic and the solution provided by the commercial tool *LocalSolver* is drawn. Both approaches obtain similar solutions for small- and medium-scale instances, but the BRA-based algorithm proves to be superior for large-scale instances.

#### **5. Applications in Transportation**

Transportation and distribution are two fields belonging to the operational level of decisions in logistics. The vehicle routing problem [53,54] and the arc-routing problem (ARP) [55,56] are two well-known optimization problems in the area of transportation and freight distribution. A traditional VRP consists of a graph formed by a set of nodes and a set of arcs. One of the nodes represents a depot and the rest represent customers, which are connected to each other by the set of arcs. A network of routes must be designed to visit each customer in order to meet a known demand. A single vehicle departing and returning to the depot is assigned to each route. The objective is to minimize the total cost of traversing the arcs. The traditional ARP is similar to the VRP, but the former assigns a demand to each arc and not to each node. Moreover, in the ARP, the underlying graph is not usually a complete one. These problems are *NP-hard*, i.e., as the number of customers grows, the quantity of alternative solutions increases almost exponentially. Therefore, heuristic and metaheuristic algorithms have been employed extensively to solve these OPs. The problems become even more difficult to solve in the presence of non-smooth and non-convex objective functions, as when soft constraints are considered. In this context, soft constraints would include time window constraints or capacity constraints that are allowed to be violated to some extent [9]. These soft constraints also allow decision-makers to consider more realistic models that take into account different management strategies and policies. For instance, customers would accept a delayed delivery if the supplier offers a discount. Likewise, a percentage of the deliveries can be outsourced if in-house capacity is exceeded.

In the VRP case, Juan et al. [43] consider a capacitated version of the problem with a non-smooth and non-convex objective function and soft constraints. These authors propose a BRA-based approach called *MIRHA*.This is a multi-start procedure consisting of two phases: a first phase in which a biased-randomized version of a constructive heuristic is designed according to a geometric probability distribution and a second (improvement) phase in which an adaptive local search procedure is implemented. Several instances from the literature are used to test the proposed algorithm and to compare it with a traditional GRASP. In general, the new algorithm outperforms the existing ones in terms of solution quality (efficiency), both in the presence of hard and soft constraints. In the case of the ARP, De Armas et al. [57] propose a BRA to solve the capacitated version of the problem with a non-smooth and non-convex objective function. The base heuristic considered is *SHARP* [58]. Firstly, they propose the following model:

$$\text{Minimize } \sum\_{\rho \in \mathcal{S}} c\_{\rho}^{\*}$$

subject to:

$$\begin{aligned} \mathcal{S} &\in \text{CSR} \\ \sum\_{(i,j)\in\rho} q\_{ij} \mathbf{x}\_{ij}^k &\leq \mathcal{Q} \qquad \forall \ \rho \in \mathcal{S}, \forall \ k \in T \\ \mathbf{x}\_{ij}^k (1 - \mathbf{x}\_{ij}^k) &= 0 \qquad \forall \ (i,j) \in \rho, \ \forall \ \rho \in \mathcal{S}, \forall \ k \in T \end{aligned}$$

where *ρ* represents a route in a set of routes S; *CSR* represents a complete set of routes (i.e., a solution); *c*∗ *<sup>ρ</sup>* is the total cost of using route *ρ*; *qij* is the demand of arc (*i*, *j*); and *x<sup>k</sup> ij* is a binary variable that takes the value 1 if and only if the arc (*i*, *j*) is covered by a vehicle *k* in the set of vehicles *T*. The cost function associated with any route *ρ* is defined as a piecewise function as follows:

$$c^\*\_{\rho} = \begin{cases} c\_{\rho} & \text{if } c\_{\rho} \le \mathbb{C} \\\\ c\_{\rho} + \lambda \left( c\_{\rho'} \mathbb{C} \right) & \text{otherwise} \end{cases} \tag{1}$$

In the former expression, *λ cρ*, *C* is a non-smooth function which will be applied whenever the actual route cost exceeds the threshold value allowed for any route, *C*. In this work, the associated penalty factor is linearized to obtain a truncated version of the problem. Then, the authors propose a BRA combined with an iterated local search metaheuristic. This hybrid algorithm is divided into four main phases: (i) an initial solution generation using a biased-randomized version of the *SHARP* heuristic; (ii) a perturbation procedure based on destruction-reconstruction strategies; (iii) a local search phase using cache memory; and (iv) the use of an acceptance criterion based on a simulated annealing procedure. A total of 87 artificial and real-world instances are used to test this approach. The mathematical model is solved using the *CPLEX* commercial tool and is used to obtain lower bounds for optimal costs. Thus, the performance of the metaheuristic is assessed, obtaining important average gap reductions regarding previous methods.

The results obtained in the previous studies demonstrate clear advantages of considering soft constraints over hard ones. Moreover, the associated models are better representation of the real-world problems. For example, budget limits established per route can be violated in the model as it is done in real life. The soft constraints are not free, but they enhance profitability. For instance, penalization costs must be incurred for violating budget limits. However, this violation leads to better route design, which generates savings for the company. In the end, it might be worthy to explore if the value of the savings compensates the penalties incurred. Finally, the consideration of soft constraints implies the construction of a more generic model, which includes a combination of soft and hard constraints. A more generic model yields more alternative solutions, and therefore, decision makers have more options to design a routing or distribution plan that better fits their utility function.

#### **6. Applications in Scheduling**

In the operations research field, scheduling OPs are among the most studied topics. According to Pinedo [59], "scheduling is a decision-making process [...] that deals with the allocation of resources to tasks over given time periods and its goal is to optimize one or more objectives." A typical example considers that the resources are machines, that the tasks are operations carried out by these machines, and that the objective is the minimization of the *makespan*—i.e., the completion time of the last task. This apparently simple definition actually includes a huge family of problems that are *NP-hard* as well.

BRAs have been proved to be useful also for scheduling problems. For instance, Martin et al. [15] used them in a multi-agent based framework to solve both routing and scheduling problems. Usually, hard constraints are considered in the literature. However, Ferrer et al. [60] solved the permutation flow-shop problem (PFSP) with a non-smooth objective function. This problem consists of a set of jobs that must be processed by a set of machines. Each job is composed of a set of operations in which the quantity is equal to the number of machines. Moreover, all operations in each job must be executed in the same sequence by the set of machines. The processing time of each operation in each machine is known, although it is different for each job. The idea is to determine the sequence in which jobs must be executed in order to minimize the makespan. One of the contributions of the aforementioned authors is the consideration of the failure-risk term in the objective function. This term is incorporated into the traditional makespan target. The failure-risk cost is incurred when a machine operates continuously without a break, which is highly usual when minimizing the makespan. This cost is equivalent to a penalty cost in logistics and transportation problems, and therefore, the failure-risk cost introduces a non-smooth component into the objective function. A mathematical model is proposed to tackle this problem. Then, a BRA is combined with an iterated local search to develop the solving approach. Its basic steps are as follows: (i) generation of an initial solution through a biased-randomized version of a classical heuristic; (ii) perturbation and local search procedures are implemented to improve the

solution quality; and (iii) the consideration of an acceptance criterion, based on simulated annealing, which may accept worst solutions with the purpose of exploring the solution space and escape from local minima. A total of 120 benchmark instances are used to test this approach. Results show that explicit consideration of the reduction in failure-risk cost leads to a reduction of the total cost for all sets of instances (negative gaps are shown). The makespan cost increases, but the reduction in failure-risk costs compensate this rise. A comparison with other metaheuristic approaches shows that the performance of the new method is similar or even better. This novel approach proves to be useful for decision makers since they have more solution alternatives to select, given the particular goals of each company, i.e., for some decision-makers, the makespan may show a higher relevance, but for others, failure-risk cost may be a more important indicator.

#### **7. General Insights from Previous Numerical Experiments**

Based on the data provided by Juan et al. [43] for the non-smooth VRP, De Armas et al. [57] for the non-smooth ARP, Ferrer et al. [60] for the non-smooth permutation FSP, and Estrada-Moreno et al. [52] for the non-smooth FLP, Figure 4 shows percentage gaps between the corresponding BRA and the reference value employed in the corresponding work.

**Figure 4.** Percentage gaps between BRAs and reference values in non-smooth optimization problems (OPs).

Precaution has to be used while interpreting this figure, since these gaps depend on the particular OP being considered, the specific instances, the selected reference value, etc. However, some insights can be obtained: (i) in all four OPs, BRAs have been able to obtain negative gaps with respect the reference values, which in several cases represent the best-known solutions for the hard-constrained version of the problem; (ii) whenever soft constraints can be considered in a real-life scenario, it might pay off to design solutions that violate these constraints to some extent, since the associated benefits might overcome the corresponding penalties; (iii) since they combine good diversification (exploration) and heuristic-based rational searching of the solution space, BRAs constitute an effective tool to cope with non-smooth OPs with highly irregular objective functions; and (iv) in some particular cases, considering soft constraints instead of hard ones might generate noticable improvements in the quality of the solution; hence, modeling and solving experts should always consider how really hard is a constraint in practice. Finally, it should be also noticed that, in all the four studies and regardless of the specific OP being solved and the application field, the authors illustrated with numerical examples the limitations of exact methods when solving non-smooth OPs. Despite this generalized conclusion, they also recommended to investigate combinations between exact methods (from global optimization and mathematical programming) and heuristic-based algorithms. The latter can play a more exploratory role and can identify promising areas, while the former can intensify the searching process inside these selected areas.

#### **8. Conclusions**

Many real-world problems can be more accurately modeled using soft constraints rather than hard ones. Soft constraints can be violated to some extent, and whenever this occurs, a penalty cost—which is usually defined via a piecewise function depending on the magnitude of the violation—has to be taken into account. Hence, non-smooth and non-convex optimization models are highly relevant in many practical applications. In general, decision makers may consider some constraints to be soft, specifically when the associated capacity limitations can be outsourced or certain delays in the service can be managed with the customer. This paper has reviewed several works in this area. These works refer to the consideration of non-smooth objective functions in popular OPs such as the vehicle routing problem, the arc routing problem, the facility location problem, and the permutation flow-shop problem. In all these cases, the use of BRAs has shown to be an effective tool to generate a myriad of high-quality solutions in short computing times, even for large-size instances of these *NP-hard* OPs. Also, these BRAs have outperformed other classical optimization methods from the areas of mathematical programming, global optimization, and even metaheuristics.

The reviewed papers demonstrate that using BRAs enhances the exploration of the solution space by generating iteratively. The execution of these algorithms might be easily parallelized by simply changing the seed of the pseudo-random number generator, which means that, in many cases, high-quality solutions—close to near-optimal ones—can be frequently obtained in real-time (less than a second). Part of the effectiveness of these algorithms lies in the fact that they preserve the logic of a good constructive heuristics while, at the same time, they offer a much larger exploration capability of the solution space. The paper has also discussed how these BRAs can be hybridized with classical metaheuristic frameworks—e.g., iterated local search, simulated annealing, etc.—in order to increase the searching process if more computing time is allowed.

Several research lines can be explored for future work: (i) the hybridization of the BRAs with the *ECAM* global optimization algorithm [61], so that the former can provide different starting points (exploration) that the latter can use to intensify the search in promising regions; (ii) other optimization problems can be considered as well, especially in application fields such as smart cities, e-commerce, computational finance, or bioinformatics; and (iii) considering even more realistic versions of the optimization problems by adding stochastic and dynamic conditions into them, for which hybridization of BRAs with simulation and machine learning techniques might be necessary.

**Author Contributions:** Conceptualization, A.A.J.; methodology, C.G.C., R.D.T., and R.d.l.T.; writing—original draft preparation, A.A.J., C.G.C., R.D.T., A.F., and R.d.l.T.; writing—review and editing, A.A.J., C.G.C., R.D.T., A.F., and R.d.l.T. All authors have read and agree to the published version of the manuscript.

**Funding:** This research was partially funded by the IoF2020 European project, AGAUR (2018-LLAV-00017), the Erasmus+ program (2018-1-ES01-KA103-049767), and the Spanish Ministry of Science, Innovation, and Universities (RED2018-102642-T).

**Acknowledgments:** We thank Dr. Napsu Karmitsa and Dr. Sona Taheri for inviting us to participate in this special issue in honor of Prof. Dr. Adil M. Bagirov.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


c 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Planning the Schedule for the Disposal of the Spent Nuclear Fuel with Interactive Multiobjective Optimization**

#### **Outi Montonen 1,\*, Timo Ranta 1,2 and Marko M. Mäkelä <sup>1</sup>**


Received: 24 October 2019; Accepted: 22 November 2019; Published: 25 November 2019

**Abstract:** Several countries utilize nuclear power and face the problem of what to do with the spent nuclear fuel. One possibility, which is under the scope in this paper, is to dispose of the fuel assemblies in the disposal facility. Before the assemblies can be disposed of, they must cool down their decay heat power in the interim storage. Next, they are loaded into canisters in the encapsulation facility, and finally, the canisters are placed in the disposal facility. In this paper, we model this process as a nonsmooth multiobjective mixed-integer nonlinear optimization problem with the minimization of nine objectives: the maximum number of assemblies in the storage, maximum storage time, average storage time, total number of canisters, end time of the encapsulation, operation time of the encapsulation facility, the lengths of disposal and central tunnels, and total costs. As a result, we obtain the disposal schedule i.e., amount of canisters disposed of periodically. We introduce the interactive multiobjective optimization method using the two-slope parameterized achievement scalarizing functions which enables us to obtain systematically several different Pareto optimal solutions from the same preference information. Finally, a case study adapting the disposal in Finland is given. The results obtained are analyzed in terms of the objective values and disposal schedules.

**Keywords:** achievement scalarizing functions; interactive method; multiobjective optimization; nonsmooth optimization; spent nuclear fuel disposal

#### **1. Introduction**

The disposal of the spent nuclear fuel is a challenging task where the careful planning and optimization of processes definitely pays dividends. The difficulty of the decision making is increased also by the fact that the disposal continues for the distant future and many parameters are still unknown. Indeed, the decisions made now have long term consequences. Thus, it is only reasonable to investigate different scenarios by utilizing multiobjective optimization from the different perspectives.

The disposal is a topical issue since many of the countries utilizing nuclear power have not yet disposed of any spent nuclear fuel. Nevertheless, all of them have to do something for it sooner or later. Long-term storage in interim storage is not considered a safe or ethical solution [1]. At the same time, the geological disposal is stated to be widely accepted as a safe method [1]. Finland is going to be one of the first countries to dispose of the spent nuclear fuel by starting the disposal in 2020s [2].

The aim in the geological disposal is to isolate the spent nuclear fuel to the bedrock such that it has no more impacts on the environment than the regular background radiation. First, the fuel assemblies are removed from the reactor and stored in the water pool in the reactor hall in order to decrease the radiation and the decay heat power to the suitable level such that the assemblies can be transferred to the water pool in the interim storage facility for decades. When the assemblies are cool enough, they can be transferred to the encapsulation facility, where the assemblies are encapsulated into the copper-iron canisters. After that, the canister moves on towards the disposal facility, in depth of more than 400 m. The disposal facility consists of the central tunnel and several parallel disposal tunnels that are connected to the central tunnel. The canister is placed vertically in the hole on the floor of the disposal tunnel. Finally, the disposal tunnel is filled up and sealed. In this study, we divide the disposal process into three parts: the interim storage, the encapsulation facility, and the disposal facility.

As the entire nuclear waste management is a large task, optimization related studies about it are usually focused on some smaller entities. Some of these entities are concentrated more on political or social aspects like to determine where to put a disposal repository [3] or how to route the transfer of the nuclear waste, or hazardous waste in general [4,5]. More safety-related aspects are the optimization of the nuclear safeguards [6,7] and the safety assessment of nuclear waste repositories [8]. In our study, we aim to produce a disposal schedule such that several goals related to all the interim storage, the encapsulation facility, and the disposal facility are taken into account simultaneously with multiobjective optimization. Other studies aiming at a disposal schedule are, for example, [9] where a single-objective mixed-integer linear programming (MILP) model minimizing the costs is given and [10] trying to achieve the minimal area of the disposal facility with a linear transportation model. Another research related to the disposal facility is discussed in [11], where the multiobjective MILP problem is given to optimize the nuclear waste placement in the disposal facility. In addition, there are attempts to optimize the loading of canisters in Finland [9,12], Slovenia [13], and Switzerland [14].

This study continues the work of [9], where the aim was to minimize the total costs of the disposal in Finland by selecting the schedule of the disposal. Here, this work has been continued by remodeling the situation as the nonsmooth multiobjective mixed-integer nonlinear programming (MINLP) problem. As a nonsmooth optimization problem [15–17], the objectives and the constraints are not necessarily continuously differentiable functions. This allows us to model the situation more accurately. Indeed, many practical applications have nonsmooth nature (see e.g., [18–20]) even if they are modeled as differentiable problems in many cases in practice.

Many practical problems also involve several objectives [21–24]. As a problem of this scale, this application has several conflicting objectives to offer naturally. Besides total costs, it is reasonable to optimize, for instance, the area of the disposal facility. In our model, this is done by minimizing the lengths of both disposal and central tunnels. In total, our model contains nine objectives. In addition to the previous three objectives, we have three objectives related to the interim storage and three related to the encapsulation facility. In the interim storage, we want to minimize the maximum number of assemblies in the storage, the maximum storage time, and the average storage time. On the other hand, the operation time of the encapsulation facility is aimed to be minimized and al number of canisters, or in other words, the number of the empty assembly positions.

These objectives indeed are conflicting. For instance, we want the whole disposal process to be over as early as possible, but this raises the heat production load of the canister. This in its turn, increases the distances between the canisters in the disposal facility. However, the heat load of the canister can be decreased by leaving empty assembly positions, but then more canisters are needed. Another option is to increase the cooling time which again delays the end of the disposal, but if the disposal delays, more storage space is needed. Obviously, all of these decisions have an impact on costs. As exemplified, the minimization of only one objective may lead to an unsatisfactory solution with respect to some other objective. This leads us to a situation where compromises are certainly needed.

As a result of the multiobjective optimization, we obtain several mathematically equally good compromises, called Pareto optimal solutions. The final selection is left to the decision-maker who has more insight into the problem. In this paper, we propose an interactive procedure utilizing the achievement scalarizing function (ASF), in particular, the two-slope parameterized ASF [25] which bases on parameterized ASF [26] and two-slope ASF [27] generalizing both of them by combining their advantages. Via scalarization, the original multiobjective problem is transformed into one single-objective problem. The idea in brief with ASF is that the decision-maker gives a reference point including the decision maker's wishes towards the final solution. Then, the closest optimal solution with respect to some metric is found. If we use only one metric, as is the case in general with ASF, the selected metric defines which solution is found [28–30]. With the parameterization, we are able to use several metrics, nine in this particular case, and thus, yield different solutions with reasonable distribution. This ability to systematically generate different solutions from the same preference information is utilized in the interactive framework.

This paper is organized as follows. In Section 2, we begin by depicting the situation under the consideration and give a nonsmooth multiobjective MINLP model for it. In Section 3, we first introduce some fundamental preliminaries about multiobjective optimization, and then describe the multiobjective interactive method utilizing two-slope parameterized ASFs (MITSPA). In Section 4, one special case study of the disposal in Finland is given and the solutions are analyzed. Finally, in Section 5 some concluding remarks are discussed.

#### **2. Mathematical Model**

In this section, we give a comprehensive description of the model for scheduling the disposal of the spent nuclear fuel. The aim is to provide general guidelines for the disposal schedule and we only plan how many canisters are disposed of rather than which assemblies are placed in which canister nor give any complex lay-out for the disposal facility. We model the situation adapting the disposal in Finland as described in the introduction with some limitations like we omit the transportation between the facilities. Furthermore, we suppose that nothing is disposed of yet and only one type of fuel is considered. Some other simplifying assumptions are that we have access to all the assemblies, assemblies are identical, and the bedrock is homogeneous such that we can build tunnels anywhere.

The model formulated is a nonsmooth multiobjective MINLP problem having nine objectives. One obvious objective is total costs. Due to the long term time perspective of the disposal, the costs will probably change during the years so we minimize also some cost factors as their own objectives. Besides being a cost factor, these objectives have also other reasons to be selected as an objective. The interim storage-related objectives minimize storage times and amounts. The faster the assemblies get under the ground, the safer it is. Other safety issues are handled as constraints, like the cooling time of the assembly must be sufficient, the maximum decay heat power of the canister is limited, and the distances between disposal tunnels and canisters depend on the heat load of the canister. While we allow empty positions in canisters, we still try to keep the total amount of the canisters as low as possible. The other objectives related to the encapsulation facility aim to get disposal done as soon as possible. Finally, the area of the disposal facility is minimized.

#### *2.1. Parameters*

The model involves several parameters mostly dealing with lower and upper bounds and costs. First, we begin with two parameters determining the size of the model. Let


In addition, we define two sets of indices: the set of periods N = {1, ... , *N*} and the set of removals from the reactor Z = {1, ... , *Z*}. Note that part of the removals are done before the first disposal period begins. In order to link the removals from the reactor and periods, we introduce two parameters:


In the following, we specify notation and measurement units for some physical magnitudes:


The next seven parameters describe the cost information needed as an input data for the model:


Finally, we give some parameters related to the upper and lower bounds:


$$\mathop{\text{\textquotedblleft}}\_{\text{max}} p\_{\text{max}}^{\text{up}} \quad \text{\textquotedblright} \\ \text{over and upper bound for the maximum average power of} \\ \text{\textquotedblleft} \\ \text{\textquotedblleft} \\ \text{\textquotedblright}$$

$$^{up}\_{\mathbb{C}^{\mathcal{A}}} \quad \text{lower and upper bound for the distance between carriers [m]}$$

*up DT* lower and upper bound for the distance between disposal tunnels [m].

#### *2.2. Continuous Variables*

*plow*

*dlow CA* , *d*

*dlow DT* , *d*

The model involves *N*(2*Z* + 1) + 3 continuous variables such that they all are assumed to be non-negative. The continuous variables used are:


Note that the first three variables have integer nature, but in order to ease the computation, they are relaxed as continuous variables.

#### *2.3. Binary Variables*

Besides continuous variables, the model consists also *N*(2*Z* + 3) binary variables listed below:

	- at the beginning of the period *j* ∈ N *ri*,*<sup>j</sup>* indicates that assemblies belonging to the removal *i* ∈ Z can be disposed during the period *j* ∈ N .

*Algorithms* **2019**, *12*, 252

#### *2.4. Objectives*

The model involves nine objectives such that six of them are nonlinear and three are linear. These objectives are:

$$\min \quad \max \left\{ \sum\_{i=1}^{a} M\_{i\prime} \sum\_{i=1}^{a+j} z\_{i,j\prime} \sum\_{i \in \mathcal{Z}} z\_{i,l} \, \middle| \, j \in \{1, \dots, b-1\}, l \in \{b, \dots, N\} \right\} \tag{1}$$

$$\min \min \max\_{\mathbf{c}, \mathbf{c}} \{ A\_{i, \mathbf{j}} \mathbf{s}\_{i, \mathbf{j}} - 1 \mid \mathbf{i} \in \mathcal{Z}\_{\star} \mathbf{j} \in \mathcal{N} \}\tag{2}$$

$$\min \quad \frac{\sum\_{i \in \mathcal{Z}} \sum\_{j \in \mathcal{N}} A\_{i,j} \mathbf{x}\_{i,j}}{\sum\_{i \in \mathcal{Z}} M\_i} \tag{3}$$

$$\min \quad \sum\_{j \in \mathcal{N}} y\_j \tag{4}$$

$$\begin{array}{ll}\min & \max \{ e\_{OFF}^j \cdot j - 1 \mid j \in \mathcal{N} \} \\ \min & \sum \ \rho \text{.} \end{array} \tag{5}$$

$$\min \quad \sum\_{j \in \mathcal{N}} c\_j \tag{6}$$
  $\text{s.t. } \quad J \quad \sum\_{j \in \mathcal{N}} c\_j \tag{7}$ 

$$\min\_{\begin{array}{c} \text{min} \\ \text{s} \end{array}} d\_{CA} \sum\_{j \in \mathcal{N}} y\_j \tag{7}$$

$$\min \quad \frac{1}{Q} d\_{CA} d\_{DT} \sum\_{j \in \mathcal{N}} y\_j \tag{8}$$

$$\begin{split} \min \quad & \mathbb{C}\_{AS} \sum\_{i \in \mathcal{Z}} \sum\_{j \in \mathcal{N}} A\_{i,j} \mathbf{x}\_{i,j} + \mathbb{C}\_{IS} \max \{ \mathbf{e}\_{OFF}^{j} \cdot j - 1 \mid j \in \mathcal{N} \} \\ &+ \quad \quad \mathbb{C}\_{SP} \max \left\{ \sum\_{i=1}^{a} M\_{i}, \sum\_{i=1}^{a+j} z\_{i,j}, \sum\_{i \in \mathcal{Z}} z\_{i,l} \; \middle| \; j \in \{1, \dots, b-1\}, l \in \{b, \dots, N\} \right\} \\ &+ \quad \quad \quad \mathbb{C}\_{CA} \sum\_{j \in \mathcal{N}} y\_{j} + \mathbb{C}\_{EF} \sum\_{j \in \mathcal{N}} e\_{j} + \mathbb{C}\_{DT} d\_{CA} \sum\_{j \in \mathcal{N}} y\_{j} + \mathbb{C}\_{CT} \frac{1}{\mathbb{C}} d\_{DT} d\_{CA} \sum\_{j \in \mathcal{N}} y\_{j} . \end{split} \tag{9}$$

Note that from nonlinear objectives, the objectives (1), (2), (5) and (9) are also nonsmooth. The objectives (1)–(3) are related to the interim storage such that (1) minimizes the maximum number of assemblies in the storage, (2) minimizes the maximum storage time, and (3) minimizes the average storage time. In the objective (1), with the first component we take into account the first *a* removals from the reactor where all the assemblies must be stored simultaneously. The second component handles the cases when removals are accomplished during the disposal periods. Finally, with the third component the cases when all removals are done are considered.

The next three objectives (4)–(6) are related to the encapsulation facility. The objective (4) minimizes the total number of canisters, (5) aims to stop the disposal as early as possible, and (6) minimizes the time which the encapsulation facility is in operation.

The objectives (7) and (8) aim to minimize the size of the disposal facility such that (7) minimizes the total length of disposal tunnels and (8) minimizes the length of the central tunnel.

Finally, the ninth objective (9) minimizes the total costs of the disposal process. The costs taken into account are related to the storage, cost of individual canisters, the encapsulation facility operating costs, and the building costs of the disposal and central tunnels.

#### *2.5. Constraints—Interim Storage*

The first set of constraints are related to the interim storage. All of these *Z*(5*N* + 2) + *N* + 2 constraints are linear.

$$z\_{i,1} - M\_i + x\_{i,1} = 0, \quad i \in \mathcal{Z} \tag{10}$$

$$z\_{i,j} - z\_{i,j-1} + \mathbf{x}\_{i,j} = \mathbf{0}, \quad i \in \mathbb{Z}, \ j \in \mathcal{N} \tag{11}$$

$$z\_{i,N} = 0, \quad i \in \mathbb{Z} \tag{12}$$

$$\sum\_{j \in \mathcal{N}} s\_{i,j} = 1 \quad i \in \mathcal{Z} \tag{13}$$

$$
\sigma\_{i,1} = \mathbf{e}\_{\text{ON}}^1 - \mathbf{s}\_{i,1\prime} \quad \text{i} \in \mathcal{Z} \tag{14}
$$

$$r\_{i,j} = r\_{i,j-1} + e\_{ON}^j - s\_{i,j}, \quad i \in \mathcal{Z}, \ j \in \mathcal{N} \tag{15}$$

$$
\sigma\_{i,j} \le e\_j, \quad i \in \mathcal{Z}, \ j \in \mathcal{N} \tag{16}
$$

$$\mathbf{x}\_{i,j} \le \mathsf{U}K\mathbf{r}\_{i,j}, \quad i \in \mathcal{Z}, \ j \in \mathcal{N} \tag{17}$$

$$\text{tr}\_{i,j}(A\_{i,j} - \mathbb{R}) \ge 0, \quad i \in \mathcal{Z}, \ j \in \mathcal{N} \tag{18}$$

The constraints (10)–(12) define the variables *zi*,*<sup>j</sup>* depicting the amount of assemblies in storage. The constraint (13) enforces all the assemblies to be disposed once. With the constraints (14)–(16) the variables *ri*,*<sup>j</sup>* are defined. The constraint (17) ensures that the production capacity is not exceeded and the constraint (18) ensures that the assembly disposed has been cooling long enough.

#### *2.6. Constraints—Encapsulation Facility*

In order to guarantee the acceptable encapsulation, the following 4*N* + 1 linear constraints are needed.

$$\sum\_{j \in \mathcal{N}} e\_{ON}^j = 1 \tag{19}$$

$$\sum\_{j \in \mathcal{N}} \mathbf{c}\_{OFF}^{j} = 1$$

$$\mathfrak{e}\_1 = \mathfrak{e}\_{\text{ON}}^1 - \mathfrak{e}\_{\text{OFF}}^1 \tag{21}$$

$$e\_j = e\_{j-1} + e\_{ON}^j - e\_{OFF'}^j \quad j \in \mathcal{N} \; \backslash \; \{1\} \tag{22}$$

$$\|y\_j\| \ge \frac{1}{K} \sum\_{i \in \mathcal{Z}} x\_{i,j}, \quad j \in \mathcal{N} \tag{23}$$

$$y\_j \le \mathsf{U}e\_{j\prime} \quad j \in \mathcal{N} \tag{24}$$

$$y\_j \ge T(e\_j - e\_{OFF}^{j+1}), \quad j \in \mathcal{N} \backslash \{N\}. \tag{25}$$

The constraints (19) and (20) ensures that the encapsulation facility is switched on and off exactly once meaning that all the canisters must be encapsulated at once. The constraints (21) and (22) define the variable *ej*. The constraints (23)–(25) guide the encapsulation process: (23) guarantees that there exist enough canisters such that all the assemblies can be disposed, (24) keeps the number of canisters under the production capacity, and (25) forces the minimum production to be fulfilled.

#### *2.7. Constraints—Disposal Facility*

The number of constraints related to the disposal facility is *N* + 4 such that *N* + 1 of them are nonlinear, and three of the constraints are box constraints.

$$\sum\_{i \in \mathcal{Z}} P\_{i,j} \mathbf{x}\_{i,j} - p\_{\max} y\_j \le 0, \quad j \in \mathcal{N} \tag{26}$$

$$d\_{CA} - \mathcal{g}(p\_{\text{max}}, d\_{DT}) = 0 \tag{27}$$

$$p\_{\text{max}} \in \left[ p\_{\text{max} \prime}^{\text{loop}}, p\_{\text{max}}^{\text{up}} \right] \tag{28}$$

$$d\_{CA} \in \left[ d\_{CA}^{low}, d\_{CA}^{up} \right] \tag{29}$$

$$d\_{DT} \in \left[ d\_{DT'}^{low}, d\_{DT}^{up} \right] \tag{30}$$

The constraints (26) and (27) are the nonlinear constraints of this model. The constraint (26) ensures that the heat power of the canisters disposed is allowable while the constraint (27) defines the dependence between the variables *dCA*, *pmax*, and *dDT*. In our case, this nonlinear function *<sup>g</sup>* : <sup>R</sup><sup>2</sup> <sup>→</sup> <sup>R</sup> is approximated with a piece-wise linear function (see Appendix A). Finally, the box constraints (28)–(30) give lower and upper bounds for variables *pmax*, *dCA*, and *dDT*, respectively.

Finally, we give some boundaries for the variables:

$$\begin{aligned} &x\_{i,j} \ge 0, \quad z\_{i,j} \ge 0 \quad \text{for all} \quad i \in \mathcal{Z}, j \in \mathcal{N}, \\ &y\_j \ge 0 \quad \text{for all} \quad j \in \mathcal{N}, \\ &e^j\_{ON} \in \{0, 1\}, \quad e^j\_{OFF} \in \{0, 1\}, \quad e\_j \in \{0, 1\} \quad \text{for all} \quad j \in \mathcal{N}, \\ &s\_{i,j} \in \{0, 1\}, \quad r\_{i,j} \in \{0, 1\} \quad \text{for all} \quad i \in \mathcal{Z}, j \in \mathcal{N}. \end{aligned}$$

To conclude, the model has nine objectives such that 6 are nonlinear and 3 are linear. The rest of the dimensions of the model are depending on two parameters: the number of periods *N* and the number of the removals from the reactor *Z*. Number of constraints is 5(*N*(*Z* + 1) + 1) + 2*Z*, where are *Z*(5*N* + 2) + 4*N* + 1 linear, *N* + 1 nonlinear and 3 box constraints. The total number of variables is 4*N*(*Z* + 1) + 3 and *N*(2*Z* + 1) + 3 of them are non-negative continuous variables and *N*(2*Z* + 3) are binary variables. Evidently, with any realistic values of *N* and *Z*, for example *N* = 19 and *Z* = 11 when one period is five years, the size of the problem will come quite large.

#### **3. Multiobjective Optimization Approach**

In this section, we define some fundamental aspects on multiobjective optimization, and then, describe the family of two-slope parameterized achievement scalarizing functions (ASFs) [25] with its properties. Finally, the interactive method utilizing two-slope parameterized ASFs is introduced.

#### *3.1. Mathematical Background*

We consider the following multiobjective MINLP problem of the form

$$\min\_{\mathbf{x}\in\mathcal{X}}\quad f(\mathbf{x}) = \{f\_1(\mathbf{x}), \dots, f\_k(\mathbf{x})\},\tag{31}$$

where *<sup>x</sup>* <sup>∈</sup> *<sup>X</sup>* <sup>=</sup> {*<sup>x</sup>* = (*y*, *<sup>z</sup>*)<sup>|</sup> *<sup>y</sup>* <sup>∈</sup> <sup>R</sup>*n*, *<sup>z</sup>* <sup>∈</sup> <sup>Z</sup>*m*} ∩*<sup>C</sup>* is a decision variable, *<sup>C</sup>* is the set of constraints, and *<sup>X</sup>* is a nonempty and compact set of feasible solutions. The objectives *fi* : *<sup>X</sup>* <sup>→</sup> <sup>R</sup> for all *<sup>i</sup>* <sup>∈</sup> *<sup>I</sup>* <sup>=</sup> {1, ... , *<sup>k</sup>*} are assumed to be lower semicontinuous with respect to *y* and at least partially conflicting. Therefore, we cannot find a minimal solution for every objective simultaneously and the minimization of only one objective may lead to an arbitrary bad solution with respect to other objectives. In order to compare the objectives, for *<sup>x</sup>*, *<sup>y</sup>* <sup>∈</sup> <sup>R</sup>*<sup>k</sup>* we denote by *<sup>x</sup>* <sup>&</sup>lt; *<sup>y</sup>* if *xi* <sup>&</sup>lt; *yi* for all *<sup>i</sup>* <sup>∈</sup> *<sup>I</sup>* and *<sup>x</sup>* <sup>≤</sup> *<sup>y</sup>* if *xi* <sup>≤</sup> *yi* for all *<sup>i</sup>* <sup>∈</sup> *<sup>I</sup>*. *Algorithms* **2019**, *12*, 252

In multiobjective optimization, we say that a solution is Pareto optimal if we cannot improve any objective without causing a deterioration for some other objective at the same time. Mathematically speaking, a solution *<sup>x</sup>*<sup>∗</sup> <sup>∈</sup> *<sup>X</sup>* is Pareto optimal if there does not exist any solution *<sup>x</sup>* <sup>∈</sup> *<sup>X</sup>* such that *<sup>f</sup>*(*x*) <sup>≤</sup> *<sup>f</sup>*(*x*∗) and *fj*(*x*) <sup>&</sup>lt; *fj*(*x*∗) for at least one index *<sup>j</sup>* <sup>∈</sup> *<sup>I</sup>*. It is noteworthy that usually we do not have a unique Pareto optimum but a set of Pareto optimal solutions, called the Pareto set. All these Pareto optimal solutions belong also to a larger class of weakly Pareto optimal solutions. The solution *<sup>x</sup>* <sup>∈</sup> *<sup>X</sup>* is an element of this class if there does not exist another solution *<sup>x</sup>* <sup>∈</sup> *<sup>X</sup>* such that *<sup>f</sup>*(*x*) <sup>&</sup>lt; *<sup>f</sup>*(*<sup>x</sup>* ).

In order to obtain some information about the Pareto set, we can define an ideal and a nadir vector, *<sup>f</sup>id* <sup>∈</sup> <sup>R</sup>*<sup>k</sup>* and *<sup>f</sup> nad* <sup>∈</sup> <sup>R</sup>*k*, to give the lower and the upper bound for a Pareto optimal solution, respectively. The ideal vector consists of individual minima of the objectives. This means that the component *f id <sup>i</sup>* is calculated as a solution of the problem min*x*∈*<sup>X</sup> fi*(*x*). Due to the conflicting objectives, the ideal vector is not feasible. The nadir vector, in its turn, represents the worst objective values in the Pareto set. Unfortunately, the exact calculation of the nadir vector needs the maximization of objectives over the set of Pareto optimal solutions being a hard task. Thus, the nadir vector needs to be approximated, for example, with a pay-off table (see e.g., [31,32]).

#### *3.2. Two-Slope Parameterized ASFs*

We approach the multiobjective mixed-integer problem with a special type of achievement scalarizing functions. In general, the utilization of the achievement scalarizing function (ASF) aims to find a Pareto optimal solution being as close as possible to a so-called reference point *f <sup>R</sup>*. The components *f <sup>R</sup> <sup>i</sup>* , *i* ∈ *I* include the decision maker's wishes for each objective. This search is done by transforming the multiobjective optimization problem to a certain type of a scalarized problem and then applying some suitable single-objective optimization method.

We use here the two-slope parameterized ASF, proposed in [25], which is a generalization of the parameterized ASF [26] and the two-slope ASF [27]. Usually, to find the closest point to the reference point *f <sup>R</sup>*, the distance from *f <sup>R</sup>* is measured with only one metric. With the parameterization used in the parameterized ASF and the two-slope parameterized ASF, we can combine different metrics such that *L*<sup>∞</sup> and *L*<sup>1</sup> metrics are the extreme cases. Thus, by systematically producing different Pareto optimal solutions from the same preference information, we can give the decision maker a wider perspective to the range of Pareto optimal solutions. Another benefit of the two-slope parameterized ASF, as well as the two-slope ASF, is that we do not need to test the achievability of the reference point. This is due to the fact that the different weights are used depending on if the reference point is achievable (i.e., the reference point belongs to the image of the feasible solutions in the objective space) or unachievable. The use of different weights is reasonable since the decision-maker usually prefers different solutions if the reference point is achievable or not, as was suggested in [28].

In order to solve the model described in Section 2, we apply the two-slope parameterized ASF. Once the multiobjective problem is converted to the single-objective one, we obtain a scalarized version of the problem (31) in the form [25]

$$\min\_{\mathbf{x}\in\mathcal{X}}\max\_{\substack{I^q\subseteq I\\|I^q|=q}}\left\{\sum\_{i\in I^q} \left[\max\{\lambda\_i^{\mathrm{LI}}(f\_i(\mathbf{x})-f\_i^R), 0\} + \min\{\lambda\_i^{\mathrm{A}}(f\_i(\mathbf{x})-f\_i^R), 0\}\right] \right\},\tag{32}$$

where the weighting vectors *λ<sup>U</sup> <sup>i</sup>* , *<sup>λ</sup><sup>A</sup> <sup>i</sup>* > 0 for all *i* ∈ *I* are for the unachievable and the achievable reference point, respectively. The parameter *<sup>q</sup>* ∈ *<sup>I</sup>* specifies which metric is used and *<sup>I</sup><sup>q</sup>* is a set containing *q* integers from the interval [*i*, *k*], where *k* is the total number of objectives. Then, the maximization is taken over all the sets including *q* integers from the interval [1, *k*]. In order to gain the benefits of the parameterization, or in other words, to use more metrics than only *L*<sup>1</sup> (i.e., *q* = *k*) and *L*<sup>∞</sup> (i.e., *q* = 1), the problem must contain at least three objectives while the maximum number of different metrics equals the number of the objectives.

Next, we are interested to know what can be deduced from the optimal solution of the scalarized problem. As the justification for the use of the two-slope parameterized ASF, we can proof the following results by adapting the proofs from [25].

**Theorem 1** ([25])**.** *For the scalarized problem* (32) *it holds that:*


**Proof.** (i) Assume that *x*<sup>∗</sup> is an optimal solution of the problem (32) but not a weakly Pareto optimal solution of the problem (31). Then there exists a feasible solution *<sup>x</sup>* <sup>∈</sup> *<sup>X</sup>* such that *<sup>f</sup>*(*<sup>x</sup>* ) < *f*(*x*∗). For any *<sup>x</sup>* <sup>∈</sup> *<sup>X</sup>*, denote *<sup>I</sup><sup>x</sup>* <sup>=</sup> {*<sup>i</sup>* <sup>∈</sup> *<sup>I</sup><sup>q</sup>* <sup>|</sup> *<sup>f</sup> <sup>R</sup> <sup>i</sup>* <sup>≤</sup> *fi*(*x*)}, *<sup>J</sup><sup>x</sup>* <sup>=</sup> {*<sup>i</sup>* <sup>∈</sup> *<sup>I</sup><sup>q</sup>* <sup>|</sup> *<sup>f</sup> <sup>R</sup> <sup>i</sup>* <sup>&</sup>gt; *fi*(*x*)} and *<sup>s</sup> q <sup>R</sup>*(*f*(*x*), *<sup>λ</sup>U*, *<sup>λ</sup>A*) as the objective of the scalarized problem (32). Now

*s q R*(*f*(*<sup>x</sup>* ), *λU*, *λA*) = max *<sup>I</sup>q*⊆*<sup>I</sup>* <sup>|</sup>*Iq*|=*<sup>q</sup>* ∑ *i*∈*Ix λU <sup>i</sup>* (*fi*(*<sup>x</sup>* ) <sup>−</sup> *<sup>f</sup> <sup>R</sup> <sup>i</sup>* ) + ∑ *i*∈*Jx λA <sup>i</sup>* (*fi*(*<sup>x</sup>* ) <sup>−</sup> *<sup>f</sup> <sup>R</sup> i* ) < max *<sup>I</sup>q*⊆*<sup>I</sup>* <sup>|</sup>*Iq*|=*<sup>q</sup>* ∑ *i*∈*Ix λU <sup>i</sup>* (*fi*(*x*∗) <sup>−</sup> *<sup>f</sup> <sup>R</sup> <sup>i</sup>* ) + ∑ *i*∈*Jx λA <sup>i</sup>* (*fi*(*x*∗) <sup>−</sup> *<sup>f</sup> <sup>R</sup> i* ) ≤ max *<sup>I</sup>q*⊆*<sup>I</sup>* <sup>|</sup>*Iq*|=*<sup>q</sup>* ∑ *i*∈*Ix*<sup>∗</sup> *λU <sup>i</sup>* (*fi*(*x*∗) <sup>−</sup> *<sup>f</sup> <sup>R</sup> <sup>i</sup>* ) + ∑ *i*∈*Jx*<sup>∗</sup> *λA <sup>i</sup>* (*fi*(*x*∗) <sup>−</sup> *<sup>f</sup> <sup>R</sup> i* ) = *s q <sup>R</sup>*(*f*(*x*∗), *<sup>λ</sup>U*, *<sup>λ</sup>A*)

yielding to a contradiction.

(ii) Assume that *x*<sup>∗</sup> is an optimal solution of the problem (32) but not a Pareto optimal solution of the problem (31). Therefore, there exists *<sup>x</sup>* <sup>∈</sup> *<sup>X</sup>* such that *<sup>f</sup>*(*<sup>x</sup>* ) <sup>≤</sup> *<sup>f</sup>*(*x*∗) and at least one index *<sup>j</sup>* <sup>∈</sup> *<sup>I</sup>* such that *fj*(*x* ) < *fj*(*x*∗). Similarly to (i), we can deduce that *s q R*(*f*(*<sup>x</sup>* ), *<sup>λ</sup>U*, *<sup>λ</sup>A*) <sup>≤</sup> *<sup>s</sup> q <sup>R</sup>*(*f*(*x*∗), *<sup>λ</sup>U*, *<sup>λ</sup>A*). If the equality holds, *x* is an optimal solution for the problem (32) and Pareto optimal for the problem (31). In the case of strict inequality, this yields to a contradiction with an assumption that *x*<sup>∗</sup> is an optimal solution for the problem (32).

(iii) First, we observe that *s q <sup>R</sup>* is strictly increasing (i.e., *s q <sup>R</sup>*(*f*(*x*1), *<sup>λ</sup>U*, *<sup>λ</sup>A*) <sup>&</sup>lt; *<sup>s</sup> q <sup>R</sup>*(*f*(*x*2), *<sup>λ</sup>U*, *<sup>λ</sup>A*) for any *<sup>f</sup>*(*x*1), *<sup>f</sup>*(*x*2) having *<sup>f</sup>*(*x*1) <sup>&</sup>lt; *<sup>f</sup>*(*x*2) and *<sup>x</sup>*1, *<sup>x</sup>*<sup>2</sup> <sup>∈</sup> *<sup>X</sup>*). Indeed, by taking *<sup>x</sup>*1, *<sup>x</sup>*<sup>2</sup> <sup>∈</sup> *<sup>X</sup>* such that *f*(*x*1) < *f*(*x*2), we see that

$$\begin{split} s\_R^q(f(\mathbf{x}\_1), \boldsymbol{\lambda}^{\mathcal{U}}, \boldsymbol{\lambda}^{\mathcal{A}}) &= \max\_{\substack{I^q \subseteq \boldsymbol{I} \\ |I^q| = q}} \left\{ \sum\_{i \in I\_{\mathbf{x}\_1}} \lambda\_i^{\mathcal{U}}(f\_i(\mathbf{x}\_1) - f\_i^R) + \sum\_{i \in I\_{\mathbf{x}\_1}} \lambda\_i^{\mathcal{A}}(f\_i(\mathbf{x}\_1) - f\_i^R) \right\} \\ &< \max\_{\substack{I^q \subseteq \boldsymbol{I} \\ |I^q| = q}} \left\{ \sum\_{i \in I\_{\mathbf{x}\_2}} \lambda\_i^{\mathcal{U}}(f\_i(\mathbf{x}\_2) - f\_i^R) + \sum\_{i \in I\_{\mathbf{x}\_2}} \lambda\_i^{\mathcal{A}}(f\_i(\mathbf{x}\_2) - f\_i^R) \right\} \\ &= s\_R^q(f(\mathbf{x}\_2), \boldsymbol{\lambda}^{\mathcal{U}}, \boldsymbol{\lambda}^{\mathcal{A}}). \end{split}$$

The claim is obtained, since for any strictly increasing ASF it holds that a weakly Pareto optimal solution *x*<sup>∗</sup> for the problem (31) is a solution of the scalarized problem with *f <sup>R</sup>* = *f*(*x*∗) and the optimal value of *s q <sup>R</sup>* is zero (see [32]).

Thus, we know that every Pareto optimal solution can be obtained and the solution of the scalarized problem (32) is weakly Pareto optimal for the original multiobjective problem. In order to guarantee the Pareto optimality of solutions, a so-called augmentation term [32]

$$\rho \sum\_{i \in I} \lambda\_i (f\_i(\mathbf{x}) - f\_i^R), \quad \rho > 0 \tag{33}$$

may be added to the objective of the scalarized problem (32) [25]. Note that similarly to Theorem 8 in [25] it can be proven that if the set *X* and the objectives *fi*, *i* ∈ *I* are convex, then *s q <sup>R</sup>*(*f*(*x*), *<sup>λ</sup>U*, *<sup>λ</sup>A*) preserves the convexity.

#### *3.3. Multiobjective Interactive Method Utilizing the Two-Slope Parameterized ASFs*

In the following, we state an outline of the multiobjective interactive solution approach utilizing the two-slope parameterized ASFs (MITSPA) applying reference point based preference information. The general framework of interactive methods is usually similar: firstly, some range for Pareto optimal solutions is given the decision-maker, secondly, the decision-maker provides some preference information, thirdly, some solutions are presented for the decision-maker, and fourthly, the decision-maker express his/her opinion on the solutions and modify the preference information as a base for the new solutions. The process is stopped when the decision-maker is satisfied with the solution. The main differences in various interactive methods can be found in the ways the preference information is given and which solvers are applied (see e.g., [32,33]).

Similar approaches to ours in terms of the utilization of scalarization functions and the reference point as preference information are proposed, for instance, in [32–38]. Compared with these, in our case with the two-slope parameterized ASFs, we can systematically produce different Pareto optimal solutions to obtain a reasonably distributed selection of Pareto set.

Multiobjective interactive method utilizing the two-slope parameterized ASFs (MITSPA)


Some remarks about the above algorithm are in order. Step 0 consists of the illustration of the Pareto set. Some Pareto optimal solutions for the decision-maker to start with can be calculated, for example, by using the two-slope parameterized ASF (32) with an ideal vector as a reference point or by applying some suitable no-preference method like descent methods [39–43]. In Step 3, *s* ∈ [1, *k*] solutions are presented to the decision-maker, where the *k* is the number of objectives. As mentioned, with the two-slope parameterized ASF we are able to solve as many different solutions as there are objectives. If the number of objectives is high, it facilitates the task of the decision-maker if only some of these solutions are presented. However, if the decision-maker is willing to see more solutions

from the same reference point, this is enabled in step 4. If more than *k* solutions are needed in total for one reference point, they can be obtained by varying the coefficients *λ<sup>U</sup>* and *λA*. During the solution process, the decision maker is able to learn about the model and after seeing some solutions, the decision maker has more insight into the problem and might want to change the opinion on the good reference point. Thus, in Step 5, a new reference point is allowed and new solutions are solved in Step 2.

#### **4. Case Study: The Disposal in Finland**

In practice, the scalarized problem (32) with an augmentation term (33) in Step 2 of MITSPA is solved with a branch-and-cut type method for single-objective MINLP problems called BARON [44,45] in GAMS [46]. The CPU time of solving each problem (32) presented here varies from 9 s to 28000 s while the average CPU time is 3475 s and the median CPU time is 142 s. The weighting vectors used are of the form

$$
\lambda^{II} = \frac{1}{f^{\text{quad}} - f^{\text{R}}}, \quad \lambda^A = \frac{1}{f^R - f^{\text{id}}}
$$

such that *<sup>f</sup> nad* <sup>−</sup> *<sup>f</sup> <sup>R</sup>* <sup>&</sup>gt; 0 and *<sup>f</sup> <sup>R</sup>* <sup>−</sup> *<sup>f</sup>id* <sup>&</sup>gt; 0 as suggested in [27]. The approximation of the nadir vector used is obtained with a pay-off table [31,32].

We investigate the disposal of the spent nuclear fuel from the European pressurized water reactor (EPR) produced by Olkiluoto 3 in Finland starting to operate in the near future. The length of one disposal period is selected to be 5 years, and the parameters *N* and *Z* are 19 and 11, respectively. The other parameters used are given in Appendix A, except the cost parameters that are omitted due to their commerce-related nature. This parameter selection yields a multiobjective MINLP problem with 9 objectives, 440 continuous and 475 binary variables, 1144 linear constraints, 20 nonlinear constraints, and three box constraints. Apart from these, we need some auxiliary variables and constraints to overcome the non-smoothness of the problem. Indeed, the two-slope parameterized ASFs are nonsmooth, but due to their min-max structure, the problem (32) can be written in the MINLP form as in [25]. Similarly, this trick can be applied also for the nonsmooth objectives. After that, we have to solve a single-objective problem with 441 continuous and 484 binary variables, 1153 linear constraints, 21–146 nonlinear constraints and 3 box constraints.

Before we proceed to the solution process, we discuss the trade-offs of the problem. There are three parts in the final disposal of spent nuclear fuel: the interim storage, the encapsulation facility, and the disposal facility. These three parts interact with each other as is exemplified in the following.


In order to investigate these, and other trade-offs, the interactive method MITSPA is employed. In each iteration of MITSPA, some new preference information is asked from the decision-maker reflecting his/her preferences. For each iteration, we compute nine solutions by using the current

reference point with different metrics by varying the value of the parameter *q* from 1 to 9 in Step 2. In order to exemplify this, the nine solutions computed using reference point 1 are shown in Figure 1. These nine solutions represent different trade-offs between objective function values. The results obtained are scaled to the interval from 0 to 1 such that 0 is the value of the ideal vector and 1 is the value of the nadir vector for the objective under consideration. The different solutions are labeled based on the reference point used and the value of the parameter *q*. For example, the solution r1q1 is the result obtained by using the reference point 1 and *q* = 1. Moreover, the reference point 1 is labeled with r1.

**Figure 1.** The objective values of 9 solutions obtained using the reference point 1.

In Step 3, two solutions are selected to be presented for the decision-maker for the closer inspection. The number of presented solutions *s* is restricted to two in order to aid the decision maker's task to select best out of only two options and in order to keep the presentation clear. At each iteration, one solution with a smaller value of *q* and one with a larger value of *q* are presented and different values of *q* are demonstrated in order to exemplify the variety of solutions. Next, we present four iterations of MITSPA.

Iteration 1. At the first iteration, the decision maker begins by investigating the trade-off between operation time of the encapsulation facility and the cooling times of assemblies by deciding to start with the unachievable reference point such that the operation time is short and the cooling time is long. The two solutions chosen for reference point 1 are shown in Figure 2a together with reference point 1. The solution obtained by using value *q* = 1 (r1q1) shown in the green line corresponds to the early starting time of disposal. The solution obtained by using value *q* = 9 (r1q9) shown in the orange line corresponds to the late starting time of the disposal. In Figures 2b,c, the corresponding disposal schedules are given. The solution r1q9, has the shortest possible encapsulation time but the maximum cooling time is long. The solution r1q9, like r1q1, has a high maximum number of assemblies in the storage (see the objective (1)), but the maximum and average storage times (the objectives (2) and (3)) are slightly shorter. The solution r1q9 does not allow any empty positions in canisters while the solution r1q1 does (see (4)), but the encapsulation ends much later (see (5)) in the solution r1q9 than in r1q1. However, the operation time of the encapsulation facility (see (6)) is shorter in the solution r1q9 than in r1q1. When the disposal facility-related objectives (see (7) and (8)) are compared, the solution r1q9 needs a smaller area than the solution r1q1. Moreover, the solution r1q9 is cheaper than the reference, while the solution r1q1 is more expensive than the reference (see (9)). Mainly due to the significant difference in the costs, the decision maker selects the solution r1q9 as the current solution *<sup>f</sup>* <sup>1</sup>.

**(a)** The objective values for the selected solutions *q* = 1 and *q* = 9 and the reference point 1.

Iteration 2. In order to learn more about the trade-off between the operation time of the encapsulation facility and the cooling time, another reference point (reference point 2) is selected. In this case, the reference point is achievable. Now we try to find solutions such that the operation time is longer and cooling time shorter. The two solutions chosen for reference point 2 are shown in Figure 3a and the corresponding disposal schedules in Figure 3b,c. Again, the solution obtained with the small value *q* = 1 (r2q1) represents the early starting time of the disposal. This is depicted with the green line in Figure 3a while the orange line depicts the solution obtained using value *q* = 9 (r2q9) corresponding to the late starting time of the disposal.

**Figure 2.** Results for the iteration 1.

If we compare the disposal schedules in Figure 3b,c to the schedules for reference point 1 given in Figure 2b,c, we notice some similarity. Even though the starting and ending times differ as well as the total number of canisters, the solutions with the parameter *q* value 1 and value 9 have the same shape. The smaller *q* suggests the schedule such that first, we encapsulate a small number of canisters per period and the number of canisters is growing while the time goes by, whereas the larger *q* recommends the schedule where all the canisters are encapsulated within two periods. The solution r2q1 captures the reference point well since they coincide with respect to other objectives than the objectives (1) and (5) which are better than the reference values. Thus, the decision maker is willing to continue with the solution r2q1 as the current solution *<sup>f</sup>* <sup>2</sup>.

Iteration 3. The long operation time of the encapsulation facility (the objective (6)) is still under the microscope at the third iteration but the decision-maker is tempted by the short central tunnel appeared in the previous iteration and combines the long operation time with small disposal facility area. Like the first reference point, also this is unachievable. The solution obtained by using *q* = 2 (r3q2) is shown in green and the solution obtained with *q* = 8 (r3q8) is shown in orange in Figure 4a. Figure 4b illustrates that the solution r3q2 yields a schedule with an early starting date and the disposal takes the longest time while the solution r3q8 starts the disposal later but it is performed faster as seen in Figure 4c. The solution r3q8 yields almost ideal value for the costs, and we can deduce that in order to achieve lower costs we have to give up in the objectives related to the storage capacity and times. Moreover, the disposal ends rather late. For the current solution *<sup>f</sup>* <sup>3</sup> the decision-maker selects the solution r3q8 due to the low costs and small disposal facility area.

**(a)** The objective values for the selected solutions *q* = 1 and *q* = 9 and reference point 2.

**(b)** The disposal schedule for *<sup>q</sup>* = 1. **(c)** The disposal schedule for *<sup>q</sup>* = 9. **Figure 3.** Results for the iteration 2.

**(a)** The objective values for the selected solutions *q* = 2 and *q* = 8 and reference point 3.

Since one motivation for this research was to take into account more goals than just the costs, we are eager to see what happens if we omit the costs and solve the problem with only the first eight objectives (1)–(8). The reference point 3' is similar to the reference point 3 without the value for the costs. The results with *q* = 2 (r3'q2) and *q* = 8 (r3'q8) are given in Figure 5. Note that since there are now only eight objectives, the scalarized function is different than in the case of nine objectives. The solutions in Figure 5 are quite similar and there is less variation than in the solutions in Figure 4a.

**Figure 5.** The objective values corresponding the selected solutions *q* = 2 and *q* = 8 for the modified reference point 3 with objectives (1)–(8).

Iteration 4. The current solution *<sup>f</sup>* <sup>3</sup> has high interim storage capacity and a small amount of canisters. At the fourth iteration, the decision maker is interested in to see if the opposite is possible, namely a solution with small interim storage capacity with allowing the higher number of canisters. Again, the reference point is unachievable. In Figure 6a, the reference point 4 and the solutions with *q* = 4 (r4q4) and *q* = 9 (r4q9) are illustrated. The solutions are shown in green and orange, respectively. Again, the corresponding disposal schedules are given in Figures 6b,c. As we see, the solution r4q4 satisfies the wishes towards the interim storage capacity as well as the utilization of the empty canister positions quite well. Additionally, the better values than the reference are obtained in the repository area related goals and the costs. The solution r4q9 express this as well, but the original wishes towards the interim storage capacity are not satisfied. Since the solution r4q4 captures better the ideas of the decision-maker, it is selected for the current solution *<sup>f</sup>* <sup>4</sup>.

**(a)** The objective values for the selected solutions *q* = 4 and *q* = 9 and reference point 4.

Eventually, the decision maker is ready to make the final choice. During the solution process, we have learned that the solutions obtained from 4 different reference points can be split broadly into two main groups. The first group includes solutions where the disposal starts early while the other group includes the solutions with late starting. The most striking fact is that the solutions of the first group are obtained with smaller values of *q* and the solutions of the second group with the larger values of *q*. This phenomenon is repeated with all of four reference points. Interestingly, with the modified reference point 3 where only eight objectives were considered, mainly solutions with earlier starting time were obtained. In general, the earlier starting time of the disposal improves the objectives (1)–(3) and impairs others compared with the case where the disposal starts later. In general, we notice that the solutions obtained adapt the reference points quite well.

In Figure 7, the solutions related to the first group with an early starting time are illustrated. It can be seen that even if all these solutions suggest the early start of the disposal, they still have some differences. One can improve goals (7) and (8) by disposing of spent fuel with a small volume at the beginning. However, this declines goals (1)–(3), (5), (6) and (9) which can be seen from the solution r3q2. It is possible to improve the goals (1)–(3), (5) and (6) by allowing some canister positions to be empty. However, this in its turn declines goals (4) and (7)–(9) which can be seen from the solution r4q4. As the final solution, the decision maker likes to return to the reference point 2 and the solution r2q1 looks like a good compromise when disposal begins early.

**Figure 7.** The four solutions where disposal starts early.

A similar examination is done for the solutions of the second group with the late starting time. The solutions in terms of the objective function values and the disposal schedules are given in Figure 8. Again, we can observe some differences. The differences depend on the number of years the start of disposal operations is prolonged. It can be seen from Figure 8, that the disposal volume is large in every solution where disposal starts late. On the one hand, one can improve goals (7)–(9) by delaying the start of disposal but on the other hand, this declines goals (2), (3) and (5), as illustrated in the solution r3q8. When the disposal starts late, empty canister positions have only a minor impact on the solution. One can improve goals (2), (3), and (5) by allowing empty canister positions. This yields to the impairing of the goals (4), and (7)–(9) which can be seen from the solution r4q9. Again, the decision maker is willing to return to the reference point 2 and consider the solution r2q9 as a satisfactory solution when the disposal starts late. Additionally, the decision maker selects this solution also for the final solution *f* <sup>∗</sup>, since it yields a rather good solution for other objectives than the maximum storage. However, we learned that this is the price of the lower costs and smaller disposal facility area. Moreover, compared with the solution r2q1 also presented from the reference point 2, the later starting does not delay the ending of the disposal.

**Figure 8.** The four solutions where disposal starts late.

#### **5. Conclusions**

In this paper, we have proposed the nonsmooth multiobjective MINLP model to optimize the spent nuclear fuel disposal in order to obtain a disposal schedule. The modeled process is described and the model is presented in detail. Then, the two-slope parameterized ASF is briefly stated and validated the use of it. Additionally, we proposed an interactive solution method utilizing these ASFs. Finally, some numerical results from the case study are given. The solutions obtained are exemplified and analyzed in terms of objective function values and disposal schedules.

With slight modifications, the model presented is applicable to other countries than Finland as well, if the spent nuclear fuel is decided to dispose of the disposal facility. It is possible to change the objectives or leave some of them out. Indeed, this model has quite many objectives, and in some cases it may be advantageous to have fewer goals either to ease the decision maker's task or reduce the computations needed.

The schedules obtained are realistic and viable. One conspicuous feature for the solutions is that they are segmented in two groups based on the value of the parameter *q* enabling the parameterization when the two-slope parameterized ASF is used. With the lower values of *q* (i.e., closer to *L*∞ metric), the disposal starts early and with the larger values of *q* (i.e., closer to *L*<sup>1</sup> metric) the later start of the disposal is suggested. If only one metric, for instance *L*∞ metric, was used, no solutions with late starting would have been obtained in these iterations. For further studies, it would be interesting to investigate, is this kind of phenomenon observable in other applications as well, if the two-slope parameterized ASF is used. The role of *q* is also fascinating in terms of which value of *q* yields the most desirable solution for the decision maker.

As future research, it would also be interesting to include all of the three different fuel types used in Finland. Another interesting topic would be including the possible hiatus for the operation of the encapsulation facility in the model.

**Author Contributions:** Conceptualization, O.M., T.R. and M.M.M.; methodology, O.M. and T.R.; software, O.M.; validation, O.M.; formal analysis, O.M.; investigation, O.M.; resources, M.M.M.; data curation, O.M. and T.R.; writing—original draft preparation, O.M.; writing—review and editing, O.M., T.R. and M.M.M.; visualization, O.M.; supervision, M.M.M.; project administration, M.M.M.; funding acquisition, O.M., T.R. and M.M.M.

**Funding:** The study is financial supported by Emil Aaltonen foundation, University of Turku Graduate School UTUGS Matti programme, Academy of Finland project No. 294002, University of Turku, and Tampere University of Technology.

**Acknowledgments:** The authors are grateful for Yury Nikulin and Jani Huttunen about their comments during the preparation of this paper. The authors wish also to congratulate Adil Bagirov on his 60th birthday!

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A. Parameters of the Case Study**

The parameters used in the case study given in Section 4 are


and the values for *Ai*,*<sup>j</sup>* and *Pi*,*<sup>j</sup>* are given in Tables A1 and A2, respectively. Furthermore, the following approximation is used in the constraint (27):

$$\begin{split} g(p\_{\text{max}}, d\_{DT}) &= \max \{-2.26911d\_{DT} + 0.00675p\_{\text{max}} + 54.5228, \\ &- 0.05833d\_{DT} + 0.00596p\_{\text{max}} - 0.727083, \\ &- 0.14d\_{DT} + 0.17701p\_{\text{max}} - 350.651 \}. \end{split}$$

**Table A1.** Values for the parameters *Ai*,*j*.


**Table A2.** Values for the parameters *Pi*,*j*.


#### **References**


c 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **SVM-Based Multiple Instance Classification via DC Optimization**

#### **Annabella Astorino 1,\*,†, Antonio Fuduli 2,†, Giovanni Giallombardo 3,† and Giovanna Miglionico 3,†**


Received: 31 October 2019; Accepted: 20 November 2019; Published: 23 November 2019

**Abstract:** A multiple instance learning problem consists of categorizing objects, each represented as a set (bag) of points. Unlike the supervised classification paradigm, where each point of the training set is labeled, the labels are only associated with bags, while the labels of the points inside the bags are unknown. We focus on the binary classification case, where the objective is to discriminate between positive and negative bags using a separating surface. Adopting a support vector machine setting at the training level, the problem of minimizing the classification-error function can be formulated as a nonconvex nonsmooth unconstrained program. We propose a difference-of-convex (DC) decomposition of the nonconvex function, which we face using an appropriate nonsmooth DC algorithm. Some of the numerical results on benchmark data sets are reported.

**Keywords:** multiple instance learning; support vector machine; DC optimization; nonsmooth optimization

#### **1. Introduction**

Multiple instance learning (MIL) is a recent machine learning paradigm [1–3], which consists of classifying sets of points. Each set is called bag, while the points inside the bags are called instances. The main characteristic of an MIL problem is that in the learning phase the instance labels are hidden and only the labels of the bags are known.

An MIL seminal paper is [4], where a drug-design problem has been faced. Such a problem consists of determining whether a drug molecule (bag) is active or non-active. A molecule provides the desired drug effect (positive label) if, and only if, at least one of its conformations (instances) binds to the target site. The crucial question is that it is not known a priori which conformation makes the molecule active.

Some MIL applications are image classification [5–8], drug discovery [9,10], classification of text documents [11], bankruptcy prediction [12], and speaker identification [13].

For this kind of problems, there are various solutions in the literature that fall into three different classes: instance-space approaches, bag-space approaches, and embedding-space approaches. In instance-space approaches, classification is performed at the instance level, finding a separation surface directly in the instance space, without looking at the global structure of the bags; the label of each bag is determined as an aggregation of the labels of its corresponding instances. Vice-versa, in bag-space approaches (for example, see [14–16]), the separation is performed at a global level, considering the bag as a whole entity. A compromise between these two kinds of approaches is constituted by embedding-space techniques, where each bag is represented by one feature vector and

the classification is consequently performed in the instance space. An example of an embedding-space approach is presented in [17].

The method we propose uses the instance-space approach and provides a separation hyperplane for the binary case, where the objective is to discriminate between positive and negative bags. We start from the standard MIL assumption stating that a bag is positive if, and only if, at least one of its instances is positive and it is negative whenever all its instances are negative.

Some examples of linear instance-space MIL classifiers can be found in [18–22]. In particular, in [18], two different models have been proposed. The first one is a mixed-integer nonlinear optimization problem solved using a heuristic technique based on the block coordinate descent method [23] and faced in [19] using a Lagrangian relaxation technique. The second model, which will be the objective of our analysis in the next section, is a nonsmooth nonconvex optimization problem, solved in [21] using the bundle type method described in [24]. In [20], a semi-proximal support vector machine (SVM) approach is used, coming from a compromise between the classical SVM [25] and the proximal approach proposed in [26] for supervised classification. Finally, an optimization problem with bilinear constraints is analyzed in [22], where each positive bag is expressed as a convex combination of its instances and a local solution is obtained by solving successive linear programs.

Recently, nonlinear instance-space MIL classifiers have also been proposed in the literature, such as in [27] and in [28], where a spherical separation approach is adopted: in particular, in the former a variable neighborhood search method [29] is used, while in the latter a DC (difference of convex) model is solved using an appropriate DC algorithm [30]. In passing, we stress that many DC models have been introduced in machine learning, in the supervised [31–35], semisupervised [36,37] and unsupervised cases [38–40].

In this work, we propose a DC optimization model providing a linear classifier for binary MIL problems. The solution method we adopt is the proximal bundle method introduced in [30] for the minimization of nonsmooth DC functions. The paper is organized as follows. In the next two sections, we describe, respectively, the DC optimization model and the corresponding nonsmooth solution algorithm. Finally, in Section 4, we report the results of our numerical experimentation performed on some data sets drawn from the literature.

#### **2. A DC Decomposition of the SVM-Based MIL**

We tackle a binary MIL problem whose objective is to discriminate between *m* positive bags and *k* negative ones using a hyperplane

$$H(w, b) \triangleq \{ \mathbf{x} \in \mathbb{R}^n \mid w^T \mathbf{x} + b = 0 \},$$

where *<sup>w</sup>* <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* and *<sup>b</sup>* <sup>∈</sup> <sup>R</sup>. Indicating by *<sup>J</sup>* + *<sup>i</sup>* , *i* = 1, ... , *m*, the index set of the instances belonging to the *i*th positive bag and by *J* − *<sup>i</sup>* , *i* = 1, ... , *k*, the index set of the instances belonging to the *i*th negative bag, we recall that, on the basis of the standard MIL assumption, a bag is positive if, and only if, at least one of its instances is positive and it is negative vice-versa. As a consequence, while a positive bag is allowed to, possibly, straddle the hyperplane, the negative bags should lie completely on the negative side.

More formally, indicating by *xj* <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* the *<sup>j</sup>*th instance of a positive or negative bag, the hyperplane *H*(*w*, *b*) performs a correct separation if, and only if, the following conditions hold:

$$\begin{cases} \begin{array}{ll} w^T x\_j + b \ge 1, & \text{for at least an index } j \in J\_i^+ \text{ and for all } i = 1, \dots, m \\\\ w^T x\_j + b \le -1, & \text{for all } j \in J\_i^- \text{ and for all } i = 1, \dots, k. \end{array} \end{cases}$$

side (+1)

J<sup>−</sup>

As a consequence (see Figures 1 and 2), a positive bag *J* + *<sup>i</sup>* , *i* = 1, . . . , *m*, is misclassified if

$$\max\_{j \in J\_i^+} (w^T x\_j + b) < 1$$

and a negative one *J* − *<sup>i</sup>* , *i* = 1, . . . , *k*, is misclassified if

$$\max\_{j \in J\_i^{-}} (w^T x\_j + b) > -1.$$

**Figure 2.** Negative bag *J* − *i* .

<sup>i</sup> is misclassified side (+1)

Then, we come out with the following error function, already introduced in [18]:

$$f(w, b) \triangleq \frac{1}{2} \|w\|^2 + \mathcal{C} \left[ \sum\_{j=1}^{m} \max\{0, 1 - \max\_{j \in I\_i^+} (w^T x\_j + b) \} + \sum\_{i=1}^{k} \max\{0, 1 + \max\_{j \in I\_i^-} (w^T x\_j + b) \} \right],\tag{1}$$

where *C* > 0 represents the trade-off between two objectives: the maximization of the separation margin, characterizing the classical SVM [25] approach, and the minimization of the classification error.

To minimize function *f* , we propose a DC decomposition based on the following formula:

$$\max\{0, 1 - h(y)\} = \max\{1, h(y)\} - h(y),\tag{2}$$

J<sup>−</sup>

<sup>i</sup> is well classified

*Algorithms* **2019**, *12*, 249

where *h* is a convex function. By applying Equation (2) to our case, we can write *f* in the form:

$$f(w, b) = f\_1(w, b) - f\_2(w, b),$$

where

$$f\_1(w, b) \triangleq \frac{1}{2} \|w\|^2 + \mathbb{C} \sum\_{i=1}^k \max\{0, 1 + \max\_{j \in I\_i^-} (w^T x\_j + b)\} + \mathbb{C} \sum\_{i=1}^m \max\{1, \max\_{j \in I\_i^+} (w^T x\_j + b)\}$$

and

$$f\_2(w, b) \stackrel{\triangle}{=} \mathbb{C} \sum\_{i=1}^m \max\_{j \in J\_i^+} (w^T x\_j + b)$$

are convex functions. Hence, we come up with the following nonconvex nonsmooth optimization problem, DC-MIL,

$$\min\_{w,b} \left[ f\_1(w,b) - f\_2(w,b) \right]. \tag{3}$$

#### **3. Solving DC-MIL using a Nonsmooth DC Algorithm**

We start by recalling some preliminary property of the DC optimization problem, by adopting the same notation as above. Given the DC optimization problem

$$\min\_{\mathcal{Y}} [f\_1(\mathcal{Y}) - f\_2(\mathcal{Y})] \tag{4}$$

where both *f*<sup>1</sup> and *f*<sup>2</sup> are convex nonsmooth functions, we say that a point *y*<sup>∗</sup> is a local minimizer if *f*1(*y*∗) − *f*2(*y*∗) is finite and there exists a neighborhood N of *y*<sup>∗</sup> such that

$$f\_1(y^\*) - f\_2(y^\*) \le f\_1(y) - f\_2(y), \ \forall y \in \mathcal{N}.\tag{5}$$

Considering that, in general, the Clarke subdifferential calculus cannot be used to compute subgradients of the DC function since

$$
\partial\_{cl} f(y) \subseteq \partial f\_1(y) - \partial f\_2(y), \tag{6}
$$

where *∂cl f*(·) denotes Clarke's subdifferential, different stationary points can be defined for nonsmooth DC functions. A point *y*∗ is called inf-stationary for problem Equation (4) if

$$
\Diamond \Diamond \not\models \partial f\_2(y^\*) \subseteq \Diamond f\_1(y^\*). \tag{7}
$$

Furthermore, a point *y*∗ is called Clarke stationary for problem Equation (4) if

$$0 \in \partial\_{cl} f(y^\*),$$

while, it is called a critical point of *f* if

$$
\exists \partial f\_2(y^\*) \cap \partial f\_1(y^\*) \neq \bigotimes . \tag{9}
$$

Denoting the set of inf-stationary points by *Sinf* , the set of Clarke stationary points by *Scl*, and the set of critical points of the function *f* by *Scr*, the following inclusions hold

$$\mathcal{S}\_{inf} \subseteq \mathcal{S}\_{cl} \subseteq \mathcal{S}\_{cr}$$

as shown in (Proposition 3, [30]).

Nonsmooth DC functions have attracted the interest of several researchers, both, from the theoretical and from the algorithmic viewpoint. Focusing in particular on the algorithmic side, the most relevant contribution has probably been provided by the methods based on the linearization of function *f*<sup>2</sup> (see, [41] and references therein), where the problem is tackled via successive convexifications of function *f*. In the last years, nonsmooth-tailored DC programming has experienced a lot of attention as it has a lot of practical applications (see [28,42]). In fact, several nonsmooth DC algorithms have been developed ([30,43–47]).

Here, we adopt the algorithm DCPCA, a bundle-type method introduced in [30] to solve problem Equation (4), which is based on a model function built by combining two convex piecewise approximations, each related to one component function. More in details, a simplified version of DCPCA works as follows:


In fact, the DCPCA is based on constructing a model function as the pointwise maximum of several concave piecewise-affine pieces. To construct this model, starting from some cutting-plane ideas, the information coming from the two component functions are kept separate in two bundles. We denote the stability center by *z* (i.e., an estimate of the minimizer), and by *I* and *L*, the index sets of the points generated by the algorithm where the information of function *f*<sup>1</sup> and *f*<sup>2</sup> have been evaluated, respectively. Therefore, we denote the two bundles of information as

$$\mathcal{B}\_1 = \{ (\mathfrak{g}\_i^{(1)}, \mathfrak{a}\_i^{(1)}) : i \in I \}$$

and

$$\mathcal{B}\_2 = \{ (\mathfrak{g}\_l^{(2)}, \mathfrak{a}\_l^{(2)}) : l \in L \}$$

where, for every *i* ∈ *I*, *g* (1) *<sup>i</sup>* ∈ *∂ f*1(*yi*) with

$$\mathfrak{a}\_i^{(1)} = f\_1(z) - \left(f\_1(y\_i) + \mathfrak{g}\_i^{(1)T}(z - y\_i)\right) \sigma$$

and, for every *l* ∈ *L*, *g* (2) *<sup>l</sup>* ∈ *∂ f*2(*yl*) with

$$\mathfrak{a}\_l^{(2)} = f\mathfrak{z}(z) - \left(f\mathfrak{z}(y\_l) + \mathfrak{g}\_l^{(2)T}(z - y\_l)\right).$$

We remark that both component functions, along with their subgradients, could be evaluated at some iterate-point, and, indeed, we assume that (*g*(1)(*z*), 0) ∈ B<sup>1</sup> and (*g*(2)(*z*), 0) ∈ B2, where *<sup>g</sup>*(1)(*z*) <sup>∈</sup> *<sup>∂</sup> <sup>f</sup>*1(*z*) and *<sup>g</sup>*(2)(*z*) <sup>∈</sup> *<sup>∂</sup> <sup>f</sup>*2(*z*).

To approximate the difference function

$$
\left(f\_1(z+d) - f\_2(z+d)\right) - \left(f\_1(z) - f\_2(z)\right),
$$

at a given iteration *k* the following nonconvex model function Γ*k*(*d*) is introduced

$$\Gamma\_k(d) \triangleq \max\_{i \in I} \min\_{l \in L} \left\{ \left( \mathbf{g}\_i^{(1)} - \mathbf{g}\_l^{(2)} \right)^T d - \mathbf{a}\_i^{(1)} + \mathbf{a}\_l^{(2)} \right\},\tag{10}$$

which is defined as the maximum of finitely many concave piecewise-affine functions. The modelfunction Γ*<sup>k</sup>* is used to state a sufficient descent condition of the type

$$\left(f\_1(z+d) - f\_2(z+d)\right) - \left(f\_1(z) - f\_2(z)\right) \le m\Gamma\_k(d)$$

where *m* ∈ (0, 1). The interesting property of such a model-function is that whenever the sufficient descent is not achieved at points that are close to the stability center, say *z* + ¯*d*, then an improved cutting-plane model can be obtained by only updating the bundle of *f*<sup>1</sup> with the appropriate information related to the point *z* + ¯*d*. On the other hand, it looks obviously difficult to adopt the minimization of the model-function Γ*<sup>k</sup>* as a building block of any algorithm, given its nonconvexity. In fact, DCPCA does not require the direct minimization of Γ*k*(*d*), but the search direction can be obtained by solving the following auxiliary quadratic problem:

$$\begin{aligned} \min\_{\boldsymbol{d} \in \mathbb{R}^n, \boldsymbol{v} \in \mathbb{R}} \quad & \boldsymbol{v} + \frac{1}{2} \|\boldsymbol{d}\|^2 \\ & \boldsymbol{v} \ge (\mathcal{g}\_i^{(1)} - \mathcal{g}\_I^{(2)})^T \boldsymbol{d} - \boldsymbol{a}\_i^{(1)} \quad \forall \boldsymbol{i} \in I \end{aligned} \tag{QP}(I)$$

where ¯ *l* ∈ *L*(0) - {*<sup>l</sup>* <sup>∈</sup> *<sup>L</sup>* : *<sup>α</sup>*(2) *<sup>l</sup>* = 0}. We observe that *L*(0) = ∅ as B<sup>2</sup> is assumed to contain the information about the current stability center. More precisely, DCPCA works by forcing *L*(0) to be a singleton, hence by letting *g* (2) ¯ *<sup>l</sup>* <sup>=</sup> *<sup>g</sup>*(2)(*z*). Denoting the unique optimal solution of Equation (*QP*(*I*)) by ( ¯*d*, *v*¯), a standard duality argument ensures that

$$\bar{d} = -\sum\_{i \in I} \bar{\lambda}\_i (\mathbf{g}\_i^{(1)} - \mathbf{g}\_I^{(2)}) \tag{11}$$

$$\bar{w} = -\left\| \sum\_{i \in I} \bar{\lambda}\_i (\mathcal{g}\_i^{(1)} - \mathcal{g}\_{\bar{I}}^{(2)}) \right\|^2 - \sum\_{i \in I} \bar{\lambda}\_i a\_i^{(1)} \tag{12}$$

where *<sup>λ</sup>*¯ *<sup>i</sup>* <sup>≥</sup> 0, *<sup>i</sup>* <sup>∈</sup> *<sup>I</sup>*, are the optimal variables of the dual of *QP*(*I*), with <sup>∑</sup>*i*∈*<sup>I</sup> <sup>λ</sup>*¯ *<sup>i</sup>* <sup>=</sup> 1.

Given that any starting point *z* = *y*0, DCPCA returns an approximate critical point *z*∗, see (Theorem 1, [30]). The following parameters are adopted: the optimality parameter *θ* > 0, the subgradient threshold *η* > 0, the linearization-error threshold *ε* > 0, the approximate line-search parameter *m* ∈ (0, 1), and the step-size reduction parameter *σ* ∈ (0, 1). In Algorithm 1, we report an algorithmic scheme of the main iteration, namely of the set of steps where the stability center is unchanged. An exit from the main iteration is obtained as soon as a stopping criterion is satisfied or whenever the stability center is updated. To make the presentation clearer, without impairing convergence properties, we skip the description of some rather technical steps, which are strictly related to the management of bundle B2. Details can be found in [30].

**Algorithm 1** DCPCA Main Iteration


We remark that the stopping condition *v*¯ ≥ −*θ*, checked at Step 2 of the DCPCA, is an approximate *θ*-criticality condition for *z*∗. Indeed, taking into account Equation (12), the stopping condition ensures that

$$\left\|\sum\_{i\in I} \lambda\_i^\* g\_i^{(1)} - g\_I^{(2)}\right\| \le \sqrt{\theta} \quad \text{and} \quad \left\|\sum\_{i\in I} \lambda\_i^\* a\_i^{(1)}\right\| \le \sqrt{\theta}.$$

which in turn implies that *g* (1) <sup>∗</sup> ∈ *∂θ f*1(*z*∗) and *g* (2) <sup>∗</sup> ∈ *∂ f*2(*z*∗) such that

> *g* (1) <sup>∗</sup> − *g* (2) <sup>∗</sup> <sup>2</sup> <sup>≤</sup> *<sup>θ</sup>*,

namely, that

$$\text{dist}\left(\partial\_{\theta}f\_1(z^\*), \partial f\_2(z^\*)\right) \le \theta\_{\prime}$$

an approximate *θ*-criticality condition for *z*∗, see Equation (9).

#### **4. Results**

We tested the performance of the algorithm DCPCA applied to the DC-MIL formulation (3) by adopting two sets of medium- and large-size problems extracted from [18]. The relevant characteristics of each problem are reported in Tables 1 and 2, where we list the problem dimension *n*, the number of instances, and the number of bags.


**Table 1.** Medium-size test problems.

**Table 2.** Large-size test problems.


The two-level cross-validation protocol has been used to tune *C* and to train the classifier. Before proceeding with the training phase, the model-selection phase is aimed at finding a promising value of parameter *<sup>C</sup>* in the set {2<sup>−</sup>7, 2−6, ... , 1, ... , 26, 27}, using a lower-level cross-validation protocol on each training set. The selected *C* value, for each training set, is the one returning the highest average test-correctness in the model-selection phase.

Choosing a good starting point is a critical phase to ensure good performance for a local optimization algorithm like DCPCA. For each training set, denoting the barycenter of all the instances belonging to positive bags by *w*+ and the barycenter of all the instances belonging to negative bags by *w*−, we have selected the starting point (*w*0, *b*0) by setting

$$
w\_0 = \overline{w}\_+ - \overline{w}\_- \tag{13}$$

and choosing *b*<sup>0</sup> such that the corresponding hyperplane correctly classifies all the positive bags.

We adopted the Java implementation of algorithm DCPCA by running the computational experiments on a 3.50 GHz Intel Core i7 computer. We limited the computational budget for every execution of DCPCA to 500 and 200 evaluations of the objective function for medium-size and large-size problems, respectively, and we restricted the size of the bundle to 100 elements adopting a restart strategy, as soon as, the bundle size exceeds the threshold and a new stability center is obtained. The QP solver of IBM ILOG CPLEX 12.8 has been used to solve quadratic subprograms. The following set of parameters, according to the notation introduced in [45], has been selected: the optimality parameter *θ* = 0.7, the subgradient threshold *η* = 0.7, the approximate linesearch parameter *m* = 0.01, the step-size reduction parameter *σ* = 0.01, and the linearization-error threshold = 0.95.

We compare our DC-MIL approach against the algorithms mi-SVM [18], MI-SVM [18], MICA [22], MIL-RL [19], and for medium-size problems also against the MIC*Bundle* [21] and DC-SMIL [28]. All such methods have been briefly surveyed in the introduction section.

To analyze the reliability of our approach, in Tables 3 and 4, we report the numerical results in terms of the percentage test-correctness averaged over 10 folds, with the best performance being underlined. We remark that some data are not reported in Table 5 as the corresponding results are obtained by adopting only nonlinear kernels in [18,22]. Moreover, to provide some insight into the efficiency of DC-MIL, we report in Tables 5 and 6, the average train-correctness (**train**, %), the average cpu time (**cpu**, sec), the average number of function evaluations (**nF**), and the average number of subgradient evaluations of the two functions (**nG1** and **nG2**). The reliability results show a good and balanced performance of the DC-MIL approach equipped with DCPCA, both, for the medium-size problems, where in one case DC-MIL slightly outperforms the other approaches, and for the large-size problems. Moreover, we observe that our approach looks strongly efficient as it manages to achieve high train-correctness in reasonably small execution times even for large-size problems.

**Table 3.** Average test-correctness (%) for medium-size problems.


Underlined means the best performance being.


**Table 4.** Average test-correctness (%) for large-size problems.

Underlined means the best performance being.

**Table 5.** DC-MIL average efficiency. Medium-size test problems.


**Table 6.** DC-MIL average efficiency. Large-size test problems.


#### **5. Conclusions**

We have considered a multiple instance learning problem consisting of classifying sets instead of single points. The resulting binary classification problem, addressed by a support vector machine approach, is formulated as an unconstrained nonsmooth optimization problem for which an original DC decomposition is presented. The problem is solved by a proximal bundle-type method, specialized for nonsmooth DC optimization, which is tested on some benchmark datasets against a set of state-of-the-art approaches. The numerical results in terms of reliability show, on one hand, that there are no outperforming methods on all the test problems, on the other hand, that our method achieves comparable performance with other approaches. Moreover, the encouraging results obtained in terms of efficiency show that there is room for improvement by further investigating the parameter settings in relation to specific test problems.

**Author Contributions:** Methodology, A.A., A.F., G.G., G.M.; software, A.A., A.F., G.G., G.M.; writing–review & editing, A.A., A.F., G.G., G.M.

**Funding:** This research received no external funding.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:

MIL Multiple instance learning


#### **References**


c 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Algorithms* Editorial Office E-mail: algorithms@mdpi.com www.mdpi.com/journal/algorithms

MDPI St. Alban-Anlage 66 4052 Basel Switzerland

Tel: +41 61 683 77 34 Fax: +41 61 302 89 18