**2. Background**

## *2.1. Differential Evolution*

Differential evolution (DE) is a evolutionary algorithm used for optimization over continuous spaces, which operates by improving a population of *N* candidate solutions evaluated by means of a fitness function *f* though a iterative process. The first phase is the initialization in which the first population is generated; there exists various approaches, among which the most common is randomly generating each vector. Following, during the iterative phase, for each generation a new population is computed though mutation and crossover operators; each new vector is evaluated and then the best ones are chosen, according to a selection operator, for the next generation. The evolution may proceed for a fixed number of generations or until a given criterion is met.

The mutation used in DE is called *differential mutation*. For each vector *target vector xi*, for *i* = 1, ... , *N*, of the current generation, a vector *y*¯*i*, namely, *donor vector*, is calculated as linear combination of some vectors in the DE population selected according to a given strategy. In the literature, there exist many variants of the mutation operator (see for instance [32]). In this work, we implemented and used three operators: rand/1 [33], current\_to\_*p*best [34], and DEGL [35].

The operator rand/1 is defined as

$$
\ddot{y}\_i = \mathbf{x}\_a + F(\mathbf{x}\_b - \mathbf{x}\_c) \tag{1}
$$

where *F* ∈ [0, 2] is a real parameter called *mutation Factor*, *a*, *b*, *c* are unique random indices different from *i*.

The operator curr\_to\_pbest is defined as

$$
\dot{y}\_i = \mathbf{x}\_i + F(\mathbf{x}\_{\text{pbest}} - \mathbf{x}\_i) + F(\mathbf{x}\_a - \mathbf{x}\_b) \tag{2}
$$

where *p* ∈ (0, 1] and *pbest* is randomly selected index from the indices of the best *N* × *p* individuals of the population. Moreover, *xb* is an individual randomly chosen from the set

$$\{\mathbf{x}\_1, \dots, \mathbf{x}\_N\} \mid \{\mathbf{x}\_{a\prime} \mathbf{x}\_i\} \cup \mathcal{A}$$

where A is an external archive of bounded size (usually with at most *N* individuals) that contains the individuals discarded by the selection operator.

Finally, DEGL is defined as

$$\begin{cases} \ddot{y}\_i = wL\_i + (1 - w)G\_i \\ L\_i = x\_i + a(x\_{n \text{best}} - x\_i) + \beta(x\_d - x\_b) \\ G\_i = x\_i + a(x\_{\text{best}} - x\_i) + \beta(x\_d - x\_b) \end{cases} \tag{3}$$

where *best* is the index of the best individual in the population, *nnbest* is the index of the best individual in the neighborhood of the target *xi*, and *w* ∈ [0, 1] is the weight of the convex combination between *Li* and *Gi*.

The crossover operator creates a new vector *yi*, namely *trial vector*, by recombining the donor with the corresponding target vector. There are many kinds of crossover; the most known is the binomial crossover where *yi* is computed as follows,

$$y\_{i,j} = \begin{cases} g\_{i,j} & \text{if } rand\_{i,j} \le CR \text{ or } j = j\_{rand} \\ x\_{i,j} & \text{otherwise} \end{cases} \text{ for } j = 1, \ldots, D \tag{4}$$

where *randi*,*<sup>j</sup>* ∈ [0, 1] is a real random number in [0, 1], *jrand* is an integer random number in {1, ... , *<sup>D</sup>*}, and *CR* ∈ [0, 1] is the crossover probability.

Finally, the selection operator compares each trial vector *yi* with the corresponding target vector *xi* and selects the better of them in the population of the next generation.

## 2.1.1. Self-Adaptive Differential Evolution

The DE parameters *F* and *CR* have a strong impact during the evolution and the choose of their values is hard. In literature there exist many proposals of self-adaptive methods that select the values for *F* and *CR*.

One of the simplest and most popular method is *jDE* [36]. Each population individual *xi* has its own values *Fi* and *CRi*. The trial individual *zi* inherits from the target the values *Fi* and *CRi*, separately with probability 0.9; otherwise, a new value for *F* and/or for *CR* is randomly generated in [0.1, 1] or in [0, 1], respectively. The trial is then created using its own values for *F* and *CR*. If the trial survives in the selection phase, it will keep its values for *F* and *CR* in the next generation.

Another self-adaptive method is JADE [37], in which the value of *F* is randomly generated from a Cauchy distribution *<sup>C</sup>*(*μ<sup>F</sup>*, 0.1) and the value of *CR* from the normal distribution *<sup>N</sup>*(*μCR*, 0.1). The means of these distributions *μF* and *μCR* are initialized to 0.5 and are updated at each generation as

$$
\mu\_F \leftarrow (1-\mathfrak{c})\mu\_F + c m\_L(\mathcal{S}\_F),
$$

and

$$
\mu\_{CR} \gets (1 - \varepsilon)\mu\_{CR} + c m\_A (S\_{CR}),
$$

where *mL*(*SF*) is the Lehmer mean of the successful *F* values (i.e., those used to generate trials which are better than their targets) and *mA*(*SCR*) is the arithmetic mean of the successful *CR* values.

A variant of JADE is ShaDE [34], in which the values of *F* and *CR* are generated in the same way of JADE, but the means of the distributions are randomly selected from a *success history*, which stores the means computed with respect to the succesful trials.

Finally, L-ShaDE [38] is an enhancement of ShaDE where the population size is reduced as the generations go on.

## 2.1.2. Self-Adaptive Mutation

There also exist self-adaptive variants of DE which selects, for instance at each generation or even for each trial, the mutation operator to be applied among a set of possible choices.

We have decided to implement SaMDE [39]. It is a variant of *jDE*, where it is applied the automatic selection of mutation strategy from a pool of given strategies. Each population individual has its own vector *V* of *o* real numbers, where *o* is the number of mutation operators. The vector *V* is evolved in the same way as the individual itself. The values of *V* are used to randomly choose, by means of the roulette-wheel method, the mutation operator to be used to create the trial individual.

## *2.2. Neuroevolution*

The term of neuroevolution is used to identify the research area where evolutionary algorithms are used to construct and train artificial neural networks. It covers a wide range of network architectures and neural models. Most neural learning methods focus on modifying the strengths of neural connections (i.e., their connection weights), whereas other models can optimize the structure of the network, the type of computation performed by individual neurons, and even learning rules that modify the network during evaluation.

The evolutionary approach dominating the scene of neuroevolution is the genetic approach by means of genetic algorithms. Typically, to find a network that solves the given task, a population of genetic encodings of neural networks (genotype) is evolved. The process constitutes an intelligent parallel search towards better genotypes in the space of solutions, and continues until a network with a sufficiently high fitness is found. The generate-and-test loop of evolutionary algorithms usually applied: (i) Each genotype is chosen in turn and decoded into the corresponding neural network, called phenotype. (ii) The performance of this network is then measured by a fitness value. (iii) After all individuals have been evaluated, genetic operators are applied and the next generation is created.

The evolution is applied to the the individuals with the highest fitness are crossed and mutated over with each other, and replace the genotypes with the lowest fitness in the population.

The conventional neuroevolution (CNE) follows this approach for the network weights [17,22]. This is the most used techniques because it is easy to implement but practical in many domains.

## **3. Related Works**

The first DE-based optimizers for NNs were presented in the late '90s and the early 2000s by [40,41], who presented and analyzed the applications of DE on the problem of feedforward NN train. In recent times, new applications of evolutionary algorithms have been presented in the area of neuroevolution [32].

The dominating evolutionary approach used is the genetic one [17,22]: this is used to optimize both topology and weights of the network but in the latter case it is very limited by being a discrete approach. In literature several encodings for the real weights are proposed, with genes represented either as a real-valued string or characters sequence, which can be interpreted as real values with a specific precision using for example Gray-coded numbers.

More adaptive approaches have been suggested, for example in [42] or more recently in [43]. In the first paper, the authors presented a dynamic encoding, which depends on the exploitation and exploration phases of the search. In the second one, the authors proposed a self-adaptive encoding, where the string characters are interpreted as a system of particles whose center of mass determines the encoded value. Other adaptive approaches have been developed for network immunization and diffusion in link prediction [44,45].

Moreover, they have also used a direct encoding that exploits the particular problem structure.

These methods are not general and are not easily extendable to be applicable in more general cases [17]. In [46], a direct encoding floating-point representation of the NN's weights is used. Precisely, the authors use the evolution strategy called CMA-ES, a real-value optimization algorithm, applied to the well-known reinforcement learning problem: pole balancing.

Among DE applications to neuroevolution, the most related works we have to cite are [13–15,30,47], even if they apply the evolutionary meta-heuristics in a different way.

In [47], the search exploration is enhanced by a DE algorithm with a modified best mutation operation: the algorithm is used to train the network and the global best value is used as a seed by the backpropagation procedure (BPG).

In [13], three different methods (GA, DE, and EDA) are compared and used to train a simple network architecture with one hidden layer, the learning factor, and the seed for the weights initialization.

In [14], the authors use the Adaptive DE (ADE) algorithm to calculate the initial weights and the thresholds of standard neural networks trained by BPG. The authors demonstrated that the system is effective to solve time series forecasting problems.

In [15], a Limited Evaluation Evolutionary Algorithm (LEEA) is applied to optimize the weights of the network. This paper is related to our paper because we employ a similar batching system, in which minibatches are used in the training phase and are changed after a certain number of generations.

The work in [30] has a strong connection with ours because the author studied how different mutation operators work to train neural networks. The results showed that the DEGL-trig (a

composition of DEGL with Trigonometric mutation) is the best mutation operator to use with small NNs.

DE and the other enhancement methods permit our algorithm to train neural networks much larger than those used in [15,30]: whereas the maximum size handled in [15] has less than 1500 weights and the maximum size handled in [30] has only 46 weights, we are capable to train a feedforward neural network for MNIST which has more than 7000 weights.

## **4. The DENN Algorithm**

This section describes the Differential Evolution for Neural Networks. The idea is to apply the Differential Evolution for optimization of NN's weights taking in count the structure of the network.

Given a fixed topology and fixed activation functions, a population P is defined as a set of *N* neural networks.

We decided to exploit the DE characteristic of working with continuous values by using a direct codification based on a one-to-one mapping between the weights of the neural network and individuals in DE population.

More precisely, let *N* be a feedforward neural network composed of *L* levels. For each level, *l*, of the network is defined by a real valued matrix, **<sup>W</sup>**(*l*), and a real valued vector, **<sup>b</sup>**(*l*), representing, respectively, the connection weights and the bias values. Therefore, each population individual *xi* is defined as a sequence

$$\langle (\mathbf{W}^{(i,1)}, \mathbf{b}^{(i,1)}), \dots, (\mathbf{W}^{(i,L)}, \mathbf{b}^{(i,L)}) \rangle\_r$$

where **Wˆ** (*<sup>i</sup>*,*l*) is the real values vector obtained by linearization of the matrix **<sup>W</sup>**(*<sup>i</sup>*,*l*), for *l* = 1, . . . , *L*.

For a population individual *xi*, we indicate by *x* (*h*) *i* its *h*–th component, for *h* = 1, ... , 2*L*. For example, *x* (*h*) *i*= **Wˆ** (*<sup>i</sup>*,(*h*+<sup>1</sup>)/2), if *h* is odd, whereas *x* (*h*) *i* = **b**(*<sup>i</sup>*,*h*/2) if *h* is even.

Note that for each solution *xi* the component *x* (*h*) *i* is a vector whose size *d*(*h*) is dependent on the number of neurons of in the level *h*.

The individuals of the population are evolved by applying mutation and crossover operators in a component-wise way. For instance, the mutation *rand*/1 for the individual *xi* is applied as three indices, *a*, *b*, *c*, that are randomly chosen in the set {1, ... , *N*}\{*i*} without repetition; then, for *h* = 1, ... , 2*L*, the *h*–th component *y*¯ (*h*) *i*of the donor individual *y*¯*i* is calculated as the linear combination

$$
\bar{y}\_i^{(h)} = \mathfrak{x}\_a^{(h)} + F(\mathfrak{x}\_b^{(h)} - \mathfrak{x}\_c^{(h)}).
$$

The evaluation of a population element *xi* is performed by a fitness function *f* , which is the objective function to be optimized.

As proposed in the many other efficient applications, we split the dataset *D* in three different subsets: a training set *TS*, a validation set *VS*, and a test set *ES*. The *TS* is used for the training phase, the *VS* is used at the end of each training phase for a uniform evaluation of the individuals, and *ES* is used on the best neural network in order to evaluate the performance.

As the evaluation phase is the most time consuming operation, and it can lead to unacceptable computation time if the fitness is computed on the whole dataset, we decided to use a batching method similar to the one proposed in [15] by partitioning the training set *TS* in *k* batches *B*0, ... , *Bk*−<sup>1</sup> of size *b* = |*TS*|/*k*.

Note that records in each batch should follow the same distribution to avoid the risk of the overfitting, followed by generation of a model that is unable to generalize.

At each generation the population is evaluated against only a small number of training samples, given by the size of the current batch, instead of evaluating the population with all the training set samples. This permits to reduce the computational load, especially on large training sets.

## *Mathematics* **2020**, *8*, 69

To reduce the problems that arose when the batch is changed as well as obtaining a smoother transition from a batch to the next one, we defined a window *U* of size *b*, which is a set of samples taken from the current batch *Bi* and from the next one *Bi*+1.

At the beginning of an epoch, the fitness of all individuals in P is re-evaluated by computing the fitness on the new batch defined by currently window *U*.

The window is changed after *s* generations, by substituting *b*/*r* examples of *U* from *Bi* with *b*/*r* examples taken from *Bi*+<sup>1</sup> and not already present in *U*.

Then, given *sub-epoch* dimension *s*, the window passes from a batch to the next one in *r sub-epoch*, or in other words in *rs* generations (we call *epoch* this period). In this way, the fitness function change more smoothly and the evolution has more time to learn from the batch because the window is updated after *s* generations.

Moreover, the batches are reused in a cyclic way; when the algorithm iterates for more than *k* epochs and thus runs out of available batches, the batch sequence restarts from the first one.

Since the fitness function relies also on the batch and we need a fixed one to compare the individuals across the *epochs*; consequently, at the end of every epoch *e*, the best individual *x*<sup>∗</sup> *e* is calculated as the NN in P, which reaches the highest accuracy in the validation set *VS*. The global best network *x*∗∗ found so far is then eventually updated.

A restart method is used to avoid a premature convergence of the algorithm; The restart strategy adopted discard all the individuals in the current population, except the best one, and for the next algorithm iteration a new population randomly generated is used. The restart technique is applied at the end of each epoch *e*, if the fitness evaluation of *x*∗∗ did not change for a given number *M* of epochs. The complete algorithm, namely DENN, is depicted in Algorithm 1.

In the algorithm DENN, the function *generate\_offspring* execute the mutation and the crossover operators in order to produce the *trial individual*, whereas the function *best\_score* finds the best network *x*<sup>∗</sup> and computes the respective score *f* ∗ among all the individuals in the population.

## *4.1. Fitness Function*

In the case of classification problems, the fitness function used to evaluate the individual *x* is the well-known cross-entropy. In this case, the optimization problem is to find the neural network *x* minimizing the *<sup>H</sup>*(*x*) value, computed as

$$H(\mathbf{x}) = -\sum\_{i=1}^{b} \sum\_{j=1}^{\mathcal{C}} z\_{ij}' \log(z\_{ij}) \tag{5}$$

where *zij* and *zij* are, respectively, the value predicted by *x* and the actual value for the *i*-th record of *U* with respect to the *j*-th class ( *C* is the number of classes).

## *4.2. The Interm Crossover*

We have implemented a new particular crossover operator called *interm*, which is a randomized version of the arithmetic crossover. If *xi* is the target and *y*¯*i* is the donor, then the trial *yi* is obtained in the following way; for each component *x* (*h*) *i* of *xi* and *y*¯ (*h*) *i* of *y*¯*i*, let *a* (*h*) *i* be a vector of *d*(*h*) randomly numbers, generated with a uniform distribution [0, 1], then

$$y\_{ij}^{(h)} = a\_{ij}^{(h)} \mathfrak{x}\_i^{(h)} + (1 - a\_{ij}^{(h)}) \bar{y}\_{ij}^{(h)}$$

for *j* = 1, . . . , *d*(*h*).

## *4.3. The MAB-ShaDE Mutation Method*

We have also implemented a variant of ShaDE algorithm, called MAB-ShaDE. MAB-ShaDE has a solution archive and a history of the best *CR* and *F* parameters, like ShaDE (Section 2).

The novelty of MAB-ShaDE is in the method used, inspired to the Multi-armed bandit UCB1 [31], to select one mutation strategy among a list of possible operators.

We consider the mutation strategies as arms of the bandit and the epochs as the rounds where the reward of the selected arm is computed. Therefore, for each mutation operator *OP*, UCB1 stores the average value of the reward *μOP* and the number of epoch *nOP* in which *OP* has been used. After the end of the epoch *e*, the operator

$$O = \underset{OP}{\text{arg }\max} \ (\mu\_{OP} + \sqrt{2\log\varepsilon/n\_{OP}})$$

is chosen as mutation strategy for the next epoch.

## **5. Experiments**

In this section, we describe the experiments performed to assess the effectiveness of DENN algorithm as an alternative to backpropagation for neural network optimization.

Moreover, we are interested to find the best algorithm combination and, in particular, the best mutation and crossover operators. To do that we organized two rounds of experiments. First of all, we tested all the possible combinations in order to define the best algorithm singularly for each dataset and the global best. These experiments are described in Section 5.3 and allow us to conclude that there is no winner combination if we consider the results grouped by dataset, whereas we can say that the combination of ShaDE with *curr\_p\_best* and *interm* globally perform better than any other combination. Then, we decided to verify the effectiveness both in term of computational effort and accuracy compared to the classical backpropagation. These results are shown in Section 5.2.

All the networks used in these experiments are without any hidden layer.

DENN has been implemented as a C++ program (Source code available at https://github.com/ Gabriele91/DENN). The results presented here are obtained with a computer having a CPU AMD Ryzen 1600 and 16GB RAM.

## *5.1. Datasets*

We tested DENN on various classification datasets from the UCI repositories (https://archive. ics.uci.edu/ml/datasets) (MAGIC, QSAR, and GASS) and also on the well-known MNIST (http: //yann.lecun.com/exdb/mnist/) dataset for hand-written digit classification. They have been chosen because of their differences on the number of features and records. Moreover, we chose the MNIST dataset because it is a classical challenge with well-known results obtained by various NN classification systems. Note that these datasets are also considered as interesting challenges in [15].


## *5.2. System Parameters*

The DENN algorithm depends on various parameters: some directly deriving from the DE (*F*, *CR*, the auto-adaptive variant of DE, the mutation, and crossover operators), other depending on the batching system (*s*, *b*, and *r*). For each dataset we analyzed the following parameters,


and their values are shown in Table 1.


**Table 1.** Parameters values.

We have chosen three levels for the window size *b*, called *low*, *mid*, and *high*, which depend on the dataset size, hence they correspond to different values for each dataset (see Table 2).


**Table 2.** Size of batches for each dataset.

We have also chosen three levels for the length *s* of the sub-epoch, which are proportional to the number *b*/*r* of records changed at each sub-epoch. For instance, the lowest level is *b*4*r*, which corresponds to a number of generations equal to 1/4 of *b*/*<sup>r</sup>*. The main motivation of this choice is that DENN should need more generations with larger batches/windows.

Another aspect of our tests is that we have used a double version for each dataset, the original one and the normalized one. In this way, we can see if the normalization process affects the performances of DENN.

As we implemented a complete test for each possible combination in each dataset and we run the same configuration five times, we collected accuracy values and computation time for 30,240 runs.

All the results are stored on GitHUB (Results available at https://github.com/Gabriele91/DENN-RESULTS-2019); in this paper, only the most significant are shown.

## *5.3. Algorithm Combination Analysis*

The first analysis has been made on the convergence graphics, where for each dataset the data of accuracy has been plotted during the generations. For each dataset and for each self-adaptive method, the data of the method which obtained the highest accuracy have been displayed in Figures 1–4.

**Figure 2.** Plots of convergences on QSAR and QSAR normalized datasets.

*Mathematics* **2020**, *8*, 69

**Figure 4.** Plots of convergences on MNIST and MNIST normalized datasets.

From the plots, it is possible to see that, excluding the cases where the differences are not significant, *MAB-ShaDE* works well on smaller datasets (MAGIC and QSAR), whereas ShaDE is the best method for larger datasets.

## *5.4. Convergence Analysis*

In this subsection, we discuss the convergence across all DE used and analyzed in this paper on the datasets discussed before. On the MAGIC dataset, *SHADE* and *L-SHADE* converge in around 1750 generations, whereas the proposed *MAB-SHADE* requires only 250 generations to achieve a solution with a comparable quality. Other methods were able to discover lower quality solutions only. Regarding the other binary classification problem, QSAR, *MAB-SHADE* converges faster than all the other methods in less than 200 generations, while simultaneously obtaining a higher quality solution.

On the GASS multi-class problem, *MAB-SHADE* follow the same convergence path of *L-SHADE*, whereas *SHADE* has a slow convergence, but the quality of result reached by *SHADE* is slightly better, conversely, the other methods do not reach a satisfactory solution.

On the image classification problem MNIST, *SHADE* and *L-SHADE* resulted the best algorithms in terms of the solution quality and the time of convergence, whereas the other methods did not obtain comparable solutions in terms of quality; noticeably, *MAB-SHADE* did not ge<sup>t</sup> stuck, but it is likely that it requires more generations to converge to a solution.

We also performed the same tests with normalized versions of the datasets, finding susceptible differences with the previous results.

On MAGIC all the methods converged to the same solution, whereas on QSAR the best solution was reached by *MAB-SHADE* and *jDE*. Regarding GASS, no analyzed method reached a solution comparable in terms of quality to the solutions found on the corresponding original dataset. Finally, on MNIST all the methods, except *SAMDE*, reached good solutions, which are, however, below the solutions found with *SHADE* on the non normalized datasets.

Generally speaking, the best DE method is *SHADE* for the multi-class problems and *MAB-SHADE* for binary classification. Anyway, the convergence curves of *SHADE* are close to those of *MAB-SHADE* in the latter kind of problems.

Finally, it is worth to notice that *MAB-SHADE* performed systematically better than its direct competitor *SAMDE* without requiring to choose a particular mutation strategy.

## Quade Weighted Rank

As we have different results for different datasets, we applied the Quade test [48] in order to obtain a global ranking which takes into account the differences among the datasets.

The Quade test considers that some datasets could be more difficult to deal with (i.e., the differences in accuracy of the various algorithms are larger). In this way, the rankings computed on each dataset are scaled depending on the differences observed in the algorithms' performances [48].

With reference to Table 3, for each algorithm combination, the weighted ranking values are shown in the last column *Quade rank*.

These values are computed as follows.

Given the 756 parameter configurations we obtained by varying the values for each dimensions as shown in Table 1, we memorized in *vij* the average accuracy value obtained by the configuration in the row *i* on the dataset in the column *j*. The ranks *rij* of these values are computed for each dataset. Ranks are also assigned to the datasets according to the sample range of accuracy values obtained on it. The sample range within data set *j* is the difference between the largest and the smallest accuracy *vij* within that data set. Let *Qj* be the rank assigned to the *j*-th dataset with respect to these values. Then, the Quade weighted rank is obtained ordering the parameters configuration with respect to *Sj* = ∑*i rijQj*.

In Table 3, the top 20 among the 756 configurations tested are shown. We can see that *SHADE*, *curr\_p\_best*, *interm*, and *b* = *high* are the best choices.


**Table 3.** Top 20 Quade ranking for parameter configurations.

## *5.5. Execution Times*

The execution time of DENN changes with respect to the number of features and the size of batch. Therefore, in Table 4, we show the average execution time in seconds of DENN in each datasets and for each level of *b*. Note that the execution time is not sensitively affected by the normalization of the datasets.


**Table 4.** Average execution times.

In Table 4, the worst case required approximately three minutes for the computation of the solution also thanks to a strong parallelization of the computation. Note that this point is a plus of the evolutionary approach: in the case of an iterative method like backpropagation it would have been impossible. Therefore, we can conclude that the time to reach the solution is reasonable and the approach is feasible, even if it is slower when compared to gradient-based methods.

## *5.6. Comparison with Backpropagation*

In this section, we compared our method to the Backpropagation (BPG) algorithm, using two optimizer: the Stochastic Gradient Descendent (SGD) and the more powerful Adam. The experiments were performed on the same datasets MAGIC, QSAR, GASS, and MNIST, using both the original and the normalized versions.

The results are reported in Table 5, where for each dataset we compare the classification accuracy obtained by NNs trained with BPG (using both optimizers) to the accuracy obtained by our method (DENN). As it can be seen in the results, in such a scenario our method shows better performances or, in some cases, comparable to the competitors. More specifically, DENN obtained higher accuracy if compared to SGD on all classification problems, while ADAM performed better only on MNIST.


**Table 5.** Comparison BPG - DENN.

The difference between MNIST and other datasets is about their features. In MNIST, the features are just quantitative; whereas, in the other ones, some data has a quantitative nature and other data are qualitative.

Generally, all the algorithms work better on normalized datasets, except that in MNIST, where data have are already a high degree of homogeneity. On the other hand, in GASS the effect of using normalized datasets is much greater for all the algorithms.

Note that our method can be useful in MLP networks trained for problems on which traditional algorithms can hardly achieve satisfying performances or need larger networks to achieve the same results.

## **6. Conclusions and Future Works**

In this paper the DENN framework, a learning algorithm for Neural Networks based on Self-Adaptive Differential Evolution, is presented. Experiments show that the framework is able to solve classification problems, reaching satisfying levels of accuracy even in case of large datasets. The use of batch systems allows the application of DE to new untested domains. Indeed, it is worth noticing that the size of the problems handled in this work is significantly larger than those tested in other works available in literature.

Furthermore, the per-layer mutation and crossover strategies introduced in this work perform better than the traditional DE used in previous works. From the experiments we found the following:


The results obtained with DENN are almost always better than those obtained with backpropagation. Moreover DENN appears to be robust than its competitor with respect to the normalization.

Future research will investigate the possibility of using DENN as optimizer for other Neural Network structures, including Convolutional Neural Networks, Recurrent Neural Networks, and Neural Turing Machines. Another scenario could be the application of Evolutionary Algorithms to those problems and domains where gradient-based optimizers do not perform as well as in supervised learning. A first direction will be the application of DENN in the Reinforcement Learning context, where a NN approximates the Value-Action Function (or Q Function) for agents in a nonlinear and complex environment.

**Author Contributions:** Writing-original draft: A.M., M.B., V.P., G.D.B.; Conceptualization and all other contributions: A.M., M.B., V.P., G.D.B. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research has been partially supported by *Progetti Ricerca di Base 2015–2019 Baioletti-Milani-Poggioni* granted by Department of Mathematics and Computer Science University of Perugia, Italy.

**Conflicts of Interest:** The authors declare no conflict of interest.
