**1. Introduction**

The use of Neural Network (NN) models has been steadily increasing in the recent past, following the introduction of Deep Learning methods and the ever-growing computational capabilities of modern machines. Thus, such models are applied to various problems, including image classification [1] and generation [2], text classification [3], speech recognition [4], emotion recognition [5], and many more. New and more complex network structures, such as Convolutional Neural Networks [6], Neural Turing Machines [7], and NRAM [8], were developed and applied to the aforementioned tasks; such new problems and structures also required the development of new optimization techniques [9–11].

According to these new trends, neuroevolution has also been renewed [12–15]. The term of neuroevolution is used to identify the research area where evolutionary algorithms are used to construct and train artificial neural networks. Several approaches have been proposed both to train networks' weights and topology and to exploit the characteristics of neuroevolution of being highly general, allowing learning with nondifferentiable activation functions, without explicit targets, and with recurrent networks [16,17].

The traditional method used by neural networks to learn their weights and biases is the gradient descent algorithm applied to a cost function and its most famous implementation is the backpropagation procedure. Nowadays, the backpropagation algorithm is still the workhorse of learning in neural networks even if its origin dates back to 1970s; its importance was revealed in 1986 [18].

Backpropagation works under two main assumptions about the form of the cost function: it has to be written as an average over cost functions *Cx* for individual training examples *x* and as a function of the outputs from the neural network. Moreover the activation functions have to be differentiable.

With that said, there are tricks for avoiding this kind of problems, and finding alternatives to gradient descent is an active area of investigation. An interesting analysis on the motivations according to backpropagation is the most used technique based on gradient to train neural networks and evolutionary approaches are not sufficiently studied is presented in [15].

As long as meta-heuristic algorithms are generally nondeterministic and not sensitive to the differentiability and continuity of the objective functions, these methods are used in a wide range of complex optimization problems. In addition, the stochastic global optimizations can identify global minimum without being trapped in local minima [19–21].

The most used evolutionary approach in neuroevolution is the genetic one, extensively employed in the conventional neuroevolution (CNE) [17,22] and also recently proposed in the case of deep neuroevolution [23]. In those algorithms, the best individuals (the individuals with the highest fitness) are evolved by means of the mutation and crossover operators and replace the genotypes with the lowest fitness in the population. The genetic approach is the most used technique because it is easy to implement and practical in many domains. However, on the other hand, there is the problem of the encoding since they use a discrete optimization method to solve continuous problems.

In order to avoid the encoding problem other continuous evolutionary meta-algorithms have been proposed including, in particular, differential evolution (DE). Indeed, DE evolves a population of real-valued vectors, so no encoding and decoding are required.

It is well known that DE performs better than other popular evolutionary algorithms [24], has a quick convergence, and is robust [25]; it also performs better for learning applications [26]. At the same time, DE has simple genetic operations, such as its operator of the mutation and survival strategy based on one-on-one competition. Moreover, they can also use population global information and individual local information to search for the optimal solution.

When the optimization problem is complex, the performance of the traditional DE algorithm depends on the selected the control parameters and mutation strategy [19,27–29]. If the control parameters and selected mutation strategy are unsuitable, then DE is likely to yield premature convergence, stagnation phenomena and excessive consumption of computational resources. In particular, the stagnation problem for DE applied to neural network optimization has been studied in [30].

In this paper the system DENN that optimizes artificial Neural Networks using DE is presented. The system uses a direct encoding with a one-to-one mapping between the weights of the neural networks and values of individuals in the population. This system is an enhanced version of the system introduced in [12], where a preliminary implementation was described.

A batching system is introduced to overcome one of the main computational problems of the proposed approach, i.e., the fitness computation. For every generation the population is evaluated on a limited number of training examples, given by the size of the current batch, rather than the whole training set. This reduces the computational load, particularly on large training sets. Moreover, a restart method is applied to avoid a premature convergence of the algorithm: the best individual is saved and the rest of the current population is discarded, continuing the research on a new random generated population.

Finally, a new self-adaptive mutation strategy *MAB-ShaDE* inspired to the multi-armed bandit UCB1 [31] and a new particular crossover operator *interm*, a randomized version of the arithmetic crossover, have been proposed.

An extensive experimental study have been implemented to (i) determine if this approach is scalable and applicable also to large classification problems, like MNIST digit recognition; (ii) study the performance reached by using *MAB-ShaDE* and *interm* components; and (iii) identify the best algorithm configurations, i.e., the configurations reaching the highest accuracy.

The experimental results show that DENN is able to outperform the backpropagation algorithm in training neural networks without hidden layers. Moreover, DENN is a viable solution also from a computational point of view, even if the time spent for learning is higher than its competitor BPG.

The paper is organized as follows. Background concepts about neuroevolution, DE algorithm and its self-adaptive strategies are summarized in Section 2, related works are presented in Section 3, the system is presented in Section 4, and experimental results are shown in Section 5. Section 6 closes the paper with some final considerations and some ideas for future works.
