Finding the Optimal Topology of an Approximating Neural Network

Yotov, Kostadin; Hadzhikolev, Emil; Hadzhikoleva, Stanka; Cheresharov, Stoyan

doi:10.3390/math11010217

Open AccessArticle

Finding the Optimal Topology of an Approximating Neural Network

Faculty of Mathematics and Informatics, University of Plovdiv Paisii Hilendarski, 236 Bulgaria Blvd., 4027 Plovdiv, Bulgaria

^*

Author to whom correspondence should be addressed.

Mathematics 2023, 11(1), 217; https://doi.org/10.3390/math11010217

Submission received: 27 October 2022 / Revised: 25 December 2022 / Accepted: 28 December 2022 / Published: 1 January 2023

(This article belongs to the Section Mathematics and Computer Science)

Download

Browse Figures

Versions Notes

Abstract

:

A large number of researchers spend a lot of time searching for the most efficient neural network to solve a given problem. The procedure of configuration, training, testing, and comparison for expected performance is applied to each experimental neural network. The configuration parameters—training methods, transfer functions, number of hidden layers, number of neurons, number of epochs, and tolerable error—have multiple possible values. Setting guidelines for appropriate parameter values would shorten the time required to create an efficient neural network, facilitate researchers, and provide a tool to improve the performance of automated neural network search methods. The task considered in this paper is related to the determination of upper bounds for the number of hidden layers and the number of neurons in them for approximating artificial neural networks trained with algorithms using the Jacobi matrix in the error function. The derived formulas for the upper limits of the number of hidden layers and the number of neurons in them are proved theoretically, and the presented experiments confirm their validity. They show that the search for an efficient neural network can focus below certain upper bounds, and above them, it becomes pointless. The formulas provide researchers with a useful auxiliary tool in the search for efficient neural networks with optimal topology. They are applicable to neural networks trained with methods such as Levenberg–Marquardt, Gauss–Newton, Bayesian regularization, scaled conjugate gradient, BFGS quasi-Newton, etc., which use the Jacobi matrix.

Keywords:

neural network topology; neural network architecture; number of layers in ANN; number of neurons

MSC:

37M99

1. Introduction

Neural networks are widely used in various fields: education, healthcare and medicine, insurance and financial activities, business and industrial production, etc. They are used to solve various practical tasks: data classification, predictions, sound and image recognition, nonlinear control, system analysis and diagnosis, fault detection analysis, etc.

A major difficulty in solving problems with neural networks is to determine the topology of the network—how many hidden layers and how many neurons the neural networks should have. This question is of great importance because finding an efficient solution depends on the choice of topology. The random selection of a number of hidden neurons might cause either overfitting or underfitting problems.

Some applications use neural networks with a single hidden layer. In cases where there are multiple input data, the number of neurons in such shallow architecture is large. In many cases, it is more appropriate to use alternative architectures—deep neural networks (DNNs), with more hidden layers and fewer neurons. DNNs have much more power and can be used to solve much more complex tasks. On the other hand, in order to have good generalization abilities, the network must be as compact as possible [1]. The topology of the neural network is of key importance in the processes of memorization and retrieval of patterns [2]. Emmert-Streib conducted multiple experiments to solve a single task by using multiple neural networks with different topologies. He proved that the neural network topology has a great influence on its learning dynamics [3].

Many scholars have worked on the task of determining the optimal network topology. Various formulas have been published that relate to solving specific problems or specific neural networks. The wide variety of different types of neural networks and their use make it impossible to find a universal solution for choosing an appropriate topology.

Automating the process of finding neural networks with a suitable topology is usually done by a procedure that iteratively constructs different neural networks and checks whether they are acceptable depending on a predetermined tolerance. This is done by changing certain parameters, e.g., the number of neurons, the number of hidden layers, epochs, etc. Criteria are needed to determine within what limits these parameters should be changed.

This paper derives formulas for determining the maximum number of hidden layers and the maximum number of neurons in a neural network design. They provide researchers with starting points and guidance in building efficient neural networks. The subject of research is neural networks that have the same number of neurons in the hidden layers and are trained with algorithms using the Jacobi matrix. Some of the widely used training methods are Levenberg–Marquardt [4,5], Gauss–Newton [6,7], Bayesian regularization [8,9], scaled conjugate gradient [10,11], BFGS quasi-Newton [12,13], etc.

2. Determining the Topology of an Artificial Neural Network—State of the Art

Many scientists are working on the task of determining an appropriate neural network topology [14,15,16]. They use different approaches and reach different conclusions applicable to different situations.

One of the problems is related to determining the number of neurons that an artificial network should have [17,18]. Single-parameter approaches where the number of neurons depends on a single parameter—the number of network inputs—can be seen in the works of Chow et al. [19], Tamura and Tateshi [20], and Sheela and Deepa [21]. For example, Chow et al. proposed the following formula to determine the number of neurons needed:

N_{h} = \frac{\sqrt{1 + 8 N_{i}} - 1}{2}

where

N_{i} (i \in N)

is the number of input stimuli [19].

Sheela and Deepa researched Elman neural networks. To fix the number of hidden neurons as a function of the input parameters

n

, they examined 101 various criteria and compared statistical errors [21]. They arrived at the following formula:

a_{n} = (4 n^{2} + 3) / (n^{2} - 8)

Madhiarasan and Deepa developed a methodology for estimating the number of hidden neurons in a multilayer perceptron NN. They derived the following formula:

u_{n} = (4 n - 2) / (n - 3)

where

n

is the number of input parameters. They performed a comparative analysis for estimating the number of hidden neurons and validated the applicability of the formula with the convergence theorem [22].

Other scientists define the number of neurons as a function of two parameters. For example, in the work of Xu and Chen [23], the number of neurons is determined by two parameters—

N_{t}

and

N_{i} (i, t \in N)

—the number of input–output training samples and the number of network inputs, respectively:

N_{h} = {\begin{matrix} \frac{1}{2} \frac{N_{t}}{N_{i} \log N_{t}}, \frac{N_{t}}{N_{i}} > 30 \\ \frac{N_{t}}{N_{i}}, \frac{N_{t}}{N_{i}} \leq 30 \end{matrix}

A similar two-parameter approach can be found in [14,24].

In one of our research projects, we are looking for an optimal number of neurons for the cases in which neural network training methods use the Jacobi matrix in their error function [25]. Such training methods are Levenberg–Marquardt, Gauss–Newton, Bayesian regularization, scaled conjugate gradient, BFGS quasi-Newton, etc. As a result, we have derived formulas with which we determine an upper bound for the number of neurons

q

in the hidden layers for single-layer and multi-layer neural networks, as follows:

In single-layer neural networks:

q \leq \frac{m - 1}{n + 2}

(1)

and in multi-layer neural networks with r hidden layers in which there is an equal number of neurons in each layer:

q \leq \frac{\sqrt{{(r + n + 1)}^{2} + 4 (r - 1) (m - 1)} - (r + n + 1)}{2 (r - 1)}

(2)

In both cases,

m

is the number of input–output training samples, and

n

is the number of input stimuli. However, even in this case, it is a matter of situation in which the considered neural networks meet certain conditions:

(1): Neural networks are trained with an algorithm using the Jacobi matrix in the error function;
(2): There is an equal number of neurons in each of the hidden layers;
(3): The network has one output neuron.

Other scientists have directed their efforts to search for an optimal neural network model from the point of view of layered architecture. For example, Ref. [26] discusses the possibilities of finding the number of layers of networks with backpropagation of the error. It has also been emphasized that the number of hidden layers is related to the number of neurons and thus affects the time of training the neural network and the magnitude of the error. This once again confirms that the choice of a number of hidden layers, respectively neurons, has a great influence on the performance of the entire neural network. On the other hand, Choldun et al. address this issue from the perspective of classification tasks [27]. Stathakis looks for a solution for the design of topology in neural networks for classification problems. He proposes a method based on the synergy of genetic algorithms and neural networks. The method is only applicable to topologies with up to two hidden layers [28].

Hanay et al. propose a method for network topology selection based on the attractor selection-based (ASB) algorithm. They perform virtual topology reconfiguration using multistate neural memories [29].

Perzina works on the problem of neural network topology optimization. He offers a self-learning genetic algorithm with the self-adaptation of all its parameters. This is an interesting possibility for self-adaptation, not only for one parameter or several ones, but for all possible parameters of genetic algorithms at the same time. This allows the algorithm to be used to solve a wide range of optimization problems without setting parameters for each type of problem in advance [30].

Vizitiu and Popescu propose an interesting approach to improve the quality of neural network architecture. They develop a genetic algorithm, which simultaneously optimizes both the topology and neural weights of a feedforward neural network. The Gann system architecture presented by them consists of two main modules. The first genetic module has the task of optimizing the connectivity (i.e., the number of neurons, layers, and neural weights) of a feedforward neural network. The second module is responsible for optimizing the distribution of the neural weight which are assigned to this [31]. Other solutions for optimizing NN topologies based on genetic algorithms are proposed by White and Ligomenides [32], Arena et al. [33], etc.

Leon proposes a method for optimizing the topology of neural networks based on the Shapley value: a game-theoretic solution concept that estimates the contribution of each network element to the overall performance. With it, more network elements can be simultaneously pruned. After each network simplification, an evolutionary hill-climbing procedure is used to fine-tune the network [34].

A method for estimating the neural network topology by examining multiple spike sequences has been proposed by Kuroda and Hasegawa. In the proposed method, they used the SPIKE distance, which is a parameter-free measure for qualifying the distance between spike sequences, and leveraged applied partialization analysis to the SPIKE distance [35].

While working on a specific task, some scientists create different topologies using various methods and methodologies, such as the methodology based on optimizing neural network parameters, methods based on genetic algorithms, differential evolution, etc. After that, they examine these methodologies and select the one with the best performance [36].

Other research related to neural network architectures aims at comparing the performance of neural networks with one and two hidden layers. Guliyev and Ismailov explore possibilities for the approximation of multivariate functions by single-layer and two-layer feedforward neural networks with fixed weights. They prove that single hidden layer networks cannot approximate all continuous multivariate functions. This is also a criterion for choosing an NN topology [37]. Thomas et al. conducted research to determine whether feedforward neural networks with two hidden layers generalize better than a network with a single hidden layer. They propose a method called “transformative optimization” through which they perform a hidden-node-by-hidden-node comparison of single-layer and double-layer feedforward neural networks across ten separate datasets. They determined that in nine out of ten cases, the two-layer neural networks had a better performance [38]. Nakama uses another approach to compare single- and multiple-hidden-layer networks. For both types of neural networks, he uses the same activation and output functions, number of nodes, feedforward connections, parameters, and inputs. The networks are trained by the gradient descent algorithm to approximate linear and quadratic functions, and their convergence properties are examined. It has been established that the single-hidden-layer network converges faster than the multiple-hidden-layer network only in cases where a linear function is approximated. In all other cases, it is recommended to use multilayer neural networks [39].

Other approaches to determining the topology of neural networks are systematized by Choldun, Santoso, and Surendro [40].

Thus, in the end, the answer to the question of the required number of layers remains open. However, at the same time, it is important enough to be one of the most popular topics in specialized forums, training seminars, and scientific conferences, and the answer is almost always related to the need for experimentation in each specific case or task. The difficulty of determining the optimal number of layers also stems from the fact that the influence of the number of layers on the network performance is ambiguous.

The experiments conducted within the current study show that expanding the layered architecture often leads to a decrease in network efficiency. A formula is derived for the upper bound of the number of neurons in the generalized case of neural networks with several output neurons, and it has been proved that this formula can help determine the upper limit of the number of layers, thus facilitating the search for the optimal neural network.

3. Experiments to Search for a Relationship between the Number of Layers and the Efficiency of the Network

For the purpose of this study, experiments were conducted to search for approximating neural networks using different types of objective functions: linear, quadratic, trigonometric, logarithmic, exponential, and mixed. Using the MatLab scripting language [41], fitting neural nets were modeled for these functions with different numbers of layers under the following conditions:

(1): A total of 286 input–output pairs were used, with approximately 70% of them (200 samples) for training, 15% (43 samples) for validation, and 15% (43 samples) for an independent test of the neural networks.
(2): The transfer function in the body of the hidden neurons is a hyperbolic tangent:

T a n h (x) = \frac{2}{1 + e^{- 2 x}} - 1 .

(3): Training is done with the Levenberg–Marquardt algorithm.
(4): The number of layers in the network, in each particular approximation case, is changed sequentially from 1 to 100 using a purpose-built algorithm implemented through the MatLab scripting language.
(5): The number of neurons for each hidden layer in all neural networks is three.
(6): Uniform, randomly generated input–output datasets were used for training, validation, and testing. This eliminates the risk of influencing the neural networks’ performance being affected by dataset randomization.
Neural network efficiency was determined using this formula:

$e f f i c i e n c y = \frac{1}{p e r f o r m a n c e}$

where $p e r f o r m a n c e$ is the aggregate performance of the network during training, validation, and testing. The results of the conducted experiment on how the number of layers influences the accuracy when neural networks are used to approximate the objective functions are presented in Figure 1.

It is obvious that there is no universal rule that relates the performance of a neural network to the number of hidden layers. In some cases, increasing the layers leads to a drastic decrease in accuracy right after the first layer (e.g., when approximating functions such as

y = 2 x + 1

and

y = x^{2}

). In other cases, the optimal number of layers is reached later, after which the efficiency remains at a certain level (e.g., functions

y = \cos (x), y = \sin (x))

. In the third cases (e.g.,

y = e^{x}

), after the peak in efficiency, multiple secondary lower highs and lows are observed. This suggests that finding the optimal number of layers is a strictly individual decision that depends on the specific task and is a matter of intuition and experimentation on the part of the researcher.

On the other hand, with a fixed number of hidden neurons, increasing the number of layers leads to a decrease in efficiency. Figure 2 shows some of the results for neural networks with different numbers of neurons in the hidden layers approximating the two-dimensional function

y = 2 x_{1} + 3 x_{2} + 1

.

The Levenberg–Marquardt algorithm was used to train the considered neural networks, using 100 (or 70%) of a total of 143 input–output samples. For validation and testing, 15% of the sample was applied for validation, and the remaining 15% was used for testing, and the transfer function in the neuron body is a hyperbolic tangent. The obtained results lead to the conclusion that after a certain number of layers, the efficiency of neural networks decreases significantly. In the cases described here, for neural networks with up to six neurons in each layer, the numerical value of the efficiency after a certain expansion of the layered architecture goes through some short fluctuations and approaches zero. Similar results were obtained with the experiments when the number of neurons in each layer was increased, as well as when trying to approximate the other investigated functions.

Even if we make the objective function a bit more complicated, for example:

y = \sin (x_{1}) + \cos (x_{2}) + 1

the observation is again that after the optimal network is modeled at the 11th layer, with

e f f i c i e n c y = 0.1639

the efficiency drops, then begins to maintain a lower level characterized by very sharp and significant drops (Figure 3).

The results of the experiments raise two questions: 1. What is the reason for the unpredictable change in neural network efficiency? 2. How to explain the fact that, in some cases, too many hidden layers lead to low network efficiency, which makes it pointless to use huge approximating architectures?

4. Generalized Formula for Determining the Upper Bound of the Number of Neurons in Artificial Neural Networks

In a previous study, we derived formulas for the upper bounds of the number of neurons for neural networks using a Jacobi matrix with a single output neuron (Formulas 1 and 2) [25]. Let us now consider the more general case of a neural network with

k

number of output neurons (Figure 4) and look for analogous formulas for an optimal upper bound on the number of neurons.

Theorem 1 (upper bound for the number of neurons in hidden layers).

For an approximating neural network that is trained with an algorithm using the Jacobi matrix for the error function and that has the same number of neurons

q

in each layer, the highest approximation efficiency is achieved for some

q

:

q \leq \frac{\sqrt{{(r + n + k)}^{2} + 4 (r - 1) (m - k)} - (r + n + k)}{2 (r - 1)},

where:

m

is the number of training samples,

n

—the number of input stimuli,

r

—the number of hidden layers,

k

—the number of output neurons (

m, n, r, k \in N)

.

Proof.

Let us consider an artificial neural network for which the following conditions are met:

(1): There are $q$ number of neurons in each layer ( $q \in N)$ .
(2): The network is trained with an algorithm using the Jacobi matrix of the error function using $m$ number of training examples:

{[x_{j 1}, x_{j 2}, \dots x_{j n}; t_{j} = F (x_{j 1}, x_{j 2}, \dots x_{j n})]}_{j = 1}^{m}

(3): Let $n$ input stimuli propagate along the dendritic tree of the network, and it has $r$ hidden and $k$ output neurons ( $n, r, k \in N)$ .

A commonly used method for finding a set of weights that achieve a minimum of the error function is the iterative procedure in the Gauss–Newton algorithm. It is aimed at finding the minimum of the function

S (z) = \sum_{i = 1}^{m} {[φ_{i} (z)]}^{2} .

In our case,

z

are

p

-dimensional points:

z = (w_{1}, w_{2}, w_{3}, \dots w_{p})

where

p

is the number of all neurons (with weights

w_{i}, i = 1, 2, \dots p)

in the network, and the number of

φ_{i} (z) = t_{i} - f (x_{i 1}, x_{i 2}, \dots x_{i n}; w_{1}, w_{2}, \dots w_{p}), for i = 1, 2, \dots m .

is determined by the number of training examples.

After initializing the network with a random set of weights

z_{0} = z_{0} (w_{01}, w_{02}, w_{03}, \dots w_{0 p})

at the

k

-th iteration we have:

z_{(k)} = z_{(k - 1)} - {[J_{φ} {(z_{(k - 1)})}^{T} J_{φ} (z_{(k - 1)})]}^{- 1} J_{φ} {(z_{(k - 1)})}^{T} φ (z_{(k - 1)}),

(3)

where

φ = (φ_{1}, φ_{2}, \dots φ_{m}), \forall φ_{i} (z) :

φ_{i} (z) = t_{i} - f (x_{i 1}, x_{i 2}, \dots x_{i n}, w_{1}, w_{2}, \dots w_{p}), and

J_{φ} (z_{(k - 1)}) is the Jacobi matrix in z_{(k - 1)}

The algorithm based on the Gauss–Newton method works correctly if one important condition is met: the number of variables in the functions

φ_{i}

in the sum

S (z) = \sum_{i = 1}^{m} {[φ_{i} (z)]}^{2}

must be lower than the number of these functions [42,43,44], i.e.,

p \leq m .

(4)

In cases where this condition is violated, the matrix

J_{φ} {(z_{(k - 1)})}^{T} J_{φ} (z_{(k - 1)})

is irreversible and iterations (3) cannot return unambiguous answers.

With the considered architecture (Figure 4), we have

p = (r - 1) q^{2} + (r + n + k) q + k

which means that the number

q

of neurons in each layer is the root of the quadratic equation:

(r - 1) q^{2} + (r + n + k) q + k - p = 0 .

(5)

The root of (5), which has a real physical meaning for

r, n, k > 0,

is

q = \frac{\sqrt{{(r + n + k)}^{2} + 4 (r - 1) (p - k)} - (r + n + k)}{2 (r - 1)} .

(6)

Thus, from condition (4), for

m \geq p,

it follows that

q \leq \frac{\sqrt{{(r + n + k)}^{2} + 4 (r - 1) (m - k)} - (r + n + k)}{2 (r - 1)} .

(7)

This proves the theorem. □

In many cases, approximating neural networks have a single hidden layer. An analogous formula for this particular case can be derived as a consequence of Theorem 1.

Corollary 1 (upper bound for the number of neurons in hidden layers for one hidden layer).

For an approximating neural network with one hidden layer that is trained with an algorithm using the Jacobi matrix for the error function and that has the same number of neurons

q

in each layer, the highest approximation efficiency is achieved for some

q

:

q \leq \frac{m - k}{n + k + 1} .

where:

m

is the number of training samples,

n

—the number of input stimuli,

k

—the number of output neurons (

m, n, k \in N)

.

Proof.

Let us consider the function

F (r, n, m, k),

formed by the expression on the right-hand side of (7):

F (r, n, m, k) = \frac{\sqrt{{(r + n + k)}^{2} + 4 (r - 1) (m - k)} - (r + n + k)}{2 (r - 1)} .

(8)

Obviously, for a neural network with one layer, at

r = 1

, the numerator and denominator in (8) are equal to 0. To find a formula for this case, we need to find the limit of

F (r, n, m, k)

with

r

tending to 1:

G (n, m, k) = \lim_{r \to 1} F (r, n, m, k)

A standard approach, with a numerator and denominator equal to 0, is to use l’Hôpital’s rule and find the limit of the quotient of the first derivatives of the numerator and denominator. The solution is a little easier to reach when we multiply the numerator and denominator by the multiplier

\sqrt{{(r + n + k)}^{2} + 4 (r - 1) (m - k)} + (r + n + k)

. In this way, we get a variant for

F (r, n, m, k)

in which there is no uncertainty at

r = 1

:

F (r, n, m, k) = \frac{2 (m - k)}{\sqrt{{(r + n + k)}^{2} + 4 (r - 1) (m - k)} + (r + n + k)}

For single-layer neural networks, by replacing r with 1, we easily arrive at the final result for an optimal upper bound on the number of neurons in an ANN trained by methods using the Jacobi matrix and for which the number of training samples is

m

, the number of input stimuli is

n

, the hidden layer is 1, and the number of output neurons is

k

(

m, n, k \in N)

, which is the function

G (n, m, k) = F (1, n, m, k)

:

G (n, m, k) = F (1, n, m, k) = \frac{m - k}{n + k + 1} .

(9)

Ergo,

q \leq \frac{m - k}{n + k + 1} .

By this, Corollary 1 is proved. □

Corollary 2 (upper bound for the number of neurons in hidden layers, given one hidden layer and one output neuron).

For an approximating neural network with one hidden layer that is trained with an algorithm using the Jacobi matrix for the error function and that has the same number of neurons

q

in each layer, the highest approximation efficiency is achieved for some

q

:

q \leq \frac{m - 1}{n + 2} .

where:

m

is the number of training samples,

n

—the number of input stimuli (

m, n \in N)

.

The proof is trivial. The formula follows from Corollary 1, after replacing

k

with 1.

5. Upper Bound on the Number of Layers for Neural Networks Containing an Equal Number of Neurons in Each Layer

The formulas for upper bounds on the number of neurons presented in the previous part can be a good guide in the search for neural networks. The question remains of finding a similar guideline for the number of layers of a single ANN. We will prove the following theorem:

Theorem 2 (upper bound for the number of hidden layers).

For an approximating neural network that is trained with an algorithm using the Jacobi matrix for the error function and that has

r

hidden layers with the same number of neurons in each layer, the highest approximation efficiency is achieved for some

r

:

r \leq \frac{m - 2 k - n + 1}{2},

where:

m

is the number of training samples,

n

—the number of input stimuli,

k

—the number of output neurons (

m, n, k \in N)

.

Proof.

When solving a specific task, we can assume that

(n, m, k) = \vec{c o n s t}, a t n, m, k \in N

. Then we can consider F as a function of only the number of layers r, considering all other parameters as previously known constants:

(n, m, k) = \vec{c o n s t}

:

F (r) = \frac{\sqrt{{(r + n + k)}^{2} + 4 (r - 1) (m - k)} - (r + n + k)}{2 (r - 1)}

(10)

The study of the partial derivative of

F (r, n, m, k)

:

\frac{\partial F}{\partial r} = \frac{\frac{4 (m - k) + 2 (k + n + r)}{2 \sqrt{{(k + n + r)}^{2} + 4 (m - k) (r - 1)}} - 1}{2 (r - 1)} - \frac{\sqrt{{(k + n + r)}^{2} + 4 (m - k) (r - 1)} - (k + n + r)}{2 {(r - 1)}^{2}}

shows that

\frac{\partial F}{\partial r} < 0,

i.e., this derivative has a negative value for all

r, n, m, k \in N .

At the same time, as can be seen from Figure 5a,

\lim_{r \to \infty} \frac{\partial F}{\partial r} = 0 .

In turn, this means that the function

F (r)

is decreasing. Moreover, just like its derivative (Figure 5b), its limit at

r

trending to

\infty

is equal to 0:

\lim_{r \to \infty} F (r, n, m, k) = 0 .

As a result of the monotonicity of

F (r)

and its bound tending to zero, with an unlimited increase in the number of layers, it becomes clear that to preserve the network optimality, an increase in the number of hidden layers must be accompanied by a decrease in the number of neurons in each of them. Thus, the expansion of the layered architecture of the network cannot grow infinitely because, after a certain value of

r

, the number of neurons will be

q < 1

, which deprives the network of practical meaning. In practice, the maximum number of hidden layers

r_{m a x}

satisfies the equation:

F (r_{m a x}) = 1, i . e .,

\frac{\sqrt{{(r_{m a x} + n + k)}^{2} + 4 (r_{m a x} - 1) (m - k)} - (r_{m a x} + n + k)}{2 (r_{m a x} - 1)} = 1, r_{m a x}, n, m, k \in N

(11)

Here again,

n

is the number of inputs to the neural network,

m

is the number of samples used to train it, and

k

is the number of output neurons.

The definition area of the equation includes the requirements:

\begin{matrix} r_{m a x} \in (- \infty, \sqrt{k^{2} + (m - k) (m + n - 1)} - (2 m + n - k)] \cup^{} \\ [\sqrt{k^{2} + (m - k) (m + n - 1)} - (2 m + n - k), + \infty) и r_{m a x} \neq 1 \end{matrix}

(12)

One of the roots of (11) is

r_{m a x} = 1

, which, however, does not belong to the definition area. The other root sets the solution of the given problem, i.e.,

r_{m a x} = \frac{m - 2 k - n + 1}{2},

(13)

ergo,

r \leq \frac{m - 2 k - n + 1}{2}

with which the theorem is proved. □

6. Experimental Verification of the Formula for the Upper Limit of the Number of Layers

Let’s go back to the example we looked at—the function approximation of

y = 2 x_{1} + 3 x_{2} + 1

with a multilayer

(r > 1)

neural network, for which we have 100 input–output samples for training

(n = 2, m = 100, and k = 1)

. In this particular task, Equation (13) gives us the maximum optimal number of required layers

r_{m a x} = 48

, which confirms the results shown in Figure 2.

For the one-dimensional functions presented in Figure 1, we have

n = 1, m = 200, and k = 1

. According to these values, it follows from (13) that we should expect an optimal network with a number of layers not greater than

r_{m a x} = 99 .

The results shown in Figure 1 confirm this value, and for all the optimal neural networks presented in the figure, the number of layers is less than the specified number.

Similar experiments were conducted with multivariate mixed functions. We used 200 training samples using the Levenberg–Marquardt algorithm. For example, for the three-dimensional mixed function

y = x_{1}^{2} + e^{x_{2}} + t a n g (x_{3})

we have

n = 3, m = 200 and k = 1,

which means an optimal number of layers not greater than

r_{m a x} = 98

. The variation in the efficiency of the approximating network, with different numbers of layers, is shown in Figure 6a. The peak in efficiency is reached with an artificial neural network with 59 hidden layers. When experimenting with the randomly composed five-dimensional mixed function

y = x_{1} + x_{2} + \cos (x_{3}) + x_{4} x_{5}

the task comes down to the situation

n = 5, m = 200, and k = 1

. In this case, according to (13), we expect the optimal number of layers to be no greater than

r_{m a x} = 97

.

After the experiment, it was found that the optimal neural network approximating the described five-dimensional function has nine hidden layers (Figure 6b).

Another part of the conducted experiments is related to checking the validity of Formula (13) when reducing the number of training examples, which obviously has a strong impact on the maximum number of layers. In general, this sample size must be large enough to provide training, validation, and testing of the out-of-the-box network. When we reduce the training examples, we should also expect a decrease in the value of

r_{m a x}

.

In these experiments (Figure 7), we used the selected features

y = 2 x_{1} + 3 x_{2} + 1

,

y = x_{1}^{2} + e^{x_{2}} + t a n g (x_{3})

,

y = x_{1} + x_{2} + \cos (x_{3}) + x_{4} x_{5}

. Using only 50 training samples, for example, for the two-dimensional linear function (Figure 7a), the maximum upper threshold of the number of layers is

r_{m a x} = 24

, and the experimentally determined most efficient number of layers is

r_{o p t i m u m} = 2

. Similarly, when examining the same function and 10 training samples available, it is once again observed that the established effective number of layers,

r_{o p t i m u m} = 1

, is below the calculated maximum number

, r_{m a x} = 4

(Figure 7b).

Experiments with the other considered functions also confirm the expectations laid down in Formula (13) (Figure 7c–f). Although there are efficient neural networks with a number of layers greater than the calculated maximum number, it is clear from the presented graphical results (e.g., Figure 7c,d) that the search for neural networks with a number of layers greater than

r_{m a x}

is unnecessary.

Similar experiments conducted with other randomly composed multidimensional functions also confirm the validity of Formula (13).

7. Conclusions

One of the important questions in the theory of artificial neural networks is that of determining the topology of the network. There are many different types of neural networks for different purposes, and therefore no universal answer can be provided to this question. In some cases, a suitable neural network can be searched for automatically by iteratively changing certain network parameters. In such cases, it is useful to determine in advance the upper limit of the number of layers and the number of hidden neurons of the neural network. This allows for the minimization of computations and facilitates the finding of a suitable neural network.

This paper derives formulas for the maximum number of layers and neurons in neural networks trained with algorithms that use the Jacobi matrix for error estimates. The subject of research is neural networks that have the same number of neurons in the hidden layers. The proposed formulas are verified by experimental verification with multilayer artificial neural networks approximating both one-dimensional and multidimensional objective functions of varied uniform nature and those of mixed type. They can be used as a criterion when searching for the optimal topology of approximating neural networks.

Finding stricter rules for determining the number of layers and neurons of an ANN can significantly optimize the process of automated ANN construction. Our future work is aimed at looking for additional factors affecting the number of layers and neurons in an artificial neural network, including implementing a system for preliminary analysis of the training samples to determine the degree of non-linearity of the problem, determining the influence of the type of transfer function of neurons on the required number of neurons, etc.

Author Contributions

Conceptualization, K.Y. and E.H.; methodology, K.Y., E.H. and S.H.; software, K.Y., E.H. and S.H.; validation, S.H. and S.C.; formal analysis, K.Y. and E.H.; investigation, K.Y., E.H., S.H. and S.C.; data curation, K.Y. and S.C.; writing—original draft preparation, K.Y. and S.H.; writing—review and editing, E.H. and S.C.; visualization, K.Y. and S.C.; supervision, E.H. and S.H.; project administration, E.H. and S.H.; funding acquisition, S.H. and S.C. All authors have read and agreed to the published version of the manuscript.

Funding

The work is partly funded by the MU21-FMI-004 project at the Research Fund of the University of Plovdiv “Paisii Hilendarski”.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are available through the public databases mentioned in the text.

Conflicts of Interest

The authors declare no conflict of interest.

References

Rozycki, P.; Kolbusz, J.; Wilamowski, B.M. Dedicated deep neural network architectures and methods for their training. In Proceedings of the IEEE 19th International Conference on Intelligent Engineering Systems (INES), Bratislava, Slovakia, 3–5 September 2015. [Google Scholar]
Torres, J.J.; Munoz, M.A.; Marro, J.; Garrido, P.L. Influence of topology on the performance of a neural network. Neurocomputing 2004, 58–60, 229–234. [Google Scholar] [CrossRef]
Emmert-Streib, F. Influence of the neural network topology on the learning dynamics. Neurocomputing 2006, 69, 1179–1182. [Google Scholar] [CrossRef]
Lv, C.; Xing, Y.; Zhang, J.; Na, X.; Li, Y.; Liu, T.; Cao, D.; Wang, F.-Y. Levenberg–Marquardt Backpropagation Training of Multilayer Neural Networks for State Estimation of a Safety-Critical Cyber-Physical System. IEEE Trans. Ind. Inform. 2018, 14, 3436–3446. [Google Scholar] [CrossRef] [Green Version]
Sapna, S.; Tamilarasi, A.; Kumar, M. Backpropagation learning algorithm based on Levenberg Marquardt algorithm. Comput. Sci. Inf. Technol. 2012, 7, 393–398. [Google Scholar]
Botev, A.; Ritter, H.; Barber, D. Practical Gauss-Newton Optimisation for Deep Learning. In Proceedings of the 34-th International Conference on Machine Learning PMLR 70, Sydney, Australia, 6–11 August 2017; pp. 557–565. [Google Scholar]
Gratton, S.; Lawless, A.; Nichols, N. Approximate Gauss–Newton Methods for Nonlinear Least Squares Problems. SIAM J. Optim. 2017, 18, 106–132. [Google Scholar] [CrossRef] [Green Version]
Okut, H. Bayesian Regularized Neural Networks for Small n Big p Data. In Artificial Neural Networks—Models and Applications; IntechOpen: London, UK, 2016. [Google Scholar]
Gouravaraju, S.; Narayan, J.; Sauer, R.; Gautam, S. A Bayesian regularization-backpropagation neural network model for peeling computations. J. Adhes. 2023, 99, 92–115. [Google Scholar] [CrossRef]
Chel, H.; Majumder, A.; Nandi, D. Scaled Conjugate Gradient Algorithm in Neural Network Based Approach for Handwritten Text Recognition. Commun. Comput. Inf. Sci. 2011, 204, 196–210. [Google Scholar]
Babani, L.; Jadhav, S.; Chaudhari, B. Scaled Conjugate Gradient Based Adaptive ANN Control for SVM-DTC Induction Motor Drive. In Artificial Intelligence Applications and Innovations. AIAI 2016; Iliadis, L., Maglogiannis, I., Eds.; Springer: Cham, Switzerland, 2016; Volume 475, pp. 384–395. [Google Scholar]
Goldfarb, D.; Ren, Y.; Bahamou, A. Practical Quasi-Newton Methods for Training Deep Neural Networks. In Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, BC, Canada, 6–12 December 2020. [Google Scholar]
Wills, A.; Schön, T.; Jidling, C. A fast quasi-Newton-type method for large-scale stochastic optimization. IFAC-Papers 2020, 53, 1249–1254. [Google Scholar] [CrossRef]
Hunter, D.; Yu, H.; Pukish III, M.S.; Kolbusz, J.; Wilamowski, B.M. Selection of proper neural network sizes and architectures: A comparative study. IEEE Trans. Ind. Inform. 2012, 8, 228–240. [Google Scholar] [CrossRef]
Kuri, A. The Best Neural Network Architecture. In Proceedings of the Mexican International Congress on Artificila Intelligence, Monterrey, Mexico, 24–29 October 2014. [Google Scholar]
Khalil, K.; Eldash, O.; Kumar, A.; Bayoumi, M. An Efficient Approach for Neural Network Architecture. In Proceedings of the 25th IEEE International Conference on Electronics, Circuits and Systems (ICECS), Bordeaux, France, 9–12 December 2018. [Google Scholar]
Jinchuan, K.; Xinzhe, L. Empirical analysis of optimal hidden neurons in neural network modeling for stock prediction. In Proceedings of the Pacific-Asia Workshop on Computational Intelligence and Industrial Application, Wuhan, China, 19–20 December 2008; Volume 2, pp. 828–832. [Google Scholar]
Alvarez, J.M.; Salzmann, M. Learning the Number of Neurons in Deep Networks. Adv. Neural Inf. Process. Syst. 2016, 29, 1–9. [Google Scholar]
Li, J.Y.; Chow, T.W.S.; Yu, Y.L. Estimation theory and optimization algorithm for the number of hidden units in the higher-order feedforward neural network. In Proceedings of the IEEE International Conference on Neural Networks, Perth, Australia, 27 November–1 December 1995; pp. 1229–1233. [Google Scholar]
Tamura, S.; Tateishi, M. Capabilities of a four-layered feedforward neural network: Four layers versus three. Proc. IEEE Trans. Neural Netw. 1997, 8, 251–255. [Google Scholar] [CrossRef] [PubMed]
Sheela, K.G.; Deepa, S.N. Review on Methods to Fix Number of Hidden Neurons in Neural Networks. Math. Probl. Eng. 2013, 2013, 425740. [Google Scholar] [CrossRef] [Green Version]
Madhiarasan, M.; Deepa, S.N. Comparative analysis on hidden neurons estimation in multi layer perceptron neural networks for wind speed forecasting. Artif. Intell. Rev. 2016, 48, 449–471. [Google Scholar] [CrossRef]
Xu, S.; Chen, L. A novel approach for determining the optimal number of hidden layer neurons for FNN’s and its application in data mining. In Proceedings of the 5th International Conference on Information Technology and Applications, Cairns, Australia, 23–26 July 2008; pp. 683–686. [Google Scholar]
Shibata, K.; Ikeda, Y. Effect of number of hidden neurons on learning in large-scale layered neural networks. In Proceedings of the ICROS-SICE International Joint Conference, Fukuoka International Congress Center, Fukuoka, Japan, 18–21 August 2009; pp. 5008–5013. [Google Scholar]
Yotov, K.; Hadzhikolev, E.; Hazdhikoleva, S. Determining the Number of Neurons in Artificial Neural Networks for Approximation, Trained with Algorithms Using the Jacobi Matrix. TEM J. 2020, 9, 1320–1329. [Google Scholar] [CrossRef]
Shen, H.-Y.; Wang, Z.; Gao, C.-Y. Determining the number of BP neural network hidden layer units. J. Tianjin Univ. Technol. 2008, 24, 13–15. [Google Scholar]
Ibnu Choldun, R.M.; Santoso, J.; Surendro, K. Determining the number of hidden layers in neural network by using principal component analysis. Adv. Intell. Syst. Comput. 2020, 1038, 490–500. [Google Scholar]
Stathakis, D. How many hidden layers and nodes? Int. J. Remote Sens. 2009, 30, 2133–2147. [Google Scholar] [CrossRef]
Hanay, Y.S.; Arakawa, S.; Murata, M. Network topology selection with multistate neural memories. Expert Syst. Appl. 2015, 42, 3219–3226. [Google Scholar] [CrossRef]
Perzina, R. Self-learning Genetic Algorithm for Neural Network Topology Optimization. Smart Innov. Syst. Technol. 2015, 38, 179–188. [Google Scholar]
Vizitiu, I.-C.; Popescu, F. GANN system to optimize both topology and neural weights of a feedforward neural network. In Proceedings of the 8th International Conference on Communications, Sintok, Malaysia, 1–3 October 2022. [Google Scholar]
White, D.; Ligomenides, P. GANNet: A genetic algorithm for optimizing topology and weights in neural network design. Lect. Notes Comput. Sci. 1995, 686, 322–327. [Google Scholar]
Arena, P.; Caponetto, R.; Fortuna, L.; Xibilia, M.G. Genetic algorithms to select optimal neural network topology. In Proceedings of the 35th Midwest Symposium on Circuits and Systems, Washington, DC, USA, 9–12 August 1992.
Leon, F. Optimizing neural network topology using Shapley value. In Proceedings of the 18th International Conference on System Theory, Control and Computing (ICSTCC), Sinaia, Romania, 17–19 October 2014. [Google Scholar]
Kuroda, K.; Hasegawa, M. Method for Estimating Neural Network Topology Based on SPIKE-Distance. Lect. Notes Comput. Sci. 2016, 9886, 91–98. [Google Scholar]
Curteanu, S.; Leon, F.; Furtuna, R.; Dragoi, E.N.; Curteanu, N. Comparison between different methods for developing neural network topology applied to a complex polymerization process. In Proceedings of the 2010 International Joint Conference on Neural Networks (IJCNN), Barcelona, Spain, 18–23 July 2010. [Google Scholar]
Guliyev, N.J.; Ismailov, V.E. On the approximation by single hidden layer feedforward neural networks with fixed weights. Neural Netw. 2018, 98, 296–304. [Google Scholar] [CrossRef] [PubMed]
Thomas, A.J.; Petridis, M.; Walters, S.D.; Gheytassi, S.M.; Morgan, R.E. Two Hidden Layers are Usually Better than One. Commun. Comput. Inf. Sci. 2017, 744, 279–290. [Google Scholar]
Nakama, T. Comparisons of Single- and Multiple-Hidden-Layer Neural Networks. Lect. Notes Comput. Sci. 2011, 6675, 270–279. [Google Scholar]
Ibnu, C.R.M.; Santoso, J.; Surendro, K. Determining the Neural Network Topology. In Proceedings of the 8th International Conference on Software and Computer Applications—ICSCA, Penang, Malaysia, 19–21 February 2019; Volume 19, pp. 357–362. [Google Scholar]
MatLab. Available online: https://www.mathworks.com/products/matlab.html (accessed on 8 October 2022).
Börlin, N. Nonlinear Optimization Least Squares Problems—The Gauss-Newton method. Available online: https://www8.cs.umu.se/kurser/5DA001/HT07/lectures/lsq-handouts (accessed on 27 December 2022).
Cartis, C. Mathematical Institute, University of Oxford. Linear and Nonlinear Least-Squares Problems; the Gauss-Newton Method. Available online: https://courses-archive.maths.ox.ac.uk/node/view_material/4898. (accessed on 8 October 2022).
Madsen, K.; Nielsen, H.B.; Tingleff, O. Methods for Non-Linear Least Squares Problems. 2004. Available online: http://www2.imm.dtu.dk/pubdb/edoc/imm3215.pdf. (accessed on 8 October 2022).

Figure 1. Efficiency of neural networks in the studied objective functions for a number of layers from 1 to 100 and 3 neurons in each layer.

Figure 2. Optimal number of layers in neural networks approximating

y = 2 x_{1} + 3 x_{2} + 1

.

Figure 2. Optimal number of layers in neural networks approximating

y = 2 x_{1} + 3 x_{2} + 1

.

Figure 3. Efficiency of a neural network approximating

y = s i n (x_{1}) + c o s (x_{2}) + 1

.

Figure 3. Efficiency of a neural network approximating

y = s i n (x_{1}) + c o s (x_{2}) + 1

.

Figure 4. ANN with

r

hidden layers containing an equal number of neurons and

k

output neurons.

Figure 4. ANN with

r

hidden layers containing an equal number of neurons and

k

output neurons.

Figure 5. (a) The derivative function ∂F/∂r trends to zero from below as the number of layers increases. (b) The function F(r) is decreasing with respect to the number of layers.

Figure 6. Efficiency of neural networks in approximating various functions with 200 training samples and one output neuron (

m = 200, k = 1

). (a)

y = x_{1}^{2} + e^{x_{2}} + t a n g (x_{3}), n = 3, r_{m a x} = 98, r_{o p t i m u m} = 59

. (b)

y = x_{1} + x_{2} + \cos (x_{3}) + x_{4} x_{5}, n = 5, r_{m a x} = 97, r_{o p t i m u m} = 9

.

Figure 6. Efficiency of neural networks in approximating various functions with 200 training samples and one output neuron (

m = 200, k = 1

). (a)

y = x_{1}^{2} + e^{x_{2}} + t a n g (x_{3}), n = 3, r_{m a x} = 98, r_{o p t i m u m} = 59

. (b)

y = x_{1} + x_{2} + \cos (x_{3}) + x_{4} x_{5}, n = 5, r_{m a x} = 97, r_{o p t i m u m} = 9

.

Figure 7. Efficiency of neural networks in approximating different functions with different numbers of training samples and one output neuron (k

= 1

). (a)

y = 2 x_{1} + 3 x_{2} + 1, n = 2, m = 50, r_{m a x} = 24, r_{o p t i m u m} = 2

. (b)

y = 2 x_{1} + 3 x_{2} + 1, n = 2, m = 10, r_{m a x} = 4, r_{o p t i m u m} = 1 .

(c)

y = x_{1}^{2} + e^{x_{2}} + t a n g (x_{3}), n = 3, m = 50, r_{m a x} = 23, r_{o p t i m u m} = 1

. (d)

y = x_{1}^{2} + e^{x_{2}} + t a n g (x_{3}), n = 3, m = 10, r_{m a x} = 3, r_{o p t i m u m} = 2 .

(e)

y = x_{1} + x_{2} + \cos (x_{3}) + x_{4} x_{5}, n = 5, m = 50, r_{m a x} = 22, r_{o p t i m u m} = 4 .

(f)

y = x_{1} + x_{2} + \cos (x_{3}) + x_{4} x_{5}, n = 5, m = 10, r_{m a x} = 2, r_{o p t i m u m} = 2

.

Figure 7. Efficiency of neural networks in approximating different functions with different numbers of training samples and one output neuron (k

= 1

). (a)

y = 2 x_{1} + 3 x_{2} + 1, n = 2, m = 50, r_{m a x} = 24, r_{o p t i m u m} = 2

. (b)

y = 2 x_{1} + 3 x_{2} + 1, n = 2, m = 10, r_{m a x} = 4, r_{o p t i m u m} = 1 .

(c)

y = x_{1}^{2} + e^{x_{2}} + t a n g (x_{3}), n = 3, m = 50, r_{m a x} = 23, r_{o p t i m u m} = 1

. (d)

y = x_{1}^{2} + e^{x_{2}} + t a n g (x_{3}), n = 3, m = 10, r_{m a x} = 3, r_{o p t i m u m} = 2 .

(e)

y = x_{1} + x_{2} + \cos (x_{3}) + x_{4} x_{5}, n = 5, m = 50, r_{m a x} = 22, r_{o p t i m u m} = 4 .

(f)

y = x_{1} + x_{2} + \cos (x_{3}) + x_{4} x_{5}, n = 5, m = 10, r_{m a x} = 2, r_{o p t i m u m} = 2

.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yotov, K.; Hadzhikolev, E.; Hadzhikoleva, S.; Cheresharov, S. Finding the Optimal Topology of an Approximating Neural Network. Mathematics 2023, 11, 217. https://doi.org/10.3390/math11010217

AMA Style

Yotov K, Hadzhikolev E, Hadzhikoleva S, Cheresharov S. Finding the Optimal Topology of an Approximating Neural Network. Mathematics. 2023; 11(1):217. https://doi.org/10.3390/math11010217

Chicago/Turabian Style

Yotov, Kostadin, Emil Hadzhikolev, Stanka Hadzhikoleva, and Stoyan Cheresharov. 2023. "Finding the Optimal Topology of an Approximating Neural Network" Mathematics 11, no. 1: 217. https://doi.org/10.3390/math11010217

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Finding the Optimal Topology of an Approximating Neural Network

Abstract

1. Introduction

2. Determining the Topology of an Artificial Neural Network—State of the Art

3. Experiments to Search for a Relationship between the Number of Layers and the Efficiency of the Network

4. Generalized Formula for Determining the Upper Bound of the Number of Neurons in Artificial Neural Networks

5. Upper Bound on the Number of Layers for Neural Networks Containing an Equal Number of Neurons in Each Layer

6. Experimental Verification of the Formula for the Upper Limit of the Number of Layers

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI