1. Introduction
Usually, neural networks (NNs) are constructed using multiple layers of summation units (SUs) in all non-input layers. The net input signal to each SU is calculated as the weighted sum of the inputs connected to that unit. NNs that use SUs are referred to in this paper as summation unit neural networks (SUNNs). SUNNs with a single hidden layer of SUs can approximate any function to an arbitrary degree of accuracy provided that a sufficient number of SUs are used in that hidden layer and provided that a set of optimal weights and biases can be found [
1]. However, this may result in a large number of SUs in order to approximate complex functions of higher orders. Alternatively, higher-order combinations of input signals can be used to compute the net input signal to a unit. There are many types of higher-order NNs [
2,
3,
4,
5], of which this paper concentrates on product unit neural networks (PUNNs) [
2], which are also referred to as pi–sigma NNs. PUNNs calculate the net input signal as the weighted product of inputs connected to that unit. Such units are referred to as product units (PUs). These PUs allow PUNNs to more easily approximate non-linear relationships and to automatically learn higher-order terms [
6], using fewer hidden units than SUNNs to achieve the same level of accuracy. Additionally, PUNNs have the advantage of increased accuracy, less training time, and simpler network architectures [
7].
Although PUNNs do provide advantages, they also introduce problems. If the weights leading to a PU are too large, input signals are transformed to too high an order, which may result in overfitting. Furthermore, weight updates using gradient-based optimization algorithms are computationally significantly more expensive than when SUs are used. PUs have a severe effect on the loss surface of the NN [
2,
6,
8]. The loss surface is the hyper-surface formed by the objective function values that are calculated across the search space. In the context of NN training, the objective function is the error function, e.g., sum-squared error, and the extent of the search space is defined by the range of values that can be assigned to the NN weights and biases. While analyses of the loss surfaces of feedforward NNs that employ SUs have been done [
9,
10,
11,
12,
13], the nature and characteristics of higher-order NN loss surfaces are not very well understood [
12]. Research has shown that PUs produce convoluted error surfaces, introducing more local minima, deep ravines, valleys, and extreme gradients [
7,
14]. Saddle points are likely to become more prevalent as the dimensionality of the problem increases [
12]. As a result, gradient-based training algorithms become trapped in local minima or become paralyzed (which occurs when the gradient of the error with respect to the current weight is nearly zero) [
7]. Additionally, the exponential term in PUs induces large, abrupt changes to the weights, causing good optima to be overshot [
15,
16].
Furthermore, the high dimensionality of the loss surface makes it very difficult to visualize its characteristics. Recently, Li et al. [
17] worked towards approaches to visualize NN loss surfaces and to use such visualizations to understand the aspects that make NNs trainable. Ding et al. [
18] considered visualization of the entire search trajectory of deep NNs and projected the high-dimensional loss surfaces to lower-dimensional spaces. However, qualitative mechanisms are still in need in order to quantify the characteristics of the NN loss surface in order to better understand it. Fitness landscape analysis (FLA) is a formal approach to characterize loss surfaces [
19,
20], with the goal being to estimate and quantify various features of the loss surface and to discover correlations between loss surface features and algorithm performance. FLA can provide insight into the nature of the PUNN loss surfaces in order to better understand the reasons certain optimization algorithms fail or succeed to train PUNNs.
The goal of this paper is to perform FLA of PUNN loss surfaces and to determine how PUNN loss surfaces differ from those of SUNNs. The loss surfaces of oversized PUNNs, the effects of regularization, and the effects of the search bounds of the loss surface are also analyzed. The paper maps the performance of selected optimization algorithms to PUNN loss surface characteristics to determine for which characteristics some algorithms perform poorly or well.
The rest of this paper is structured as follows:
Section 2 describes PUNNs.
Section 3 discusses FLA, reviews FLA metrics, and describes the random walks used to gather the necessary information about the loss surface.
Section 4 provides a review of current FLA studies of NNs. PUNN training algorithms are reviewed in
Section 5. The empirical process followed to analyze the loss surface characteristics of PUNNs is described in
Section 6.
Section 7 discusses the loss surface characteristics, while correlations between the performance of PUNN training algorithms and loss surface characteristics are discussed in
Section 8.
2. Product Unit Neural Networks
Higher-order NNs include functional link NNs [
4], sigma–pi NNs [
3], second-order NNs [
5], and PUNNs [
2]. PUNNs [
2,
6,
8] calculate the net input signal to hidden units as a weighted product of the input signals, i.e.,
instead of using the traditional SU, where the input signal is calculated as a linear weighted sum of the input signals, i.e.,
In the above,
is the net input signal to unit
for pattern
p,
is the activation level of unit
,
is the weight between units
and
, and
is the total number of units in the previous layer [
14]. The bias is modeled as the (
)-th unit, where
for all patterns, and
represents the bias [
14]. A SUNN is implemented with bias units for the hidden and output layer; a PUNN is implemented with a bias unit for the output layer only. There are two types of architectures that incorporate PUs [
2]: (1) each layer alternates between PUs and SUs, with the output layer always consisting of SUs; (2) a group of dedicated PUs are connected to each SU while also being connected to the input units. This paper makes use of the former architecture, with one hidden layer consisting of PUs and linear activation functions used in all layers. Using this architecture, the activation of a PU for a pattern
p is expressed as
for
. If
, then
is written as the complex number
, yielding
where
The above equations illustrate that the computational costs for gradient-based approaches are higher than when SUs are used.
Durbin and Rumelhart discovered that, apart from the added complexity of working in the complex domain, which results in double the number of equations and weight variables, no substantial improvements in results were gained [
2,
14]. Therefore, the complex part of Equation (
4) is omitted. Refer to [
14] for the PUNN training rules using stochastic gradient descent (SGD).
Research has shown that the approximation of higher-order functions using PUNNs provides more accurate results, better training time, and simpler network architectures than SUNNs [
7]. Training time is less because PUNNs automatically learn the higher-order terms that are required to implement a specific function [
6]. PUNNs have increased information capacity compared to SUNNs [
2,
6]. The information capacity of a single PU is approximately
, compared to
for a single SU, where
N is the number of inputs to the unit. The increased information capacity results in fewer PUs required to learn complex functions, resulting in smaller network architectures.
3. Fitness Landscape Analysis
The concept of FLA comes from the evolutionary context in the study of the landscapes of discrete combinatorial problems [
19]. FLA has since been successfully adapted to continuous fitness landscapes [
20]. The goal of fitness landscape analysis is to estimate and quantify various features of the error surface and to discover correlations between landscape features and algorithm performance. FLA provides a better understanding as to why certain algorithms succeed or fail as well as providing a deeper understanding of the optimization problem [
21]. The features of a fitness landscape are related to four high level properties: namely, modality, structure, separability, and searchability. Modality refers to the number and distribution of optima in a fitness landscape. Structure refers to the amount of variability in the landscape and describes the regions surrounding the optima. Separability refers to the correlations and dependencies among the variables of the loss function. Searchability refers to the ability of the optimization algorithm to improve the quality of a given solution and can further be considered a metric of problem hardness [
21].
FLA is performed by randomly sampling points from the landscape, calculating the fitness value for each sampled point, and then analyzing the relationship between the spatial and qualitative characteristics of the sampled points. Therefore, it is important to consider the manner in which points are sampled for FLA. The samples need to be large enough to sufficiently describe and represent the search space in order to accurately estimate the characteristics of the search space. However, samples need to be obtained without a complete enumeration of every point in the search space because the search space is infinite. A balance needs to be obtained between comprehensive sampling of the search space and the computational efficiency in doing so. It is important to note that FLA has to be done in a computationally affordable manner to make it a viable option compared to selecting the optimization algorithm and hyper-parameters through a trial-and-error approach. However, Malan argues that this is not completely true, as FLA still provides a deeper understanding of the problem, providing clarification of the “black-box” nature of NNs [
20]. The computational effort in FLA is largely dependent on the sampling techniques.
The sampling techniques considered in this paper are uniform and random-walk-based sampling. Uniform sampling simply takes uniform samples from the whole landscape within set bounds. No bias is given to any points in the landscape, thus providing a more objective view of the entire landscape. However, many points are required in order for it to be effective [
20]. Alternatively, random walk sampling refers to “walking” through the landscape by taking random steps in all dimensions. Random walk methods have the advantage of gathering fitness information of neighboring points, which is required for certain fitness measures. However, simple random walks do not provide enough coverage of the search space [
20]. Instead, progressive random walks (PRWs) are used [
22]. PRWs provide better coverage by starting on the edge of the search space and then randomly moving through all dimensions, with a bias towards the opposite side of the search space. Finally, the Manhattan random walk (MRW) [
22] is similar to the PRW, but each step moves in only one dimension. MRWs allow gradient information of the landscape to be estimated. Refer to [
22] for a more detailed discussion and a visualization of the coverage of the sampling techniques.
The magnitude of change in fitness throughout the landscape is quantified using gradient measures [
23]. The average estimated gradient
and the standard deviation of the gradient
are both obtained by sampling with MRWs. A low value for
is indicative that
is a good estimator of the gradient. Larger values of
indicate that the gradients of certain walks deviate a lot from
. This is an indication of “cliffs” or sudden “peaks” or “valleys” present in the landscape [
23].
The variability of the fitness values or ruggedness of the landscape is estimated with the first entropic measure (
) [
22]. Malan and Engelbrecht [
22] proposed two measures based on the
: namely, micro ruggedness (
), where the step sizes of the PRWs are 1% of the search space, and macro ruggedness (
), where the step sizes of the PRWs are 10% of the search space. The
measures provide a value in
, where 0 indicates a flat landscape, and larger values indicate a more rugged landscape. For a detailed description of the
measures and pseudocode, see [
23].
The fitness–distance correlation (
) was introduced by Jones [
24] as a measure of global problem hardness. The
measure is based on the premise that for a landscape to be easily searched, error should decrease as distance to the optimum decreases in the case of minimization problems. The
measures the covariance between the fitness of a solution and its distance to the nearest optimum. Fitness should therefore correlate well with the distance to the optimum if the optimum is easy to locate. However, the
requires knowledge of the global optima, which is often unknown for optimization problems. Therefore, this measure was extended by Malan [
20] by making use of the fittest points in the sample instead of the global optima (
). Instead of estimating how well the landscape guides the search towards the optimum, the
quantifies how well the problem guides the search towards areas of better fitness. Therefore,
changes the focus from a measure of problem hardness to searchability. The
measure gives a value in
, where 1 indicates a highly searchable landscape, −1 indicates a deceptive landscape, and 0 indicates a lack of information in the landscape to guide the search.
The dispersion metric (
) [
25] is calculated by comparing the overall dispersion of uniformly sampled points to a subset of the fittest points. The
describes the underlying structure of the landscape by estimating the presence of funnels. A funnel in a landscape is a global basin shape that consists of clustered local minima [
20]. A single-funnel landscape has an underlying unimodal “basin”-like structure, whereas a multi-funnel landscape has an underlying multimodal-modal structure. Multi-funnel landscapes can present problems for optimization algorithms because they may become trapped in sub-optimal funnels [
20]. A positive value for
indicates the presence of multiple funnels.
Neutrality of the landscape can be characterized by the
and
measures [
26].
calculates the proportion of neutral structures in a PRW in order to estimate the overall neutrality of the landscape.
estimates the relative size of the largest neutral region. The
and
measures both produce values in
, where 1 indicates a completely neutral landscape, and 0 indicates that the landscape has no neutral regions.
4. Neural Network Fitness Landscape Analysis
Though NNs have been studied extensively and have been widely applied, the landscape properties of the loss function are still poorly understood [
12]. A review of early analyses of NN error landscapes can be found in [
21].
Recent FLA of feedforward NNs have provided valuable insights into the characteristics of the loss surfaces produced when SUs are used in the hidden and output layers. Gallagher [
27] applied principal component analysis to simplify the error landscape representation to visualize NN error landscapes. It was found that NN error landscapes have many flat areas with sudden cliffs and ravines: a finding recently supported by Rakitianskaia et al. [
28]. Using formal random matrix theory, proofs have been provided to show that NN error landscapes contain more saddle points than local minima, and the number of local minima reduces as the dimensionality of the loss surfaces increases [
12]. This finding was also recently supported by Rakitianskaia et al. [
28] and Bosman et al. [
29].
Bosman et al. [
30] analyzed fitness landscape properties under different space boundaries. The study showed that larger bounds result in highly rugged error surfaces with extremely steep gradients and provide little information to guide the training algorithm. Rakitianskaia et al. [
28] and Bosman et al. [
29] showed that more hidden units per hidden layer reduce the number of local minima and simplify the shape of the global attractor, while more hidden layers sharpen the global attractor, making it more exploitable. In addition, the dimensionality of loss surfaces increases, which results in more rugged, flatter landscapes with more treacherous cliffs and ravines. Bosman et al. [
10] investigated landscape changes induced by the weight elimination penalty function under various penalty coefficient values. It was shown that weight elimination alters the search space and does not necessarily make the landscape easier to search. The error landscape becomes smoother, while more local minima are introduced. The impact of the quadratic loss and entropic loss on the error landscape indicate that entropic loss results in stronger gradients and fewer stationary points than the quadratic loss function. The entropic loss function results in a more searchable landscape.
In order to cover as much as possible insightful areas of the loss surfaces of NNs, Bosman et al. [
11] proposed a progressive gradient walk to specifically characterize basins of attraction. Van Aardt et al. [
26] developed measures of neutrality specifically for NN error landscapes.
Dennis et al. [
13] evaluated the impact of changes in the set of training samples to NN error surfaces by considering different active learning approaches and mini-batch sizes. It was shown that aspects of structure (specifically gradients), modality, and searchability are highly sensitive to changes in the training examples used to adjust the NN weights. It was also found that different subsets of training examples produce minima at different locations in the loss surface.
Very recently, Bosman et al. [
31] analyzed the impact of activation functions on loss surfaces. It was shown that the rectified linear activation function yields the most convex loss surfaces, while the exponential linear activation function yields the flattest loss surface.
Yang et al. [
32] analyzed the local and global properties of NN loss surfaces. Changes to the loss surface characteristics, such as variation of control parameter values, were analyzed, as well as the impact of different training phases on the loss surfaces. Sun et al. [
33] provided a recent review of research on the global structure of NN loss surfaces, with specific focus on deep linear networks. Approaches to perturb the loss function to eliminate bad local minima were analyzed, as well as the impact of initialization and batch normalization. Recent loss surface analyses focused on gaining a better understanding of the loss surfaces of deep NNs [
34,
35,
36,
37].
Despite the advances made in gaining a better understanding of NN loss surfaces, no FLA studies exist to analyze the characteristics of loss surfaces produced when PUs are used. Therefore, a need exists for such an analysis, which is the focus of this paper.
7. Empirical Analysis of Loss Surface Characteristics
This section discusses the results of the FLA of PUNN loss surfaces for the different architectures in comparison to the loss surfaces produced by SUNNs.
Section 7.1 discusses the results obtained from the optimal network architectures, while
Section 7.2 and
Section 7.3, respectively, consider the oversized and regularized architectures. The results for the regression problems are given in
Table 3, and those for the classification problems are given in
Table 4. In these tables, oPUNN refers to the optimal PUNN architectures, osPUNN refers to the oversized PUNN architectures, and rPUNN refers to regularized PUNN architectures.
7.1. Optimal Architectures
For SUNN and PUNN loss surfaces, the nature of the PUNN loss surface is best captured by the and metrics, which are substantially larger for PUNNs for every scenario except for the XOR problem. Even for loss surfaces with smaller bounds, the PUNN is significantly larger than that of SUNN for the majority of the problems. Larger bounds resulted in loss surfaces with even larger gradients, especially for the diabetes and problems. The large values for mean that the gradients of certain walks deviate substantially from . This is an indication of sudden cliffs or valleys present in the loss surfaces. The and metrics portray the treacherous nature of the PUNN landscape, i.e., that of extreme gradients and deep ravines and valleys.
The ruggedness of loss surfaces is estimated using entropy using the
and
metrics. The amount of entropy can be interpreted as the amount of “information” or variability in the loss surface [
22]. There exists a prominent trend between the gradient and ruggedness measures. Loss surfaces with smaller gradients are related to very rugged surfaces, where extremely large gradients are related to smoother surfaces. This relationship is observed for all of the problems, where Iris, Wine, Diabetes, and
have large gradients and smaller
values. Conversely, XOR,
, and
have smaller gradients and larger
values. Except for XOR, the PUNN loss surfaces are smoother than the SUNN loss surfaces, which is validated with the smaller values obtained for
and
. SUNN landscapes tend to have more variability or “information”, whereas PUNN landscapes tend to be smoother, with more consistent increases or decreases of loss values. Since surfaces with larger bounds have larger gradients, larger bounds tend to produce smoother loss landscapes. The macro-ruggedness values of
exceed the corresponding micro-ruggedness values of
for all scenarios, indicating that larger step sizes experience more variation in both NN loss surfaces.
estimates how searchable a loss surface is by quantifying how well the surface guides the search towards areas of better quality. The PUNN values for the regression problems are all moderately positive, indicating that PUNN landscapes are not deceptive but possess informative landscapes, making them more searchable. PUNN loss surfaces for all regression problems except are more searchable than those of SUNNs. This does not hold for the classification problems, where PUNN loss surfaces tend to be less searchable. Further, the searchability of both PUNN and SUNN loss surfaces decreases for classification problems. This is a result of the fact that the classification problems are higher dimensional, and thus, the volume of the landscape grows exponentially with the dimension of the landscape. Therefore, the distances between solutions of good quality become very large, producing smaller values. This also explains the fact that landscapes with larger bounds are less searchable for all problems.
indicates the presence of funnels. Negative values indicate single funnels, while positive values indicate multi-funnels. Negative values were obtained for all loss surfaces, indicating single-funnel landscapes that create basin-like structures for both PUNNs and SUNNs. It is important to note that the
measure does not estimate modality. Therefore, it is possible and likely to still have multiple local minima residing in the global basin structure. PUNN landscapes tend to produce more negative
values, which is indicative of a simpler global topology for PUNN surfaces. Landscapes with larger bounds produce more negative
values, correlating with landscapes of simpler global topology. Single-funneled landscapes are more searchable landscapes [
20], which suggests why PUNN landscapes are more searchable with respect to
than SUNN landscapes for regression problems.
The neutrality metrics
and
show a general trend of smaller neutrality for PUNN landscapes, indicating that the SUNN loss surfaces are more neutral than those of PUNNs. This is in agreement with the observation of larger gradients in the PUNN loss surfaces. Larger bounds create even less neutral loss surfaces for PUNNs, correlating with the observation that larger bounds create larger gradients. The effects that larger bounds have on neutrality is amplified when architectures are higher-dimensional, such as Iris, Wine, Diabetes, and
; for lower-dimensional architectures, e.g.,
and XOR, larger bounds actually create more neutral PUNN loss surfaces. The higher-dimensional architectures have more weights in the PUs, and thus, solution quality is more susceptible to changes in the weights. Another reason why SUNN loss surfaces are more neutral is because of their tendency to have more saddle points. This is a result of the fact that SUNN architectures tend to be higher-dimensional, for which, according to theoretical findings, saddle points are more prevalent [
12]. Furthermore,
tends to differ less drastically and is similar in cases such as
,
,
, and Wine. This indicates that, although PUNN loss surfaces tend not to be as neutral as SUNN loss surfaces in general, the longest neutral areas of both tend to be the same size.
7.2. Oversized Architectures
Recall that oversized architectures are investigated to analyze the effect of overfitting behavior on the PUNN loss surfaces. The loss surfaces produced by PUNNs with oversized hidden layers are referred to as complex PUNN landscapes (CPLs) for the purposes of this section. The landscapes of PUNNs with optimal architectures are referred to as optimal PUNN landscapes (OPLs).
Most of the differences between CPLs and OPLs are a result of the differences in dimensionality: CPLs tend to have larger gradients, as indicated by larger values for most problems. CPLs have larger values, which is indicative of more sudden ravines and valleys in the landscape. Smaller and values for CPLs show that OPLs are more neutral than CPLs. This can be attributed to the larger gradients of CPLs. values tend to be smaller for CPLs than for OPLs. This a result of the dimensionality differences, as discussed in the previous section, as well as the fact that the oversized architectures have irrelevant weights, introducing extra dimensions to the search space. The extra dimensions do not add any extra information and only divert the search, thus making the landscape less searchable. Larger values are obtained from CPLs, indicating that they have multi-funnel landscapes. Therefore, the global underlying structures of CPLs are more complex than OPLs, which is in agreement with the fact that CPLs are less searchable than OPLs, and is the case with multi-funnel landscapes. There is a mixed result with respect to the micro-ruggedness of the CPLs: even though CPLs tend to have larger gradients than OPLs, which is usually an indication of a smoother landscape, CPLs produce larger values than OPLs for XOR, Iris, , and . The macro-ruggedness values of CPLs tend to be larger than OPLs, which suggests that CPLs experience more variation in the landscape with larger step sizes than OPLs. Therefore, CPLs possess higher variability across the landscapes than OPLs.
7.3. Regularized Architectures
For the purposes of this section, the loss surfaces produced by regularized PUNNs are referred to as regularized PUNN landscapes (RPLs). The only noticeable effect that regularization has on the fitness landscapes of a PUNN is changes in the gradient measures. RPLs have larger magnitudes of gradients, as indicated by larger
values. Larger gradients are caused by the addition of the penalty term to the objective function, which increases the overall error and causes larger loss values and, thus, larger gradients. Additionally, larger
values indicate that regularization creates sudden ravines and valleys in the landscape, possibly introducing more local minima. The regularization coefficient
has a severe effect on the landscape [
14,
21]. However, a value of
(for SUNNs) is not likely to influence the error landscape significantly [
21]. Referring to
Table 2, the optimal value obtained from tuning the penalty coefficient was
for all problems. This was most likely due to the fact that a smaller value for
made the contribution of the penalty term insignificant to the overall error. Therefore, as a result of the small optimal value used for
, no other significant changes to the fitness landscape were detected by the fitness landscape measures besides
and
.
8. Performance and Loss Surface Property Correlation
The purpose of this section is to find correlations between good (or bad) performance of the optimization algorithms and the fitness landscape characteristics of the PUNN loss surfaces produced for the different classification and regression problems. The purpose of the section is not to compare the performances of the optimization algorithms. Comparisons of PUNN training algorithms can be found in [
7,
15,
16].
The performance results for the different PUNN training algorithms are summarized in
Table 5 and
Table 6 for the regression and classification problems, respectively. Provided in these tables are the average training error
, the best training error achieved over the independent runs, the average generalization error
, the best generalization error, and deviation values (given in parentheses).
Results for SGD are not provided because it failed to train PUNNs for all problems. SGD only succeeded when the weights were initialized very close to the optimal weights. The reasons behind the failure of SGD can now be understood using FLA: It was observed that the average gradients for PUNN loss surfaces were exceptionally large and were orders of magnitude larger than those of SUNN loss surfaces. The standard deviations for PUNN loss surfaces were also very large—indicative of sudden ravines or valleys in the PUNN loss surfaces. These characteristics trap or paralyze SGD. Larger values of suggest that not all the MRWs sampled such extreme gradients. Taking into consideration that the longest neutral areas of both SUNN and PUNN loss surfaces tend to be the same size, only certain parts of the PUNN loss surface have extreme gradients, whereas some areas are still relatively level. Such loss surfaces are impossible to search using gradient-based algorithms. and are the only measures that differ substantially between PUNN and SUNN loss surfaces. Therefore, the gradient measures are likely to be the most relevant fitness landscape measures that explain why SGD works for SUNNs and fails for PUNNs.
Smaller
values were obtained for OPLs compared to CPLs for all classification problems. Note that the dimensionality difference between OPLs and CPLs is the most significant for the classification problems. Loss surfaces with larger bounds—hence, larger landscape volumes—are also correlated with worse training performance. Therefore, the performance of both PSO and DE deteriorates for loss surfaces with higher dimensionality. This agrees with findings in the literature [
14] and is referred to as the “curse of dimensionality”. The deterioration in training performance for CPLs can be explained by the observed loss surface characteristics. CPLs were found to be less searchable: possessing more complex global structures (multi-funnels) and having increased ruggedness. Therefore, the
,
, and
measures capture the effects that the “curse of dimensionality” have on the loss surface. A general trend of overfitting and inferior
is observed for CPLs for Diabetes, Iris, Wine, and the majority of regression problems. The correlation of
,
, and
with the training and generalization performance indicates that they are meaningful fitness landscape measures for performance prediction for PUNNs, especially where oversized PUNN architectures are used.
Training of regularized PUNN architectures resulted in lower for nearly all problems compared to oversized PUNN architectures, suggesting that regularization makes the RPLs more searchable. The only effect that regularization had on the PUNN loss surfaces was larger gradient measures. Larger values can be linked with improved training performance on RPLs. For Diabetes and Wine, the PUNNs with larger bounds produced very large and values. However, the best and values are still small. This observation along with the fact that and have large deviations suggests that a few simulations became stuck in poor areas. This can be correlated to the fact that large values were observed for RPLs, which suggests sudden valleys and ravines. These landscape features are possibly the reason that the PSO and DE algorithms became stuck, leading to poor performance. Furthermore, DE became stuck in areas of worse quality for more problems, suggesting that large values of are an indication to use PSO instead of DE. Furthermore, decreased for RPLs compared to CPLs; therefore, regularization proved effective at improving the generalization performance of PUNNs.
9. Conclusions
The main purpose of this work was to perform a fitness landscape analysis (FLA) on the loss surfaces produced by product unit neural networks (PUNNs). The loss surface characteristics of PUNNs were analyzed and compared to those of SUNNs to determine in what way PUNN and SUNN loss surfaces differ.
PUNN loss surfaces have extremely large gradients on average, with large amounts of deviation over the landscape suggesting many deep ravines and valleys. Larger bounds and regularized PUNN architectures lead to even larger gradients. Stochastic gradient descent (SGD) failed to train PUNNs due to the treacherous gradients of the PUNN loss surfaces. The gradients of PUNN loss surfaces are significantly larger than those of SUNN loss surfaces, which explains why gradient descent works for SUNNs and not PUNNs. Therefore, optimization algorithms that make use of gradient information should be avoided when training PUNNs. Instead, meta-heuristics such as particle swarm optimization (PSO) and differential evolution (DE) should be used. PSO and DE successfully trained PUNNs of all architectures for all problems.
PUNN loss surfaces are less rugged, more searchable in lower dimensions, and less neutral than SUNN loss surfaces. The smoother PUNN loss surfaces were strongly correlated with the larger gradient measures and were found to possess simpler overall global structures than SUNN loss surfaces, where the latter had more multi-funnel landscapes. Oversized architectures created higher-dimensional landscapes, decreasing searchability, increasing ruggedness, and having an overall more complex multi-funnel global structure. The , , and metrics correlated well with the poor training performance DE and PSO achieved for oversized PUNN architectures. , , and captured the effects of the “curse of dimensionality” on PUNN loss surfaces. Regularized PUNN loss surfaces were more searchable than complex PUNN loss surfaces, leading to better training and generalization performance. The metric described the effect regularization had on PUNN loss surfaces and was found to correlate well with the better training performance of DE and PSO for regularized PUNNs. Regularized PUNN loss surfaces had more deep ravines and valleys that trapped PSO and DE. Finally, PSO is suggested for loss surfaces that have large values.