1. Introduction
Every connected input and output unit in a neural network (NN) has a weight set by the computer programs that use it. A neural network is made up of many of these connected input and output units. The starting values are chosen at random. The weights and biases used to connect the nodes in each individual layer of the neural network are trained using a training set, which represents a sizeable portion of the original dataset. The two error functions that are primarily utilized for backpropagation are error of mean squared in short MSE and the error of root mean squared in short RMSE. The neural network modifies the weights and biases by utilizing backpropagation. However, employing the neural network with backpropagation does have a few drawbacks. In order to generate the best discriminant for classification or the best function for regression-value prediction, it takes a lot of computational cost and data to tweak the weights and biases. Second, the model is capable of entering the local minima without being able to exit it. Another commonly observed issue is the number of epochs required for network training. For training, which frequently takes a long period of time, more data are needed.
There must be a categorization of each data point into one of the “N” classes in classification tasks. Regression issues, on the other hand, have the goal of producing a certain output value for a given input. The NN generates a discriminant function that separates the classes in order to achieve this. For instance, a network with a single linear output can resolve a two-class problem by learning a discriminant function that is larger than zero for class A and smaller than zero for class B.
The ability of a swarm to forage for food is its most noticeable feature. In computer terminology, this conduct is referred to as exploration and exploitation. The utilization of this behavior allows machine learning algorithms to break out of the local minima and reduce their search time by making use of particle communication. One such popular algorithm based on particle behavior is the particle swarm optimization (PSO) algorithm. It can be utilized for neural network optimization and calculation of the ideal weight and bias distribution. The backpropagation algorithm used with neural networks has a propensity to enter local minima, which lowers the algorithm’s overall performance. The integration of PSO and NN aims to highlight the differences in NN performance between networks built using backpropagation and training data and those built using the PSO method.
Perceptrons, Feed Forward NN, Long and Short Term Memory, Multilayer Perceptrons, Convolutional NN, Radial Basis Functional NN, Recurrent Neural Nets, and the Feed-Forward with backpropagation are a few examples of several types of NN. This implementation takes advantage of the Feed Forward NN.
2. Materials
When creating an NN for a problem or application, picking the right NN architecture is typically a crucial step. Usually, a trial-and-error approach is used to do this, where the network is trained and tested before the number of layers is selected based on past performance. It is crucial to be aware that there are noises in the data, such as irrelevant features, redundant features, and outliers, as the architecture of NN is based on the training set. Because of this, it is crucial to pre-process the data before training the NN. One pre-processing method used to address this issue is feature selection (FS) [
1].
In FS, a subset of the features is chosen to be included in a model for classification. The basic idea behind the attribute-selection process is that, since the dataset is composed of numerous attributes that are unnecessary and can be trimmed, deleting them will not cause the dataset to lose any information. This can improve the predictor’s performance even more, enabling an accurate prediction. Redundancy is a difficult notion to understand because an attribute can only be redundant if it also includes a redundant feature. The crucial thing to remember is that feature selection always yields a smaller subset of features than the original attributes in the set because it only returns the best subset of attributes from a bigger attribute set. When working with datasets that have a large set of attributes or variables, attribute selection is very crucial. It has the benefit of making the predictor more accurate, quicker, and less expensive [
2]. The filter model and the wrapper model are two models that can be used to classify FS techniques. Filtering models, which include statistical techniques like discriminant analysis (DA), principal-component analysis (PCA) that makes use of eigen values and eigen vectors for attribute reduction, factor analysis (FA), and independent component analysis (ICA), are mostly indirect measurements of performance that are guided by distance. Although this approach is quick, the subset of functions it produces might not be the best. After computing the classification accuracy using the classification algorithm, the wrapper model selects a subset of features using various selection techniques and assesses the outcomes [
3].
Each PSO particle is responsible for selecting a subset of attributes and then figuring out the optimal neural network design based on those features. The PSO algorithms are able to carry out FS. With the PSO algorithm’s exploration and exploitation methods, all PSO particles cooperate to find the optimal feature and neural network design. This has the optimal number of layers and neurons in each layer. PSO can overcome the limitation of being restricted to local minima by utilizing particle mutations depending on mutation probability. A feature is represented as a feature vector using the input [
1].
The NN architecture can also be optimized using the PSO algorithm. In a multi-modal landscape, the standard PSO algorithm struggles to converge to a local optimal result. To solve this problem, a dynamic multi-layer PSO (DMLPSO) can be utilized. Important information dynamically interacts with many swarms, increasing population variety and improving performance. Particle swarm optimization provides numerous benefits over other evolutionary computation techniques, including simplicity, quick convergence, and resilience. The PSO also excels at solving problems involving global optimization. Through cross-layer cooperation, multi-layer PSO may thoroughly explore multi-modal areas by introducing multi-layer search algorithms to add more search layers. PSO uses second-order dynamic equations, a type of discrete system, to update the position of the particles. Two techniques that can be utilized for the DMLPSO are multi-layer searching and dynamic reorganizing. The search solution space for each particle is determined by a number of factors, such as the best location for the particle itself, the best location for its lower sub-swarms, and the best location for its upper sub-swarms. For DMLPSO, the authors have used its parameters with the following values: number of layers = 4, number of initial particles = 64, SPD = (1, 4, 4, 4), Velocity
max = 3, Velocity
min = −3 [
4].
The PSO algorithm is a Bionic Algorithm because it is intelligent. It is a full-featured random search method with few parameters. The backpropagation (BP) network can fix some flaws and become more adaptable by optimizing its weight values and thresholds in conjunction with a BP neural network. The authors of this paper [
5] suggest an immunization algorithm that, through iterations, inoculates the undesirable particles in the population while extracting the vaccine from the population’s best particles. A random suspension method is used to suspend the currently optimal individual particles in a particle swarm in order to prevent premature PSO. Immunological memory refers to the immune system’s ability to retain memory cells that produce antibodies in response to foreign antigens. When the same antigen invades again, associated memory cells are activated and produce a large number of antibodies. When properties like variety and immunological memory are incorporated into this PSO method, the algorithm improves its ability to conduct global searches rather than settling for local solutions. It is more likely that lower antibody concentrations will be chosen. However, the selection probability decreases with increasing antibody concentration. The authors use the following initial values for the BP-NN, (net training parameter goal = 0.0001), the count of nodes in the hidden layer nodes is five, and the learning rate net training parameter
lr = 0.1. In the PSO part, v
max = 1.2, v
min = 0.4, c1 = c2 = 2.05, population size pop_size = 40, evolution times max gen = 100 [
3].
A thorough examination of the classification algorithm using NN is provided in the paper [
6]. The study’s goal was to provide a novel hybrid neural classification method for enhancing classic NN’s efficacy and accuracy. According to the authors’ research, the suggested algorithm outperforms the neural network algorithms of Adam, Random Forest, Gradient Boosting Machine, Lasso, Ridge, Linear, and Quadratic discriminant analysis, as well as logistic regression and Lasso regression in terms of accuracy and stability, but it also takes longer to process.
The convergence rate of the BP-NN is a little slow, yet it can capture the solution with local minima. Under these conditions, PSO with NN can be used to train a feed-forward neural network. The Nash–Sutcliffe and correlation coefficients can be used to evaluate a model’s performance. The network may sometimes become stuck when there is another set of neuronal weights in the weight space whose expense function is significantly lesser than the local minimum. While modelling an NN, there are many more elements that must be taken into account, some of which are as follows:
All of these factors also have an effect on the convergence of BP-NN training. In comparison to GA, the authors found the PSO approach to be simple with less iteration and fewer variables for control [
7,
8,
9].
The diabetic retinopathy dataset is used by the authors of [
10]. They presumptuously use 10-way cross-validation to partition the dataset into 10 equal sections. Each component will be used as test and training data. Each feature or characteristic is then chosen using the PSO approach after division. Following the selection of the ideal characteristics or traits, neural network techniques are utilized to classify the training data. On the other hand, test data with chosen characteristics are verified based on training data, and a neural network is the technique employed in this validation. The final phase is the presentation of the data-classification findings after the training and test data have been validated using the neural network.
Deciding on the ideal count of the hidden neurons in each intermediate layer and the connection weight in a neural network at the same time is considered a challenging task. This is due to the fact that modifying the hidden neurons drastically changes the topography of the network, making training more challenging and necessitating unique considerations. The authors of [
11] propose a PSO with Levy flight-based multiverse optimization (MVO). PSO is a brand-new, quick algorithm that improves the harmony between development and exploration by preventing early convergence. Because of its quick convergence and simplicity of use, it is among the most important meta-heuristic algorithms.
The authors of the work [
12] suggest an upgraded PSO that greatly outperforms (in terms of the global search) the original PSO algorithm and also the solution fine tuning. This improved PSO makes use of parameter automation techniques, velocity resets, crossings, and mutations. Improvements like parameters that rely on varying time and perturbations at random like velocity resetting have been proposed to help overcome these limitations of the traditional PSO. Additionally, mutation and crossover are used to create diversity. Before using a BPN to solve a problem, the authors of [
13] present a technique that states that the BPN’s parameter settings, such as the number of hidden layers, rate of learning, impulse term, number neurons in the hidden layer, and cycles of learning, must be established. To prevent building sub-optimal network models, which can drastically increase computing cost and decrease results, the parameterization of the network architecture should be carefully considered. The authors claim that by choosing or de-noising pertinent features, the Rate Classifier’s classification accuracy can be increased. The authors use the following parameters with the respective values: c
1 = 0.8, c
2 = 1.5, w = 0.9, and population
size = 10 [
9].
3. Methodology
The error functions serve as the foundation for the NN’s BP. In accordance with the results of the error functions, the weights and biases are modified. The mean squared error (MSE) for problems requiring real valued results (Regression) and the cross entropy for problems requiring classes (Target or labels) are the two most widely utilized error functions. The error function in this instance is the cross entropy [
11]. Given two input–output pairs, the mean square error function is as follows:
U = (
,
), (
,
).
U is the position vector indicated in two dimensions and an ANN with some parameters represented as
that produces output in the form
(
);
(
u) for input
u; the error function
E(
U,
) is
One algorithm that is bio-inspired and can be used to find the best arrangement within the arrangement space is PSO. It differs from previous optimization computations in that it needs an objective function and is not dependent on the objective function’s slope or other differential shape. Moreover, there are remarkably few hyper-parameters in it [
14].
The optimal application of PSO is to determine the maximum or minimum of a multidimensional vector space-defined function
(
u). If
(
u) is a function that takes a vector parameter, like the coordinates (
u,
v) in a two-dimensional plane, and returns a real result that can take on almost any value in search space; for instance,
g(
u) can calculate the altitude for any point on the plane—then PSO can be used. The parameter
u that the PSO algorithm determines yields the minimal
g(
u) that will be returned. The direction of travel is always towards the search of the worldwide optimal. Every particle dynamically modifies its travelling velocity based on its prior flying experiences as well as those of its group members [
15,
16]. First, each particle attempts to monitor its own best result, which it refers to as its personal best or p
best.
Second, the best over the entire search space or
gbest is the most promising value for any particle. Based on its position vector, the velocity vector of the particle at T
current, the magnitude of the distance between position vector
pcurrent and
pbest, and, lastly, the magnitude of the distance between its position
pcurrent and
gbest, each particle adjusts its position [
17].
The formula that updates the velocity of the particles is
where the particle or agent’s velocity is denoted with
, the population of agents is denoted with
A, the inertia weight is denoted with
W, the cognitive constant is denoted with
c1,
c2 the particle or agent’s position is denoted with
, the personal best is denoted with
Pb, and the global best is denoted with
gb.
The movement of the particles is updated using the following formula:
The count of neurons in the input layer, the count in the output layer, the count in the hidden layer, and the count of hidden layers determine the architecture of the neural network. Values are first randomly inserted into the weight and bias vectors. Each particle in the PSO is assigned a neuron in the NN [
17,
18]. The particles will then move around randomly within their own neighborhoods until they reach a better position where the sum of the squares of errors between the model predictions and actual data values are minimized. This continues until the stopping criteria, or the number of iterations in this case, is completed. At the end, an optimal weight and bias matrix is obtained.
Figure 1 shows the flow of the training using PSO with NN and the pseudo code is mentioned in Algorithm 1.
The models’ performance is measured using a few different performance scores. The number of times a model properly classifies a positive sample as positive is known as the true positive combination. The combination of false negative indicates the number of times a model misclassifies a positive sample as negative.
The term false positive refers to the number of times a model classifies a negative sample as positive in error. The number of times a model properly classifies a negative sample as negative is known as true negative.
The visualization of the point at which the model becomes confused when distinguishing between two classes is aided by the Confusion Matrix
Table 1. A 2 × 2 matrix with the actual truth labels represented in the row and the predicted labels in the column provides a clear understanding of it.
Algorithm 1. Calculating weight and bias matrix using PSO. |
c10.5 c20.3 w0.9 nparticles150 iterations1000 weightsrandom vector biasrandom vector while N < iterations popCount0 while popCount < nparticles resforwardPropagation() popCountpopCount + 1 end while errorerrorEstimate(nparticles, res) nparticlesadjustParticlePosition(nparticles, error) NN + 1 end while weightsnparticles biasnparticles |