*2.2. Genetic Algorithm*

Biological variation and its basic processes were clarified by Darwin's (2002) evolutionary theory [30]. Natural selection is fundamental to what is often referred to as the macroscopic understanding of evolution. In an environment where only a finite number of humans will survive, and given the basic tendency of people to multiply, selection is necessary if people do not have an accelerated population [31,32]. This evolution favors people who bid more successfully for the given resources. In other words, they are better suited or adapted to the climate, recognized as global best survival [33].

Selection on the basis of competition is one of the two pillars of the mechanism of evolution. The other main influence comes as a function of phenotypical differences in the populations. The phenotype is an individual's physical and behavioral characteristics that assess their fitness in terms of their exposure to the surrounding environment. That individual represents a specific combination of environmental assessment phenotypic characteristics. These characteristics are inherited by the offspring of the individual if evaluated with favor; otherwise, the offspring is discarded. Charles Darwin's insight was that slight, spontaneous phenotype changes occur across ages [34–36].

New combinations of phenotype arise and are assessed by these mutations. That is the fundamental basis of the genetic algorithm: With a population of individuals, constraints on the environment lead to natural selection and survival of the population via roulette wheel selection, which results in an increase in the fitness of the population. A random collection of candidates can be generated [37]. Depending on this fitness, many of the best candidates are selected for the next generation using conjugation to seed the performance as an abstract fitness metric [38]. Cross-over and mutation give rise to a number of new offspring that fight for a position in the next generation on the basis of their fitness with old members of the population, before an organism of adequate efficiency is identified and until a previously determined computational threshold is exceeded [39,40]. In line with this, Algorithm 1 shows the scheme of the genetic algorithm. The scheme coincides with the generate-and-test algorithm type. The fitness function constitutes a heuristic estimate of an optimal solution, and the cross-over, mutation, and selection operators guide the search algorithm. The genetic algorithm has many characteristics that can support in the generating and testing of parents.


#### *2.3. Cascade Neural Network Genetic Algorithm*

Backpropagation training algorithms based on other traditional optimization methods, such as the conjugating gradient and Newton process, have different variants. This same gradient descent approximation, the easiest and among the slowest, usually speeds up the conjugate gradient algorithm, as well as Newton's method [41,42]. We used genetic algorithms through this study. Each neuron weight between the hidden layer and the output layer should be updated, and the weight of the neurons here between the input and the hidden layer was adjusted [43]. The weight change between some of the hidden and output layers of the neuron is specified in Equation (8) with activation function *ϕ*(*x*) = 1*x*.

$$\frac{\partial E\_r}{\partial w\_{jk}^{H-0}} = \frac{\partial E\_r}{\partial net\_k} \cdot \frac{\partial net\_k}{\partial w\_{jk}^{H-0}}$$

$$\frac{\partial E\_r}{\partial w\_{jk}^{H-0}} = -(t\_k - y\_k) \cdot \eta^2 (net\_k) \cdot \theta\_j \tag{8}$$

The weight value of neurons was updated between the input and hidden layer as represented in Equation (9).

$$\frac{\partial E\_r}{\partial w^{l-H}\_{jk}} = \frac{\partial E\_r}{\partial \theta\_j} \cdot \frac{\partial \theta\_j}{\partial net\_k} \cdot \frac{\partial net\_k}{\partial w^{l-H}\_{jk}}$$

$$\frac{\partial E\_r}{\partial w^{l-H}\_{jk}} = \left[\frac{1}{2} \sum\_{k=1}^{N\_0} (t\_k - y\_k)^2\right] \cdot \frac{\partial \theta\_j}{\partial net\_k} \cdot \frac{\partial net\_k}{\partial w^{l-H}\_{jk}}$$

$$\frac{\partial E\_r}{\partial w^{l-H}\_{jk}} = \left[\left(t\_k - y\_k\right)^2 \cdot \frac{\partial y\_k}{\partial \theta\_j}\right] \cdot \frac{\partial \theta\_j}{\partial net\_k} \cdot \frac{\partial net\_k}{\partial w^{l-H}\_{jk}}$$

$$\frac{\partial E\_r}{\partial w^{l-H}\_{jk}} = \left[\left(t\_k - y\_k\right)^2 \cdot \frac{\partial y\_k}{\partial net\_k} \cdot \frac{\partial net\_k}{\partial \theta\_j}\right] \cdot \frac{\partial \theta\_j}{\partial net\_j} \cdot \frac{\partial net\_j}{\partial w^{l-H}\_{ij}}$$

$$\frac{\partial E\_r}{\partial w^{l-H}\_{jk}} = -\sum\_{k=1}^{N\_0} \left[\left(t\_k - y\_k\right) \cdot q\_2 \left(nct\_k\right) \cdot w^{H-0}\_{jk}\right] \cdot q\_1 \left(nct\_j\right) \cdot \mathbf{x}\_i \tag{9}$$

With backpropagation, the input data are repeatedly presented to the neural network. With each presentation, the output of the neural network is compared to the desired output, and the error is computed. This error is then backpropagated through the neural network and used to adjust the weights such that the error decreases with each iteration; the neural network thus gets closer and closer to producing the desired output, represented in Equation (10).

$$w(h+1) = w(h) + \Delta w(h)\tag{10}$$

Algorithm 2 shows the function cascade neural network. However, the context backpropagation of each input datum is continuously shown to the neural network, with every representation comparing the output of the neural network to the requested output and computing the error; these errors provide context to the neural network and are used to update the weights to reduce its error for each iteration, as well as the genetic algorithm, allowing new generation of the neural network.


#### **3. Simulation and Results**

*3.1. Construction of VAR-Cascade*

There exist few guidelines for building a neural network model for time series. One of them considers time series as a nonlinear function of several past observations and random errors. Since air pollution data are known to be nonlinear time series data, we selected this method as a benchmark for forecasting. Equation (11) represents the time series models:

$$y\_t = f\left[ (z\_{t-1}, z\_{t-2}, \dots, z\_{t-m}), \ (\mathbf{c}\_{t-1}, \mathbf{c}\_{t-2}, \dots, \mathbf{c}\_{t-n}) \right] \tag{11}$$

where *f* is a nonlinear function determined by the neural network, *zt* = (1 − *B*)*<sup>d</sup> yt*, and *d* represents the order difference. Also, the residuals at time *t* are defined as *et*, and *m* and *n* are integers. Equation (12) shows that, initially, the VAR model is fitted in order to generate the residuals *et*. A neural network is then used to model the nonlinear and linear relations in excess and the original results [22,44,45].

$$z\_t = w\_0 + \sum\_{j=1}^{Q} w\_j \cdot \lg\left(w\_{0j} + \sum\_{i=1}^{p} w\_{ij} \cdot z\_{t-i} + \sum\_{i=p+1}^{p+q} w\_{ij} \cdot e\_{t+p-i}\right) + \varepsilon\_t \tag{12}$$

Here, *wij*(*<sup>i</sup>* = 0, 1, 2, . . . , *p* + *q*, *j* = 1, 2, . . . , *Q*) and *wj*(*j* = 0, 1, 2, 3, . . . , *Q*) are connection weights and *p*, *q*, *Q* are integers that should be determined in the design process of the cascade neural network. The values of *p* and *q* are determined by the underlying properties of the data. If the data are just nonlinear, they only consist of nonlinear structures; then, *q* can be 0 since the Box–Jenkins method is a linear model that cannot simulate nonlinear interaction. Suboptimal methods may be used in a hybrid model, but suboptimality does not change the functional characteristics of the hybrid approach [17,46–48].

The interpretation of time series requires quantification of the vector dynamic response with time shifts. The main feature of this method is to forecast potential values using recent qualities of a variable, often referred to as lagged values [49]. Commonly, the latest values influence the estimation of a potential value most strongly [50,51]. A single scalar variable is frequently expressed in series data evaluation of a self-regression where future values are estimated based on the weighted total of pre-set lagged values. This variable relies on its own previous values as well as the previous values for many other variables in the much more specific multivariate case [52–54].
