**4. Methodology**

Feature analysis and classification using Gaussian radial Boltzmann with Markov encoder model:

The weighted sum of the densities of the M component parts is known as the mixture density. The *i*th component density is denoted by the expression *s*(*x*; *θi*), where *θ<sup>i</sup>* stands for the component parameters. With the restrictions that *<sup>π</sup><sup>i</sup>* <sup>≥</sup> 0 and <sup>∑</sup>*<sup>M</sup> <sup>i</sup>*=<sup>1</sup> *π<sup>i</sup>* = 1, we use π<sup>i</sup> to signify the weighting factor or "mixing proportions" of the *i th* component in combination. The chance that a data sample belongs to the *i*th mixture component is represented by s(i), and M \_i = 1. Equations (15) and (16) are then used to define an M component mixture density (16),

$$s(\mathbf{x}) = \sum\_{i=1}^{M} \pi\_i p(\mathbf{x}; \theta\_i), i = 1, \dots, M \tag{15}$$

$$s(\mathbf{x}) = \sum\_{\mathbf{c}=1}^{c} \pi\_{\mathbf{c}} f\_{\mathbf{c}}(\mathbf{x} \mid \boldsymbol{\theta}) \tag{16}$$

The mixture model has a vector of parameters, *θ* = {*θ*1,..., *θM*, *π*1,... *πM*}

Hidden variables are treated as a latent variable, or z, in mixture models. It accepts values 1 through M as a discrete set that satisfies the conditions *zM*{0, 1} and ∑*<sup>M</sup> zM* = 1. A conditional distribution *p(x|z)* and a marginal distribution *p(z)* are how we define the joint distribution *p(x, z),* i.e., from Equation (17),

$$p(z, \mathbf{x}) = p(z)p(\mathbf{x} \mid z) \tag{17}$$

Mixing coefficients k are used to specify the marginal distribution across z, as illustrated in Equation (18),

$$p(z\_k = 1) = \pi\_k \tag{18}$$

Equations (19) and (20) define probability density function of X.

$$p(\mathbf{x} \mid \boldsymbol{\mu}\_k, \boldsymbol{\Sigma}\_k) = \frac{1}{\sqrt{2\pi \left| \boldsymbol{\Sigma}^{-1} \right|}} \exp\left(-\frac{1}{2} (\mathbf{x} - \boldsymbol{\mu}\_\mathbf{x}) \boldsymbol{\Sigma}\_\mathbf{x}^{-1} (\mathbf{x} - \boldsymbol{\mu}\_\mathbf{x})^T\right) \tag{19}$$

$$f\_{\mathbf{c}}(\mathbf{x} \mid \boldsymbol{\mu}\_{\mathbf{c}}, \boldsymbol{\Sigma}\_{\mathbf{c}}) = \frac{1}{(2\pi)^{\frac{1}{2}} |\boldsymbol{\Sigma}\_{\mathbf{c}}|^{\frac{1}{2}}} \exp\left(-\frac{1}{2} (\mathbf{x} - \boldsymbol{\mu}\_{\mathbf{c}})^{t} \boldsymbol{\Sigma}\_{\mathbf{c}}^{-1} (\mathbf{x} - \boldsymbol{\mu}\_{\mathbf{c}})\right) \tag{20}$$

where *μ<sup>x</sup>* is an (*μx*1,..., *μxN*) and Σ*<sup>x</sup>* covariance matrix and Σ*<sup>x</sup>* is a vector of means (\_x1,..., \_xN). Equations (21)–(23), which are linear superpositions of Gaussians, can be used to represent Gaussian mixture distribution.

$$p(\mathbf{x}) = \sum\_{k=1}^{K} \pi\_k p(\mathbf{x} \mid \boldsymbol{\mu}\_k, \boldsymbol{\Sigma}\_k) \tag{21}$$

$$\begin{aligned} \hat{\boldsymbol{\pi}}\_{\text{c}} &= \frac{\boldsymbol{n}\_{\text{c}}}{\boldsymbol{n}\_{\text{c}}},\\ \dot{\boldsymbol{\mu}}\_{\text{c}} = \frac{1}{\boldsymbol{n}\_{\text{c}}} \sum\_{(i,j\_{i}=\boldsymbol{c}\_{j})} \mathbf{x}\_{i\prime} p(\mathbf{x} \mid \boldsymbol{z}\_{k} = 1) &= p(\mathbf{x} \mid \boldsymbol{\mu}\_{k}, \boldsymbol{\Sigma}\_{k}),\\ \boldsymbol{\Sigma}\_{\text{c}} &= \frac{1}{(\boldsymbol{n}\_{\text{c}} - 1)} \sum\_{(i\mid y\_{i} = \boldsymbol{c})} (\mathbf{x}\_{i} - \boldsymbol{\mu}\_{\text{c}})(\mathbf{x}\_{i} - \boldsymbol{\mu}\_{\text{c}})^{t} \end{aligned} \tag{22}$$

Conditional distribution of x for a specific value of z is a Gaussian, according to Equation (23):

$$p(\mathbf{x} \mid \mathbf{z}) = \prod\_{k=1}^{K} p(\mathbf{x} \mid \mu\_k, \Sigma\_k)^{z\_k} \tag{23}$$

By adding joint distribution of all possible states of z to obtain Equation (24), one may determine the marginal distribution of x.

$$p(\mathbf{x}) = \sum\_{z} p(z)p(\mathbf{x} \mid z) = \sum\_{k=1}^{K} \pi\_k p(\mathbf{x} \mid \mu\_k, \Sigma\_k) \tag{24}$$

The "posterior probability" on a mixture component for a specific data vector is a significant derived quantity and is indicated in Equation (25):

$$p(z\_{nk}) = \frac{\pi\_k \mathcal{N}(\mathbf{x}\_n \mid \boldsymbol{\mu}\_k, \boldsymbol{\Sigma}\_k)}{\sum\_{j=1}^K \pi\_j \mathcal{N}(\mathbf{x}\_n \mid \boldsymbol{\mu}\_j, \boldsymbol{\Sigma}\_j)} = \frac{p(z\_k = 1)p(\mathbf{x} \mid z\_k = 1)}{\sum\_{j=1}^K p(z\_j = 1)p(\mathbf{x} \mid z\_j = 1)}\tag{25}$$

Maximal likelihood is the learning objective of RBMs, which are energy-based approaches. Equation (26), in its combined structure, defines the energy of its hidden parameters (*e*) and visible parameters (*f*).

$$\operatorname{En}(\mathbf{e}, \mathbf{f}; \theta) = -\sum\_{i\mathbf{j}} \mathcal{W}\_{i\mathbf{j}} \mathbf{f}\_i \mathbf{e}\_{\mathbf{j}} - \sum\_i \mathbf{b}\_i \mathbf{f}\_i - \sum\_{\mathbf{j}} \mathbf{a}\_{\mathbf{j}} \mathbf{e}\_{\mathbf{j}}.\tag{26}$$

θ represents the element *W(a, b).* Using Equation (27), one can determine the combined probability of v and h.

$$P\_{\theta}(\mathbf{f}, \mathbf{e}) = \frac{1}{Z(\theta)} \exp(-\operatorname{En}(\mathbf{f}, \mathbf{e}; \theta)). \tag{27}$$

In this context, the partition function is denoted by *Z(θ).* The previous equation can be rewritten as Equation (28).

$$P\_{\theta}(\mathbf{f}, \mathbf{e}) = \frac{1}{Z(\theta)} \exp\left(\sum\_{i=1}^{D} \sum\_{j=1}^{F} W\_{ij} \mathbf{f}\_i \mathbf{e}\_j + \sum\_{i=1}^{D} \mathbf{f}\_i \mathbf{b}\_i + \sum\_{j=1}^{F} \mathbf{e}\_j \mathbf{a}\_j\right) \tag{28}$$

Maximizing the probability function *P(f)* is the goal. The edge distribution of *P(f, e)* makes it easy to calculate *P(f)* by Equation (29):

$$P\_{\theta}(\mathbf{f}) = \frac{1}{Z(\theta)} \sum\_{\mathbf{h}} \exp\left[\mathbf{f}^{T} \mathbf{W} \mathbf{h} + \mathbf{a}^{T} \mathbf{h} + \mathbf{b}^{T} \mathbf{f}\right] \tag{29}$$

The RBM parameters are derived (*f*) by optimizing *P*. By optimizing log(*P(f))* = L(), we can obtain maximum *P(f)* using Equation (30):

$$\begin{array}{l} L(\theta) = \frac{1}{N} \sum\_{n=1}^{N} \log P\_{\theta} \left( \mathbf{f}^{(n)} \right) \\ \frac{\partial L(\theta)}{\partial \mathbf{W}\_{ij}} = \frac{1}{N} \sum\_{n=1}^{N} \frac{\partial}{\partial \mathbf{W}\_{ij}} \log \left( \sum\_{h} \exp \left[ \mathbf{f}^{(n)T} \mathbf{W} \mathbf{h} + \mathbf{a}^{T} \mathbf{h} + \mathbf{b}^{T} \mathbf{f}^{(n)} \right] \right) \\ \qquad - \frac{\partial}{\partial \mathbf{W}\_{ij}} \log Z(\theta) = E\_{P\_{\text{dut}}} \left[ \mathbf{f}\_{i} \mathbf{e}\_{j} \right] - E\_{P\_{\theta}} \left[ \mathbf{f}\_{i} \mathbf{e}\_{j} \right] \end{array} \tag{30}$$

The original purpose of stochastic gradient descent was to maximize *L(θ)*. Next, Equation (30) is used to calculate the *L(θ)* derivative for W.

The formula's first part is easy to evaluate. Across all datasets, the values of fi and ej are averaged. It is computationally challenging to solve the remaining part of the equation, which comprises all 2|*f*|+|*e*| possible values of f and e. The formula's second part is Equation (31).

$$\sum\_{f,\varepsilon} \mathbf{f}\_i \mathbf{e}\_f P\_\theta(\mathbf{f}, \mathbf{e}) \tag{31}$$

Monte Carlo simulations are used to estimate gradient as shown in the following equation:

$$\begin{array}{c} \Delta \mathbf{a}\_{i} = f\_{i}^{(0)} - f\_{i}^{(k)} \\ \Delta \mathbf{b}\_{i} = P\left(\mathbf{e}\_{j} = \mathbf{1} \mid \mathbf{f}^{(0)}\right) - P\left(\mathbf{e}\_{j} = \mathbf{1} \mid \mathbf{f}^{(k)}\right) \\ \Delta \mathbf{W}\_{ij} = P\left(\mathbf{e}\_{j} = \mathbf{1} \mid \mathbf{f}^{(0)}\right) \mathbf{f}\_{i}^{(0)} - P\left(\mathbf{e}\_{j} = \mathbf{1} \mid \mathbf{f}^{(k)}\right) \mathbf{f}\_{i}^{(k)} \end{array} \tag{32}$$

where *f(0)* I is sample value and *f(k)* I is a sample that satisfies distribution *P(f)* identified by sampling. Lastly, Equation (33) provides the parameter update equation.

$$\begin{array}{l} \mathbf{a}\_{i} = \mathbf{a}\_{i} + \Delta \mathbf{a}\_{i} \\ \mathbf{b}\_{j} = \mathbf{b}\_{j} + \Delta \mathbf{b}\_{j} \\ \mathbf{W}\_{ij} = \mathbf{W}\_{ij} + \Delta \mathbf{W}\_{ij} \end{array} \tag{33}$$

The probability distribution is shown below by Equation (34).

$$\begin{aligned} \label{eq:SDAC-1} P\left(V, e^{(1)}, e^{(2)}\right) &= \frac{1}{\overline{\mathcal{Z}(\theta)}} \exp - E\left(V, e^{(0)}, e^{(\Omega)}; \theta\right) \\ \mid P\left(V, e^{(1)}, e^{(2)}\right) &= -V^T \mathcal{W}^{(1)} e^{(1)} - V^T \mathcal{W}^{(2)} e^{(2)} + b \end{aligned} \tag{34}$$

Encoders and decoders are essential elements of its design. Encoder and decoder both implement standard matrix multiplication. As a normalizing function, an encoder gradient function is utilized. After adjusting the weight and biases of the autoencoder, Equation (35) operates network training.

$$e^{(0)} = a\left(b^{(n)} + V^T W^{(n)}\right) \tag{35}$$

$$e^{(n)} = \sigma(b\_i^{(n)} + e^{(n-1)T}W^{(n)})$$

where *n* = 1, 2, 3, . . . , *m*.

Consider training an HSI datacube with two hidden layers using Equations (36) and (37).

$$P\left(V\_{\bar{l}}=1;e^{(2)},e^{(2)}\right) = aV^T\mathcal{W}\_1^{(1)} + aV^T\mathcal{W}\_1^{(2)}\tag{36}$$

When n = 1:

$$\begin{aligned} P\left(V\_i = 1; e^{(1)}, e^{(2)}\right) &= aV^T \mathcal{W}\_1^{(1)} + aV^T \mathcal{W}\_i^{(2)}\\ P\left(V\_i = 2; e^{(1)}, e^{(2)}\right) &= aV^T \mathcal{W}\_2^{(1)} + aV^T \mathcal{W}\_2^{(2)} \end{aligned} \tag{37}$$

Mean-field value is represented by Equation (38):

$$P(\mathbf{x}) = \sum\_{h=1,2} Q\left(e^{(1)}, e^{(2)}\right) \log\left(\frac{e^{(i)}\,\_{\text{'}}e^{(2)}}{P\left(e^{(1)}\,\_{\text{'}}e^{(2)}\right)}\right) \tag{38}$$

where Gibbs energy is represented by Equation (39):

$$E(\mathbf{x}) = \frac{1}{Z(D)} \exp(-P(\mathbf{x})) \tag{39}$$

By indicating a weight change, (40) and (41) present a new weight value. Each stratum is assigned theb=0 bias.

$$
\Delta e\_i^{(1)} = \mathfrak{a} \sum\_i V\_i \mathcal{W}^{(1)} \tag{40}
$$

$$
\Delta e\_i^{(2)} = \alpha \sum\_i V\_i \mathcal{W}^{(2)} \tag{41}
$$

Convolution filters and the weights of fully connected layers are two model parameters that are optimized using the gradient descent approach. It is essential to classify the image into the correct category since the final layer has a significant impact on classification outcomes. This is carried out by properly linking the weights from the prior layers. Here, in order to improve classification accuracy, the training of the final weight vector is optimized utilizing a newly created modified whale optimization method. The number of search agents is limited to 50, the utmost number of iterations is limited to 100, and the final parameter (vector a) is linearly modified between [0,2].
