**1. Introduction**

A fundamental problem in integrated geophysical interpretation is the proper geological understanding of multiple inverted physical property images, which demands high dimensional techniques for extracting geological information from modeled physical property images. The study of seismic attributes is an example of high-dimensional pattern recognition that seeks to extract relevant geological features from broadband seismic data [1–7]. Non-seismic data interpretations in mineral exploration involve a similar classification problem, where extracting 3D geological information from inverted potential field data is a computational challenge.

Several techniques are available for feature extraction in the high dimensional space, and this computational challenge can be seen as a dimensionality increase problem that

**Citation:** Abbassi, B.; Cheng, L.-Z.; Jébrak, M.; Lemire, D. 3D Geophysical Predictive Modeling by Spectral Feature Subset Selection in Mineral Exploration. *Minerals* **2022**, *12*, 1296. https://doi.org/10.3390/min12101296

Academic Editor: José António de Almeida

Received: 7 September 2022 Accepted: 11 October 2022 Published: 14 October 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

consequently leads to an overload of information [8–10]. For example, the spectral decomposition of geophysical images provides a robust way for feature extraction in the frequency domain [11–13]. However, the underlying patterns inside the high-dimensional images are only related to certain frequencies, and most decomposed spectra are redundant. The question is which frequency or frequency ranges are geologically relevant. This computational challenge can be seen as a dimensionality reduction problem, in which we try to extract the best representative components of high-dimensional images to facilitate visual interpretations and machine learning optimization [14,15].

Conventionally, high redundancy in the spectral domain is treated by incorporating principal component analysis (PCA) to extract the principal components of multiple images [14,16]. PCA is an unsupervised algorithm that linearly transforms the multivariate datasets into new features called Principal Components (PCs) by maximizing the variance of the input data [14,16]. However, the uncorrelatedness of principal components is a weaker form of independence; therefore, PCA is not a good choice for separating overlapped features inside multiple physical property images [17]. Alternatively, independent component analysis (ICA), precisely the negentropy maximization approach, has been proposed as a robust tool for separating background lithotypes from alteration events [17,18]. Unlike PCA, ICA performs beyond the Gaussian assumption and incorporates higher-order statistics to find maximally independent latent features [14,16].

As expected, the spectra of the ICs still contain certain hidden overlapped features in different frequencies. A similar problem has been addressed by Honório et al. [19] to visualize seismic broadband data in different frequencies. They used ICA to maximize the non-gaussianity of the decomposed seismic frequency volume and stacked the resulting ICs in three red, green, and blue (RGB) channels. The unified RGB image delineated a hidden sedimentary channel system [19]. However, this approach lacks automation and involves manually selecting and stacking representative ICs of the decomposed seismic data. The extraction of spectral features provides more detailed input features for the machine learning algorithm; however, the existence of redundant features decreases the performance of the learning process. Finding a way to detect these redundancies and avoid them can improve machine learning predictions. Moreover, it is challenging to decide how much dimensionality reduction is necessary to represent the most prominent spectral features to machine learning algorithms.

Feature subset selection (FSS) provides a powerful tool for adaptive dimensionality reduction and feature learning [15,20–22]. SFSS reduces the size of a large set of input features to a new set of relatively small and representative input features that improve the accuracy of the artificial neural network (ANN) prediction, either by decreasing the learning speed and model complexity or by increasing generalization capacity and classification accuracy [23]. Therefore, unlike conventional statistical dimensionality reduction methods that are blind source separation algorithms, there is a criterion for the selection of the best features that can be expressed as an adaptation of the learning process in the light of the optimization of both neural synaptic weights and the number of selected input features [21–23].

This study's spectral feature subset selection (SFSS) procedure is seen as a multiobjective optimization problem in which both the ANN weights and the number of the selected features are updated in different iterations. Several heuristic algorithms are known to solve these multi-objective optimization problems [24–26]. We use a bi-objective genetic algorithm (GA) optimization to regularize the ANN weights and find the most adaptable combination of input spectral features in the supervised machine learning procedure. The advantage of this approach to the conventional representative learning algorithms is that it can direct the learning of ANN in a way that only a relevant set of input features are allowed to be used during the learning process.

The main objective of this study is to present a robust algorithm to train ANN with automatically selected spectral features to predict geological targets such as mafic to felsic 3D lithological variations and 3D Au-grade distribution. The predicted outputs of the SFSS procedure are the estimated geological targets and selected sets of spectral features related to each geological target. The study also aims to show that the selected features and predicted targets also provide a unique way for interpreters to see which spatial and spectral features are prominent in relation to the geological targets. SFSS procedure are the estimated geological targets and selected sets of spectral features related to each geological target. The study also aims to show that the selected features and predicted targets also provide a unique way for interpreters to see which spatial and spectral features are prominent in relation to the geological targets.

it can direct the learning of ANN in a way that only a relevant set of input features are

The main objective of this study is to present a robust algorithm to train ANN with automatically selected spectral features to predict geological targets such as mafic to felsic 3D lithological variations and 3D Au-grade distribution. The predicted outputs of the

*Minerals* **2022**, *12*, x 3 of 19

allowed to be used during the learning process.

#### **2. Materials and Methods 2. Materials and Methods**

We developed an algorithm that automatically selects the best representative components based on multi-objective machine learning optimization. Our self-proposed 3D SFSS algorithm works for dimensionality reduction and separating high-dimensional spectral overlapped features. However, SFSS implementation in this study demands several preprocessing stages, including 3D inversion of geophysical data and 3D feature extraction methods. The accuracy of inverted physical property images is critically important in the subsequent feature extraction and selection procedures. 3D inversion algorithms are dedicated to tackling this problem [27–30]. Even if the geophysical inversion provides the most accurate physical property distributions, each physical property image provides only partial information about subsurface geology. This phenomenon can be seen as a linear mixing process, and ICA can help to reconstruct the hidden geological features from multiple physical property images [17,18]. The advantage of this method is that it can separate the host geology from alteration overprints in multidimensional physical property space [17]. We developed an algorithm that automatically selects the best representative components based on multi-objective machine learning optimization. Our self-proposed 3D SFSS algorithm works for dimensionality reduction and separating high-dimensional spectral overlapped features. However, SFSS implementation in this study demands several preprocessing stages, including 3D inversion of geophysical data and 3D feature extraction methods. The accuracy of inverted physical property images is critically important in the subsequent feature extraction and selection procedures. 3D inversion algorithms are dedicated to tackling this problem [27–30]. Even if the geophysical inversion provides the most accurate physical property distributions, each physical property image provides only partial information about subsurface geology. This phenomenon can be seen as a linear mixing process, and ICA can help to reconstruct the hidden geological features from multiple physical property images [17,18]. The advantage of this method is that it can separate the host geology from alteration overprints in multidimensional physical property space [17].

Subsequently, spectral decomposition reveals many latent features that are not properly visible in the spatial domain. However, the statistical interdependence of the decomposed images is a major difficulty in extracting and selecting geological information. The large volumes of decomposed spectra also limit the spectral decomposition's efficiency and demand effective dimensionality reduction methods to prevent information overload. Our self-proposed SFSS algorithm solves these high-dimensional problems in five stages (Figure 1): Subsequently, spectral decomposition reveals many latent features that are not properly visible in the spatial domain. However, the statistical interdependence of the decomposed images is a major difficulty in extracting and selecting geological information. The large volumes of decomposed spectra also limit the spectral decomposition's efficiency and demand effective dimensionality reduction methods to prevent information overload. Our self-proposed SFSS algorithm solves these high-dimensional problems in five stages (Figure 1):

	- 4. Dimensionality reduction and separation of raw spectral features through ICA (spectral feature extraction). 3. Spectral decomposition of ICs through a continuous wavelet transform (CWT). 4. Dimensionality reduction and separation of raw spectral features through ICA (spec-
	- 5. SFSS based on a supervised feature selection algorithm optimized by a multi-objective GA optimization (spatiospectral feature selection). tral feature extraction). 5. SFSS based on a supervised feature selection algorithm optimized by a multi-objective GA optimization (spatiospectral feature selection).

**Figure 1.** Workflow of SFSS for 3D predictive modeling. **Figure 1.** Workflow of SFSS for 3D predictive modeling.

#### *2.1. Geological and Geophysical Settings*

We test our method on an Au/Ag epithermal deposit known as the Newton deposit in British Colombia, Canada [28–30]. Regionally, the Late Cretaceous volcanic sequence is overlain by Miocene–Pliocene Chilcotin Group flood basalts and Quaternary glacial deposits, which are variably eroded to expose the older rocks. Quaternary glacial tills cover most of the Newton property on a deposit scale. Consequently, deposit scale geological information has primarily been obtained from borehole samples [27–30]. A bedrock geology map of the property is compiled from the mapping of limited outcrops, drill cores, and cross-section interpretations (Figure 2a). in British Colombia, Canada [28–30]. Regionally, the Late Cretaceous volcanic sequence is overlain by Miocene–Pliocene Chilcotin Group flood basalts and Quaternary glacial deposits, which are variably eroded to expose the older rocks. Quaternary glacial tills cover most of the Newton property on a deposit scale. Consequently, deposit scale geological information has primarily been obtained from borehole samples [27–30]. A bedrock geology map of the property is compiled from the mapping of limited outcrops, drill cores, and cross-section interpretations (Figure 2a).

We test our method on an Au/Ag epithermal deposit known as the Newton deposit

*Minerals* **2022**, *12*, x 4 of 19

*2.1. Geological and Geophysical Settings*

**Figure 2.** Geological and geophysical settings of the deposit: (**a**) Bedrock geology map at the deposit scale (sliced at an elevation of 1000 m). The light gray parts are undefined regions due to the lack of borehole information in the periphery of the deposit. Contours indicate topography. (**b**) Electrical resistivity model. (**c**) IP chargeability model. (**d**) Magnetic susceptibility model. The bedrock geology is overlaid on all geophysical models [27]. **Figure 2.** Geological and geophysical settings of the deposit: (**a**) Bedrock geology map at the deposit scale (sliced at an elevation of 1000 m). The light gray parts are undefined regions due to the lack of borehole information in the periphery of the deposit. Contours indicate topography. (**b**) Electrical resistivity model. (**c**) IP chargeability model. (**d**) Magnetic susceptibility model. The bedrock geology is overlaid on all geophysical models [27].

Three layered volcano-sedimentary sequences overlain from bottom to top are mafic volcanic rocks (MVR), sedimentary rocks (SR), and felsic volcanic rocks (FVR). The layered rocks are intruded by felsic to intermediate porphyritic intrusions, including the monzonite (M), quartz-feldspar porphyry (QFP), feldspar biotite porphyry (FBP), and the younger dioritic intrusion (D). The epithermal mineralization associated with the Newton deposit is at a depth of ~50 m to ~600 m. Au/Ag mineralization is mainly hosted within the felsic volcanic sequence [27–30]. The deposit is offset by the Newton Hill Fault (NHF), displacing geology and mineralization by ~300m of normal dip-slip movement. Three layered volcano-sedimentary sequences overlain from bottom to top are mafic volcanic rocks (MVR), sedimentary rocks (SR), and felsic volcanic rocks (FVR). The layered rocks are intruded by felsic to intermediate porphyritic intrusions, including the monzonite (M), quartz-feldspar porphyry (QFP), feldspar biotite porphyry (FBP), and the younger dioritic intrusion (D). The epithermal mineralization associated with the Newton deposit is at a depth of ~50 m to ~600 m. Au/Ag mineralization is mainly hosted within the felsic volcanic sequence [27–30]. The deposit is offset by the Newton Hill Fault (NHF), displacing geology and mineralization by ~300m of normal dip-slip movement.

Geophysical images of Newton deposit were obtained from 3D inversion of direct current (DC) resistivity, induced polarization (IP), and magnetic data sets [27,31]. In 2010, an 85 line-kilometer DC/IP survey was conducted in the Newton Hill area using a "pole-Geophysical images of Newton deposit were obtained from 3D inversion of direct current (DC) resistivity, induced polarization (IP), and magnetic data sets [27,31]. In 2010, an 85 line-kilometer DC/IP survey was conducted in the Newton Hill area using a "pole-dipole" electrode configuration [27,31]. The DC/IP survey lines were placed at 200m intervals in the east–west direction in the hydrothermal alteration zone. The dipole

length was set to 100 m and 200 m, with a maximum of ten times dipolar separations. Apparent chargeability (mV/V) was measured by recording the voltage drop after current interruption. The total magnetic field data were collected from a helicopter flying at an average altitude of 155 m above the ground. The magnetic data were collected along N–S flight lines spaced 200 m apart, and E–W tie lines were flown approximately every 2000 m to 500 m. A total of 7071 line-kilometers of magnetic data were collected, covering an area of 1293 km<sup>2</sup> . The magnetometer sampling rate was 0.1 s, yielding a measurement interval of approximately 10 m along each profile, depending on the helicopter speed [27,31].

The interrelationships between physical properties enabled the cooperative inversion of multiple datasets. The chargeability image in this study was used to constrain the magnetic inversion. The constrained susceptibilities set aside the less consistent equivalent solutions and narrowed down to a final solution that is well matched with the IP-DC resistivity inversion results, which improves deposit scale susceptibility imaging [27,31]. The value of this approach is that it does not necessarily require prior information for physical properties coupling. The total number of cells for forward DC/IP calculation is 288,000, and the size of each cell is 25 m × 50 m in the horizontal plane. The model cells are also set to grow exponentially with depth in 20 intervals up to 600 m. The unit cell size for cooperative susceptibility inversion is also set to X = 25 m, Y = 25, and Z = 10 m, with a total of 613,800 cells.

The recovered cooperative susceptibilities allowed to locate a magnetite destructive zone associated with porphyritic intrusions and felsic volcanoes (Au host rocks). Although, in some places, the distinction between magnetic highs of diorite and mafic volcanic rocks and magnetic lows of felsic volcanic rocks (Au host rocks) and porphyritic intrusions is still challenging. The result of the cooperative magnetic- IP-DC resistivity inversion is also shown in Figure 2. The high magnetic cap imaged from the constrained inversion is related to a mixture of the mafic volcanic rocks (MV) and the younger dioritic intrusions (D). A low magnetic zone (LMZ) was positioned deeper in earlier unconstrained models, but the cooperative inversion has replaced it upward (about 500 m) and imaged right under the high magnetic cap (magnetic susceptibilities > 0.04 SI). Several flanks of this LMZ locally reduce the magnetic susceptibilities to 0.015 SI. This structure is probably a response of the intermediate-to-felsic porphyritic intrusions (FIP) and interbeds of felsic volcanic rocks (FV) within the upper high magnetic cap [27,31].

### *2.2. Feature Extraction*

The feature extraction in this study is based on implementing several blind source separation methods, including PCA, ICA, and CWT, to recover latent features from a set of highly mixed images. Consider a mixing model where every image (g) is a mix of several hidden features (f) with different contributions to constructing the observed image, determined by a set of mixing weights *A*, in the form of a linear mixing model [18]:

$$\mathbf{g} = A\mathbf{f} \tag{1}$$

To recover the hidden features (f), one needs to find a separation matrix (*W*) and unmixes the observed images:

$$\mathbf{f} = \mathcal{W}\mathbf{g} = \boldsymbol{A}^{-1}\; \mathbf{g} \tag{2}$$

The separation matrix (*W*) can be estimated as an optimization problem by minimization of a cost function. By making a few assumptions about the statistical measures of the data sets, the feature extraction process iteratively reduces the effect of the mixing on the observed images. In PCA, the second-order statistical measure, i.e., variance, is maximized for image separation, and the outcome is linearly separated uncorrelated images. However, observed images with the nonlinear form of correlation (dependency) pose a significant difficulty to PCA-based separation methods. We can solve this problem by maximization of higher-order statistical measures (non-gaussianity) to separate the images into nonlinearly uncorrelated images through ICA.

The ICA starts with a PCA as a preprocessing step that removes the mean of the input images (g). The principal components (y) are related to the centered images (g<sup>c</sup> ) through a weight matrix *D*:

$$y = D \mathbf{g}\_{\mathbf{c}} \tag{3}$$

The matrix *D* can be estimated through variance maximization of the principal components (y). Finally, increasing the non-gaussianity of the principal components (y) produces the ICs of the images. The problem is to find a rotation matrix that, during the multiplication with the principal components, produces the least Gaussian outputs [16,18].

The Fast-ICA algorithm, through negentropy maximization and the Hyvärinen fixedpoint method, was used to obtain the rotation matrix that incorporates higher-order statistics to recover the latent independent sources [16,18]. The entropy (*H*) of an image (g) is defined as [16,18]:

$$H(\mathbf{g}) = -\int p(\mathbf{g}) \log p(\mathbf{g}) d\mathbf{g} \tag{4}$$

where *p*(g) is the probability density of the image g. Entropy is a measure of randomness. The more unpredictable and unstructured the variable is, the larger its entropy. Theoretically, the Gaussian variable possesses the largest entropy. The negentropy (*J*) of an image is the normalized differential entropy of that image:

$$J(\mathbf{g}) = H(\mathbf{g}\_{gauss}) - H(\mathbf{g}) \tag{5}$$

where *H*(g) is the entropy of the image, *H* <sup>g</sup>*gausss* is the entropy of a Gaussian random image of the same covariance matrix, and *neg*(g) ≥ 0 is always non-negative and zero when the image has a pure Gaussian distribution. The negentropy maximization is based on bringing an objective function to an approximated maximum value [14,16–18,27].

The CWT in two dimensions (horizontal planes) was used for feature extraction in the frequency domain. Since wavelets are localized in space and have finite durations, the sharp changes in images are efficiently detectable by 2D wavelet decomposition, which provides a unique way for spectral feature extraction [19,32]. The output of wavelet decomposition effectively reflects the sharp changes in images, making it an ideal tool for feature extraction [19,32]. The continuous wavelet transform of an image *I (x, y)* is defined as a decomposition of that image by translation and dilation of a mother wavelet *ψ (x, y)*. The resulting wavelet coefficients (*Cs*) are then given by

$$\mathcal{C}\_{S} = (b\_{1}, b\_{2}, a) = \frac{1}{\sqrt{|a|}} \iint I(x, y) \psi^{\*}(\frac{x - b\_{1}}{a}, \frac{y - b\_{2}}{a}) dx dy \tag{6}$$

where *b*<sup>1</sup> and *b*<sup>2</sup> control the spatial translation, *a* > 1 denotes the scale, and *ψ\** is the complex conjugate of the mother wavelet *ψ (x, y)*. The mother wavelet shifts and scales in multiple directions and produces numerous features. In this study, taking advantage of the nICA statistical properties, we can keep the most geologically pertinent information within the spectral decomposed volumes [18].

### *2.3. Learning Spectral Features*

Spatial and spectral feature extraction provides inputs for predictive modeling through supervised machine learning algorithms. Training machine learning models directly with raw data sets often yield unsatisfactory results. The feature extraction process identifies the most discriminating characteristics in raw images, which a machine learning algorithm can more easily consume. However, the curse of dimensionality prevents efficient machine learning when extracted features are too large [8–10]. Anyway, dimensionality reduction can improve spectral learning performance through feature extraction and feature subset selection methods. The main difference is that feature extraction combines the original features and creates a set of new features, while feature selection selects a subset of the original features [15,21–23,26]. We propose a hybrid algorithm that includes several phases

of preprocessing, spatial feature extraction, spectral feature extraction, and machine learning for geophysical predictive modeling (Figure 3). In this section, we first outline a basis for SRL through multilayer perceptron (MLP), and then we describe our proposed SFSS algorithm for 3D geophysical predictive modeling, feature selection, and dimensionality reduction. several phases of preprocessing, spatial feature extraction, spectral feature extraction, and machine learning for geophysical predictive modeling (Figure 3). In this section, we first outline a basis for SRL through multilayer perceptron (MLP), and then we describe our proposed SFSS algorithm for 3D geophysical predictive modeling, feature selection, and dimensionality reduction.

the original features and creates a set of new features, while feature selection selects a subset of the original features [15,21–23,26]. We propose a hybrid algorithm that includes

*Minerals* **2022**, *12*, x 7 of 19

**Figure 3.** Schematic view of SRL for predictive modeling. The algorithm consists of four sub-modules: The preprocessing prepares the data sets by 3D inversion, interpolation, and filtering methods. The spatial feature extraction with nICA separates the image overlaps in 3D. Spectral feature extraction with CWT-nICA extracts the wavelet decomposed features. Furthermore, the machine learning algorithm with MLP adjusts network weights (W1 and W2) to learn patterns inside the extracted features based on the sample targets. **Figure 3.** Schematic view of SRL for predictive modeling. The algorithm consists of four sub-modules: The preprocessing prepares the data sets by 3D inversion, interpolation, and filtering methods. The spatial feature extraction with nICA separates the image overlaps in 3D. Spectral feature extraction with CWT-nICA extracts the wavelet decomposed features. Furthermore, the machine learning algorithm with MLP adjusts network weights (W1 and W2) to learn patterns inside the extracted features based on the sample targets.

The basic building block of MLP is the Perceptron, a mathematical analog of the biological neuron, first described by Rosenblatt [33]. In this model, the integrated weights and biases of input vectors (*x*) are activated by a sigmoid function (activation function) to produce the output (*y*). After setting the initial weights randomly, the network is ready to train. Then, the output will be compared to the target in an iterative process, adjusting the weights of the ANN. This process can be performed by defining an Objective Function (*F*) and minimizing it over N training samples. *F* is defined as the least mean square (LMS) of the output vector (*yi*) and the desired output or the target vector (*ti*): The basic building block of MLP is the Perceptron, a mathematical analog of the biological neuron, first described by Rosenblatt [33]. In this model, the integrated weights and biases of input vectors (*x*) are activated by a sigmoid function (activation function) to produce the output (*y*). After setting the initial weights randomly, the network is ready to train. Then, the output will be compared to the target in an iterative process, adjusting the weights of the ANN. This process can be performed by defining an Objective Function (*F*) and minimizing it over N training samples. *F* is defined as the least mean square (LMS) of the output vector (*y<sup>i</sup>* ) and the desired output or the target vector (*t<sup>i</sup>* ):

*N*

$$F = \frac{1}{N} \sum\_{i=1}^{N} (t\_i - y\_i)^2 \tag{7}$$

to make output closer to the specified target. Despite successful applications of ANNs in nearly all branches of science and engi-During the back-propagation process, the optimization updates the synaptic weights to make output closer to the specified target.

neering, we still face many problems and potential pitfalls. A potential pitfall is an overfitting problem, in which designing too many neurons and too many iterations gives rise to noisy outputs. On the other side, under-fitting happens when a too-simple network is created, sometimes leading to under-estimated values. Preprocessing of original data is also an important criterion. Random noises in inputs or targets can propagate through networks and disturb the outputs. For example, surficial noises in geophysical images can lead to underestimating the results. Smoothing the inputs and targets can reduce spurious effects but also eliminate valuable attributes in the images, leading to overestimation in Despite successful applications of ANNs in nearly all branches of science and engineering, we still face many problems and potential pitfalls. A potential pitfall is an over-fitting problem, in which designing too many neurons and too many iterations gives rise to noisy outputs. On the other side, under-fitting happens when a too-simple network is created, sometimes leading to under-estimated values. Preprocessing of original data is also an important criterion. Random noises in inputs or targets can propagate through networks and disturb the outputs. For example, surficial noises in geophysical images can lead to underestimating the results. Smoothing the inputs and targets can reduce spurious effects but also eliminate valuable attributes in the images, leading to overestimation in the final

models. An optimum smoothness filtering strategy can be used to avoid over-fitting and under-fitting effects [34–36]. the final models. An optimum smoothness filtering strategy can be used to avoid overfitting and under-fitting effects [34–36].

Choosing an efficient optimization algorithm is also critical because a suitable optimization algorithm can make a considerable difference in final estimations. The Levenberg– Marquardt optimization technique [37,38] is a fast and robust technique for minimizing the performance function. Because of the local minima problem inherent in nonlinear optimization procedures, finding a global optimal or better local solution is the goal of an effective learning process. Several novel heuristic stochastic techniques are available to solve this complex hyper-dimensional optimization problem [24,35,38–42]. Choosing an efficient optimization algorithm is also critical because a suitable optimization algorithm can make a considerable difference in final estimations. The Levenberg–Marquardt optimization technique [37,38] is a fast and robust technique for minimizing the performance function. Because of the local minima problem inherent in nonlinear optimization procedures, finding a global optimal or better local solution is the goal of an effective learning process. Several novel heuristic stochastic techniques are available

In this study, we use GA optimization to minimize the ANN cost function and the number of input features simultaneously. The process is called feature subset selection and begins with initializing the input parameters we wish to optimize. These considerations are codified in an objective or cost function with different weights depending on the survivability of individuals. The best-fitted individuals meet the specifications of the objective function and survive, while the rest of the population is extinct [24,41,42]. The optimal solution emerges from parallel processing among the population with a complex fitness landscape. The whole idea is moving the population away from local optima that classical hill-climbing techniques usually might be trapped in. This ability makes GA an extremely robust global optimization technique that can resolve the traditional problem of local pockets existing in older optimization routines. to solve this complex hyper-dimensional optimization problem [24,35,38–42]. In this study, we use GA optimization to minimize the ANN cost function and the number of input features simultaneously. The process is called feature subset selection and begins with initializing the input parameters we wish to optimize. These considerations are codified in an objective or cost function with different weights depending on the survivability of individuals. The best-fitted individuals meet the specifications of the objective function and survive, while the rest of the population is extinct [24,41,42]. The optimal solution emerges from parallel processing among the population with a complex fitness landscape. The whole idea is moving the population away from local optima that classical hill-climbing techniques usually might be trapped in. This ability makes GA an extremely robust global optimization technique that can resolve the traditional problem

During the SFSS procedure, the spectral extracted features are fed to the MLP network, and the learning process involves adjusting the network weights and the number of the selected features while minimizing the bi-objective function F (Figure 4). In this study, we used a fast implementation of the Non-dominated Sorting Genetic Algorithm (NSGA) for bi-objective optimization. The NSGA algorithm [43] optimizes a global bi-objective cost function in the general form of: of local pockets existing in older optimization routines. During the SFSS procedure, the spectral extracted features are fed to the MLP network, and the learning process involves adjusting the network weights and the number of the selected features while minimizing the bi-objective function F (Figure 4). In this study, we used a fast implementation of the Non-dominated Sorting Genetic Algorithm (NSGA) for bi-objective optimization. The NSGA algorithm [43] optimizes a global bi-

$$E\_{\text{Global}} = f\left(E\_{\text{ANN}}, n\_f\right) \tag{8}$$

objective cost function in the general form of:

where *f* is the global cost function operator. The bi-objective GA simultaneously minimizes the number of spectral features (*n<sup>f</sup>* ) and ANN cost (*E*ANN). where *f* is the global cost function operator. The bi-objective GA simultaneously minimizes the number of spectral features (*nf*) and ANN cost (*E*ANN).

**Figure 4.** Schematic view of spectral feature subset selection for predictive modeling. The algorithm consists of four main sub-routines like the SRL algorithm, except that the MLP is integrated with NSGA-II to adjust the network weights and the selection of inputs simultaneously. **Figure 4.** Schematic view of spectral feature subset selection for predictive modeling. The algorithm consists of four main sub-routines like the SRL algorithm, except that the MLP is integrated with NSGA-II to adjust the network weights and the selection of inputs simultaneously.

The solution of multi-objective optimization problems gives rise to a set of Paretooptimal solutions instead of a single solution. Conventional optimization methods convert multi-objective optimization problems to single-objective optimization problems by emphasizing one Pareto-optimal solution at a time. The algorithm has to be iterated to obtain a different solution at each run to find multiple solutions. Over the past decades, evolutionary algorithms have been suggested to find multiple Pareto-optimal solutions in one single simulation run, emphasizing moving toward the actual Pareto-optimal region. This study used an improved version of NSGA called NSGA-II [44,45]. NSGA-II can be detailed in the following steps:

### *Step 1: Population initialization*

NSGA-II randomly initializes a population *P<sup>t</sup>* of size *N*. An offspring population *Q<sup>t</sup>* is generated using the genetic operators (tournament selection, crossover, and mutation), and *P<sup>t</sup>* and *Q<sup>t</sup>* are combined to create a population *R<sup>t</sup>* of size *2N*.
