Optimization of Feature Selection in Mineral Prospectivity Using Ensemble Learning

Zhang, Hong; Xie, Miao; Dan, Shiyao; Li, Meilin; Li, Yunhe; Yang, Die; Wang, Yuanxi

doi:10.3390/min14100970

Open AccessArticle

Optimization of Feature Selection in Mineral Prospectivity Using Ensemble Learning

by

Hong Zhang

¹,

Miao Xie

^2,3,*,

Shiyao Dan

⁴,

Meilin Li

⁴,

Yunhe Li

⁴,

Die Yang

⁴ and

Yuanxi Wang

⁴

¹

Chengdu Center, China Geological Survey (Geosciences Innovation Center of Southwest China), Chengdu 610218, China

²

Institute of Geophysical and Geochemical Exploration, Chinese Academy of Geological Sciences, Langfang 065000, China

³

Key Laboratory of Geochemical Exploration, Institute of Geophysical and Geochemical Exploration, Langfang, 065000, China

⁴

Geomathematics Key Laboratory of Sichuan Province, Chengdu University of Technology, Chengdu 610059, China

^*

Author to whom correspondence should be addressed.

Minerals 2024, 14(10), 970; https://doi.org/10.3390/min14100970

Submission received: 3 August 2024 / Revised: 13 September 2024 / Accepted: 23 September 2024 / Published: 26 September 2024

(This article belongs to the Special Issue Application of Big Data Mining, Machine Learning and Artificial Intelligence in Geoscience, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

In recent years, machine learning (ML) has been extensively used for the quantitative prediction of mineral resources. However, the accuracy of prediction models is often influenced by data quality, feature selection, and algorithm limitations. This research investigates the benefits of data-driven feature optimization techniques in enhancing model accuracy. Using the Lhasa region in Tibet as the study area, this research applies ensemble learning methods, such as random forest and gradient boosting tree techniques, to optimize 43 feature variables encompassing geology, geochemistry, and geophysics. The optimized feature variables are then input into a support vector machine (SVM) model to generate a prospectivity map. The performance characteristics of the SVM, RF_SVM, and GBDT_SVM models are evaluated using ROC curves. The results indicate that the feature-optimized GBDT_SVM model achieves superior classification accuracy and prediction effectiveness, demonstrating that feature optimization is a necessary step for mineral prospectivity mapping, as it can significantly improve the performance of mineral prospectivity prediction.

Keywords:

feature optimization; machine learning; mineral prospectivity mapping

1. Introduction

Mineral Prospectivity Mapping (MPM) seeks to identify promising areas for mineral exploration to uncover new deposits through the examination of the connection between deposits and ore-controlling variables [1]. Traditional prospecting work mainly relies on the expert subjective evaluation of spatial control factors, and the predicted output and intermediate products are also based on qualitative results [2,3]. Due to the swift advancement of geographic information technology and the increasing diversification of data sources, the prediction model based on the data-driven method can more objectively and accurately capture the spatiotemporal variation characteristics of various factors in geoscience, laying a foundation for the improvement of prediction accuracy [4,5,6]. In past geological surveys and metallogenic predictions, many ML algorithms have made important contributions [7,8,9,10,11,12].

It is worth noting that ensemble learning, as one of the most advanced and effective methods in ML, reduces bias and variance by training multiple weak estimators, thereby enhancing the robustness and reliability of the overall model. Particularly in MPM, ensemble learning can fully utilize various data sources and features, providing more accurate and reliable predictions [13,14,15]. For instance, Zhang et al. [16] used the random forest algorithm to quantitatively analyze various factors influencing mineralization and established a mineral prediction model. Zhang et al. [17] used the XGBoost algorithm for 3D mineral prospecting of the Lannigou gold deposit in Guizhou, demonstrating the effectiveness of ensemble learning in 3D prediction. In addition to being applied to classification tasks, ensemble learning is also a research hotspot for selecting features that have significant impacts on the target variable [18,19,20,21].

Feature optimization is key to improving the performance of ML technology [22,23,24]. The problem of mineral resource prediction can be classified into the binary classification problem of determining whether there is a mine. In the process of mineral resource prediction, the data sources are complex and diverse, with many factors that may be unrelated to mineralization [25,26,27]. Feature optimization helps identify and filter out these irrelevant factors, ensuring that the model focuses on genuinely valuable information, thereby improving the model’s classification accuracy [28].

Feature optimization in machine learning involves selecting, filtering, and adjusting features to enhance the model’s performance and efficiency [29]. In exploration geochemistry, potential geochemical models may be disturbed by “noise”, which may originate either intrinsically within the data or from human factors during sampling, preparation, or analysis, as well as other factors unrelated to the mineralization process [30]. These factors can decrease the accuracy of classifying samples of interest. Thus, feature optimization is particularly important for improving the classification accuracy of the model [31]. While many studies have successfully applied ML, little attention has been paid to selecting feature variables specifically related to mineralization [32].

This study focuses on the Lhasa area, utilizing RF and GBDT for feature optimization in metallogenic prediction. Then, the SVM model is constructed after feature optimization. On this basis, the PSO algorithm is used to optimize SVM parameters, enhancing the model’s accuracy. Comparative analyses of prediction results from SVM, RF_SVM, and GBDT_SVM offer a novel approach for regional mineral resource exploration.

2. Geological Background and Dataset

2.1. Geological Background

The Gangdese Metallogenic Belt is of immense importance in the western regions of China. The belt has experienced the northward subduction of the Neo-Tethys Ocean coupled with the orogenic processes of continental–continental collision between the Indian and Eurasian plates. The belt includes rocks from the Paleozoic to the Cenozoic. Large-scale intermediate-acid intrusive rocks have been developed, and the fault structure is well developed with favorable metallogenic conditions [33,34,35]. The Lhasa region, situated in the eastern part of the Gangdese Metallogenic Belt, represents one of the most extensively explored and remarkable areas on the Qinghai–Tibet Plateau.

The strata in this region are categorized into two parts based on their location in the north or south. Among them, the strata in the south belong to the Lhasa–Waka stratigraphic area; the strata in the north belong to the Quesang–Songduo stratigraphic area (Figure 1b). The carbonate rocks in the Lhasa–Waka stratigraphic area are mainly hosted in the Late Jurassic Duodigou Formation. The Quesang–Songduo strata are relatively older and from the early Carboniferous to the early Jurassic.

The rock mass within the research area can be broadly categorized into two types: magmatic rock and metamorphic rock. The Yeba Formation, dating from the Early to Middle Jurassic, includes volcanic rocks within the southern Lhasa–Woka stratigraphic zone [36,37]. Moreover, volcanic rocks are extensively spread across diverse strata in the northern Quesang–Songduo stratigraphic area. Granitoids are the primary intrusive rocks, and they are abundant throughout the study area. Basic rocks, partially exposed, may have close relationships with the porphyry copper deposits [38,39] (Figure 1b). The predominant metamorphic rocks in the research area include quartzite, quartz schist, phyllite containing gneiss, and marble.

The Qinghai–Tibet Plateau has evolved from Paleo-Tethys to Meso-Tethys to Neo-Tethys and formed three world-class plate collisional suture zones, namely, the Jinsha–Jiang, Bangonghu–Nujiang, and Indus–Yarlung Zangbo Suture Zones. The whole Qinghai–Tibet Plateau is segmented into four regions—Songpan–Ganzi Terrane, Qiang-tang Terrane, Lhasa Terrane, and Himalaya Terrane [40,41] (Figure 1a). The research site is located in the Menba Township of Lhasa City in the eastern sector of Lhasa Terrane. A series of east–west-trending nappe tectonic systems developed in this area due to the bidirectional effect of the south subduction of the Meso-Tethys Ocean and the north subduction of the Neo-Tethys Ocean, with a considerable occurrence of overturned folds and thrust faults (Figure 1b). In the process of Indian–Asian continental collision, due to the influence of the eastern Himalayan syntaxis, the whole Lhasa terrane undergoes a large strike–slip movement, forming a nearly north–south trending tensile fault [41,42,43,44].

During the Paleogene period, the collision between the Indian and Asian continental plates triggered significant volcanic and magmatic activity in the Gangdese belt, leading to the formation of widespread Linzizong volcanic rocks and Gangdese collisional granitoids [45,46,47]. This process triggers the first major phase of mineralization in the region. Influenced by the low-angle subduction of the Neo-Tethys oceanic crust [48], the northern part of the study area formed several mineralization concentrated areas [49]. These include the Dongzhongla–Yaguila skarn Pb–Zn and porphyry Mo mineral contrated area, the Mengyaa–Longmala–Hahaigang skarn Pb–Zn–W–Mo mineral concentration zone, the Lietinggang–Leqingla skarn Pb–Zn–Fe–Cu mineral concentrated area, and the Xingaguo–Lunlang skarn Pb–Zn mineral concentrated area around Lhunzhub Basin [35] (Figure 1b).

During the Miocene epoch, the India–Asia plates entered a post-collisional phase. As a result of either the slab breakoff or regional extension of the Neo-Tethys oceanic crust, mantle upwelling took place, leading to significant porphyry and skarn Cu–Mo mineralization in the Gangdese belt [50,51,52]. This process represents the second major mineralization episode in the study area, occurring primarily south of the Milashan–Songduo collision zone and resulting in the formation of several super-large deposits and key mineral concentrated areas, particularly the Qulong–Jiama porphyry–skarn Cu–Mo mineral concentrated area. Simultaneously, the India–Asia continental collision progressed into the “soft collision” phase [53], which triggered the development of a series of near north–south trending fault structures in the Lhasa and Tethys terranes. These fault systems controlled the formation of multiple hydrothermal vein Pb–Zn and Au mineral concentrated areas, including the Riwuduo–Banduoxi Pb–Zn–Ag and Nongruri Au mineral concentrated areas in this region [54,55] (Figure 1b).

The mineralization within the study area exhibits distinct spatial zoning, with the Milashan–Songduo collisional zone serving as a general boundary. To the north, skarn Pb–Zn mineralization predominates, along with a few porphyry and skarn W and Mo deposits. In contrast, porphyry and skarn Cu–Mo deposits are mainly concentrated in the southern region [42,56]. Additionally, skarn Pb–Zn deposits are abundant around the Lhunzhub Basin, with some deposits also containing Fe and Cu. The hydrothermal vein deposits between Menba and Ruduo are largely influenced by the distribution of regional fault structures.

Figure 1. The geological schematic map of Lhasa [57]. (a) The geotectonic location of the Lhasa area, Tibe [58]. JF, Jiali fracture; KF, Kunlun fault; ALT, Altyn Tagh fault; KLF, Karakoram fault; BNS, Bangonghu–Nujiang suture; IYZS, Indus–Yarlung Zangbo suture; and JS, Jinshajiang suture. (b) The geological map with the locations of significant deposits in Lhasa.

2.2. Dataset

The geochemical dataset used in this study was collected via stream sediment sampling as part of the “Regional Geochemistry National Reconnaissance (RGNR) Project”, which was initiated in 1979 and has now covered more than 7 million km² of China. The sample density was 1 per 7 km², and a total of 2689 samples were included in this study. The specific distribution map of the sampling points was provided by Wang [56]. 39 elements were determined by using various methods including inductively coupled plasma–mass spectrometry (ICP–MS), X ray fluorescence spectrometry (XRF), and ICP–atomic emis-sion spectroscopy (ICP–AES). Specific information on the analytical techniques used, their detection limits, precision and accuracy data, and other information are provided in Wang et al., Xie et al., and Xie et al. [59,60,61] and are not repeated here.

2.2.1. Geological Variables

Strata and structures are critical factors that cannot be overlooked in the formation of mineral deposits. As the foundation of deposit formation, strata provide the bedrock and sedimentary environment necessary for mineral enrichment. In the study area, the Carboniferous-to-Jurassic strata are the main ore-bearing strata, characterized by terrigenous clastic and carbonate rock deposits. Porphyry copper deposits are mainly emplaced within the Jurassic Yeba and Duodigou Formations. Skarn-type deposits are partially located between the Early Cretaceous Linzizong Formation’s sandstone slate and the Late Jurassic Duodigou Formation’s limestone, while others are found in the Permian Luoba Formation. Additionally, tectonic activities, such as faulting and folding, cause rock layers to fracture and deform, creating pathways and spaces for the migration and concentration of materials and allowing ore bodies to accumulate in favorable structural locations. The spatial distribution of the studied deposits is jointly controlled by east–west thrust fault zones and north–south extensional structural systems. Stratigraphic combination entropy represents the relationship between the types of strata and their exposed areas with mineralization, reflecting whether the stratigraphic combination is favorable for mineralization. In quantitatively representing stratigraphic information, most studies use binary discrete variables to distinguish whether strata are favorable for mineralization. However, stratigraphic combination entropy, as a continuous variable, better reflects the degree of stratigraphic variation, facilitating subsequent machine learning models in learning the features of the data. The structural information is quantitatively expressed by the structural distance based on GIS analysis.

(1): Strata Variables

Stratigraphic combination entropy reflects the complexity of strata and structural development [62].

H_{r} = \frac{- \sum p_{i} l g p_{i}}{- \sum \frac{1}{N} l g \frac{1}{N}} \times 100 %

where H_r signifies the measure of geological complexity for the r-th statistical unit, N represents the number of stratum types, and P_i denotes the ratio of the exposed area of the i-th stratum within a statistical unit to the total area of the statistical unit [63].

In this study, stratigraphic combination entropy is used to describe the complexity of the strata. The Miocene epoch represents the most concentrated period of mineralization intensity within the Gangdese metallogenic belt, and large-scale intermediate-acid intrusive rocks are developed. The region characterized by high values of stratigraphic entropy exhibits a strong correlation with the distribution pattern of Miocene granites. In addition, a majority of the ore bodies are situated adjacent to the fracture, which is in the transition zone of the stratigraphic combination entropy from high to low (Figure 2). This figure shows that the stratigraphic combination entropy can represent the complex metallogenic environment, reflect the strong structure, and serve as an indicator for predicting mineral exploration outcomes. However, stratigraphic combination entropy depends on the integrity of the data. If the strata data are incomplete or there are errors, the calculation results of the entropy value will be affected.

(2): Fracture Variables

Fracture is an important ore-controlling factor, which can provide a method and space for mineralization. Since most deposits are located on the side of the main fracture, an analysis of the fracture distance can reveal a favorable environment for mineralization around the fracture on a regional scale. Therefore, this study uses fracture distance to describe the spatial relationships between deposits and fractures (Figure 3). Areas closer to faults are assigned smaller values, which are represented in red on the map, while areas farther from faults are assigned larger values, which are represented in blue. Most mineralization is located in the red low-value areas.

2.2.2. Geochemical Variables

(1): Geochemical Element Anomalies

The geochemical measurement of river sediments is an effective method to describe geochemical anomalies. Due to factors such as sampling bias and weathering, this method may not capture all geochemical anomalies. However, it remains effective for metallogenic prediction in the region. As shown in Figure 4, although most of the mineralization is indeed located within the anomalous areas, a few copper deposits have not been accurately identified. Therefore, in addition to using single-element geochemical data as feature variables for metallogenic prediction, it is essential to extract the association of geochemical elements, as they can reflect comprehensive mineralization information (as detailed below).

(2): Geochemical Association Anomalies

Principal component analysis (PCA) was performed on the 39 elements after clr log-ratio transformation. In the biplot (Figure 5), among the 39 elements, 14 elements show positive values on the first principal component. The elements with higher loadings in PC1 are Y and Hg. The elements with high loads in PC2 are As, Ag, Cd, Sb, and Cu. The elements in the first quadrant are related to Cu mineralization, and the occurrence of Ni and Cr may also be related to basic and ultrabasic rocks. The elements appearing in the second quadrant are related to Au mineralization [64,65]. Au, As, Sb, Hg, Pb, and Zn are used as indicator elements for gold and lead–zinc deposits. In addition, W, Mo, and Bi, as typical high-temperature elements, appear together in the second quadrant, which may indicate the existence of deposits related to the magmatic–hydrothermal fluid to a certain extent [65,66]. The elements in the third quadrant may be related to the stratigraphic distribution. The Al₂O₃ and SiO₂ elements in the third quadrant are the main constituent elements of granite and Linzizong volcanic rocks in the area. MgO has a higher content in the Laigu Formation dolomite and Songduo group. The distribution characteristics of the elements are closely related to the lithology of the strata. The iron group elements (Fe₂O₃, V, Ti) appearing in the fourth quadrant may be related to basic and ultrabasic rocks [65,66]. From the data-driven results, the positive load of the PC2 contains Cu, Au, Ag, Mo, Pb, Zn, and other elements, which is related to polymetallic mineralization in the study area. This finding confirms the existence of certain correlations among the metallogenic elements.

Data-driven relationships among elements are utilized, and the metallogenic geological background is considered. The sequential binary partition technique in compositional data analysis is used to extract the associations of elements of interest. This technique, as a data dimensionality reduction technique, is a method of constructing adversarial relationships using orthogonal bases in a simplex space. The theory of CODA can refer to the previous paper [67].

Based on the above analysis, an association of elements reflecting Mn-Ag-Pb-Cd-As-Sn-W-Cu-Sb-Mo-Au-Zn polymetallic mineralization in the study area is extracted using the compositional data analysis [65]. The main metallogenic elements in the study area include Cu, Mo, Pb, Zn, Au, Ag and W [41]. As shown in Figure 6, about 60% of the mineral deposits fall within the range of 1.0 (yellow in the figure) or above.

2.2.3. Geophysical Variables

Tectonic activity and magnetic differences between different rock types can lead to complexity and variability in the magnetic field. The general trend of aeromagnetic anomalies follows an almost east–west direction, as they are distributed on both sides of the main fault. In addition, known deposits are mostly distributed in the transition zones between positive and negative anomalies or within the traps of positive anomalies (Figure 7).

In summary, there are 43 characteristic variables based on geology, geochemistry, and geophysics in this paper, including 2 geological characteristic variables, 40 geochemical characteristic variables, and 1 geophysical characteristic variable. The details are shown in the following table.

3. Methods

3.1. Random Forest

As mentioned above, RF is an algorithm constructed using decision trees as weak classifiers through bagging ensemble methods [67]. The methodology exhibits exceptional robustness against outliers and noise while also effectively mitigating the risk of overfitting. Additionally, it demonstrates superior capability in the classification of multiple categories [68]. Its core step is to use the bootstrap method for random sampling with replacement on the training set, constructing different sample sets. Random features are selected in each set to establish multiple decision trees. The ultimate predictive outcome is derived by aggregating the results from all decision trees, utilizing either majority voting or averaging techniques to synthesize the final decision. Additionally, in the construction of each decision tree, Random Forest randomly selects features to determine the split condition for each node. For each node’s split decision, Random Forest uses the Gini index to evaluate the impurity of the features [69].

G i n i = 1 - \sum_{i = 1}^{C} P_{i}^{2}

(1)

where

C

represents the total number of categories and

P_{i}

denotes the probability of the i-th class sample.

In the RF algorithm, indicators such as Gini Importance or Mean Decrease Impurity are commonly utilized to measure the impacts of features on the predictive accuracy of the model. Specifically, new features can be constructed in the following steps:

Train an RF model and calculate the importance score of each feature.
Arrange the features in descending order according to their scores and choose the top k features with the highest scores.
For each sample, concatenate its values on the top k features to complete the feature selection.
Train the model and make predictions using the new feature vector as the input vector.

3.2. Gradient Boosting Decision Tree (GBDT)

GBDT is a boosting algorithm proposed by Friedman [70]. It is an ensemble learning algorithm based on decision trees, which constructs a series of weak learners and combines them into a strong learner through multiple iterations. Unlike Random Forest, GBDT is a sequential algorithm. In each iteration, the emphasis is on rectifying the residuals from the preceding iteration model. By fitting residuals, the new decision tree can identify the parts that the model is currently unable to predict, thereby continuously optimizing the model’s prediction accuracy.

The training objective of GBDT is to find a strong learner

F (x)

. Any value

x_{i} ϵ X

minimizes the error between the predicted value

\hat{y_{i}} = F (x_{i})

and the true value

y_{i}

. The strong learner

F (x)

is composed of these weak learners

f_{m} (x)

, and its formula is as follows.

F (x) = f_{M} (x) = f_{0} (x) + \sum_{m = 1}^{M} \sum_{j = 1}^{J} h_{m j} I (x \in T_{m j})

(2)

where

f_{0} (x)

represents the initialization of the weak learner,

h_{m j}

denotes the optimal fitting values for each leaf node, and

j

represents the number of leaf nodes.

In terms of constructing new features, the idea of GBDT is to use the original information features (each sample with 43 features) to train the GBDT model and use the trained GBDT model to build new features. Figure 8 shows the specific steps of GBDT to generate new feature vectors. First, the GBDT algorithm will be applied to the training dataset to learn and generate a decision tree model. Secondly, all samples will be classified by inputting them into the model, and the output paths of the training set samples will be recorded within the decision tree model. In each tree, only one leaf node will be assigned a value of 1, indicating that the sample falls into the data interval represented by this leaf node, and the values of other leaf nodes and decision nodes are 0. This process is called One-Hot Encoding. The construction of the new feature vector is obtained through the One-Hot Encoding of the leaf nodes of the decision tree. Each element in the vector represents the case where the sample point falls into the corresponding leaf node, with a value of either 0 or 1. Finally, this new feature vector serves as the data fed into the SVM model to build the final predictive model.

3.3. Support Vector Machine (SVM)

SVM is a supervised learning method established on the principle of structural risk minimization [71]. Its core idea is to classify data by finding the maximum-margin hyperplane. For linearly separable data, SVM identifies the optimal classification hyperplane (linear segmentation boundary) in the original feature space to two separate classes of samples. In two-dimensional space, a hyperplane is a straight line. In three-dimensional space, a hyperplane is a plane, and in high-dimensional space, a hyperplane is a subspace one dimension lower than the space. In situations where data are not linearly separable, SVM introduces slack variables and utilizes nonlinear mapping to a high-dimensional space, making them linearly separable in the new feature space. Subsequently, the search for the optimal classification hyperplane takes place within this transformed feature space.

w^{T} \cdot ϕ (x) + b = 0

(3)

where w represents the weight coefficient vector, b denotes the classification threshold value of the hyperplane, and

ϕ (x)

is a mapping of the feature space.

To address the problem of nonlinear separable data classification, input variables can be mapped through a mapping function to a high-dimensional space. In this study, a Gaussian kernel function is employed. It is used to map raw data to a high-dimensional space, enabling better linear segmentation in that space [72].

K (x_{i}, x_{j}) = \exp (- \frac{{‖x_{i} - x_{j}‖}^{2}}{2 σ^{2}}), σ > 0

(4)

4. Result and Discussion

4.1. Data Preprocessing

Based on the gridded method of GIS spatial analysis, a 2 km × 2 km grid was used to divide the study divisions into 4848 prediction units, each containing 43 features in Table 1. ML-based MPM is a classification process that calculates the probability with each prediction unit belonging to either the ‘mineral’ or ‘non-mineral’ category [7]. Selecting the appropriate training samples is one of the keys to successful prediction. The positive samples are prediction units that contain the ore points, and the number is relatively limited. Non-mineralization is a frequent phenomenon, differentiating areas from mineralized areas. Negative samples were randomly chosen from regions away from mineralization zones, tectonic areas, and areas with low contents of main ore-forming elements.

There are 117 positive samples in the study area (known mineralized areas). To maintain data balance, 123 negative samples (background areas) were selected based on the above principles. These positive and negative samples constitute the model’s training and test datasets, totaling 240 samples. Samples with close geographic locations tend to have high similarities. When similar samples are assigned to different sets during the split of the training and test sets, it can affect the evaluation of the model’s generalization ability. This study uses 92° E as a boundary to divide the study area into eastern and western regions. The western region contains 149 positive and negative samples, which are used as the training set, while the eastern region contains 91 positive and negative samples, which are used as the test set. The following figure shows the results of positive and negative sample selection (Figure 9).

4.2. Construction of Predictive Variables

4.2.1. Determination of Predictive Variables by RF

The study employs the RF algorithm to assess the significance of 43 features for MPM. Grid search and cross-validation techniques are used to determine the optimal model parameters. Grid search iterates through a predefined set of hyperparameter combinations, training the model, evaluating its performance for each combination, and selecting the combination with the best performance as the final model’s hyperparameters. The optimization of the model involves grid search and 5-fold cross-validation, leading to the configuration of 100 decision trees. The default values are retained for the maximum number of features and the maximum depth of each tree.

Figure 10 clearly illustrates the importance of each feature variable. The most important feature is Cu, with an importance score of 0.11. This is because the important mineral in the research site is copper. This study chooses to retain features that accumulate to the top 54% of importance, which includes features with an importance greater than 0.008. As a result, these 22 indicators, including Cu, fracture distance, stratigraphic combination entropy, and so on, are the data features selected by RF (Table 2). These features include geological, geochemical, and geophysics information, and the main metallogenic elements in the study area are mostly included. These 22 features are the most important information about whether there is a mine, which can represent most of the original information of the data.

4.2.2. Determination of Predictive Variables by GBDT

First, the parameters for the GBDT model were configured, and the model was trained. Since the internal structure of the GBDT algorithm uses decision trees, the parameter settings for decision tree models are applicable to GBDT as well. The main parameters are generally the same as those for the RF model, so they were not repeated here. The specific parameter settings were as follows: the maximum number of features was set to “sqrt”, the depth was limited to 3, and the minimum number of samples was set to 4. For other parameters, such as the maximum number of leaf nodes, they were set to the default values.

Building on this, new composite features were constructed using the GBDT model. Based on the previously established dataset (4848 samples with 43 feature variables), all sample data were input into the GBDT model built from the training set for classification prediction. In the GBDT model, each leaf node of every decision tree represented a new feature, and all leaf nodes together form a new feature vector. The classification path of each sample point (the output of the leaf nodes) was recorded and one-hot encoded, converting the variables into binary feature vectors to serve as new features for the sample points. This completed the feature optimization and extraction process of the GBDT algorithm (Figure 11). However, after the output of the GBDT leaf nodes and one-hot encoding, the generated feature vectors were often high-dimensional and sparse. High-dimensional sparse data could increase computational complexity and memory consumption, and the features optimized by GBDT were often the result of complex, non-linear combinations, which significantly reduced their interpretability.

4.3. Mineral Prospectivity Mapping Based on SVM

In the SVM, the original 43 variables are used as input variables for the model. Additionally, the PSO optimization algorithm is utilized to perform parameter tuning for the parameters C and g. The parameter optimization results of the constructed SVM model are C = 93.6 and g = 0.007 (Table 3). The model achieves an accuracy of 81.3% for the test set. On this basis, the metallogenic prediction of the study area is carried out.

In RF, the selection of variables is measured by the importance of the Gini. The first 22 feature variables with high importance are selected as the input vectors of the SVM model. At the same time, the parameters are optimized based on the PSO algorithm, i.e., C = 69.43 and g = 0.02 (Table 3). The accuracy of the test set data is 86.5%. Compared with the SVM model without RF feature extraction, the RF_SVM model exhibits enhanced accuracy. Utilizing the RF_SVM model constructed from the training set, a comprehensive prediction of the study area is conducted, and the prediction results are visualized.

In the GBDT model, the feature vector is composed of the leaf nodes of the tree. The feature vector of each element corresponds to a leaf node, and the value is usually 0 or 1. The extracted new feature vector serves as the input for the SVM model. The parameter optimization results of the GBDT-SVM model by PSO are C = 46.3 and g = 0.0001, and the accuracy rate on the test set data is 91.7%. Compared with the SVM model without GBDT feature extraction, the performance of the GBDT_SVM model undergoes a substantial enhancement in accuracy.

The ROC (Receiver Operating Characteristic) curve is a tool used to evaluate the performance characteristics of classification models, especially in binary classification problems. The ROC curve illustrates the model’s classification ability by plotting the changes in the True Positive Rate (TPR) and False Positive Rate (FPR) at different threshold settings. The higher the AUC (Area Under the ROC Curve), the better the classification performance of the model (Figure 12). The AUC values for GBDT_SVM, RF_SVM, and SVM stand at 0.96, 0.94, and 0.84, respectively. This indicates that the GBDT_SVM model has the best prediction performance for this data set. The AUC values of SVM based on feature optimization are greater than the AUC values of SVM, and both are greater than 0.9. It shows that feature optimization can improve the prediction performance of the model and is an essential stage for MPM.

Figure 13 shows the mineral probability maps obtained by GBDT_SVM, RF_SVM, and SVM based on inverse distance ratio weight interpolation, respectively. Compared with the SVM model, the high-probability regions of GBDT_SVM and RF_SVM contain most of the known metal deposits. This suggests a significant spatial correlation between the identified high-value anomalies and the genuine mineral deposits. Upon re-entering the optimized features into the SVM model, the results obtained are all improved, with accuracy being enhanced. It confirms the key role of feature optimization in the process of model establishment. Although the SVM algorithm has the advantages of nonlinearity and strong generalization ability, its input features usually depend on expert experience, leading to more subjectivity. After feature optimization, this subjectivity can be effectively reduced, and the adaptability of the model to complex geological data can be improved so that the classification prediction effect can be significantly improved.

From the point of the algorithm, the GBDT algorithm records the output path of each leaf node of each decision tree as a new feature, and the new feature vector is based on all the original data. In contrast, the RF_SVM model filters the original features using the RF algorithm, which may lose some information. In general, the RF_SVM model and GBDT_SVM model have better classification results than SVM, and GBDT_SVM performs best.

Geologically, the high-probability areas of GBDT_SVM and RF_SVM are concentrated near intermediate-acid rock bodies and faults. These intrusions significantly influence copper polymetallic mineralization by providing material sources for the formation of copper–polymetallic deposits. Specifically, porphyry copper deposits are predominantly found in biotite granite porphyry, granodiorite, and quartz monzonite porphyry, while skarn copper deposits form in the outer contact zones of acidic intrusions. In addition, the predicted high-probability area also contains many known typical deposits, such as the lead–zinc deposits in the northern areas like Mengya’a and Longmala, skarn deposits in the western areas like Luobadui and Xingagou, and porphyry copper deposits in the southern areas like Qulong and Jiama porphyry–skarn type copper–polymetallic deposits. Compared with RF_SVM, the high-probability area of GBDT_SVM can identify a clearer mineralization range and provide more accurate prediction results. On the whole, the results obtained based on data-driven feature optimization are credible, and the prediction results have a certain reference value.

5. Conclusions

In the field of machine learning, features are key factors in building models, and effective feature processing methods can significantly improve model accuracy. This study employs RF and GBDT methods to optimize and select feature variables, using a Support Vector Machine (SVM) model for metallogenic prediction. Based on the research results, the following conclusions are drawn:

A total of 43 feature variables, including the concentrations of 39 elements, fault distance, stratigraphic combination entropy, ore-forming element combinations, and aeromagnetic data, were used as inputs for the RF and GBDT models. The RF model selected the top 22 features based on individual variable importance scores to form a new feature set, while the GBDT model used the classification path outputs of the sample points in the GBDT model as new feature combinations, which were one-hot encoded and represented as binary vectors.
The original 43 feature variables and the features optimized by RF and GBDT were respectively input into the SVM model. Data-driven feature optimization significantly improved the model’s prediction performance because it effectively reduced manual intervention during feature construction, allowing the model to focus more on the intrinsic characteristics of the data.
Comparing the RF_SVM and GBDT_SVM models, it was found that the GBDT_SVM model exhibited more precise anomaly detection, and the ROC curve further confirmed that the GBDT_SVM model achieved higher classification accuracy.
In some cases, feature optimization could lead to interpretability challenges. For instance, the feature optimized by GBDT could be difficult to intuitively understand, potentially impacting the interpretability and uncertainty of the model in practical applications.

Author Contributions

Methodology, H.Z. and M.X.; software, S.D., M.L., Y.L., D.Y. and Y.W.; writing—original draft preparation, H.Z., M.X., S.D. and M.L.; writing—review and editing, H.Z. and M.X.; visualization, S.D., M.L., Y.L., D.Y. and Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Construction and Application of the ASEAN-China Geoscience Information Big Data Platform.

Data Availability Statement

The author does not have permission to share data due to internal policy.

Conflicts of Interest

The authors declare no conflict of interest.

References

Zuo, R. Geodata Science-Based Mineral Prospectivity Mapping: A Review. Nat. Resour. Res. 2020, 29, 3415–3424, (In Chinese with English Abstract). [Google Scholar] [CrossRef]
Harris, J.R.; Grunsky, E.; Behnia, P.; Corrigan, D. Data-and knowledge-driven mineral prospectivity maps for Canada’s North. Ore Geol. Rev. 2015, 71, 788–803. [Google Scholar] [CrossRef]
Abedi, M.; Kashani, S.B.M.; Norouzi, G.H.; Yousefi, M. A deposit scale mineral prospectivity analysis: A comparison of various knowledge-driven approaches for porphyry copper targeting in Seridune, Iran. J. Afr. Earth Sci. 2017, 128, 127–146. [Google Scholar] [CrossRef]
Zuo, R. Data science-based theory and method of quantitative prediction of mineral resources. Earth Sci. Front. 2021, 28, 49–55, (In Chinese with English Abstract). [Google Scholar]
Carranza, E.J.M.; Laborte, A.G. Data-driven predictive mapping of gold prospectivity, Baguio district, Philippines: Application of Random Forests algorithm. Ore Geol. Rev. 2015, 71, 777–787. [Google Scholar] [CrossRef]
Carranza, E.J.M.; Laborte, A.G. Data-driven predictive modeling of mineral prospectivity using random forests: A case study in Catanduanes Island (Philippines). Nat. Resour. Res. 2016, 25, 35–50. [Google Scholar] [CrossRef]
Zuo, R.; Carranza, E.J.M. Support vector machine: Atool for mapping mineral prospectivity. Comput. Geosci. 2011, 37, 1967–1975. [Google Scholar] [CrossRef]
Chen, Y.; Wu, W. Mapping mineral prospectivity using an extreme learning machine regression. Ore Geol. Rev. 2017, 80, 200–213. [Google Scholar] [CrossRef]
Zhang, Z.; Wang, G.; Liu, C.; Cheng, L.; Sha, D. Bagging-based positive-unlabeled learning algorithm with Bayesian hyperparameter optimization for three-dimensional mineral potential mapping. Comput. Geosci. 2021, 154, 104817. [Google Scholar] [CrossRef]
Martins, T.F.; Seoane, J.C.S.; Tavares, F.M. Cu–Au exploration target generation in the eastern Caraj´ as Mineral Province using random forest and multi-class index overlay mapping. J. South Am. Earth Sci. 2022, 116, 103790. [Google Scholar] [CrossRef]
Porwal, A.; Carranza, E.J.M.; Hale, M. Bayesian network classifiers for mineral potential mapping. Comput. Geosci. 2006, 32, 1–16. [Google Scholar] [CrossRef]
Wang, C.; Pan, Y.; Chen, J.; Ouyang, Y.; Rao, J.; Jiang, Q. Indicator element selection and geochemical anomaly mapping using recursive feature elimination and random forest methods in the Jingdezhen region of Jiangxi Province, South China. Appl. Geochem. 2020, 122, 104760. [Google Scholar] [CrossRef]
Xiang, J.; Xiao, K.; Carranza, E.J.M.; Chen, J.; Li, S. 3D mineral prospectivity mapping with random forests: A case study of Tongling, Anhui, China. Nat. Resour. Res. 2020, 29, 395–414. [Google Scholar] [CrossRef]
Brandmeier, M.; Zamora, I.G.C.; Nykänen, V.; Middleton, M. Boosting for Mineral Prospectivity Modeling: A New GIS Toolbox. Nat. Resour. Res. 2022, 29, 71–88. [Google Scholar] [CrossRef]
Zhao, J.; Chi, H.; Shao, Y.; Peng, X. Application of AdaBoost Algorithms in Fe Mineral Prospectivity Prediction: A Case Study in Hongyuntan–Chilongfeng Mineral District, Xinjiang Province, China. Nat. Resour. Res. 2022, 31, 2001–2022. [Google Scholar] [CrossRef]
Zhang, S.; Xiao, K. Random forest-based mineralization prediction of the Lala-type Cu deposit in the Huili area, Sichuan Province. Geol. Explor. 2020, 56, 239–252, (In Chinese with English Abstract). [Google Scholar]
Zhang, Q.; Chen, J.; Xu, H.; Jia, Y.; Chen, X.; Jia, Z.; Liu, H. Three Dimensional Mineral Prospectivity Mapping by XGBoost Modeling: A Case Study of the Lannigou Gold Deposit. China. Nat. Resour. Res. 2022, 31, 1135–1156. [Google Scholar] [CrossRef]
Charbuty, B.; Abdulazeez, A.M. Classification based on decision tree algorithm for machine learning. J. Appl. Sci. Technol. Trends 2021, 2, 20–28. [Google Scholar] [CrossRef]
Wang, J.; Zuo, R.; Xiong, Y. Mapping mineral prospectivity via semi-supervised random forest. Nat. Resour. Res. 2020, 29, 189–202. [Google Scholar] [CrossRef]
Wu, C.; Ma, S. A selective review of robust variable selection with applications in bioinformatics. Brief. Bioinf. 2015, 16, 873–883. [Google Scholar] [CrossRef]
Archibald, R.; Fann, G. Feature selection and classification of hyperspectral images with support vector machines. Ieee Geosci. Remote Sens. Lett. 2007, 4, 674–677. [Google Scholar] [CrossRef]
Yu, L.; Liu, H. Efficient feature selection via analysis of relevance and redundancy. J. Mach. Learn. Res. 2004, 5, 1205–1224. [Google Scholar]
McKinley, J.M.; Grunsky, E.; Mueller, U. Environmental monitoring and peat assessment using multivariate analysis of regional-scale geochemical data. Math. Geosci. 2018, 50, 235–246. [Google Scholar] [CrossRef]
Chen, S.; Hattori, K.; Grunsky, E.C. Identification of sandstones above blind uranium deposits using multivariate statistical assessment of compositional data, Athabasca Basin, Canada. J. Geochem. Explor. 2018, 188, 229–239. [Google Scholar] [CrossRef]
Gonbadi, A.M.; Tabatabaei, S.H.; Carranza, E.J.M. Supervised geochemical anomaly detection by pattern recognition. J. Geochem. Explor. 2015, 157, 81–91. [Google Scholar] [CrossRef]
Wang, L. Geochemical Features and Metallogenic Prognosis of Gold Ore Deposit in Chifeng-Weichang Area. Master’s Thesis, China University of Geosciences, Beijing, China, 2010. (In Chinese with English Abstract). [Google Scholar]
Janecek, A.; Gansterer, W.; Demel, M.; Ecker, G. On the relationship between feature selection and classification accuracy. In Proceedings of the 2008 International Conference on New Challenges for Feature Selection in Data Mining and Knowledge Discovery, Antwerp, Belgium, 15 September 2008; Volume 4, pp. 90–105. [Google Scholar]
Zhao, Z.H. Study on Mineral Resources Prediction Model Based on Ensemble Learning. Master’s Thesis, Jilin University, Changchun, China, 2018. (In Chinese with English Abstract). [Google Scholar]
Khalid, S.; Khalil, T.; Nasreen, S. A survey of feature selection and feature extraction techniques in machine learning. In Proceedings of the 2014 Science and information Conference, London, UK, 27–29 August 2014. [Google Scholar]
Zekri, H.; Mokhtari, A.; Cohen, D.R. Application of singular value decomposition (SVD) and semi-discrete decomposition (SDD) techniques in clustering of geochemical data, an environmental study in central Iran. Stoch. Environ. Res. Risk Assess. 2016, 30, 1947–1960. [Google Scholar] [CrossRef]
Zekri, H.; Cohen, D.R.; Mokhtari, A.R.; Esmaeili, A. Geochemical Prospectivity Mapping through a Feature Extraction—Selection Classification Scheme. Nat. Resour. Res. 2019, 28, 867–868. [Google Scholar] [CrossRef]
Wang, C.B.; Chen, J.G.; Ouyang, Y.P. Determination of Predictive Variables in Mineral Prospectivity Mapping Using Supervised and Unsupervised Methods. Nat. Resour. Res. 2022, 31, 2081–2102. [Google Scholar] [CrossRef]
Hou, Z.; Cook, N.J.; Zaw, K. Metallogenesis of the Tibetan collisional orogeny: A review and introduction to the special issue. Ore Geol. Rev. 2009, 36, 2–24. [Google Scholar] [CrossRef]
Hou, Z.; Duan, L.; Lu, Y.; Zheng, Y.; Zhu, D.; Yang, Z.; Yang, Z.; Wang, B.; Pei, Y.; Zhao, Z.; et al. Lithospheric architecture of the Lhasa terrane and its control on ore deposits in the Himalayan-Tibetan orogen. Econ. Geol. 2015, 110, 1541–1575. [Google Scholar] [CrossRef]
Hou, Z.; Yang, Z.; Xu, W. Metallogenesis in Tibetan collisional orogenic belt: I. Mineralization in main collisional orogenic setting. Miner. Depos. 2006, 25, 337–358. [Google Scholar]
Zheng, W.; Liu, B.; McKinley, M.J.; Cooper, M.R.; Wang, L. Geology and geochemistry-based metallogenic exploration model for the eastern Tethys Himalayan metallogenic belt, Tibet. J. Geochem. Explor. 2021, 224, 106743. [Google Scholar] [CrossRef]
Wang, L.; Tang, J.; Zheng, W.; Chen, W.; Lin, X.; Kang, H.; Luo, M. Study on metallogeny of main molybdenum polymetallic deposits in the eastern section of the gangdese metallogenic belt. Geol. Rev. 2014, 60, 363–379, (In Chinese with English Abstract). [Google Scholar]
Tang, J.; Lang, X.; Xie, F.; Gao, Y.; Li, Z.; Huang, Y.; Ding, F.; Yang, H.; Zhang, L.; Wang, Q.; et al. Geological characteristics and genesis of the jurassic No. i porphyry Cu–Au deposit in the Xiongcun district, Gangdese porphyry copper belt, Tibet. Ore Geol. Rev. 2015, 70, 438–456. [Google Scholar] [CrossRef]
Pan, G.; Ding, J.; Yao, D.; Wang, L. Geological Map of the Tibetan Plateau and Adjacent Areas, 1: 1500000; Chengdu Map Publishing House: Chengdu, China, 2004. [Google Scholar]
Pan, G.; Wang, L.; Li, R.; Yuan, S.; Ji, W.; Yin, F.; Zhang, W.; Wang, B. Tectonic evolution of the Qinghai-Tibet plateau. J. Asian Earth Sci. 2012, 53, 3–14. [Google Scholar] [CrossRef]
Wang, L.; Liu, B.; McKinley, J.M.; Cooper, M.R.; Li, C.; Kong, Y.; Shan, M. Compositional data analysis of regional geochemical data in the Lhasa area of Tibet, China. Appl. Geochem. 2021, 135, 105108. [Google Scholar] [CrossRef]
Tang, J.; Duo, J.; Liu, H.; Lang, X.; Zhang, J.; Zheng, W.; Ying, L. Minerogenetic series of ore deposits in the east part of the gangdise metallogenic belt. Acta Geosci. Sin. 2012, 33, 393–410. [Google Scholar]
Tang, J.; Wang, L.; Zheng, W.; Zhong, K. Ore deposits metallogenic regularity and prospecting in the eastern section of the gangdese metallogenic belt. Acta Geol. Sin. 2014, 88, 2545–2555. [Google Scholar]
Xie, F.; Lang, X.; Tang, J.; He, Q.; Deng, Y.; Wang, X.; Wang, Y.; Jia, M. Metallogenic regularity of Gangdese Metallogenic Belt, Tibet. Miner. Depos. 2022, 41, 952–974. [Google Scholar]
Ji, W.; Wu, F.; Liu, C.; Chung, S. Early Eocene crustal thickening in southern Tibet: New age and geochemical constraints from the Gangdese batholith. J. Asian Earth Sci. 2012, 53, 82–95. [Google Scholar] [CrossRef]
Mo, X.; Dong, G.; Zhao, Z.; Zhou, S.; Wang, L.; Qiu, R.; Zhang, F. Spatial and temporal distribution and characteristics of granitoids in the Gangdese, Tibet and implication for crustal growth and evolution. Geol. J. China Univ. 2005, 11, 281–290, (In Chinese with English Abstract). [Google Scholar]
Schärer, U.; Xu, R.; Allègre, C. U–Pb geochronology of Gandese (Transhimalaya) plutonism in the Lhasa–Xigaze region Tibet. Earth Planet. Sci. Lett. 1984, 69, 311–320. [Google Scholar] [CrossRef]
Kapp, P.; DeCelles, P.; Leier, A.; Fabijanic, J.; He, S.; Pullen, A.; Gehrels, G.; Ding, L. The Gangdese retroarc thrust belt revealed. GSA Today 2007, 17, 4–9. [Google Scholar] [CrossRef]
Groves, D.; Santosh, M.; Zhang, L.; Deng, J.; Yang, L.; Wang, Q. Subduction: The recycling engine room for global metallogeny. Ore Geol. Rev. 2021, 134, 104130. [Google Scholar] [CrossRef]
Hou, Z.; Gao, Y.; Qu, X.; Rui, Z.; Mo, X. Origin of adakitic intrusives generated during mid-Miocene east–west extension in southern Tibet. Earth Planet. Sci. Lett. 2004, 220, 139–155. [Google Scholar] [CrossRef]
Hou, Z.; Zhao, Z.; Gao, Y.; Yang, Z.; Jiang, W. Tearing and subduction of the Indian continental slab, evidence from Cenozoic Gangdese igneous rocks in Tibet. Acta Petrol. Sin. 2006, 22, 761–774, (In Chinese with English Abstract). [Google Scholar]
Richards, J. Tectonic, magmatic, and metallogenic evolution of the tethyan orogen: From subduction to collision. Ore Geol. Rev. 2015, 70, 323–345. [Google Scholar] [CrossRef]
Xu, Z.; Wang, Q.; Li, Z.; Li, H.; Cai, Z.; Liang, F.; Dong, H.; Cao, H.; Chen, X.; Huang, X.; et al. Indo-Asian collision: Tectonic transition from compression to strike slip. Acta Geol. Sin. 2016, 90, 1–23, (In Chinese with English Abstract). [Google Scholar]
Dong, S.; Huang, H.; Liu, B.; Zhang, L.; Zhang, H. Geological characteristics and exploration direction of the Nongruri gold deposit in Tibet. Geol. Explor. 2010, 46, 207–213, (In Chinese with English Abstract). [Google Scholar]
Li, L.; Xie, C.; Ren, Y.; Yu, Y.; Dong, Y.; Gao, Z.; Hao, Y. Discovery of Late Triassic mineralization in the Gangdese Metallogenic Belt, Tibet: The Banduo Pb–Zn deposit, Somdo area. Ore Geol. Rev. 2020, 126, 103754. [Google Scholar] [CrossRef]
Tang, J.; Wang, L.; Ci, Q.; Zhong, K.; Zhang, H.; Du, X.; Zeren, Z.; Yan, J.; Ma, G.; Song, Y. Minerogenetic Series of Ore Deposits in the East Part of the Gangdese Metallogenic Belt; Geological Publishing House: Beijing, China, 2020; pp. 1–226. [Google Scholar]
Wang, L. Analysis of Multi-Scale Geochemical Data and Optimization of the Mineral Prospectivity. Ph.D. Thesis, Chengdu University of Technology, Chengdu, China, 2022. (In Chinese with English Abstract). [Google Scholar]
Yin, A.; Harrison, T.M. Geologic evolution of the himalayan-tibetan orogen. Annu. Rev. Earth Planet. Sci. 2000, 28, 211–280. [Google Scholar] [CrossRef]
Xie, X.; Mu, X.; Ren, T. Geochemical mapping in China. J. Geochem. Explor. 1997, 60, 99–113. [Google Scholar]
Xie, X.; Wang, X.; Zhang, Q.; Zhou, G.; Cheng, H.; Liu, D.; Cheng, Z.; Xu, S. Multi-scale geochemical mapping in China. Geochem. Explor. Environ. Anal. 2008, 8, 333–341. [Google Scholar] [CrossRef]
Wang, X.; Zhang, Q.; Zhou, G. National-scale geochemical mapping projects in China. Geostand. Geoanal. Res. 2007, 31, 311–320. [Google Scholar] [CrossRef]
Chen, Y.; Chen, J.; Wang, X. Quantitatively Integrated Techniques for Assessment of Mineral Resources Based on GIS; Geological Publishing House: Beijing, China, 2008. [Google Scholar]
Dong, Q. Quantitative Evaluation and Prediction of Regional Metallogeny in Northern Segment of Three River Region, Southwest China. Ph.D. Thesis, China University of Geosciences, Beijing, China, 2009. (In Chinese with English Abstract). [Google Scholar]
Liu, Y.; Cao, L.; Li, Z. Element Geochemistry; Science Press: Beijing, China, 1984. (In Chinese) [Google Scholar]
Liu, B.; Zheng, W.; Wang, L.; Li, C.; Kong, Y.; Tang, R.; Luo, D.; Xie, M. Mineral exploration model for Lhasa Area, eastern Gangdese metallogenic belt: Based on knowledge-driven compositional data analysis and catchment basin division. J. Geochem. Explor. 2024, 259, 107415. [Google Scholar] [CrossRef]
Xie, M.; Liu, B.; Wang, L.; Li, C.; Kong, Y.; Tang, R. Auto encoder generative adversarial networks-based mineral pro-spectivity mapping in Lhasa area, Tibet. J. Geochem. Explor. 2023, 255, 107326. [Google Scholar] [CrossRef]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Fawagreh, K.; Gaber, M.M.; Elyan, E. Random forests: From early developments to recent advancements. Syst. Sci. Control 2014, 2, 602–609. [Google Scholar] [CrossRef]
James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning; Springer: Berlin/Heidelberg, Germany, 2013; Volume 112, p. 18. [Google Scholar]
Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Math. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Vapnik, V.N. An overview of statistical learning theory. IEEE Trans. Neural Netw. 1999, 10, 988–999. [Google Scholar] [CrossRef]
Maepa, F.; Smith, R.S.; Tessema, A. Support vector machine and artificial neural network modelling of orogenic gold prospectivity mapping in the Swayze greenstone belt, Ontario, Canada. Ore Geol. Rev. 2021, 130, 103968. [Google Scholar] [CrossRef]

Figure 2. The map of stratigraphic combination entropy.

Figure 3. The map of the fracture distance.

Figure 4. The geochemical map of Cu.

Figure 5. The biplots of 39 elements.

Figure 6. The association anomaly of metallogenic elements.

Figure 7. The map of the vertical first derivative of the regional aeromagnetic ΔT polarization (modified from [56]).

Figure 8. The map of the GBDT_SVM fusion model.

Figure 9. The distribution map of the positive and negative samples.

Figure 10. The map of feature importance based on RF.

Figure 11. The map of features optimized by GBDT.

Figure 12. The ROC curves for SVM, RF_SVM, GBDT_SVM.

Figure 13. Mineral potential mapping in Lhasa. (a) Mineral potential mapping was generated using the SVM model. (b) Mineral potential mapping was generated using the RF_SVM model. (c) Mineral potential mapping was generated using the GBDT_SVM model.

Table 1. The feature variables.

The Type of Feature	Features
Geochemistry	39 Elements
Geochemistry	Geochemical Association Anomalies
Geology	Fracture Distance
Geology	Stratigraphic Combination Entropy
Geophysical	Regional Aeromagnetic ΔT Polarization

Table 2. The feature variables optimized by RF.

The Type of Feature	Feature
Geochemistry	Cu, SiO₂, Bi, Au, Cd, Sn, MgO, Mo, Ba, Be, Mn, Li, F, Zn, U, Ni, La
Geochemistry	Geochemical Association Anomalies
Geology	Fracture Distance
Geology	Stratigraphic Combination Entropy
Geophysical	Regional Aeromagnetic ΔT Polarization

Table 3. The parameter values of each model based on PSO.

Name	Parameter Value
SVM	C = 93.6, g = 0.007
RF_SVM	C = 69.43, g = 0.02
GBDT_SVM	C = 46.3, g = 0.0001

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, H.; Xie, M.; Dan, S.; Li, M.; Li, Y.; Yang, D.; Wang, Y. Optimization of Feature Selection in Mineral Prospectivity Using Ensemble Learning. Minerals 2024, 14, 970. https://doi.org/10.3390/min14100970

AMA Style

Zhang H, Xie M, Dan S, Li M, Li Y, Yang D, Wang Y. Optimization of Feature Selection in Mineral Prospectivity Using Ensemble Learning. Minerals. 2024; 14(10):970. https://doi.org/10.3390/min14100970

Chicago/Turabian Style

Zhang, Hong, Miao Xie, Shiyao Dan, Meilin Li, Yunhe Li, Die Yang, and Yuanxi Wang. 2024. "Optimization of Feature Selection in Mineral Prospectivity Using Ensemble Learning" Minerals 14, no. 10: 970. https://doi.org/10.3390/min14100970

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Optimization of Feature Selection in Mineral Prospectivity Using Ensemble Learning

Abstract

1. Introduction

2. Geological Background and Dataset

2.1. Geological Background

2.2. Dataset

2.2.1. Geological Variables

2.2.2. Geochemical Variables

2.2.3. Geophysical Variables

3. Methods

3.1. Random Forest

3.2. Gradient Boosting Decision Tree (GBDT)

3.3. Support Vector Machine (SVM)

4. Result and Discussion

4.1. Data Preprocessing

4.2. Construction of Predictive Variables

4.2.1. Determination of Predictive Variables by RF

4.2.2. Determination of Predictive Variables by GBDT

4.3. Mineral Prospectivity Mapping Based on SVM

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI