Application of Clustering Methods in Multivariate Data-Based Prospecting Prediction

Chang, Xiaopeng; Zhang, Minghua; Chen, Liang; Zhang, Sheng; Ren, Wei; Zhang, Xiang

doi:10.3390/min15070760

Open AccessArticle

Application of Clustering Methods in Multivariate Data-Based Prospecting Prediction

by

Xiaopeng Chang

^1,2,3

,

Minghua Zhang

^3,4,*

,

Liang Chen

^1,2,*,

Sheng Zhang

⁴,

Wei Ren

⁴ and

Xiang Zhang

^1,2

¹

Geophysical Survey Center, China Geological Survey, Langfang 065000, China

²

Technology Innovation Center for Earth Near Surface Detection, China Geological Survey, Langfang 065000, China

³

School of Geophysics and Information Technology, China University of Geosciences, Beijing 100830, China

⁴

Natural Resources Survey, China Geological Survey, Beijing 100830, China

^*

Authors to whom correspondence should be addressed.

Minerals 2025, 15(7), 760; https://doi.org/10.3390/min15070760 (registering DOI)

Submission received: 8 June 2025 / Revised: 13 July 2025 / Accepted: 18 July 2025 / Published: 20 July 2025

(This article belongs to the Special Issue Application of Big Data Mining, Machine Learning and Artificial Intelligence in Geoscience, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Mining and analyzing information from multiple sources—such as geophysics and geochemistry—is a key aspect of big data-driven mineral prediction. Clustering, which groups large datasets based on distance metrics, is an essential method in multidimensional data analysis. The Two-Step Clustering (TSC) approach offers advantages by handling both categorical and continuous variables and automatically determining the optimal number of clusters. In this study, we applied the TSC method to mineral prediction in the northeastern margin of the Jiaolai Basin by: (i) converting residual gravity and magnetic anomalies into categorical variables using Ward clustering; and (ii) transforming 13 stream sediment elements into independent continuous variables through factor analysis. The results showed that clustering is sensitive to categorical variables and performs better with fewer categories. When variables share similar distribution characteristics, consistency between geophysical discretization and geochemical boundaries also influences clustering results. In this study, the (3 × 4) and (4 × 4) combinations yielded optimal clustering results. Cluster 3 was identified as a favorable zone for gold deposits due to its moderate gravity, low magnetism, and the enrichment in F1 (Ni–Cu–Zn), F2 (W–Mo–Bi), and F3 (As–Sb), indicating a multi-stage, shallow, hydrothermal mineralization process. This study demonstrates the effectiveness of combining Ward clustering for variable transformation with TSC for the integrated analysis of categorical and numerical data, confirming its value in multi-source data research and its potential for further application.

Keywords:

two-step clustering; Ward clustering; Jiaolai Basin; mineral prospecting

1. Introduction

The Jiaodong region is the largest gold (Au) deposit catchment in China [1,2,3]. Since the initial exploration breakthrough (2011–2020), Pengjiakuang- and Fayunkuang-style gold deposits have been discovered along the northeastern (NE) and northwestern (NW) margins of the Jiaolai Basin [4,5]. The NE margin is characterized by intense tectonic activity and widespread magmatic intrusions, resulting in complex mineralization and ore-controlling systems. This has led to a series of gold occurrences, including Muping-Liaoshang, Tudui-Shawang, and Rushan-Pengjiakuang [6,7,8,9,10].

Geochemistry—the study of the distribution and association of metallogenic elements—has contributed significantly to mineral exploration [11,12,13,14]. Geochemical anomalies have been used to define hydrothermal alteration halos (e.g., As, Sb, Hg, Te, B) since the 1980s [15,16,17]). Alteration minerals, such as sericite and carbonate, along with trace element ratios (e.g., Au/As), have also been used to vector toward mineralization [16,17,18]. In particular, stream sediment geochemical surveys have played a key role in prospecting for gold and other polymetallic minerals [19,20,21]. Geophysical data also provide insights into metallogenic processes and the regional tectonic evolution underlying deep mineral resources [22,23]. Gravity surveys have long been used to detect density contrasts related to mineralization-related structures (e.g., shear zones, faults, hydrothermal alteration zones) and lithological controls (e.g., dense mafic/ultramafic rocks) [24,25]. Electromagnetic (EM) methods detect conductive zones caused by sulfide-rich alteration (e.g., pyrite, arsenopyrite) or graphite-bearing rocks associated with hydrothermal systems, helping to locate conductive shear zones, fluid pathways [26], and near-surface sulfide mineralization [27]. In the Abitibi greenstone belt (Canada), EM anomalies have been correlated with gold deposits. Thus, comprehensive analysis of geophysical and geochemical anomalies is essential for mineral prediction. The integration of multi-source anomaly data supports the development of more accurate models, enhancing target identification and improving the efficiency and success rate of mineral exploration.

The rapid development of big data and machine learning (ML) offers new tools for analyzing geochemical data. Clustering—an unsupervised method—enables the classification of large datasets based on intrinsic relationships. It also mitigates issues common in supervised learning, such as limited positive samples and mislabeling, while enhancing interpretability. The Two-Step Clustering (TSC) algorithm used in this study is a functionality within SPSS 26 software [28] that has the advantage of simultaneously handling both categorical and continuous variables [29].

The 1:200,000 Regional Geochemistry-National Reconnaissance (RGNR) program collected stream sediment data from the northeastern margin of the Jiaolai Basin. In this study, we utilized gravity and magnetic data derived from this extensive dataset. Based on geochemical surveys and previous research [19,20], key mineralizing and associated elements were selected to create continuous variables, while categorical variables were constructed by classifying anomalous value ranges. The TSC method was applied to both continuous and categorical variables, incorporating geological context. We assessed the validity and accuracy of the clustering results and explored the influencing factors, yielding a series of exploratory results for the proposed method. Based on mineral distribution and clustering outcomes within the study area, prospective zones for mineral deposits were delineated.

2. Study Area Overview

The Jiaodong region hosts the largest gold (Au) reserves in China, bounded by the Jiaobei uplift to the north and the Jiaolai Basin to the south (Figure 1). Since the Mesozoic era, the region has developed a complex tectonic framework due to the subduction of the Pacific Plate beneath the Eurasian Plate and the collision of the North and South China plates [1,2,22]. Centrally located on the Jiaodong Peninsula, the Jiaolai Basin evolved into a terrestrial sedimentary basin during the Cretaceous period [7,10]. Its northeastern margin is primarily characterized by interlayer-sliding fault breccia-type gold deposits (one of the three main deposit types in the region, alongside altered rock and quartz vein types [22,30,31]). These deposits are controlled by both interlayer-sliding tectonics along the basin edge and the NE-trending Muping-Jimo fault zone. In the extensional tectonic setting, interlayer-sliding fractures developed in the soft layers between the Laiyang (cover) and Jingshan (basement) groups. These fault zones are mainly composed of fractured rocks and breccias, distributed along the basin margin and strongly influenced by basin morphology.

Since the Mesozoic era, the Tanlu fault zone in western Jiaodong has undergone intensified activity, producing a series of NNE–NE-trending fractures of various scales that shaped the region’s current tectonic framework. As a result, NNE-NE-oriented faults are larger and more continuous, while E–W, S–N, and NW-trending fractures are smaller, more discontinuous, and often appear only in isolated fragments. This long history of tectonic evolution has created a complex fracture network. Major fracture zones in the region include the Muping-Jimo and Jinniushan faults [31]. The Muping-Jimo fault zone comprises the Taocun, Guocheng, Yazi-Zhuwu, and Yuli-Haiyang faults, all trending NNE–NE in the study area [32], with numerous smaller E–W and S–N faults also present. Gold mineralization in Jiaodong is predominantly fault-controlled. Key deposits—such as Songjiagou, Guocheng, Liaoshang, and Xilaokou- are distributed along these major fault zones [30] (Figure 1). Table 1 shows the major gold deposits features in the northeastern margin of the Jiaolai Basin. Magmatic activity is also well developed in the area, notably the Paleoproterozoic Muniushan intrusive body and Mesozoic vein rocks [33]. The Muniushan body, composed mainly of diorite granite, serves as the principal ore-hosting rock [1]. The region’s veinstone—largely NE-trending amphibole porphyrites—is controlled by the NE-trending faults and their subsidiary fractures [4].

In recent years, several medium- to large-scale gold deposits—such as Pengjiakuang, Tudui-Shawang, Xijingkou, Liaoshang, Xilaokou, and Qianchuiliu—have been discovered, with total gold resources exceeding 230 tons, including proven, probable, and inferred categories [34]. Based on their distinct characteristics (Table 1), gold deposits in the study area are categorized into four types: Pengjiakuang, Tudui, Liaoshang, and Songjiagou. The Pengjiakuang-type is a mesothermal hydrothermal metasomatic deposit controlled by superimposed ductile–brittle structures at the basin margin; the Liaoshang-type is a pyrite–carbonate vein deposit controlled by NE-trending faults; the Tudui-type is characterized by stockwork vein systems and is classified as a meso- to low-temperature hydrothermal deposit; and the Songjiagou-type is a mesothermal hydrothermal metasomatic “conglomerate-type” deposit associated with interlayer detachment structures [3,7,34]. Geological, geophysical, and geochemical studies suggest that the Jiaolai Basin and its surrounding areas share a similar geological background with the Jiaobei Uplift and offer favorable conditions for gold mineralization [3].

3. Methodology

3.1. Available Dataset

The data used in this study were obtained from the Data Integration and Comprehensive Analysis Project for Resource, Environmental, and Geological Surveys (Project No. DD20190427), conducted by the Development Research Center of the China Geological Survey (Beijing, China). The dataset includes stream sediment geochemical data and geophysical data [20].

Stream sediment geochemical data were obtained from the Regional Chemistry-National Reconnaissance (RGNR) Program [35]. In the Jiaodong area, samples were collected at a density of one per 1 km², with every sample combined into one analytical sample. Each analytical sample was tested for 39 elements. Multi-element analysis of samples and the associated quality control procedures followed national standards.

The geophysical data included residual gravity and magnetic anomalies. A total of 368 samples were analyzed, focusing on elements related to gold mineralization (Au, Ag, Cu, Pb, Zn, Bi, Mo, Sb, W, Ni, Sn, As, Hg). Thirteen types of geochemical and geophysical data were used in the analysis.

3.2. Geochemical Data Processing: Au Distribution Pattern in Stream Sediments

Based on the Au geochemical data from 837 regional (1:200,000) geochemical survey maps and 44,422 combined stream sediment samples across China [36], the statistically derived median and arithmetic mean Au/Ag contents were 1.32 ppb and 2.03 ppb [36], respectively. In the study area, the median Au content in stream sediments was 2.57 ppb, with an arithmetic mean of 7.54 ppb, significantly higher than the national averages. The Au content in stream sediments ranged from 0.82 ppb to 226.24 ppb, with a standard deviation of 20.10 ppb, an extreme deviation of 225.42 ppb, and a coefficient of variation of 2.66, indicating high dispersion. Normality analysis was based on skewness and kurtosis as indicators. Skewness measures the asymmetry and direction of the data distribution, while kurtosis reflects the peakedness and tail thickness compared to a normal distribution (both values are 0 for a standard normal distribution). Au showed the highest coefficient of variation, followed by Ag and W (Table A1). All elements exhibited high skewness and kurtosis, indicating significant deviation from normality.

Geochemical data are typically compositional and subject to closure effects, meaning that using raw data can lead to spurious correlations. To address this, specific transformation methods are recommended to “open” the data for analysis [37]. Common approaches include direct logarithmic transformation (DLT), additive log-ratio (ALR), centered log-ratio (CLR), and isometric log-ratio (ILR) transformations [37]. The ILR method assumes that the sum of all components equals 1. However, the geochemical dataset presented two issues: (i) incomplete elemental coverage, so that the components did not sum to 1; and (ii) a four-order magnitude difference between major and trace element units. As a result, ILR was applied with caution. For datasets involving only trace elements or where components did not sum to 1, DLT was used. This approach has been shown to improve data normality (Table A2). After logarithmic transformation, the statistical analysis of 13 elements showed a significant reduction in skewness, kurtosis, and coefficients of variation, supporting the assumption of approximate normality. In Figure 2, plots (a) and (c) highlight anomalous variations in Au, Ag, As, Sn, and Bi. The skewness–kurtosis plot for raw values (b) indicates non-normal distributions for Sn, Au, Ag, As, W, and Bi, and to a lesser extent for Pb, Zn, and Cu. Au, Ag, W, Bi, and Sn may exhibit bimodal distributions (kurtosis > 2). The skewness–kurtosis plot for log-transformed values (d) shows that most elements still deviate from a log-normal distribution (skewness ≠ 0), with long tails and possible bimodal behavior (kurtosis > 2) in Sn, As, W, and Ag.

Factor analysis is a common preprocessing method in geochemistry, used to reduce dimensionality and extract independent components that capture key data patterns [38]. Applied to the log-transformed values of 13 elements, factor analysis identified four factors (Table 2), explaining 71.23% of the total variance. The factor groups (Table 3) are: F1 (Ni, Cu, Zn, Pb, Hg), F2 (W, Bi, Mo), F3 (As, Sb, Sn), and F4 (Au, Ag). These factor scores used as continuous variables in the TSC analysis (Figure 3). Figure 3 shows the score distributions for the four factors (F1–F4), overlaid with the locations of existing mineral deposits and towns in the study area. In Figure 3a–c, positive and negative anomalies exhibit an overall NE-trending pattern. Gold deposits are mainly located in the positive anomaly zones in Figure 3a,c. Figure 3d shows positive anomalies concentrated around Guocheng and Yazi Towns, both significant gold mining areas. The groupings of Au–Ag and Ni, Cu, Zn, Pb, Hg reflect known geochemical associations in the Jiaodong region [19]. The Au–Ag association (F4) corresponds to electrum, commonly found in gold deposits. The As–Sn–Sb grouping (F3) likely indicates low-temperature hydrothermal or intrusion-related systems (e.g., granitoids, porphyries), often aligned with fault zones controlling gold mineralization. In contrast, the W–Bi–Mo association (F2) suggests a higher-temperature geochemical signature, linked to magmatic-hydrothermal systems and granitic or porphyry intrusion-related gold deposits [39]. The Ni–Cu–Zn–Pb–Hg group (F1) likely reflects hydrothermal processes with potential for polymetallic sulfide mineralization, aiding in classifying gold systems as epithermal, orogenic, or porphyry types.

3.3. Geophysical Data Processing

Long-term geological surveys have accumulated extensive data, leading to the development of various metallogenic and prospecting theories [40,41]. One key challenge in mineral prospecting is the integration of qualitative data into geological big data analytics [40]. Qualitative data can be encoded numerically, allowing for participation in data analysis through category-specific codes.

Unlike geochemical methods, geophysical exploration relies on linear and nonlinear inversion theories to infer the distribution and depth of subsurface structures based on observed data and model parameters. This approach focuses on physical model-driven parameter estimation rather than statistical inference. Geophysical data are typically acquired through areal measurements and represented as gridded, discrete data [42], which are corrected and visualized as two-dimensional maps. Interpretation is largely qualitative, focusing on relative variations in anomalies linked to subsurface geological bodies or structures. These anomalies often exhibit spatial autocorrelation. The Moran index [43], widely used in geography, economics, and ecology, quantifies spatial patterns—positive values indicate clustering, negative values indicate dispersion, and near zero values indicate randomness. In this study, the Moran indices for residual gravity and magnetic anomalies were 0.414 and 0.260, respectively, indicating positive spatial autocorrelations with identifiable high- and low-value clusters. The differences in these indices reflect the nature of the anomalies: residual gravity anomalies, tied to large-scale density variations like basement topography and sediment thickness, exhibit stronger spatial autocorrelation. Residual magnetic anomalies, influenced by smaller magnetic bodies, such as volcanic rocks or mineral deposits, are more scattered and less autocorrelated.

Geophysical analysis highlights anomaly variations by converting continuous data into categories such as “high,” “moderate,” and “low,” based on data characteristics and application needs. Traditional classification methods, like binning and empirical approaches, are often prone to subjective bias. This study employs Ward-based hierarchical clustering [44] to automatically classify residual gravity and magnetic anomalies. The method requires a predefined number of clusters and minimizes within-cluster variance during merging, making it suitable for small samples and sensitive to outliers [45]. For spatially autocorrelated data, Ward produces compact clusters with low internal variance and strong similarity among adjacent regions. It also preserves geographical continuity, avoiding fragmented clusters [46]. Therefore, Ward is well suited for applications requiring spatial coherence and clear boundaries. Results for K = 3, 4, and 5 are shown in Figure 4, with distinct colors representing different clusters.

Figure 4 (a0,b0) shows the contour maps of residual gravity and magnetic anomalies in the study area, both displaying a similar NE-trending pattern of alternating positive and negative anomalies. The low-gravity anomaly is mainly caused by widespread low-density granite bodies and the crustal thinning under an extensional tectonic setting. The magnetic anomaly is closely related to the presence of magnetic minerals such as magnetite. Low magnetic anomalies in this area are mainly due to the weak magnetism of granites and hydrothermal alteration. The distribution of gold deposits in the region typically coincides with zones of low gravity and low magnetic fields, reflecting the presence of migmatites and granitic bodies. Therefore, low-gravity and low magnetic field anomalies can be used as important geophysical indicators for gold exploration.

The clustering results (Figure 4(a1–b3)) show that the samples are predominantly uniformly distributed, consistent with the anomaly distribution ranges (Figure 4(a0,b0)). High-value areas are shown in red and orange, while low-value areas appear in blue. However, some individual sample points display isolated distributions. As the number of clusters (K) increases, previously grouped samples are further subdivided. For example, Cluster 4 in Figure 4(a2) and Cluster 5 in Figure 4(a3) both originate from Cluster 3 in Figure 4(a1), while Cluster 1 in Figure 4(a3) corresponds to Cluster 1 in Figure 4(a1). Similar patterns are observed in Figure 4b. Independently distributed sample points primarily represent extreme values—for instance, Cluster 5 (maximum) and Cluster 1 (minimum) in Figure 4(a3), with comparable trends in Figure 4(b3)—demonstrating the strong interpretability of the classification results. Comparing Figure 4(a1–a3), the positions of isolated sample points (red circles) remain consistent. As the number of clusters (K) increases, the algorithm extracts extreme values sequentially from existing clusters. For gravity data, maximum values (Figure 4(a2), Cluster 4) are identified first when K increases from 3 to 4, followed by minimum values (Figure 4(a3), Cluster 1) when K increases to 5. In contrast, magnetic data show minimum values (Figure 4(b2), Cluster 1) first, followed by maximum values (Figure 4(b3), Cluster 5). Overall, K = 3 has basically reflected the main distribution characteristics of the study.

The silhouette coefficient is a key metric for evaluating clustering performance, with higher values that indicating better clustering quality. From the results (Table 4), we observed that:

(i) As the number of clusters (K) increases, the silhouette coefficients for gravity anomalies decrease. This occurs because more clusters lead to smaller cluster sizes and reduced separation, lowering the silhouette coefficient;

(ii) The silhouette coefficient for residual gravity anomalies is consistently higher than that for magnetic anomalies. This is because stronger spatial autocorrelation enhances clustering by producing tighter, more distinct clusters, which raises the silhouette coefficient;

(iii) When K = 4 or 5, the silhouette coefficients for magnetic and gravity anomalies show minimal differences, indicating similar spatial distribution patterns. This is reflected in the anomaly contour maps in Figure 4, which display alternating high and low values trending NE;

(iv) When K = 3 and 4, the silhouette coefficients for magnetic anomalies are similar, indicating a saturation effect—where increasing K no longer significantly improves the silhouette coefficient. This is common in geophysical data clustering and results mainly from the spatial continuity of the data. The result at K = 5 is obtained by extracting additional extreme values from the K = 4 case, with minimal changes in overall features. However, the silhouette coefficient decreases notably at K = 5, mainly because the clusters are already compact at K = 3 and 4. Further increases in K reduce inter-cluster distances more than intra-cluster distances, leading to a decline in the silhouette quality;

(v) For magnetic anomalies, the highest silhouette coefficient occurs at K = 4 and the lowest at K = 3, likely due to their spatial structure. Gravity anomalies tend to form broader, more continuous regions that are better suited to fewer clusters, increasing K in these regions, reducing cluster distinctiveness, and lowering the silhouette coefficient. In contrast, magnetic anomalies—characterized by weaker spatial autocorrelation and more scattered patterns—naturally form more compact and consistent clusters. With small K values, achieving a high silhouette coefficient is more difficult.

Another possible reason for the influence of the K value in Ward clustering is that excessive clustering can over-segment geological units, reducing their integrity and amplifying noise, which lowers the silhouette coefficient. Thus, considering both clustering performance and geological relevance, K = 3 or 4 is recommended as the optimal number of clusters.

3.4. Review of Clustering Method

Unsupervised learning (UL) is a machine learning (ML) approach that identifies patterns and structures in unlabeled data. Clustering, a widely used UL technique, groups data based on intrinsic similarities. Traditional clustering methods include K-means and hierarchical clustering, which are generally limited to continuous variables. In contrast, Two-Step Clustering (TSC) can handle both categorical and continuous variables simultaneously [29]. TSC not only determines the optimal number of clusters but also identifies key variables influencing clustering through predictive importance scores, feature maps, and variable characteristics within each cluster. The method constructs a cluster feature (CF) tree and uses a distance metric as a similarity criterion [47]. Model selection is guided by automatically calculated Bayesian Information Criterion (BIC) or Akaike’s Information Criterion (AIC), both of which assess model fit.

AIC, derived from entropy, evaluates how well a model fits the data [48], while BIC, based on AIC, includes a stronger penalty for model complexity, especially with larger datasets [29,49]. Consequently, BIC tends to favor simpler models. In most parameter estimation problems, the likelihood function serves as the objective function; model selection aims to balance model complexity with its ability to represent the data effectively.

The log-likelihood distance d(i, j) is defined as [29]:

d (i, j) = ξ_{i} + ξ_{j} - ξ_{< i, j >}

(1)

where

ξ_{i}

and

ξ_{j}

are the log-likelihood values for datasets i and j, and

ξ_{< i, j >}

is the log-likelihood of a new cluster formed by merging i and j.

The log-likelihood ξ is calculated as [29] follows:

ξ_{v} = - N_{V} (\sum_{k = 1}^{K^{A}} \frac{1}{2} \log ({\hat{σ}}_{k}^{2} + {\hat{σ}}_{vk}^{2}) + \sum_{k = 1}^{K^{B}} {\hat{E}}_{vk})

(2)

{\hat{E}}_{vk} = - \sum_{l = 1}^{L_{k}} \frac{N_{vkl}}{N_{v}} \log \frac{N_{vkl}}{N_{v}}

(3)

where K^A is the number of continuous variables; K^B is the number of categorical variables; L_k is the number of categories of the kth categorical variable;

{\hat{σ}}_{k}^{2}

is the total variance of the kth continuous variable in the entire dataset;

{\hat{σ}}_{v k}^{2}

is the variance of the kth continuous variable in class K;

{\hat{E}}_{v k}

is the estimated mean of the kth continuous variable in class j; N_v is the number of samples in class v; and N_vkl is the number of samples in the kth categorical variable of the lth category in class v. The BIC is computed as [29] follows:

BIC (J) = - 2 \sum_{j = 0}^{J} ζ_{j} + m_{j} \log (N)

(4)

where N is the total number of records in this dataset;

ζ_{j}

is the log- likelihood of cluster j; and

m_{j}

is the number of model parameters in cluster j.

Theoretical studies suggest that AIC often lacks rigor and may misidentify models, making BIC the preferred criterion for model performance [50]. In TSC, following Bayesian quasi-test principles, a smaller BIC value indicates better clustering quality. As the number of clusters increases, the rate of BIC reduction diminishes, helping to determine the optimal cluster count. All data analysis and clustering were performed using the SPSS 26 (IBM, Chicago, USA) software package.

4. Discussion

4.1. Clustering Results

TSC was applied to cluster the four factors (F1–4) derived from the factor analysis of: (i) 13 geochemical elements from stream sediments; and (ii) residual gravity and magnetic anomaly data from the northeastern margin of the Jiaolai Basin. Since TSC accommodates both continuous and categorical variables, the four factors (F1–4) were treated as continuous variables, while the residual gravity and magnetic anomalies were treated as categorical variables (Table 5). A comparative analysis was conducted to assess the number of categories in the clustering results. The clusters were sorted in order from 1 to 9 (see Table 5 “Clustering Quality”).

Table 5 includes four fixed continuous variables and two categorical variables. The interaction between the categorical variables produces nine combinations, each yielding distinct clustering outcomes. Specifically, K = 5, 2 appears three times, while K = 3, 4 occurs one and two times. Generally, fewer categories result in higher quality [51]. In this study, combination c2 (3 × 4) achieved the best performance, while a3 (5 × 3) performed the worst.

A comparison of continuous variable importance rankings showed that F1 is the most influential. Although the 13 original elements and the four factor-derived variables were standardized (mean = 0, variance = 1), variations in factor weights led to differing contributions. F1, having the largest weight, contributed the most to the clustering, further confirming its significance as the most important continuous variable.

The silhouette coefficient was used to evaluate TSC quality, typically classified as poor, fair, or good (Figure 5). The figure illustrates horizontal differences in the clustering results due to changes in the residual gravity anomaly classifications while the residual magnetic anomaly classifications remain fixed. Vertical differences caused by changes in magnetic anomaly classifications with gravity classifications are held constant. A comparison across the nine TSC category combinations showed that b2 and c1 yield a similar clustering performance, while c2 produces the best result.

Figure 6 shows a planar visualization of the results for the top four clusters (c2, b2, c1, a2). The remaining clusters were not further analyzed due to their small K values or poor clustering quality. The clustering pattern predominantly follows a northeast orientation, likely influenced by the distribution of residual gravity and magnetic anomalies. The results also show strong consistency, particularly in the stable positions of the purple and blue regions. When compared with Figure 4(a1), the purple region corresponds to the high-value zone and the blue region to a low-value zone in the residual gravity anomalies, confirming the significant role of gravity anomalies in shaping the clustering outcomes.

4.2. Factors Affecting Clustering Results

4.2.1. The Difference in the Number of Categories of Categorical Variables

Table 5 includes four fixed continuous variables and two categorical variables, which together yield nine variable combinations. Different combinations of these categorical variables produce distinct clustering outcomes. A comparison of combinations c2, b2, and c1 reveals that fewer categories generally result in higher clustering quality. For example, combination c2 (3 × 4) yields the best performance, while a3 (5 × 3) and b1 (4 × 5) perform the worst. Notably, c3 (3 × 3) does not produce the highest clustering quality. In Ward clustering (Table 4), the highest silhouette coefficient occurs when gravity K = 3 and magnetic K = 4, while the lowest occurs when gravity K = 5 and magnetic K = 3. These results show that variable combinations influence silhouette values, with higher coefficients generally indicating better clustering performance.

Furthermore, this combination pattern likely stems from the nature of residual gravity and magnetic anomalies, which represent inherently continuous geophysical fields. Excessive clustering can overly discretize the data, disrupt the natural spatial continuity, and reduce compatibility with continuous geochemical factor variables. In contrast, a moderate number of categories (e.g., 3 × 4, 4 × 4) preserves variability while minimizing noise, leading to more effective clustering outcomes.

As shown in Figure 6, when the number of categories for the residual gravity anomaly is fixed at 3 (Figure 6a,c), clustering performance remains relatively stable, with minimal variation, likely due to the smooth nature of the gravity field. In contrast, the number of categories for the residual magnetic anomaly varies more flexibly between 3 and 5 (Table 5), possibly reflecting its stronger local variability. The poor performance of the 5 × 5 combination may be the result of an excessively high number of categories, leading to information redundancy. In the first step (pre-clustering), too many subcategories were generated and critical information may have been lost during the second step (hierarchical clustering compression).

4.2.2. The Impact of Algorithms

The Ward algorithm is sensitive to outliers. As the number of clusters increases, extreme outliers are often grouped into separate clusters. When using Ward’s hierarchical clustering to discretize variables into categories, too many clusters may capture noise and local variations rather than meaningful overall patterns. This not only increases computational complexity but also distorts distance calculations during TSC. TSC begins by pre-clustering continuous variables before integrating them with categorical variables. When categorical variables have too many categories (e.g., 5 × 5), pre-clustering produces an excessive number of subcategories, increasing the complexity of the merging step and making the process more susceptible to initial conditions, which can lead to suboptimal local solutions. Additionally, TSC uses log-likelihood distance for mixed data. In this context, categorical variables can outweigh continuous variables such as geochemical factors. However, after factor analysis reduces dimensionality, geochemical factors retain a high signal-to-noise ratio. As a result, moderate category combinations (e.g., 3 × 4) help to balance variable weights, leading to improved clustering outcomes.

In summary, the clustering results are highly dependent on the discretization of categorical variables, variable combinations, and the number of clusters (K). This sensitivity stemmed from algorithmic parameters, the nature of the geophysical fields (e.g., smooth gravity vs. complex magnetic anomalies), and geological unit divisions within the study area. The top-performing combinations struck a balance between parameter selection and data structure, while poor-performing ones suffered from noise introduction or information loss. The next step was to conduct geological interpretation based on the optimal combinations identified.

4.3. Geological Interpretation Based on the 3 × 4 and 4 × 4 Combination

This study selected the 3 × 4 and 4 × 4 combinations for further analysis, as both represented a balance of moderate geophysical discretization with equal K values in clustering. Different classification schemes for gravity and magnetic data carry distinct geological implications. The 4 × 4 combination reflects a scenario where the complexity of the study area warranted slightly more gravity categories and fewer magnetic ones, effectively minimizing noise. This suggests that regional structures primarily influence the gravity field, while magnetic anomalies are more locally controlled by mineralization.

In contrast, the 3 × 4 combination emphasizes large-scale gravity features with fewer categories, while the higher number of magnetic categories captures fine-scale variations such as veins and alteration zones. The distribution of clustering results for each variable is shown according to its importance in Figure 7 (a2,b2), with known Au, Ag, Cu, and Pb deposits marked in Figure 7 (a3,b3).

The five clusters in Figure 7(a1,b1) directly reflect the spatial coupling relationships among gravity, magnetic, and geochemical factors. Clusters with the highest mineral exploration potential typically feature a favorable combination of geochemical signatures and specific gravity and magnetic anomaly patterns. Variable F4, representing Au and Ag, serves as a direct indicator of precious metal mineralization. Given the close spatial association between Au and Ag, their distribution patterns can be considered key targets for exploration. As illustrated in Figure 7(a3,b3), gold deposits are predominantly located in Cluster 3, demonstrating the prospecting effectiveness of this cluster. Other areas within the same cluster may be considered priority exploration zones. Cluster 3 is characterized by moderate-gravity anomalies, low-magnetic anomalies, and high F1 and F2 values. Geologically, this corresponds to a medium-density background, low magnetism, and significant enrichment in medium- to low-temperature polymetallic elements.

In this study, F4 exhibits the smallest variance and lowest importance in clustering (Table A1), resulting in a weak correlation between clustering outcomes and known deposit locations. Contributing factors may include: (1) a research focus primarily on gold, with limited data on other deposit types, leading to sparse positive samples; and (2) sampling conducted on a 1:200,000 grid with wide sample spacing, which may reduce analytical accuracy. Nevertheless, a key strength of TSC lies in its ability to identify not only individual anomaly intensities but also spatial coupling among multiple anomalies. For example, a medium-level Au anomaly (F4), when coupled with a strong W-Mo anomaly (F2) and specific magnetic features, may indicate higher mineralization potential than an isolated high-intensity Au anomaly. The clustering results in this study effectively capture these spatial coupling relationships. Therefore, analyzing the distribution of each variable within the clusters is essential to determine optimal exploration targets.

Table 6 summarizes the gravity and magnetic anomaly combinations for each cluster. The gravity anomalies are as follows: 1–3 indicate low, medium, and high; 1–4 indicate “very high”. The magnetic anomalies are as follows: 1–4 indicate very low, low, medium, and high. Except for Clusters 2 and 5 (highlighted in bold), differences among the other clusters are relatively minor. The difference between the two combinations lies in the number of gravity anomaly categories—3 and 4—resulting in K values of 5 and 4, respectively. In Figure 4(a1,a2), the fourth gravity category represents a local maximum with minimal impact on the overall distribution. As a result, Clusters 1 to 4 in Table 6 share the same gravity–magnetic category combinations. Cluster 5 (green) in the 4 × 4 setup is a refined subset of Cluster 2. In Table 6, Cluster 2 and 5 in the 3 × 4 combination are characterized by (F1+, F2+) and (F3+), respectively, which together correspond to the (F1+, F2+, F3+) profile of Cluster 2 in the 4×4 combination.

Figure 7 shows that existing ore deposits are primarily located within Clusters 2 and 3. Based on the variable combination in Table 6, this study focuses on the geological backgrounds of Cluster 2 and 3. Both are characterized by clusters including: (1) moderate-gravity anomalies; (2) low-magnetic anomalies; and (3) medium-magnetic anomalies. Moderate gravity may indicate a fracture transition zone or lithological boundary. Low magnetism suggests weak alteration in shallow settings; medium magnetism likely reflects alteration along rock margins.

Cluster 2, characterized by moderate gravity and magnetism with F3 (As, Sb) enrichment, indicates low-temperature hydrothermal gold–antimony mineralization controlled by shallow to mid-crustal faults. Regional faults channel gold-bearing fluids upward, with As and Sb serving as primary gold carriers. At temperatures below 200 °C, these elements precipitate in fractures, forming fine disseminated gold deposits such as the Pengjiakuang deposit.

Cluster 3 is defined by moderate gravity and low magnetism, with significant enrichment of F1 (Ni–Cu–Zn), F2 (W–Mo–Bi), and F3 (As–Sb), indicating a multi-stage, shallow, hydrothermal mineralization process. Early high-temperature fluids formed quartz vein-type molybdenum mineralization (F2), followed by medium- to low-temperature hydrothermal overprinting that introduced Cu, Pb, Zn sulfides (F1), and Au–Sb mineralization (F3). This sequence resulted in vein-type polymetallic ore bodies, similar to those in the Jinniushan district of Jiaodong.

Cluster 4, marked by independent F4 enrichment, shows a high-gravity anomaly linked to a basic rock mass, potentially indicating gold mineralization controlled by the outer contact zone of the intrusion. Additionally, this study examined the similarity in variable characteristics between Clusters 2 and 5 (in the 3 × 4 scheme) and Cluster 2 (in the 4 × 4 scheme).

A comparative analysis of Cluster 5 (green) in Figure 7(a3) and Cluster 2 (yellow) in Figure 7(b3) shows that their gravity–magnetic combinations are 2, 4 and 2, 3, respectively (Table 6). Cluster 5 (Figure 7(a3)) has 67 samples marked with green ☆ on Cluster 2 (Figure 7(b3)), of which 37 (55%) overlap. The remaining 45% fall within Cluster 4 (Figure 7(b1) or Figure 7(b3)). These overlaps indicate that Cluster 4 represents a refined subset of Cluster 2 and is not just a simple extraction but a deeper mineralization layer masked by shallow alteration within the same fault–magmatic system. This pattern is commonly observed along the eastern Tan-Lu Fault Zone. The data suggest that multi-stage emplacement of basic magmas during Late Mesozoic lithospheric thinning led to the spatial superposition of shallow gold and deeper copper–nickel deposits. In this study, TSC effectively distinguished magnetic anomalies, splitting Cluster 2 from the 4 × 4 combination into two parts: shallow gold mineralization (medium magnetism) and deeper copper–nickel mineralization (high magnetism). This finding demonstrates the vertical stratification of the mineralization system.

5. Conclusions

Geophysical and geochemical anomalies reflect a region’s tectonic framework. Integrating these data with suitable methods can yield valuable insights, particularly for mineral prospecting. In this study, geophysical data were transformed into two categorical variables—residual gravity and magnetic anomalies—while geochemical data were converted into continuous variables via factor analysis. Two-Step Clustering (TSC) was then applied to jointly analyze the categorical and continuous variables. The key findings are as follows.

A key feature of the Ward method in transforming continuous variables into categorical ones is that increases in K tend to extract extreme values (maximum or minimum). While this process does not alter the overall anomaly distribution, applying different variable combinations in TSC leads to varied clustering results. More categories do not necessary result in more clusters (K) or better performance. Two factors explain this: (1) geophysical data exhibit natural spatial continuity and a high K in Ward clustering can fragment these patterns, disrupting geological coherence; and (2) clustering performance depends on how well geochemical data distributions align with the discretized geophysical data. When variables share similar anomaly patterns—such as the NE-trending anomaly in this study—fewer, more coherent clusters are typically formed. Across all nine combinations tested, K never exceeded 5. These results suggest that exploring additional geophysical variable combinations could benefit future research.

In this study, the silhouette coefficient was used to evaluate clustering quality. Combinations with high silhouette scores, such as gravity–magnetic pairs (3 × 4) and (4 × 4), yielded better clustering results. Notably, known gold deposits are primarily located in Cluster 3, suggesting that other areas within this cluster may represent high-priority exploration targets. Furthermore, TSC effectively divided the previously mixed Cluster 2 (from the 4 × 4 combination) into two distinct subclusters: one associated with shallow gold mineralization (medium magnetic response), and another linked to deep copper–nickel mineralization (high magnetic response). This differentiation clearly reveals the vertical zonation of the metallogenic system.

Future research should further explore the application of clustering methods in geological big data analytics to enhance mineral exploration.

Author Contributions

Conceptualization, M.Z. and X.C.; methodology, X.C. and L.C.; software, X.C.; validation, M.Z. and X.C.; formal analysis, X.Z.; investigation, L.C.; resources, M.Z. and W.R.; data curation, X.C.; writing—original draft preparation, X.C.; writing—review and editing, M.Z.; visualization, X.Z.; supervision, S.Z.; project administration, M.Z.; funding acquisition, L.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the China Geological Survey Project (DD20230485, DD20243177, DD20240208604).

Data Availability Statement

Data are contained within the article.

Acknowledgments

We thank all editors and reviewers and for their valuable comments and suggestions for improving this manuscript and thank.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. Statistical parameters of elemental concentration in stream sediments of 368 samples in the study area (Note: Au and Ag values are in ppb, while all other elements are in ppm).

Element	Content Arithmetic Mean	Content Standard Deviation	Variation Coefficient	Elemental Content			Distribution Characteristics
Element	Content Arithmetic Mean	Content Standard Deviation	Variation Coefficient	Minimum	Median	Maximum	Skewness	Kurtosis
Au	7.54	20.10	2.66	0.82	2.57	226.24	7.76	70.89
Ag	62.84	33.34	0.53	30.38	54.41	387.47	4.49	31.51
As	6.21	2.08	0.33	2.10	5.76	24.46	3.41	21.79
Cu	18.67	5.45	0.29	9.73	17.23	46.29	1.59	4.32
Bi	0.18	0.07	0.37	0.09	0.17	0.64	2.84	13.45
Pb	22.69	5.98	0.26	10.99	21.87	58.46	2.10	7.97
Zn	55.35	13.45	0.24	29.63	54.02	149.84	1.75	7.54
Sb	0.34	0.07	0.22	0.20	0.33	0.64	1.22	2.31
Mo	0.80	0.26	0.32	0.40	0.73	2.11	2.04	5.33
W	1.41	0.60	0.43	0.68	1.27	5.70	3.43	16.80
Ni	26.54	6.57	0.25	13.39	25.41	47.29	0.59	−0.16
Sn	2.29	0.59	0.26	1.45	2.30	10.22	6.92	87.61
Hg	16.86	3.22	0.19	10.33	15.98	32.45	1.32	2.20

Table A2. Elemental statistical parameters after logarithmic transformation of 368 samples in the study area.

Element	Arithmetic Mean	Standard Deviation	Variation Coefficient	Elemental			Normal Distribution
Element	Arithmetic Mean	Standard Deviation	Variation Coefficient	Minimum	Median	Maximum	Skewness	Kurtosis
Au	0.54	0.43	0.79	–0.08	0.41	2.35	1.32	2.07
Ag	1.76	0.16	0.09	1.48	1.74	2.59	1.51	3.62
As	0.77	0.12	0.16	0.32	0.76	1.39	0.66	3.70
Cu	1.26	0.11	0.09	0.99	1.24	1.67	0.60	0.38
Bi	–0.77	0.13	–0.17	–1.06	−0.78	–0.20	0.87	1.99
Pb	1.34	0.10	0.08	1.04	1.34	1.77	0.61	2.24
Zn	1.73	0.10	0.06	1.47	1.73	2.18	0.46	1.14
Sb	–0.47	0.09	–0.19	–0.70	–0.48	–0.20	0.43	0.75
Mo	–0.12	0.12	–1.02	–0.39	–0.14	0.32	1.03	1.26
W	0.12	0.14	1.13	–0.17	0.10	0.76	1.29	3.17
Ni	1.41	0.11	0.08	1.13	1.41	1.67	0.07	–0.56
Sn	0.35	0.09	0.25	0.16	0.36	1.01	1.30	9.16
Hg	1.22	0.08	0.06	1.01	1.20	1.51	0.74	0.63

References

Song, M.C.; Yang, L.Q.; Fan, H.R. Current progress of metallogenic research and deep prospecting of gold deposits in the Jiaodong Peniusula during 10 years for Exploration Breakthrough Strategic Action. Geol. Bull. China 2022, 41, 903–935. [Google Scholar] [CrossRef]
Ding, Z.J.; Sun, F.Y.; Liu, F.L.; Liu, J.P.; Peng, Q.M.; Ji, P.; Li, B.L.; Zhang, P.J. Mesozoic geodynamic evolution and metallogenic series of major metal deposits in Jiaodong Peninsula. Acta Petrol. Sin. 2015, 10, 3045–3080. [Google Scholar]
Liu, Y.Q.; Shi, H.; Li, J.; Huang, T.L.; Jin, Y.M.; Wang, F. Geological, Geophysical and Geochemical Characteristics of Gold Deposits around Jiaolai basin, Shandong Province and Their Prospecting Significance. Acta Geosci. Sin. 2004, 6, 593–600. [Google Scholar]
Liu, J.Y.; Duan, L.A. The North Edge of Jiao-Lai basin Gold Deposit Features and Analysis on mine-prospecting Prospect. Gold Sci. Technol. 2006, 14, 30–35. [Google Scholar]
Liu, Y.Q. Further discussion of the direction in gold prospecting in the Jiao-Lai basin. Geol. China 2001, 11, 13–19. [Google Scholar]
Zhao, B.J.; Gao, M.B.; Li, Y.D.; Fu, H.Q.; Li, D.D.; Feng, Q.W.; Zhang, D.; Zheng, D.C.; Ma, M.; Wang, L.G. Study on metallogenic regularity of gold deposits in Longkou-Tudui mining area on the northeastern margin of Jiaolai basin. Acta Geol. Sin. 2019, 93 (Suppl. 1), 1–11. [Google Scholar]
Li, G.H.; Ding, Z.J.; Ji, P.; Li, Y.; Tang, J.Z.; Liu, S.S. Features and Prospecting Direction of the Gold Deposits in the Northeastern Margin of the Jiaolai basin. Geol. Explor. 2016, 6, 1029–1036. [Google Scholar] [CrossRef]
Wang, R.Z.; Wu, C.X. Characteristics of deep mineralization of Guocheng gold deposit in the northeast margin of Jiaolai basin and its geological prospecting potentials. China Met. Bull. 2022, 15, 40–42. [Google Scholar]
Li, D.D. Study on Genesis and Metallogenic Model of Longkou-Tudui Gold Deposit in Northeastern Margin of Jiaolai basin. Mod. Min. 2020, 9, 6–11. [Google Scholar] [CrossRef]
Li, Y.; Ding, Z.J.; Bo, J.W.; Song, M.C.; Wu, F.P.; Li, G.H.; Li, T.T. Geochemical characteristics of mine-forming elements and metallogenic potentiality in the gold mineralization area of northeast margin of Jiaolai basin. Gold 2018, 8, 15–21. [Google Scholar] [CrossRef]
Botbol, J.M.; Sinding-Larsen, R.; McCammon, R.B.; Gott, C.B. A regionalised multivariate approach to target selection in geochemical exploration. Econ. Geol. 1978, 73, 534–546. [Google Scholar] [CrossRef]
Howarth, R.J.; Sinding-Larsen, R. Multivariate Analysis General Information. In Handbook of Exploration Geochemistry Statistics and Data Analysis in Geochemical Prospecting; Elsevier: Amsterdam, The Netherlands, 1983; pp. 207–289. [Google Scholar] [CrossRef]
McCammon, R.B.; Botbol, J.M.; Sinding-Larsen, R.; Bowen, R.W. Characteristic analysis—1981: Final program and a possible discovery. Math. Geol. 1983, 15, 59–83. [Google Scholar] [CrossRef]
Sinding-Larsen, R.; Botbol, J.M.; McCammon, R.B. Use of weighted characteristic analysis as a tool in resource assessment. In Proceedings of the Evaluation of Uranium Research Proceedings of the Advisory Group Meeting, Rome, Italy, 29 November–3 December 1976; International Atomic Energy Agency: Vienna, Austria, 1979; pp. 275–285. [Google Scholar]
Plant, J.; Hale, M. Drainage Geochemistry. Handbook of Exploration Geochemistry; Elsevier: Amsterdam, The Netherlands, 1994; p. 8. [Google Scholar]
Beus, A.A.; Grigorian, S.V. Geochemical Exploration Methods for Mineral Deposits; Applied Publishing Ltd.: Wilmette, IL, USA, 1977; 288p. [Google Scholar]
Levinson, A.A. Introduction of Exploration Geichemistry; Applied Publishing Ltd.: Wilmette, IL, USA, 1974; 612p. [Google Scholar]
Robert, F.; Brommecker, R.; Bourne, B.T.; Dobak, P.J.; McEwan, C.J.; Rowe, R.R.; Zhou, X. Models and exploration methods for major gold deposit types. In Proceedings of the Exploration 07: Fifth Decennial International Conference on Mineral Exploration, Toronto, ON, Canada, 9–12 September 2007; pp. 691–711. [Google Scholar]
Wang, S.H.; Wang, J.K.; Zhang, X.F. Geochemical characteristics and gold metallogenic potential for stream sediment in Muping-Wendeng area, Jiaodong Peninsula. Geol. Rev. 2020, 66, 510–519. [Google Scholar] [CrossRef]
Li, R.H.; Wang, X.Q.; Chi, Q.H.; Zhang, B.M.; Liu, Q.Q.; Liu, H.L. Distribution of geochemical anomaly of gold in drainage sediment in the Jiaodong Peninsula, China and its significance. Earth Sci. Front. 2019, 4, 221–230. [Google Scholar] [CrossRef]
Wei, Y.T. Geochemical Characteristics of Stream Sediments and mine-prospecting Direction in Haiyang Area in Jiaodong Peninsula. Shandong Land Resour. 2017, 33, 30–36. [Google Scholar]
Song, M.C.; Wang, H.J.; Liu, H.B.; He, C.Y.; Wei, Y.T.; Li, B. Deep characteristics of ore−controlling faults in Jiaoxibei gold deposits and its implications for prospecting:Evidence from geophysical surveys. Geol. China 2024, 51, 1–16. [Google Scholar] [CrossRef]
He, C.Y.; Yao, Z.; Guo, G.Q.; Liu, H.B.; Song, M.C.; Li, S.Y. Deep structural features of the Jiaobei terrane of Jiaodong and the North Sulu orogenic belt:the inspiration from deep exploration of the geophysics. Prog. Geophys. 2022, 37, 1392–1404. [Google Scholar] [CrossRef]
Dentith, M.; Frankcombe, K.; Trench, A. Geophysical signatures of Western Australian mineral deposits: An overview. Explor. Geophys. 1994, 25, 103–160. [Google Scholar] [CrossRef]
Song, M.C.; Xue, G.Q.; Liu, H.B.; Li, Y.X.; He, C.Y.; Wang, H.J.; Wang, B.; Song, Y.X.; Li, S.Y. A geological-geophysical prospecting model for deep-seated gold deposits in the jiaodong peninsula, China. Minerals 2021, 11, 1393. [Google Scholar] [CrossRef]
Oldenburg, D.W.; Li, Y.; Ellis, R.G. Inversion of geophysical data over a copper gold porphyry deposit: A case history for Mt. Milligan. Geophysics 1998, 62, 1419–1431. [Google Scholar] [CrossRef]
Goldfarb, R.J.; Groves, D.I. Orogenic gold: Common or evolving fluid and metal sources through time. Lithos 2015, 233, 2–26. [Google Scholar] [CrossRef]
Nie, N.H. Spss Statistical Package for the Social Sciences, Second Edition. New York: McGraw-Hill Book Co., 1975. J. Advert. 1976, 5, 41–42. [Google Scholar] [CrossRef]
Zhu, Y.C. Twostep Cluster Model and Its Application. Mark. Res. 2005, 1, 40–42. [Google Scholar] [CrossRef]
Song, M.C.; Lin, S.Y.; Yang, L.Q.; Song, Y.X.; Ding, Z.J.; Li, J.; Li, S.Y.; Zhou, M.L. Metallogenic model of Jiaodong Peninsula gold deposits. Miner. Depos. 2020, 39, 215–236. [Google Scholar] [CrossRef]
Song, M.C.; Song, Y.X.; Li, J. Thermal doming-extension metallogenic system of Jiaodong type gold deposits. Acta Petrol. Sin. 2023, 5, 1241–1260. [Google Scholar] [CrossRef]
Song, M.C.; Li, S.Z.; Santosh, M.; Zhao, S.J.; Yu, S.; Yi, P.H.; Cui, S.X.; Lv, G.X.; Xu, J.X.; Song, Y.X.; et al. Types, characteristics and Metallogenesis of Gold Deposits in the Jiaodong Peninsula, Eastern North China Craton. Mine Geol. Rev. 2015, 65, 612–625. [Google Scholar] [CrossRef]
Cheng, S.B.; Liu, Z.J.; Wang, Q.F. Shrimp zircon U-Pb dating and HFisotope analyses of the Muniushan Monzogranite, Guocheng, Jiaobei Terrane, China:implications for the tectonic evolution of the Jiao-Liao-Ji Belt, North China Craton. Precambrian Res. 2017, 301, 36–48. [Google Scholar] [CrossRef]
Wang, L.G.; Zhi, Y.B.; Wang, Y.P.; Dong, J.; Wang, Q.Y. Ore-controlling Characteristics and Metallogenic Prospect Analysis of Detachment Structure in Guocheng-Yazi Gold Ore Concentration Area, Northeastern Margin of Jiaolai Basin. Sci. Technol. Eng. 2025, 25, 1359–1369. [Google Scholar] [CrossRef]
Xie, X.J.; Mu, X.Z.; Ren, T.X. Geochemical mapping in China. J. Geochem. Explor. 1997, 60, 99–113. [Google Scholar] [CrossRef]
Chi, Q.H.; Yan, M.C. Handbook of Applied Geochemical Elemental Abundance Data; Geological Publishing House: Beijing, China, 2007; pp. 1–148. [Google Scholar]
Tan, Q.P.; Xia, Y.; Wang, X.Q. Reflections on some issues of log-ratio conversion of geochemical composition data. In Proceedings of the 8th National Symposium on Metallogenic Theory and Prospecting Methods, Nanchang, China, 8–11 December 2017; pp. 737–738. [Google Scholar]
Yin, Z.G.; Jiang, R.; Cheng, J.D.; Li, M.M.; Zhou, X.G.; Zhang, K.Q.; Jiang, Q.; Guo, H. Characteristics of soil geochemical anomalies and metallogenic prediction in Liukuaidi South of Mulan County, Heilongjiang Province. Geol. Bull. China 2023, 42, 2015–2027. [Google Scholar] [CrossRef]
Aliyari, F.; Yousefi, T.; Abedini, A.; Calagari, A.A. Primary geochemical haloes and alteration zoning applied to gold exploration in the Zarshuran Carlin-type deposit, northwestern Iran. J. Geochem. Exp. 2021, 231, 106864. [Google Scholar] [CrossRef]
Zuo, R.G.; Xia, Q.L.; Wang, H.C. Compositional data analysis in the study of integrated geochemical anomalies associated with mineralization. Appl. Geochem. 2013, 28, 202–211. [Google Scholar] [CrossRef]
Zuo, R.G. Data science-based theory and method of quantitative prediction of mineral resources. Earth Sci. Front. 2021, 28, 49–55. [Google Scholar] [CrossRef]
Wang, C.B.; Ma, X.G.; Chen, J.G. The application of data pre-processing technology in the geoscience big data. Acta Petrol. Sin. 2018, 34, 303–313. [Google Scholar]
Moran, P.A.P. Notes on Continuous Stochastic Phenomena. Biometrika 1950, 37, 17–23. [Google Scholar] [CrossRef] [PubMed]
Ward, J.H.J. Hierarchical Grouping to Optimize an Objective Function. J. Am. Stat. Assoc. 1963, 58, 236–244. [Google Scholar] [CrossRef]
Li, Y.W.; Song, S. Comprehensive analysis of the techniques and tactics in 2018 FIFA World Cup using principal component analysis and Q type cluster analysis. Comput. Era 2020, 1, 57–61. [Google Scholar] [CrossRef]
Wang, F.F.; Wu, Y.F.; Kong, M. Region segmentation of gravity anomaly based on agglomerative clustering algorithm. Hydrogr. Surv. Chart. 2021, 41, 18–22. [Google Scholar] [CrossRef]
Liu, X.; Xia, X.Y.; Liu, D.G.; Hu, J.H.; Huang, R.; Li, Z.W.; Shi, Z.Y. Line loss anomaly identification of low-voltage-station based on Two-Step clustering and robust random cut Forest algorithm. Mod. Electr. Power 2024, 41, 441–447. [Google Scholar] [CrossRef]
Zhang, Y.; Meng, G. Simulation of an Adaptive Model Based on AIC and BIC ARIMA Predictions. J. Phys. Conf. Ser. 2023, 2449, 012027. [Google Scholar] [CrossRef]
Qin, L.; Michael, A.C.; Shane, A.R.; Barbara, R.H. Performance of AIC and BIC in Selecting Partition Models and Mixture Models. Syst. Biol. 2022, 72, 92–105. [Google Scholar] [CrossRef]
Xia, F.; Zhang, L.F. Application of BIC Criteria in Data Fusion. J. Jiangxi Univ. Sci. Technol. 2007, 28, 35–38. [Google Scholar] [CrossRef]
Chang, X.P.; Zhang, M.H.; Zhang, X.; Zhang, S. Two-Step Clustering for Mineral Prospectivity Mapping: A Case Study from the Northeastern Edge of the Jiaolai Basin, China. Minerals 2024, 14, 1089. [Google Scholar] [CrossRef]

Figure 1. Geological map of the study area (a) Location map of the study area; (b) Regional geological map of the study area (revised according to reference [5]).

Figure 2. Statistical characteristics of geochemical elements plotted for: raw values (a,b); and log-transformed values (c,d).

Figure 3. Score maps of stream sediment geochemical factors (F1–F4): (a) F1 (Ni, Cu, Zn, Pb, Hg); (b) F2 (W, Bi, Mo); (c) F3 (As, Sb, Sn); and (d) F4 (Au. Ag).

Figure 4. Residual gravity (a0); and magnetic (b0) anomalies in the study area, along with their corresponding categorical classifications ((a1–a3, b1–b3), respectively).

Figure 5. Comparison of clustering quality for nine gravity and magnetic anomaly combinations. (Residual gravity anomalies 5 × Residual magnetic anomalies 5 presents the combination (5 × 5)).

Figure 6. Comparison of clustering results across four TSC models. Different colors represent distinct sample clusters.

Figure 7. Comparison of the 3 × 4 and 4 × 4 combinations in the study area: panels (a1,b1) show the category distribution, and (a2,b2) show the detailed distributions of the corresponding clusters (red dotted lines indicate Cluster 3); while (a3,b3) display the distribution of Clusters 1 and 5 alongside known Au, Ag, Cu, and Pb deposits (the overlapping points are marked with green ☆).

Table 1. Major deposit features in the study area (revised according to reference [7]).

Number	Deposit	Controlling Structure	Orebody Features	Ore Type	Deposit Scale	Type of Deposits
1	Pengjiakuang	Interlayer detachment zone	Layer, Lenticular	Au, Ag, S	Large-scale	Pengjiakuang-type
2	Songjiagou	Structural fracture zone	Layer, vein	Au, Ag, S	Large-scale	Tudui-type
3	Tudui-Shawang	Structural fracture zone	Layer	Au, Ag	Large-scale	Liaoshang-type
4	Liaoshang	Exomorphic zone	Vein	Au, Ag, S	Extra-large scale	Songjiagou-type
5	Xilaokou	Interlayer detachment zone, Exomorphic zone	LenticularTabular	Au, S	Large-scale	Liaoshang-type Pengjiakuang type
6	Xijingkou	Interlayer detachment zone	Layer	Au, Pb, Zn, S	Medium-scale	Pengjiakuang-type
7	Longkou	Interlayer fracture Zone	Layer, Vein	Au, Ag	Large-scale	Tudui-type
8	Daligou	Structural fracture zone	Vein	Au, Ag, S	Ore spot	Songjiagou-type
9	Nanguozi	Interlayer fracture Zone	Vein	Au, Ag	Ore spot	Tudui-type

Table 2. Total variance explained by factor analysis of the 13 log-transformed elements in the study area.

Component	Initial Eigenvalue			Extracted Load Square Sum			Moment of Inertia Square Sum
Component	Total	Variance %	Cumulative %	Total	Variance %	Cumulative %	Total	Variance %	Cumulative %
1	4.22	32.50	32.50	4.22	32.50	32.50	3.10	23.81	23.81
2	2.31	17.80	50.30	2.31	17.80	50.30	2.50	19.27	43.07
3	1.45	11.15	61.45	1.45	11.15	61.45	2.14	16.44	59.52
4	1.27	9.79	71.23	1.27	9.79	71.23	1.52	11.72	71.23
5	0.85	6.51	77.74

Table 3. Four factors derived from the factor analysis of 13 geochemical elements in the study area.

	Component
	1	2	3	4
Ni	0.89
Cu	0.84
Zn	0.79
Pb	0.67
Hg	0.60
W		0.83
Bi		0.79
Mo		0.78
As			0.85
Sb			0.78
Sn			0.59
Ag				0.84
Au				0.64

Table 4. Silhouette coefficients of residual gravity and residual magnetic anomalies for different values of K.

K	Residual Gravity Anomalies	Residual Magnetic Anomalies
3	0.522	0.477
4	0.521	0.517
5	0.493	0.491

Table 5. Combination of gravity and magnetic anomalies in nine TSC models.

No.	Combination of Categorical Variables		Number of Continuous Variables	Number of Clusters	Importance of Continuous Variables	Clustering Quality
No.	Residual Gravity Anomalies	Residual Magnetic Anomalies	Number of Continuous Variables	Number of Clusters	Importance of Continuous Variables	Clustering Quality
a1	5	5	4	3	F1 > F3 > F4 > F2	7
a2	5	4	4	4	F3 > F1 > F2 > F4	4
a3	5	3	4	2	F1 > F3 > F4 > F2	9
b1	4	5	4	2	F1 > F3 > F2 > F4	6
b2	4	4	4	4	F1 > F2 > F3 > F4	2
b3	4	3	4	5	F1 > F2 > F4 > F3	8
c1	3	5	4	5	F3 > F1 > F2 > F4	3
c2	3	4	4	5	F2 > F1 > F4 > F3	1
c3	3	3	4	2	F1 > F3 > F4 > F2	5

Table 6. Comparison of gravity and magnetic combinations for each cluster. Gravity 2 refers to the second category of residual gravity anomalies. Magnetic 1 refers to the first category of residual magnetic anomalies. A ‘+’ indicates that the cluster median for the corresponding factor exceeds the overall median.

Cluster	3 × 4 Combination						4 × 4 Combination
Cluster	Gravity	Magnetic	F1	F2	F3	F4	Gravity	Magnetic	F1	F2	F3	F4
1	1	3	+			+	1	3	+			+
2	2	3			+		2	3	+	+	+
3	2	2	+	+	+		2	2	+	+	+
4	3	3				+	3	3		+		+
5	2	4	+	+

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chang, X.; Zhang, M.; Chen, L.; Zhang, S.; Ren, W.; Zhang, X. Application of Clustering Methods in Multivariate Data-Based Prospecting Prediction. Minerals 2025, 15, 760. https://doi.org/10.3390/min15070760

AMA Style

Chang X, Zhang M, Chen L, Zhang S, Ren W, Zhang X. Application of Clustering Methods in Multivariate Data-Based Prospecting Prediction. Minerals. 2025; 15(7):760. https://doi.org/10.3390/min15070760

Chicago/Turabian Style

Chang, Xiaopeng, Minghua Zhang, Liang Chen, Sheng Zhang, Wei Ren, and Xiang Zhang. 2025. "Application of Clustering Methods in Multivariate Data-Based Prospecting Prediction" Minerals 15, no. 7: 760. https://doi.org/10.3390/min15070760

APA Style

Chang, X., Zhang, M., Chen, L., Zhang, S., Ren, W., & Zhang, X. (2025). Application of Clustering Methods in Multivariate Data-Based Prospecting Prediction. Minerals, 15(7), 760. https://doi.org/10.3390/min15070760

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Application of Clustering Methods in Multivariate Data-Based Prospecting Prediction

Abstract

1. Introduction

2. Study Area Overview

3. Methodology

3.1. Available Dataset

3.2. Geochemical Data Processing: Au Distribution Pattern in Stream Sediments

3.3. Geophysical Data Processing

3.4. Review of Clustering Method

4. Discussion

4.1. Clustering Results

4.2. Factors Affecting Clustering Results

4.2.1. The Difference in the Number of Categories of Categorical Variables

4.2.2. The Impact of Algorithms

4.3. Geological Interpretation Based on the 3 × 4 and 4 × 4 Combination

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI