An Unsupervised Learning Approach for Coal Spontaneous Combustion Warning Level Classification Using t-SNE and k-Means Clustering

Zhang, Pengyu; Chen, Xiaokun

doi:10.3390/app15073756

Open AccessArticle

An Unsupervised Learning Approach for Coal Spontaneous Combustion Warning Level Classification Using t-SNE and k-Means Clustering

by

Pengyu Zhang

^*

and

Xiaokun Chen

School of Safety Science and Engineering, Xi’an University of Science and Technology, 58, Yanta Mid. Rd., Xi’an 710054, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(7), 3756; https://doi.org/10.3390/app15073756

Submission received: 12 February 2025 / Revised: 10 March 2025 / Accepted: 25 March 2025 / Published: 29 March 2025

Download

Browse Figures

Versions Notes

Abstract

:

Accurate prediction of coal spontaneous combustion levels is crucial for preventing and controlling spontaneous combustion in goaf areas. To address the ambiguity in classification standards of coal spontaneous combustion warning levels, 21 groups of coal samples from different mining areas were subjected to experiments with programmed temperatures, generating a database of 336 sets of temperatures and data on indicator gas concentrations. An unsupervised learning approach combining t-distributed Stochastic Neighbor Embedding (t-SNE) and k-means clustering was proposed to perform dimensionality reduction and clustering of high-dimensional data features. The clustering results of the original data were compared with Principal Component Analysis (PCA) and Stochastic Neighbor Embedding (SNE) methods to determine coal spontaneous combustion warning levels. The indicator gases and warning levels were input into a trained Support Vector Classifier (SVC) to establish a classification model for coal spontaneous combustion warning levels in goaf areas. The results showed that the maximum Maximal Information Coefficients (MICs) between temperature and CO and O₂ concentrations were 0.95 and 0.81, respectively, indicating strong nonlinear relationships between indicator gases and warning levels. The t-SNE method effectively extracted nonlinear mapping relationships between the indicator gas features, while the k-means clustering categorized coal spontaneous combustion data using distance as a similarity measure. By combining the t-SNE and k-means methods for accurate dimensionality reduction and clustering of goaf spontaneous combustion data, the warning levels were classified into six categories: safe, low risk, moderate risk, high risk, severe risk, and extremely severe risk. The application in the Longgu mine demonstrated that the SVC method could accurately classify spontaneous combustion warning levels in field goaf areas and implement corresponding response measures based on different warning levels, providing a valuable reference for spontaneous combustion prevention in goaf areas.

Keywords:

coal spontaneous combustion; warning level; unsupervised learning; support vector classifier; classification model

1. Introduction

Coal spontaneous combustion in goaf areas is characterized by secondary effects, concealment, and uncontrollability, which can easily trigger secondary disasters, such as coal dust and gas explosions [1,2,3]. Several catastrophic accidents have occurred due to this phenomenon, including the gas explosion triggered by coal spontaneous combustion at the Jilin Tonghua Babao Coal Industry in 2013, a similar incident at the Yanshitai Coal Mine in 2014, and the gas explosion caused by the spontaneous combustion of residual coal at Tangyang Coal Mine in 2018. These incidents highlight the critical importance of accurate prediction and early warnings for coal spontaneous combustion for ensuring safe coal mine production [4,5,6].

Coal spontaneous combustion involves complex physicochemical processes that result in heat accumulation within the coal mass, eventually leading to spontaneous ignition as temperatures rise [7,8]. During this process, various indicator gases, including CO, CO₂, CH₄, C₂H₄, C₂H₆, and C₂H₂, are released, with their concentrations exhibiting distinct trends with increasing coal temperature [9]. The traditional gas analysis method has been widely adopted due to its non-invasive nature and real-time monitoring capabilities [10]. While the Coal Mine Safety Regulations categorize coal spontaneous combustion into three stages (incubation, self-heating, and spontaneous combustion periods), this classification is overly simplistic and lacks quantitative analysis [11,12]. Although numerous researchers have conducted experimental studies to establish quantitative relationships between indicator gases and warning levels [13], the highly nonlinear relationship between these parameters makes it challenging to obtain accurate predictions using traditional mechanism-based methods [14,15]. Recent studies have also revealed that environmental factors, such as the temperature, humidity, and oxygen concentration, significantly influence the spontaneous combustion process, further complicating the prediction problem [16,17].

With the rapid advancement of computer science and artificial intelligence, many scholars have applied machine learning methods, such as BP neural networks [18,19], support vector machines [20,21], and random forests [22,23], to predict coal spontaneous combustion. These data-driven approaches have shown promising results in laboratory conditions, demonstrating superior performance compared to traditional statistical methods [24]. However, these supervised learning approaches require labeled training samples, which are often difficult to obtain in practical situations [25]. Furthermore, existing classification standards for warning levels vary significantly due to differences in the project backgrounds, testing methods, and coal sample selection, with the number of warning levels ranging from 3 to 7 across different studies. This inconsistency in classification standards, coupled with varying severity interpretations of the same warning level, hampers the effective implementation of prevention and control measures.

To address these limitations, this study analyzes 336 sets of data collected through oil bath experiments with programmed temperatures across 21 different mining areas and proposes a novel unsupervised learning approach combining t-SNE and k-means clustering for warning level classification. The method integrates machine learning techniques with traditional coal mine safety practices, establishing a comprehensive warning level system based on multiple indicator gases. Our approach differs from previous studies by eliminating the need for pre-labeled training data and providing a more objective classification framework. The effectiveness of the proposed approach is validated through field application in actual mining conditions, providing a practical solution for spontaneous combustion prediction in goaf areas. The methodology incorporates recent advances in dimensionality reduction techniques and cluster analysis, offering improved accuracy and reliability compared to conventional methods. This paper presents the establishment of the prediction database, details the warning level classification methodology, and demonstrates the development and validation of the prediction model through extensive field application and comparative analysis.

2. Experiments and Methods

2.1. Experimental System and Data Acquisition

To develop a universally applicable prediction method for coal spontaneous combustion in goaf, 21 coal samples were collected from various mining regions, including Shaanxi, Guizhou, Anhui, and Shanxi provinces. The raw coal samples were sealed and transported to the laboratory, where they were crushed and screened into five particle size fractions: ≤0.9 mm, >0.9–≤3 mm, >3–≤5 mm, >5–≤7 mm, and >7–≤10 mm. A 250 g mixed sample was prepared by taking 50 g from each size fraction for the programmed temperature experiment to simulate the oxidation and temperature rise processes of coal spontaneous combustion.

The experimental system with programmable oil bath heating consists of three main components: a gas circuit, a temperature control chamber, and a data acquisition system, as shown in Figure 1. An SPB-3 automatic air pump supplied the gas flow, while a temperature control device regulated the heating of the oil bath chamber. The heating rate was set at 0.3 °C min⁻¹ with an air flow rate of 30 mL min⁻¹. Gas samples were collected at 10 °C intervals when the coal temperature ranged from 30 °C to 180 °C. The collected gas samples were analyzed using gas chromatography to determine their composition and concentration.

Through the experiments, a comprehensive database of 336 sets of data was established. The database includes temperature measurements and corresponding gas concentrations (O₂, N₂, CO₂, CO, CH₄, C₂H₂, C₂H₄, and C₂H₆). Partial data from the programmed temperature experiment is presented in Appendix A. Given the complex nature of coal spontaneous combustion and the multiple gas indicators involved, it is essential to understand the relationships between these variables for effective monitoring and early warning. Therefore, a systematic analysis of the correlations between gas concentrations and temperature variations was conducted.

To quantitatively analyze the complex relationships between gas indicators and temperature variation during coal spontaneous combustion, the Maximal Information Coefficient (MIC) was employed due to its ability to capture both linear and nonlinear associations.

As illustrated in Figure 2, the analysis revealed distinct patterns in the relationship between temperature and various gases. CO exhibited the strongest correlation with temperature (MIC = 0.95), which can be attributed to its continuous production during the coal oxidation process. The high MIC value suggests that CO serves as a primary indicator for monitoring temperature evolution. O₂ showed the second highest correlation (MIC = 0.81), reflecting its crucial role in the oxidation reaction and its consistent consumption pattern as temperature increases.

The varying MIC values among different gases reflect the dynamic nature of coal oxidation processes and provide insights into the evolution of gas compositions at different stages of spontaneous combustion. A detailed examination of these relationships reveals distinctive patterns in gas generation and consumption.

Hydrocarbon gases demonstrated varying degrees of correlation: C₂H₄ (MIC = 0.74) showed moderate correlation, primarily emerging during higher temperature stages, while C₂H₆ (MIC = 0.42) exhibited weaker correlation, typically appearing in the later stages of spontaneous combustion. CO₂ (MIC = 0.70) displayed intermediate correlation, indicating its value as a supplementary indicator. The relatively low MIC value for N₂ (MIC = 0.47) reflects its role as an environmental background gas rather than a direct product of coal oxidation.

This hierarchical correlation structure presents both opportunities and challenges for data analysis. While retaining all gas indicators provides comprehensive monitoring coverage, it introduces computational complexity and potential sparsity issues in high-dimensional feature space. Conversely, selecting only highly correlated indicators risks overlooking subtle but important changes in the combustion process. To address this trade-off, the t-SNE algorithm was introduced as a dimensionality reduction technique. This approach effectively preserves the intrinsic relationships between variables while reducing data redundancy, enabling more efficient and accurate analysis of the spontaneous combustion process.

The MIC analysis further revealed distinct temporal patterns in gas evolution throughout the spontaneous combustion process. During the initial stage, the process is characterized predominantly by O₂ consumption with minimal CO production, indicating the onset of coal oxidation. As the process advances to the development stage, a marked increase in CO concentration becomes evident, accompanied by accelerated O₂ depletion, signifying intensified oxidation reactions. In the advanced stage, the process is distinguished by the emergence of hydrocarbon gases (C₂H₄ and C₂H₆) alongside continued CO accumulation, reflecting the progression toward more severe combustion conditions and potential thermal decomposition of coal structures.

2.2. t-SNE Algorithm for Dimensionality Reduction

The t-SNE algorithm is a nonlinear dimensionality reduction technique that is particularly effective for visualizing high-dimensional data by embedding them in a lower-dimensional space. This algorithm excels in preserving both local and global structures of the original data through probability distribution matching [26,27]. In this study, all three dimensionality reduction methods (PCA, SNE, and t-SNE) were used to reduce the high-dimensional gas concentration data to 2 dimensions to facilitate intuitive visualization and enable effective comparison between the clustering results.

In dimensionality reduction context, local data structure refers to the preservation of proximity relationships between neighboring data points, ensuring that points that are close in the high-dimensional space remain close in the reduced space. Global data structure refers to the maintenance of large-scale patterns and relationships across the entire dataset, such as clusters, general shapes, and relative distances between distant points. Unlike PCA, which prioritizes global variance preservation, t-SNE effectively balances both local and global structural preservation through its probability-based similarity measurements, making it particularly suitable for capturing the complex evolutionary patterns in coal spontaneous combustion data where both short-range similarities and long-range relationships are critical for accurate classification.

In the high-dimensional space, t-SNE first computes the conditional probability p_j_|i that represents the similarity between data points x_i and x_j using a Gaussian distribution:

p_{i | j} = \frac{\exp (- {‖x_{i} - x_{j}‖}^{2} / 2 {σ_{i}}^{2})}{\sum_{k \neq 1} \exp (- {‖x_{i} - x_{k}‖}^{2} / 2 {σ_{i}}^{2})},

(1)

where ||x_i − x_j|| denotes the Euclidean distance between points x_i and x_j, and σ_i is the standard deviation of the Gaussian distribution centered at x_i. This standard deviation is determined through binary search to achieve a predefined perplexity value, which reflects the effective number of neighbors for each data point.

To ensure symmetry, the joint probability distribution p_ij in the high-dimensional space is computed as follows:

p_{i j} = \frac{p_{j | i} + p_{i | j}}{2 n},

(2)

where n represents the total number of data points. The joint probability p_ij characterizes the pairwise similarities between data points, with larger values indicating a higher likelihood of points being neighbors.

In the low-dimensional space, t-SNE employs a student t-distribution with one degree of freedom (Cauchy distribution) to compute the similarities q_ij between mapped points y_i and y_j:

q_{i | j} = \frac{{(1 + {‖y_{i} - y_{j}‖}^{2})}^{- 1}}{\sum_{k \neq 1} {(1 + {‖y_{k} - y_{l}‖}^{2})}^{- 1}},

(3)

The Student t-distribution was specifically chosen over a Gaussian distribution due to its heavier tails, which helps mitigate the “crowding problem”—the tendency for moderate-distance points to be collapsed together in the low-dimensional embedding.

The objective function of t-SNE minimizes the Kullback–Leibler divergence between the high-dimensional and low-dimensional probability distributions:

C = K L (P | | Q) = \sum_{i} \sum_{j} p_{i j} l o g (p_{i j} / q_{i j}),

(4)

The KL divergence quantifies the dissimilarity between the two probability distributions, where a smaller value indicates better preservation of the original data structure.

The gradient of the cost function with respect to the low-dimensional coordinates is given by the following:

δ C / δ y_{i} = 4 {\sum_{j} (p_{i j} - q_{i j}) (y_{i} - y_{j}) (1 + {‖y_{i} - y_{j}‖}^{2})}^{- 1},

(5)

This gradient expression reveals the underlying force field where p_ij > q_ij generates an attractive force between points, while p_ij < q_ij results in a repulsive force.

The optimization process employs gradient descent with momentum:

y (t) = y (t - 1) + η \frac{\partial C}{\partial y} + α (t) (y (t - 1) - y (t - 2)),

(6)

where η denotes the learning rate controlling the step size of each iteration, and α(t) is the momentum coefficient that accelerates convergence and helps escape local optima.

The perplexity parameter, which is crucial for t-SNE’s performance, is related to the Shannon entropy H(Pi) of the conditional probability distribution:

p e r p (p i) {= 2}^{H (p i)},

(7)

Higher perplexity values emphasize the preservation of global structure, while lower values focus more on maintaining local features. In practice, perplexity values typically range from 5 to 50, requiring adjustment based on the specific characteristics of the dataset.

Through iterative optimization, t-SNE progressively refines the low-dimensional embedding until the probability distributions in both spaces achieve optimal alignment, thereby creating a meaningful visualization of the high-dimensional data structure.

2.3. K-Means Clustering Algorithm

K-means clustering is a fundamental unsupervised machine learning algorithm that partitions n observations into k clusters, where each observation belongs to the cluster with the nearest mean. The algorithm operates by iteratively refining cluster assignments based on the minimization of within-cluster variances [28,29].

The algorithm aims to minimize the objective function, commonly known as the Sum of Squared Errors (SSEs) or inertia:

J (X, μ) = \sum_{i = 1}^{n} \sum_{j = 1}^{k} w_{i j} {‖x_{i} - μ_{j}‖}^{2},

(8)

where X = {x₁, x₂, …, x_n} represents the set of n data points, μ = {μ₁, μ₂, …, μ_k} denotes the k cluster centroids, and w_ij is a binary indicator variable: w_ij = 1 if x_i belongs to cluster j, and is 0 otherwise. The term ||x_i − μ_j|| represents the Euclidean distance between data point x_i and centroid μ_j.

The cluster assignment for each data point is determined by the following:

c (i) = \arg \min_{j} {‖x_{i} - μ_{j}‖}^{2},

(9)

where c(i) indicates the cluster assignment for the i-th data point. This equation assigns each point to the cluster whose centroid is nearest in terms of Euclidean distance.

After cluster assignments, the centroids are updated using the following:

μ_{j} = (\frac{1}{|C_{j}|}) \sum_{i} C_{j} \in x_{i},

(10)

where |C_j| represents the number of points assigned to cluster j, and C_j is the set of all points currently assigned to cluster j. This update computes the mean position of all points assigned to each cluster.

The convergence criterion is typically defined as follows:

|J_{j} (t) - J_{j} (t - 1)| < ε,

(11)

where J(t) represents the objective function value at iteration t, and ε is a small threshold value determining when the algorithm should terminate.

The initialization of cluster centroids significantly influences the algorithm’s performance. Common initialization methods include the following:

μ_{j} = x_{i} + δ,

(12)

where x_i is randomly selected from the dataset and δ represents a small random perturbation. The within-cluster sum of squares (WCSS) for each cluster is calculated as follows:

W C S S (j) = \sum_{i} \in C_{j} {‖x_{i} - μ_{j}‖}^{2},

(13)

This metric helps evaluate the compactness of individual clusters and can be used for determining the optimal number of clusters through techniques like the elbow method.

The silhouette coefficient, which measures how similar a point is to its own cluster compared to other clusters, is computed as follows:

s (i) = \frac{b (i) - a (i)}{\max \{a (i), b (i)\}},

(14)

where a(i) is the mean distance between point i and all other points in its cluster, and b(i) is the minimum mean distance between point i and points in any other cluster.

2.4. Support Vector Classification Model

Support Vector Classification (SVC) is a sophisticated machine learning algorithm founded on the structural risk minimization principle. It constructs an optimal hyperplane in a high-dimensional feature space to separate different classes while maximizing the margin between them. This approach effectively balances model complexity with classification accuracy [30,31,32].

The fundamental concept of SVC is expressed through its decision function, which combines a weight vector, bias term, and feature mapping:

f (x) = w ϕ (x) + b,

(15)

This function represents the classifier’s decision boundary, where w is the weight vector determining the orientation of the separating hyperplane, φ(x) is a nonlinear mapping function that transforms input data into a higher dimensional feature space, and b is the bias term adjusting the hyperplanes position.

To find the optimal hyperplane, SVC formulates a primary optimization problem that minimizes both structural risk and classification errors:

\min (\frac{1}{2}) {‖w‖}^{2} + C \sum (ξ_{i} + ξ_{i}^{*}),

(16)

This optimization is subject to essential constraints that enforce proper classification while allowing for some margin violations:

\begin{array}{l} y_{i} (w ϕ (x_{i}) + b) < ε + ξ_{i} \\ - (w ϕ (x_{i}) + b) - y_{i} \leq ε + {ξ_{i}}^{*} \\ ξ_{i}, {ξ_{i}}^{*} \geq 0, 1 = 1, 2, \dots, m \end{array},

(17)

Through the application of Lagrangian multipliers and optimization theory, the problem transforms into its dual form, which is computationally more tractable. The kernel function K(x_i,x_j) plays a crucial role in enabling nonlinear classification by implicitly mapping data to a higher dimensional space. The linear kernel, being the simplest form, directly computes the dot product in input space:

K (x_{i}, x_{j}) = x_{i} \cdot x_{j},

(18)

The Radial Basis Function (RBF) kernel introduces nonlinearity by measuring the Gaussian distance between points:

K (x_{i}, x_{j}) = \exp (- γ {‖x_{i} - x_{j}‖}^{2}),

(19)

For problems requiring polynomial decision boundaries, the polynomial kernel offers an alternative approach:

K (x_{i}, x_{j}) = {(γ x_{i} + r)}^{d},

(20)

The final classification decision is made using the sign of the decision function, which incorporates the learned support vectors and their corresponding coefficients:

f (x) = sign (\sum (α_{i} - {α_{i}}^{*}) K (x_{i}, x) + b),

(21)

This model demonstrates remarkable capability in handling complex classification tasks through its kernel-based approach and margin maximization strategy. The solution’s sparseness, achieved through support vector identification, ensures computational efficiency during prediction while maintaining robust generalization performance. The hyperparameter C and kernel-specific parameters provide flexible control over the model’s behavior, allowing practitioners to adapt the classifier to specific problem requirements and data characteristics.

For the SVC implementation in this study, parameter optimization was conducted to ensure optimal classification performance. The radial basis function (RBF) kernel was selected after comparative testing with linear and polynomial kernels, as it demonstrated superior ability in handling the nonlinear relationships in coal spontaneous combustion data. The key parameters were determined through grid search with 5-fold cross-validation: the penalty parameter C was set to 10.0 to balance model complexity and classification accuracy, while the kernel coefficient γ was optimized to 0.01 to control the influence radius of support vectors.

3. Results and Discussion

3.1. Analysis of Warning Level Classification

To achieve the optimal dimensionality reduction, Maximum Likelihood Estimation (MLE) was employed as the criterion for determining intrinsic dimensionality. Compared to other estimation methods, MLE demonstrated superior convergence properties and asymptotically approaches unbiased estimation in theory. The MLE results indicated that compressing the sample variables to two dimensions preserved the essential data characteristics. Through this approach, the data was embedded into a two-dimensional space, ensuring intuitive data visualization. Furthermore, to evaluate the dimensionality reduction performance of t-SNE, both PCA and SNE methods were applied to the original data using identical procedures. The visualization results in two dimensions of data related to coal spontaneous combustion under different dimensionality reduction methods are shown in Figure 3. PCA, as a linear dimensionality reduction method, resulted in significant data stacking in two-dimensional space and failed to effectively compress the data. In contrast, both SNE and t-SNE methods successfully mapped nonlinear data from high-dimensional space to low-dimensional space with better preservation of the data structure.

The PCA projection shows significant overlap between data points and lacks clear boundaries between different regions, indicating its limitation in capturing nonlinear relationships. The SNE method demonstrates better separation of data points but still exhibits some clustering inconsistencies at region boundaries. The t-SNE projection achieves the most distinct separation of data points with clear boundaries between different regions, suggesting its superior ability to preserve both local and global data structures. These visualization results demonstrate that t-SNE is more effective at capturing the intrinsic nonlinear relationships in coal spontaneous combustion data compared to traditional dimensionality reduction methods.

The k-means clustering algorithm was employed to classify the warning levels for coal spontaneous combustion. The first crucial step was determining the optimal number of clusters (k) for partitioning the combustion data, as both excessive and insufficient cluster numbers would compromise the warning system’s effectiveness. Therefore, the range of k values was set between 2 and 7, and the clustering performance was evaluated using silhouette coefficients for each k value, as illustrated in Figure 4. The analysis revealed that the clustering performance improved with increasing k values. Overall, t-SNE demonstrated superior clustering effectiveness, achieving a peak silhouette coefficient of 0.95 when k = 6, indicating both optimal cluster separation and cohesion.

Based on the maximum silhouette coefficient values, the coal spontaneous combustion warning levels were classified into seven levels using PCA, while both the t-SNE and SNE methods yielded six distinct levels. Subsequently, iterative clustering was performed for each method independently, with the iteration process illustrated in Figure 5.

The convergence patterns of the loss functions reveal the important characteristics of each method. PCA shows rapid initial convergence but stabilizes at a relatively high loss value, indicating potential information loss during the dimensionality reduction. SNE exhibits moderate convergence speed with intermediate loss values, suggesting a balance between data compression and information preservation. The t-SNE method, while showing slightly slower initial convergence, achieves the lowest final loss value, demonstrating its superior ability to maintain data relationships during the dimensionality reduction. The stabilization of all three methods after 15 iterations indicates that further iterations would not significantly improve the clustering results, confirming the optimization process’s effectiveness.

After 15 iterations, the loss functions for all methods stabilized, indicating that the models had achieved their optimal performance. Further iterations would not yield substantial improvements in the clustering results. The final clustering outcomes after the iteration convergence are presented in Figure 6.

After dimensionality reduction and clustering, the coal spontaneous combustion data was partitioned into six clusters (Clusters 1–6), though these clusters were not inherently ordered. To interpret the significance of each cluster, we analyzed the corresponding coal combustion temperatures and assessed the cluster risk levels based on temperature variations, thereby establishing the warning levels.

PCA clustering yielded seven clusters, with the temperature frequency distributions shown in Figure 7. As shown in Figure 7, Clusters 5 and 1 predominantly span 30–120 °C, and Clusters 3 and 6 concentrate within 90–160 °C, while Clusters 7, 2, and 4 mainly distribute across 140–180 °C. Figure 7 further illustrates that Cluster 5 appears the earliest with the highest frequency, disappearing at 130 °C, indicating the lowest risk. Cluster 1 follows, with an increasing frequency until disappearing at 170 °C. Cluster 4 appears the latest, and is predominantly distributed between 170–180 °C, representing the highest risk, followed by Cluster 2. However, PCA’s seven-level classification resulted in some clusters containing insufficient data points, making the risk level determination through the peak analysis unreliable.

SNE clustering results, illustrated in Figure 8, reveal several limitations in classification effectiveness. The overall frequency distributions show relatively low and inconsistent patterns across different clusters, suggesting suboptimal data separation. Figure 8 demonstrates a particularly problematic aspect where Cluster 5 spans across the entire temperature range (30–180 °C), violating the physical principle of progressive risk development in coal spontaneous combustion. This unrealistic distribution indicates that SNE fails to capture the underlying temperature-dependent relationships between indicator gases. Moreover, the highly concentrated nature of SNE-reduced data manifests in significant cluster overlap, particularly in the middle temperature ranges (90–150 °C), where multiple clusters share substantial common regions. This overlap creates ambiguity in risk level determination, as similar gas concentration patterns are assigned to different clusters without clear physical justification. The lack of distinct boundaries between the adjacent clusters and the absence of clear temporal progression in the cluster appearances make it challenging to establish reliable temperature thresholds for different risk levels. These limitations stem from SNE’s tendency to compress data points too closely in the reduced dimensional space, potentially losing important structural information about the relationships between different risk levels. Unlike t-SNE, which maintains both local and global data structures, SNE’s focus on preserving local structures results in less interpretable clustering outcomes for coal spontaneous combustion warning level classification.

The t-SNE clustering results, presented in Figure 9, demonstrate distinct distribution ranges for each cluster. High-risk and low-risk clusters exhibit skewed distributions toward temperature extremes, while medium-risk clusters approximate normal distributions centered around intermediate temperatures. Figure 9 shows Cluster 3 appearing last, indicating the highest risk, followed by decreasing risk levels in Clusters 5, 1, and 2. Figure 9 also reveals Cluster 4 disappearing last, representing the lowest risk, with Clusters 6 and 2 showing progressively higher risk levels.

The t-SNE clustering results reveal intricate distribution patterns that effectively characterize the progression of coal spontaneous combustion risk levels. The temperature distributions within each cluster demonstrate clear differentiation with minimal overlap, enabling precise risk level classification. Higher risk clusters (3 and 5) exhibit more concentrated distributions, particularly in the elevated temperature ranges, indicating well-defined conditions for severe risk scenarios. In contrast, the moderate-risk clusters (1 and 2) show broader distributions, reflecting the gradual transition nature of the spontaneous combustion process. Most notably, the temporal evolution of the cluster appearances and disappearances closely aligns with the physical progression of coal spontaneous combustion, where lower-risk clusters dominate the early stages and higher-risk clusters emerge as temperatures increase. This natural progression validates the clustering methodology and provides a reliable foundation for the warning level classification system. The clear separation between the clusters, combined with their logical temporal sequence, demonstrates the effectiveness of t-SNE-based clustering in capturing both the spatial and temporal characteristics of spontaneous combustion risk development.

Based on the comprehensive analysis of temperature distributions and temporal patterns, six distinct warning levels were established for coal spontaneous combustion risk assessment. These levels progress from “Safe” (representing minimal risk conditions), through “Low Risk”, “Moderate Risk”, and “High Risk”, to “Severe Risk”, and ultimately “Critical Risk” (indicating the most dangerous conditions). Each of these warning levels corresponds to specific cluster assignments as detailed in Table 1, providing a systematic framework for risk evaluation and monitoring.

These comprehensive analyses of gas indicators, their correlations, and temporal evolution patterns provide a solid foundation for developing more effective monitoring strategies. The insights gained from this study can be directly applied to optimize early warning systems and improve safety measures in coal storage and mining operations. This classification system effectively captures the progressive nature of spontaneous combustion risk while maintaining practical utility for implementation in real-world monitoring scenarios.

3.2. Engineering Application and Discussion

The coal samples from the 3302 fully mechanized caving face of the Longgu Coal Mine consist of 1/3 coking coal and fat coal, which are prone to spontaneous combustion with a natural ignition period of 41 days. To validate the model’s universality and accuracy, 10 sets of data were selected for SVC model prediction, as shown in Table 2. These datasets include routine tube bundle monitoring data and representative measurements from before and after spontaneous combustion events in the goaf area.

The analysis of the field data revealed no presence of C₂H₄ or C₂H₂ gases. Since C₂H₄ is typically produced during high-temperature coal pyrolysis, its absence indicates that the coal had not reached pyrolysis temperatures, thus precluding a classification at the highest warning level.

Based on the t-SNE method analysis, the warning levels demonstrated a clear progression of risk indicators. Under normal monitoring conditions, the system registered as “Safe”. The risk level elevated to “Low Risk” when minor increases in CO concentration were detected alongside decreasing oxygen levels. As spontaneous combustion events developed, characterized by sharp increases in CO and CO₂ concentrations, significant oxygen depletion, and C₂H₆ detection, the system escalated to “High Risk”. Finally, the warning level reached “Severe Risk” when the fire spread intensified and the CO concentrations exceeded critical thresholds.

The clustering and prediction results demonstrated reliable credibility, confirming that unsupervised learning methods are suitable for categorizing coal spontaneous combustion warning levels. When combined with the SVM classification model, this approach enables effective early warning predictions for coal spontaneous combustion in goaf areas.

4. Conclusions

(1) The proposed unsupervised learning approach combining t-SNE and k-means clustering demonstrated superior performance in classifying coal spontaneous combustion warning levels compared to traditional PCA and SNE methods. The t-SNE method achieved optimal clustering with six distinct warning levels and a peak silhouette coefficient of 0.95, indicating excellent cluster separation and cohesion. This classification framework provides a more objective and reliable basis for risk assessment compared to conventional supervised learning methods.

(2) The analysis of temperature distributions and gas concentration patterns revealed strong nonlinear relationships between indicator gases and warning levels, with maximum MIC values of 0.95 and 0.81 for CO and O₂ concentrations, respectively. The t-SNE method effectively captured these nonlinear relationships while preserving both local and global data structures, enabling more accurate risk level classification based on multiple indicator gases rather than relying on single-parameter thresholds.

(3) The field application at the Longgu Coal Mine validated the practical effectiveness of the integrated t-SNE and SVC approach. The system successfully detected and classified different risk levels based on gas concentration patterns, even in the absence of high-temperature indicators like C₂H₄. This demonstrated the model’s robustness in real-world conditions and its capability to provide an early warning for spontaneous combustion risks in goaf areas, making it a valuable tool for mine safety management.

Author Contributions

Conceptualization, P.Z. and X.C.; Methodology, P.Z.; Data curation, X.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 52174206.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. Partial data from programmed temperature experiment.

Temperature	CO₂/%	C₂H₄/ppm	C₂H₆/ppm	CH₄/ppm	CO/ppm	C₂H₂/ppm	O₂/%	N₂/%
30	0.05	0	0	0.54	0	0	19.42	78.67
40	0.11	0	0	2.35	0	0	18.98	79.23
50	0.59	0	0	5.30	7.16	0	18.49	80.66
60	0.64	0	0	14.61	12.40	0	17.71	75.99
70	0.73	0	1.04	52.63	26.92	0	17.08	75.00
80	0.95	0	5.40	72.63	96.71	0	16.94	75.86
90	1.22	0	6.65	95.16	153.74	0	16.68	76.73
100	1.46	0	10.54	109.46	183.45	0	16.53	77.47
110	1.65	0	14.18	137.96	219.70	0	16.28	77.62
120	3.15	0	38.90	222.38	835.21	0	14.93	77.18
130	4.41	0	69.87	227.53	2102.48	0	12.60	80.10
140	5.48	5.97	106.10	238.79	2725.89	0	10.76	81.99
150	7.38	10.11	82.44	252.63	4142.65	0	8.04	84.88
160	9.58	18.19	122.92	327.96	5865.54	1.96	6.67	86.77
170	10.88	22.12	196.51	517.61	6366.78	3.55	4.13	88.39
180	12.76	28.47	253.42	578.61	6909.85	7.94	3.13	90.37

References

Onifade, M.; Genc, B. A review of research on spontaneous combustion of coal. Int. J. Min. Sci. Technol. 2020, 30, 303–311. [Google Scholar] [CrossRef]
Su, C.; Deng, J.; Li, X.; Huang, W.; Ma, J.; Wang, C.; Wang, X. Investment in enhancing resilience safety of chemical parks under blockchain technology: From the perspective of dynamic reward and punishment mechanisms. J. Loss Prev. Process Ind. 2025, 94, 105523. [Google Scholar] [CrossRef]
Ma, L.; Zhang, P.Y.; Chen, X.K.; He, Y.P.; Wei, G.M.; Fan, J. Numerical investigation of coupling hazard zone of coal spontaneous combustion and gas in gob for high-gas mines. Case Stud. Therm. Eng. 2024, 63, 105341. [Google Scholar] [CrossRef]
Su, C.; Zha, X.; Ma, J.; Li, B.; Wang, X. Dynamic optimal control strategy of ccus technology innovation in coal power stations under environmental protection tax. Systems 2025, 13, 193. [Google Scholar] [CrossRef]
Qu, G.; Deng, J.; Ren, S.; Xiao, Y.; Wang, C.; Wang, J.; Duan, X.; Zhang, L. Effect of oxygen deficient conditions on oxidative spontaneous combustion characteristics of raw coal and water-immersed air-dried bituminous coal. Process Saf. Environ. Prot. 2025, 196, 106859. [Google Scholar] [CrossRef]
Gao, L.; Tan, B.; Fan, L.; Cliff, D.; Huang, J.; Wang, H.; Kong, B. An experimental study on monitoring and early warning of spontaneous coal combustion fires using CPM. J. Environ. Chem. Eng. 2024, 12, 114712. [Google Scholar] [CrossRef]
Lu, B.; Zhang, X.; Qiao, L.; Ding, C.; Fan, N.; Huang, G. Experimental study on the effect of slow reaction process of the latent period on coal spontaneous combustion. Energy 2024, 302, 131927. [Google Scholar] [CrossRef]
Zhang, P.; Ma, L.; Sun, J.; Chen, R.; Pei, G.; Chen, X.; Fan, J. Numerical investigation of air flow field evolution and leakage patterns in large-scale goaf areas. Phys. Fluids 2025, 37, 027151. [Google Scholar] [CrossRef]
Guo, J.; Wen, H.; Zheng, X.; Liu, Y.; Cheng, X. A method for evaluating the spontaneous combustion of coal by monitoring various gases. Process Saf. Environ. Prot. 2019, 126, 223–231. [Google Scholar] [CrossRef]
Liu, H.; Wang, F.; Ren, T.; Qiao, M.; Yan, J. Influence of methane on the prediction index gases of coal spontaneous combustion: A case study in Xishan coalfield, China. Fuel 2021, 289, 119852. [Google Scholar] [CrossRef]
Zhou, Q.; Mao, X.; Jia, B. Development of a Graded Early Warning Index System and Identification of Critical Temperatures for Coal Spontaneous Combustion Using Composite Gas Characteristics. ACS Omega 2024, 9, 33. [Google Scholar] [CrossRef]
Wang, F.; Ji, Z.; Wang, H.; Chen, Y.; Wang, T.; Tao, R.; Su, C.; Niu, G. Analysis of the Current Status and Hot Technologies of Coal Spontaneous Combustion Warning. Processes 2023, 11, 2480. [Google Scholar] [CrossRef]
Liu, X.; Wang, H.; Wang, E.; Li, Z.; Liu, X.; Wang, K. Research on evolution law of index gases produced by the coal spontaneous combustion based on wavelet analysis. Fuel 2024, 370, 131859. [Google Scholar] [CrossRef]
Luo, Z.M.; Wang, S.J.; Wang, K.; Yang, Y.; Zhang, X.W. Experimental Study on Constructing a Classification and Early Warning Index System for Non-Stick Coal Spontaneous Combustion. Combust. Sci. Technol. 2025, 1, 1–23. [Google Scholar] [CrossRef]
Qi, Y.; Xue, K.; Wang, W.; Cui, X.; Liang, R. Prediction Model of Borehole Spontaneous Combustion Based on Machine Learning and Its Application. Fire 2023, 6, 357. [Google Scholar] [CrossRef]
Zhao, J.; Wang, C.; Song, J.; Lu, S.; Deng, J.; Zhang, Y.; Shu, C.M. Quantitative characterisation of the influence of different environmental factors on coal spontaneous combustion. J. Therm. Anal. Calorim. 2024, 149, 10241–10264. [Google Scholar] [CrossRef]
Yi, X.; Zhang, M.; Deng, J.; Xiao, Y.; Chen, W.; Heris, S.Z. Effects on environmental conditions and limiting parameters for spontaneous combustion of residual coal in underground goaf. Process Saf. Environ. Prot. 2024, 187, 1378–1389. [Google Scholar] [CrossRef]
Xie, Z.; Zhang, Y.; Jin, C. Prediction of Coal Spontaneous Combustion in Goaf Based on the BP Neural Network. Procedia Eng. 2012, 43, 88–92. [Google Scholar] [CrossRef]
Xiao, H.; Tian, Y. Prediction of mine coal layer spontaneous combustion danger based on genetic algorithm and BP neural networks. Procedia Eng. 2011, 26, 139–146. [Google Scholar] [CrossRef]
Deng, J.; Chen, W.; Wang, C.; Wang, W. Prediction Model for Coal Spontaneous Combustion Based on SA-SVM. ACS Omega 2021, 6, 11307–11318. [Google Scholar] [CrossRef]
Lei, C.; Deng, J.; Cao, K.; Xiao, Y.; Ma, L.; Wang, W.; Ma, T.; Shu, C. A comparison of random forest and support vector machine approaches to predict coal spontaneous combustion in gob. Fuel 2019, 239, 297–311. [Google Scholar]
Lei, C.; Deng, J.; Cao, K.; Ma, L.; Xiao, Y.; Ren, L. A random forest approach for predicting coal spontaneous combustion. Fuel 2018, 223, 63–73. [Google Scholar]
Bao, R.; Feng, Q.; Lei, C. Influencing Factors Analysis and Prediction of Gas Emission in Mining Face. Sustainability 2025, 17, 578. [Google Scholar] [CrossRef]
Ni, S.; Yue, Y.; Chen, Q. Research on the Prediction Model of Coal Spontaneous Combustion Hazard Level Based on IGWO-GRNN. Combust. Sci. Technol. 2025, 1, 1–11. [Google Scholar]
Pan, H.; Fan, Y.; Deng, J.; Shi, K.; Wang, C.; Lei, X.; Wei, Z.; Bai, J. GCN-based prediction method for coal spontaneous combustion temperature. Process Saf. Environ. Prot. 2025, 196, 106855. [Google Scholar]
Dey, P.; Saurabh, K.; Kumar, C.; Pandit, D.; Chaulya, S.K.; Ray, S.K.; Prasad, G.M.; Mandal, S.K. t-SNE and variational auto-encoder with a bi-LSTM neural network-based model for prediction of gas concentration in a sealed-off area of underground coal mines. Data Anal. Mach. Learn. 2021, 25, 14183–14207. [Google Scholar]
Peng, H.B.; Chen, G.H.; Chen, X.X.; Lu, Z.M.; Yao, S.C. Hybrid classification of coal and biomass by laser-induced breakdown spectroscopy combined with K-means and SVM. Plasma Sci. Technol. 2019, 21, 034008. [Google Scholar]
Pandit, Y.P.; Badhe, Y.P.; Sharma, B.K.; Tambe, S.S.; Kulkarni, B.D. Classification of Indian power coals using K-means clustering and Self Organizing Map neural network. Fuel 2011, 90, 339–347. [Google Scholar]
Sahu, H.B.; Mahapatra, S.S.; Panigrahi, D.C. Fuzzy c-means clustering approach for classification of Indian coal seams with respect to their spontaneous combustion susceptibility. Fuel Process. Technol. 2012, 104, 115–120. [Google Scholar]
Zhai, X.W.; Hao, L.; Ma, T.; Song, B.B.; Wang, K.; Luo, J.L. Non-linear soft sensing method for temperature of coal spontaneous combustion. Process Saf. Environ. Prot. 2023, 170, 1023–1031. [Google Scholar]
Wang, W.; Liang, R.; Qi, Y.; Cui, X.C.; Liu, J.; Xue, K.L. Study on the Prediction Model of Coal Spontaneous Combustion Limit Parameters and Its Application. Fire 2023, 6, 381. [Google Scholar] [CrossRef]
Cheng, G.; Chen, J.; Wei, Y.F.; Chen, S.S.; Pan, Z.Y. A Coal Gangue Identification Method Based on HOG Combined with LBP Features and Improved Support Vector Machine. Symmetry 2023, 15, 202. [Google Scholar] [CrossRef]

Figure 1. Schematic diagram of programmed oil bath experimental system.

Figure 2. MIC values between temperature and indicator gases.

Figure 3. Two-dimensional visualization results of coal spontaneous combustion data: (a) Principal Component Analysis (PCA) projection; (b) Stochastic Neighbor Embedding (SNE) projection; and (c) t-distributed Stochastic Neighbor Embedding (t-SNE) projection.

Figure 4. Silhouette coefficients under different dimension reduction methods.

Figure 5. Loss function value of iteration process.

Figure 6. Clustering visualization results using different dimensionality reduction methods: (a) PCA clustering results with 7 clusters and center points; (b) SNE clustering results with 6 clusters and center points; and (c) t-SNE clustering results with 6 clusters and center points.

Figure 7. Classification of coal spontaneous combustion warning levels with PCA.

Figure 8. Classification of coal spontaneous combustion warning levels with SNE.

Figure 9. Classification of coal spontaneous combustion warning levels with t-SNE.

Table 1. The warning levels corresponding to the clustering results.

Sequence	t-SNE Cluster	Warning Level
1	4	Safe
2	6	Low Risk
3	2	Moderate Risk
4	1	High Risk
5	5	Severe Risk
6	3	Critical Risk

Table 2. Gas concentration monitoring data and warning level classification results from 3302 fully mechanized caving face of Longgu Coal Mine.

Sequence	CO₂/%	C₂H₆/ppm	CH₄/ppm	CO/ppm	O₂/%	N₂/%	t-SNE
1	0.06	0	0	0	20.56	79.38	Safe
2	0.06	0	0	0	20.37	79.57	Safe
3	0.53	0	5	9	19.39	80.08	Low Risk
4	0.68	0	6	8	19.22	80.10	Low Risk
5	0.75	0	3	5	19.63	79.57	Low Risk
6	0.95	6.20	79	107	17.95	81.05	High Risk
7	0.91	8.55	86	132	16.97	81.10	High Risk
8	1.45	13.53	132	192	15.21	83.30	Severe Risk
9	1.64	13.31	112	199	15.18	83.16	Severe Risk
10	1.59	15.64	136	226	13.94	84.44	Severe Risk

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, P.; Chen, X. An Unsupervised Learning Approach for Coal Spontaneous Combustion Warning Level Classification Using t-SNE and k-Means Clustering. Appl. Sci. 2025, 15, 3756. https://doi.org/10.3390/app15073756

AMA Style

Zhang P, Chen X. An Unsupervised Learning Approach for Coal Spontaneous Combustion Warning Level Classification Using t-SNE and k-Means Clustering. Applied Sciences. 2025; 15(7):3756. https://doi.org/10.3390/app15073756

Chicago/Turabian Style

Zhang, Pengyu, and Xiaokun Chen. 2025. "An Unsupervised Learning Approach for Coal Spontaneous Combustion Warning Level Classification Using t-SNE and k-Means Clustering" Applied Sciences 15, no. 7: 3756. https://doi.org/10.3390/app15073756

APA Style

Zhang, P., & Chen, X. (2025). An Unsupervised Learning Approach for Coal Spontaneous Combustion Warning Level Classification Using t-SNE and k-Means Clustering. Applied Sciences, 15(7), 3756. https://doi.org/10.3390/app15073756

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Unsupervised Learning Approach for Coal Spontaneous Combustion Warning Level Classification Using t-SNE and k-Means Clustering

Abstract

1. Introduction

2. Experiments and Methods

2.1. Experimental System and Data Acquisition

2.2. t-SNE Algorithm for Dimensionality Reduction

2.3. K-Means Clustering Algorithm

2.4. Support Vector Classification Model

3. Results and Discussion

3.1. Analysis of Warning Level Classification

3.2. Engineering Application and Discussion

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI