Combined Prediction of Dust Concentration in Opencast Mine Based on RF-GA-LSSVM

Xiao, Shuangshuang; Liu, Jin; Ma, Yajie; Zhang, Yonggui

doi:10.3390/app14188538

Open AccessArticle

Combined Prediction of Dust Concentration in Opencast Mine Based on RF-GA-LSSVM

¹

School of Energy Engineering, Xi’an University of Science and Technology, Xi’an 710054, China

²

Shaanxi Coalbed Methane Development Co., Ltd., Xi’an 719000, China

³

North Weijiamao Power and Coal Co., Ltd., Ordos 017000, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(18), 8538; https://doi.org/10.3390/app14188538

Submission received: 2 August 2024 / Revised: 14 September 2024 / Accepted: 18 September 2024 / Published: 23 September 2024

(This article belongs to the Section Ecology Science and Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

Accurate prediction of dust concentration is essential for effectively preventing and controlling mine dust. The environment of opencast mines is intricate, with numerous factors influencing dust concentration, making accurate predictions challenging. To enhance the prediction accuracy of dust concentration in these mines, a combined prediction algorithm utilizing RF-GA-LSSVM is developed. Initially, the random forest (RF) algorithm is employed to identify key features from the meteorological and dust concentration data collected on site, ultimately selecting five indicators—temperature, humidity, stripping amount, wind direction, and wind speed—as the input variables for the prediction model. Next, the data are split into a training set and a test set at a 7:3 ratio, and the genetic algorithm (GA) is applied to optimize the least squares support vector machine (LSSVM) model for predicting dust concentration in opencast mines. Additionally, model evaluation metrics and testing methods are established. Compared with LSSVM, PSO-LSSVM, ISSA-LSSVM, GWO-LSSVM, and other prediction models, the GA-LSSVM model demonstrates a final fitting degree of 0.872 for PM2.5 concentration data and 0.913 for PM10 concentration data. The GA-LSSVM model clearly demonstrates a strong predictive performance with low error and high fitting. The research results can serve as a foundation for developing dust control measures in opencast mines.

Keywords:

opencast mine; dust concentration; RF index screening; GA-LSSVM; combinatorial forecasting

1. Introduction

Opencast mining is an important method for extracting mineral resources. The production process includes drilling, blasting, mining, loading, transportation, and unloading. During these operations, a significant amount of rock crushing will generate highly concentrated composite dust. The small composite dust particles float in the air for extended periods, posing a threat to workers’ health, hastening mechanical wear, diminishing visibility in the workspace, and polluting the ecological environment [1]. Dust pollution is a significant environmental challenge encountered by opencast mines. Therefore, it is crucial to predict the dust concentration in opencast mines and to provide guidance for effective dust prevention and control measures.

Dust concentration prediction methods can generally be categorized into four types: qualitative and semi-quantitative predictions, linear regression predictions, machine learning predictions, and combined predictions [2]. Currently, the prediction of dust concentration primarily evolves from traditional methods like life tables, macro measurements, and empirical analogies to more advanced methods such as time series analysis, linear regression, and grey theory. In recent years, this has expanded to include combination methods like artificial neural networks, random forests, support vector machines, and machine learning optimized by biological intelligent algorithms. For example, scholars like Wang Zhiming have proposed a new method for predicting dust concentration by combining meteorological parameters and production intensity. This method utilizes weather forecast data and mine production intensity data as model inputs, and integrates various machine learning techniques to develop a new model for predicting daily dust concentration in opencast mines. The results indicate that incorporating production intensity significantly enhances the model’s performance [3]. Balaga et al. introduced and formulated a functional model for predicting dust concentration using a power function algorithm. The model can forecast the distribution state of dust without relying on empirical data and can establish the distribution characteristics of PM10, PM4, and PM2.5 dust particles [4]. Tripathy et al. focused on monitoring dust concentration in various working areas of the mine, characterizing dust from different sources, and assessing personal dust exposure. AERMOD software (https://www.weblakes.com/software/air-dispersion/aermod-view/) was utilized to forecast the dust concentration in various locations within the mine and its surrounding areas [5]. LUAN BOYU et al. introduced an enhanced machine learning approach to optimize the estimation of dust concentration. The method combines random forest–Markov chain (RF-MC) to establish a model. After applying the Markov algorithm correction, the root mean square error is significantly reduced, and the Pearson correlation coefficient and mean absolute error are also significantly improved [6]. WANG MENG et al. developed a CNN-BiLSTM attention multivariate hybrid model, which incorporates a convolutional neural network (CNN), bidirectional long short-term memory neural network (BiLSTM), and attention mechanism, to predict PM2.5 concentration in the following 24 h. The model fitting effect is good, and the error rate is low [7].

The aforementioned research demonstrates how the existing prediction algorithm is utilized in the context of opencast mining environments. Machine learning, deep learning, and various algorithms discussed in the literature can achieve superior predictive outcomes compared with conventional experimental techniques. However, they also have inherent drawbacks, including a strong reliance on data accuracy, low efficiency, extended timeframes, and a tendency to become stuck in local optima [8]. Moreover, it is common to encounter local minimization issues with individual algorithms. Relying solely on primary algorithms like decision trees, support vector machines, and neural networks often makes it challenging to attain optimal prediction results. However, the likelihood of getting stuck in a local minimum can be significantly decreased when using a combined approach through iterative processes.

Due to the extensive use of biological intelligent optimization algorithms, the leading prediction techniques today are integrated models that combine machine learning with optimization through biological intelligent algorithms [9]. With the widespread application of biological intelligent optimization algorithms, an increasing number of researchers are using the particle swarm optimization algorithm to optimize the prediction parameters of the model. Furthermore, the genetic algorithm, with its global optimization capability and high parallelism, is more popular than the particle swarm optimization algorithm. It is not only difficult to fall into a local optimum, but it also offers high accuracy and computational efficiency [10]. For example, Zhou Changwei et al. developed a genetic algorithm to optimize the BP neural network model for predicting the dust concentration in the time series of the mine working face. It has been verified that the high-precision prediction requirements were achieved [11]. Wang Bo et al. proposed a BP neural network model based on genetic optimization to predict the dust concentration in a coal mining face. The accuracy of the model has significantly improved compared with the traditional model [12]. The support vector machine (SVM) has been widely utilized due to its strong generalization ability, rapid learning speed, and effective noise resistance. In order to improve the classification and regression learning capabilities of support vector machines and enhance prediction accuracy, scholars have proposed an enhanced prediction model called the least squares support vector machine (LSSVM) algorithm. Amin Bemani et al. developed a hybrid particle swarm optimization and genetic algorithm (HGAPSO) to optimize the least squares support vector machine (LSSVM) model for predicting the cetane number of biodiesel. The optimization performance of three combined models, HGAPSO-LSSVM, PSO-LSSVM, and GA-LSSVM, was compared [13]. Liu Yang et al. proposed a hybrid intelligent prediction model that combines the random forest (RF) and least squares support vector machine (LSSVM) algorithms. The RF algorithm is used to establish the optimal index set for predicting the chloride ion penetration resistance of high-performance concrete. Additionally, the LSSVM algorithm accurately predicts the chloride ion penetration resistance of high-performance concrete (HPC) [14]. Yu Chengbing et al. developed a model for predicting dye concentration based on PSO-LSSVM. The results show that the efficiency (Ep) of the model is above 96%, indicating high prediction accuracy and strong dye specificity [15]. Pan Xi et al. applied a method based on the least squares support vector machine (LSSVM) and the genetic algorithm (GA) to predict the coefficient of performance (COP) and regulate the load of each unit in the chiller system. The results indicate that the LSSVM model, optimized by the genetic algorithm, demonstrates sufficient accuracy for predicting COP [16]. Ma Li et al. utilized Fourier series to simulate the shape of a blasting muck pile and established a prediction model for its shape and effect based on GA-LSSVM, achieving high prediction accuracy [17]. While LSSVM is not commonly used in mine dust prediction, it has been successfully applied in predicting telecom flow [18], storage for wind energy [19], blasting effects [20], and other fields.

The environment of opencast mines is intricate and constantly changing, leading to significant fluctuations in dust concentration due to various influencing factors. Without a comprehensive analysis of the dust distribution patterns, achieving accurate predictions becomes challenging. Thus, accurately identifying the factors that affect dust concentration is crucial in the realm of predicting dust levels in opencast mining. In their study on the influencing factors of dust concentration, Wang Zhiming et al. analyzed the factors affecting dust concentration in the Haerwusu opencast coal mine. They determined the order of influencing factors as follows: coal production > boundary layer height > wind speed > temperature difference > temperature > humidity [21].

Liu Zhigao conducted an analysis of dust concentration characteristics and the influence of meteorological factors such as temperature, humidity, wind speed, and air pressure on dust levels, using data on dust concentration and weather conditions. The research also explained the reasons behind the variations in dust concentration. Findings revealed that dust levels at the open-pit mine peaked in March, November, and during winter, while they were lower in summer and autumn. The daily changes in humidity and temperature across different seasons displayed a “herringbone” and “inverted herringbone” pattern, respectively. Additionally, a dust concentration prediction model utilizing a long short-term memory (LSTM) neural network was created, achieving a fitting degree of approximately 0.88 in its predictions [22]. Additionally, in the interdisciplinary domain, many researchers utilize the random forest method for classification and regression tasks. One significant benefit of random forest is its ability to perform feature selection, allowing for the identification of the most important features from a larger set. This approach helps to remove less relevant factors, thereby enhancing the model’s predictive accuracy [23]. Qi Chongchong et al. proposed a hybrid particle swarm optimization algorithm (PSO) to optimize the estimated PM concentration of the random forest (RF) model. The input variables included five indicators: wind direction, wind speed, temperature, humidity, and noise [24].

In conclusion, while research on dust prediction in opencast mines has yielded significant results and ongoing innovations in theoretical methods, the technology remains imperfect. When it comes to predicting dust concentration with multiple data features, there is still a reliance on a singular and fixed prediction approach. Statistical tools and machine learning algorithms both have their respective strengths and weaknesses (see Table 1). Additionally, the factors affecting dust concentration have not been thoroughly explored, and there is a subjective selection of prediction indicators [25]. Additionally, there is a limited amount of research on the time series data regarding dust concentration in open-pit mines, which makes it more challenging to create accurate predictions for dust levels in these environments. To address the aforementioned issues, this paper proposes the design of a real-time monitoring system for meteorological conditions and dust concentration in open-pit mines, based on the existing literature research. The system enables online monitoring of temperature, humidity, wind speed, wind direction, rainfall, noise, TSP, PM10, and PM2.5 in open-pit mines. A screening method for predicting dust concentration index sets based on random forest (RF) is proposed, and quantitatively screens the optimal index set. The open-pit mine’s dust concentration prediction index system is developed to address the issue of subjective selection of dust concentration prediction indices. The genetic algorithm (GA) is utilized to optimize the least squares support vector machine (LSSVM), and the combined GA-LSSVM prediction model for dust concentration in open-pit mines is established. This approach overcomes the issues of premature, local convergence, and slow optimization speed associated with single algorithms, resulting in a more accurate prediction effect. The research results can provide theoretical guidance for developing dust prevention and control measures in open-pit mines.

2. Monitoring Project Overview

2.1. Mine Overview

The Weijiamao open-pit mine is situated in Weijiamao Town, Zhungeer Banner, Ordos City, Inner Mongolia. The coal mine was finished and began production in 2011. The mining method is open-pit mining. The production scale is 15 million tons per year, and the total area of the mining site is approximately 52.6 square kilometers. The mining area is situated in the semi-arid region of Inner Mongolia, which experiences a typical continental arid climate. It is cold in the winter and hot in the summer. The total rainfall is low, and there is a significant temperature difference between day and night. The annual average temperature is 5.3~7.6 °C. The highest temperature is 38.4 °C and the lowest is −36.30 °C. The freezing period typically spans from October to April of the following year, with the maximum depth of frozen soil reaching 1.50 m. The total annual precipitation ranges from 231 mm to 459 mm, with the majority concentrated in July, August, and September, accounting for 60% to 70% of the total annual precipitation. As a result, the low air humidity in this area makes water easy to evaporate, and the daily mining operations generate a significant amount of dust through wear, mining, transportation, and drainage, leading to severe dust pollution [32].

2.2. Monitoring Point Layout and Data Acquisition

Two monitoring equipment units were installed in the coal mine area, as shown in Figure 1. The monitoring instrument was approximately 2 m tall, and it monitored No.1 and No.2, respectively. In this paper, the statistical variables included dust concentration, temperature, humidity, wind speed, wind direction, wind power, rainfall, and production intensity (stripping amount, noise). The data were collected from 17:30 on 7 August 2023 to 9:00 on 18 August 2023, with a data acquisition interval of 10 min, resulting in approximately 1500 sets of data over the course of 10 days. The stripping amount data were provided by the mine production technology department, while other data were exported by the real-time online monitoring system through the client. The specific technical parameters of the equipment are presented in Table 2.

2.3. Analysis of Experimental Data Processing

2.3.1. Data Pre-Processing

During the dust data monitoring process, interruptions in data may occur due to equipment power failures, abnormal peaks caused by truck bumps on transportation roads, and other uncertain factors such as human error and natural occurrences. Some data values are missing or abnormal. Through data comparison, it was found that the dust concentration at Monitoring Point No.1 fluctuated significantly. The monitoring point may be located about 6 m from the center of the transportation road, near the side where many trucks pass by. Therefore, only monitoring point No.2 will be analyzed and predicted in the future.

The research data used dust concentration data from Monitoring Point No.2, collected by the real-time online monitoring system. In order to improve data quality and facilitate the observation and analysis of the change pattern of each data over time, noise reduction was performed first to address existing data gaps and anomalies. Furthermore, employing a weighted average decreased the frequency of the data from every ten minutes to hourly intervals, resulting in approximately 250 sets in total. The specific steps for handling missing data and exceptions are as follows.

Missing data

The presence of missing values not only complicates in-depth data analysis but also hinders the development of a stable and accurate model. The lack of dust concentration data is due to the monitoring instrument’s power supply being turned off, resulting in missing data for the intervals of 12:00–13:00 on August 10 and 21:26–22:23 on August 11. Some data are missing, and this part of the data has been restored and repaired.

There are three commonly used methods: average interpolation method, weight method, and direct deletion method. When the proportion of missing values is small, they can be either directly discarded or filled in manually. When the missing data are relatively significant to the overall dataset, it is necessary to use the average interpolation method to fill in the missing values. There are 12 instances of missing data at this time, and the gaps can be filled using the mean interpolation method.

F (d, t) = \frac{F (d - 1, t) + F (d - 2, t) + F (d + 1, t) + F (d + 2, t)}{4}

(1)

where

F (d, t)

represents the filled data value,

d

is the missing data,

t

is the corresponding collection point of the missing data,

F (d - 1, t)

and

F (d - 2, t)

represent the data at the same time as the previous day and the day before that of the data point, and

F (d + 1, t)

and

F (d + 2, t)

represent the data at the same time as the next day and the day after that of the data point.

2.: Abnormal data

The monitoring equipment is installed near the road. The abnormal fluctuations in data caused by trucks passing and road unevenness need to be eliminated or smoothed out. Due to the small sample size, a few outliers were identified using an intuitive observation method. Common data anomaly processing methods, namely the horizontal processing method and the vertical processing method, are used to refine the data. The horizontal processing method is suitable for data with strong continuity, while the longitudinal processing method is more appropriate for data with obvious periodicity.

In the dust-related data, the particle concentration data exhibit strong continuity, and the horizontal processing method is used for correction. The specific operation involves comparing the data from one hour before and after the data anomaly point to determine if the difference between the data before and after the point exceeds the threshold. If it exceeds, it is judged as abnormal data. Generally, the anomalous point is replaced by the mean of the two sets of data before and after. The threshold is one-third of the mean of the entire sample data.

| F (d, t) - F (d, t - 1) | > Ω (t)

(2)

| F (d, t) - F (d, t + 1) | > ℧ (t)

(3)

F (d, t) = \frac{| F (d, t - 1) + F (d, t + 1) |}{2}

(4)

where

Ω (t)

and

℧ (t)

represent thresholds,

F (d, t)

represents the data at time t on day d, while

F (d, t - 1)

and

F (d, t + 1)

represent the data at the previous and next moments, respectively.

The meteorological data for temperature and humidity exhibit clear daily periodicity in average changes. Therefore, abnormal temperature and humidity data are corrected using a longitudinal processing method. The specific operation involves comparing the data at this point with the data from recent days to determine if the difference exceeds the threshold and is greater than the average value of the sample data. If the value exceeds the threshold, the sum of the threshold and the average value is used to replace the abnormal data. If the difference exceeds the threshold but is less than the average value of the sample data, the difference between the threshold and the average value is used instead [33].

| F (d, t) - m (t) | > l (t)

(5)

F (d, t) = m (t) + l (t), F (d, t) > m (t)

(6)

F (d, t) = m (t) - l (t), F (d, t) < m (t)

(7)

where

l (t)

represents one third of the mean value

m (t)

of the data from the past 5 days.

Secondly, in meteorological monitoring data, the rainfall monitoring data represent the cumulative rainfall for the day. These data need to be averaged to determine the rainfall at specific times, and the rainfall data should be cleared promptly once the rain stops.

2.3.2. Test Date Analysis

Analysis of Distribution Patterns

Dividing the 24 h of a day into 6 periods of 4 h each, the violin plot is combined with the box plot and the kernel density plot [34]. This analysis examines the distribution of temperature, humidity, noise, wind speed, and other variables during different periods of the day, as depicted in Figure 2.

The temperature index data in Figure 2a indicate that the temperature is low at 0:00–4:00, 4:00–8:00, and 8:00–12:00, generally distributed between 15 and 30 °C, and gradually increases at 12:00–16:00, reaching 30–40 °C. A small amount of data show that the temperature ranges between 20 and 30 °C until it rises to 40 °C at 16:00–20:00 in the 5th period, which represents the peak period of temperature difference change. Subsequently, the temperature gradually decreases at 20:00–24:00. The data overall indicate that the highest temperature reaches 41.2 °C, while the lowest temperature recorded is 13.9 °C over a period of ten days.

The humidity violin plot in Figure 2b depicts the overall distribution as “slender”. The data distribution density is low, and there is a large fluctuation, indicating significant changes in humidity at different times. In the first, second, and third periods, the humidity data are mainly concentrated between 40% and 80% relative humidity (RH), with a small amount of data exceeding 80% RH or falling below 40% RH. In the 4th and 5th periods, the time is between 12:00 and 20:00, and the humidity gradually decreases to below 40% RH. The data reveal that the maximum humidity reaches 91.6% RH, while the minimum humidity drops to 11% RH over a period of ten days.

In Figure 2c, the distribution of noise data is associated with mine production operations. In a typical outsourcing or self-operated common operating environment, the noise intensity ranges from 55 to 65 dB. At 4:00–6:00, the noise intensity gradually decreases in the second period due to the shutdown of outsourcing and the shift to self-operated production. Some of the noise data in the third, fourth, and fifth periods are minimal. There may be two reasons. First, lunch break is from 12:00 to 12:30 and from 17:30 to 18:00, and work is paused for half an hour. Secondly, in the early morning of August 11, it rained throughout the day, causing the outdoor party to suspend production and only operate in a self-contained manner. The trend of increasing noise can be observed in time periods 3, 4, and 5 in Figure 2c.

Figure 2d shows the wind speed violin diagram. Most of the data are concentrated in the range of 0–2.5 m/s. However, in the second period, there are outliers in the wind speed from 4:00 to 8:00, where it increases from 2 m/s to 13.6 m/s, and the data fluctuate greatly. At this time, there is a light rainfall.

2.: Analysis of the relationship

The concentration of dust is influenced by numerous factors. A single factor can visually display the density and distribution of the data, but it needs to be combined with the change in dust concentration to better reflect the relationship between dust concentration and its variables. Figure 3 illustrates the correlation between individual variables, including temperature, humidity, noise, rainfall, wind speed, wind direction, and dust concentration.

From Figure 3, it is evident that there is a strong correlation between the PM2.5 and PM10 data, and their change patterns are essentially identical. The PM2.5 levels are kept within the range of 10–130 μg/m³, while the PM10 levels are generally maintained between 20 and 140 μg/m³. It can be seen from Figure 3a that there is a significant negative correlation between dust concentration and temperature. As the temperature increases, the dust concentration decreases. One possible explanation is that, at lower temperatures, dust particles tend to remain suspended in the air more readily, causing dust to accumulate and making it difficult for it to disperse, which in turn leads to higher observed concentrations.

Air humidity influences the diffusion and settling of particulate matter concentration to some degree [31]. There is a distinct relationship between rainfall and humidity. During the observation period, rain is recorded on the afternoons of August 11 and August 20, which is reflected in samples 81–101 and 241–250 in Figure 3c. When combined with Figure 3b, it shows that humidity levels shift with rainfall, rising from 20% to 70% initially, and then reaching 80% to 100%. Figure 3b indicates a positive correlation between air humidity and PM concentration overall, meaning that, as humidity increases, dust concentration also tends to rise. However, this trend is primarily evident in areas where humidity is not influenced by rainfall and exhibits notable periodic patterns. For instance, in the intervals of samples 21–41, 61, 121–141, and 141–161, both humidity and particle concentration increase. This phenomenon occurs because the summer climate in the mine is typically dry. Even though there is a temporary rise in humidity, the air still lacks sufficient moisture, which diminishes the interaction between particles and leads to an increase in dust concentration in the air [35]. In addition, by analyzing Figure 3c, it can be seen that, due to the rainy weather, the stope environment is relatively humid, which helps the water vapor in the air to condense, and the particles in the air can absorb more water, which leads to the increase in particle size and settlement [35]. Therefore, the dust concentration is significantly reduced in the two intervals of 81–101 and 241–250. However, with the stop of rainfall, the humidity and dust concentration show a transient stable state and gradually decrease and continue to show periodic distribution characteristics.

The magnitude of the noise indicates the intensity of mine production. It can be observed from Figure 3d that there is a lagged positive correlation between the mining production intensity and the dust concentration. Specifically, as the production intensity increases, the dust concentration will increase after a certain period of time. Figure 3e,f present wind rose diagrams that demonstrate how PM2.5 and PM10 levels are influenced by wind speed, direction, and force. It is evident that higher wind speeds correspond to lower dust concentrations, as shown by the blue area. The wind flow helps to disperse the dust mass concentration, resulting in a reduced level of concentration observed. Moreover, it is important to clarify that dust concentration is influenced by multiple factors rather than just one. Despite the impact of various external elements, there are no clear patterns in the changes of particulate matter concentration. Consequently, a more in-depth analysis and discussion of how different factors affect dust concentration is required.

The overall analysis indicates that the monitoring data align with the fundamental pattern of dust concentration changing with the weather, and the accuracy is reasonable.

3. Important Indicator Screening Based on RF Algorithm

Understanding the impact of meteorological factors and production intensity on dust concentration is essential for enhancing the accuracy of dust prediction models. The impact of different factors on dust concentration varies, and there may be a correlation between multiple factors. If irrelevant factors are included, the accuracy of the prediction model may be reduced. Therefore, to enhance prediction accuracy, it is essential to exclude unimportant and redundant factors before establishing a dust concentration prediction model.

3.1. Random Forest Algorithm

Random forest (RF) is an ensemble learning algorithm based on decision trees. It is known as “a method representing the level of ensemble learning technology [36,37]”. The algorithm is simple, easy to implement, and has low computational overhead, demonstrating strong performance in numerous real-world tasks. The random forest (RF) algorithm has a wide range of applications in prediction problems and is often used for feature selection. In large-scale learning, feature selection can eliminate uncorrelated or weakly correlated input variables and create a subset of the most relevant input variables to reduce data noise, modeling time, and improve the prediction accuracy of the model [38]. Importance screening is calculated based on the sum of the increase in homogeneity caused by the impact factors. After fitting the data, the RF model will provide a measure of the importance of the variables in the data attribute column. In sklearn, the feature_importances parameter represents the feature importance of the random forest model. This parameter returns a numpy array object, which corresponds to the importance of the training features as determined by the RF model. In the feature importance array, the larger the value, the more important the attribute column is for prediction accuracy [39]. The basic principle can be summarized in three steps: random sampling, random feature selection, and aggregated voting. The specific steps of the RF algorithm are shown in Figure 4.

Randomly select n samples from the sample set to form a new sample set.
A decision tree is created from the sample set obtained through sampling. At each node of the spanning tree, the following steps are taken: (a) Randomly select d feature attributes without repetition. (b) Use these features to divide the sample set and find the best feature for division.
Repeat steps 1–2 b times to establish b decision trees and form a random forest.
The trained random forest is used to predict the test samples by comprehensively considering the output of each tree and voting on the results.

Among them, “b” learners are independent of each other, which enables the parallelization of decision methods.

3.2. Screening of Important Indicators

The literature and engineering experience are summarized to provide insights into the factors affecting dust concentration. The factors that commonly influence the level of dust concentration are chosen as input variables to assess the extent of dust concentration pollution. An index system for predicting and evaluating dust concentration is initially established, as depicted in Figure 5.

Through the random forest (RF) model, the input index (dependent variable) of dust concentration is analyzed to obtain the degree of interpretation of each independent variable (using PM2.5 as an example) to the dependent variable. This process helps to determine the importance of each influencing factor on dust concentration, as shown in Figure 6.

Taking PM2.5 data as an example, the importance scores of eight influencing factors, including temperature, humidity, noise, stripping amount, wind speed, wind direction, wind force, and rainfall, are 0.88, 0.67, 0.61, 1.43, 0.20, 0.54, 0.08, and 0.41, respectively. The importance is ranked from highest to lowest as follows: stripping amount > temperature > humidity > noise > wind direction > rainfall > wind speed > wind force.

Because noise is an indicator of the amount of stripping, humidity is highly correlated with rainfall, and most of the time there is no rainfall. Combined with the quantitative calculation results of the RF algorithm, the input parameters (independent variables) for the dust concentration prediction model are finally determined as stripping amount, temperature, humidity, wind direction, and wind speed. The output parameters (dependent variables) mainly include PM2.5, with PM10 used as an auxiliary dependent variable reference. The GA-LSSVM model for predicting dust concentration based on RF index analysis is established [14], as shown in Figure 7.

4. Prediction Model Establishment

4.1. Least Squares Support Vector Machine Algorithm (LSSVM)

The support vector machine (SVM) has been proposed as a solver for quadratic programming (QP) and regression problems. The least squares support vector machine (LSSVM) is an enhanced version of the support vector machine. Based on the traditional support vector machine (SVM), the inequality constraints in the quadratic programming problem are transformed into equality constraints. This simplifies the calculation of the function approximation problem and greatly facilitates the solution process. It overcomes the abnormal regression caused by the rough dataset and the large fluctuation of the dataset, and can effectively avoid the local optimization problems in BP neural network and other methods [40]. Compared with fuzzy logic systems and artificial neural networks, LSSVM adopts a low-parameter technology, which enables fast calculation and parameter tuning [41]. The principle is as follows:

For a given training dataset

\{(x_{i}, y_{i}), i = 1, 2, \dots, n\}

, where

x_{i} = {(x_{i 1,} x_{i 2,} \dots, x_{i d,})}^{T}

is d-dimensional input vector, and

y_{i}

is the corresponding output result.

N

represents the total number of training data points. In order to represent the mapping relationship between input variables and output variables, the following nonlinear function is used to estimate the model:

f (x) = 〈 φ (x), ω 〉 + b

(8)

where

ω

represents the weight vector;

b

represents the bias term; and symbol

〈 \cdot 〉

denotes the inner product operation.

Based on the principle of structural risk minimization, the error square term is expressed as empirical risk, and the quadratic programming problem in SVM is optimized as follows:

\{\begin{cases} \min_{ω, e} J (ω, e) = \min (\frac{1}{2} {‖ω‖}^{2} + \frac{1}{2} γ \sum_{i = 1}^{N} e_{i}^{2}) \\ y_{i} = 〈 φ (x_{i}), ω 〉 + b + e_{i} i = 1, 2, \dots, N; γ > 0 \end{cases}

(9)

where

φ (x)

represents the kernel function;

y_{i}

stands for the regularization parameter;

e_{i}

denotes the error vector; and

γ

signifies the penalty factor.

To solve the optimization problem above, we construct the corresponding Lagrange function.

L_{L S S V M} = \frac{1}{2} {‖ω‖}^{2} + \frac{1}{2} γ \sum_{i = 1}^{N} e_{i}^{2} - \sum_{i = 1}^{N} α_{i} \{〈 φ (x_{i}), ω 〉 + b + e_{i} - y_{i}\}

(10)

where

α_{i}

represents the Lagrange function.

By calculating the partial derivatives of Formula (3) with respect to

ω, b, e_{i}, α_{i}

, the conditions for obtaining the optimal solution of the problem are:

\{\begin{cases} \frac{\partial L_{L S S V M}}{\partial ω} = 0 \\ \frac{\partial L_{L S S V M}}{\partial b} = 0 \\ \frac{\partial L_{L S S V M}}{\partial e_{i}} = 0 \\ \frac{\partial L_{L S S V M}}{\partial α_{i}} = 0 \end{cases} \Rightarrow \{\begin{cases} ω = \sum_{i = 1}^{N} α_{i} φ (x_{i}) \\ \sum_{i = 1}^{N} α_{i} = 0 \\ α = γ e_{i} \\ 〈 φ (x_{i}), ω 〉 + b + e_{i} - y_{i} = 0 \end{cases}

(11)

By eliminating variables

ω

and

e_{i}

, the above linear problem is simplified.

[\begin{matrix} 0 & E^{T} \\ E & Ω + \frac{1}{γ} E \end{matrix}] [\begin{array}{l} b \\ α \end{array}] = [\begin{array}{l} 0 \\ y \end{array}]

(12)

where

y = {[y_{1}, y_{2}, \dots, y_{n}]}^{T}

;

α = {[α_{1}, α_{2}, \dots, α_{n}]}^{T}

;

E

is an n-order unit matrix;

Ω

is an n-order identity matrix of the kernel function.

Ω_{i j} = K (x_{i}, x_{j}) = φ {(x_{i})}^{T} φ (x_{j}) i, j = 1, 2, \dots, n

(13)

where the selected formula for the Gaussian radial basis kernel function (RBF) is:

K (x_{i}, x_{j}) = \exp (- \frac{{‖x_{i} - x_{j}‖}^{2}}{2 σ^{2}}), σ > 0

(14)

where

σ

represents the bandwidth of the kernel function.

Finally, the expression for the LSSVM model function is obtained.

\begin{array}{l} y (x) = 〈 φ (x), ω 〉 + b \\ = \sum_{i = 1}^{n} α_{i} φ (x_{i}) \cdot φ (x) + b \\ = \sum_{i = 1}^{n} α_{i} K (x_{i}, x_{j}) + b \end{array}

(15)

4.2. Biological Evolution Algorithm: Genetic (GA) Algorithm

Optimizing parameters can prevent underfitting in LSSVM learning, ensure the model’s prediction performance, and enhance the accuracy of the predicted results. The genetic algorithm (GA) is a component of evolutionary computation. The theory was proposed by John Holland from the University of Michigan in 1975. It is a method of globally searching for the optimal solution by simulating Darwin’s genetic selection and natural elimination. The algorithm has a strong capability to optimize various functions. It is simple, universal, robust, and highly parallel. It is one of the most popular adaptive and global probabilistic search optimization algorithms [42]. The core process of genetic algorithms (GAs) to solve the problem includes coding (binary), genetic operations (selection, crossover, mutation), and fitness function. Firstly, the optimization parameters are binary coded, and the solution space is transformed into chromosome space. The initial population parameters, such as evolutionary algorithm, individual length, and population size, are established to determine the appropriate fitness function, and then the fitness of individuals in the population is calculated. Then, the population undergoes genetic operator operations, such as selection, crossover, and mutation. After iterative calculations, the population continues to evolve in the optimal direction in order to obtain the optimal solution [43]. Since the LSSVM model requires the optimization of two parameters (penalty factor

γ

and kernel parameter

σ

), the population dimension is 2. The specific operation of the GA optimization algorithm is as follows.

Step 1: firstly, the initial solution of the chromosome is generated, and the mutation and crossover factors are initialized.

Step 2: determine the initial solution of the fitness function

F_{i}^{k} = f (X_{i}^{k}) \forall I

and the optimal index of the chromosome.

Step 3: randomly select two different genes (parameter values) from chromosome j to exchange and generate new high-quality individuals.

Step 4: randomly select the j chromosome of the individual to introduce a small disturbance, mutate it into the optimal chromosome, maintain population diversity, and increase the possibility of covering the search space.

Step 5: If the optimal chromosome is obtained in Step 4, the cycle is stopped; otherwise, the above steps are repeated. After multiple generations of evolution, the individual that is optimal or near-optimal is selected. This individual is then used as a parameter in the LSSVM model prediction after the decomposition operation, resulting in a more accurate performance prediction model.

4.3. GA-LSSVM Combination Forecasting Algorithm

In this paper, we propose a method for predicting dust concentration based on the genetic optimization least squares support vector machine. The core idea of the algorithm is to utilize the global search ability of genetic algorithms to address the issue of LSSVM parameter selection. It aims to automatically optimize the parameters to find the most suitable model configuration, enhancing the accuracy and stability of model predictions. As an enhanced variant of support vector machines, LSSVM is solved by minimizing the squared error. In the regression prediction problem, it is necessary to determine a hyperplane that best fits the training dataset. This involves two main parameters: penalty factor and kernel parameter. The choice of these parameters is crucial for the performance of the model. As a global optimization search algorithm, GA can optimize the parameters of LSSVM. In the genetic algorithm, each individual represents a set of specific parameter values (such as the value of

γ

and

σ

), and these individuals constitute an initial population. Through multiple iterations of evolution, GA continuously searches and updates the optimal parameters of the LSSVM, aiming to gradually enhance the prediction performance of the model on the training set.

The specific steps of the model prediction are as follows:

Step 1. Data normalization: collect the sample data for the time series and perform noise reduction processing, such as data normalization.

Step 2. LSSVM model construction: establish an LSSVM (least squares support vector machine) prediction model.

Step 3. Hereditary operation: utilizing the LSSVM prediction model to train the training sample data, obtain the root mean square error and correlation coefficient for both the training and test samples.

Step 4. Genetic manipulation: Integrate a GA to optimize the parameter of LSSVM prediction models The initial values of the optimized parameters are set, and the LSSVM model is reconstructed using the optimal parameters optimized by GA. Then, train and test the model, the model’s generalization ability and prediction accuracy have been verified. And output the root mean square error and correlation coefficient of the GA-optimized model.

Step 5. Model evaluation: analyze and evaluate all error results.

The training of models, such as support vector machines and random forests, requires a large number of datasets in the training libraries to ensure the applicability of the model. This type of model involves two datasets: the training set for algorithm learning and the test set for algorithm prediction. The research must ensure that there is an adequate amount of training data for learning and training, and data for establishing the transfer matrix. Therefore, the training sample set and the test sample set are selected at a 7:3 ratio. For example, out of the 250 sets of data newly monitored by dust in August 2023, the first 7 days of the collected 10-day data (approximately 180 sets of data) are used as the training set, and the last 3 days are used as the test set (approximately 70 sets of data). The modeling process is shown in Figure 8 [16].

4.4. Evaluation Index Selection

In order to test the predictive performance of the LSSVM model and the LSSVM model optimized by different algorithms, this paper uses common metrics such as root mean square error (RMSE), standard deviation (STD), and correlation coefficient (R²) to evaluate the model’s performance. The specific calculation formula is as follows:

Root mean square error (RMSE):

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{t r u e}^{i} - y_{p r e d}^{i})}^{2}}

(16)

where n represents the number of samples,

y_{t r u e}^{i}

represents the true value, and

y_{p r e d}^{i}

is the predicted value.

Standard deviation (STD):

S T D = \sqrt{\frac{{\sum_{i = 1}^{n} (y_{i} - μ)}^{2}}{n - 1}}

(17)

where

μ

represents the sample mean value.

Correlation coefficient (R²):

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{t r u e}^{i} - y_{p r e d}^{i})}^{2}}{\sum_{i = 1}^{n} {(y_{t r u e}^{i} - \bar{y})}^{2}}

(18)

where

\bar{y}

represents the average value.

The root mean square error (RMSE) is the square root of the average of the squared differences between the predicted and true values in the sample. The metric is highly sensitive to large errors and can directly reflect the average error between the true value and the predicted value. This makes it more suitable for evaluating the stability of the model. The standard deviation represents the extent of dispersion within the dataset. The larger the value, the more distinct the data. The correlation coefficient represents the proportion of the variation between the actual value and the predicted value that can be explained by the model. The value is between 0 and 1. It is generally believed that the correlation is greater than 0.80, indicating that the model fits well.

5. Results and Discussion

5.1. Data Normalization and Parameter Setting

5.1.1. Data Normalization

In statistical modeling, the model may be unevenly scaled across various dimensions. Differences in the scaling of dimensions can lead to calculation results being influenced more by features with larger dimensions, causing a disparity between the optimal solution and the original data. Therefore, to ensure the reliability of the results, the original monitoring data are normalized to the same dimension to eliminate dimensional differences.

The commonly used data standardization methods are min-max standardization and Z-score standardization. Z-score normalization is the most commonly used normalization method in SPSS, also known as standard deviation normalization. This method calculates the mean and standard deviation of the original data in order to normalize it. The processed data conform to the standard normal distribution, with a mean value of 0 and a standard deviation of 1. The transformation function is as follows.

x^{*} = \frac{(x - μ)}{σ}

(19)

where

μ

represents the sample mean, and

σ

represents the sample standard deviation.

The Z-score normalization method is suitable for cases where the maximum and minimum values of attribute A are unknown, or when there are outliers beyond the range of values. The dust concentration data are disorganized and fluctuate significantly, with more outliers present. The normalized data preserve the valuable information in the outliers, reducing the algorithm’s sensitivity to them. Compared with the min-max method, this method is less affected by outliers. Therefore, this paper utilizes the Z-score method to normalize the data. The normalized change steps are as follows.

1.: The mathematical expectation $X_{i}$ and standard deviation $S_{i}$ of each input and output variable are calculated, respectively.

2.: Normalization:

$Z_{i j} = \frac{(X_{i j} - X_{i})}{S_{i}}$

(20)

where $Z_{i j}$ is the normalized variable value; $X_{i j}$ is the actual variable value;

3.: Adjust the sign in front of the inverse variable.

The normalized variable value fluctuates around 0. A value greater than 0 indicates a level higher than the average, while a value less than 0 indicates a level lower than the average.

5.1.2. Model Parameter Settings

Input temperature, humidity, stripping volume, wind speed, and wind direction variables to output PM2.5 or PM10. Then, establish a GA-LSSVM prediction model. The input and output layer index data are detailed in Table 3. The root mean square error (RMSE) is used as the fitness function, representing the correlation ratio between the actual and predicted values of the model. It is combined with the sample standard deviation (STD) and the correlation coefficient (R²) to evaluate the model’s prediction effectiveness.

Specify the maximum number of iterations (M), population size (N), penalty parameter (γ), the optimization range of the kernel function (σ), and model-specific hyperparameters, including the crossover probability value (Table 4).

The parameter table only reflects the initial value, and it needs to be continuously tested and modified in the later stages. The accuracy of the prediction results and the suitability of the model need to be based on an analysis of the data characteristics. The parameters are continuously adjusted and optimized based on the data characteristics of the measured dust concentration in the field to enhance the accuracy of the results.

By adjusting the model parameters and incorporating various optimization algorithms, we compare the prediction results of dust concentrations PM2.5 and PM10 before and after optimization.

5.2. Analysis of Model Results

Firstly, the least squares support vector machine (LSSVM) single model is used to predict the concentration of PM2.5 and PM10, and the results are shown in Figure 9. It can be observed from Figure 9 that the LSSVM model can approximately predict the trend of dust concentration, but its accuracy in prediction is still low.

The fitness curve and prediction results of the GA-LSSVM combined prediction model established in this paper are depicted in Figure 10 and Figure 11. It can be seen from Figure 10 that the average fitness function value of population initialization is fitness (PM2.5) = 0.275 and fitness (PM10) = 0.44. After multiple iterations and refinements, the average fitness curve for PM prediction stabilizes within a consistent range, with the number of iterations reaching a stable state at the 14th and 30th instances, respectively. After completing the optimization calculation, the population’s optimal fitness function values are fitness (PM2.5) = 0.13 and fitness (PM10) = 0.14. The optimal parameter combination of the final GA-LSSVM prediction model is K₁(γ, σ) = (21.8731, 62.5992), K₂(γ, σ) = (11.6278, 99.9998).

From Figure 11, it can be seen that the root mean square error (RMSE) of the LSSVM model before optimization is 13.484 (PM2.5) and 15.351 (PM10), and the coefficient of determination (R²) is 0.482 (PM2.5) and 0.466 (PM10), respectively. After genetic algorithm (GA) optimization, the RMSE indicators for PM2.5 and PM10 have decreased by 36% and 53%, respectively, compared with before optimization, the R² is increased by 81% and 96%, and the final fitting degree is 0.872 and 0.913, respectively. The optimization effect has been significantly improved.

5.3. Model Comparison and Test

5.3.1. Model Comparison

The model is a hybrid model derived from the least squares support vector machine (LSSVM) algorithm and the genetic algorithm (GA) optimization algorithm. Therefore, to verify the accuracy of the model’s predictions, it is necessary to compare the model’s results to those of similar algorithms to determine if the combined model demonstrates better fitting and generalization abilities.

The least squares support vector machine (LSSVM) model is optimized using optimization algorithms such as the improved sine cosine algorithm (ISSA), grey wolf optimization (GWO), and particle swarm optimization (PSO). Three combined prediction models, ISSA-LSSVM, PSO-LSSVM, and GWO-LSSVM, are constructed for comparison with the GA-LSSVM model. The training and test set data with the same distribution ratio are imported for training and prediction. Subsequently, the PM2.5 and PM10 concentrations are tested separately. The sparrow algorithm is iterated 40 times, the gray wolf algorithm is iterated 20 times, and the particle swarm algorithm is iterated 50 times. The prediction results of the three models are depicted in Figure 12.

The most representative evaluation metrics R², RMSE, and STD are chosen to compare the model accuracy, as shown in Table 5.

In order to facilitate the comparative analysis of prediction results, the Taylor diagram(Figure 13) is used to illustrate the relationship between the root mean square error (RMSE), the sample standard deviation (STD), and the correlation coefficient (R²), allowing for a comprehensive evaluation of the errors. The scatter points in the Taylor plot represent various prediction models. The x-axis represents the standard deviation (STD), the radial line represents the correlation coefficient (R²), and the dotted line represents the root mean square error (RMSE). According to Figure 13 and Table 5, the optimization results of the three comparison algorithms are not significantly different. Compared with the LSSVM model, the RMSE of the PSO-LSSVM model is reduced by 32% for PM2.5 and 29% for PM10, respectively. Additionally, the R² is increased by 67% for PM2.5 and 74% for PM10, respectively. The RMSE of the GWO-LSSVM model is reduced by 31% for PM2.5 and 43% for PM10, respectively. Additionally, the R² is increased by 68% for PM2.5 and 84% for PM10, respectively. The RMSE of the ISSA-LSSVM model is reduced by 19% for PM2.5 and 41% for PM10, respectively. Additionally, the R² is increased by 58% for PM2.5 and 79% for PM10, respectively. The comprehensive analysis shows that the model’s performance is ranked as follows: GA-LSSVM > GWO-LSSVM > PSO-LSSVM > ISSA-LSSVM > LSSVM. Therefore, the LSSVM model optimized by the GA algorithm obviously demonstrates a good prediction effect with low error and high fitting.

5.3.2. Model Test

In order to effectively evaluate the robustness of the model’s performance, the generalization ability and overfitting phenomenon of the model are tested by adjusting the number of training samples. This study divides the samples into five different proportions to assess the generalization ability and actual impact of the validated optimal model, aiming to boost the credibility and scientific value of the research findings. Figure 14 illustrates the predictive effect of the model with varying numbers of training samples, e.g., 140, 160, 180, 200, and 220 (using PM10 as an example). Combined with the prediction performance of the training set and the test set as shown by the learning curve in Figure 15, the prediction performance of both sets changes with the increase in the number of training samples. If the training error decreases with an increase in the number of samples, but the verification error does not decrease (or even increases), the model appears to be overfitting.

Therefore, as seen in Figure 14c, when the number of training samples is 180, the dataset is allocated a 7:3 ratio. The results of the prediction set of the model indicate that the correlation coefficient and the error value are optimal. It can be concluded that the model’s fitting effect and robustness are at their best in this state.

It is important to highlight that the model comparison and testing discussed above demonstrate the effectiveness of the algorithm presented in this paper. First, the dust concentration prediction index set is quantitatively refined using the random forest (RF) method, which helps eliminate redundant factors and enhances the model’s prediction speed. Next, a combined prediction model for dust concentration in opencast mines is developed using a GA-LSSVM approach, where the GA algorithm optimizes the LSSVM model, resulting in strong predictive performance with low error and high accuracy. However, the algorithm does have some limitations that require further enhancement. These limitations include its suitability for modeling long sequence data, its robustness and generalization capabilities when handling high-dimensional data, and the necessity for a substantial amount of data for model training, which adds complexity and makes it sensitive to hyperparameter selection. Additionally, while the genetic optimization algorithm used in the model is flexible, its outcomes are probabilistic rather than deterministic, and the optimization process may lead to local convergence. In the future, we will keep enhancing the algorithm model to significantly boost the accuracy and reliability of predictions.

6. Conclusions

6.1. Main Conclusions

An analysis was conducted on the distribution of temperature, humidity, noise, wind speed, and other factors throughout the monitoring period, along with their relationship with dust concentration. A notable negative correlation was found between dust concentration and temperature, humidity, rainfall, and wind speed. Conversely, a strong positive correlation exists between the intensity of mine production (noise) and dust concentration. The fluctuations in PM2.5 and PM10 concentrations are largely similar.
A new feature importance screening technique utilizing the random forest (RF) algorithm was introduced. The importance scores for the eight influencing factors, ranked from highest to lowest, are as follows: stripping amount (1.43), temperature (0.88), humidity (0.67), noise (0.61), wind direction (0.54), rainfall (0.41), wind speed (0.20), and wind force (0.08). Ultimately, the best combination of indicators for forecasting dust concentration consists of temperature, humidity, stripping amount, wind direction, and wind speed.
A predictive model for dust concentration was developed utilizing genetic optimization and least squares support vector machine techniques. The model’s input variables consist of temperature, humidity, stripping amount, wind direction, and wind speed. The primary output variable is PM2.5 concentration, while PM10 concentration serves as a reference auxiliary variable. The model’s sample library was created using 10 days of data collected in August from the Weijiamao open-pit coal mine in Inner Mongolia, with training and testing samples divided in a 7:3 ratio.
In comparison with LSSVM, PSO-LSSVM, ISSA-LSSVM, GWO-LSSVM, and other models, the GA-LSSVM model achieves fitting degrees (R²) of 0.872 for PM2.5 and 0.913 for PM10, along with root mean square errors (RMSE) of 8.592 and 7.476, respectively. The GA-LSSVM model shows superior overall performance, showcasing excellent predictive abilities with minimal error and high fitting accuracy.

6.2. Application of the Model

The predictive capabilities of the integrated model utilizing RF-GA-LSSVM presented in this paper surpass those of other algorithms. The findings serve as a valuable reference for real-world applications. This model offers an effective approach to enhance environmental management, integrate production data, and mitigate dust hazard risks in opencast mining operations, thereby reducing the environmental, production, and occupational risks associated with dust levels in these areas. This is primarily evident in the following two key areas:

(1): An effective model for predicting dust concentration can assess the level of dust pollution in opencast mines. This offers a new approach for managing and preventing dust, allowing production teams and on-site workers to proactively develop strategies, adjust mining operations promptly, and mitigate the negative effects of dust pollution on the mine’s ecological environment. It also helps prevent damage to mining machinery and enhances the precision and effectiveness of dust control measures, thereby ensuring safer mining operations.
(2): Dust pollution in opencast mines not only harms the ecological environment but also impacts production efficiency and safety, posing serious health risks to workers. By implementing a prediction model, it is possible to thoroughly investigate the factors contributing to dust concentration in these mines, providing a theoretical framework for assessing dust-related risks and helping to manage or reduce the incidence of occupational diseases like pneumoconiosis.

6.3. Limitations and Future Research Directions

In the future, we need to develop a new algorithm for monitoring dust concentration in opencast mines by combining various types of sensors to address the shortcomings of relying on a single monitoring method. Additionally, we should enhance fundamental theoretical research and create a comprehensive database for dust concentration in opencast mines. This involves sampling dust from key production processes, analyzing its physical and chemical properties in the lab, and studying the characteristics of dust generation, spatial variations, and the factors influencing different dust concentrations. This will help refine the input variables for our prediction model, providing a solid theoretical foundation for dust concentration forecasting. Furthermore, we should conduct a thorough seasonal analysis of dust concentration data to explore trends across different seasons, enabling precise predictions despite variations in dust concentration data. Lastly, we should investigate the development of an integrated system platform for monitoring, analyzing, forecasting, early warning, and intelligent prevention and control of dust in opencast mines.

Author Contributions

S.X. and Y.Z. conceived this study, and proposed the overall framework; S.X. and J.L. wrote this study; S.X., Y.Z. and Y.M. conducted a statistical analysis of the data; J.L. and Y.M. performed the data prediction work; S.X., J.L. and Y.M. participated in the design of this study and verified the results; all authors contributed to revisions. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by [The National Natural Science Foundation of China] grant number [52004202], and [Open Project of Key Laboratory of Xinjiang Coal Resources Green Mining, Ministry of Education] grant number [KLXGY-KB2424].

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

Author Yajie Ma was employed by the company Shaanxi Coalbed Methane Development Co., Ltd. Author Yonggui Zhang was employed by the company North Weijiamao Power and Coal Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Xiao, S.; Ma, Y.; Li, W.; Xue, J.; Li, K.; Ma, X.; Ding, X.; Zhang, Y. Research Progress and Prospect on Theory and Technology for Dust Prevention and Control in Open Pit Mine of China in the Past 20 Years. Met. Mine 2023, 7, 1–24. [Google Scholar]
Ko, K.K.; Jung, E.S. Improving air pollution prediction system through multimodal deep learning model optimization. Appl. Sci. 2022, 12, 10405. [Google Scholar] [CrossRef]
Wang, Z.; Zhou, W.; Jiskani, I.M.; Yang, Y.; Yan, J.; Luo, H.; Han, J. A novel approach to forecast dust concentration in open pit mines by integrating meteorological parameters and production intensity. Environ. Sci. Pollut. Res. 2023, 30, 114591–114609. [Google Scholar] [CrossRef] [PubMed]
Bałaga, D.; Kalita, M.; Dobrzaniecki, P.; Jendrysik, S.; Kaczmarczyk, K.; Kotwica, K.; Jonczy, I. Analysis and forecasting of PM2.5, PM4, and PM10 dust concentrations, based on insitutests in hard coal mines. Energies 2021, 14, 5527. [Google Scholar] [CrossRef]
Tripathy, D.P.; Dash, T.R.; Badu, A.; Kanugo, R. Assessment and modelling of dust concentration in an opencast coal mine in India. Glob. Nest J. 2015, 17, 825–834. [Google Scholar]
Luan, B.; Zhou, W.; Jiskani, I.M.; Wang, Z. An Improved Machine Learning Approach for Optimizing Dust Concentration Estimation in Open-Pit Mines. Int. J. Environ. Res. Public Health 2023, 20, 1353. [Google Scholar] [CrossRef]
Wang, M.; Yang, Z.; Tai, C.; Zhang, F.; Zhang, Q.; Shen, K.; Guo, C. Prediction of road dust concentration in open-pit coal mines based on multivariate mixed model. PLoS ONE 2023, 18, e0284815. [Google Scholar] [CrossRef]
Yang, S.; Wu, H. A novel PM2.5 concentrations probability density prediction model combines the least absolute shrinkage and selection operator with quantile regression. Environ. Sci. Pollut. Res. 2022, 29, 78265–78291. [Google Scholar] [CrossRef]
Tian, Z.; Gai, M. A novel air pollution prediction system based on data processing, fuzzy theory, and multi-strategy improved optimizer. Environ. Sci. Pollut. Res. 2023, 30, 59719–59736. [Google Scholar] [CrossRef]
Emaminejad, S.A.; Sparks, J.; Cusick, R.D. Integrating Bio-Electrochemical Sensors and Machine Learning to Predict the Efficacy of Biological Nutrient Removal Processes at Water Resource Recovery Facilities. Environ. Sci. Technol. 2023, 57, 18372–18381. [Google Scholar] [CrossRef]
Zhou, C.; Xie, X.; Du, X. Prediction of mine dust concentration based on GA-BP neural network. Nonferrous Met. Mine Part 2023, 75, 88–93. [Google Scholar]
Wang, B.; Yao, X.; Jiang, Y.; Sun, C. Dust concentration prediction model in thermal power plant using improved genetic algorithm. Soft Comput. 2023, 27, 10521–10531. [Google Scholar] [CrossRef]
Bemani, A.; Xiong, Q.; Baghban, A.; Habibzadeh, S.; Mohammadi, A.H.; Doranehgard, M.H. Modeling of cetane number of biodiesel from fatty acid methyl ester (FAME) information using GA-, PSO-, and HGAPSO-LSSVM models. Renew. Energy 2020, 150, 924–934. [Google Scholar] [CrossRef]
Liu, Y.; Cao, Y.; Wang, L.; Chen, Z.-S.; Qin, Y. Prediction of the durability of high-performance concrete using an integrated RF-LSSVM model. Constr. Build. Mater. 2022, 356, 129232. [Google Scholar] [CrossRef]
Yu, C.; Cao, W.; Liu, Y.; Shi, K.; Ning, J. Evaluation of a novel computer dye recipe prediction method based on the pso-lssvm models and single reactive dye database. Chemom. Intell. Lab. Syst. 2021, 218, 104430. [Google Scholar] [CrossRef]
Pan, X.; Xing, Z.; Tian, C.; Wang, H. A method based on GA-LSSVM for COP prediction and load regulation in the water chiller system. Energy Build. 2021, 230, 110604. [Google Scholar] [CrossRef]
Ma, L.; Li, T.; Lai, X. GA-LSSVM prediction of throwing blasting effect in open-pit mine based on Fourier series. J. China Coal Soc. 2022, 47, 4455–4465. [Google Scholar]
Luo, S.; Liu, C. Design of network communication load status recognition system based on QPSO-LSSVM. Mod. Electron. Tech. 2019, 42, 81–89. [Google Scholar]
Liu, Z.; Li, L.; Tseng, M.; Tan, R.R.; Aviso, K.B. Improving the reliability of photovoltaic and wind power storage systems using Least Squares Support Vector Machine optimized by Improved Chicken Swarm Algorithm. Appl. Sci. 2019, 9, 3788. [Google Scholar] [CrossRef]
Guo, J.; Zhao, Z.; Zhao, P.; Chen, J. Prediction and optimization of open-pit Mine blasting based on intelligent algorithms. Appl. Sci. 2024, 14, 5609. [Google Scholar] [CrossRef]
Wang, Z.; Zhou, W.; Jiskani, I.M.; Ding, X. Dust pollution in cold region Surface Mines and its prevention and control. Environ. Pollut. 2022, 292, 118293. [Google Scholar] [CrossRef] [PubMed]
Liu, Z.; Zhang, R.; Ma, J.; Zhang, W.; Li, L. Analysis and Prediction of the Meteorological Characteristics of Dust Concentrations in Open-Pit Mines. Sustainability 2023, 15, 4837. [Google Scholar] [CrossRef]
Bai, Y.; Liu, M. Multi-scale spatiotemporal trends and corresponding disparities of PM2.5 exposure in China. Environ. Pollut. 2024, 340, 122857. [Google Scholar] [CrossRef] [PubMed]
Qi, C.; Zhou, W.; Lu, X.; Luo, H. Particulate matter concentration from open-cut coal mines: A hybrid machine learning estimation. Environ. Pollut. 2020, 263, 114517. [Google Scholar] [CrossRef]
Xiao, S.; Ma, Y.; Li, W.; Liu, J. CiteSpace Prediction of dust concentration in open-pit minebased on CiteSpace knowledge graph analysis. J. Xi’an Univ. Sci. Technol. 2023, 43, 675–685. [Google Scholar]
Han, L.; Li, Y.; Yan, W.; Xie, L.; Wang, S.; Wu, Q.; Ji, X.; Zhu, B.; Ni, C. Quality of life and influencing factors of coal miners in Xuzhou, China. J. Thorac. Dis. 2018, 10, e0267440. [Google Scholar] [CrossRef]
Shen, Y.; Wang, Y.; Cao, L.; Shi, T.; Huang, L.; Cui, F. Assessingcumulative dust exposure for excavating workers in ahigh speed tunnel industry using the Bayesian decisionanalysis technique. Mod. Prev. Med. 2018, 45, 1753–1758. [Google Scholar]
Chen, R. Gray prediction of underground dust concentration. Ind. Saf. Dust Control. 2000, 22, 5–7. [Google Scholar]
Lal, B.; Tripathy, S.S. Prediction of dust concentration in open cast coal mine using artificial neural network. Atmos. Pollut. Res. 2012, 3, 211–218. [Google Scholar] [CrossRef]
Bian, Z.; Tang, J.; Ni, C.; Zhu, B.; Zhang, H.; Dinga, B.; Shen, H.; Han, L. Analysis on prevalence of pneumoconiosis in Jiangsu province using ARIMA-GRNN combined model. J. Environ. Occup. Med. 2019, 36, 755–760. [Google Scholar]
Li, L.; Zhang, R.; Sun, J.; He, Q.; Kong, L.; Liu, X. Monitoring and prediction of dust concentration in an open-pit mine using a deep-learning algorithm. Environ. Health Sci. Eng. 2021, 19, 401–414. [Google Scholar] [CrossRef] [PubMed]
Li, J.; Qiao, G.; Zhang, Y. Environmental risk identification and prevention of Weijiamao open-pit coal mine. Open-Pit Min. Technol. 2023, 38, 97–99. [Google Scholar]
Duan, X. Design and Implementation of Rural Power Load Forecasting System Based on Hybrid Neural Network. Master’s Thesis, Hebei University of Engineering, Handan, China, 2022. [Google Scholar]
Xu, G.; Wang, X. Development of blasting vibration prediction model based on neural network algorithm. Nonferrous Met. Eng. 2023, 13, 94–102. [Google Scholar]
Cheng, P. Distribution law of meteorological factors and evolution characteristics of particulate matter in low temperature stope of open-pit coal mine. Coal Eng. 2020, 52, 85–90. [Google Scholar]
Wang, H.; Feng, S.; Llu, Z. Geologicalstructure recognition model based on improved random-forest algorithm. Coal Sci. Technol. 2023, 51, 149–156. [Google Scholar]
Duan, G.; Dong, J. Construction of ensemble learning model for home appliance demand forecasting. Appl. Sci. 2024, 14, 7658. [Google Scholar] [CrossRef]
Chen, R.; Wang, X.; Wang, Z.; Qu, H.; Ma, T.; Chen, Z.; Gao, R. Wavelength screening method for near-infrared spectroscopy based on random forest feature importance and interval partial least squares. Spectrosc. Spectr. Anal. 2023, 43, 1043–1050. [Google Scholar]
Hu, X.; Belle, J.H.; Meng, X.; Wildani, A.; Waller, L.A.; Strickland, M.J.; Liu, Y. Estimating PM2.5 concentrations in the conterminous United States using the random forest approach. Environ. Sci. Technol. 2017, 51, 6936–6944. [Google Scholar] [CrossRef]
Wang, M.; Zhong, C.; Yue, K.; Zheng, Y.; Jiang, W.; Wang, J. Modified MF-DFA model based on LSSVM fitting. Fractal Fract. 2024, 8, 320. [Google Scholar] [CrossRef]
Wan, P.; Zou, H.; Wang, K.; Zhao, Z. Hot deformation characterization of Ti–Nb alloy based on GA-LSSVM and 3D processing map. J. Mater. Res. Technol. 2021, 13, 1083–1097. [Google Scholar] [CrossRef]
Ali, M.Z.; Awad, N.H.; Suganthan, P.N.; Shatnawi, A.M.; Reynolds, R.G. An improved class of real-coded Genetic Algorithms for numerical optimization. Neurocomputing 2018, 275, 155–166. [Google Scholar] [CrossRef]
Zendehboudi, A. Implementation of GA-LSSVM modelling approach for estimating the performance of solid desiccant wheels. Energy Convers. Manag. 2016, 127, 245–255. [Google Scholar] [CrossRef]

Figure 1. Monitoring point layout diagram.

Figure 2. The data distribution status of each influence index and the density of distribution across different time periods.

Figure 3. The variation law of each influencing factor and dust concentration data.

Figure 4. RF algorithm schematic diagram.

Figure 5. The initial index system diagram.

Figure 6. The importance of RF in evaluating dust concentration index.

Figure 7. Screening and prediction process of dust concentration index based on RF-GA-LSSVM.

Figure 8. GA-LSSVM dust concentration prediction model process.

Figure 9. Effect of LSSVM in predicting PM2.5 and PM10.

Figure 10. The fitness curves of PM2.5 and PM10 predicted by GA-LSSVM model.

Figure 11. The effect of GA-LSSVM on predicting PM2.5 and PM10.

Figure 12. Comparison of the model to predict the effect of PM2.5 and PM10.

Figure 13. Test sample prediction result error analysis Taylor diagram.

Figure 14. Model test result diagram.

Figure 15. Model learning curve.

Table 1. Technology of dust concentration prediction method and its application.

Method	Prediction Technique	Submitter	Domain of Application	Characteristic
Qualitative and semi-quantitative prediction	Method of statistical regression analysis, mortality table method	HAN [26]	Prediction and early warning of coal mine dust and coal worker pneumoconiosis	Semi-quantitative prediction with low accuracy.
Qualitative and semi-quantitative prediction	Bayesian decision analysis technique	SHEN [27]	Prediction of dust exposure in highway tunnel excavation work	Using quantitative evaluation and forecasting based on risk likelihood can enhance the objectivity of assessing long-term cumulative exposure to dust.
Linear regression forecasting	Grey theory	CHEN [28]	Prediction of mine dust concentration	The sliding average is utilized to analyze the original data, resulting in a minimal relative error in the prediction outcome.
Machine learning algorithm	ANN	LAL [29]	Prediction of dust in various locations of a mine	While it is effective for nonlinear predictions, the approach is limited and lacks robustness.
Combination forecasting	A new algorithm that combines autoregressive integrated moving average with generalized neural network regression (ARIMA-GRNN) has been introduced.	BIAN [30]	Pneumoconiosis prediction	The GRNN (general regression neural network) model is highly effective in handling nonlinear relationships, offering excellent prediction accuracy and stability, making it ideal for dealing with nonlinear and unstable datasets.
Combination forecasting	Long short-term memory network and attention mechanism prediction model	LI [31]	Concentration of total suspended particulate matter in Pingshuo Anjialing opencast coal mine	The accuracy of predictions is high and the stability is strong, allowing for the application of various algorithm combinations.

Table 2. Monitoring equipment parameters.

Monitored Object	Monitoring Range	Resolution	Precision
PM10	0~1000 μg/m³	1 μg/m³	±10 μg/m³
PM2.5	0~1000 μg/m³	1 μg/m³	±10 μg/m³
Temperature	−40~120 °C	0.1 °C	±0.5 °C
Humidity	0~99%RH	0.1%RH	±3%RH
Wind speed	0~70 m/s	0.1 m/s	±0.3 m/s
Wind direction	8 bearing	1 bearing	—
Rainfall	0~8 mm/min	0.2 mm	≤±5%
Noise	30~130 dB	0.1 dB	±0.5 dB

Table 3. Partial input and output layer index data.

Input					Output
Temperature (°C)	Humidity (%RH)	Stripping Volume (hectare)	Wind Direction	Wind Velocity (m/s)	PM2.5 (μg/m³)	PM10 (μg/m³)
34.0	34.8	1.73	121	1.6	40	52
31.0	41.4	1.95	113	1.4	44	60
27.6	48.6	1.91	141	2.1	38	58
25.8	53.2	1.87	120	1.2	37	55
24.9	55.0	1.90	122	1.1	35	51
24.7	55.5	1.90	148	1.2	36	57
24.1	59.2	1.89	242	1.6	40	55
23.7	61.5	1.92	112	1.0	44	61
23.2	58.6	1.85	137	1.3	36	56
22.7	60.5	1.86	129	1.0	39	55

Table 4. Model hyperparameter settings.

Parameter	Initialization Range
Bestc	[0.1,1000]
Bestg	[0.001,100]
MAXGEN	100
NIND	20
Select	0.9
Recombin	0.7
Mut	0.01

Table 5. Evaluation of prediction results from different models.

Model	PM	Evaluating Indicator
Model	PM	R²	RMSE	STD
LSSVM	PM2.5	0.482	13.484	4.197
LSSVM	PM10	0.466	15.751	3.618
GA-LSSVM	PM2.5	0.872	8.592	13.503
GA-LSSVM	PM10	0.913	7.476	14.606
PSO-LSSVM	PM2.5	0.805	9.189	10.082
PSO-LSSVM	PM10	0.813	11.110	9.937
GWO-LSSVM	PM2.5	0.808	9.319	12.033
GWO-LSSVM	PM10	0.859	8.974	11.161
ISSA-LSSVM	PM2.5	0.763	10.956	7.630
ISSA-LSSVM	PM10	0.835	9.283	10.816

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xiao, S.; Liu, J.; Ma, Y.; Zhang, Y. Combined Prediction of Dust Concentration in Opencast Mine Based on RF-GA-LSSVM. Appl. Sci. 2024, 14, 8538. https://doi.org/10.3390/app14188538

AMA Style

Xiao S, Liu J, Ma Y, Zhang Y. Combined Prediction of Dust Concentration in Opencast Mine Based on RF-GA-LSSVM. Applied Sciences. 2024; 14(18):8538. https://doi.org/10.3390/app14188538

Chicago/Turabian Style

Xiao, Shuangshuang, Jin Liu, Yajie Ma, and Yonggui Zhang. 2024. "Combined Prediction of Dust Concentration in Opencast Mine Based on RF-GA-LSSVM" Applied Sciences 14, no. 18: 8538. https://doi.org/10.3390/app14188538

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Combined Prediction of Dust Concentration in Opencast Mine Based on RF-GA-LSSVM

Abstract

1. Introduction

2. Monitoring Project Overview

2.1. Mine Overview

2.2. Monitoring Point Layout and Data Acquisition

2.3. Analysis of Experimental Data Processing

2.3.1. Data Pre-Processing

2.3.2. Test Date Analysis

3. Important Indicator Screening Based on RF Algorithm

3.1. Random Forest Algorithm

3.2. Screening of Important Indicators

4. Prediction Model Establishment

4.1. Least Squares Support Vector Machine Algorithm (LSSVM)

4.2. Biological Evolution Algorithm: Genetic (GA) Algorithm

4.3. GA-LSSVM Combination Forecasting Algorithm

4.4. Evaluation Index Selection

5. Results and Discussion

5.1. Data Normalization and Parameter Setting

5.1.1. Data Normalization

5.1.2. Model Parameter Settings

5.2. Analysis of Model Results

5.3. Model Comparison and Test

5.3.1. Model Comparison

5.3.2. Model Test

6. Conclusions

6.1. Main Conclusions

6.2. Application of the Model

6.3. Limitations and Future Research Directions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI