1. Introduction
Soil reclamation in coal mining subsidence areas is a critical component of ecological restoration, aiming to mitigate the adverse environmental impacts of mining activities and restore degraded land to productive use. In China, the cumulative area of coal mining subsidence has reached 13.5 million hectares, with an annual increase of 70,000 hectares, leading to severe soil degradation and significant socio-economic and ecological challenges. Soil reclamation is, therefore, an urgent priority. At present, a variety of treatment methods for coal mining subsidence areas, such as in situ soil improvement, natural recovery [
1], foreign soil landfill recovery, coal gangue landfill remediation and chemical remediation [
2], have been proposed to manage the soil quality of reclamation areas. However, due to the potential pollution and the resource consumption of remediation costs, the foreign soil remediation is a more mainstream method for the reclamation of coal mining subsidence land.
The quality of reclaimed soil is an important index for evaluating the effect of soil reclamation. However, the diversity and complexity of soil quality assessments show significant differences across reclamation objectives. Reclamation to restore soils to agricultural use usually focuses on soil fertility and productivity, emphasizing chemical and physical indicators [
3,
4], while reclamation for ecological restoration is more concerned with biodiversity restoration and relies on bioindicators [
5]. Pollution remediation prioritizes pollutant toxicity and degradation potential, combining chemical and biological toxicity tests [
6,
7]. In recent years, research on soil quality assessment has gradually developed in a multidimensional and comprehensive direction. For example, in reclamation to restore soils to agricultural use, researchers have not only focused on traditional chemical indicators but also introduced soil microbiomics and metabolomics analyses [
3,
8]. However, balancing the synergies and trade-offs between indicators under different reclamation objectives remains a core challenge of current research [
9].
Traditional methods for selecting indicators rely on literature reviews and expert consultations. Chen et al. characterized the quality of reclamation in the Xuzhou mining area using a soil quality evaluation index, which includes two categories of indicators, soil productivity and environmental quality, employing an ordinary weighted ball method [
10]. Zhao et al. utilized frequency analysis, theoretical analysis methods, and expert scoring to select evaluation indicators and assess soil fertility in the reclamation area using a nonlinear membership degree function method and an improved Nemerow index [
11]. However, such methods are often subjective, based on the experience of the selectors, and lack objectivity. Various scholars have attempted to apply mathematical methods to select evaluation indicators to reduce the interference of subjective factors. For example, Yang et al. comprehensively evaluated the soil fertility level and degree of heavy metal pollution in the Huaihe mining area’s subsidence area using the single-factor index method and the Nemerow index method based on fuzzy mathematics principles [
12]. In recent years, with the development of computer technology, researchers are increasingly applying computer technology to soil quality assessment, such as PCA for data dimensionality reduction to assist in the evaluation of soil quality; the use of big data deep learning exercise evaluation model for soil quality assessment. However, deep learning has a large number of data requirements, which is difficult to meet in general. PCA performs relatively well and has relatively low data requirements, but when the indicator dimensions are high and the sample size is limited, the PCA may lead to distorted results due to the multicollinearity problem, a phenomenon that has been demonstrated in the study of Johnstone [
13]. Especially in high-dimensional small-sample data, the limitation of PCA on the assumption of linear relationship makes it difficult to accurately extract data features, and there is still a need for improvement.
Methods of soil quality assessment have developed under the influence of the intersection of disciplines, and many new methods have been developed. For example, technologies based on microbiomics and metabolomics enable in-depth analyses of the biological activity and biochemical status of soils [
14], remote sensing and geographic information systems (GIS) enable large-scale, dynamic soil monitoring [
15], and deep learning and artificial intelligence provide high-precision predictive models of soil quality through the integration of data from multiple sources [
16]. In addition, classical evaluation methods, such as the soil quality index (SQI), soil degradation index (SDI), and soil productivity index (SPI) [
17,
18], have demonstrated new application values in soil quality assessment by integrating with emerging technologies. Some of these techniques have been applied to assess the quality of reclaimed soil in coal mining subsidence areas. For example, collecting soil data from different years and using geographic information technology to study the spatial and temporal changes in the quality of reclaimed soils on a large scale [
19], using complex network theory to establish a computer model to analyse the relationship between indicators and soil quality [
20], using unmanned aerial vehicle spectral scanning to rapidly monitor the quality of reclaimed soils on a large scale in coal mining subsidence zones [
21], calculating soil quality indices to evaluate the quality of reclaimed soils by using principal component analysis [
22], and so on. However, some of these methods are subject to significant conditions, such as geographic information technology, which requires the collection of a large amount of data from different years; drone spectral scanning, which requires drone equipment that meets the conditions and cannot be widely implemented. Given the current widespread distribution of coal mining subsidence areas, a set of affordable and feasible evaluation methods that can be widely disseminated is particularly important. Among the existing methods for reclamation soil quality assessment, SQI has been widely used for soil quality assessment in most soil environments due to its simplicity of calculation and flexibility of quantification [
23]. For example, Wendyam et al. employed this method to assess soil quality in a watershed area [
24], while Muhammad and Roderick applied it to evaluate soil quality in a semi-arid region and to explore soil pollution, respectively [
25,
26]. The accuracy of soil quality assessments using the SQI method largely depends on the selection of indicators and scoring methods [
27,
28].
Traditional methods such as expert review systems rely heavily on subjective judgement and are susceptible to personal bias. In contrast, modern techniques such as deep learning, which utilizes computational techniques for assessment, tend to require large datasets and are costly, making them impractical or not optimal in many cases. This study addresses the limitations of these two approaches and attempts a statistically robust and feasible framework designed for small sample scenarios. In order to improve the accuracy of soil quality evaluation in reclaimed areas under small sample data, we chose to use cluster analysis and correlation analysis to initially screen the full data set of soil indicators to remove redundant indicators and then use PCA to conduct in-depth screening to establish the minimum data set. This method improves the accuracy and interpretability of PCA results by initially reducing the number of indicators, reducing subjectivity in the component selection process, and improving the reliability of the minimum dataset affecting the quality of soil reclamation. Finally, the SQI method and the membership function were used to evaluate the quality of reclaimed soil. What is more, the accuracy of the CA-CDA-PCA-MF method was validated.
3. Minimum Dataset Establishment
3.1. Cluster Analysis
Cluster analysis is a statistical method that groups samples in a dataset into categories based on similarities. Cluster analysis can preliminarily determine the data structure and reduce the redundancy of highly correlated indicators by grouping multiple soil indicators with similar features. This study selected suitable indicators for soil quality evaluation using the R-type clustering method within systematic clustering (Realized by SPSS 27 software). Choose to cluster the variables and standardize the data using z-scores. The distance between different variables was defined using the square Euclidean distance, and the two closest variables were merged, followed by the calculation of distances between the merged variables. This process continued until a dendrogram illustrating the relationships between variables was created to establish the minimal dataset [
35]. The square Euclidean distance is calculated as follows [
36]:
3.2. Correlation Analysis
Correlation analysis is a statistical method used to measure the relationship between two or more variables. Correlation analysis further quantifies the correlation between indicators and identifies groups of variables with high covariance through correlation coefficient matrices. If two or more indicators are highly correlated during the minimum dataset selection process, correlation analysis can be used to screen out redundant variables and retain only the core variables that have a strong influence on soil quality. This step not only reduces the number of variables but also reduces the computational burden during principal component analysis. After conducting normality tests for all evaluation indicators, researchers select the correlation coefficient matrix based on the results. If all factors follow a normal distribution, the Pearson correlation coefficient matrix is used for analysis. Otherwise, the Spearman correlation coefficient matrix is applied.
After selecting the appropriate correlation matrix, the analysis proceeds based on the cluster analysis results. Within different groups classified by cluster analysis, the correlation between indicators is analyzed. A correlation coefficient greater than 0.5 indicates a strong relationship, and such indicators are prioritized for inclusion in the minimal dataset based on practical considerations and established research. Indicators with a correlation coefficient less than 0.5 are considered to have no significant relationship and are treated as backup options for the minimal dataset.
3.3. Principal Component Analysis
Principal component analysis (PCA) is used to transform the original variables into a set of independent principal components through linear transformation. This process reduces data dimensions, minimizes redundant information, and extracts the main variability in the data.
Firstly, KMO (Kaiser–Meyer–Olkin) and Bartlett’s tests were performed on the indicators for which PCA was performed to determine the suitability of the data for PCA. For soil quality evaluation, principal components with eigenvalues of 1 or higher are retained. Indicators with loadings of 0.5 or greater on the same principal component are considered alternative indicators for inclusion in the minimal dataset. If an indicator has loadings of 0.5 or greater in two or more principal components, it is analyzed in the principal component with lower correlations with other indicators.
The vector norm (Norm) calculation is introduced as a reference basis for selecting indicators into the minimal dataset to avoid relying solely on indicator loadings as the criterion and potentially overlooking some indicator information [
37]. The larger the Norm value of an indicator, the stronger its ability to explain comprehensive information. The formula for calculating the Norm value is as follows [
38]:
where
Nik is the Norm value of the
ith indicator in the top
k principal components with eigenvalues greater than 1,
uik is the loading of the
ith indicator in the
kth principal component, and
ek is the eigenvalue of the
kth principal component.
3.4. Soil Quality Evaluation
The SQI is a multidimensional concept that relies on indicators to comprehensively assess soil quality. This assessment is more representative when it includes multiple indicators rather than focusing on individual ones alone. The SQI numerically represents soil quality by establishing membership functions between the evaluation indicators and soil quality based on their positive and negative effects. The SQI is ultimately calculated to represent soil quality accurately by integrating the weights of indicators from each dataset. The membership function is used to standardize different units of soil indicators in SQI calculations, and it is able to standardize soil quality indicators of different units and scales to an affiliation value between 0 and 1, making them comparable. The way the membership function is defined affects the accuracy of the soil quality index. Different types of membership functions reflect the way in which each indicator affects soil quality, for example, S-type functions are used for positive indicators, inverse S-type functions are used for negative indicators, and parabolic functions are used for indicators for which an optimal range exists [
39,
40].
The commonly used membership functions are categorized into three types as follows: S-shaped membership function:
Inverted S-shaped membership function:
Parabolic membership function:
where
x is the actual measured value of the evaluation indicator,
a and
b are the lower and upper limits of the critical values of the indicator, representing the minimum and maximum measured values,
a1 and
a2 are the lower and upper limits of the critical values of the indicator, representing the minimum and maximum measured values, and
b1 and
b2 are the lower and upper limits of the optimal value.
Principal component analysis was performed on the whole indicator data set and the minimum indicator data set to calculate the weights of each evaluation indicator in the soil quality index calculation. The SQI for different datasets is calculated using the following formula by combining the membership degree:
where
Si is the indicator’s score,
n is the number of indicators, and
Wi is the indicator’s weight. A higher SQI value shows better soil quality and greater suitability for plant growth.
3.5. Validation for the CA-CDA-PCA-MF Method
The core of validating the accuracy of the CA-CDA-PCA-MF method is to compare the soil quality index calculated from the minimum data set obtained with the CA-CDA-PCA-MF method with the soil quality index calculated from the full data set using PCA. Comparisons are made by calculating the coefficient of determination R
2 and the coefficient of deviation CV between the two groups of soil quality indices using the following formulae [
41]:
where
is the soil quality index calculated for the full data set for the
ith sample,
is the soil quality index calculated for the minimum data set for the
ith sample,
is the mean value the soil quality index calculated for the full data set, and
n is the number of samples.
It was determined that the closer the coefficient of determination R2 is to 1, the closer the soil quality index calculated from the minimum data set calculated from the full data set, and the more accurate the model is. The closer the coefficient of deviation CV is to 0, the smaller the model deviation value is.
4. Results and Discussion
4.1. Cluster Analysis Results
Cluster analysis was conducted on a dataset comprising ten factors, with the results displayed in
Figure 3. According to these results, the dataset was divided into three classes when the clustering level was between 10 and 15. The first class includes exchangeable magnesium (EMg), exchangeable calcium (ECa), available iron (Fe(avail.)), organic matter (OM), total nitrogen (TN), and available copper (Cu(avail.)). The second class includes available manganese (Mn(avail.)) and available zinc (Zn(avail.)), and the third class includes available silicon (Si(avail.)) and free iron (Fe(free)).
4.2. Correlation Analysis Results
Preliminary screening of the results of cluster analysis is conducted by using correlation analysis. Initially, a normality test was performed on all factors in the dataset, indicating that Mn(avail.), Zn(avail.), Cu(avail.), TN, and Si(avail.) met the criteria for normal distribution. However, OM, Fe(free), Fe(avail.), ECa, and EMg did not follow a normal distribution. Hence, this study utilized the Spearman correlation coefficient matrix for further analysis.
Figure 4 shows the correlation coefficients for the plots.
The correlation coefficient between EMg and ECa in Class 1 indicators is 0.927 **. EMg enhances the aggregation and cementation of soil particles, promoting soil structure stability and permeability. It also supports plant growth by regulating physiological metabolic processes within plant cells and affecting soil pH, which influences the effectiveness of other ions in soil and nutrient absorption by plants. Exchangeable calcium promotes soil particle aggregation and cementation, benefiting soil structure stability and aeration. It influences soil biological activity, microbial growth, and organic matter decomposition and plays a significant role in plant cell wall synthesis and cell division, profoundly impacting plant root growth and development [
42,
43]. Although their functions overlap to some extent, both are indispensable and are thus included in the alternative minimum dataset.
The correlation coefficient between Fe(avail.) and Cu(avail.) is 0.924 **. Fe(avail.) is vital for chlorophyll synthesis and nitrogen metabolism in plants, necessary for photosynthesis, and supports the healthy development of plant roots and leaves. It also facilitates oxidation-reduction reactions in the soil, maintaining soil redox balance. Cu(avail.) is crucial for plants’ photosynthesis, respiration, nitrogen metabolism, protein and enzyme synthesis, root growth, and nutrient absorption and utilization. Their levels directly impact plant growth, and their deficiency or excess significantly affects soil quality [
42,
43]. Hence, both are included in the alternative minimum dataset.
The correlation coefficient between OM and TN is 0.936 **. TN is a critical nutrient for plant growth and is involved in synthesizing organic compounds such as proteins and nucleic acids. It promotes plant growth, improves yield and quality, and reflects soil fertility. As an essential indicator, it evaluates soil fertility effectively. OM provides vital nutrients for plant growth, including carbon, nitrogen, and phosphorus, playing a crucial role in enhancing plant development, soil structure, water and nutrient retention, soil permeability, aeration, microbial growth, and activity. It also maintains the soil ecosystem’s balance and stability [
44,
45]. Their relationship is not directly subordinate, so both are included in the alternative minimum dataset.
In Class 2 indicators, Mn(avail.) participates in plant photosynthesis and respiration, promotes chlorophyll synthesis and aids in oxidation–reduction reactions. It also assists plants in absorbing and utilizing nutrients such as nitrogen, phosphorus, and potassium. Zn(avail.) is involved in the synthesis of plant growth hormones and enzyme activity, enhancing the plant’s resistance to diseases and pests [
46]. Both indicators show no significant correlation and are thus included in the alternative minimum dataset.
In Class 3 indicators, the correlation coefficient between Si(avail.) and Fe(free) is 0.753 **. Si(avail.) enhances plants’ resistance to diseases and pests and improves their tolerance to adverse environmental conditions. Fe(free) affects the absorption capacity of plant roots and the nutrient supply [
44]. The correlation between the two remains unclear; therefore, to prevent errors in evaluating indicators based on correlation coefficients, both are included in the alternative minimum dataset.
4.3. Principal Component Analysis Results
After cluster analysis and correlation analysis, Mn(avail.), Zn(avail.), Fe(free), and Si(avail.) are selected for inclusion in the minimum dataset for soil quality evaluation. However, there is still data redundancy in the alternative minimum dataset of the first-class indicators. Hence, PCA filters out the factors from the first-class indicators that ultimately enter the minimum dataset.
Firstly, the indicators undergoing PCA are subjected to Kaiser–Meyer–Olkin (KMO) and Bartlett’s tests to determine if all indicators are suitable for PCA [
47]. The results of the tests showed a KMO value of 0.712, meeting the requirements for conducting PCA, and Bartlett’s test result was
p < 0.01, indicating a significant correlation, thus making the indicators suitable for PCA.
Further utilizing SPSS 27 software, PCA was conducted on the alternative dataset comprising Fe(avail.), Cu(avail.), OM, TN, EMg, and ECa. Components with eigenvalues exceeding 0.8 were chosen to ensure that the principal components achieved a sufficient cumulative contribution rate. This selection process identified two principal components, together achieving a cumulative contribution rate of 92.95%. These components were utilized to filter the factors. The specific contributions of the different principal components and the loadings of the different factors in the principal components are shown in
Table 1.
The contribution rate of the eigenvalue of the first principal component was 79.25%. Factors exhibiting high loadings in this component comprised Fe(avail.), Cu(avail.), OM, TN, EMg, and ECa. The norm value for each factor was computed, with selections based on the highest norm value, which corresponded to 90% of exchangeable magnesium. The analysis indicated that Cu(avail.), TN, EMg, and ECa satisfied the selection criteria. The correlation coefficients for EMg with ECa, Cu(avail.), and TN were recorded at 0.96 **, 0.706, and 0.585, respectively. Thus, EMg, having the highest norm value and the lowest correlation with TN, was included in the minimum dataset.
The contribution rate of the eigenvalue of the second principal component registered at 13.70%, with EMg and Cu(avail.) demonstrating higher loadings. Since EMg was previously selected for the minimum dataset from the first component, the analysis primarily addressed the inclusion of Cu(avail.). Cu(avail.) was incorporated into the minimum dataset, and no other factors showed high loadings in the second component.
Accordingly, three indicators, including EMg, TN, and Cu(avail.), were added to the minimum dataset through principal component analysis. Ultimately, cluster analysis, correlation analysis, and PCA selected seven indicators, including EMg, TN, Cu(avail.), Mn(avail.), Zn(avail.), Fe(free), and Si(avail.), as the minimum dataset for soil quality evaluation.
4.4. Soil Quality Index Calculation and Soil Quality Assessment
The weights of factors in the minimum dataset and the types of membership functions are presented in
Table 2. In this study area, EMg, TN, Cu(avail.), Mn(avail.), Zn(avail.), and Si(avail.) all exhibit positive correlations with soil quality, defined using S-type functions. In contrast, the concentration of Fe(free) is negatively correlated with soil quality, characterized as an inverse S-type function. The soil quality index for the study area is listed in
Table 3 by integrating the weights of each indicator and the calculations from the membership functions. This index varies from 0.06 to 0.88, with an average of 0.4575, indicating a moderate level of overall soil quality. A map illustrating the soil quality grade distribution across the study area was generated based on the index values from sampling points (
Figure 5), providing a visual representation of varying soil quality grades throughout the study area. Predominantly, the area shows moderate soil quality grade, with the northeastern section exhibiting a fairly good grade while the northwestern and southeastern sections display fairly poor grades.
The soil quality index and the corresponding map of soil quality grades reveal substantial disparities in the quality of reclaimed soil throughout the study area. Soil quality is lowest at sampling point 1 and highest at sampling point 5, demonstrating a clear north-to-south trend of declining quality. Critical contributors to these differences include variations in OM, Cu(avail.), and EMg levels, which result in relatively poor soil aeration and fertility at sampling point 1, adversely affecting plant growth and respiration [
48].
4.5. Validation Results for the CA-CDA-PCA-MF Method
The accuracy of the SQI values for the minimum dataset was tested by comparing them with the SQI values calculated for the entire dataset using PCA. The SQI for the whole data set calculated for each point using PCA is shown in
Table 4. The SQI based on the whole dataset ranges from 0.02 to 0.91, with a mean value of 4.88. The soil quality index calculated with the minimum dataset ranges from 0.06 to 0.88, with a mean value of 0.4575. The difference between the two soil quality indices was small, at 6.2%. Regression analysis of the two sets of soil quality indices was carried out and a regression analysis plot was obtained (
Figure 6). The regression equation was y = 0.78x + 0.09, and the calculated coefficient of determination, R
2, was 0.882. The slope of 0.78 indicates that the trends of the two data sets are highly similar (1 when identical), the intercept of 0.09 indicates that the bias of the two data sets is small (0 when unbiased), and the R
2 indicates that the model of the two data sets has a good fit (1 when identical). The deviation coefficient CV was calculated as −0.053, which indicated that the deviation between the soil quality index of the minimum dataset and that of the whole dataset was small, and the accuracy of the soil quality index calculated from the minimum dataset met the requirements. In conclusion, the accuracy of the SQI calculated by the CA-CDA-PCA-MF method was verified.
In the case of high-dimensional small samples, PCA faces multiple covariates interference, which leads to unclear ecological significance of principal components. As shown in
Table 1, the first principal component (with a contribution of 79.25%) contains six highly correlated indicators (correlation coefficients greater than 0.7), such as EMg, ECa, and Fe (utilization), and it is difficult to differentiate their actual contributions to soil quality. In contrast, CA-CDA-PCA-MF pre-screened the indicators by cluster analysis (CA) and correlation analysis (CDA), merged redundant variables (e.g., ECa and EMg with correlation coefficients as high as 0.96), and ultimately retained seven relatively independent indicators (MDS). This strategy greatly reduced the information overlap between principal components and made the ecological significance of the PCA loading matrix clearer, i.e., enhanced the interpretability of the data.
4.6. Discussion
The observed differences in soil quality can be attributed to several key factors. At sampling point 1, where the lowest SQI (0.105) was recorded, the relatively low concentrations of exchangeable magnesium (EMg), total nitrogen (TN), and available copper (Cu) likely contributed to poor soil aeration and fertility. EMg and TN are essential for maintaining soil structure and nutrient availability, while Cu plays a critical role in plant photosynthesis and respiration. The deficiency of these elements at sampling point 1 may have limited plant growth and overall soil health. In contrast, sampling point 5, with the highest SQI (0.77), exhibited more favorable conditions, likely due to higher concentrations of these key nutrients. Additionally, the high content of free iron (Fe) across the study area was found to negatively impact soil quality, as excessive Fe can lead to nutrient imbalances and reduced plant uptake of essential elements.
The accuracy test proved that the SQI calculated from the minimum data set was reliable enough to explain the soil quality in the region, which was also reflected in Johnstone’s study, where PCA on a specific subset of the full variable also maintained the consistency of its results with those of PCA on the whole full data set [
13]. In the case of high-dimensional small samples, PCA faces multiple covariates interference, which leads to unclear ecological significance of principal components [
49]. As shown in
Table 1, the first principal component (with a contribution of 79.25%) contains six highly correlated indicators (correlation coefficients greater than 0.7), such as EMg, ECa, and Fe (utilization), and it is difficult to differentiate their actual contributions to soil quality. In contrast, CA-CDA-PCA-MF pre-screened the indicators by cluster analysis (CA) and correlation analysis (CDA), merged redundant variables (e.g., ECa and EMg with correlation coefficients as high as 0.96), and ultimately retained seven relatively independent indicators (MDS). This strategy greatly reduced the information overlap between principal components and made the ecological significance of the PCA loading matrix clearer, i.e., enhanced the interpretability of the data. In addition, the computational complexity of PCA grows cubically with the number of indicators, whereas CA-CDA-PCA-MF reduces the dimensionality of indicators through pre-screening, thus reducing the computational volume of PCA. Despite the small sample size of this study, the stepwise process of CA-CDA-PCA-MF (CA → CDA → PCA) provides a scalable framework for large-scale data scenarios. And the pre-screening process avoids repeated analyses of redundant indicators (e.g., ECa) and reduces the consumption of computational resources. This method is particularly important for areas with limited data, as it does not require large data sets for accurate soil quality assessment. In addition, the flexibility of the method allows it to be adapted to different geographical and geological conditions, providing a versatile tool for assessing the effectiveness of reclamation in different environments. For example, the method can be applied to other mining areas or even to non-mining reclamation projects, such as agricultural land rehabilitation or urban greenfield development.
Of course, our study also has some limitations, first of all, our research site is a coal mining subsidence area reclaimed through foreign soil reclamation, the overall soil texture and soil structure is more uniform; that is, the geological conditions are relatively simple, the results of this study in the foreign soil reclamation of the site has a certain degree of universality, but for the more complex land conditions need to be adjusted to the experimental program. For example, Fayez Raiesi explored soil evaluation indexes in semi-arid areas and found that anthropogenic farming also has a great impact on soil quality, and the most important soil quality indicators are enzymes and microbial activities [
50], while our study site has not yet been disturbed by human cultivation, and we need to redesign the soil quality indicators and sampling sites to meet the requirements of the evaluation in the face of more complex site conditions. Another limitation is that the reclamation was completed in a relatively short period of time, and the reclaimed site was not yet covered by vegetation, as shown in
Figure 1c. Therefore, the study focused on physicochemical indicators of soil quality, and biological factors such as microbial activity and vegetation restoration were not included.
In order to break through the limitations of our current study, in the future we will conduct experimental studies in complex areas with different reclamation methods to explore the effectiveness of the methods on different sites. Secondly, more ecological indicators, such as microbial activity and vegetation restoration, should be incorporated to provide a more comprehensive assessment of soil quality. Finally, optimizing the weighting and classification of the indicators, and after some data accumulation, machine learning techniques can be integrated to further improve the accuracy and applicability of the model to more accurately predict soil quality and reclamation outcomes.
5. Conclusions
Based on cluster analysis, correlation analysis, and principal component analysis, a minimum dataset for assessing soil quality in the reclamation area of the Ezhuang Coal Mine, Laiwu District, is established. The soil quality index of collected soil samples is calculated to evaluate the reclamation quality in the study area using a membership function approach. The research findings are summarized as follows:
(1) Cluster analysis, correlation analysis, and PCA determine the minimum dataset for soil quality evaluation in Laiwu’s reclamation area, which includes exchangeable magnesium, total nitrogen, effective copper, effective manganese, effective zinc, free iron, and effective silicon;
(2) The soil quality assessment of Ezhuang coal mine reclamation area was achieved, and the soil quality index (SQI) of the study area ranged from 0.06 to 0.88, with a mean value of 0.4575. The soil quality of the whole reclamation area was mainly moderate with large spatial variations, and the soil quality of sampling site No. 1 had the worst soil quality (SQI = 0.105), and the soil quality of sampling site No. 5 had the best soil quality (SQI = 0.77), and the soil quality of sampling site No. 5 had the best soil quality (SQI = 0.77). Sample site 1 had the worst soil quality (SQI = 0.105) and sample site 5 had the best (SQI = 0.77);
(3) The accuracy of the CA-CDA-PCA-MF method was verified. The accuracy of the lowest dataset created by the CA-CDA-PCA-MF method was verified by using the coefficient of determination and the coefficient of deviation, and it was determined that the method can be used for soil quality assessment in topsoil reclamation projects;
(4) Factors affecting soil quality were investigated: The apparent differences in soil quality at the study site were attributed primarily to variations in EMg, TN, and Cu concentrations that affect soil structure, fertility, and plant growth. In addition, high levels of free iron (Fe) negatively affected soil quality across the region. These findings contribute to targeted interventions to increase soil fertility and improve soil structure to support sustainable land restoration;
(5) Future directions for expansion of the study are explored. The completed research is for coal mining subsidence areas reclaimed by guest soil, and it is effective and feasible to use the method for soil quality evaluation in areas reclaimed by the same method, which is of great significance for extension. However, for soils reclaimed by other methods, such as chemical reclamation, the soil conditions are affected by the reclamation method, and the evaluation method of this study needs to be re-designed for the sampling program and the selection of indicators.