Next Article in Journal
Adaptive Stacking Ensemble Techniques for Early Severity Classification of COVID-19 Patients
Previous Article in Journal
Research on the Processing of Image and Spectral Information in an Infrared Polarization Snapshot Spectral Imaging System
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Prediction of Soil Compaction Parameters Using Machine Learning Models

1
College of Civil Engineering, Suzhou University of Science and Technology, Suzhou 215009, China
2
ZhongYifeng Construction Group Co., Ltd., Suzhou 215131, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2024, 14(7), 2716; https://doi.org/10.3390/app14072716
Submission received: 21 February 2024 / Revised: 21 March 2024 / Accepted: 22 March 2024 / Published: 24 March 2024
(This article belongs to the Section Civil Engineering)

Abstract

:

Simple Summary

The research of this paper can be applied in the compaction parameter prediction of soil-filling materials, which can significantly reduce the amount of laboratory work and improve the efficiency of optimizing design for soil resource utilization in engineering construction.

Abstract

Maximum Dry Density (MDD) and Optimum Moisture Content (OMC) are two important parameters of soil filling, which affect the soil stability and bearing capacity, and thus the reliability and durability of facilities such as highways and bridges. Therefore, it is important to make reasonable predictions of OMC and MDD. Four machine learning algorithms, namely, Support Vector Machine (SVM), Artificial Neural Network (ANN), Random Forest (RF), and Extreme Gradient Boosting Tree (XGBoost), are adopted in this paper to establish MDD and OMC prediction models. After training and testing, the best models of the four algorithms are compared. The results show that, as an ensemble learning algorithm, XGBoost is the best model for predicting MDD and OMC, with an R2 of 0.9234 for OMC, and an R2 of 0.9098 for MDD. Finally, the feature importance analysis concludes that the plastic limit (PL) and the liquid limit (LL) are the two features that affect OMC and MDD the most. The prediction of soil compaction parameters using machine learning models, especially ensemble learning, can significantly reduce the amount of laboratory work and improve the efficiency of optimizing design for soil resource utilization in engineering construction.

1. Introduction

Nowadays, filling materials of high quality are becoming rare in the infrastructure engineering construction process. With a sustainable concept and solution, more types of soils are considered as filling materials. Some soils that were previously not allowed to be used as road construction materials now can be utilized after testing and evaluation, which achieve waste resource recycling and utilization.
Compaction performance is a vital property to determine whether a kind of soil can be used as a filling for roads or other engineering. The density of soils can be increased by increasing the contact area between particles and decreasing the distance between particles when subjected to an external force [1,2]. In geotechnical engineering, the compaction can affect the stability, deformation characteristics, and bearing capacity of soil [3,4]. By controlling the compaction state of soil, the strength and stability of soil can be improved, and deformation can be reduced, which ensures the safety and reliability of geotechnical engineering. Therefore, in engineering construction, it is crucial to control the compaction of soil reasonably.
Evaluating the compaction of soils can usually be accomplished by determining the maximum dry density (MDD) and optimum moisture content (OMC) [5]. The standard and modified Proctor test is the most commonly used compaction test for determining OMC and MDD. In the Proctor compaction test, the dry density of the soil when it reaches the tightest state under a certain number of strikes is MDD, while the moisture content of the soil at the same state is OMC. These parameters are important for construction facilities such as highways, bridges, and houses. For example, Oluremi et al. [6] used OMC and MDD to evaluate the compaction characteristics of cured lead-contaminated red soil; Rahman et al. [7] used OMC and MDD to study the engineering properties of roadbed soils; and Chen et al. [8] investigated the microstructure and hydraulic properties of roadbed soils using OMC and MDD.
The accurate prediction of OMC and MDD can improve the design accuracy, construction quality, and economic efficiency of geotechnical engineering. The Proctor compaction test can be used to predict OMC and MDD [9], but it is a time-consuming and labor-intensive method that requires multiple tests to plot compaction curves [10]. Although this method can provide reasonable predictions in many cases, it may be subject to a certain amount of error because it is limited by experimental conditions and sample size. Regression analysis is a statistical method for calculating soil compaction parameters and can be used to predict soil compaction parameters [11,12,13,14,15]. However, this method can only make preliminary predictions of soil MDD and OMC and has a large error [16,17,18,19,20]. Multiple linear regression allows the creation of empirical equations to predict the OMC and MDD of soils [21,22]. Empirical equations may affect the reliability of prediction due to the presence of multiple factors, and empirical equations can only reflect linear soil behavior, not complex nonlinear soil behavior. Therefore, this requires researchers to develop more advanced methods to predict OMC and MDD.
In recent years, with the increasing development of artificial intelligence technology, some researchers have utilized this technology to predict the compaction parameters of soils, which have been proved to have good performance and advantages. Saikia et al. [23] established a prediction model for the compaction characteristics of fine-grained soil through logistic regression analysis. Hasnat et al. [24] applied Support Vector Machine (SVM) to develop a model for predicting the compaction parameters of soil. Zhu et al. [25] used the Support Vector Machine (SVM), Additive Regression Support Vector Machine (AD-SVM), and Imperialist Competitive Algorithm Support Vector Machine (ICA-SVM) to predict the compaction parameters of lateritic soils, and the comparison showed that the ICA-SVM algorithm outperformed the other two algorithms. Sinha et al. [16] used Artificial Neural Network (ANN) to develop a predictive model for compaction parameters and permeability of soils. Khuntia et al. [12] used Artificial Neural Network (ANN), Least Squares Support Vector Machine (LS-SVM), and Multiple Adaptive Regression Spline Curve (MARS) to develop a prediction model for the compaction parameters of sandy soils, and the MARS model was more accurate. Othman [26] constructed a prediction model for aggregate base materials by ANN trained with different numbers of hidden layers, different numbers of hidden layer neurons, and three activation functions. The results showed that the hyperbolic tangent function (Tanh) is the most effective activation function and the performance of the ANN deteriorates with an increase in the number of hidden layers or the number of neurons per hidden layer. Jalal et al. [27] used Gene Expression Programming (GEP) and Multiple Gene Expression Programming (MEP) to develop a prediction model for the compaction parameters of expansive soils. In addition, the researchers conducted feature importance analysis, which showed that liquid limit (LL), plastic limit (PL), and sand content (S) affect OMC and MDD predictions [12,28].
Machine learning (ML) is an important branch of artificial intelligence. In this paper, ML is used to predict OMC and MDD. Compared to traditional methods, ML saves a lot of time and labor cost through the process of data processing, model training, and model testing, and it can achieve more accurate results. Compared to multivariate linear regression, ML can handle large, nonlinear data and is able to continuously learn and improve as it receives new data to adapt to the needs of engineering and the evolution of the data. For a trained machine learning model, it has the adaptability and generalization ability to handle various types of data and tasks, and it still performs well in the face of never-before-seen data. These advantages make machine learning an important tool and technique in various fields. However, there are fewer studies using machine learning (especially Random Forest and Extreme Gradient Boosting Tree) to predict MDD and OMC. Therefore, further research on prediction models for MDD and OMC based on machine learning is necessary.
Firstly, this paper builds MDD and OMC prediction models based on 168 sets of soil sample data collected by using four machine learning algorithms, namely Support Vector Machine (SVM), Random Forest (RF), Extreme Gradient Boosting Tree (XGBoost), and Artificial Neural Network (ANN). Secondly, in order to improve the performance and generalization ability of the models, hyperparameter tuning and K-fold cross-validation are carried out in this paper. Then, the best models among the four algorithms are compared to obtain the most effective and accurate final model for predicting OMC and MDD by evaluating the metrics. Finally, feature importance analysis is used to investigate the impact of each feature on OMC and MDD to improve the interpretability of the overall model.

2. Materials and Methods

In this paper, the soils are from a particular region of the country and the data are obtained from standard physical and compaction tests. When certain types of soils are lacking, the database can be supplemented and secondary development of the model can be performed at a relatively small computational cost. The ultimate goal is to form a large database covering multiple regions in multiple countries to train a large predictive model that can be used directly. As the first step of the goal, the main objective of this paper is to evaluate the feasibility of machine learning in predicting soil compaction parameters and to compare the advantages and disadvantages of different models. In terms of generalization performance, a large database is being collected and models are being trained.

2.1. Soil Database

In the research, 168 sets of soil sample data were collected. In the Gunaydin [17] literature, nine different soil types were gathered from the small dams constructed in the vicinity of Nigde in Turkey, which were high-plasticity inorganic clay (CH), medium-plasticity inorganic clay (CI), low-plasticity inorganic clay (CL), clayey gravel (GC), silty gravel (GM), high-compressibility inorganic silt (MH), medium-plasticity inorganic silt (MI), low-plasticity inorganic silt (ML), sandy clay (SC), and 126 sets of data were obtained through the Proctor compaction test. In the study by Nagaraj et al. [29], different natural soil types are collected from different geologic locations in India, including high-plasticity inorganic clay (CH), medium-plasticity inorganic clay (CI), low-plasticity inorganic clay (CL), high-compressibility inorganic silt (MH), medium-plasticity inorganic silt (MI), low-plasticity inorganic silt (ML), sandy clay (SC), ML-CL, sandy silt (SM), and MI-CI, and 42 sets of data were obtained through the Proctor compaction test.
In this database, there are six features, which are fine content (content of particle size less than 0.075 mm, clay and silt, FC), sand content (S), specific gravity (SG), liquid limit (LL), and plastic limit (PL). Table 1 shows that the soil samples in this database have a fine content of 13–98%, a sand content of 2–80%, a specific gravity of 2.58–2.85, a liquid limit of 23.1–115%, a plastic limit of 13.68–45.3%, an OMC of 7.6–36.8%, and an MDD of 12.6–20.51 kN/m3. Then, in order to evaluate the generalization ability of the model, 134 (80%) and 34 (20%) sets of soil sample data were randomly selected as training and test sets from the soil database. The training set is used for the training of the model and the test set is distinguished from the training set for testing and validating the performance of the model.
Randomized division is a data division method commonly used in machine learning. It can reduce assessment bias, improve model generalization, and ensure objectivity, accuracy, and reliability in model assessment.

2.2. Statistical Analysis of Database

Pearson’s correlation coefficient is a common method to measure the degree of linear correlation between parameters. In machine learning, Pearson’s correlation coefficient can discover the linear relationship between features by calculating the correlation coefficient between different features, and by avoiding redundant feature information input into the model. In addition, Pearson’s correlation coefficient, by calculating the correlation coefficient between each feature and the target variable, allows for the selection of features that are highly correlated with the target variable for constructing machine learning models. This helps us to reduce the feature dimension and improve the explanatory ability and generalization performance of the model.
Figure 1 illustrates the distribution of the parameter densities and the Pearson correlation coefficients between the parameters, with the Pearson correlation coefficients denoted by r. The r within ±0.81 to ±1.0, ±0.61 to ±0.80, ±0.41 to ±0.60, ±0.21 to ±0.40, and ±0.0 to ±0.20 represents very strong, strong, moderate, weak, and irrelevant, respectively.
As shown in Figure 1, FC has a strong correlation with S (−0.7189) and MDD (−0.6151), and a moderate correlation with LL (0.4899) and OMC (0.5703). S has medium correlation with LL (−0.4442), MDD (0.4414), and OMC (−0.3997). There was no complex covariance between FC and LL, PL, OMC, and MDD. Therefore, FC, S, LL, and PL can be used as input characteristics to predict OMC and MDD.
SGl is largely independent of the other parameters. This shows that SG, as an input feature, contributes to a very small extent to predicting OMC and MDD. In addition, LL was strongly correlated with PL (0.7966), OMC (0.7990), and MDD (−0.7520). PL has a very strong correlation with OMC (0.8497) and MDD (−0.7901) and a weak correlation with FC (0.3869) and S (−0.2880). The results indicate that LL is multicollinear with PI and OMC.

2.3. Machine Learning Algorithms

2.3.1. Support Vector Machines

SVM is a classic model for the binary classification of data, which is based on the principle of distinguishing data into two classes in an optimal way through hyperplanes [30]. SVM has good performance and advantages for small- and medium-sized data samples, nonlinear, and high-dimensional problems and can be used to handle regression and classification tasks [31]. As shown in Figure 2, the SVM hyperplane equation expression is
ω T x + b = 0
where ω is the weight vector and b is the offset.
The marginal hyperplane spacing between the two sides of the hyperplane is 2 ω , which is required to maximize the spacing:
min 1 2 ω 2 s . t . y i ω T x i + b 1 , i = 1 , 2 , , n
where x i is the input variable and y i is the output indicator.
The maximum interval problem is then converted to a simpler dyadic problem using the Lagrange optimization method, which can be used to find the unique optimal solutions ω * and b * . The Lagrange optimization function is
L ( ω , b , α ) = 1 2 ω 2 i = 1 n α i y i ω T x i + b 1
where α is the Lagrange multiplier and α i 0 .
The optimal solution is brought in to obtain the separation decision function
f ( x ) = s i g n ω x + b = s i g n i = 1 n α i y i K ( x , z ) + b
where K ( x , z ) is the kernel function.
SVM built-in kernel function expressions are as follows:
Linear kernel function:
K ( x , z ) = x T z
Gaussian kernel function:
K ( x , z ) = exp x z 2 2 σ 2

2.3.2. Artificial Neural Network

ANN is an information processing model that resembles the structure and function of neurons in the brain and consists of a large number of neurons built on top of each other [32]. As shown in Figure 3, ANN is composed of the input layer, hidden layer, and output layer. The role of the input layer is to accept the input data and pass the data into the hidden layer; the role of the hidden layer is to access the data of the input layer and then train to build the model. The role of the output layer is to access the data of the hidden layer, and then calculate the final output value [33]. The role of the output layer is to access the data of the hidden layer and then calculate the final output value.
ANN has many advantages, such as dealing with complex nonlinear data, utilizing parallel distributed training, adaptively adjusting the weights and parameters of the network, and learning deeper features. However, ANN also has many problems. For instance, it is easy to fall into the local optimum. In addition, too many hyperparameters lead to difficulty in tuning, and it is quite sensitive to noise and outliers. Therefore, before choosing an ANN model, it is crucial to consider its advantages and disadvantages and weigh them according to the needs of the problem.
BP (Back Propagation) neural network is one of the most used neural network models in engineering prediction [34]. The BP algorithm is used to backpropagate the error between the output value of the forward propagation and the actual value, and it continuously adjusts the weights and thresholds through multiple iterations and gradient descent methods to minimize the error.

2.3.3. Random Forest

RF belongs to the bagging algorithm in ensemble learning, and its basic unit is the decision tree. As shown in Figure 4, RF utilizes multiple decision trees for prediction or classification, and the final result is the average or voting result of all decision trees. In constructing each decision tree, RF employs two randomization methods: bagging and random feature selection. Random feature selection is to randomly select a subset of features in the feature set for tree node partitioning, which can reduce the correlation between decision trees and further improve the performance of the model [35].
The general steps of RF are as follows:
(1)
Bag a certain amount of data from the original database as a data subset. Then, randomly select a feature subset from the original feature set.
(2)
Build a decision tree for the data and feature subsets. At each tree node, select the optimal features from the feature subset for splitting.
(3)
Repeat steps (1) and (2) to construct multiple decision trees. Each decision tree is constructed from a different subset of data and a subset of features.
(4)
For classification problems, the results of all decision trees are voted to obtain the final classification result. For regression problems, the results of all decision trees are averaged to obtain the final value.
RF is considered as an ensemble algorithm with the advantages of high accuracy, resistance to overfitting, parallelized training, flexibility, and ease of use. Therefore, RF is widely used in various areas to solve classification and prediction problems [36].

2.3.4. Extreme Gradient Boosting

XGBoost is improved from the Gradient Boosting Decision Tree (GBDT) and belongs to the boosting algorithm in ensemble learning. As shown in Figure 5, the boosting algorithm is used to integrate many weak classifiers together to form a strong classifier, and its core idea is to correct the error of the previous model each time a new tree model is trained, so as to improve the overall performance of the model [37].
XGBoost is an additive model consisting of k tree models. The tree model to be trained in the tth iteration is f t x , and then there is
y ^ i ( t ) = k = 1 t f k x i = y ^ i ( t 1 ) + f t x i
where y ^ i ( t ) is the prediction result of sample i after the tth iteration, y ^ i ( t 1 ) is the prediction result of the first t − 1 tree, and f t ( x i ) is the model of the tth tree.
The loss function can be represented by the predicted value y ^ i and the true value y ^ i :
L = i = 1 n l y i , y ^ i
where n is the number of samples.
The objective function is composed of the loss function L of the model and the canonical term, and so the objective function is
O b j = i = 1 n l y i , y ^ i + i = 1 t Ω f i
where i = 1 t Ω f i is the canonical term.
Bringing Equation (7) into the objective function, the objective function is transformed into
O b j ( t ) = i = 1 n l y i , y ^ i ( t 1 ) + f t x i + Ω f i
The second-order Taylor expansion formula is
f ( x + Δ x ) = f ( x ) + f ( x ) Δ x + 1 2 f ( x ) Δ x 2
Definition:
g i = y ^ ( t 1 ) l y i , y ^ ( t 1 ) h i = y ^ ( t 1 ) 2 l y i , y ^ ( t 1 )
A second-order Taylor expansion of the loss function gives the objective function as
O b j ( t ) = i = 1 n l y i , y ^ i ( t 1 ) + g i f l x i + 1 2 h i f i 2 x i + Ω f i
On the one hand, XGBoost improves the overfitting resistance and generalization ability of the models by adding regular terms to the loss function. On the other hand, the introduction of second-order Taylor expansion makes the model more accurate and efficient in the training and optimization process. The advantages of XGBoost include a more accurate approximation of the loss function, faster convergence of the model, more stable training of the model, more optimized splitting strategy, and a better handling of nonlinear data, etc. [38,39,40]. In addition, XGBoost supports parallel processing to increase the speed of model training and provides feature significance to study the impact of features on model results. XGBoost has become one of the most respected algorithms today due to its excellence in most aspects. The risk of overfitting can be reduced by limiting the depth of the tree or setting the minimum number of samples for leaf nodes, which can prevent the model from generating overly complex trees.

2.4. Model Performance Evaluation Metrics

Prediction models in machine learning are generally evaluated with evaluation metrics such as Root Mean Square Error (RMSE) and coefficient of determination (R2). In the extremely perfect model, RMSE is equal to 0. When RMSE is smaller, the model is more accurate, and, on the contrary, it is worse. The range of R2 is from 0 to 1, and the closer its value is to 1, the better the model fits the data. The mathematical expressions are as follows:
RMSE = 1 n i = 1 n y i y ^ i 2
R 2 = 1 i = 1 n y i y ^ i 2 i = 1 n y i y ¯ 2
where n is the number of samples, y i is the true value, y ^ i is the predicted value, and y ¯ is the mean value.

3. Results

3.1. Support Vector Machines

Since there are different kernel functions built into the SVM, the linear kernel and Gaussian kernel function are adopted for the prediction of OMC and MDD to study the effect of different kernel functions on the SVM. The prediction results of OMC and MDD by linear kernel and Gaussian kernel function models are shown in Table 2.
From Table 2 and Figure 6, it can be seen that the Gaussian kernel function model outperforms the linear kernel in predicting OMC, with RMSE = 1.6308% and R2 = 0.8847. Similarly, the Gaussian kernel function model outperforms the linear kernel in predicting MDD, with RMSE = 0.5039 kN/m3 and R2 = 0.8831. Therefore, in the prediction of compaction parameters by the SVM, the performance using the Gaussian kernel function is superior.
It can be observed from the results and previous research that when the data present a complex nonlinear structure, a Gaussian kernel can be used to obtain a good model performance, while a linear kernel can be chosen when the data are linearly differentiable.

3.2. Artificial Neural Network

BP neural network is used to build OMC and MDD prediction models. ANN is a customized network model, since, on the one hand, the higher number of hidden layers in ANN is beneficial to the network to solve complex nonlinear problems, but too many hidden layers may lead to the overfitting of the model. On the other hand, a higher number of neurons is beneficial to increase the expressive ability and learning ability of the network. However, an unreasonable number of neurons may also lead to overfitting or underfitting. Therefore, it is important to determine the appropriate number of hidden layers and neurons. According to existing research, ANNs with one hidden layer have been able to meet most of the requirements [41,42,43,44,45]. Therefore, in this paper, an ANN with one hidden layer is used to study the effect of the number of neurons on the prediction model, and the results are shown in Table 3.
As shown in Figure 7, in the OMC prediction model, when the number of neurons increases, the RMSE of the training set tends to decrease and then increase, and then decrease and then increase again. And the test set tends to decrease and then increase. Therefore, a neuron number of 7 is the best OMC prediction model. In the MDD prediction model, both the training set and the test set tend to decrease and then increase. Therefore, a neuron number of 7 is the best MDD prediction model. From Table 3 and Figure 8, in the best OMC model, RMSE = 1.6302% and R2 = 0.8848. In the best MDD model, RMSE = 0.5727 kN/m3 and R2 = 0.8490.

3.3. Random Forest

In the construction of the prediction model with the RF algorithm, maximum tree depth, the minimum number of samples required to split a node, minimum samples required to be at a leaf node, and the number of trees in the forest have a large impact on the RF model. In this paper, the above four hyperparameters are tuned, and the rest of the hyperparameters are adopted as the default values. In the tuning process, the Bayesian optimization method is applied.
As shown in Table 4 and Figure 9, in the best prediction model for OMC, RMSE = 1.3605% and R2 = 0.9198. In the best prediction model for MDD, RMSE = 0.5664 kN/m3 and R2= 0.8523. Through the results, it is shown that RF has an excellent performance in predicting OMC, with an R2 of more than 0.90, while for MDD, although the R2 reaches about 0.85, there is still a gap of about 6% in comparison with OMC.

3.4. Extreme Gradient Boosting

XGBoost is proven to be an ensemble algorithm that has excellent performance. However, more hyperparameters need to be tuned, such as the number of trees in the forest, maximum tree depth, regularization parameter and learning rate, etc. Similar to Random Forests, Bayesian optimization is employed for tuning. Compared with the grid search and other tuning methods, Bayesian optimization can find the global optimal solution with high efficiency [38].
As shown in Table 5 and Figure 10, in the best prediction model, RMSE = 1.3290% and R2 = 0.9234 for OMC, and RMSE = 0.4447 kN/m3 and R2 = 0.9089 for MDD. The results show that XGBoost has a good performance in predicting both OMC and MDD, and its R2 can reach above 0.90.
XGBoost has superior model performance and generalization ability whether it deals with classification problems or prediction problems. As a result, XGBoost has become one of the preferred algorithms for problem-solving in the field of machine learning. However, XGBoost has many built-in hyperparameters and still needs to be carefully tuned for the problem to be solved to obtain the best model performance.

3.5. Feature Importance Analysis

Feature importance analysis is a key step in machine learning, which can help us to assess the degree of contribution of each feature to the prediction results and to understand the decision-making way of the model. And it can also help us to identify the reasons for model performance degradation through the importance of each feature and to find the potential problems in the model and optimize and improve it accordingly. Therefore, feature importance analysis plays a very important role in model evaluation, understanding, and optimization.
RF feature importance is derived by calculating the importance of each feature in all decision trees. XGBoost feature importance is derived from how often each feature is used as a split feature in all decision trees. Although both feature importance is judged by different criteria, they are integrated algorithms with stability and robustness. Therefore, RF and XGBoost are chosen for feature importance analysis in this paper.
As can be seen from Figure 11, in RF feature importance, the feature that affects OMC the most is PL with an impact factor of 0.598, followed by LL with 0.294, and the lesser impacts are FC, Sand, and SG. In XGBoost feature importance, the feature that affects OMC the most is LL with an impact factor of 0.476, followed by PL with 0.297, and the lesser impacts are FC, Sand, and SG. In summary, the two features that affect OMC the most are PL and LL, which is consistent with geotechnical engineering experience.
As can be seen from Figure 12, in RF feature importance, the two features that affect the MDD the most are LL and PL, whose influencing factors are 0.403 and 0.391, respectively, and the lesser influences are FC, SG, and Sand. In the feature importance analysis of XGBoost, the feature that affects the MDD the most is PL, with an influence factor of 0.466, followed by LL with 0.25, and the lesser influences of FC, Sand, and SG. Similarly, the two features that affect the MDD the most are PL and LL.
The above findings coincide with those found in the literature [12,28]. Feature importance analysis is meaningful in engineering to guide feature selection, interpret model predictions, identify data anomalies, guide decision making, etc., thus improving the efficiency, performance, and credibility of engineering.

4. Discussion

In this paper, different prediction models are trained for different machine learning algorithms, different kernel functions, and different numbers of neurons, and all models are evaluated for model performance by 5-fold cross validation to obtain the best performing prediction model. The k-fold cross-validation is a commonly used method for model evaluation, in which the database is first divided into k subsets, then each subset is made a test set once, and the remaining k-1 subsets are used as the training set. Finally, the results of multiple evaluations are averaged, thus eliminating the effect of unbalanced database partitioning. The results of the best prediction models in SVM, RF, XGBoost, and ANN are shown in Table 6.
As shown in Figure 13, XGBoost has excellent performance in predicting both OMC and MDD, with its R2 reaching above 0.90, and RF also has good performance in predicting OMC (RMSE = 1.3605, R2 = 0.9198), which suggests that the ensemble algorithm possesses superior model performance and generalization ability. However, RF performs poorly in predicting MDD, which may be attributed to the failure to optimize the best combination of hyperparameters. SVM has moderate performance in predicting both OMC and MDD, which is attributed to the fact that the core idea and fundamentals of SVM have not been changed, and it is still considered as a single machine learning algorithm. For ANN, whether it is predicting OMC or MDD, it shows the worst model performance. On the one hand, it may be related to the sample size being too small. On the other hand, it is not paired with an optimization algorithm to optimize the global optimal solution. The optimization algorithm is a kind of algorithm with global optimization performance, generality, and it is suitable for parallel processing. Optimization algorithms can theoretically find the optimal solution or near-optimal solution in a certain time. Commonly used optimization algorithms are Genetic Algorithm (GA), Simulated Annealing (SA), Particle Swarm Optimization (PSO), Ant Colony Algorithm (ACA), and so on.
Figure 14 and Figure 15 show the actual and predicted values of OMC and MDD for the test set in XGBoost, respectively. From the figures, it can be seen that the overlap between the real and predicted values is very high when the model has excellent performance. Even though some of the data are poorly predicted, the error with the true values is around 4%, which is an acceptable range. This shows that machine learning algorithms can be applied to predict OMC and MDD with great efficiency and advantage.

5. Conclusions

The prediction of compaction parameters of soil based on machine algorithms is a promising option; however, this method requires knowledge of the characteristic parameters of the soil. In this paper, OMC and MDD prediction models of machine learning algorithms such as SVM, ANN, RF, and XGBoost are developed and compared based on collected soil sample data. Some beneficial conclusions can be drawn.
(1)
When dealing with large and complex data, SVM based on Gaussian kernel function can predict OMC and MDD better than SVM based on linear kernel function.
(2)
In ANN modeling, the model performance varies greatly for different neurons. When the hidden layer is certain, the number of neurons has a great impact on predicting OMC and MDD.
(3)
Comparing the four machine learning algorithms, XGBoost is the best model for predicting OMC and MDD, while RF has good performance in predicting OMC. This shows that the ensemble algorithm has better model performance and generalization ability than the other algorithms.
(4)
Regarding outcomes, ANN is less successful in predicting both OMC and MDD. This suggests that ANN requires large high-quality databases paired with optimized algorithms to achieve higher model performance.
(5)
Finally, the feature importance output was analyzed by RF and XGBoost. The results show that PL and LL are the two features that affect the performance of the model the most, which is in line with reality.
According to this research, it is proved that machine learning, especially ensemble learning, can be applied for the compaction parameter prediction of soil-filling materials. With this method, the amount of laboratory work can be significantly reduced and the efficiency optimization significantly improved, which is meaningful for soil resource utilization in engineering construction.

Author Contributions

Conceptualization, B.L. and Y.W.; methodology, Z.Y. and K.N.; software, Z.Y. and K.N.; validation, Z.Y.; data curation, K.N.; writing—original draft preparation, B.L. and Y.W.; writing—review and editing, B.L. and Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the IOT Technology Application Transportation Industry R & D Center (Hangzhou) (No.2023-02)) and Suzhou Science and Technology Plan Project (No. 2022SS57).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

Author Bingyi Li was employed by the company ZhongYifeng Construction Group Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest..

References

  1. Ren, X.C.; Lai, Y.M.; Zhang, F.Y.; Hu, K. Test method for determination of optimum moisture content of soil and maximum dry density. KSCE J. Civ. Eng. 2015, 19, 2061–2066. [Google Scholar] [CrossRef]
  2. Du, Y.J.; Hong, S.L.; Tao, W.; Hao, W. Experimental study on compaction characteristics of coarse-grained soil with discontinuous gradation. Chin. J. Geotech. Eng. 2019, 41, 2142–2148. [Google Scholar] [CrossRef]
  3. Miao, M.G.; Ning, L.; Chao, D.S.; Hui, C.Z.; Jubert, P. Experimental investigation of microscopic deformation mechanism of unsaturated compacted loess under hydraulic coupling conditions. Geotech. Mech. 2021, 42, 2437–2448. [Google Scholar] [CrossRef]
  4. Gang, W.; Yi, L.W.; Xing, W.; Ming, J.Z. Permeability variation of compacted clay during triaxial compression. Geomechanics 2020, 41, 32–38. [Google Scholar] [CrossRef]
  5. Rimbarngaye, A.; Mwero, J.N.; Ronoh, E.K. Effect of gum Arabic content on maximum dry density and optimum moisture content of laterite soil. Heliyon 2022, 8, 553. [Google Scholar] [CrossRef] [PubMed]
  6. Oluremi, J.R.; Ishola, K. Compaction and strength characteristics of lead contaminated lateritic soil treated with eco-friendly biopolymer for use as road foundation material. Hybrid Adv. 2024, 5, 100158. [Google Scholar] [CrossRef]
  7. Rahman, I.U.; Raheel, M.; Khawaja, M.W.A.; Khan, R.; Li, J.; Khan, A.; Khan, M.T. Characterization of engineering properties of weak subgrade soils with different pozzolanic & cementitious additives. Case Stud. Constr. Mater. 2021, 15, e00676. [Google Scholar] [CrossRef]
  8. Chen, R.P.; Qi, S.; Wang, H.L.; Cui, Y.J. Microstructure and hydraulic properties of coarse-grained subgrade soil used in high-speed railway at various compaction degrees. J. Mater. Civ. Eng. 2019, 31, 04019301. [Google Scholar] [CrossRef]
  9. Wang, H.L.; Yin, Z.Y. High performance prediction of soil compac-tion parameters using multi expression programming. Eng. Geol. 2020, 276, 105758. [Google Scholar] [CrossRef]
  10. Verma, G.; Kumar, B. Prediction of compaction parameters for fine-grained and coarse-grained soils: A review. Int. J. Geotech. Eng. 2020, 14, 970–977. [Google Scholar] [CrossRef]
  11. Farooq, K.; Khalid, U.; Mujtaba, H. Prediction of Compaction Characteristics of Fine-Grained Soils Using Consistency Limits. Arab. J. Sci. Eng. 2016, 41, 1319–1328. [Google Scholar] [CrossRef]
  12. Khuntia, S.; Mujtaba, H.; Patra, C.; Farooq, K.; Sivakugan, N.; Das, B.M. Prediction of compaction parameters of coarse grained soil using multivariate adaptive regression splines (MARS). Int. J. Geotech. Eng. 2015, 9, 79–88. [Google Scholar] [CrossRef]
  13. Hohn, A.V.; Leme, R.F.; Moura, T.E.; Llanque Ayala, G.R. Empirical models to predict compaction parameters for soils in the state of ceará, northeastern Brazil. Ingeniería e Investigación. 2022, 42, e86328. [Google Scholar] [CrossRef]
  14. Arama, Z.A.; Gençdal, H.B. Simple Regression Models to Estimate the Standard and Modified Proctor Characteristics of Specific Compacted Fine-Grained Soils. In Proceedings of the 7th World Congress on Civil, Structural, and Environmental Engineering, Istanbul, Turkey, 10–12 April 2022; pp. 1–9. [Google Scholar] [CrossRef]
  15. Khalid, U.; Rehman, Z.U. Evaluation of compaction parameters of fine-grained soils using standard and modified efforts. Int. J. Geo-Eng. 2018, 9, 15. [Google Scholar] [CrossRef]
  16. Sinha, S.K.; Wang, M.C. Artificial neural network prediction models for soil compaction and permeability. Geotech. Geol. Eng. 2008, 26, 47–64. [Google Scholar] [CrossRef]
  17. Günaydın, O.J.E.G. Estimation of soil compaction parameters by using statistical analyses and artificial neural networks. Environ. Geol. 2009, 57, 203–215. [Google Scholar] [CrossRef]
  18. Hussain, A.; Atalar, C. Estimation of compaction characteristics of soils using Atterberg limits. IOP Conf. Ser. Mater. Sci. Eng. 2020, 800, 012024. [Google Scholar] [CrossRef]
  19. Ratnam, U.V.; Prasad, K.N. Prediction of compaction and compressibility characteristics of compacted soils. Int. J. Appl. Eng. Res. 2019, 14, 621–632. [Google Scholar]
  20. Yousif, A.A.; Mohamed, I.A. Prediction of compaction parameters from soil index properties case study: Dam complex of upper atbara project. Am. J. Pure Appl. Sci. 2022, 4, 01–09. [Google Scholar] [CrossRef]
  21. Farooq, K.; Mujtaba, H. Prediction of California Bearing Ratio (CBR) and Compaction Characteristics of Granular Soil. Acta Geotech. Slov. 2017, 14, 63–72. [Google Scholar]
  22. Lubis, A.S.; Muis, Z.A.; Hastuty, I.P.; Siregar, I.M. Estimation of compaction parameters based on soil classification. IOP Conf. Ser. Mater. Sci. Eng. 2018, 306, 012005. [Google Scholar] [CrossRef]
  23. Saikia, A.; Baruah, D.; Das, K.; Rabha, H.J.; Dutta, A.; Saharia, A. Predicting compaction characteristics of fine-grained soils in terms of Atterberg limits. Int. J. Geosynth. Groun. Eng. 2017, 3, 18. [Google Scholar] [CrossRef]
  24. Hasnat, A.; Hasan, M.M.; Islam, M.R.; Alim, M.A. Prediction of compaction parameters of soil using support vector regression. Curr. Trends Civ. Struct. Eng. 2019, 4, 1–7. [Google Scholar] [CrossRef]
  25. Zhu, P.; Zhu, Y.; Zhang, P. Comparison of SVR models for predicting the compaction properties of lateritic soils as novel hybrid methods. Eng. Res. Express 2022, 4, 035038. [Google Scholar] [CrossRef]
  26. Othman, K. Deep neural network models for the prediction of the aggregate base course compaction parameters. Designs 2021, 5, 78. [Google Scholar] [CrossRef]
  27. Jalal, F.E.; Xu, Y.; Iqbal, M.; Jamhiri, B.; Javed, M.F. Predicting the compaction characteristics of expansive soils using two genetic programming-based algorithms. Transp. Geotech. 2021, 30, 100608. [Google Scholar] [CrossRef]
  28. Ardakani, A.; Kordnaeij, A. Soil compaction parameters prediction using GMDH-type neural network and genetic algorithm. Eur. J. Environ. Civ. Eng. 2019, 23, 449–462. [Google Scholar] [CrossRef]
  29. Nagaraj, H.B.; Reesha, B.; Sravan, M.V.; Suresh, M.R. Correlation of compaction characteristics of natural soils with modified plastic limit. Transp. Geotech. 2015, 2, 65–77. [Google Scholar] [CrossRef]
  30. Tenpe, A.R.; Patel, A. Utilization of support vector models and gene expression programming for soil strength modeling. Arab. J. Sci. Eng. 2020, 45, 4301–4319. [Google Scholar] [CrossRef]
  31. Tabarsa, A.; Latifi, N.; Osouli, A.; Bagheri, Y. Unconfined compressive strength prediction of soils stabilized using artificial neural networks and support vector machines. Front. Struct. Civ. Eng. 2021, 15, 520–536. [Google Scholar] [CrossRef]
  32. Armaghani, D.J.; Mirzaei, F.; Shariati, M.; Trung, N.T.; Shariati, M.; Trnavac, D. Hybrid ANN-based techniques in predicting cohesion of sandy-soil combined with fiber. Geomech. Eng. 2020, 20, 191–205. [Google Scholar] [CrossRef]
  33. Nguyen, T.A.; Ly, H.B.; Jaafari, A.; Pham, B.T. Estimation offriction capacity of driven piles in clay using. Vietnam J. Earth Sci. 2020, 42, 265–275. [Google Scholar] [CrossRef]
  34. Nguyen, T.A.; Ly, H.B.; Pham, B.T. Backpropagation neural network-based machine learning model for prediction of soil friction angle. Math. Probl. Eng. 2020, 2020, 8845768. [Google Scholar] [CrossRef]
  35. Zhang, R.; Li, Y.; Goh, A.T.; Zhang, W.; Chen, Z. Analysis of ground surface settlement in anisotropic clays using extreme gradient boosting and random forest regression models. J. Rock Mech. Geotech. Eng. 2021, 13, 1478–1484. [Google Scholar] [CrossRef]
  36. Nejad, A.S.; Güler, E.; Özturan, M. Evaluation of liquefaction potential using random forest method and shear wave velocity results. In Proceedings of the 2018 International Conference on Applied Mathematics & Computational Science (ICAMCS NET), Budapest, Hungary, 6–8 October 2018; pp. 23–233. [Google Scholar] [CrossRef]
  37. Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Barcelona, Spain, 13 August 2016; pp. 785–794. Available online: https://dl.acm.org/doi/abs/10.1145/2939672.2939785 (accessed on 10 November 2023).
  38. Zhang, W.; Wu, C.; Zhong, H.; Li, Y.; Wang, L. Prediction of undrained shear strength using extreme gradient boosting and random forest based on Bayesian optimization. Geosci. Front. 2021, 12, 469–477. [Google Scholar] [CrossRef]
  39. Zhang, W.; Zhang, R.; Wu, C.; Goh, A.T.; Wang, L. Assessment of basal heave stability for braced excavations in anisotropic clay using extreme gradient boosting and random forest regression. Undergr. Space. 2020, 7, 233–241. [Google Scholar] [CrossRef]
  40. Ikeagwuani, C.C. Estimation of modified expansive soil CBR with multivariate adaptive regression splines, random forest and gradient boosting machine. Innov. Infrastruct. Solut. 2021, 6, 199. [Google Scholar] [CrossRef]
  41. Orbanić, P.; Fajdiga, M. A neural network approach to describing the fretting fatigue in aluminium-steel couplings. Int. J. Fatigue 2003, 25, 201–207. [Google Scholar] [CrossRef]
  42. Naderpour, H.; Rafiean, A.H.; Fakharian, P. Compressive strength prediction of environmentally friendly concrete using artificial neural networks. J. Build. Eng. 2018, 16, 213–219. [Google Scholar] [CrossRef]
  43. Kaveh, A.; Eskandari, A.; Movasat, M. Buckling resistance prediction of high-strength steel columns using metaheuristic-trained artificial neural networks. Structures 2023, 56, 104853. [Google Scholar] [CrossRef]
  44. Taha, O.M.E.; Majeed, Z.H.; Ahmed, S.M. Artificial neural network prediction models for maximum dry density and optimum moisture content of stabilized soils. Transp. Infrastruct. Geotechnol. 2018, 5, 146–168. [Google Scholar] [CrossRef]
  45. Khatti, J.; Grover, K.S. Prediction of compaction parameters of compacted soil using LSSVM, LSTM, LSBoostRF, and ANN. Innov. Infrastruct. Solut. 2023, 8, 76. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/ors the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Figure 1. Detail of Pearson’s correlation coefficient/matrix of soil database. FC fine content; S sand content; SG specific gravity; LL liquid limit; PL plastic limit; OMC optimum moisture content; MDD maximum dry density; r Pearson correlation coefficient; R2 coefficient of determination.
Figure 1. Detail of Pearson’s correlation coefficient/matrix of soil database. FC fine content; S sand content; SG specific gravity; LL liquid limit; PL plastic limit; OMC optimum moisture content; MDD maximum dry density; r Pearson correlation coefficient; R2 coefficient of determination.
Applsci 14 02716 g001
Figure 2. Schematic diagram of SVM.
Figure 2. Schematic diagram of SVM.
Applsci 14 02716 g002
Figure 3. Schematic diagram of ANN.
Figure 3. Schematic diagram of ANN.
Applsci 14 02716 g003
Figure 4. Schematic diagram of RF.
Figure 4. Schematic diagram of RF.
Applsci 14 02716 g004
Figure 5. Schematic diagram of XGBoost.
Figure 5. Schematic diagram of XGBoost.
Applsci 14 02716 g005
Figure 6. Regression plots between actual and predicted values with SVM (Gaussian kernel function); (a) OMC; (b) MDD.
Figure 6. Regression plots between actual and predicted values with SVM (Gaussian kernel function); (a) OMC; (b) MDD.
Applsci 14 02716 g006
Figure 7. Effect of the number of neurons with ANN; (a) OMC; (b) MDD.
Figure 7. Effect of the number of neurons with ANN; (a) OMC; (b) MDD.
Applsci 14 02716 g007
Figure 8. Regression plots between actual and predicted values with ANN; (a) OMC; (b) MDD.
Figure 8. Regression plots between actual and predicted values with ANN; (a) OMC; (b) MDD.
Applsci 14 02716 g008
Figure 9. Regression plots between actual and predicted values with RF; (a) OMC; (b) MDD.
Figure 9. Regression plots between actual and predicted values with RF; (a) OMC; (b) MDD.
Applsci 14 02716 g009
Figure 10. Regression plots between actual and predicted values with XGBoost; (a) OMC; (b) MDD.
Figure 10. Regression plots between actual and predicted values with XGBoost; (a) OMC; (b) MDD.
Applsci 14 02716 g010
Figure 11. Feature importance of OMC; (a) RF; (b) XGBoost.
Figure 11. Feature importance of OMC; (a) RF; (b) XGBoost.
Applsci 14 02716 g011
Figure 12. Feature importance of MDD; (a) RF; (b) XGBoost.
Figure 12. Feature importance of MDD; (a) RF; (b) XGBoost.
Applsci 14 02716 g012
Figure 13. Comparison of RMSE for the test set in different models.
Figure 13. Comparison of RMSE for the test set in different models.
Applsci 14 02716 g013
Figure 14. The actual and predicted OMC results for the test set in XGBoost.
Figure 14. The actual and predicted OMC results for the test set in XGBoost.
Applsci 14 02716 g014
Figure 15. The actual and predicted MDD results for the test set in XGBoost.
Figure 15. The actual and predicted MDD results for the test set in XGBoost.
Applsci 14 02716 g015
Table 1. Descriptive statistics of soil database.
Table 1. Descriptive statistics of soil database.
StatisticsFine Content (%)Sand Content (%)Specific Gravity (-)Liquid Limit (%)Plastic Limit (%)Optimum Moisture Content (%)Maximum Dry Density (kN/m3)
Mean53.5537.272.741.9123.0117.5517.12
Standard
deviation
17.0513.870.0612.45.934.821.42
Minimum1322.5823.113.687.612.6
Maximum98802.8511545.336.820.51
Table 2. Different kernel function performance results of SVM.
Table 2. Different kernel function performance results of SVM.
Kernel Function ModelTraining SetTest Set
RMSER2RMSER2
linear kernel OMC2.11270.80611.90830.8421
MDD0.69080.75580.66770.7947
Gaussian kernel OMC1.49270.90321.63080.8847
MDD0.43910.90140.50390.8831
Table 3. Different neuronal performance results of ANN.
Table 3. Different neuronal performance results of ANN.
ModelNumber of NeuronsTraining SetTest Set
RMSER2RMSER2
OMC41.78500.86161.83750.8536
51.54410.89641.71550.8724
61.59160.88991.66210.8802
71.57900.89171.63020.8848
81.59610.88931.78340.8621
MDD40.58280.82620.62370.8208
50.56750.83510.61790.8241
60.56250.83810.59490.8370
70.52490.85900.57270.8490
80.55740.84100.61000.8286
Table 4. Performance results of RF.
Table 4. Performance results of RF.
ModelTraining SetTest Set
RMSER2RMSER2
OMC0.93680.96191.36050.9198
MDD0.34860.93780.56640.8523
Table 5. Performance results of XGBoost.
Table 5. Performance results of XGBoost.
ModelTraining SetTest Set
RMSER2RMSER2
OMC0.83200.96991.32900.9234
MDD0.25060.96790.44470.9089
Table 6. Comparison of best model performance for the test set.
Table 6. Comparison of best model performance for the test set.
ModelOMCMDD
RMSER2RMSER2
SVM1.63080.88470.50390.8831
RF1.36050.91980.56640.8523
XGBoost1.32900.92340.44470.9098
ANN1.63020.88480.57270.8490
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, B.; You, Z.; Ni, K.; Wang, Y. Prediction of Soil Compaction Parameters Using Machine Learning Models. Appl. Sci. 2024, 14, 2716. https://doi.org/10.3390/app14072716

AMA Style

Li B, You Z, Ni K, Wang Y. Prediction of Soil Compaction Parameters Using Machine Learning Models. Applied Sciences. 2024; 14(7):2716. https://doi.org/10.3390/app14072716

Chicago/Turabian Style

Li, Bingyi, Zixuan You, Kaiwei Ni, and Yuexiang Wang. 2024. "Prediction of Soil Compaction Parameters Using Machine Learning Models" Applied Sciences 14, no. 7: 2716. https://doi.org/10.3390/app14072716

APA Style

Li, B., You, Z., Ni, K., & Wang, Y. (2024). Prediction of Soil Compaction Parameters Using Machine Learning Models. Applied Sciences, 14(7), 2716. https://doi.org/10.3390/app14072716

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop