A Study on Developing a Model for Predicting the Compression Index of the South Coast Clay of Korea Using Statistical Analysis and Machine Learning Techniques

Lee, Sungyeol; Kang, Jaemo; Kim, Jinyoung; Baek, Wonjin; Yoon, Hyeonjun

doi:10.3390/app14030952

Open AccessArticle

A Study on Developing a Model for Predicting the Compression Index of the South Coast Clay of Korea Using Statistical Analysis and Machine Learning Techniques

¹

Department of Geotechnical Engineering Research, Korea Institute of Civil Engineering and Building Technology, Goyang-si 10223, Republic of Korea

²

Department of Rural & Biosystems Engineering, Chonnam National University, Gwangju 61186, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(3), 952; https://doi.org/10.3390/app14030952

Submission received: 4 January 2024 / Revised: 19 January 2024 / Accepted: 20 January 2024 / Published: 23 January 2024

(This article belongs to the Special Issue Advances in Disaster Prevention and Reduction for Geotechnical Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

As large cities are continually being developed around coastal areas, structural damage due to the consolidation settlement of soft ground is becoming more of a problem. Estimating consolidation settlement requires calculating an accurate compressive index through consolidation tests. However, these tests are time-consuming, and there is a risk of the test results becoming compromised while preparing and testing the specimens. Therefore, predicting the compression index based on the results of relatively simple physical property tests enables more reliable and accurate predictions of consolidation settlement by calculating the compression index at multiple points. In this context, this study collected geotechnical data from the soft ground of Korea’s south coast. The collected data were used to construct a dataset for developing a compression index prediction model, and significant influencing factors were identified through Pearson correlation analysis. Simple and multiple linear regression analysis was performed using these factors to derive regression equations, and compression index prediction models were developed by applying machine learning algorithms. The results of deriving the significance of the influencing factors from the developed compression index prediction model showed that natural water content was the most significant factor in predicting the compression index. By collecting a significant amount of high-quality data and using the compression index prediction model and the model construction process proposed in this study, more accurate predictions of the compressive index will be possible in the future.

Keywords:

consolidation; compression index; machine learning; statistical analysis

1. Introduction

As large-scale infrastructure is built along coastal areas with soft ground, concerns are emerging about damage to superstructures due to the consolidation settlement of soft soil. Soft ground refers to very soft ground composed of layers of silt and clay. Soft soils have different geotechnical characteristics depending on certain conditions, such as how they were created, their sedimentary environment, and mineral composition [1]. Accordingly, the compression index of each region has different characteristics and ranges, and those considering ground characteristics are required to estimate consolidation settlement accurately. However, there are difficulties in conducting precise geotechnical surveys in wide areas due to the time and costs involved. As a result, the amount of soft ground settlement is generally predicted by estimating the overall soil properties based on geotechnical surveys.

The compression index of soft ground can be derived from consolidation tests in the laboratory using samples collected in the field [2]. The compression index is often used to calculate the primary consolidation settlement for consolidated or overconsolidated ground [3]. The compression index and primary consolidation settlement show a proportional relationship [4]. The compression index is calculated from the void ratio and the compression curve’s slope of the effective overburden pressure (effective stress) obtained from a standard consolidation test [5]. However, standard consolidation tests are expensive and time-consuming. Therefore, many empirical formulas have been developed to calculate the compression index from relatively simple physical test results [6,7,8,9,10]. Many studies have been conducted to predict the compression index by applying various techniques, such as developing a model for predicting the compression index through single and multiple regression analysis using the natural water content, liquid limit, and initial void ratio [8]. In addition, many studies have also been conducted on developing compression index prediction models based on machine learning algorithms [11,12,13]. However, most previous studies developed these models using relatively little data derived from local areas. In addition, different kinds of soft ground have different characteristics depending on various factors (such as how they were formed), so there are limitations to applying compression index prediction models from previous studies to other regions. Therefore, this study analyzed the characteristics of geotechnical data of the soft ground from Korea’s south coast and proposed a compression index prediction model using linear regression analysis and machine learning techniques. The geotechnical data of the target area (natural water content, liquid limit, plastic limit, plasticity index, initial void ratio, compression index) were collected and analyzed. Then, Pearson correlation analysis was used to determine the compression index and significant geotechnical properties, and regression equations between the geotechnical properties and compression index were derived through linear regression analysis. After removing any missing values, compression index prediction models were developed by applying the dataset to machine learning algorithms, such as RandomForest (RF), XGBoost (XGB), and LightGBM (LGBM). After this was completed, the significance of the factors used by the model to predict the compression index was investigated. Using the compression index prediction model proposed in this study will enable the prediction of the approximate compression index of the target area, and securing more data will enable the development of a model with upgraded quality in the future.

2. Target Area and Data

2.1. Target Area

Geotechnical data of the soft soil along the southern coast of Korea were collected to develop compression index prediction models using multiple regression analysis and machine learning techniques. Korea’s south coast has soft ground located in the Yeongsan, Seomjin, and Nakdong River water systems. The soft ground in the target area is characterized by the deposition of a sand layer on top of a clay layer resulting from flowing water. The thickness of the soft ground is mostly 10–20 m [1]. Table 1 shows the 4868 data points collected from these areas. Figure 1 shows a map of the region. The red portion in Figure 1 represents the targeted area

2.2. Data

The data used in this study were collected from geotechnical surveys during the construction and design stages located at the target areas. The geotechnical data collected included the soils’ average sampling depth, natural water content, specific gravity, liquid limit plasticity index, plastic limit, initial void ratio, saturated unit weight, uniaxial compressive strength, compression index, and pre-consolidation pressure. The natural water content was measured by drying samples collected in the field in a laboratory. Samples collected closer to the coast and at lower depths have higher natural water contents [14,15,16].

Table 2 shows the range and mean values of the data. The dataset is based on CH or CL groups, which are in the clay series under the unified soil classification system.

3. Selecting the Influencing Factors

To develop a machine learning-based compression index prediction model using geotechnical properties, geotechnical data of the study area were collected and correlation analysis was performed to derive factors that significantly correlated with the compression index. Preprocessing is required to remove empty values and errors from the data before performing correlation analysis and machine learning. The empty values and errors from the data used in this study were removed, and since much of the deleted data concerned saturated unit weight, uniaxial compressive strength, and pre-consolidation pressure, these data were excluded from the influencing factors to secure the dataset. The pre-consolidation pressure is significantly correlated with the compression index in addition to the effective overburden pressure [17,18,19]. Therefore, adding the pre-consolidation pressure as an influencing factor to the compression index prediction model was expected to yield good results. However, in addition to saturated unit weight and uniaxial compressive strength, only a limited amount of data was available for the pre-consolidation pressure. Therefore, it was unfortunately excluded as an influencing factor in this study.

Pearson correlation analysis was performed to derive the geotechnical properties (average depth, natural water content, specific gravity, liquid limit, plasticity index, plastic limit, and initial void ratio) that affect the compression index.

Correlation analysis is a statistical method used to measure the strength of the linear relationship between independent and dependent variables, and the correlation coefficient has values between −1 and 1. The correlation coefficient between variables is calculated through Equations (1) and (2), below [20,21].

C o r r (X, Y) = ρ (X, Y) = \frac{C o v (X, Y)}{σ_{x} σ_{y}}

(1)

r = \frac{s_{x y}}{s_{x} s_{y}} = \frac{\sum_{i = 1}^{n} (X_{i} - \bar{X}) (Y_{i} - \bar{Y})}{\sqrt{\sum_{i = 1}^{n} (X_{i} - \bar{X})^{2} \sum_{i = 1}^{n} (Y_{i} - \bar{Y})^{2}}}

(2)

s_(xy): covariance of X and Y variables; s_x: standard deviation of variable X; s_y: standard deviation of variable Y.

After obtaining the correlation coefficient using Equations (1) and (2), the hypothesis about the correlation in the population is tested using Equation (3) to determine whether the correlation coefficient between the two variables is statistically significant. If the test statistic is less than 0.05, then the correlation between the two variables is statistically significant [22,23].

t = r \sqrt{\frac{n - 2}{1 - r^{2}}}

(3)

Table 3 shows the Pearson correlation analysis results between the mean depth, natural water content, specific gravity, liquid limit, plasticity index, plastic limit, initial void ratio, and compression index. The average depth and specific gravity were shown to not be significantly correlated with the compression index. In contrast, the natural water content, liquid limit, plasticity index, and initial void ratio showed a highly significant correlation (p < 0.001). In particular, the correlation between the natural water content, initial void ratio, and compression index was very high. Therefore, the factors influencing the compression index in this study were natural water content, liquid limit, plasticity index, plastic limit, and initial void ratio. In addition, a total of 2071 data points were used to build a dataset for statistical analysis and machine learning. Figure 2 shows the distribution between the compression index and the influencing factors.

4. Statistical Analysis

In this section, regression equations are proposed through single and multiple linear regression analyses using the natural water content, liquid limit, plasticity index, plastic limit, and initial void ratio, which were significantly correlated with the compression index in the correlation analysis. Multicollinearity between variables was checked through VIF analysis, and linear regression analysis was performed using factors not affected by multicollinearity to present the regression equation and the Coefficient of Determination.

4.1. VIF

Multicollinearity refers to the correlation between explanatory variables. If multicollinearity exists between variables, it distorts the estimated regression coefficients of the multiple regression model. Therefore, multicollinearity must be checked in order to build a multiple regression model. In general, the variance inflation factor (VIF) is used to test for multicollinearity. A VIF of less than 10 means no multicollinearity is present [24]. Equation (4) shows how to calculate the VIF for β_j, the estimator of the jth regression coefficient [25]. R_jis the Coefficient of Determination calculated from the regression model with xj as the dependent variable and the remaining variables as explanatory variables.

{V I F}_{j} = \frac{1}{1 - R_{j}^{2}}

(4)

Table 4 shows the VIFs of the data variables. All factors except the plastic limit had VIFs greater than 10. Accordingly, the initial void ratio and plasticity index, which were suspected to overlap with the natural water content and liquid limit in the linear regression analysis, were excluded. The initial void ratio is calculated using the natural water content, and the plasticity index is the numerical difference between the liquid limit and plastic limit. Therefore, considering the amount of data, a linear regression analysis was conducted with the compression index using the natural water content, liquid limit, and plastic limit and excluding the initial void ratio and plasticity index. Table 5 shows the VIFs of the final data variables.

4.2. Simple Regression Analysis

Regression analysis expresses the relationship between the independent and dependent variables in the data as a functional formula to use the independent variable to predict the dependent variables. Equation (5) shows a form of a linear regression model using actual data. The intercept means the initial value, and the slope is the influence (weight). Depending on the sign, the slope can be interpreted as a positive or negative correlation between the two variables. The larger the absolute value, the greater the influence (weight) of the independent variable on the dependent variable [26].

Y_{i} = b_{0} + b_{i} x_{1}

(5)

Y_{i}

: dependent variable;

x_{t}

: independent variable;

b_{o} : i n t e r c e p t

;

b_{t} : s l o p e

.

Table 6 shows the simple linear regression analysis results between the natural water content ratio, liquid limit, plastic limit, and compression index. According to the single-factor linear regression results, the highest Coefficient of Determination was found in the regression equation between the natural water content and compression index. The plastic limit’s Coefficient of Determination was the lowest, just as in the Pearson correlation analysis results. In addition, all independent variables were positively correlated with the dependent variable (Cc). The reliability of the simple regressions in Table 6 can be assessed using R², and the results are similar to those of previous studies in similar geographic regions [8,9,27,28]. This was likely due to the analysis using relatively little data from a wide target area. Therefore, it is necessary to apply statistical analysis and machine learning techniques to propose a regression model for predicting the compression index using multiple factors.

4.3. Multiple Regression Analysis

Multiple linear regression analysis for predicting the compression index using geotechnical properties was performed using OLS analysis to investigate the effects of the natural water content, liquid limit, and plastic limit on the compression index and present regression equations. Multiple linear regression analysis expresses the relationship between multiple explanatory variables and the dependent variable (whether ground subsidence occurs or not) as a regression equation, as shown in Equation (6). The coefficients of the regression equation are estimated through the Ordinary Least Squares method (OLS), which minimizes the Sum of Squared Residuals [29,30].

y = β_{0} + β_{1} x_{1} + β_{2} x_{2} + \dots + β_{i} x_{i} + ε

(6)

y: dependent variable;

x_{i}

: independent variable;

β_{i}

: estimated regression coefficient;

ε

: error.

Table 7 shows the results of the OLS analysis performed in this study. The weights of the natural water content and liquid limit were found to be statistically significant (p < 0.001). The weight of the natural water content was higher than that of the liquid limit. These results were consistent with the Pearson correlation and simple linear regression analysis above. However, in the case of the plastic limit, the weight was relatively small and not statistically significant. Therefore, the compression index regression equation using the geotechnical properties of the southern coastal area of Korea is composed of the natural water content and liquid limit, excluding the plastic limit, as shown in Equation (7), below.

C_{c} = 0.0152 W_{n} + 0.0018 L L - 0.3165

(7)

5. Machine Learning

The dataset used in the linear regression analysis was applied to machine learning algorithms to propose a model for predicting the compression index of the target area. This study used algorithms such as RF, XGB, and LGBM, which have shown excellent capabilities in predicting the compression index in previous studies [31]. The flowchart in Figure 3 shows the process for developing the compression index prediction model using machine learning. The dataset was built by securing the data of the target area and splitting it into an 80:20 ratio of training and evaluation data. After applying the training data to the machine learning algorithms, the model’s performance was tested by checking the Root Mean Squared Error (RMSE) and the Coefficient of Determination (R²).

5.1. Machine Learning Algorithms

The machine learning algorithms used in this study have been reported to be effective in solving regression and classification problems [32,33,34]. RandomForest (RF) is a commonly used ensemble learning method of the bagging technique, which derives optimal results by combining the output of multiple decision trees [35]. Bagging is accomplished by training each single algorithm to produce a result and then voting for the best result by majority vote or by presenting the average result. Unlike models that derive results from a single algorithm, this method derives results by training multiple algorithms, resulting in more efficient evaluation metrics. RF is often used in machine learning because it is efficient for processing data with unclear features and it splits trees in a dichotomous manner. Therefore, it is easy to solve problems by generalizing the data [36].

RF works as follows. First, n random bootstrap samples are used to train a decision tree, where attributes are randomly selected at each node without allowing for duplicates. Then, the process of splitting the nodes by creating the best split based on an objective function, such as information gain, is repeated, with the results of each decision tree averaged to derive the results. The information gain used as the objective function is defined as Equation (8), below [33].

I G (D_{p,} f) = I (D_{p}) - \sum_{i = 1}^{m} \frac{N_{j}}{N_{p}} I (D_{j})

(8)

f: Attribute used for splitting

D_p, D_j: The dataset in the node

I: Impurity

N_p, N_j: Number of samples in the node

XGBoost (XGB) and LightGBM (LGBM) are machine learning algorithms that use boosting techniques, and a single algorithm is trained sequentially to derive results. The next algorithm learns from the previous algorithm and minimizes errors in the derived results to obtain the optimal outcome through the process. Boosting minimizes the rate of error by differentiating it through gradient descent and calculates the loss function. Equations (9) and (10) define the objective function using the cost function and the spatial function [37].

\hat{y_{i}} = \sum_{k = 1}^{K} f_{k} (x_{i})

(9)

O b j = \sum_{i = 1}^{n} l (y_{i} \hat{y_{i}}) + \sum_{k = 1}^{K} Ω (f_{k})

(10)

\hat{y_{i}} : prediction score

K: number of trees

f: function of space F

5.2. Model Evaluation Metrics

The performance of the machine learning models used in this study was evaluated by the Root Mean Squared Error (RMSE) and Coefficient of Determination (R²). The RMSE is calculated by taking the root of the MSE (Mean Squared Error). RMSE is mainly used because it can cope with outliers by penalizing large errors. It is also easy to process by minimizing the size of the errors. R² evaluates the performance of a model based on the variance. The closer the value is to 1, the better the model is. Equations (11) and (12) show how to calculate the RMSE [38].

M S E = \frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - \hat{y})}^{2}

(11)

R M S E = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - \hat{y})}^{2}}

(12)

5.3. Results

This study built a dataset using the target area’s geotechnical properties and compression index data. The dataset was applied to machine learning algorithms (RF, XGB, and LGBM), and RMSE and R² were used to determine the model with the best performance. Model overfitting was avoided when the difference in R² between the training set and testing set was less than 0.1. Table 8 shows the results of the machine learning models for predicting the compression index. In terms of RMSE, the model using the RF algorithm had the lowest RMSE (0.1669). The R² was also the best, although it was close (0.72). However, the performance of the XGB and LGBM models was not significantly different from the RF model. When looking at these results from a computer science perspective, this is likely due to the fact that a small amount of data was used.

Table 9 shows the hyperparameters of the models. A compression index regression model was built based on the RF model. Figure 4 shows the importance of the influencing factors used in the analysis through the feature importance provided by RF. Natural water content showed the highest importance in predicting the compression index. Liquid limit and plastic limit had relatively low importance, as in the linear regression analysis results. Compared to the results of a previous study that used data samples from multiple countries to predict the compression index of soft ground based on machine learning (R² = 0.93), the model proposed in this study shows relatively low results [39]. This was likely due to the number of data samples used in the analysis and the regional characteristics. The formation of soft ground is highly influenced by geographical characteristics such as parent rock and flow rate. Therefore, it would be better to develop a customized model for predicting geotechnical properties by subdividing the region.

Figure 5 shows the RF model’s compression index predictions for the target area and the actual data. Since most of the data are concentrated in the center, the model can be used to predict a rough compression index.

6. Discussion

This study proposed a compression index prediction model using geotechnical properties based on linear regression analysis and machine learning for analyzing soft clay on the south coast of Korea. This is likely due to the improved performance of the models by collecting more high-quality data. The compressibility index is a very important factor used to calculate consolidation settlement. As such, there are limitations to immediately applying the compression index prediction model proposed in this study in the field. However, using the results of this study enables a rough prediction of the compression index and can improve the reliability of the consolidation test results performed in laboratories. In addition, securing more comprehensive data (adhesion, uniaxial compressive strength, etc.) can help develop an improved compression index prediction model by better classifying regions and depths.

7. Conclusions

This study applied linear regression analysis and machine learning techniques to develop a compression index prediction model using geotechnical properties for soft clay on the south coast of Korea. As a result, regression equations and compression index prediction models using metrics related to the natural water content, liquid limit, and plasticity index were derived. Linear regression analysis using a single influence factor resulted in R² values ranging from 0.22 to 0.69. While this is comparable to previous studies in similar regions, the reliability was low compared to general R² standards. Therefore, this study conducted an OLS analysis using multiple factors and obtained a value of R² = 0.69. In addition, the natural water content and liquid limit were found to be factors with a significant effect on the compression index, and a regression equation was derived using them. A model for predicting the compression index was proposed using machine learning techniques to improve the results. The model using the RF classifier showed the best result (R² = 0.72), and the importance of the factors was in the order of natural water content, liquid limit, and plastic limit. This was also reflected in the statistical analysis results. The R² was shown to be higher than the result of multiple linear regression analysis. Using machine learning techniques was most advantageous in proposing a model for predicting the compression index of clay on the south coast of Korea. In addition, soft ground varies in composition and mineral properties depending on the geographical characteristics. Therefore, presenting a customized compression index prediction model by subdividing the target area would be more effective.

Author Contributions

Conceptualization, J.K. (Jaemo Kang) and J.K. (Jinyoung Kim); developed the models and carried out the model simulations, S.L. and H.Y.; writing—original draft preparation, S.L.; writing—review and editing, J.K. (Jaemo Kang), J.K. (Jinyoung Kim) and W.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by a grant from the project “Underground Utilities Diagnosis and Assessment Technology (5/5)”, which was funded by the Korea Institute of Civil Engineering and Building Technology (KICT).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to restrictions related to the rights of the data collection agency involved.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Kim, S.K.; Lim, H.D.; Moon, S.K. Clay Minerals and Their Distribution in the Soft Ground Deposited along the Coastline. J. Korean Geotech. Soc. 1998, 14, 73–80. [Google Scholar]
Gregory, A.S.; Whalley, W.R.; Watts, C.W.; Birtd, N.R.A.; Hallett, P.D.; Whitmore, A.P. Calculation of the compression index and precompression stress from soil compression test data. Soil Tillage Res. 2006, 89, 45–57. [Google Scholar] [CrossRef]
Bryan, A.M.; Brian, B.S.; Michael, M.L.; Fintan, J.B.; Eric, R.F. Empirical correlations for the compression index of Irish soft soils. Proc. Inst. Civ. Eng.-Geotech. Eng. 2014, 167, 507–599. [Google Scholar]
Kalantary, F.; Kordnaeij, A. Prediction of compression index using artificial neural network. Sci. Res. Essays 2012, 7, 2835–2848. [Google Scholar] [CrossRef]
Balasubramaniam, A.S.; Brenner, R.P. Consolidation and Settlement of Soft Clay. Dev. Geotech. Eng. 1981, 20, 479–566. [Google Scholar]
Park, C.S.; Kim, S.S. A Study on the Estimation of Compression Index in the East-Southern Coast Clay of Korea. J. Korean Geotech. Soc. 2019, 35, 43–56. [Google Scholar]
Skempton, A.W.; Jones, O.T. Notes on the compressibility of clays. J. Geol. Soc. 1944, 100, 119–135. [Google Scholar] [CrossRef]
Heo, Y.; Hwang, I.S.; Kang, C.W.; Bae, W.S. Correlations Between the Physical Properties and Consolidation Parameter of West Shore Clay. J. Korean Geo-Environ. Soc. 2015, 16, 33–40. [Google Scholar] [CrossRef]
Bae, W.S.; Kim, J.W. Correlations Between the Physical Properties and Compression Index of KwangYang Clay. J. Korean Geo-Environ. Soc. 2009, 10, 7–14. [Google Scholar]
Chung, S.G.; Kwag, J.M.; Jang, W.Y.; Kim, D.G. Compressibility Characteristics of Estuarine Clays in the Nakdong River Plain. J. Korean Geotech. Soc. 2002, 18, 295–307. [Google Scholar]
Bae, W.S.; Kwon, Y.C. Prediction of consolidation parameter using multiple regression analysis. Mar. Georesources Geotechnol. 2017, 35, 643–652. [Google Scholar] [CrossRef]
Nguyen, M.D.; Pham, B.T.; Ho, L.S.; Ly, H.B.; Le, T.T.; Qi, C.; Le, V.M.; Le, L.M.; Prakash, I.; Son, L.H.; et al. Soft-computing techniques for prediction of soils consolidation coefficient. Catena 2020, 195, 104802. [Google Scholar] [CrossRef]
Singh, M.J.; Kaushik, A.; Patnaik, G.; Xu, D.S.; Feng, W.Q.; Rajput, A.; Prakash; Borana, L. Machine learning-based approach for predicting the consolidation characteristics of soft soil. Mar. Georesources Geotechnol. 2023. [Google Scholar] [CrossRef]
Heo, Y.; Yun, S.; Jung, K.; Oh, S. Analysis on the Relationship of Soil Parameters of Marine Clay. J. Korean Geo-Environ. Soc. 2008, 9, 37–45. [Google Scholar]
George, V. Chilingar and Larry Knight, Relationship Between Pressure and Moisture Content of Kaolinite, Illite, and Montmorillonite Clays1. Bull. Am. Assoc. Pet. Geol. 1960, 44, 101–106. [Google Scholar]
Partha, N.M.; Alexander, S.; Bhuyan, M.H. A Unified Approach for Establishing Soil Water Retention and Volume Change Behavior of Soft Soils. Geotech. Test. J. 2021, 44, 1197–1216. [Google Scholar] [CrossRef]
Stamatopoulos, C.; Petridis, P.; Parcharidis, I.; Foumelis, M. A method predicting pumping-induced ground settlement using back-analysis and its application in the Karla region of Greece. Nat. Hazards 2018, 92, 1733–1762. [Google Scholar] [CrossRef]
Tripathy, S.; Schanz, T. Compressibility behavior of clays at large pressures. Can. Geotech. J. 2007, 44, 355–362. [Google Scholar] [CrossRef]
Marcial, D.; Delage, P.; Cui, Y. On the high stress compression of bentonites. Can. Geotech. J. 2002, 39, 816. [Google Scholar] [CrossRef]
Pearson, K. Note on regression and inheritance in the case of two parents. Proc. R. Soc. Lond. 1895. 58, 240–242.
Box, G.E.; Cox, D.R. An analysis of transformations. J. R. Stat. Soc. Ser. B Stat. Methodol. 1964, 26, 211–252. [Google Scholar] [CrossRef]
Pearson, K. On lines and planes of closest fit to systems of points in space. Philos. Mag. 1901, 2, 559–572. [Google Scholar] [CrossRef]
Fisher, R.A. On the “probable error” of a coefficient of correlation deduced from a small sample. Metron 1921, 1, 3–32. [Google Scholar]
Chan, J.Y.L.; Leow, S.M.H.; Bea, K.T.; Cheng, W.K.; Phoong, S.W.; Hong, Z.W.; Chen, Y.L. Mitigating the Multicollinearity Problem and Its Machine Learning Approach: A Review. Mathematics 2022, 10, 1283. [Google Scholar] [CrossRef]
Kimon, N.; Alex, K.; Andreas, A. Interdependency Pattern Recognition in Econometrics: A Penalized Regularization Antidote. Econometrics 2021, 9, 44. [Google Scholar] [CrossRef]
Manoranjan, P.; Bharati, P. Introduction to Correlation and Linear Regression Analysis. Appl. Regres. Tech. 2019, 1–18. [Google Scholar] [CrossRef]
Lee, S.Y.; Kim, J.Y.; Kang, J.M.; Baek, W.J.; Yoon, H.J. Comparison of Performance of Machine Learning Models for Predicting Compression Index Based on Clay Properties. J. Korean Soc. Hazard Mitig. 2022, 22, 127–134. [Google Scholar] [CrossRef]
Hong, S.J.; Kim, D.H.; Choi, Y.M.; Lee, W.J. Prediction of Compression Index of Busan and Inchon Clays Considering Sedimentation State. J. Korean Geotech. Soc. 2011, 27, 37–46. [Google Scholar] [CrossRef]
Hayes, A.F. Using heteroskedasticity-consistent standard error estimators in OLS regression: An introduction and software implementation. Behav. Res. Methods 2007, 39, 709–722. [Google Scholar] [CrossRef]
Hayes, A.F.; Mattels, J. Computational procedures for probing interactions in OLS and logistic regression: SPSS and SAS implementations. Behav. Res. Methods 2009, 41, 924–936. [Google Scholar] [CrossRef]
Pal, M. Random forest classifier for remote sensing classification. Int. J. Remote Sens. 2005, 26, 217–222. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In KDD’16 Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; ACM: New York, NY, USA, 2016; pp. 785–794. [Google Scholar]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. Adv. Neural Inf. Process. Syst. 2017, 30, 3149–3157. [Google Scholar]
Breiman, L.; Friedman, J.; Stone, C.; Olshen, R. Classification and Regression Trees; Taylor & Francis: Abingdon, UK, 1984. [Google Scholar]
Louppe, G. Understanding Random Forests; University of Liege: Leige, Belgium, 2014; p. 211. [Google Scholar]
Kim, H.I.; Lee, Y.S.; Kim, B. Real-time flood prediction applying random forest regression model in urban areas. J. Korea Water Resour. Assoc. 2021, 54, 1119–1130. [Google Scholar]
Zhang, Y.; Haghani, A. A gradient boosting method to improve travel time prediction. Transp. Res. Part C Emerg. Technol. 2015, 58, 308–324. [Google Scholar] [CrossRef]
Willmott, C.J. Some Comments on the Evaluation of Model Performance. Bull. Am. Meteorol. Soc. 1982, 63, 1309–1313. [Google Scholar] [CrossRef]
Díaz, E.; Spagnoli, G. A super-learner machine learning model for a global prediction of compression index in clays. Appl. Clay Sci. 2024, 249, 107239. [Google Scholar] [CrossRef]

Figure 1. Target area. (Source: National Geographic Information Institute).

Figure 2. Correlation distribution between compression index and influencing factors from (a) Wn-Cc, (b) LL-Cc, (c) PI-Cc, (d) e0-Cc, (e)PL-cc.

Figure 3. Flowchart for developing the compression index prediction model.

Figure 4. Importance of factors used.

Figure 5. Distribution of actual and predicted data.

Table 1. Number of data points by water system in the target area.

Water System	Number of Data Points
Yeongsan River	1759
Seomjin River	1778
Nakdong River	1331
Total	4868

Table 2. Data characteristics.

Geotechnical Properties	Range	Mean
Average Depth (m)	0.050~66.000	10.652
Natural Water Content ( $W_{c}$ ,%)	7.400~147.100	58.996
Specific Gravity (Gs)	2.530~2.900	2.692
Liquid Limit (LL, %)	22.200~142.500	58.554
Plasticity Index (PI, %)	1.400~118.90	24.707
Plasticity Limit (PL, %)	1.40~91.80	25.866
Initial Void Ratio ( $e_{0}$ )	0.607~3.825	1.675
Saturated Unit Weight ( $r_{t}$ , tf/ $m^{3}$ )	1.063~4.155	1.648
Uniaxial Compressive Strength ( $q_{u}$ , kgf/ ${c m}^{2}$ )	0.012~2.323	0.346
Compression Index ( $C_{c}$ )	0.119~2.614	0.711
Pre-consolidation Pressure ( $P_{c}$ , kgf/ ${c m}^{2}$ )	0.000~4.900	0.732

Table 3. Pearson correlation analysis results.

Category	Compression Index
Average Depth	−0.049
Natural Water Content	0.838 ***
Specific Gravity	−0.077
Liquid Limit	0.632 ***
Plasticity Index	0.617 ***
Plastic Limit	0.473 ***
Initial Void Ratio	0.869 ***

*** p < 0.001.

Table 4. The VIFs of the initial data.

Category	VIF
Natural Water Content	15.025
Liquid Limit	93.882
Plasticity Index	60.218
Plastic Limit	9.899
Initial Void Ratio	14.976

Table 5. The VIFs of the final data.

Category	VIF
Natural Water Content	2.149
Liquid Limit	3.262
Plastic Limit	2.101

Table 6. Linear regression analysis results.

Category	Regression Equation	$R^{2}$
Natural Water Content	$C_{c}$ = 0.0167 $W_{n}$ − 0.2913	0.687
Liquid Limit	$C_{c}$ = 0.0125LL − 0.0146	0.425
Plastic Limit	$C_{c}$ = 0.0284PL − 0.0019	0.223

Table 7. OLS analysis results.

Category		Linear Regression Model (OLS)
Constant		−0.3165 ***
Geotechnical Properties	Natural Water Content	0.0152 ***
	Liquid Limit	0.0018 ***
	Plastic Limit	0.0003
Explanatory Power	$R^{2}$	0.691

*** p < 0.001.

Table 8. Model results.

Model	RMSE		$R^{2}$
Model	Train	Test	Train	Test
RF	0.1341	0.1669	0.82	0.72
XGB	0.1403	0.1715	0.80	0.70
LGBM	0.1401	0.1702	0.80	0.71

Table 9. Model hyperparameters.

Model	Hyperparameter
RF	N_estimators = 500, Max_depth = 6
XGB	N_estimators = 200, Max_depth = 3, Learning_rate = 0.05
LGBM	N_estimators = 300, Max_depth = 5, Learning_rate = 0.02

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lee, S.; Kang, J.; Kim, J.; Baek, W.; Yoon, H. A Study on Developing a Model for Predicting the Compression Index of the South Coast Clay of Korea Using Statistical Analysis and Machine Learning Techniques. Appl. Sci. 2024, 14, 952. https://doi.org/10.3390/app14030952

AMA Style

Lee S, Kang J, Kim J, Baek W, Yoon H. A Study on Developing a Model for Predicting the Compression Index of the South Coast Clay of Korea Using Statistical Analysis and Machine Learning Techniques. Applied Sciences. 2024; 14(3):952. https://doi.org/10.3390/app14030952

Chicago/Turabian Style

Lee, Sungyeol, Jaemo Kang, Jinyoung Kim, Wonjin Baek, and Hyeonjun Yoon. 2024. "A Study on Developing a Model for Predicting the Compression Index of the South Coast Clay of Korea Using Statistical Analysis and Machine Learning Techniques" Applied Sciences 14, no. 3: 952. https://doi.org/10.3390/app14030952

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Study on Developing a Model for Predicting the Compression Index of the South Coast Clay of Korea Using Statistical Analysis and Machine Learning Techniques

Abstract

1. Introduction

2. Target Area and Data

2.1. Target Area

2.2. Data

3. Selecting the Influencing Factors

4. Statistical Analysis

4.1. VIF

4.2. Simple Regression Analysis

4.3. Multiple Regression Analysis

5. Machine Learning

5.1. Machine Learning Algorithms

5.2. Model Evaluation Metrics

5.3. Results

6. Discussion

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI