*Article* **Measurement-While-Drilling Based Estimation of Dynamic Penetrometer Values Using Decision Trees and Random Forests**

**Eduardo Martínez García 1,2, Marcos García Alberti 1,\* and Antonio Alfonso Arcos Álvarez <sup>3</sup>**


**Abstract:** Machine learning is a branch of artificial intelligence (AI) that consists of the application of various algorithms to obtain information from large data sets. These algorithms are especially useful to solve nonlinear problems that appear frequently in some engineering fields. Geotechnical engineering presents situations with complex relationships of multiple variables, making it an ideal field for the application of machine learning techniques. Thus, these techniques have already been applied with a certain degree of success to determine such things as soil parameters, admissible load, settlement, or slope stability. Moreover, dynamic penetrometers are a very common type of test in geotechnical studies, and, in many cases, they are used to design the foundation solution. In addition, its continuous nature allows us to know the variations of the terrain profile. The objective of this study was to correlate the drilling parameters of deep foundation machinery (Measurement-While-Drilling, MWD) with the number of blows of the dynamic penetrometer test. Therefore, the drilling logs could be equated with said tests, providing information that can be easily interpreted by a geotechnical engineer and that would allow the validation of the design hypotheses. Decision trees and random forest algorithms have been used for this purpose. The ability of these algorithms to replicate the complex relationships between drilling parameters and terrain characteristics has allowed obtaining a reliable reproduction of the penetrometric profile of the traversed soil.

**Keywords:** machine learning; decision trees; random forests; penetrometer; MWD; rigid inclusions

#### **1. Introduction**

Machine learning (ML) is a field in great health. As a concept, it is not something new (the first decision tree algorithm dates from 1963 [1]), but the current capacity to acquire and handle a huge amount of data has exponentially increased its popularity as an analysis tool. ML is a branch of artificial intelligence that consists of the application, through computer programs, of a series of algorithms to obtain information from large data sets. The term learning is used given that these systems are capable of improving and adapting to the information supplied to them.

Perhaps some of the best-known applications are those related to advertising and economics. For example, based on our browsing data, the algorithms can predict our preferences and offer us personalized advertising [2]. ML has also been widely used in medicine, where algorithms work as a medical diagnosis. As a mode of example, Scikitlearn, a machine learning library for Python, comes preloaded with small data sets on diabetes, exercise, and breast cancer [3]. In areas such as the industrial sector, the main aim of ML has been to predict when breakdowns will occur [4]. Mining, a field with problems similar to those of civil and geotechnical engineering, has also applied machine learning techniques [5].

**Citation:** García, E.M.; Alberti, M.G.; Arcos Álvarez, A.A.

Measurement-While-Drilling Based Estimation of Dynamic Penetrometer Values Using Decision Trees and Random Forests. *Appl. Sci.* **2022**, *12*, 4565. https://doi.org/10.3390/ app12094565

Academic Editors: Małgorzata Jastrz ˛ebska, Krystyna Kazimierowicz-Frankowska, Gabriele Chiaro and Jaroslaw Rybak

Received: 9 March 2022 Accepted: 27 April 2022 Published: 30 April 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

In civil engineering, there have already been important contributions to the monitoring of dams [6], excavations [7], and traffic prediction [8]. However, there is still a lack of studies considering the large amount of data and ML in certain fields such as geotechnics.

Geotechnical engineering is a field whose problems present complex, non-linear relationships among its variables. These problems have been solved using empirical formulations that approximate the solution with more or less accuracy. ML algorithms have shown to be extremely efficient to analyze non-linear problems. Therefore, they are an ideal tool to be applied to geotechnics. One of the obvious applications is to predict the behavior of the soil from parameters obtained from laboratory tests. This approach achieves higher degrees of accuracy than the traditional analytical formulas [9]. However, the application of ML has not only been limited to geotechnical parameters but also to computing load-bearing capacity, settlement, liquefaction and slope stability [10–12].

Modern geotechnical machinery and equipment has abundant instrumentation that allows the attainment of a multitude of drilling parameters such as the drilling speed or the applied torque. This constant collection of data is usually named Measurement-While-Drilling (MWD). This record supplies an interesting source of information for the application of machine learning techniques.

On another note, continuous dynamic penetration tests consist of the introduction of a penetration element into the ground by striking a hammer with a defined weight. The result of the test is the number of blows required to advance a certain length. This type of test has become one of the most attractive alternatives giving several advantages:


The data obtained from MWD is in some way related to the characteristics of the terrain. These data are recorded continuously throughout the perforation and should be able to be correlated with the blows of the dynamic penetrometer. However, this relationship is unknown, complex, and non-linear.

In this study, a significant amount of data obtained from the MWD system from deep foundation and ground improvement machinery is analyzed. Figure 1 shows the plan views of the four sites studied. The circles represent the perforations made; the crosses are the penetrometers available. The aim was to correlate said penetration tests with the values obtained from MWD using ML techniques, specifically decision trees and random forests.

The significance of this research lies in the fact that, at the time of writing, it is the first time that this approach has been applied to rigid inclusions execution data.

The input variables have been the drilling speed (m/h), rotation speed (rpm), rotation torque (t·m) and thrust (t). The objective variable has been the blow values of dynamic penetrometers.

The results obtained show that the algorithms have been able to relate, with a certain degree of precision, the drilling parameters with the blows of the penetrometers. This information allows an increase in the level of security, an adaption of the execution to the real conditions of the ground and detection of possible anomalies, among other things.

The structure of the paper is the following:


**Figure 1.** Distribution of columns (circles) and penetrometers (crosses) available in the analyzed sites. (**a**) Site 1. (**b**) Site 2. (**c**) Site 3. (**d**) Site 4.

#### **2. Related Works**

Geotechnical engineering hosts problems with many uncertainties in which the true relationships among the variables are unknown. This makes it an interesting field for the application of ML algorithms. In this way, several authors have used this approach [10], especially through the application of neural networks (NN).

In the field of layer characterization, Shuku et al. [13] proposed a new method for estimating trends and layer boundaries in depth-dependent soil data based on lasso. This new method (SBLasso) was applied to synthetic data, as well as an actual CPT sounding taken at Texas A&M University and it provided stratification consistent with existing methods.

Zhao and Wang [14] proposed an interpolation first method to characterize a multilayer soil property profile when measurements within each layer are sparse and limited. This method interpolates the multilayer soil property profile using measurements from all layers together as input to a Bayesian supervised machine learning first. Then, the interpolated multilayer soil property profile is stratified using an unsupervised machine learning method, e.g., the modified k-means clustering method.

Focusing on the use of MWD, Rai et al. [15] reviewed MWD techniques in the extractive industry, and therefore focused on rock drilling. Kadkhodaie-Ilkhchi et al. [16] conducted a comparative study of three machine learning techniques using MWD data from drilling for explosives at an iron mine in Australia. Other studies related to the recognition of rocks using MWD and ML have also been published with such references [17–20].

In the field of soils and civil engineering, Goh [21,22] estimated the load-bearing capacity of driven piles using NN in non-cohesive soils from data collected by Flaate [23] for wooden, precast concrete and steel piles. The input parameters were hammer weight, hammer drop height, hammer type, pile length, pile weight, pile modulus of elasticity, pile cross-sectional area, and pile set. The target value was the load-bearing capacity of the pile. The model was able to obtain high correlation coefficients between the input parameters and the load-bearing capacity for both the train set and the test set. Furthermore, NN were found to perform better than classical formulas when predicting load-bearing capacity.

Lee and Lee [24] also used NN to calculate the load-bearing capacity of piles. Five input variables were used: penetration depth ratio, the average SPT strike along the axis of the pile, the average SPT strike near the tip of the pile, the penetration rate for each strike and the hammer energy. The results obtained were compared with the Meyerhof formula [25]. The values obtained through NN correlated better with the measured values than those obtained through the Meyerhof equation.

Although the input parameters did not come from MWD, Teh et al. [26], proposed an NN model to estimate the load-bearing capacity of piles from dynamic pressure wave data. The data came from 37 precast piles from 21 different sites. The objective parameters were the values of the soil parameters predicted by the CAPWAP model [27]. This approach to predicting parameter values is similar to what is considered in this article.

In another use of NN, Diaz et al. [28] predicted the rate of penetration during wellbore drilling using data obtained as the drilling progresses within a given well.

Pal and Deswal [29] used a Gaussian regression process (GP) to predict the loadbearing capacity of the pile. Part of the input data was the same as that used by Goh. The performance of the proposed model was compared with support vector machines (SVM) and empirical relationships, obtaining a better result.

Galende-Hernández et al. [30] carried out a study estimating the value of the RMR from the characterization of the excavation face of a tunnel, using MWD and expert knowledge in the execution of a tunnel using explosives. The results obtained showed a good correlation. Such work posed a similar philosophy to that of this article since it sought to estimate the value of a soil parameter (the RMR) from the drilling parameters.

Although several examples of the use of MWD and ML data have been collected, the use of data obtained directly from machinery is relatively scarce. In this proposal, data from construction sites was taken directly from the MWD system and used to estimate the values of the dynamic penetrometer. The objective was to reach results for each perforation to be equivalent to having carried out a penetration test. Although data from rigid inclusions with ground displacement has been used, the methodology is general enough to be applicable to other works.

#### **3. MWD in Pile Driving and Soil Improvement Machinery**

The current pile machinery has a series of measuring instruments that allow the collection of performance data. The main information obtained during rotary drilling usually consists of the following [31]: the length of the drilling, drilling speed (*VA*), rotation speed (*VR*) and the hydraulic pressures of the rotary and thrust motors. These data are obtained with a high frequency, for example, every 10 cm of perforation.

#### *3.1. Direct and Indirect Parameters*

The parameters obtained from MWD can be divided into two groups depending on the process of attainment: direct parameters and indirect parameters.

Directly obtained parameters are those in which it is not necessary to apply any type of correlation to the values obtained from the sensors. Depth, rotational speed, and drilling speed can be considered direct parameters as they are obtained directly or as a function of time. Perhaps the most relevant of this type of parameter is the drilling speed as it is very indicative of the hardness of the terrain.

However, other parameters such as thrust force (PO) and rotation torque (CR) are obtained through mechanical correlations from the hydraulic pressure of the motors. These values (although they can be provided by the manufacturer) depend on a multitude of factors: the efficiency of the motor, the configuration of the machinery at any given time, the wear of the components, and so forth. Consequently, it is exceedingly difficult to know the true value of these parameters. These possible sources of error must be considered when analyzing the results obtained when using parameters that do not come from direct measurements. To this type of error attributed to the correlations, the errors and tolerances of the measuring devices must also be added.

#### *3.2. Compound Indices and Parameters*

Drilling parameters are related to soil resistance. However, by taking these parameters separately, the relationship is not clear. For this reason, various authors have developed compound parameters that combine individual parameters into energy expressions or empirical indices seeking to show the strength of the ground from the drilling data.

Most of these parameters or compound indices follow the same basic structure. In this way, the hardness of the ground is directly proportional to the rotation torque and the applied load and inversely proportional to the area of the drilling and drilling speed. These relationships tend to soften the profile, giving them greater physical meaning and making their interpretation easier. The most common of these indices are listed in Table 1 [32].


**Table 1.** Most used compound parameters.

Of these parameters, penetration resistance, Somerton's index, and drilling specific energy are intended for research in hard rock and soils or for recording parameters during on-site testing. Their application to soft soils, based on MWD data, can be difficult to interpret.

The main obstacle with these formulas is that they assume a form of the relationship between the parameters and the strength that, although logical, does not necessarily correspond to reality. Therefore, these formulas give us a qualitative idea of the resistance to perforation and are useful when the differences in behavior within the soil are clear, but they do not supply precise figures.

#### **4. Decision Trees and Random Forests**

#### *4.1. Decision Trees*

Decision trees (DT) are a popular algorithm for classification or regression capable of providing reliable and easily interpretable results.

Although there are a variety of DT algorithms, they all have a similar structure. To be concise, the algorithm consists of dividing the data set into successive subsets following

established rules. This process can be represented graphically in structures that resemble a tree (Figure 2).

**Figure 2.** Decision tree example for site 1, limited to 4 leaf nodes.

The success of DTs is explained by several factors that make them quite useful in practice [39]:


#### *4.2. Division Rules*

A division at a node can be understood as a question whose answer divides the set at the node into two or more subsets. These questions or division rules vary depending on the type of decision tree chosen. Once these rules have been applied to each of the variables in the set, they are ordered according to the information they provide, establishing the most useful division as a node.

The application of these criteria defines the shape of the tree.

#### *4.3. Stop Criteria*

Overfitting refers to the situation where the model fits the training data too much, leading to larger errors in the test data. To avoid this problem stopping criteria can be set. This set of rules can be regulated to achieve trees that are neither too short (too general) nor too long (overfitting problem).

The tree will stop its development by itself in two cases:


In the previous cases, it is not possible to gain purity in the nodes by continuing the divisions, so the algorithm stops. These nodes that are not divided are called leaf nodes. In addition to these situations, other criteria can be added. The most common approaches are:

• Set a node as a leaf node if it contains less than a minimum number of samples.


All these criteria must be defined by the user looking for the most adequate balance. Such balance can be difficult to achieve, which is why specially dedicated models are commonly used at the expense of a greater computational load.

These stopping criteria can be understood as a method of pruning the tree. Specifically, they are pre-pruning methods that are carried out during the growth of the tree.

Another method would be to develop the entire tree and perform the pruning eliminating those nodes that provide the worst results in a different data set than that used to generate the tree. This method usually provides better results than pre-pruning.

#### *4.4. Assignment Rules*

The assignment rules refer to the criteria followed to assign a value to a leaf node. In this way, in regression problems, it would be assigned to the value that provides the smallest mean square error. In a classification problem, the value would be that of the most probable category.

#### *4.5. Implementation of DT in Scikit-Learn*

For the application of DT and RF, the Scikit-learn library has been used. Scikitlearn [40] is a Python module that integrates a wide range of machine learning algorithms. It is a package designed for non-experts, focused on the ease of use, documentation, and consistency of the application programming interface (API).

The learning is carried out using the fit method together with the train values (in the case of supervised learning, these values would be X\_train and y\_train tables for input variables and target variables to predict, respectively). During the definition of the model, the hyperparameters that control the algorithm are introduced. The following pseudocode shows an example of DT training where the minimum number of samples to divide the node is five.

```