3.1. Neural Network Prediction Model
The neural network regression prediction model is a method of regression analysis and prediction using artificial neural networks. Its basic principle is to transform the input data nonlinearly by a certain number of neurons and optimize the connection weights of the neurons using a backpropagation algorithm to fit and predict the output data. Specifically, the neural network regression prediction model can be represented as a directed acyclic graph, in which the input layer receives the data, the output layer outputs the prediction results, and the middle hidden layer is responsible for the nonlinear transformation of the input data. Each neuron receives a certain number of input signals, and after weighting and calculation, the result is nonlinearly transformed by the activation function and output to the next layer of neurons [
29].
Assuming that there are
layers of neurons,
denotes the input vector of the
ith sample,
denotes the corresponding output, and
denotes the predicted value of the model for
, the neural network regression prediction model can be expressed as Formula (3):
where
f1 is the output of the first-layer neuron and
θ1 is the parameters of the first-layer neuron.
We determine the outer diameter, wall thickness, yield strength, ovality, wall thickness unevenness, and residual stress as the input data and the ultimate deformation load as the output data, and divide the normalized data into a training set (70%) for model training and parameter seeking and a testing set (30%) for model training effect evaluation.
Considering that the study data do not involve serial or temporal correlation and the dimensionality of the features to be processed is not high, it is a regression task with a known input to predict the output. The feedforward neural network structure with a single hidden layer is chosen. The model structure is simple and easy to explain and understand, and the training time is short while ensuring prediction accuracy. Combining the existing data samples, the outer diameter, wall thickness, yield strength, ovality, wall thickness unevenness, and residual stress are determined as the input data, and the ultimate deformation load as the output data. The network structure diagram (input layer–hidden layer–output layer) is shown in
Figure 8:
The LM algorithm (Levenberg–Marquardt algorithm) is an optimization algorithm mainly used to train feedforward neural networks. The LM algorithm uses a combination of the second-order Newton method and the first-order gradient descent method, which has an efficient convergence speed and stability and can adjust the learning rate adaptively during the training process, avoiding the oscillation and scattering problems of the gradient descent algorithm.
The neural network super parametric tuning mainly includes the determination of the number of neurons, learning rate, and the number of iterations. In terms of the learning rate, the LM algorithm is chosen as the training algorithm to adjust the learning rate adaptively. In terms of iteration number, the number of iterations can be set to a larger amount (1000), and the number of iterations can be constrained by setting an error threshold to stop training when the training error is less than the error threshold. Determining the number of neurons is a critical issue. The number of neurons directly affects the complexity and learning ability of the model. In general, the number of neurons should be large enough that the model can fully learn the features and patterns in the dataset. However, it is also important to avoid too many neurons, which can lead to the overfitting of the model or too long a training time.
The number of neurons in the hidden layer is usually determined based on the number of samples. A commonly used formula is:
where
is number of hidden neurons,
is number of nodes in the input layer,
is number of nodes in the output layer, and
is a constant between 1 and 10.
The trial-and-error range (4~13 shown in
Table 3) of the trial-and-error method is determined by the empirical formula, the neural network model is trained according to the range, and the root mean square error RMSE and correlation coefficient R
2 are recorded for each training. Comparing the data performance of each training set, the number of hidden layer neurons corresponding to the time when the root mean square error RMSE is the smallest and the correlation coefficient R
2 is the closest to one is determined as the optimal parameter of the model.
The number of neurons was determined to be nine. The RMSE value of 0.0163 is the minimum value when the number of neurons is 9 and the R2 value of 0.99328 is the closest to 1. Therefore, the number of neurons was determined to be nine.
The topology of the finalized neural network prediction model is shown in
Figure 9.
The prediction results of the training set and test set of the neural network prediction model are shown in
Figure 10.
The model reaches the best training effect after eight iterations. After the training of the model, the training set, the test set, and the overall dataset are predicted respectively, and the data are denormalized to scale the training data distributed between [0, 1] to the size of the actual values. The comparison curve between the actual value and predicted value is shown in
Figure 11.
From the prediction result curves, the high degree of fit between the predicted and actual value curves indicates the high accuracy of the current neural network model.
The evaluation metrics for assessing the model prediction results include the root mean square error (RMSE), coefficient of determination (R2), and mean relative error.
The root mean square error (
RMSE) is an indicator used to assess the magnitude of prediction errors in regression models. It represents the square root of the mean of the squared differences between the predicted values and actual values. Formula (5) calculates the
RMSE.
where
is the true value,
is the predicted value, and
n is the number of samples. The smaller the
RMSE, the better the predictive ability of the model.
R2 is an indicator used to evaluate the goodness of fit of a regression model; it indicates how much variation in the target variable can be explained by the model. It is calculated using Formula (6).
where
SSres is the residual sum of squares and
SStot is the total sum of squares. It can be calculated using the following formula:
where
is the true value,
is the predicted value,
is the mean value of
, and
n represents the number of samples.
The value of R2 ranges between 0 and 1. Values closer to 1 mean that the model fits the data better, indicating that the model can explain most of the changes in the target variable. When R2 is equal to 1, it means that the model fits the data completely, and when R2 is equal to 0, it means that the model is unable to explain the changes in the target variable, and the model is unable to fit the data.
The mean relative error (MRE), calculated by Formula (9), is a common indicator of the difference between the predicted value and the true value.
where
n is the number of samples,
is the true value of the
ith sample, and
represents the corresponding predicted value.
The results of neural network model prediction are shown in
Table 4.
3.2. Random Forest Prediction Models
Random forest is an integrated learning algorithm based on decision trees, which is usually used in classification and regression problems. In regression problems, the random forest model consists of multiple decision trees, each constructed by bootstrap sampling and random feature selection of the training data [
30]. For each decision tree, a subset of samples and a subset of features are generated by randomly sampling the original data and randomly selecting the features. Based on this sample subset and feature subset, a decision tree model is trained. Here, the process of random sampling and random feature selection reduces the variance in the decision tree.
The prediction function for each decision tree in the random forest regression prediction model is given in Formula (10).
where
denotes the number of decision trees,
denotes the constant term in the
th decision tree,
denotes the leaf node region in the
th decision tree, and
denotes whether sample
falls within the leaf node
.
where
denotes the number of leaf nodes in the
th decision tree and
denotes the constant term of the
th leaf node in the
th decision tree.
The decision tree is the basic unit of random forest construction. The process involves determining the outer diameter, wall thickness, yield strength, ovality, wall thickness unevenness, and residual stress as decision tree features, and the ultimate deformation load as prediction samples to build the decision tree. Then, randomly selected features and sample data are used to construct multiple decision trees to build a random forest.
Given the number of trees in the random forest (n_estimators), the maximum depth of each tree (max_depth), and the range and interval of the minimum number of leaf node samples (min_samples_leaf). Generate different parameter combinations within the given parameter range to form a "grid". A random forest regression model is constructed using each combination of parameters, and the performance of the model on the validation set is evaluated, usually using metrics such as the root mean square error (RMSE) or correlation coefficient (R2). The optimal combination of parameters is selected based on the model performance.
The number of trees (n_estimators) is determined to be 100, the maximum depth of each tree (max_depth) is five, and the minimum number of leaf node samples (min_samples_leaf) is five according to the grid search algorithm.
For the training of the model using optimal parameters, the error variation curve with the number of decision trees is shown in
Figure 12.
After the training of the model, the training set, the test set, and the overall dataset are predicted, respectively, and the data are denormalized to scale the training data distributed between [0, 1] to the size of the actual values. The comparison curve between the actual value and predicted value is shown in
Figure 13.
From the comparison curves of the prediction results, which are listed in
Table 5, the random forest prediction model has a higher prediction accuracy for sample points with more concentrated data, but there is a larger error in the prediction of outlier points with larger sample values.
3.3. Support Vector Machine Prediction Model
The support vector machine (SVM) is a machine-learning algorithm widely used in the fields of classification, regression, and anomaly detection. In regression problems, an SVM can be used to fit a nonlinear function to describe the relationship between input variables (independent variables) and output variables (dependent variables).
The regression problem of an SVM can be transformed into solving a minimizing convex quadratic programming problem to minimize the model complexity while maximizing the prediction error. Specifically, given a training dataset
, where
is the independent variable and
is the dependent variable, the goal of the SVM regression model is to find a function
, where
is the weight vector and
is the bias, such that for all
, the error
is less than a given tolerance
, while minimizing the complexity of the model [
31].
The objective function of the SVM regression model can be expressed as Formula (10).
where
and
are relaxation variables for the non-separable case and
is a regularization parameter to control the complexity of the model. Also, this objective function needs to satisfy the following constraints:
where the first constraint indicates that
is greater than or equal to
, and the second constraint indicates that
is less than or equal to
. These constraints indicate that the training sample points must satisfy the range of
.
After solving the above convex quadratic programming problem to obtain the weight vectors and biases, the predicted values of the SVM regression model are:
The advantages of SVM regression models are their ability to handle high-dimensional and non-linear data, as well as their robustness to noise and outliers. The disadvantages are the large amount of computation and storage space required, and the challenges for parameter selection and tuning.
Support vector machine regression prediction model hyperparameter determination usually includes the selection of the kernel function and the determination of the penalty factor. A kernel function is a function that maps the original data to a high-dimensional space and is used to deal with nonlinear problems. In a support vector machine (SVM), the kernel function is often used to construct classifiers or regressors to transform nonlinear problems into linear ones.
The radial basis function kernel (RBF kernel) is the most commonly used kernel function with smooth nonlinear characteristics, which can better handle nonlinear problems, and has a better robustness and adaptability. Its expression is : where is a parameter that determines the rate of change in the function, also known as the bandwidth.
After determining the RBF as the kernel function, it is necessary to further determine the parameter and the penalty factor of the radial basis kernel function. The penalty factor is a hyperparameter that is used to control the complexity of the model. The larger is, the larger the penalty on misclassified points, and the more complex the model is; the smaller is, the smaller the penalty on misclassified points, and the simpler the model is. Therefore, the value of needs to be optimized in model selection. The parameter of the radial basis kernel function controls the rate of change in the radial basis kernel function. When is larger, the value of the kernel function decreases rapidly with the increase in the distance between points, and the decision boundary becomes more complicated; when is smaller, the value of the kernel function decreases more slowly, and the decision boundary becomes smoother. In the model training process, the optimal and values need to be selected by cross-validation. The values = 0.25 and = 11.3137 were determined by double-loop five-fold cross-validation.
After the training of the model, the training set, the test set, and the overall dataset are predicted, respectively, and the data are denormalized to scale the training data distributed between [0, 1] to the size of the actual values. The comparison curve between the actual value and predicted value is shown in
Figure 14.
The comparison curves of the prediction results, which are listed in
Table 6, show that the support vector machine regression model has a high accuracy in the training set, test set, and overall data prediction.
3.4. Comparative Analysis of Three Prediction Models and API Formulas
The comparison curves of the three model prediction results, the calculated values by API formula (ISO10400), and the measured values of casing collapse strength are plotted as follows. Due to the relatively small sample size used for model training, the training set, the test set, and the overall data results are now shown in their entirety. The model optimization is mainly based on the test set samples that were not involved in modeling training.
Three casing limit deformation load regression prediction models, neural network, random forest, and support vector machine, were constructed based on the measured data, and all three machine0learning prediction models have a high prediction accuracy. From
Figure 15, the best prediction models were selected based on the root mean square error RMSE, correlation coefficient R
2, and mean relative error MRE as the evaluation indexes.
The comparison curves of the actual, predicted, and calculated values from the formula show that all three machine-learning prediction models have better prediction results, while the API calculation formula has a larger error. This indicates that the machine-learning algorithm has some advantages over the traditional API formula in predicting the casing collapse strength. The differences between the three prediction models and calculation formulas are further determined by specific evaluation indexes to determine the best prediction model. To exclude the training data interference, the prediction model evaluation metrics were used from the test set data without training.
The prediction results of the three models, neural network, support vector machine, and random forest, are summarized and compared in
Table 7. The accuracy (
ACC) is calculated by Formula (15).
Among the three machine-learning prediction models, the neural network prediction model has the best prediction effect: the correlation coefficient is 0.9733, which is the closest to 1, and the root mean square error (0.0267) is the smallest. By comparing the actual value of the collapse strength with the predicted value, the average prediction accuracy of the traditional API calculation formula is only 63.3%, while the three machine-learning prediction models have a higher accuracy and the average prediction accuracy of the neural network prediction model can reach 92.2%.