1. Introduction
As a crucial replacement energy source in China’s “carbon peak and carbon neutrality” strategy, the exploration and development of deep and ultra-deep natural gas resources has become a primary focus of the Chinese petroleum industry [
1,
2]. The complex geological conditions of deep gas resources make it challenging to detect downhole incidents, such as gas influx and kicks, in a timely and accurate manner during drilling operations, which leads to high well-control risk [
3]. Statistical data from the drilling of 37 ultra-deep natural gas wells (well depth > 6000 m) in the SY block of the Sichuan Basin shows that a total of 113 kick events have occurred, with an average non-productive time of 137.2 h per well due to handling these kicks. The primary reason is that there is an abnormally high-pressure section in the Ziliujing Formation and the Xujiahe Formation (3300–4800 m) corresponding to the
φ333.4 mm borehole. After a kick occurs, the process of formation fluid invading the wellbore is reflected in continuous and subtle changes in surface drilling parameters, including outflow rate, rate of penetration (ROP), pit gain, etc. Even highly experienced on-site engineers may struggle to accurately identify kicks from such minute parameter changes and issue timely warnings. Therefore, monitoring kicks under complex geological conditions has long been a significant challenge in drilling engineering and represents a pressing issue that needs to be resolved, especially in terms of ensuring well control safety in deep and ultra-deep natural gas drilling [
4].
Traditional kick detection methods determine whether a kick has occurred, based on parameter changes measured by on-site equipment. These methods include monitoring the mud pit level, flow meter monitoring, casing pressure monitoring, acoustic interference detectors, ultrasonic Doppler gas influx monitors, and acoustic impedance monitors, among others [
5,
6,
7]. In 1987, Orban et al. [
8] installed a flow sensor on the triplex pump and set 1.5 L/s as the critical flow rate difference for detecting kicks, using the flow difference between the inlet and outlet of the drilling fluid to monitor kicks. In 1991, Bryant et al. [
9] used acoustic responses from logging-while-drilling tools to detect gas influx. They conducted over 40 tests in water-based and oil-based muds, finding that the accuracy of acoustic kick detection depended on factors such as the drilling fluid flow rate, fluid type, and instrument response frequency. In 2003, Helio et al. [
10] used bottomhole measurement tools to monitor downhole parameters and installed sensors to measure flow rate, density, and temperature at the surface, providing reference data for on-site engineers to judge kick. In 2015, Fu et al. [
11] proposed a kick detection method using ultrasonic devices to measure annular flow velocity, with 15 MPa pressure tests verifying the reliability of ultrasonic sensors below the mudline. In 2021, Gu et al. [
12] designed a gas influx monitoring device based on Doppler ultrasonic propagation principles, optimizing the position of the Doppler probe. Experiments revealed the variation in ultrasonic waves with changes in the gas content of the fluid, providing theoretical guidance for kick detection in deepwater drilling. Despite some advances in traditional kick detection techniques, drilling still relies heavily on manual supervision and the field experience of operators, resulting in lower timeliness and accuracy in detecting kicks. Improving the accuracy of kick detection during drilling and reducing false alarms and missed detections remain primary topics in drilling engineering. Machine learning technologies offer the potential to accelerate the transition from manual monitoring to intelligent warning systems for kick detection.
As a crucial subset of artificial intelligence (AI), machine learning technology that uses data-driven algorithms to automatically execute specific tasks has been utilized for over half a century and is expected to provide intelligent solutions for complex drilling problems [
13,
14,
15]. With the continuous development of AI technology, kick detection methods based on machine learning have also emerged. In 2001, David et al. [
16] developed a kick detection system based on the Bayesian algorithm. By analyzing and processing large amounts of historical drilling data from both normal operations and kick events, they established models to differentiate between normal conditions and kick occurrences. In 2010, Mohammadreza et al. [
17], recognizing the limitations of static neural networks for kick detection, proposed a method using dynamic neural networks for this task. They trained their model using drilling data from four wells in three different blocks in Iran where kicks had occurred. In 2018, Raed et al. [
18] developed an automated system for monitoring kicks during drilling. They used surface drilling parameters (such as hook load, ROP, torque, pump are, and weight on bit (WOB)) to train and optimize five models: decision trees, k-nearest neighbors (KNN), sequential minimal optimization (SMO), artificial neural networks (ANN), and Bayesian networks, thereby finding that decision trees and KNN models performed best.
In 2019, Yin et al. [
19] proposed a kick detection method based on the autoregressive integrated moving average (ARIMA). By predicting changes in the total pit volume before shutting in the well, they assessed the severity of the kick. The test results showed that this method had high accuracy in predicting kick volume over short time steps. In 2020, Augustine et al. [
20] introduced a data-driven kick detection method, using the d-exponent and riser pressure as inputs. They employed a long short-term memory recurrent neural network (LSTM-RNN) to capture the relationship between input time series data and kick events. Nhat et al. [
21], using the data from simulated kicks in laboratory experiments, analyzed the impact of kicks on downhole parameters. They introduced a data-driven Bayesian network to identify kick events, with no false positives or missed kicks being reported in model testing. Liang et al. [
22] proposed a remote monitoring platform for kicks and developed a kick identification model based on a bat-optimized random forest algorithm. This model optimized the parameter combinations and demonstrated high prediction accuracy for kicks. Arunthavanathan et al. [
23] presented a kick detection method based on convolutional neural networks (CNN), LSTM, and unsupervised support vector machines. They monitored kicks by predicting system parameters identified from future sampling windows.
In 2022, Kopbayev et al. [
24] combined CNN with bidirectional long short-term memory (Bi-LSTM) networks to construct a monitoring model for wellbore leaks and kicks. They trained and tested their model using sequence curves generated from open-source simulation data, successfully identifying the kicks and classifying their severity. In 2023, Xing et al. [
25] proposed a kick detection model framework that established an operating condition interaction classification model based on maximizing the use of limited kick data. To improve the timeliness of kick warnings, Zhang et al. [
26] introduced a hierarchical kick detection method using cascaded gated recurrent unit (GRU) networks. In this method, the GRU served as the fundamental unit for monitoring abnormal parameter changes. The hierarchical kick warning model assessed the risk of kicks based on the number of abnormal parameters at different times. Testing with the data from 22 wells showed that this method achieved correct classifications from low to high risk, improving kick detection accuracy by 5.88% compared to traditional GRU models. Xu et al. [
27] proposed a pattern recognition-based kick detection method for offshore drilling by integrating multiphase flow, data filtering, pattern recognition, and Bayesian networks. This method combined computational technology with pattern recognition algorithms, allowing for the effective monitoring of gas influx based on the shape and fluctuation characteristics of curves, even when using a single parameter.
Compared to traditional models, machine learning-based data-driven models offer advantages such as flexible model inputs, higher prediction accuracy, and the ability to uncover hidden patterns, meaning that they are widely used in kick detection. During drilling, kicks are rare events, and the number of kick samples in real-world drilling datasets is far smaller than the number of normal drilling samples. Therefore, when a classification algorithm is utilized, kick detection is a binary problem involving imbalanced small sample data. Most existing intelligent kick detection models do not address the issues of data imbalance and small sample sizes, relying on algorithms that are built on the assumption of balanced sample sizes in the dataset, which may lead to lower accuracy in kick detection. Improving model performance with a limited number of real kick samples remains an unresolved engineering challenge, and solving this issue is crucial for achieving efficient and intelligent kick detection. Methods to address the problem of imbalanced small sample datasets include undersampling, oversampling, mixed sampling, and algorithm-level approaches such as one-class learning, ensemble learning, and cost-sensitive learning [
28]. With the continuous advancement of artificial intelligence technology, researchers have been inspired by zero-sum game theory to propose generative adversarial networks (GANs) and their improved models [
29,
30,
31] for generating artificial data that match the distribution of existing samples. GANs have been successfully applied to tasks such as image completion and data augmentation.
To address the challenge that insufficient real kick data can lead to difficulties in training intelligent models and poor model accuracy and generalization capability, this paper constructs an improved intelligent kick detection model for detecting kicks during ultra-deep well drilling in the Sichuan Basin. Considering the time-series characteristics of surface drilling parameters after a kick, the model uses TimeGAN [
32] to generate synthetic kick samples. This approach improves the sample imbalance ratio of the original real drilling dataset, increases sample diversity, and mitigates issues related to model overfitting and weak generalization. Subsequently, the LSTM algorithm is employed to extract the multidimensional time-series features of surface drilling parameters, which are then input into an MLP model to identify kick and normal drilling conditions, enabling intelligent kick detection. The model is then trained and tested using real drilling data from ultra-deep wells in the SY block of the Sichuan Basin. The effects of k-fold setup, imbalanced data processing methods, and dataset imbalance ratios on model performance are analyzed. Ablation experiments are conducted to evaluate the contribution of each module to the model’s kick detection capability. Finally, the trained model is applied to the field’s new drilling operations.
2. Materials and Method
2.1. Framework of the Intelligent Kick Detection Model
To address the challenges of binary classification under imbalanced small sample conditions, this paper constructs an intelligent kick detection model based on TimeGAN for kick time-series data augmentation, LSTM for multidimensional time-series feature extraction, and MLP for downhole condition classification. The model framework is shown in
Figure 1. The model consists of three main components: the sample augmentation module (TimeGAN), the feature extraction module (LSTM), and the condition classification module (MLP). The model’s workflow is illustrated in
Figure 2. First, we preprocess the real drilling data collected from the field to obtain the original dataset M(a). Then, we divide the original dataset into a kick dataset N
P and a normal drilling dataset N
N. Based on the real kick dataset N
P, we generate a certain amount of artificial kick samples using the TimeGAN, forming the augmented kick dataset N(C). After mixing the kick dataset N
P, the normal drilling dataset N
N, and the augmented kick dataset N(C), we feed the mixed dataset into the feature extraction module. We utilize LSTM networks, which excel in handling time series problems, for feature extraction and the dimensionality reduction of the mixed dataset, to obtain the deep data features of surface multidimensional parameters. Finally, we use cross-validation to divide the mixed dataset into training and testing datasets and input the training dataset into the MLP for model classification training. After training, we test the model’s performance using the test dataset. In this process, the k-fold cross-verification strategy is applied.
2.2. Kick Characterization Parameters and Data Preprocessing
Machine learning systems make decisions and predictions by learning patterns and regularities from data. During actual drilling operations, due to complex formation conditions and measurement noise and errors, relying solely on a single factor (such as the difference in inlet and outlet flow rates or the increase in pit volume) to detect kicks may result in low model reliability and accuracy. To improve the prediction accuracy of the model, this paper combines field experience and previous research results [
15,
16,
17,
18,
19,
20,
21,
22,
23,
24,
25,
26,
27,
28] to select eight surface-drilling parameters that are closely related to kicks (
Table 1) as the feature (input) parameters for the model. These parameters include the difference in inlet and outlet flow rates, the difference in inlet and outlet temperatures, standpipe pressure (SPP), and the total volume of the mud pit. Drilling time refers to the time that the bit needs to penetrate 1 m, which is inversely proportional to the ROP. After a kick happens, the ROP increases because of reduced bottomhole pressure, and the drilling time decreases. This model utilizes multi-dimensional time series data on surface drilling parameters to conduct kick detection.
To make the real drilling data more suitable for machine learning, appropriate data preprocessing methods need to be selected. Since the raw drilling data collected are extensive and the parameters have different dimensions, data quality needs to be improved to enhance the correlation between the data and the kick events, reduce the difficulty of kick detection, and improve prediction accuracy.
The preprocessing methods include: (1) data cleaning: we remove outliers and missing values to ensure the data are clean and reliable. (2) Noise reduction and smoothing: we apply the Savitzky–Golay filtering algorithm to reduce noise and smooth the data. (3) Normalization: we normalize the data to ensure consistency and improve model performance. The min-max normalization function used in this paper is as follows:
where
x represents the original data, and
Max(
x) is the maximum value in the sample,
Min(
x) is the minimum value in the sample, and
x’ represents the normalized data.
In real-world drilling processes, kicks are rare events, so the number of kick samples in the real drilling data is much smaller than the number of normal drilling samples, leading to a severe imbalance between kick data and normal drilling data. Kick detection is a binary classification problem under imbalanced small sample conditions and faces the following challenges:
(1) Extreme imbalance in terms of sample quantity. The number of normal drilling samples far exceeds the number of kick samples. This extreme imbalance can cause the model to focus on the more abundant category during training, neglecting the less frequent category [
15,
28]. As a result, the trained model may incorrectly classify severe kick events as normal drilling conditions.
(2) Difficulty in extracting multidimensional time-series features due to an imbalance in sample sizes [
33,
34,
35]. The time series of wellhead parameters have both high dimensionality and long sequences of non-kick data relative to the limited kick samples. This imbalance makes it challenging for an intelligent model to fully learn the deep temporal features associated specifically with kick events. This difficulty in effectively extracting these time-series features further hinders the model’s ability to accurately identify kicks.
2.3. Time Series Data Augmentation of Kick Using TimeGAN
TimeGAN combines the flexibility of unsupervised learning with the control of supervised training, allowing for more precise dynamic adjustments of the model. TimeGAN is a derivative of GAN, which consists of two neural networks: a generator and a discriminator. During training, the discriminator network minimizes the objective function (Equation (1)) while the generator network maximizes it. The generator and discriminator networks alternate optimization until the entire training process is complete. After a certain number of iterations and updates, the output of the discriminator
D for the artificially generated data converges to 1/2, indicating that the generated data closely matches the distribution of the real data.
In this formula, Pz(z) represents the noise distribution of the random noise z; Pdata(x) represents the distribution of the real sample data x; G(z) refers to the samples generated by the generator network; D(x) represents the probability that the sample is a real data sample.
In addition to the adversarial module of a traditional GAN, TimeGAN incorporates an autoencoder module (embedding network and recovery network), which enables reversible mapping between the feature space and the latent space. The embedding and recovery functions have the capability to reconstruct latent features
from the hidden features (
hS,
h1:T) of the original data (
S,
X1:T). Therefore, the reconstruction loss of the objective function (Equation (1)) is expressed as follows:
The training process of the TimeGAN network is illustrated in
Figure 3. In this figure, the solid lines represent the forward propagation of data, while the dashed lines represent the backpropagation of the loss gradients. The symbols
e,
r,
g, and
d denote the embedding network, recovery network, generator network, and discriminator network, respectively. The terms (
S,
X1:T), (
hS,
h1:T), and
represent the real time series, latent time series, and generated time series, respectively;
(ZS,
Z1:T) refers to random vectors;
and
are the reconstruction loss;
is the supervised training loss;
refers to backpropagation loss (supervised loss);
and
represent the generator and discriminator network losses, respectively;
and
represent the latent features and the score from the discriminator network;
θe,
θr,
θg, and
θd are the parameters of the embedding function, recovery function, sequence generator network, and sequence discriminator network.
The training framework of the TimeGAN network consists of three main parts: (1) training the autoencoder (embedding network and recovery network) with the given sequence data for optimized reconstruction; (2) supervising the training using real sequence data to capture historical patterns; (3) simultaneously training the four networks by minimizing the loss functions.
In the process of training the TimeGAN, the generator network receives two types of inputs. When operating in a fully open-loop mode, to better generate the next vector
, the autoregressive generator network accepts the latent features
from the generated embedding process. Then, the gradient is computed through unsupervised loss to further improve the classification of real training data (
hS,
h1:T) and the data generated by the generator network
. The unsupervised loss expression is as follows:
Relying solely on the binary adversarial feedback from the GAN’s discriminator network is insufficient to fully motivate the generator network to capture the conditional distribution of the real sample data. Therefore, TimeGAN introduces additional losses to constrain the training process, performing alternating training sequences in a closed-loop mode. The input of the generator network consists of the embedded sequence data
h1:t−1, calculated by the embedding function, which then generates the next latent vector. Here, the gradient is calculated using maximum likelihood estimation to compute the supervised loss. This supervised loss helps distinguish the differences between the distributions of
and
. The mathematical model for this supervised loss is as follows:
In Equation (5), is approximated by , with the gradient descent of a random sample serving as the standard. At any stage during model training, the latent vectors from the embedding function and the historical latent sequence data from the generator network must be evaluated, to highlight the difference between the generated latent vector and the next generated latent vector. This ensures that while the generator network is encouraged to produce realistic sequences by , the supervised loss simultaneously guarantees that the model can induce corresponding transitions between consecutive latent vectors, facilitating the smoother and more accurate generation of time series data.
The main process of kick data augmentation using the TimeGAN algorithm is as follows: (1) Collect and organize real kick data from actual drilling operations. (2) Set the model training parameters and input the organized real kick sample data into the TimeGAN for training. (3) Minimize reconstruction and supervised and unsupervised losses, thereby capturing the temporal characteristics of the kick data and generating random synthetic kick data.
2.4. Feature Extraction of Surface Multivariate Time Series Data
The LSTM neural network is a special type of recurrent neural network (RNN) [
36]. While RNNs tend to suffer from vanishing and exploding gradients when handling long-sequence problems, LSTM networks, with their complex gating mechanisms and stronger memory capacity, are better at controlling and filtering the flow of information, helping to capture important features within sequence data. A standard LSTM unit consists of a memory cell, an input gate, an output gate, and a forget gate. The standard LSTM structure is shown in
Figure 4.
LSTM stores the temporal correlations of time series data in memory cells for processing. The expressions for each neuron in an LSTM unit are as follows:
where
xt is the input to the LSTM unit;
ht is the hidden layer vector;
Wf,
Wi and
Wo are the weight matrices;
b,
bf and
bo are the biases for the input gate, forget gate, and output gate, respectively;
and
represent the activation functions.
The time-series feature extraction model for surface drilling parameters based on the LSTM network, as shown in
Figure 5, transforms the original time series data from an NL*8 matrix into a vector with the dimension 1*dL. This representation captures the variation patterns of the surface drilling parameters while reducing the dimensionality of the dataset (i.e., decreasing the number of feature parameters input into the subsequent modules). The model employs four sliding windows of different sizes (with time steps of 3, 5, 7, and 9, respectively) to extract features from the real drilling time series data. These capture multivariate time-series features over different time ranges, resulting in four corresponding time-step-based time-series feature datasets of N
1, N
2, N
3, and N
4. These four feature datasets are then fused, summed, and averaged to obtain the final time-series feature dataset M(b), which contains features from both the kick data and normal drilling data.
2.5. Classification of Downhole Condition
MLP is employed in this study to perform the final downhole condition classification task. MLP is a deep learning model based on a feed-forward neural network [
37]. It is widely used to solve problems such as classification, regression, and clustering. An MLP consists of an input layer, hidden layers, and an output layer. The input layer is responsible for receiving external input features and providing data to the neural network, while the hidden layers optimize the network. The MLP forms a chain-like network structure by combining two or more functions layer by layer. The length of the chain is referred to as the depth of the network. The most common network structure involves connecting the first layer
f(1), the second layer
f(2), and the output layer
g to form a chain
, mapping the input data
x to a category.
represents the feature matrix for
n samples, where each sample has
d input features.
represents the MLP, with a single hidden layer containing
h hidden units. Since both the hidden layer and output layer are fully connected, the hidden layer has the weights
and biases
, while the output layer has the weights
and biases
. The output of the single hidden-layer MLP is expressed as follows:
To fully utilize the potential of the multi-layer architecture, a non-linear activation function
σ must be applied to each hidden unit after the affine transformation:
This non-linearity allows the network to capture more complex patterns and relationships in the data, rather than being limited to linear mapping.
By merging the hidden layers of an MLP, an equivalent single-layer model with the parameters
and
can be produced:
In a typical MLP structure, the functions f(1) and f(2) are responsible for data filtering and feature extraction throughout the network. These layers process the input data, transforming it by extracting meaningful patterns and features through the learned weights and biases. In contrast, the final layer, represented by the function g in conjunction with the softmax activation function, is used to map the extracted features to the output dimensions. The softmax function then converts the raw output values into probabilities, allowing the model to determine the likelihood of the input belonging to each class. This final layer is crucial for making decisions and assigning the input data to the appropriate category.
After the field drilling data are processed through the data augmentation and the feature extraction, the resulting time-series feature dataset M(b) is used as the input for the MLP. The detailed process is shown in
Figure 6. The dataset is divided into k equal parts using the k-fold cross-validation method. In each iteration, k − 1 parts are used as the training dataset and are fed into the MLP network for model training, while the remaining part is used as the test set for model evaluation. The final output of the model is either 1 or 0, indicating a kick or normal drilling conditions, thus enabling the identification and warning of kick events.
2.6. Model Structure and Hyperparameter Optimization
In this study, Bayesian optimization is used to optimize the hyperparameters of the kick detection model. Bayesian optimization is a global optimization algorithm based on a Bayesian theorem, often applied to approximate complex functions [
38]. The Bayesian optimizer consists mainly of a probabilistic surrogate model and an acquisition function. The surrogate model is designed to reduce the complexity of the objective function by acting as a substitute model, which is typically modeled using a Gaussian process. This approach allows for determining the normal distribution of each sample point, offering significant convenience for the acquisition function and enabling the precise location of the next optimal sample point. The acquisition function selects the next optimal sample point, helping to avoid local optima during the exploration process. The acquisition function chosen in this study is Expected Improvement.
After trial and error, the architectures and hyperparameters of the TimeGAN, LSTM, and MLP are measured, and are listed in
Table 2. The length of the input time series of TimeGAN is 600, with epochs = 20,000, and the loss weights are 75. In LSTM, the batch size is 24, the epochs = 50, and the learning rate = 0.03. In MLP, the batch size = 32, the epochs = 25, the learning rate = 0.001, and the loss function = binary cross-entropy.
2.7. Evaluation Metrics of the Kick Detection Model
To comprehensively evaluate the performance of the intelligent kick detection model in identifying both kick and normal drilling conditions, this paper utilizes four performance metrics: accuracy, recall, precision, and F-measure (F1 score), based on the confusion matrix shown in
Table 3.
Accuracy measures the overall performance of the model, indicating the proportion of correct predictions (in both kick and normal drilling conditions):
Recall (sensitivity or true positive rate) indicates the model’s ability to correctly identify actual kick incidents. It measures the proportion of actual kick cases that are correctly predicted:
Precision shows the proportion of predicted positive instances (kick cases) that are actually positive, i.e., the accuracy of positive predictions:
The F-measure is the harmonic mean of precision and recall, used to balance these two metrics, especially in those cases where the dataset is imbalanced:
Together, these metrics provide a well-rounded evaluation of the kick detection model’s effectiveness in distinguishing between kick incidents and normal drilling operations. In situations where kick samples are scarce, traditional intelligent models tend to ignore the minority class, leading to high accuracy but very low recall, which means that the model cannot accurately identify kick events.
3. Results and Discussion
3.1. Field Drilling Dataset
This study collected normal drilling data and kick data from 37 ultra-deep gas wells (ST1, SY001-1, SY001-H2, SY001-H6, SY001-X9, SY001-X3, SY001-X7, SY132, SYX131, HT1, CK1, etc.) in the SY block of the Sichuan Basin, forming the original dataset M(a). The detailed information is shown in
Table 4. The dataset M(a) consists of 96 positive samples (kick samples) and 1104 negative samples (normal drilling samples), totaling 1200 samples. The imbalance ratio between the positive and negative samples in M(a) is as high as 11.5, making it a typical imbalanced dataset. Each sample in M(a) contains eight feature parameters and one label, with a sequence length of 600 (data sampling frequency of 1 s, corresponding to a time length of 10 min). Each feature parameter has the same sequence length. The first sample in the dataset M(a) is shown in
Table 5.
The dataset used in this study was collected from real-time drilling operations. It includes time-series measurements of various well parameters, with data sampled at 1 Hz. This dataset spans a diverse range of drilling conditions, providing a robust foundation for the model. All eight of the feature parameters are recorded by the rig sensors of a well. Specifically, the standpipe pressure is measured by a pressure gauge in the standpipe. The outlet flow rate is measured directly using a flow meter placed at the mud return line, capturing the volume of mud exiting the well. This measurement is crucial as it can indicate kick events or mud losses. The inlet flow rate is calculated by the pump rate. Mud pit volume is measured by a liquid level gauge. The mud density, temperature, and conductivity are measured by a densitometer, thermometer, and conductivity meter located at the inlet and outlet flow lines. The Dc exponent is calculated using Equation (18) [
39], while the data statistics are listed in
Table 6:
where
ρmN is the normal pore pressure equivalent density, which is typically 1.05 g/cm
3;
ρmR is the mud density in g/cm
3;
ROP, m/h;
N is revolutions per minute;
Db is the diameter of the drill bit in mm.
The correlation between the selected input parameters and other surface drilling parameters and kick is calculated using the Spearman method. The results are shown in
Figure 7. The correlation indexes between the selected eight feature parameters are all greater than 0.75, showing a strong correlation, while other surface drilling parameters such as WOB and RPM are less closely related to kick. Specifically, the relative importance is ΔDD > ΔV> DT > ΔSPP > ΔCD = ΔTD > ΔDF > Dc exponent.
3.2. Example of Kick Data Augmentation
Figure 8a–h shows a set of preprocessed real kick data and a set of kick data generated by the TimeGAN data augmentation method (the blue curves represent real kick data, and the red curves represent generated data). Graphically presenting the results of both TimeGAN-generated and actual data offers unique advantages for evaluating the quality of the synthetic data, particularly when dealing with time series data. Visualizations allow the researcher to directly observe how well the generated data replicate temporal patterns, trends, and feature relationships that are present in the actual data. Unlike a single evaluation metric like R
2, which captures overall fit, visual comparisons can reveal nuances such as the alignment of peaks and valleys, continuity in sequences, and other subtleties that contribute to the model’s realism. It can be observed that the kick data generated by TimeGAN exhibits similar characteristics and patterns to those of real kick data. Compared to traditional oversampling and undersampling methods, TimeGAN-generated kick data do not simply replicate the existing data but add diversity to the kick dataset, which helps enhance the generalization ability of the kick detection model.
3.3. Results of k-Fold Cross-Verification
This section aims to evaluate the robustness and generalizability of the model’s performance in terms of kick detection through k-fold cross-validation. The objective is to assess how consistently the model performs across different subsets of the data, reducing the risk of overfitting and ensuring that the results are not dependent on any particular training–test split. This section presents the model’s accuracy, precision, recall, F1 score, and other relevant metrics across different sets of k cross-validation. By calculating these metrics over k folds, this study provides a comprehensive view of the model’s reliability and variability. This evaluation helps confirm that the model can generalize well with unseen data.
To determine the appropriate value of k, the imbalance ratio of positive and negative samples (the ratio of the number of positive and negative samples in the dataset) is fixed at 1, and k is set for values ranging from 2 to 12, with a step size of 2. Additionally, the performance of the model is analyzed when the eight feature parameters are transformed into different dimensions, ranging from 2 to 8. To minimize the errors caused by a single test and to fully validate the model’s classification performance, each combination of parameters is tested 10 times, and the final results are obtained by averaging the outcomes, as shown in
Figure 9. It shows that as the dimension of the feature parameter decreases, the model’s performance worsens when using k-fold cross-validation for training and testing. This is because the eight selected feature parameters in this study are highly correlated with kick events, and there is minimal data redundancy. Reducing the data dimensions is not beneficial for model training. Therefore, in subsequent training and testing processes, the data dimension is set to 8.
The choice of k value also has an impact on model performance. The accuracy, recall, precision, and F-measure all improve as the value of k increases. This is because, with a smaller k value, fewer samples participate in training, leading to poorer model performance and reduced kick detection capability. In contrast, with larger k values, more samples are involved in training, resulting in better model performance on the test set. When k = 10 and the data dimension is 8, the model achieves an accuracy of 0.988, a recall of 0.938, a precision of 0.915, and an F-measure of 0.926. When k = 12 and the data dimension is 8, the model demonstrates the best kick detection performance, with an accuracy of 0.991, a recall of 0.942, a precision of 0.928, and an F-measure of 0.935.
The conclusion drawn from the k-fold cross-verification results is that the model demonstrates better performance metrics as the fold number increases, indicating better generalizability. It is important to note that as the value of k increases, the number of samples used for testing the model decreases, which may negatively affect the model’s generalization ability. Moreover, as the value of k increases, the time required for model training also increases (as shown in
Figure 10). After the k value reaches 10, further increasing the k value yields only a minimal improvement in model accuracy. Therefore, considering the balance between model accuracy, generalization ability, and training time, k is set to 10 in the subsequent models, adopting a 10-fold cross-validation method. Compared to single-experiment methods, 10-fold cross-validation provides a more objective and comprehensive evaluation of a model’s performance.
3.4. The Impacts of the Sample Imbalance Ratio and Methods for Handling Imbalanced Data on Model Performance
To compare the effectiveness of different imbalanced data handling methods, this study replaces the TimeGAN method in the intelligent kick detection model with classic sampling methods (undersampling, oversampling, and hybrid sampling) and the original GAN method, keeping other parts of the model unchanged.
Table 7 shows the optimal performance of the model using various imbalanced data handling methods.
It is evident that when mitigating imbalanced data with traditional sampling methods such as undersampling, oversampling, and hybrid sampling, the performance of the model in identifying kicks is inferior to that when using methods that generate synthetic kick data using GAN algorithms. Among them, undersampling yields the worst performance. This is because undersampling achieves balance by deleting part of the majority class samples (
Table 8), which can lead to the loss of important information. Oversampling balances the number of positive and negative samples by duplicating minority class samples, while hybrid sampling combines the strengths of both undersampling and oversampling, improving model performance somewhat, but still falling short of expectations (
Table 9). Unlike traditional sampling methods, both GAN and TimeGAN retain all the information from the original kick data (
Table 9), and, during data augmentation, the loss function ensures that the synthetic kick data closely resemble real kick data. This effectively addresses the imbalance in sample quantity. Since the real kick data used in this study are time series data, TimeGAN, which accounts for temporal sequence characteristics, is more suitable for kick data augmentation compared to GAN. As a result, the data generated by TimeGAN align more closely with the actual surface drilling parameter changes observed during real kicks, leading to better overall model performance.
However, the results indicate certain limitations of synthetic data generation methods. While TimeGAN shows slight improvements, the lack of statistically significant differences suggests that various synthetic data generation methods may have comparable utility in many scenarios. This finding indicates a need to temper the expectation of large performance gains when choosing one synthetic generation model over another. Time series data often require models to capture long-term dependencies and complex temporal patterns. Synthetic generation methods, including TimeGAN, can struggle with these dependencies, sometimes producing sequences that lack the richness and continuity of real-world data. This limitation may account for the observed similarities in performance across the tested methods.
In addition, to study the effect of sample imbalance ratios on the performance of the kick detection model, this paper uses different imbalance handling techniques to transform the original dataset into datasets with various imbalance ratios (as shown in
Table 9). These datasets are then used for model training and testing. All models have undergone Bayesian hyperparameter optimization and are fully trained, with the final results being the average of multiple experiments. The model is evaluated across various imbalance ratios to examine how the imbalance in the dataset influences its kick detection accuracy. The results, depicted in
Figure 11, highlight the changes in model performance as the imbalance ratio increases.
As shown in
Figure 11, the sample imbalance ratio has a significant impact on the performance of the kick detection model, regardless of the imbalance handling technique used. The model performs best when the sample imbalance ratio is equal to 1. As the imbalance ratio increases, the gap between the number of positive and negative samples widens, leading to a slight decrease in model accuracy and a sharp decline in recall, precision, and F-measure. This indicates that models trained with high imbalance ratios struggle to detect kicks accurately. Thus, the sample imbalance ratio is a key factor in determining the effectiveness of model training.
In contrast, except for undersampling and hybrid sampling techniques, other imbalanced data handling methods yield the highest accuracy when the imbalance ratio is 1. Although undersampling results in increased accuracy when the imbalance ratio is less than 4, and hybrid sampling maintains stable accuracy when the imbalance ratio is less than 6, both techniques show that recall, precision, and F-measure are highest when the imbalance ratio is 1. As the imbalance ratio increases, these metrics decline noticeably, demonstrating that the closer the positive and negative sample sizes, the better the model’s ability to accurately detect kicks. When the imbalance ratio is 1, the kick detection model using TimeGAN performs best, followed by GAN. Models trained with other traditional imbalance handling techniques perform less satisfactorily, reaffirming the superiority of the TimeGAN data augmentation technique used in this study.
3.5. Results of the Ablation Experiment
Through ablation experiments, specific modules within the model were systematically removed or altered to evaluate their impact on the performance of the kick monitoring model. The following experiments were conducted: (1) removing the LSTM feature extraction module to assess the contribution of feature extraction to the model. (2) Removing the MLP classification module. In this case, a sigmoid function was added to the LSTM neural network as the output activation function for binary classification, making the LSTM network handle both feature extraction and condition classification. This experiment aimed to evaluate the contribution of MLP to the model. (3) Removing the TimeGAN data augmentation module. The original dataset with an imbalance ratio of 11.5 was used to evaluate the contribution of data augmentation. (4) Using LSTM alone. The LSTM neural network was used to perform both feature extraction and condition classification on the original dataset.
The ablation experiment results are shown in
Table 10. From the table, it is evident that removing or altering any module in the model significantly reduces its ability to detect kicks. The worst performance occurs when the TimeGAN data augmentation module is removed. Although the model’s accuracy remains above 0.85, its recall drops below 0.65, indicating a substantial decline in its ability to detect kicks. This poor performance is due to the highly imbalanced nature of the original dataset, which causes the model to overlook the minority of positive kick samples. Moreover, the TimeGAN + LSTM model outperforms the TimeGAN + MLP model. This is because, in machine learning, any neural network performs some degree of feature extraction internally. Using a neural network to perform multivariate time-series feature extraction before classification effectively involves two rounds of feature extraction, which enhances the model’s overall feature extraction capability and improves the classification performance.
3.6. Field Application of the Intelligent Kick Detection Model
A blind test of the proposed model was conducted in a total of seven ultra-deep directional wells (SY-2, SY-5, SY-7, SY-11, SY-20, SY-31, SY-57) in the SY block of the Sichuan Basin, the data from which were not included in the dataset used for the training and testing of the intelligent kick detection model. The typical geological era and lithology (taking well SY-5 as an example) are listed in
Table 11, while the well configuration is illustrated in
Figure 12.
Based on the daily drilling reports and the well history of the seven wells, 11 kick events have occurred during the drilling. To verify the engineering applicability of the intelligent kick detection model developed in this study, real-time logging data from the wells were used. Time series data of the eight feature parameters listed in
Table 1 were extracted as input for the trained kick detection model, and a sliding window of 600 s was used to monitor kicks during the drilling process.
The kick detection results for the real drilling process of the seven wells are shown in
Table 12. The results indicate that the model with TimeGAN successfully identified the 11 kick events during the drilling process, with one false alarm (Recall = 1, Precision = 0.917). As for the other methods, the model with undersampling only identified 7 of the 11 kicks, and produced three false alarms (Recall = 0.636, Precision = 0.700). The model with oversampling identified 8 of the 11 kicks and produced three false alarms (Recall = 0.727, Precision = 0.727). The model with hybrid sampling identified 8 of the 11 kicks and produced two false alarms (Recall = 0.727, Precision = 0.8). The model with GAN identified 10 of the 11 kicks and produced 2 false alarms (Recall = 0.909, Precision = 0.833). Overall, the intelligent kick detection model constructed with TimeGAN demonstrated high accuracy in identifying kick events and performed well in the real-world application of ultra-deep well drilling in the Sichuan Basin, providing useful guidance for field operations.
3.7. Discussion of Model Limitations
(1) Limitations of surface-based sensor measurements for kick detection
In this study, we relied on surface-based sensors to measure key drilling parameters in real time, including inlet–outlet mud conductivity differences and inlet–outlet mud density differences. While these surface measurements provide valuable insights into wellbore conditions, they come with inherent limitations, particularly regarding the timeliness of kick detection. Surface sensors only capture changes in mud properties after the drilling mud has circulated to the surface. This delay means that kicks will only be detectable at the surface after the mud has traveled up the annulus. In scenarios where rapid kick detection is critical, this latency could lead to delayed response times, potentially increasing the risk of blowouts.
While surface-based monitoring remains standard practice in drilling operations, due to its cost-effectiveness and ease of deployment, downhole sensors offer a promising alternative for the real-time detection of downhole events directly at the source. However, the implementation of downhole sensors faces practical challenges in terms of cost, technical complexity, and data transmission.
Given that our study utilized surface-based measurements, our approach is limited by the delayed detection of downhole events. As a result, while the model developed in this study can identify potential kicks, it does so only after the kick has affected the mud’s properties at the surface. In future studies, integrating downhole sensors could enhance the accuracy and timeliness of kick detection, improving the reliability of early warning systems for well integrity monitoring. For real-world implementations where early kick detection is critical, combining surface and downhole measurements may provide a more comprehensive monitoring system.
(2) Limitations of TimeGAN
When using TimeGAN for the data augmentation of kick samples, several limitations may impact the model’s ability to detect kicks. First, TimeGAN, like other generative models, may struggle to capture the full variability of rare events. In the case of kick detection, kicks are typically much rarer than normal drilling conditions, which can lead to challenges in accurately modeling and generating synthetic kick events. As a result, the generated kick samples may not capture the full spectrum of kick scenarios, which could limit the model’s ability to generalize.
Second, TimeGAN is prone to mode collapse, a common issue in GAN-based models where the generator produces a limited variety of outputs. This can reduce the diversity of generated kick samples, resulting in synthetic data that are too similar across instances. Under such circumstances, the augmented training dataset will not provide sufficient variability for the detection model to learn the range of possible kick scenarios. This can lead to overfitting and decreased performance regarding real kick detection.
Third, TimeGAN is designed to generate realistic time series data, but capturing long-term dependencies and subtle variations in kick events can be challenging. Kicks often exhibit complex, time-dependent changes that may not be fully captured by TimeGAN. However, evaluating the quality of synthetic kick data is challenging, as no single metric fully captures the realism and relevance needed for kick detection. Existing evaluation metrics like reconstruction error or discriminative score may not adequately reflect the kick characteristics that are essential for accurate detection. Poor evaluation of synthetic data quality can lead to an augmented dataset that appears adequate but lacks the critical features of real kicks.
4. Conclusions
In addressing well control safety issues during the exploration and development of deep and ultra-deep natural gas resources, this paper selects eight feature parameters closely related to kick events to construct an intelligent kick detection model based on TimeGAN-LSTM-MLP. Using real drilling data from ultra-deep gas wells in the Sichuan Basin, the model has been trained and tested, leading to the following key conclusions:
(1) Based on the time-series characteristics of real drilling data, the TimeGAN network was used to construct a data augmentation method for kicks during real drilling, enhancing the diversity of the kick dataset (
Figure 8). This method overcomes the challenges of poor generalization and low detection accuracy caused by the scarcity and imbalance of kick samples.
(2) The LSTM network, known for its ability to capture crucial information over a long time series, was used to build a time series feature extraction module for surface drilling parameters. This reduced the difficulty of training the MLP-based downhole operating condition classification module, improving classification accuracy.
(3) The eight selected feature parameters exhibited minimal data redundancy; reducing the dimension of feature parameters would hinder model training (
Figure 9). Considering model accuracy, generalization ability, and training time (
Figure 10), the optimal k for k-fold cross-validation was found to be 10 (accuracy = 0.988, recall = 0.938, precision = 0.915, and F-measure = 0.926). Compared to single-experiment methods, ten-fold cross-validation provides a more objective and comprehensive evaluation of model performance.
(4) In comparison with other imbalanced data handling methods, the kick detection model performed better when using TimeGAN to generate synthetic kick data (
Table 8). The model’s kick identification capability is the best when the sample imbalance ratio is 1 but decreases as the imbalance ratio increases (
Figure 11). Achieving a near-balance of positive and negative samples through appropriate data augmentation techniques is key to ensuring accurate kick identification by the intelligent model.
(5) The ablation experiments demonstrated that all three modules of the intelligent kick detection model are indispensable for ensuring accuracy, with the TimeGAN data augmentation module being the most critical (
Table 11). Without this module, the model’s ability to identify kick events significantly decreases (accuracy = 0.880, recall = 0.638, precision = 0.859, and F-measure = 0.732).
(6) The trained model was applied in the field using unseen drilling data from seven wells in a certain area of Sichuan. The model with TimeGAN successfully identified 11 kick events during drilling, with a low false alarm rate (Recall = 1, Precision = 0.917), thereby providing a valuable reference for kick warnings during drilling operations, in comparison with other methods (
Table 12).