An Improved Transformer Framework for Well-Overflow Early Detection via Self-Supervised Learning

Yi, Wan; Liu, Wei; Fu, Jiasheng; He, Lili; Han, Xiaosong

doi:10.3390/en15238799

Open AccessArticle

An Improved Transformer Framework for Well-Overflow Early Detection via Self-Supervised Learning

by

Wan Yi

¹,

Wei Liu

²,

Jiasheng Fu

²,

Lili He

¹ and

Xiaosong Han

^1,*

¹

Key Laboratory for Symbol Computation and Knowledge Engineering of National Education Ministry, College of Computer Science and Technology, Jilin University, Changchun 130012, China

²

CNPC Engineering Technology R&D Company Limited, National Engineering Research Center of Oil & Gas Drilling and Completion Technology, Beijing 102206, China

^*

Author to whom correspondence should be addressed.

Energies 2022, 15(23), 8799; https://doi.org/10.3390/en15238799

Submission received: 29 October 2022 / Revised: 15 November 2022 / Accepted: 18 November 2022 / Published: 22 November 2022

(This article belongs to the Special Issue Optimization and Simulation of Intelligent Oil and Gas Wells)

Download

Browse Figures

Versions Notes

Abstract

:

Oil drilling has always been considered a vital part of resource exploitation, and during which overflow is the most common and tricky threat that may cause blowout, a catastrophic accident. Therefore, to prevent further damage, it is necessary to detect overflow as early as possible. However, due to the unbalanced distribution and the lack of labeled data, it is difficult to design a suitable solution. To address this issue, an improved Transformer Framework based on self-supervised learning is proposed in this paper, which can accurately detect overflow 20 min in advance when the labeled data are limited and severely imbalanced. The framework includes a self-supervised pre-training scheme, which focuses on long-term time dependence that offers performance benefits over fully supervised learning on downstream tasks and makes unlabeled data useful in the training process. Next, to better extract temporal features and adapt to multi-task training process, a Transformer-based auto-encoder with temporal convolution layer is proposed. In the experiment, we used 20 min data to detect overflow in the next 20 min. The results show that the proposed framework can reach 98.23% accuracy and 0.84 F1 score, which is much better than other methods. We also compare several modifications of our framework and different pre-training tasks in the ablation experiment to prove the advantage of our methods. Finally, we also discuss the influence of important hyperparameters on efficiency and accuracy in the experiment.

Keywords:

oil drilling; overflow; time series prediction; abnormal detection; self-supervised learning; transformer

1. Introduction

Considered the most important resource in current society, oil, a non-renewable energy sources, enlivens the modern industry. However, in oil drilling, there are many accidents causing the unnecessary waste of oil resources and even leading to severe consequences, where the most common accident is overflow. If an overflow is not detected early and handled properly, it will burgeon into a blowout, resulting in wellbore scrapping, which will not only cause incalculable economic loss and environmental damage, but also endanger the safety of employees. Therefore, it is highly necessary to detect the overflow as early as possible.

Many traditional machine learning methods have been proposed and used in overflow detection problems. Intelligent early warning model was established by employing pattern identification and K-mean dynamic clustering (Liang et al., 2019) [1]. Haibo et al. improved DBSCAN clustering with time-series scanning and stratification to rule the idea of clustering (Haibo et al., 2019) [2]. Liang et al. [3] proposed a warning method based on fuzzy theory and PSO-SVR algorithm. Liu et al. (2021) [4] developed a dynamic Bayesian network to create a dynamic risk assessment model for evaluating the safety of deep-water drilling operations. Wang et al. (2022) [5] proposed a drilling identification method based on optimized SVM. These methods highly rely on the statistical feature from data and are very sensitive to feature selection.

In deep learning, Lind et al. (2014) [6] proposed a radial basis function (RBF) neural network based on the k-means clustering algorithm to predict drilling risk. Liang et al. (2019) [7] established a model for overflow diagnosis based on the monitoring standpipe pressure and casing pressure in pressure wave transmission with the genetic algorithm and BP neural network (GA-BP). Sabah et al. (2020) [8] combined a number of heuristic search algorithms, including genetic algorithm (GA), particle swarm optimization (PSO), and cuckoo search algorithm (COA), with multilayer perception (MLP) neural network and least square support vector machine (LSSVM) to present different hybrid algorithms in prediction of lost circulation.

In both ML and DL methods, the non-iteration approach is widely used in optimization. For example, bat optimization algorithm is used to improve random forest [9], which can optimize the optimal parameter combination and has the ability to handle a large number of eigenvalues. GA-BP structure [8,10] is more commonly used, which optimizes back-propagation neural network (BP) with genetic algorithm (GA). In addition to GA-BP, SGTM neural-like structure [11,12] can also be used in overflow detection tasks as a better approach.

However, oil drilling data have certain differences with vision or text data. The distribution in oil drilling data will vary over time, which means the stated statistical hypothesis of traditional machine learning like Bayesian, SVM, etc.: Samples are generated from a population distribution, and each sample is independently and identically distributed, which may not be suitable [13,14]. Meanwhile, deep learning methods, like MLP, LSTM, CNN, may extract the relationship within data but suffer over-fitting problems and lack explainability [15]. Moreover, DL methods have enormous requirements for labeled data which is actually difficult to obtain. The second problem is the class imbalance problem of the overflow sample. In many methods, slide windows with small stride are used for sampling [16,17,18], but samples close to each other have high similarity that could be considered one sample indeed. Therefore, the discussion of stride is an important aspect of data preparation. However, for the convince of discussion, we use a radical way to handle data. The stride of the windows is set the same as their size, which leads to the extreme class imbalance situation.

In this paper, an improved Transformer framework based on TCN, Transformer [19], and Self-Supervised Learning (SSL) [20] is used for the problem mentioned above. The rest of the paper is organized as follows. The introduction of Transformer and other related methods is introduced in the next section. The details of our framework are presented in the Methods section. Comparison and experimental results are shown in the Experiment section with a discussion of each experiment. Finally, the summary of our proposal is in the Discussion and Conclusion section. Overall, our contribution is listed as follows.

A new Transformer framework is proposed for the oil overflow detection task;
A combination model is built for forecasting and classification task, and several SSL pre-training tasks are fused in the experiment;
Several experiments are conducted to demonstrate the advantages and disadvantages of the Transformer in the oil overflow detection task.

2. Related Work

2.1. Auto-Encoder and Self-Supervised Learning

Auto-encoder (AE) [21,22,23] is a classical method for representation learning, which uses an encoder to map input into a low-dimension representation space and a decoder that reconstructs the input, like PCA, T-SNE. Denoising autoencoders (DAE) [24] is an improved version of AE. Instead of using the original input, it corrupts an input signal and learns to reconstruct the original, uncorrupt signal.

2.2. Temporal Convolution Networks

Temporal Convolution Networks (TCN) [25] is a convolution-based auto-encoder network (shown in Figure 1), which uses a hierarchy of temporal convolutions to perform fine-grained action segmentation or detection. TCN has the capability of capturing action compositions, segment durations, and long-range dependencies. Besides, TCN is over a magnitude faster to train than competing LSTM-based Recurrent Neural Networks [26]. In our methods, Temporal Convolution Layer (TCL) from TCN is used.

2.3. Transformer for Time Series

Transformer [19] is a widely used method in Natural Language Processing, Computer Vision [27,28], and Time Series Forecasting [29,30,31]. There are two main blocks in a Transformer structure (Figure 2a). Multi-Head Attention mechanism (MHA) (Figure 2b) and Feedforward Network (FFN) (Figure 2c). MHA is good at capturing long-distance feature dependencies, while FFN is used for integration of information from MHA. In our paper, the original Transformer is used to mainly discuss the suitability of the transformer in overflow detection problems. Other variants of the Transformer, like Informer [29], Autoformer [30], Reformer [31], are compared for better performance and longer time dependence.

3. Method

3.1. Base Model

The main framework of our proposal is an Encoder-Decoder structure. As shown in Figure 3, Temporal Convolution Layer (TCL) is used for global temporal relation extraction as well as compression and select Transformer as the vital part of the encoder. As for the decoder part, different designs are used for two tasks, classification and forecasting (reconstruction) task, both of which will be demonstrated in Section 3.2.

In the encoder part (shown in Figure 3 right), for better feature extraction and reduction of computation, TCL is used to project input

x \in R^{D_{f} \times D_{t}}

into the first hidden feature space

h \in R^{D_{h} \times D_{t}}

, where

D_{h}

is the hidden feature dimension. Instead of sinusoidal encodings, learnable positional embedding layer proposed by Zerveas [32] is used. It is actually a weight matrix of

w_{p o s} \in R^{l \times 1 \times D_{h}}

, where

l

is any length

\geq D_{t}

. After embedding, the Transformer layer is connected. Each Transformer block contains two main parts, multi-head attention (MHA) and Feedforward Network (FFN). MHA is able to extract long-term relation both at the temporal and channel levels and accumulate the effect for each location in feature space, while the feedforward network will integrate the information from all locations. In the end, we acquire the hidden feature space

h^{'} \in R^{D_{h} \times D_{t}}

as the output of the encoder.

In the decoder (shown in Figure 4), layer perception [33] with Softmax function for the classification task is used in the supervised learning scheme, and a single linear layer for reconstruction task and forecasting task in SSL pre-training scheme is used. The main result of such simple setting is due to the consideration of computation consumption. Usually, the Encoder-Decoder framework will use a symmetrical structure [19,21,22], but the Transformer has a very high computational complexity that limits the size of the time windows in sampling. Therefore, a simple decoder is used for the reconstruction and forecasting tasks.

In the first stage-SSL pre-training task, a Transformer encoder with a forecasting or reconstruction task decoder (shown in Figure 3 right) is used. Training details are introduced in the next section. Before moving to the next stage (overflow detection task), the decoder of pre-trained model is replaced with a classification task decoder (shown in Figure 3 left), while the encoder part is kept the same.

3.2. Self-Supervised Pre-Training Tasks

In this paper, three different tasks are used for self-supervised learning in the pre-training scheme, mask auto-encoder reconstruction task (mask reconstruction task if only using the Transformer in the encoder), auto-encoder reconstruction task, forecasting task (shown in Figure 5).

From right to left in Figure 5, the first task is the auto-encoder reconstruction task (AE task) [34], which converts high-dimensional data to low-dimensional codes with an encoder network and reconstructs it with a decoder so that gradient descent can be used for adjusting the low-dimensional codes. During dimension transformation, the information will be inevitably lost, which we call “Structure Noise” in Figure 5. This method relies on the opinion that dimensionality reduction facilitates the classification, visualization, communication, and storage of high-dimensional data. In overflow detection, the AE task may have potential benefit by increasing the margin in low-dimensional codes. The loss function used in our proposal is mean squared error loss (MSE).

M S E (x, y) = \sum_{i} {(x_{i} - y_{i})}^{2}

(1)

The second task is the Mask auto-encoder reconstruction task (MAE task) like Mask image modeling task [35] and Mask Language Modeling task [36]. There are two differences between AE task and MAE task. (1) Besides adding Structure Noise, we also add Mask Noise to input by setting a small portion of data to 0. (2) The purpose of the model is to reconstruct the masked input back to the original input (before masking), but only the masked part of the output and the same part from the input is used for loss calculation. Overall, it is like a simplified version of AE task with additional mask noise and random Random-masking forces the model to learn the dependence between the masked and unmasked parts without knowing the masked content, which has been proven to be an effective SSL task in Natural Language Processing. The loss function is also MSE.

The third task is completely different from AE methods above. Forecasting tasks take

t

moment’s input

x_{t}

to predict the next

t + 1

moment

x_{t + 1}

. In general situation, there is no obvious correlation between classification tasks and forecasting tasks. However, overflow detection is indeed the combination of both, using

t

moment’s data to predict the label of

t + 1

moment. Therefore, we try to use forecasting tasks as self-supervised pre-training tasks to offer potential benefits for downstream tasks and overflow early detection. The loss function is also MSE.

3.3. Supervised Training

In this supervised learning scheme, labeled data are used for supervised overflow detection training. In this subsection, two methods are introduced: Weighted sampling and Cross-Entropy.

Weighted sampling is used as a solution for sample imbalance problems. The main idea is to weigh each sample and randomly sample according to probability distribution of weight, which is capable of doing minority oversampling and majority under sampling at the same time. The calculation of weight is shown as Equations (2) and (3).

W_{n} = \frac{N}{n_{0}} + b_{0}

(2)

W_{o} = \frac{1}{1 - \frac{N}{n_{0}}}

(3)

In the above equation,

W_{n}

and

W_{o}

are the weights of the normal sample and the overflow sample.

N

is the total number of all samples and

n_{0}

is the number of normal samples.

b_{0}

is a manual bias to adjust the weight effect in each batch. After weighing all samples, random sampling will be used to create batches for training. The actual effect will be introduced in the experiment part.

In supervised training, instead of using weighted Binary Cross-Entropy, Cross-Entropy is used as a loss function. Although weight loss is a common solution to the imbalance problem, weighted sampling has the same effect with it. Combining the two methods will cause an over-fitting problem.

3.4. Data Preparation

The data used in this paper are collected from a real drilling process and 56 features are collected before and after overflow with an hour time interval and nearly 1 Hz sampling frequency. In sampling, a sliding window in size of 1000 is used, which means each sample is roughly 20 min data. The label of each sample is determined by the next window. If overflow occurred in the next window, the label of this sample would be 1, otherwise, 0. For feature selection, we follow the same procedure proposed by Wei Liu [18] shown in the figure and determine 22 features shown in Table 1. Here, we define that each sample shape is

(D_{f}, D_{t})

, where

D_{f}

is the dimension of the feature channel, and

D_{t}

is the dimension of the time channel. After sampling and feature selection,

D_{t}

will be 1000 while

D_{f}

will be 22. In Table 1 and Table 2, a few statistical characteristics of our data after sampling are presented.

From Table 2, the proportion of normal and overflow samples is around 40–50, which means the data used in our paper are very imbalanced, as most anomaly detection tasks [37]. According to Table 1, some features have dramatic changes, which is not always caused by the fault of sensor but also can be more possibly caused by manual operation. Therefore, Gaussian distribution is used to normalize our data after sampling instead of Min-Max normalization.

4. Experiments

4.1. Experiment Setup

In the prior section, the data used in this paper has already been explained. To demonstrate the advantage of our proposed framework, some common methods are selected as baselines in our experiment.

Deep Learning methods:

TCN: We follow the same training framework in [25] and use three layers of temporal convolution block for feature extraction.
CNN-LSTM: We use the same network from [18], which contains three convolution layers and an LSTM layer for feature extraction.

Variants of transformer framework:

3.: Linear + Transformer: We use a single Linear Layer replace TCL This is the similar framework with TST [32] but using layer normalization instead.
4.: Transformer: The original Transformer without Linear Layer or TCL.

Machine Learning methods:

5.: LightGBM [38]: A improved Gradient Boosting Decision Tree with Gradient-based One-Side Sampling and Exclusive Feature Bundling.
6.: SVM: Support Vector Machine, a supervised algorithm based on statistical learning frameworks.
7.: KNN: K-Nearest Neighbors algorithm, metric-based Partitioning Cluster Methods.

Considering the 2-D demand in ML methods, we flatten the data with PCA and also evaluate the performance without PCA. We set the amount of variance that needs to be explained as greater than 0.95. The main parameter setting of DL methods is listed in Table 3 and Table 4.

We train all the baselines DL models with sufficient hyper-parameter tuning to produce results. However, for financial reasons, the hyper-parameter of the Transformer is a rough result, which means better performance is possible. To select hyper-parameters, we do a random 80–20% split to the training set and use 20% as the validation set for hyper-parameter tuning. After fixing the result, we train the model again using the entire training set several times and save the model with the lowest training loss for evaluation.

4.2. Baselines Experiment

In Table 5, we present the performance of our proposed framework, TCN + Transformer + Forecasting compared with several baseline methods. In total statistics, the Transformer framework and its variants achieve the best in nearly all evaluation metrics as well as the second in the accuracy of overflow samples. SVM and LightGBM achieve the best accuracy of overflow samples. Considering the feature of our task (overflow detection), higher Recall and F1 score is more suitable. This is because in an abnormal task, the most disastrous fault is the False Negative. If an overflow occurs and is wrongly predicted as a normal one, it will cause catastrophic loss. Comparatively, stopping drilling for a few hours is more acceptable. Therefore, the Transformer is much better than all the other methods in baselines, while DL methods are also better than traditional ML methods.

In further discussion, we observe that PCA will obstruct KNN from correctly distinguishing overflow samples. The reason may be that the component selection process of PCA is not relevant to the downstream task. This case is also observed in the pre-training experiment.

4.3. Pre-Training Experiment

Shown in Table 6, the forecasting task is the best SSL pre-training task in providing significant performance improvement compared with AE task and MAE task, which is the most common SSL task for abnormal detection. We believe there are two main reasons causing AE tasks (including PCA) to be useless in overflow prediction: 1. Abnormal chrematistic are local [39] and surrounded by normal values. Learning a low-dimensional space may average the abnormal values with normal values. 2. Reduction of dimension has no obvious correlation with downstream tasks. The feature space or say the representation learning by AE task may be unsuitable for overflow prediction. Such a gap is also mentioned in TARnet [40] who combine two tasks and training as multi-task instead of designing a relevant pre-training task.

In our paper and experiment, the forecasting task is proved to be suitable for the Transformer in overflow prediction problems and can be used as a pretext task for representation learning or feature extraction.

4.4. Sampling Experiment

In Table 7, weighted sampling is able to slightly reduce the impact of imbalanced situations in deep learning, but the performance in overflow class is still dissatisfactory.

4.5. Time Interval Experiment

As the window narrows, the size of the data set and the number of overflow decreases (showed in Table 8). The partition of overflow sample varies around 41 to 45. However, the performance shown in Figure 6 is not gradually decreasing or increasing, like [19]. It is partly because of the differences in stride that cause the distribution to change more dramatically. This also shows that the Transformer is very sensitive to data distribution. Another reason can the variation of Recall and F1 score, the normal sample confuses some parts from where the early sign that overflow started, which directly reduces the separability of the dataset. Therefore, it is better to find an interval that can roughly contain all the early signs till the overflow occurs or set the label one if the sample has the early sign. In our experiment, a size of 1000 is suitable for our data for overflow prediction.

5. Discussion and Conclusions

In this paper, an improved Transformer framework for the class imbalance overflow prediction problem is proposed. Specifically, forecasting is used as a self-supervised pre-training task and as a pretext task for overflow prediction and compared with several SSL tasks in the experiment. Three different Transformer structures are also used in the experiment: Original Transformer, Linear + Transformer, TCL + Transformer, with a discussion on their advantages and disadvantages. Finally, the efficiency experiment in changing window size and parameters is presented.

Overall, there are three main conclusions after the experiments. 1. Transformer has great potential in the overflow detection task, which is demonstrated by the baseline experiment. 2. The forecasting task has significant benefits over Transformer models in the overflow detection task, while AE and MAE tasks have little improvement. 3. Transformer models are very sensitive to sample distribution, and resampling solution is inadequate for this problem.

As a further discussion, the Transformer has two vital disadvantages during the experiment. 1. High computational complexity 2. Over-dependence on data. High complexity will bring significant demand for computational resources and increase the cost of equipment, while over-dependence limits the generalization ability of the model trained in a small data set. From a human perspective, certain rules can be used to define whether an overflow would occur or not. However, currently, both ML methods or DL methods are merely a reflection of collected data and lack the ability to learn the rules. In consequence, the generalization ability to different wells is unknown.

Besides the disadvantages mentioned above, there are also many valuable questions remaining for the Transformer. For example: Whether the attention value in MHA block has explainability? In data sampling, we only demonstrate the importance of appropriate window size, but how to select a window size that can cover the early signs before overflow is also a vital research problem.

In future work, we will work on the explainability and sampling of the Transformer framework in oil overflow prediction problems.

Author Contributions

Conceptualization, L.H. and X.H.; Data curation, W.L. and J.F.; Investigation, J.F.; Methodology, W.Y., W.L. and X.H.; Project administration, W.L.; Software, L.H.; Supervision, L.H. and X.H.; Validation, J.F.; Visualization, W.Y.; Writing—original draft, W.Y.; Writing—review & editing, W.Y. and X.H. All authors have read and agreed to the published version of the manuscript.

Funding

The authors are grateful for the support of the National Key Research and Development Program of China No. 2019YFA0708304, the National Natural Science Foundation of China No. 61972174 and No. 62172187, the Science and Technology Planning Project of Jilin Province No. 20220201145GX, No. 20200708112YY and No. 20220601112FG, the Science and Technology Planning Project of Guangdong Province No. 2020A0505100018, Guangdong Universities’ Innovation Team Project No. 2021KCXTD015 and Guangdong Key Disciplines Project No. 2021ZDJS138, Projects of CNPC No. 2021DQ0503 and No. 2020B-4019.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Liang, H.; Li, G.; Liang, W. Intelligent early warning model of early-stage overflow based on dynamic clustering. Clust. Comput. 2019, 22, 481–492. [Google Scholar] [CrossRef]
Haibo, L.; Zhi, W. Application of an intelligent early-warning method based on DBSCAN clustering for drilling overflow accident. Clust. Comput. 2019, 22, 12599–12608. [Google Scholar] [CrossRef]
Liang, H.; Zou, J.; Li, Z.; Khan, M.J.; Lu, Y. Dynamic evaluation of drilling leakage risk based on fuzzy theory and PSO-SVR algorithm. Future Gener. Comput. Syst. 2019, 95, 454–466. [Google Scholar] [CrossRef]
Liu, Z.; Ma, Q.; Cai, B.; Liu, Y.; Zheng, C. Risk assessment on deepwater drilling well control based on dynamic Bayesian network. Process. Saf. Environ. Prot. 2021, 149, 643–654. [Google Scholar] [CrossRef]
Wang, K.; Liu, Y.; Li, P. Recognition method of drilling conditions based on support vector machine. In Proceedings of the 2022 IEEE 2nd International Conference on Power, Electronics and Computer Applications (ICPECA), Shenyang, China, 21–23 January 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 233–237. [Google Scholar]
Lind, Y.B.; Kabirova, A.R. Artificial Neural Networks in Drilling Troubles Prediction. In Proceedings of the SPE Russian Oil and Gas Exploration & Production Technical Conference and Exhibition, Moscow, Russia, 14–16 October 2014; OnePetro: Richardson, TX, USA, 2014. [Google Scholar]
Liang, H.; Zou, J.; Liang, W. An early intelligent diagnosis model for drilling overflow based on GA–BP algorithm. Clust. Comput. 2017, 22, 10649–10668. [Google Scholar] [CrossRef]
Sabah, M.; Mehrad, M.; Ashrafi, S.B.; Wood, D.A.; Fathi, S. Hybrid machine learning algorithms to enhance lost-circulation prediction and management in the Marun oil field. J. Pet. Sci. Eng. 2021, 198, 108125. [Google Scholar] [CrossRef]
Liang, H.; Han, H.; Ni, P.; Jiang, Y. Overflow warning and remote monitoring technology based on improved random forest. Neural Comput. Appl. 2021, 33, 4027–4040. [Google Scholar] [CrossRef]
Li, M.; Zhang, H.; Zhao, Q.; Liu, W.; Song, X.; Ji, Y.; Wang, J. A New Method for Intelligent Prediction of Drilling Overflow and Leakage Based on Multi-Parameter Fusion. Energies 2022, 15, 5988. [Google Scholar] [CrossRef]
Izonin, I.; Tkachenko, R.; Kryvinska, N.; Tkachenko, P. Multiple Linear Regression Based on Coefficients Identification Using Non-iterative SGTM Neural-like Structure. In International Work-Conference on Artificial Neural Networks; Springer: Cham, Switzerland, 2019; pp. 467–479. [Google Scholar] [CrossRef]
Izonin, I.; Tkachenko, R.; Vitynskyi, P.; Zub, K.; Tkachenko, P.; Dronyuk, I. Stacking-based GRNN-SGTM ensemble model for prediction tasks. In Proceedings of the 2020 International Conference on Decision Aid Sciences and Application (DASA), Sakheer, Bahrain, 8–9 November 2020; pp. 326–330. [Google Scholar]
Litterman, R.B. A random walk, Markov model for the distribution of time series. J. Bus. Econ. Stat. 1983, 1, 169–173. [Google Scholar]
Kitagawa, G. Introduction to Time Series Modeling; Chapman and Hall/CRC: Boca Raton, FL, USA, 2010. [Google Scholar]
Xu, F.; Uszkoreit, H.; Du, Y.; Fan, W.; Zhao, D.; Zhu, J. Explainable AI: A brief survey on history, research areas, approaches and challenges. In Proceedings of the CCF International Conference on Natural Language Processing and Chinese Computing, Dunhuang, China, 9–14 October 2019; Springer: Cham, Switzerland, 2019; pp. 563–574.
Wei, L.; Kumar, N.; Lolla, V.N.; Keogh, E.J.; Lonardi, S.; Ratanamahatana, C.A. Assumption-Free Anomaly Detection in Time Series. SSDBM 2005, 5, 237–242. [Google Scholar]
Perea, J.A.; Deckard, A.; Haase, S.B.; Harer, J. SW1PerS: Sliding windows and 1-persistence scoring; discovering periodicity in gene expression time series data. BMC Bioinform. 2015, 16, 257. [Google Scholar] [CrossRef] [PubMed]
Liu, W.; Fu, J.; Liang, Y.; Cao, M.; Han, X. A Well-Overflow Prediction Algorithm Based on Semi-Supervised Learning. Energies 2022, 15, 4324. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Misra, I.; Maaten, L.V.D. Self-supervised learning of pretext-invariant representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6707–6717. [Google Scholar]
Wang, Y.; Yao, H.; Zhao, S. Auto-encoder based dimensionality reduction. Neurocomputing 2016, 184, 232–242. [Google Scholar] [CrossRef]
Suk, H.-I.; Initiative, T.A.D.N.; Lee, S.-W.; Shen, D. Latent feature representation with stacked auto-encoder for AD/MCI diagnosis. Brain Struct. Funct. 2015, 220, 841–859. [Google Scholar] [CrossRef] [PubMed]
Aytekin, C.; Ni, X.; Cricri, F.; Aksu, E. Clustering and Unsupervised Anomaly Detection with l2 Normalized Deep Auto-Encoder Representations. In Proceedings of the 2018 International Joint Conference on Neural Networks (IJCNN), Rio de Janeiro, Brazil, 8–13 July 2018; pp. 1–6. [Google Scholar] [CrossRef] [Green Version]
Vincent, P.; Larochelle, H.; Bengio, Y.; Manzagol, P.A. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland, 5–9 July 2018; pp. 1096–1103. [Google Scholar]
Koh, B.H.D.; Lim, C.L.P.; Rahimi, H.; Woo, W.L.; Gao, B. Deep Temporal Convolution Network for Time Series Classification. Sensors 2021, 21, 603. [Google Scholar] [CrossRef] [PubMed]
Graves, A. Long short-term memory. Supervised Sequence Labelling with Recurrent Neural Networks. Doctoral Dissertation, Technical University of Munich, Munich, Germany, 2012; pp. 37–45. [Google Scholar] [CrossRef] [Green Version]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 11106–11115. [Google Scholar] [CrossRef]
Wu, H.; Xu, J.; Wang, J.; Long, M. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. Adv. Neural Inf. Process. Syst. 2021, 34, 22419–22430. [Google Scholar]
Kitaev, N.; Kaiser, Ł.; Levskaya, A. Reformer: The efficient transformer. arXiv 2020, arXiv:2001.04451. [Google Scholar]
Zerveas, G.; Jayaraman, S.; Patel, D.; Bhamidipaty, A.; Eickhoff, C. A transformer-based framework for multivariate time series representation learning. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Virtual, 14–18 August 2021; pp. 2114–2124. [Google Scholar]
Gallant, S. Perceptron-based learning algorithms. IEEE Trans. Neural Netw. 1990, 1, 179–191. [Google Scholar] [CrossRef]
Sabokrou, M.; Fathy, M.; Hoseini, M. Video anomaly detection and localisation based on the sparsity and reconstruction error of auto-encoder. Electron. Lett. 2016, 52, 1122–1124. [Google Scholar] [CrossRef]
Xie, Z.; Zhang, Z.; Cao, Y.; Lin, Y.; Bao, J.; Yao, Z.; Dai, Q.; Hu, H. Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 9653–9663. [Google Scholar]
Kenton, J.D.M.W.C.; Toutanova, L.K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the NAACL-HLT, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Chandola, V.; Banerjee, A.; Kumar, V. Anomaly detection: A survey. ACM Comput. Surv. (CSUR) 2009, 41, 1–58. [Google Scholar] [CrossRef]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. Lightgbm: A highly efficient gradient boosting decision tree. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Li, C.L.; Sohn, K.; Yoon, J.; Pfister, T. Cutpaste: Self-supervised learning for anomaly detection and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 9664–9674. [Google Scholar]
Chowdhury, R.R.; Zhang, X.; Shang, J.; Gupta, R.K.; Hong, D. TARNet: Task-Aware Reconstruction for Time-Series Transformer. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 14–18 August 2022; pp. 212–220. [Google Scholar]

Figure 1. A Temporal Convolution Layer (left) and Dilated Convolution operation (right). The first convolution projects the input into a hidden dimension with the second convolution projected into output dimension. In each layer, a convolution of the size of 1 is used as a residual connection. The chomp operation is used to chomp the padding value that may come out with output and keep dimensions consistent.

Figure 2. Structure of Transformer: (a) overall structure of Transformer, (b) Multi-head Attention Layer, (c) Feedforward Network Layer. X is the input of the Transformer layer, Z is the hidden feature space that the Transformer outputs, and Z’ is the output of the Transformer block.

Figure 3. The structure of Auto-Encoder (left) and the Encoder layer of our proposal (right). The inner structure of the encoder is sequential.

Figure 4. Decoders of different tasks. The inner structure of the decoder is sequential.

Figure 5. Simple demonstration of different SSL tasks.

Figure 6. Metrics with different sliding window sizes. Using forecasting + TCN + Transformer in this experiment.

Table 1. Selected features with mean and standard deviation.

Feature Name	Mean ± Std	Feature Name	Mean ± Std
Torque (KN·m)	4.99 × 10²⁴ ± 7.43 × 10²⁵	Bit Pressure (KN)	20.51 ± 71.03
Tootle Pump Stroke (spm)	9.18 ± 30.04	Inlet Flow log (L/s)	219.8 ± 233,331.7
Standpipe Pressure (log MPa)	9.86 ± 19.97	Outlet Flow log (%)	4.4 × 10²⁸ ± 1.48 × 10³¹
PWD Vertical Depth (m)	11.44 ± 269.99	Lag Time (min)	259.61 ± 264.74
PWD Annulus Pressure (MPa)	−5.23727 ± 276.1197	Inlet Temperature (°C)	1.07 × 10¹⁷ ± 8.06 × 10¹⁹
PWD Angle of Inclination (°)	−0.68 ± 270.27	Outlet Temperature (°C)	1.34 × 10²⁶ ± 1.01 × 10²⁹
Hook Load (KN)	846.8 ± 492.44	Mud Tanks Volume (m³)	135.7 ± 50.13
Hook Position (m)	3.93 × 10²¹ ± 2.09 × 10²⁴	Drilling Time (min/m)	259.61 ± 264.74
Hook Speed (m/min)	31.36 ± 49.9	Circulating Pressure Loss (MPa)	0.54 ± 0.79
PWD Direction (°)	117.95 ± 381.01	Total Hydrocarbons (%)	5.09 × 10²⁷ ± 6.2 × 10²⁹
C2 (%)	−1.68 ± 42.02
Wellhead Pressure (MPa)	0.53 ± 0.91

Table 2. A simple view of data used in training and evaluation.

Dataset	Sample Number	Normal	Overflow	Ratio
Total	3869	3781	88	42.97
Train	2901	2833	68	41.66
Test	968	948	20	47.4

The ratio is calculated with Normal/Overflow, which is used to demonstrate the imbalance problem.

Table 3. Parameter setting of TCN and Transformer.

	Batch	Hidden	MLP	Depth	Attention Heads	Kernel Sizes	TCL Depth
TCN + Transformer	256	16	8	1	2	3	1
Linear + Transformer	256	16	8	1	2	\	1
Transformer	256	22	8	1	2	\	\
TCN	256	16	\	\	\	3	3

Hidden is the inner dimension of MHA, while MLP is the inner dimension of FFN. Attention Heads is the number of MHA. Kernel Sizes and TCL Depth are the parameters of TCL.

Table 4. Parameter setting of CNN-LSTM.

	Batch	Conv1	Conv2	Conv3	LSTM Input	LSTM Hidden	LSTM Layer
CNN-LSTM	256	(22, 64, 11)	(64, 128, 7)	(128, 128, 10)	128	50	1

In CNN-LSTM, a convolution layer is defined as (input channel, output channel, kernel size). In TCN, kernel size is the initial size that will dilate with depth.

Table 5. Baselines experiment result.

Methods	ACC	Recall	F1	P	R	F
SVM	0.97	0.862	0.747	0.385	0.75	0.508
SVM + PCA	0.986	0.699	0.763	0.8	0.4	0.533
KNN	0.981	0.599	0.649	0.667	0.2	0.308
KNN + PCA	0.978	0.597	0.632	0.444	0.2	0.276
LightGBM	0.978	0.573	0.606	0.429	0.15	0.222
LightGBM + PCA	0.986	0.699	0.763	0.8	0.4	0.533
CNN-LSTM	0.971	0.863	0.756	0.41	0.75	0.526
TCN	0.965	0.914	0.744	0.359	0.86	0.506
TCN + Transformer + Forecasting	0.987	0.944	0.863	0.62	0.9	0.735
Linear + Transformer + Forecasting	0.988	0.871	0.854	0.681	0.75	0.714
Transformer + Forecasting	0.989	0.872	0.863	0.714	0.75	0.732

+PCA means using PCA for dimension reduction in data preparation. +Forecasting means using the Forecasting task as the pre-training task. TCN + Transformer + Forecasting is the proposed framework of our paper. P, R and F are the same metrics in overflow samples. Numbers in bold mean the best metric results, underlined means the second.

Table 6. Ablation experiment on SSL pretraining task.

Methods	ACC	Recall	F1	P	R	F
Transformer	97.87%	84.22%	77.90%	49.10%	70.00%	56.89%
+AE	+0.35%	+1.65%	+3.13%	+7.54%	+3%	+6.07%
+MAE	−0.08%	−2%	−0.64%	+0.36%	−4%	−1.24%
+Forecasting	+0.99%	+2.95%	+8.39%	+22.32%	+5%	+16.27%
Linear + Transformer	98.16%	80.95%	78.87%	55.22%	63.00%	58.68%
+AE	−0.17%	−6.45%	−3.73%	−0.31%	−13%	−7.39%
+MAE	−0.21%	−3.53%	−2.85%	−4.47%	−7%	−5.6%
+Forecasting	+0.6%	+6.18%	+6.53%	+12.95%	+12%	+12.75%
TCN + Transformer	97.97%	80.85%	77.64%	51.32%	63.00%	56.32%
+AE	+0.02%	−0.48%	−0.02%	+2.29%	−1%	−0.05%
+MAE	0%	+3.92%	+1.62%	+0.95%	+8%	+3.23%
+Forecasting	+0.68%	+13.56%	+8.74%	+10.75%	+27%	+17.14%

+AE stands for using AE task as pre-training task. For better comparison, we use the differences between supervised-only and pre-training + supervised.

Table 7. Ablation experiment on weighted sampling.

Methods	ACC	Recall	F1	P	R	F
CNN-LSTM	97.19%	86.33%	75.57%	41.02%	75%	52.59%
W/O	0.074%	−2.067%	−1.131%	−1.033%	−4.3%	−2.301%
TCN	96.49%	91.35%	74.37%	35.94%	86%	50.55%
W/O	−2.176%	−2.09%	−2.508%	−2.657%	−2%	−3.628%
TCN + Transformer	98.66%	94.42%	86.39%	62.07%	90%	73.47%
W/O	−1.116%	−2.283%	−2.745%	−3.704%	−3.5%	−4.784%

W/O stands for without weighted sampling.

Table 8. Statistic of data set with different sizes of sliding windows.

Window Size	Data Set	Size	Normal	Overflow	Ratio
1000	train	2901	2833	68	41.66
	Test	968	948	20	47.4
1500	train	1933	1891	42	45.02
	test	645	631	14	45.07
1800	train	1611	1573	38	41.39
	test	537	524	13	40.31
2000	train	1449	1417	32	44.28
	test	484	473	11	43
2200	train	1317	1286	31	41.48
	test	440	429	11	39

The ratio is calculated with Normal/Overflow, which is used to demonstrate the imbalance problem.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yi, W.; Liu, W.; Fu, J.; He, L.; Han, X. An Improved Transformer Framework for Well-Overflow Early Detection via Self-Supervised Learning. Energies 2022, 15, 8799. https://doi.org/10.3390/en15238799

AMA Style

Yi W, Liu W, Fu J, He L, Han X. An Improved Transformer Framework for Well-Overflow Early Detection via Self-Supervised Learning. Energies. 2022; 15(23):8799. https://doi.org/10.3390/en15238799

Chicago/Turabian Style

Yi, Wan, Wei Liu, Jiasheng Fu, Lili He, and Xiaosong Han. 2022. "An Improved Transformer Framework for Well-Overflow Early Detection via Self-Supervised Learning" Energies 15, no. 23: 8799. https://doi.org/10.3390/en15238799

APA Style

Yi, W., Liu, W., Fu, J., He, L., & Han, X. (2022). An Improved Transformer Framework for Well-Overflow Early Detection via Self-Supervised Learning. Energies, 15(23), 8799. https://doi.org/10.3390/en15238799

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Improved Transformer Framework for Well-Overflow Early Detection via Self-Supervised Learning

Abstract

1. Introduction

2. Related Work

2.1. Auto-Encoder and Self-Supervised Learning

2.2. Temporal Convolution Networks

2.3. Transformer for Time Series

3. Method

3.1. Base Model

3.2. Self-Supervised Pre-Training Tasks

3.3. Supervised Training

3.4. Data Preparation

4. Experiments

4.1. Experiment Setup

4.2. Baselines Experiment

4.3. Pre-Training Experiment

4.4. Sampling Experiment

4.5. Time Interval Experiment

5. Discussion and Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI