1. Introduction
High-voltage overhead transmission lines, which are an important part of modern power systems, are increasingly using insulated overhead conductors. Compared with bare overhead conductors, insulated overhead conductors can prevent power outages caused by interline and line-to-ground failures, thereby improving the reliability of modern power systems. In addition, in terms of the environment, disturbances caused by forest fires, the electrocution of animals, and falling trees are mitigated by the use of insulating conductors [
1,
2]. However, ordinary protective devices, which are used for bare conductor systems, are unable to detect IOC faults without overcurrent. The IOC faults can cause partial discharge (PD) before breakdown [
2,
3,
4,
5]. Therefore, early partial discharge detection helps to ensure the high reliability and security of power grid assets. In other words, the emphasis of IOC fault detection has shifted from traditional current detection to PD detection. However, due to the small pulse value of the PD and interference from external background noise, fault detection through PD identification has become one of the most difficult challenges [
3].
Due to the non-stationarity and randomness of PD signals, various types of time–frequency analysis methods, such as wavelet transform (WT), Hilbert–Huang transform (HHT), and S transform, have been proposed to analyze them. In [
6], a simple algorithm, which works in four steps, including increasing the signal-to-noise ratio, removing DSI (discrete spectral interference) noise and RPI (random pulse interference) noise, and detecting PD, was created for automatic PD pattern detection. In [
7], a fast approach is proposed for the detection of the partial discharge from the data collected by the antenna, and it had a small computational demand and low false-positive rate. In [
8], two new features, namely, the template-matching degree and intracluster concentration degree, are proposed for the superior detection of PD based on light gradient-boosting machine (lightGBM).
While the above detection methods are effective, with the complexity of power systems increasing, fault diagnosis requires more efficient techniques to automatically extract multilevel features from large amounts of data. Therefore, fault detection methods based on deep learning have been widely applied to PD recognition. Reference [
9] proposed a deep learning neural network model called the stacked sparse autoencoder (SSAE) to extract features from PD signals, and then the extracted features are fed into a softmax classifier to be classified into one of four defined PD severity states. In [
10], the deep belief network is introduced for recognizing and classifying different PD patterns. In [
11], to improve the recognition accuracy, a kind of PD pattern recognition method based on variational mode decomposition (VMD)–Choi–Williams distribution (CWD) spectrum and an optimized convolutional neural network (CNN) with cross-layer feature fusion is proposed. In [
12], a new approach based on discrete wavelet transform (DWT) and a long short-term memory network (LSTM) is presented for the detection of IOC faults according to partial discharge. In [
13], a multichannel CNN–LSTM (convolutional neural network, long short-term memory) network is proposed for fault detection by determining PD. In [
14], a heterogeneous stacking ensemble neural network is applied to classify PDs obtained by the contactless method.
Although various deep learning models have been applied to the recognition of PD, the recognition accuracy and computation complexity still need to be further improved. First of all, all the models treat the input features equally and do not consider giving them different attention according to the correlation between each input feature and the recognition results. In addition, they are difficult to adapt to different background noises.
Recently, the attention mechanism (AM) [
15,
16] has emerged as a research hotspot in the field of deep learning. By assigning different weights to emphasize important information and ignore unimportant information, the AM offers higher accuracy, less computation time, and a better generalization ability. However, the existing work on PD recognition has not taken the AM into consideration.
To make better use of important information and ignore irrelevant information, this paper introduces the attention mechanism to the Bi-LSTM (bi-directional long short-term memory) framework. Furthermore, to reduce noise from the original PD signals, discrete wavelet transform (DWT) is proposed for denoising. Based on this, the Bi-LSTM model based on the AM and DWT is proposed for PD recognition. The key contributions of this paper are as follows: (1) a noise reduction technique based on DWT is proposed for denoising PD signals; (2) the AM is introduced for the PD recognition model, which can improve the recognition accuracy and computation complexity by emphasizing their effective characteristics; (3) Bi-LSTM, which combines past information and future information, has excellent time-series information-mining capability and makes PD recognition more accurate.
The remainder of this paper is organized as follows: The proposed fault detection method based on Bi-LSTM with AM is introduced in
Section 2. In
Section 3, the simulation analysis and comparison results are presented. Finally, conclusions are discussed in
Section 4.
3. Simulation and Analysis
3.1. Data Analysis
The experimental dataset in this paper is from the ENET dataset (
https://www.kaggle.com/competitions/vsb-power-line-fault-detection/data, accessed on 21 December 2018) published by the Technical University of Ostrava (VSB) in 2018, which devised a special instrument to measure PD signals. As shown in
Figure 8, a single-layer coil is wrapped around the IOC to acquire the stray electrical field voltage along the IOC. Furthermore, a capacitive voltage divider is connected in parallel at the voltage output, and the output capacitor is connected in parallel with the inductor to obtain the voltage signal. VSB’s approach is more cost-effective than another possible solution that uses Rogowski sensors to directly measure currents in conductors [
19].
During the experimental measurement, the PD measurement device was placed at the end of the power system substation under the actual medium-voltage system and powered by a 5 km long IOC with a nominal phase voltage of 12.7 kV/50 Hz. The size of the IOC contact with the ground was gradually changed (e.g., spot contact: 1 m, 5 m, 10 m, and 35 m).
The ENET dataset includes signal data and metadata. The signal data contain 8712 voltage signals from four different locations, which represent the deployment in the real environment, such as forested and hard-to-access terrain. The metadata provide corresponding label information, namely, fault (1) or non-fault (0), and phase A (0), phase B (1), or phase C (2). Each voltage signal is a one-cycle voltage waveform with 800,000 sampling points at a sampling frequency of 40 MHz and was pre-marked as PD (525) or non-PD (8187).
Figure 9 and
Figure 10 show the signals without PD and with PD, respectively. Obviously, the two kinds of signals are disturbed by the external background noise, which reduces the accurate extraction of features. Compared with non-PD signals, PD signals have larger amplitudes (about 60 mV for PD and 40 mV for non-PD) and strong fluctuation. Data labels are shown in
Table 1, which records the ID, phase, fault, and other important information of each data point.
The signal_id is the number of each data point. Its index ranges from 0 to 8711, and there are 8712 data points in total. The Id_measurement represents the number of one group of three-phase signals. Phase depicts which phase the signal belongs to (0-A, 1-B, or 2-C), and target gives the fault result of the line, where 0 represents no fault (normal) and 1 represents fault.
3.2. PD Signal Denoising
When partial discharge occurs, each pulse signal contains unique information about the IOC’s specific state and fault features. The ENET dataset, acquired from real environments, such as forested and hard-to-access terrain, is subject to interference from external background noise. To improve the model’s anti-noise ability and ensure detection accuracy, it is crucial to reduce noise before feature extraction. In this paper, DWT was used for denoising the original signals, and
Figure 11 and
Figure 12 show the denoised signals, which are smoother than the original signals in
Figure 9 and
Figure 10.
3.3. Feature Extraction
Feature extraction is the key to PD recognition. Although the voltage signal itself contains all the information, it is difficult to fit the original signal into some sets of rules and criterions that can intelligently interpret the potential information brought by the signal. In contrast, useful information is purposefully dug out by feature extraction. Additionally, reducing the dimensionality of the data can sometimes improve the performances of certain algorithms used in classifiers [
20]. Thus, the statistical features and entropy features of PD signals were extracted in this study. Statistical features can reflect the morphological characteristics, while entropy features can characterize the time–frequency-spectrum complexity and detect small changes in PD signals. There are 19 statistical features, which include mean, std, mean plus std, mean minus std, seven percentiles, the amplitude, and seven relative percentiles. There are seven entropy features, which include perm entropy, singular entropy, approximate entropy, sample entropy, and three fractal dimensions. A one-phase PD signal has 19 statistical features; thus, three-phase PD signals have 57 dimensional features in all. The 57 dimensional features form a vector, which is depicted in
Figure 13. Considering that the sample size was too large, a PD signal was divided into 160 subgroups, with 5000 data points in each subgroup. Hence, the statistical features extracted from all the subgroups were fused to form a 160 × 57 matrix, which is shown in
Figure 14.
3.4. Model Evaluation
Python 3.6 was the development language used for the whole experiment in this paper, and the framework used to build the deep learning model was TensorFlow-GPU 1.13.2 and Keras 2.1.5.
This paper uses k-fold cross-validation to evaluate the model’s performance and prevent the improper selection of training and test sets from affecting the performance. Specifically, 5-fold cross-validation is used, as shown in
Figure 15. The dataset is divided into five parts, with four used as training sets (green color represents training sets) and one as the test set (red color represents the test set). This process is repeated five times, and the average values are taken as the final result.
The parameter settings during the training are listed in
Table 2. The optimization process is terminated when it reaches the maximum number of epochs.
Figure 16 shows the loss values of one time during the training process and testing process. It can be seen that in about 40 generations, the training loss decreased slowly, the test loss almost did not change, and the values of them were reduced to about 0.07 in the end.
Figure 17 displays the accuracy of the Matthew’s correlation coefficient at one time. It can be seen that when the training iteration reached about 20 times, the Matthew’s correlation coefficient fluctuated around 0.7, and the highest reached 0.78.
The confusion matrix is an intuitive representation of the model performance.
Figure 18 shows the detection confusion matrix for the testing set one time. The rows represent the predicted results (including faults and non-faults), while the columns represent the actual results (including faults and non-faults). The diagonal elements correspond to correctly classified samples, while the off-diagonal elements correspond to misclassified samples. In other words, 537 non-fault samples and 25 fault samples are correctly classified, while 8 non-fault samples and 11 fault samples are misclassified.
To thoroughly evaluate the proposed model’s effectiveness, several evaluation metrics were selected, including the accuracy rate, precision rate, recall rate,
F1 score, and Matthew’s correlation coefficient (
MCC) [
21]. They are calculated as follows:
where
TP is the correctly labelled positive signals,
TN is the correctly labelled negative signals,
FP is the incorrectly labelled positive signals, and
FN is the incorrectly labelled negative signals.
Table 2 summarizes five corresponding evaluation indexes and their average values. It is evident from
Table 3 that the training set and testing data have different accuracies, which reflects the necessity of K-fold cross-validation. Because there is a serious imbalance between failure samples and non-failure samples, non-failure samples have a better performance than failure ones in
precision,
recall, and
F1. Accuracy is the most widely used evaluation index. The proposed method has an accuracy of about 97% for detecting faults and non-faults, indicating its effectiveness for PD detection. However, in the case of imbalanced fault and non-fault samples, the accuracy has a substantial defect. Therefore, it is far from scientific and comprehensive to evaluate a model only by its accuracy.
The recall rate reflects the ability of the model to identify positive samples, while the accuracy rate reflects the ability of the model to identify negative samples. The recall rate of non-fault samples is about 98.4%, and the recall rate of fault samples is about 75.9%, indicating that the model has a good ability to distinguish non-faults. The precision rate of non-fault samples is about 98.4%, and the precision rate of fault samples is about 77.1%, indicating that the model has a good ability to distinguish faults. F1 integrates the results of precision and recall, and the F1 of non-fault samples is about 98.4%, while that of fault samples is about 75.7%, indicating that the proposed method is effective for fault detection, and the model is very robust.
3.5. Algorithm Contrast
To demonstrate the superiority of the proposed AM-Bi-LSTM model, LSTM, Bi-LSTM, and attention-mechanism-based LSTM (AM-LSMT) were chosen for comparison to detect PD. During the comparison simulation, the parameter setting values of the other three models were the same as those of the proposed model. The detection results between different models are shown in
Figure 19 and
Figure 20. The performance of the AM-Bi-LSTM outperformed the other models in terms of the
accuracy,
F1, and
MCC. Models with the attention mechanism, whether it is LSTM or Bi-LSTM, are better than the corresponding models without attention. This is because the model can pay different attention to the input features depending on their contribution to the detection results using the AM.
In addition, in order to highlight the advantages of feature extraction in this paper, the evaluation index
F1 is compared with the methods proposed in other papers, such as FNN (fuzzy neural network), SVM (support vector machine), XGBoost, and MLR in [
12]. The comparison between these different methods is shown in
Table 4. As can be seen from
Table 4, the model proposed in this paper outperforms the other models in terms of effectiveness. In addition, due to the small proportion of fault data in the public dataset, the value of
F1 obtained by the proposed model for fault samples is slightly higher than those obtained by other models, and the values of
F1 for non-fault samples are far higher than those. It is worth mentioning that although these comparative algorithms in
Table 4 are not the most advanced, they are the most widely used.
3.6. The Effect of Features on Results
Considering the importance of feature extraction, the PD classification results shown in
Figure 21 are compared based on two different feature combinations. One is based on statistical features, and the other is based on the combination of statistical features and entropy features. Obviously, the accuracy, precision rate, recall rate,
F1 score, and
MCC based on the statistical features plus entropy features are superior to those based on only the statistical features. Specifically, the accuracy is increased by 0.4%, the precision rate is increased by 1.3%, the recall rate is increased by 6.8%, the
F1 score is increased by 4.3%, and the
MCC is increased by 0.4. However, it takes more time to extract entropy features than statistical features. Therefore, if high accuracy is required, statistical features plus entropy features can be selected, and if real time and efficiency are required, then only statistical features can be selected.