1. Introduction
As the main part of the power system that undertakes power transmission tasks, high-voltage direct-current (HVDC) systems can retain the independence of power grids at both ends of transmission and reception, which has distinctive advantages over AC power transmission, such as no inductance and no synchronization. It is an important guarantee for high-capacity grid interconnection and large-scale power exchange [
1,
2,
3]. To solve the problem of increasing the distance between power production and load centers [
4,
5,
6], the HVDC system has taken on the main power transmission task in many transmission projects. Currently, seven ±800 kV HVDC transmission projects have been completed and put into operation in China, effectively contributing to the realization of West-East Power Transmission Project [
7,
8,
9]. However, with the increase in voltage level and scale of HVDC systems, the stability of their fault occurrence and the safety of personal equipment are serious problems, so it is particularly important to carry out efficient and comprehensive fault diagnosis [
10,
11]. Currently, fault diagnosis methods applied to HVDC transmission systems mainly include analytical model-based, expert system, neural network-based [
12], support vector machine (SVM) [
13], ensemble learning (EM) [
14], and K-nearest neighbor (KNN) methods [
15]. The analytical model-based approach requires a mathematical model based on the factual input–output relationships of the power system and is thus easily limited by the complexity of the power system in practical engineering. Neural network is a kind of strong learner with the advantages of high accuracy and robustness, but it involves a complex design, which often requires a lot of effort to strengthen the server computing capacity for its construction. SVM is a novel small-sample learning method that avoids dimensional catastrophes in the computation process and has good robustness. However, it is difficult to implement on large-scale training samples and solve multi-classification problems of the power system. EM uses the cooperation of multiple weak learners to achieve the effect of a strong learner, largely reducing the gray area of traditional single learners. However, it can overfit classification or regression problems with high noise, long iteration times, and high data processing costs.
In recent years, HVDC fault diagnosis has attracted widespread academic attention, and various fault diagnosis methods have emerged. Reference [
16] took the 500 kV HVDC transmission system from Yunnan to Guizhou as an example to study the impact of lightning current peak and grounding resistance on the change in shield failure flashover. Reference [
17] analyzed the impact of converter transformers on the single-phase grounding fault current on the grid side and revealed the mechanism of high short-circuit current in the nearby DC area. In order to ensure the safe and stable operation of the HVDC system, the lightning over-voltage at the neutral point of the high-voltage DC converter transformer was studied in work [
18]. To improve the safety of the ±800 kV HVDC system, the grounding mode of the neutral point of the converter transformer was considered and analyzed in work [
19]. Reference [
20] used convolutional neural network (CNN) to identify internal and external faults of HVDC transmission lines, but with long computation time. In view of the sudden changes in current in HVDC transmission line faults, the Teager energy operator was adopted to form a feature vector based on the energy ratio of positive to negative pole current sudden change variables in faults, and 1D-CNN was applied to train and test the feature vector set, thus realizing the effective discrimination of fault types and fault poles inside and outside the area [
21]. For the collection and processing of transient quantity information during faults, reference [
22] carried out variational mode decomposition (VMD) on the transient current signal during faults to obtain the intrinsic mode function (IMF) component of the transient current signal and then calculated multi-scale fuzzy entropy using the intrinsic mode function component of the transient current signal; finally, multi-scale fuzzy entropy was input to the Softmax classifier for HVDC transmission line fault identification. When a fault occurs on an HVDC transmission line, there are a lot of characteristics that can be collected. Reference [
23] used SVM to classify 13 fault characteristics, such as the AC/DC voltage and current of the HVDC transmission line in case of a ground fault, to realize fault identification. Reference [
24] used AdaBoost SVM optimized with the bird swarm algorithm to identify HVDC transmission line faults after extracting fault characteristics from the DC-side voltage signal with the wavelet packet transform. Reference [
25] designed a fault diagnosis method for HVDC transmission systems based on the improved gray wolf algorithm (IGWO) optimized time convolution neural network (TCN) to solve the problems of HVDC transmission line fault diagnosis and pole selection under the condition of high transition resistance, focusing on the shortcomings of low sensitivity and low identification accuracy of existing HVDC fault diagnosis methods [
26]. However, the aforementioned methods have some distinct problems, such as weak robustness, high modeling cost, and slow diagnosis speed, due to complex models [
27].
This study proposes a fault diagnosis model for HVDC transmission systems based on extreme gradient boosting (XGBoost). XGBoost is an integrated learning method based on gradient lifting proposed based on gradient boosting decision tree (GBDT) that also supports column sampling, which can greatly improve the efficiency of the algorithm and reduce overfitting. Moreover, this study uses back propagation (BP) neural network and probabilistic neural network (PNN) as comparison methods to diagnose HVDC system faults. The simulation results show that the proposed method has high accuracy and reliability in fault diagnosis in HVDC systems.
In addition, the main contributions of this study are as follows:
The dataset used with several classification methods is obtained from the Tianshengqiao HVDC transmission project in Guangzhou, China, which is of great practical engineering significance;
XGBoost is applied to HVDC fault diagnosis for the first time;
This research combines KGs with fault diagnosis to realize the visualization of HVDC fault processing.
Moreover, the rest of the article is arranged as follows:
Section 2 shows the application of KGs in HVDC systems;
Section 3 presents fault classification in HVDC systems;
Section 4 is an introduction to the XGBoost algorithm;
Section 5 is an introduction to the XGBoost algorithm;
Section 6 presents a case analysis of fault diagnosis; finally,
Section 7 presents the summary of the whole paper.
2. Knowledge Graph Platform in HVDC System
With the rapid development of artificial intelligence, knowledge graph (KG) technology has become one of the core driving technologies to promote the development of cognitive intelligence. At the same time, machine learning technology has been widely used [
28].
This research aims to study abnormal signal identification and auxiliary decision making in HVDC systems based on state information. The research content is mainly divided into three parts: sequence-of-events recorder (SER) data abnormal signal identification module, SER data abnormal signal oriented fault identification module, and typical fault auxiliary decision-making module based on KGs. The data used in this study were all measured fault data from the HVDC transmission system of China Southern Power Grid. The actual HVDC system is named the Tianshengqiao (Guangxi Province, China)–Guangzhou (Guangdong Province, China) transmission project. The voltage level of the project is ±500 kV; the total length is 960 km; and the rated power is 1800 MW.
Figure 1 shows the frame diagram of abnormal signal identification and auxiliary decision-making technology of the Tianshengqiao HVDC system, which is divided into three parts, namely, the SER data signal abnormal identification module, the SER data abnormal signal oriented fault identification module, and the typical fault assistant decision-making module based on KGs.
The distributed word vector representation of natural language words provides a new foundation for the in-depth application of different artificial intelligence methods in natural language processing. KG relational reasoning is an effective means to solve knowledge verification, prediction, and reasoning. Focusing on the problems of HVDC systems, i.e., the lack of massive data collection carriers and lack of intelligent means for fault anomaly analysis [
29], this research aimed to build a technical framework for fault diagnosis in HVDC systems based on small-sample machine learning and multi-parameter fusion. The SER data and fault recording data were obtained by sending a request to the HVDC transmission system knowledge base; then, the obtained data were extracted from the key recording segments [
30]. Finally, the processed fault data were input into the HVDC system risk analysis model for fault classification. In particular,
Figure 2 shows the fault handling and risk analysis framework of the KG-based HVDC system. Due to the long route and high voltage level of the Guangzhou–Tianshengqiao HVDC system, fault diagnosis needs high accuracy and high security, and KGs can efficiently assist researchers in completing fault diagnosis, and especially, fault treatment can be quickly solved with KGs.
3. Fault Classification of HVDC System
HVDC transmission systems are mainly composed of a converter station, a transmission line, and a grounding electrode system [
31]. Among them, the converter station is one of the core components of HVDC transmission systems, has a complex structure, and often becomes a high-incidence area of faults [
32]. HVDC transmission systems have many fault types, such as AC faults, DC faults, inverter commutation faults [
33], converter valve faults [
34], single-phase faults [
35], interphase faults, and lightning stroke faults [
36]. This study constructed a fault diagnosis model based on the XGBoost algorithm according to the measured data of four types of faults in a substation of a southwest power grid, and analyzed and diagnosed the four types of faults [
37].
3.1. Grounding Fault on Converter Transformer Valve Side
Grounding faults on the valve side of the DC converter have spatiotemporal dispersion. A fault occurring at the high- and low-voltage bridges causes a huge difference in the working conditions of the valve bridge [
38]. The relative positions of the fault point and the current transformer also lead to the difference in the measured current of the fault phase. The change in fault occurrence time causes a change in the fault characteristics. When single-phase grounding faults occur in HVDC systems, arc grounding occurs at the fault point, and over-voltage and resonant over-voltage form in the non-fault phase of the fault line bus. Compared with the normal distribution network operation, the over-voltage value becomes 1.732 times the original voltage level under the condition of complete grounding, or the resonance over-voltage formed exceeds the bearing range of the line and directly burns the line. The impact of single-phase grounding faults on distribution network lines is direct. If the line is in the state of voltage rise many times, it accelerates the aging of insulation weak links of line cables and equipment and causes short circuits due to the breakdown of insulation weak links. During the operation of the distribution network line with grounding, it is possible to cause relay-type short circuits and power failure due to the grounding fault. For over-voltage faults of small current grounding systems, there are difficulties in fault line selection, fault point location, and distance measurement. Researchers can solve the problem of reliable detection of small current grounding faults by studying the characteristics of single-phase grounding faults of medium- and low-voltage distribution networks, timely finding the grounding fault line, finding the fault point, and taking corresponding treatment measures.
3.2. Interphase Short-Circuit Fault on Converter Transformer Valve Side
Interphase short circuits refer to power supply short circuits caused by the connection between the end lines with no passing of the load. Interphase short circuits only have positive sequence current [
39] and negative sequence current, and no zero sequence current. The device includes two-phase short circuits and three-phase short circuits. When interphase short circuits occur in HVDC systems, the harmonic component and its variation rule in the line are consistent with those of single-phase grounding faults, but the harmonic component content in interphase short circuits is higher, so the probability of 50 Hz protection maloperation during fault recovery is higher.
3.3. Short-Circuit Fault of Converter Valve Arm
As core equipment of HVDC systems, converters undertake the energy conversion function in AC/DC systems, that is, they convert AC electric energy into DC electric energy at the power transmission end of the system and then transmit it to the AC power grid at the receiving end to complete the energy transmission process (from the sending end to the receiving end). Bridge arm short-circuit faults of the converter valve are common faults of the converter valve. After such fault occurs, the AC system alternately has two-phase short circuits and three-phase short circuits; then, the AC system power cannot be transmitted to the receiving converter station through the DC line, and the receiving power grid cannot receive the DC power normally, which has a serious impact on the AC systems on both sides. Valve arm short-circuit faults can be divided into AC-side area valve arm short-circuit fault and DC-side area valve arm short-circuit fault. AC-side valve arm short-circuit faults mainly refer to interphase short-circuit faults caused by the reduction in the interphase insulation performance of the converter valve side. DC-side valve arm short-circuit faults include single-bridge valve arm short-circuit faults, single-phase valve short-circuit faults, and pole bus and neutral bus short-circuit faults [
40].
3.4. Fault of Converter Valve Group
Due to the nonlinearity of the converter, a large number of harmonics are generated in the HVDC system during operation, resulting in the distortion of the voltage and current of the transmission system, thus polluting the power. The core of the converter is the converter valve group, which is the key piece of equipment of the converter. Therefore, the analysis of the harmonic characteristics of converter valve group faults (valve false opening and valve non-opening faults) is of great significance to the safe and stable operation of HVDC transmission systems.
4. HVDC Fault Identification Based on XGBoost
In classification and recognition, to train an algorithm model with excellent recognition effect, it is usually possible to build the model with the aid of integration ideas. Boosting is a supervised classification learning method that combines weak separators to form strong classifiers, such as Adaboost and GBDT. Each submodel or subtree tries to enhance the overall effect of the model by constantly iterating and updating sample point weights. The boosting method has excellent classification and recognition performance when the dataset is not complex. However, when the dataset is complex, the model is constantly iterated so that the number of iterations increases, which directly leads to a sharp increase in the amount of computation. It not only slows down the training speed of the model but also affects the final classification and recognition effect of the model. This is the biggest disadvantage of the boosting algorithm.
Inspiringly, an XGBoost model based on the boosting integration idea and the C++ parallel construction of the regression tree is constructed, which is consistent with the GBDT idea. Each iteration is trained based on the residual of the weak classifier generated by the previous iteration. Multiple weak classifiers are trained with multiple iterations and then combined into an accurate and efficient integrated learner. Both are improved in the negative gradient direction of the loss function. However, XGBoost has higher accuracy, better generalization ability, and higher efficiency than GBDT.
4.1. Principle of XGBoost
The training objective of XGBoost is constant prediction to minimize the error between the predicted value and the real value. XGBoost integrates many CART classification regression trees generated by iterations, and the new tree generated by each iteration is based on the training and prediction of the tree generated by the previous iteration, that is, it is optimized according to the negative gradient direction of the loss function. Each iteration generates a tree to fit the prediction residual of the spanning tree of the last iteration and continues to iterate until the residual can no longer be reduced, thus improving the performance of the model.
The generation of the XGBoost tree depends on the addition model of the decision tree and forward distribution algorithm. The features are continuously split to generate a tree. Each time a tree is generated, it is equivalent to a new function. To fit the residuals of the last prediction, a new prediction value and new residuals are obtained. This training process is repeated. This is the forward-step algorithm. When
N trees are obtained after training, the sample features have a node and a corresponding node predictive value in each tree. Finally, the final predictive value of the sample is the sum of all node values, which is the addition model. The specific steps are shown in
Figure 3.
XGBoost only uses the decision tree as the basic classifier, which essentially integrates multiple decision trees. Therefore, the model can be expressed as follows [
41]:
where
is the decision tree,
is the
ith input sample,
is the parameter of the corresponding decision tree,
N is the number of decision trees, and
represents the predicted value of the model after the nth iteration.
The generation of the XGBoost tree needs to initialize the predictive value of each sample to be 0, that is,
; then, the model of the
nth iteration is [
42]
The parameters can be obtained by minimizing the loss function of the algorithm; in particular, the solution formula is
The lifting tree model, , completed by the final iteration depends on the forward distribution algorithm and the addition model of XGBoost. The XGBoost model is obtained by adding the n class decision tree, , obtained by means of iteration.
4.2. Construction of Loss Function
For a dataset
with
a samples and
b features, the final prediction output,
, of
M classification regression trees is [
43]
where
is the final predictive value of the
ith sample, which is obtained by summing up the scores of leaf nodes
corresponding to each classification regression tree
;
T is the number of leaf nodes, that is, each tree
has a leaf tag
q and a leaf weight
corresponding to the current prediction sample; and
is the sum of the predicted scores of the weights of leaf nodes
q corresponding to all classification regression trees, that is, the final predictive value of the XGBoost model for the sample.
To make the XGBoost model learn fully to achieve the best performance of classification and recognition, it is necessary to minimize the loss function of the XGBoost model, and at the same time, add regular items to prevent the model from being too complex. The loss function of the XGBoost model is [
44]
where formula
is intended to calculate the loss function of the deviation between the predicted value of the sample and the true value, including deviation calculation term
and regular term
to prevent overfitting; and
γ and
λ are used to control the regularization parameter of the model complexity. The larger the parameter value is, the more difficult the model is to overfit.
To build the final XGBoost model, one needs to calculate
of each tree. It is necessary to train the
tth tree with the forward distribution algorithm. By setting the initial predictive value of the first tree to 0, that is,
, the following model is obtained with
t iterations [
45]:
By summing the iteratively generated
t trees with the addiction model, the objective function is
The second-order Taylor expansion of each training objective function is obtained as follows:
where,
and
are the first and second step degrees of the loss function, respectively. By removing the constant term, we can obtain the simplified objective function of the
tth training as follows:
5. Multi-Classification Fault Diagnosis Model Based on XGBoost
In this section, based on the measured fault data of a substation of a southwest power grid, the specific electrical diagram of the fault points and fault types of the transmission system is shown in
Figure 4. In particular,
Table 1 summarizes the fault types represented by the number of each fault point. From the original dataset, the recording data of 15 cycles before and after the fault were extracted, that is, the extraction duration of the recording data was 0.5 s. In the extraction of the recording data, 11 representative signal channels were sorted out. The specific signal meaning is described in
Table 2. Among them, the elements in the data samples of single-phase ground faults, interphase faults, converter valve arm short-circuit faults, and converter valve group faults were
N1 = 56,
N2 = 42,
N3 = 44, and
N4 = 96, respectively. Based on the original dataset, the XGBoost algorithm was used for fault diagnosis, and the effectiveness of this method was verified. The algorithm implementation process is shown in
Figure 5, and the meaning of the six classifiers therein is elaborated in
Table 3; the first brace indicates the four labels of the XGBoost algorithm, and the second brace shows the specific classification of the six classifiers.
The specific steps are as follows: Firstly, 11 channel data of each sample in each type of fault data were connected in series to conduct data preprocessing and then stacked according to the number of samples to form a full fault dataset. Then, 80% of the total fault dataset was randomly selected as the training dataset, and 20% was selected as the test dataset. Secondly, integrated learning was used to extract the features of fault data, and 80% of the data were intensively trained. Multi-classification XGBoost was used to determine the number of classifiers and labels. According to the introduction of the multi-classification XGBoost method, the number of samples was four, so six classifiers were required. Among them, the labels of single-phase ground fault, interphase fault, converter arm short-circuit fault, and converter valve group fault were 1, 2, 3, and 4, The specific classification method and explanation are shown in
Table 3.
Figure 6 shows the data waveforms of 11 channels corresponding to the four HVDC faults.
As reported in the next section, after determining the number of data classifiers and training data, the remaining 20% of data were used as test samples for fault diagnosis and classification, and the test results were compared with the standard fault category threshold. In addition, the Euclidean distance between the test results and the standard fault threshold was used to determine the fault type. Finally, the accuracy of fault data diagnosis using this method was discussed.
6. Case Study
In this section, we report on the remaining 20% of all datasets being used as test data to verify the accuracy of XGBoost. Note that test datasets were randomly selected from all datasets. In particular, BP neural network and PNN neural network were used as comparison methods to verify the progressiveness and effectiveness of XGBoost. We input the test samples into XGBoost, BP neural network, and PNN, respectively. The test set data of XGBoost were
N1 = 10,
N2 = 5,
N3 = 27, and
N4 = 5; the test set data of BP neural network were
N1 = 16,
N2 = 9,
N3 = 5, and
N4 = 17; the test set data of PNN were
N1 = 21,
N2 = 5,
N3 = 5, and
N4 = 16; and the test set data of Classification learner were
N1 = 14,
N2 = 11,
N3 = 9, and
N4 = 13. The parameter settings of the six methods are shown in
Table 4. We compared the accuracy of fault diagnosis of the six methods using the same number of test sets.
To intuitively observe the accuracy of fault diagnosis of the three methods, confusion matrices were drawn for data statistics and analysis. After the three methods trained their fault diagnosis models, the fault diagnosis results of XGBoost, BP neural network, PNN, Classification learner, SVM, and KNN were obtained and are shown in
Figure 7a,
Figure 7b,
Figure 7c,
Figure 7d,
Figure 7e, and
Figure 7f, respectively. It is not difficult to see that the three methods produced errors in the diagnosis results of the four types of faults. However, the diagnostic error rates of BP neural network, PNN, Classification learner, SVM, and KNN were significantly higher than that of XGBoost. In particular, XGBoost had 100% accuracy in identifying single-phase ground faults, converter arm faults, and converter valve group faults. The BP neural network only had 100% accuracy in identifying single-phase ground faults, while PNN and Classification learner showed different degrees of misdiagnosis of the four faults, which effectively shows that XGBoost could accurately and efficiently extract and identify the characteristics of fault data.
Finally, according to the confusion matrix, the accuracy of the four methods in diagnosing the four types of faults in the HVDC system was obtained, as shown in
Table 5. In addition, the five parameters of F1-score, precision score, recall score, AUC score, and test time obtained with the six methods are shown in
Table 6. Note that all parameters are the average values obtained after cross validation, of which the fold is 5. It is not difficult to find that the accuracy of XGBoost model fault diagnosis in the full dataset was as high as 87.23%, while the fault diagnosis accuracy rates of BP neural network, PNN, Classification learner, SVM, and KNN were only 74.47%, 78.72%, 72.30%, 55.32%, and 65.96%, respectively, which fully proves the progressiveness of XGBoost in HVDC system fault diagnosis. All simulation experiments were run in the Python PyCharm Community Edition 2022 environment on a computer configured with 2.90 GHz Intel (R) Core (TM) i5-9400 CPU, 32.0 GB RAM, and 64-bit Windows 10.