1. Introduction
It is well-known that an oil-immersed transformer is the most expensive component in the power grid and the failure of a transformer will result in a widespread blackout and incalculable economic losses [
1,
2]. Hence, it is important to recognize the condition of the transformer and perform the proper maintenance before its failure [
3]. Over the past decades, DGA has been largely used for incipient transformer fault identification [
4]. However, these offline DGA tests are not always effective due to sudden transformer faults.
In recent years, with the development of sensor and computation technology, online monitoring with its corresponding artificial intelligence (AI) algorithms, such as support vector machine (SVM) [
5,
6], back propagation neural network (BPNN) [
7,
8] and their corresponding improved algorithms, have been applied to the status evaluation of power transformers. Although AI algorithms can establish a complex and nonlinear relationship between transformer faults and the feature gases, a large number of samples are needed for the training process [
9] to improve the diagnosis accuracy. However, in practice, the fault occurrence rate of an individual transformer is rather low such that only a few historical samples can be collected by an online monitor. This leads to difficulties in the training process. Furthermore, most of the existing AI algorithms are only utilized for fault classification. However, what is more important is power transformer condition prediction since the discovery of latent faults in advance is more helpful for avoiding serious transformer problems.
To address diagnoses with small samples and transformer status forecasting, based on the online monitoring DGA system that was developed in our previous work [
10], a new hybrid model combining the advantages of EL and CSO-NN, i.e., EL-CSO-NN, was proposed for transformer condition evaluation and prediction in this study. The criss-cross optimization (CSO) algorithm was proposed in [
11] by Meng et al. The CSO algorithm is a heuristic optimization algorithm that has been widely used. For example, Xiangang Peng et al. applied CSO to address an ODGA problem regarding interacting operators [
12]. In this study, the CSO algorithm performs the function of predicting transformer faults.
The contributions of this work are as follows: (1) ensemble learning (EL) via the bagging algorithm [
13] was applied to transformer fault diagnosis for the first time to deal with small samples that usually have an unbalanced distribution, and (2) on the basis of our newly developed CSO algorithm [
14], for the condition prediction of power transformers, the CSO-optimized artificial neural network (i.e., CSO-NN), which circumvents the limitation of the backpropagation (BP) algorithm’s tendency of getting stuck in local minima easily, was utilized for the DGA data prediction for the first time. In the new model, EL was used for transformer faults classification, while the CSO-NN was used to forecast the DGA series data. To validate the effectiveness of the hybrid model, verification was performed and it was found that EL was superior to other methods in terms of diagnosis accuracy with the condition that the training data set was smaller than the test data set. Combined with the CSO-NN prediction methods, the forecasting results showed that the proposed EL-CSO-NN model had advantages in terms of fault type classification and power transformer condition prediction. All of these are elaborated on in this paper.
2. Framework of the Online DGA System
As shown in
Figure 1, this is an online DGA system framework with a focus on dissolved gas measurement, which is used to monitor the transformer status. The whole online DGA monitoring system contains many units, i.e., oil sampling, gas extraction, gas separation, gas detection, data sampling, data analysis and fault diagnosis units. The controlling unit is used to control and connect the other parts. Each component involved is discussed in this section.
As the gases are generated and dissolved in the transformer oil, the feature gases are extracted from the oil for GC analysis. The gas extraction module comprises the integration of oil sampling and gas extraction units. The feature gases are mixed after the extraction is done. Then, chromatographic separation is performed on the mixed gas through the separation unit.
After separation, the target gases need to be detected. The SOFC detector designed in Ref. [
15] was adopted for the online monitor. Although the SOFC sensor is mostly used as an oxygen sensor, it also shows sensitivities to H
2, C
2H
4, C
2H
6, C
2H
2 and CO since they easily react with oxygen. Furthermore, the SOFC sensor only needs nitrogen as a carrier gas. Therefore, the SOFC has huge advantages of high sensitivity and low cost.
Within the online DGA monitoring system, the data analysis and fault prediction unit based on the EL-CSO-NN was implemented. Hence, based on the results of the measurement, the transformer status and fault type can be automatically evaluated. It makes it possible to distinguish and predict the fault types and give suggestions to the maintainers for further actions. In the following sections of this paper, the EL-CSO-NN algorithms for transformer fault diagnosis and forecasting are detailed.
3. Ensemble Learning for Transformer Fault Diagnosis
3.1. Bagging Algorithm
Ensemble learning [
13] with bagging is a machine learning method in which many classifiers are integrated to obtain better performance than a single one. It combines several machine learning algorithms to reduce prediction errors. The realization process is that several models are trained and the outputs are averaged from the test samples on all of these models; this strategy is called ensemble learning. This model combination and averaging technique are expected to make more accurate predictions than any individual model. There exist two types of ensemble learning methods, i.e., heterogeneous and homomorphic learning methods, which are distinguished according to the type of base classifiers. In heterogeneous EL, various base classifiers are integrated. In contrast, in the homomorphic EL method, all of the base classifiers are the same type, although the parameters are distinct from each other. The base classifiers can be selected from decision trees [
16], artificial neural networks, k-nearest neighbors, etc. On one hand, to achieve a better performance, the classification error of each base classifier should be less than 50%; otherwise, the final error rate of the results will be increased. On the other hand, the discrimination achieved by these base classifiers should be apparent. In case the outputs of these base classifiers are similar to each other, the ensemble results would not be improved compared with the decisions made by a single classifier. This technique is effective because the variances of these algorithms on the same test data set are different. Algorithms with high variance can be reduced using the general procedure of bootstrap aggregation, where a classification and regression tree (CART) [
16] is the most frequently used one. A CART uses the Gini coefficient minimization criterion for feature selection and takes the smallest value as the splitting property. Based on this, each internal node is split into two child nodes by using a binary recursive splitting method, which forms a binary tree with a simple structure.
In this study, transformer fault diagnosis was essentially treated as a classification problem. Hence, bagging of the CART algorithm was suitable to be used and it worked as follows: the samples from the prepared DGA data set were resampled to create different sub-samples. The CART model was trained on each sub-sample. Then, for a new test data set, the diagnosis was performed with each model. The types of transformer faults could be classified into five types as follows: discharge with high energy (high-intensity arcing), discharge with low energy, partial discharge (low-intensity discharge) according to the energy level, low- and median-temperature thermal fault (<700 °C), and high-temperature thermal fault (>700 °C) according to the temperature level. All of the outputs from the trained CART models were collected, with the most frequent class being taken as the output class. When bagging using decision trees, overfitting of the training data is less concerned about each tree. Considering the efficiency and maintaining the variability of these base classifiers, the generated trees were not pruned. The classification tree kept branching until the overall impurity became optimal. Therefore, in the absence of pruning, the tree usually grew very deep. Furthermore, they had high variance as well as low bias. The advantage of the bagging decision is that only a few parameters are needed, i.e., sample and tree numbers. With an increase in the number of classifiers, a longer time is taken to prepare, and the efficiency is influenced. However, it is worth mentioning that the training data is not overfitted, which performs excellently on the training set, but poorly on the test set. Bagging can be applied to both classification and regression problems.
3.2. Structure of Fault Diagnosis
The structure of ensemble learning of the bagging algorithm for transformer diagnosis based on DGA is shown in
Figure 2. A series of classifiers (called base classifiers) were trained via ensemble learning and DGA samples, and their output classification results were fused to obtain a better one when the classified transformer had a fault. The generalization capability of the ensemble classifier to faults classification could be improved by utilizing the decisions of these base classifiers.
The achievement of a series of diverse base classifiers and the integration of the output results of these classifiers are essential to ensemble learning. To construct an ensemble classifier, in this study, a series of the training subsets originating from the DGA set (i.e., 90 sets, 45% of the whole data set) were used and different classifiers were trained. In this study, the base classifier used was a decision tree, the number of which was set to 175. On the other hand, 75% of the samples were drawn randomly from the training data set to train each base estimator. That is to say, the sub-training sets in
Figure 2 were extracted from the training data set. Hence, there were 175 sub-sample sets, and 175 base classifiers were obtained via training.
After the training of each classifier, the well-trained classifier could output a label when used for the fault classification of the test sample (with 110 records, 55% of the whole data set). The final diagnostic result was determined through voting with all of the classifiers involved, i.e., simple, weighted and Bayesian voting [
17]. The transformer fault diagnosis system designed in this study establishes the classifier by resampling the data set with duplication, which is called bootstrapping [
18].
This is a powerful technique for assessing the accuracy of a parameter in situations where the samples are limited. When training the base classifier, by using different training data sets, each training process produces a different basis and the final result is obtained via simple voting.
4. DGA Data Prediction with CSO-ANN
DGA is a technique that is used worldwide for incipient thermoelectric fault detection of the power transformer. The trace gases dissolved in the transformer insulation oil and analyzed using a chromatographic method mainly contain H
2, CO, CO
2, CH
4, C
2H
2, C
2H
4, C
2H
6, etc. Generally, the first five or six gases are enough for most of the transformer diagnosis methods, e.g., IEC or Rogers ratios [
19] and decision trees [
20]. Presently, most of the transformer fault diagnosis algorithms are based on historical DGA data obtained using offline measuring methods with a relatively long time interval. This means that the diagnosis is usually performed when the faults have already taken place, i.e., breakdown maintenance is done after the event. Under the principle that “prevention is better than cure”, it is more significant to forecast transformer faults in advance to prevent failure.
DGA data forecasting with a time label is essentially a time-series data prediction problem, where the differences are that a vector with five or six dimensions should be predicted. Recently, various predicting techniques were developed, such as the grey model [
21], SVM [
22] and artificial neural network (ANN) [
23]. Grey theory uses an exponential law to establish the time-series data model, which is suitable for monotonously decreasing or increasing processes. However, the exponential law for the DGA data fitting may lead to errors because of the complexity of the transformer faults and its corresponding gas content in oil. SVM utilizes a structural risk minimization principle instead of experiential risk minimization, which enables it to have excellent generalization ability on small samples for binary classification problems. Although the SVM was extended to multiple classifications and regression problems [
24], recently, the practicability and effectiveness of the SVM were different with specific applications.
When the training sample is enough, an artificial neural network, e.g., BPNN, is one of the most suitable methods used for nonlinear prediction. Theoretically, it can be used for fitting any nonlinear function [
25]. Conventionally, a gradient descent algorithm (GDA) is the most commonly used one for standard BP network training. It is known that the training speed of GDA for a BP network is low and it is easy to converge to a local minimum point. Similarly, traditional ANN suffers from some weaknesses during the training process, such as slow convergence and falling into a local minimum easily, which affects the prediction accuracy largely. To overcome these limitations of ANN, a new training method called CSO that was proposed in [
11] by Meng et al. was adopted in this study for DGA series data predictions. The CSO algorithm is a heuristic optimization algorithm that is exploited in wind speed and energy predictions [
14]. This algorithm has excellent global optimization, as well as rapid convergence, which has an obvious advantage over PSO [
26] or the GA [
27]. Two search operators called horizontal crossover (HC) and vertical crossover (VC) play an important role to ensure the global searchability of CSO. The procedures of CSO are as follows [
28]:
(1) Population initialization
Suppose that X is a randomly generated matrix with D columns and M rows; it represents the D-dimensional population that consists of M individuals.
(2) Horizontal crossover operation
Generate the randomly distributed integers from 1~M and the individuals in X are matched into M/2 pairs. Suppose X(i) and X(j) are the counterpart parent pairs. The next generation of moderation solutions is reproduced using Equation (1) [
26]:
where r
1 and r
2 are random values in [0, 1], and c
1 and c
2 are random values in [−1, 1] with uniform distribution. MS
hc(i) and MS
hc(j) are the offsprings of X(i) and X(j), respectively.
(3) Vertical crossover operation
By performing the VC operation on the d
1 and d
2 dimensions of the individual X(i), the offspring MS
vc(i) can be generated using Equation (2):
The HC and VC search approaches were shown to be effective methods that had excellent global search ability. The role of VC is important because this crossover is useful for making the dimensions step out from the local minima.
In this study, the weights and thresholds of the ANN for DGA data prediction are optimized using CSO, which helps to improve the prediction accuracy. Moreover, the quadratic cost function [
28] was used in this study. The components of each element of the population consist of weights and thresholds, which are stored in matrix X as a row. The number of rows of X is the number of populations. At each iteration, the horizontal crossover operation is performed according to Equation (1) to obtain the median solution MS
hc. In CSO, the new solutions should be evaluated before entering the next generation of populations. Hence, the fitness values in MS
hc are calculated via comparison with the parent population. Similarly, the fitness values in MS
vc are calculated after performing the vertical crossover operation on the update population according to Equation (2). These results can survive when they outperform their parents. Hence, it is obvious that only the best solutions can be maintained in the population during the iteration. Once the iteration is completed, the set of solutions with the best fitness is output as the weights and threshold values of the ANN network.