Next Article in Journal
A Multi-Condition-Based Junction Temperature Estimation Technology for Double-Sided Cooled Insulated-Gate Bipolar Transistor Modules
Previous Article in Journal
Geospatial Planning for Least-Cost Electrification in Developing Countries
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

InvMOE: MOEs Based Invariant Representation Learning for Fault Detection in Converter Stations

1
Kunming Bureau of EHV Transmission Company, Kunming 650217, China
2
Electric Power Engineering, Kunming University of Science and Technology, Kunming 650500, China
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Energies 2025, 18(7), 1783; https://doi.org/10.3390/en18071783
Submission received: 13 December 2024 / Revised: 17 March 2025 / Accepted: 26 March 2025 / Published: 2 April 2025
(This article belongs to the Topic Advances in Power Science and Technology, 2nd Edition)

Abstract

:
Converter stations are pivotal in high-voltage direct current (HVDC) systems, enabling power conversion between an alternating current (AC) and a direct current (DC) while ensuring efficient and stable energy transmission. Fault detection in converter stations is crucial for maintaining their reliability and operational safety. This paper focuses on image-based detection of five common faults: metal corrosion, discoloration of desiccant in breathers, insulator breakage, hanging foreign objects, and valve cooling water leakage. Despite advancements in deep learning, existing detection methods face two major challenges: limited model generalization due to diverse and complex backgrounds in converter station environments and sparse supervision signals caused by the high cost of collecting labeled images for certain faults. To overcome these issues, we propose InvMOE, a novel fault detection algorithm with two core components: (1) invariant representation learning, which captures task-relevant features and mitigates background noise interference, and (2) multi-task training using a mixture of experts (MOE) framework to adaptively optimize feature learning across tasks and address label sparsity. Experimental results on real-world datasets demonstrate that InvMOE achieves superior generalization performance and significantly improves detection accuracy for tasks with limited samples, such as valve cooling water leakage. This work provides a robust and scalable approach for enhancing fault detection in converter stations.

1. Introduction

Converter stations play a pivotal role in high-voltage direct current (HVDC) systems by enabling efficient power conversion between an alternating current (AC) and a direct current (DC) [1]. They regulate power flow, ensuring stable and reliable energy transmission across interconnected grids, which is essential for long-distance power transmission, asynchronous grid interconnection, and cross-regional power distribution. The critical importance of these systems cannot be overstated, as failures can lead to widespread power outages, grid instability, and significant economic disruptions [2,3]. Consequently, effective fault detection is indispensable to maintain the reliable operation of converter stations and the overall health of the power network.
As shown in Figure 1, we present two kinds of common faults of converter stations: insulator breakage and hanging foreign object. Among them, insulator breakage can lead to electrical arcing, which may cause severe damage to the converter station’s equipment, leading to system instability and potential power outages. Additionally, hanging foreign objects can interfere with the mechanical components of the converter station, leading to physical damage or malfunction. These objects may also obstruct air flow or cooling systems, increasing the risk of overheating and reducing the efficiency of the converter station. So, fault detection is crucial to ensure the safe and reliable operation of converter stations.
However, fault detection in converter stations is inherently challenging due to the complex and dynamic environments in which these stations operate [4]. The environments are subject to various unpredictable factors such as changing lighting conditions, weather, camera angles, and background clutter, which can drastically affect the performance of conventional detection models. Traditional fault detection methods, which often rely on domain-specific expertise and handcrafted features [5,6], are limited in their ability to adapt to such environmental variability. These methods typically involve feature extraction based on fixed rules, which do not generalize well under different conditions, leading to performance degradation when deployed in real-world scenarios.
In contrast, deep-learning-based methods, particularly those utilizing large-scale data, have shown significant promise in automatically learning rich, high-level feature representations [7,8]. Convolutional neural networks (CNNs) have been widely used for tasks such as feature extraction, fusion, and decision making. While these methods have demonstrated impressive performance on controlled datasets, they are not immune to two critical limitations that hinder their real-world applicability: (1) Limited Model Generalization: Environmental variations, such as changes in lighting, weather, and the introduction of background noise, can significantly degrade model performance. Models trained on specific datasets may fail to generalize to out-of-distribution (OOD) data, leading to poor performance when the operational conditions differ from those encountered during training [9,10]. This challenge is particularly pronounced in converter station environments, where conditions can vary unpredictably over time. (2) Sparse Supervision Signals: Fault detection tasks are heavily reliant on labeled data, but acquiring high-quality annotations for certain fault categories, such as valve cooling water leakage, is expensive and time-consuming. As a result, there is often a severe imbalance in the availability of labeled data, which makes it difficult to train robust models [11,12]. Sparse supervision exacerbates the difficulty of accurately detecting rare or hard-to-label faults.
To address these challenges, we introduce InvMOE (Invariant representation learning with Mixture Of Experts), a novel fault detection framework specifically designed for the dynamic and challenging environments of converter stations. InvMOE incorporates two key components: (1) Invariant Representation Learning: This component aims to disentangle task-relevant features from environmental noise, thereby enhancing the model’s robustness and generalization capabilities. From a causal perspective, fault occurrences are independent of environmental factors, which are treated as confounders [13]. By learning invariant representations, the model can focus on the causal features that are consistent across different environments in alignment with causal inference principles. Techniques such as adversarial training and contrastive loss are employed to facilitate this disentanglement, making InvMOE particularly well-suited for OOD scenarios where environmental factors vary significantly from the training data distribution. (2) Multi-task Training with Mixture of Experts (MOE): To tackle the issue of sparse supervision, the MOE framework is employed to adaptively route inputs to specialized expert subnetworks that are optimized for different fault detection tasks. This approach not only prevents negative transfer between tasks but also allows the model to leverage shared knowledge across tasks, improving performance even for categories with limited data. By simultaneously learning from multiple related tasks, the model can effectively utilize available data, mitigating the challenge of data scarcity for rare fault types, such as valve cooling water leakage detection.
Experiments conducted on real-world datasets demonstrate that InvMOE significantly outperforms existing methods, achieving superior generalization across OOD conditions and showing substantial improvements in fault detection tasks with limited labeled data.
Contributions: The main contributions of this paper are as follows:
  • We propose InvMOE, a novel fault detection algorithm that combines invariant representation learning and a mixture of experts framework to address the challenges of limited generalization and sparse supervision.
  • We introduce a causal-inspired approach for disentangling task-relevant features from environmental noise, enabling robust and reliable fault detection across diverse and unpredictable converter station environments.
  • We develop a multi-task training strategy with MOE, which improves model efficiency and effectiveness, particularly in handling tasks with limited data, and demonstrate its superiority through extensive experiments on real-world datasets.
The structure of this paper is organized as follows: In Section 2, we provide a comprehensive review of related works, highlighting key studies and methodologies relevant to our research. In addition, we point out the differences between previous works and our proposed method. Section 3 introduces our proposed InvMOE framework, detailing its design, components, and the rationale behind our approach. Following this, Section 4 presents the experimental results, where we evaluate the performance of our framework on various benchmark datasets. In Section 5, we conduct an ablation study to assess the impact of different components of the InvMOE framework and their contribution to the overall performance. Finally, we conclude in Section 6 and Section 7, summarizing the key findings and suggesting potential avenues for future work.

2. Related Works

In this section, we review the related works on fault detection methods for converter stations, out-of-distribution (OOD) generalization techniques, and multi-task learning approaches, with a specific focus on image recognition and detection tasks.

2.1. Converter Station Fault Detection

Fault detection in converter stations is a critical task to ensure the safe and efficient operation of high-voltage direct current (HVDC) systems [2]. Traditional fault detection methods for converter stations often rely on rule-based systems, threshold-based approaches, or model-based techniques [5,6]. These methods typically focus on detecting specific anomalies like overcurrent or voltage irregularities, which might indicate faults in key components such as transformers, capacitors, or switches. For instance, methods like open circuit fault diagnosis have been used for detecting anomalies in converter circuits, with some systems employing automatic feature extraction coupled with algorithms such as random forests to identify faults in the presence of nonstationary influences [5]. More recent approaches have leveraged deep learning [7] and computer vision techniques [8] for fault detection in converter stations, particularly focusing on analyzing images captured in real-world environments [14,15]. These methods generally involve extracting features from images and classifying them into fault categories. Additionally, researchers have explored hybrid techniques that combine deep learning with traditional electrical fault detection to improve diagnostic accuracy and robustness [16]. Furthermore, some studies have extended fault detection by incorporating environmental factors, such as light conditions and camera angles, which can significantly affect the accuracy of image-based fault detection systems. This incorporation of contextual features helps improve the robustness and generalization of fault detection models, addressing the challenges of real-world deployment [17].

2.2. Out-of-Distribution Generalization

Out-of-distribution (OOD) generalization is a significant challenge in machine learning [18,19], especially when deploying models in dynamic real-world environments like converter stations, where variations in lighting, angle, and background can drastically impact model performance. OOD generalization refers to the model’s ability to make accurate predictions on data that differ from the training data distribution. Several approaches have been proposed to address this challenge, with a strong focus on domain adaptation and robust learning techniques. Invariant Risk Minimization (IRM) is one of the prominent techniques used to address OOD generalization [10,20,21,22]. IRM encourages the model to learn invariant features across different domains (or environments) to ensure consistency in predictions when exposed to unseen data distributions. This is particularly useful when labeled data from the target domain (e.g., converter stations under specific environmental conditions) are scarce. Techniques like adversarial training and self-supervised learning have been utilized to align feature distributions between source and target domains, reducing the domain shift that hampers generalization [23,24]. Additionally, methods like data augmentation (e.g., random cropping, rotation, color jittering, etc.) are commonly used to expose models to a variety of input conditions during training. This technique helps improve the model’s robustness, allowing it to better handle OOD data when deployed in diverse operational environments.

2.3. Multi-Task Learning

Multi-task learning (MTL) is a machine learning paradigm that simultaneously learns multiple tasks with shared representations, aiming to improve the model’s performance on each individual task by leveraging common information [25]. MTL has been widely applied to image recognition, classification, and detection tasks, where multiple related objectives need to be tackled simultaneously.
In the context of fault detection in converter stations, MTL can be particularly beneficial, as it allows the model to learn different fault detection tasks (e.g., corrosion detection, insulator breakage detection, etc.) concurrently, sharing useful features across tasks. For instance, one task might focus on detecting corrosion, while another focuses on detecting insulator cracks. By sharing the learned features, the model benefits from a more generalized understanding of the environment, which can improve performance on each individual task, especially when the amount of labeled data is limited for some tasks. Ref. [23] is a popular approach within MTL, where different “experts” specialize in different tasks and are dynamically selected based on the input. MOE provides the flexibility of task specialization while maintaining the benefits of sharing common knowledge. In image recognition tasks, the MOE framework has been applied to ensure that the relevant experts are activated depending on the type of fault being detected. This allows the model to efficiently allocate resources to tasks that require more specialized attention while still benefiting from shared feature learning for tasks that have overlapping characteristics. MTL methods can be combined with techniques like attention mechanisms and graph neural networks (GNNs) to model dependencies between different faults or tasks in converter station environments [24]. These dependencies can enhance the model’s ability to handle complex relationships between fault types, improving detection accuracy.
In summary, while existing approaches provide valuable contributions, they are often limited by poor generalization to diverse environments, reliance on sparse supervision signals, and inefficiencies in multi-task learning. Our proposed InvMOE addresses these shortcomings by combining invariant representation learning and a flexible multi-task learning framework, resulting in improved fault detection performance and generalization.

3. The Proposed InvMOE Framework

This section describes the proposed InvMOE framework for fault detection in converter stations. As shown in the Figure 2, our proposed InvMOE framework consists of three key components: (1) image feature extraction, (2) MOE-based multi-task learning, and (3) invariant-learning-based optimization. The detailed processes of each component are as follows.

3.1. Image Feature Extraction

The first stage of the InvMOE framework focuses on extracting robust feature representations from images captured in real-world converter station environments. These images, taken under varying conditions of lighting, angles, and background complexity, serve as the primary data source for fault detection.
While the framework is designed to accommodate any state-of-the-art visual backbone, we employed the Swin Transformer model due to its advanced capabilities in capturing both local and global features through a hierarchical structure and self-attention mechanism [26]. The Swin Transformer has shown superior performance in a variety of vision tasks, making it an ideal choice for the diverse and challenging nature of converter station images [12,26]. Compared to VIT and traditional CNN models, the advantages of the Swin Transformer are summarized as follows: (a) Hierarchical Representation: Swin Transformer employs a hierarchical feature extraction approach, where images are progressively downsampled through a series of stages, allowing the model to capture both local and global contexts at multiple scales. This enables the model to capture fine-grained details as well as high-level abstract features. In contrast, ViT operates by flattening the entire image into a sequence of patches, which limits its ability to efficiently model hierarchical structures in the image. (b) Locality-Aware Attention: The Swin Transformer introduces the concept of shifted window-based attention, where the image is divided into nonoverlapping local windows, and attention is computed within each window. These windows are shifted between layers, allowing for cross-window interactions, which enhances the model’s ability to capture local patterns while still maintaining long-range dependencies in the image. Considering the above advantages, we employed the Swin Transformer to extract image features for fault detection in converter stations.
Given an input image x R H × W × C , representing a high-resolution RGB photo with height H, width W, and C color channels, the Swin Transformer processes the image as follows:
  • Patch Tokenization: The input image x is divided into nonoverlapping patches of size P × P . Each patch is flattened into a vector, and a linear embedding layer maps these vectors into a d-dimensional feature space. This produces an initial set of tokens T R ( H / P ) × ( W / P ) × d .
  • Hierarchical Feature Extraction: The tokenized patches are passed through multiple Transformer layers. Each layer consists of shifted window-based self-attention modules and feedforward networks, enabling efficient computation and the capture of long-range dependencies within the image [26].
  • Feature Aggregation: As the processing progresses through hierarchical stages, features are aggregated and downsampled to reduce spatial dimensions while increasing semantic richness. This results in a compact feature vector z R d , encapsulating the key visual information of the input image.
The extracted feature vector z serves as the input for subsequent stages of the InvMOE framework, ensuring that rich and invariant representations are available for fault detection tasks. By leveraging the Swin Transformer’s ability to effectively balance computational efficiency and representation quality, this stage establishes a strong foundation for the proposed method.

3.2. MOE-Based Multi-Task Learning

InvMOE leverages a mixture of experts (MOE) framework to address the challenges posed by multiple fault detection tasks and sparse supervision signals. For each input image feature, extracted via a Swin Transformer, the MOE architecture is used to adapt the features for the downstream five-fault detection tasks: metal corrosion, respirator silicone discoloration, insulator fracture, suspended object, and valve cooling water leakage detection.

3.2.1. Adaptive Expert Routing

The generalized features z c obtained from the invariant representation learning stage are passed through the MOE layer. The MOE framework consists of K expert networks { E 1 , E 2 , , E K } , where each expert specializes in a subset of fault detection tasks. A gating network G dynamically routes the input to one or more experts based on the task requirements:
z output = i = 1 k g i ( z ) E i ( z ) ,
where g i ( z ) is the gating weight for expert E i . This adaptive routing mechanism enables the model to allocate computational resources effectively, ensuring that each task benefits from task-specific expertise. In this case, the five tasks will use a combination of experts to process their respective features, ensuring specialized detection. To better illustrate the adaptive expert routing process, we give a clearer description of the“gating network” and “experts”:
  • Gating Network: The gating network is responsible for deciding which “expert” should process the input for each task. It operates by taking the input data and assigning them to one or more experts based on the relevance of each expert for a given task. This routing is adaptive, meaning the gating decisions change dynamically based on the input data and task requirements. A detailed working of the gating network is presented as follows: (a) The gating network takes the feature vector of the input data and computes a routing decision using a softmax function, producing a probability distribution over the experts. (b) The network assigns each input to one or more experts depending on the routing probabilities. The experts are chosen dynamically depending on the type of task and the input data.
  • Experts: The experts in the framework are specialized subnetworks that are trained to focus on particular tasks. Each “expert” is associated with one or more fault types or specific task requirements. For example, in the case of fault detection, some experts may specialize in detecting electrical faults, while others may specialize in detecting mechanical or thermal faults. A detailed explanation of the experts is presented as follows: (a) Each expert is a neural network that has been fine-tuned to handle specific data patterns or tasks. These experts are trained on task-specific datasets, which could include different fault types or other specialized information relevant to each task. (b) Regarding the relationship between experts, the experts are not isolated. The gating mechanism dynamically routes inputs to the appropriate experts based on their capabilities and relevance to the current task. This allows the model to specialize in different aspects of the task while also facilitating knowledge sharing across the experts.

3.2.2. Multi-Task Learning

To jointly train the experts for all fault detection tasks, we adopted an empirical risk minimization (ERM)-based multi-task optimization framework. The loss function for each individual task, i { 1 , , 5 } , is based on the cross-entropy loss, which is defined as
L task i = c = 1 C y c log ( p c ) ,
where y c is the ground truth for class c for task i, p c is the predicted probability for class c, and C is the number of possible classes for the task. The model has five such cross-entropy loss functions—one for each task.
The overall multi-task loss function is a weighted sum of the task-specific losses:
L ERM = i = 1 5 λ i L task i ,
where λ i is the weight for task i, and L ERM denotes the total empirical risk minimization. This design not only enhances the performance of individual tasks but also facilitates knowledge sharing across tasks, leveraging common features to mitigate the issue of sparse supervision. The weighted sum of losses ensures that the model optimizes all five tasks simultaneously, with task-specific contributions adjusted according to the importance and difficulty of each task.

3.3. Invariant-Learning-Based Optimization

To address the challenge of limited generalization caused by environmental variability, we adopted an invariant representation learning strategy inspired by Invariant Risk Minimization (IRM). Specifically, we divided the input images into multiple environments { e 1 , e 2 , , e N } based on factors such as lighting, angles, and surrounding backgrounds. The goal was to encourage the model to focus on causal features that are invariant across these environments while ignoring task-irrelevant environmental noise.

3.3.1. Causal Framework for Invariance

From a causal perspective, the fault label y is determined by latent causal factors z c , which are independent of environmental factors z e . We model the representation z as the combination of z c and z e , where
z = z c + z e ,
The goal of invariant representation learning is to extract z c while suppressing z e , thus focusing on the causal factors that are responsible for fault detection and minimizing the influence of environmental variations.

3.3.2. IRM-Based Regularization

We employed Invariant Risk Minimization (IRM) to enforce invariance across different environments. The IRM framework aims to encourage the model to learn stable detection ability across various environments. Thus, the IRM-based regularization is defined as follows:
L IRM = Var ( { L ERM k : 1 k K } ) ,
where L IRM k is the loss for environment e k . This formulation encourages the model to minimize the loss across all environments, ensuring that the learned representations are invariant and robust to environmental variations.
By combining the IRM-based regularization, the final optimization objective of InvMOE is
L = 1 N n = 1 N L ERM k + β Var ( { L ERM k : 1 k K } ) ,
where β is the hyper-parameter used to balance the ERM and IRM losses. Based on the above optimization, InvMOE can learn the invariant representation z c and achieve robust fault detection performance in converter station environments, addressing both the issues of sparse supervision and limited model generalization due to environmental variability. The detailed training process is shown in Algorithm 1.
Algorithm 1 InvMOE framework for fault detection
  1:
Input: Image x R H × W × C
  2:
Output: Fault detection results for five tasks
  3:
Step 1: Feature Extraction
  4:
   Divide x into patches and apply linear embedding to map to d-dimensional space.
  5:
   Pass through Swin Transformer and aggregate features to get z R d .
  6:
Step 2: MOE-based Multi-task Learning
  7:
   Pass z through MOE layer with K experts, using dynamic routing by gating network.
  8:
   For each task i, compute cross-entropy loss L task i and total loss L ERM .
  9:
Step 3: Invariant Learning-based Optimization
10:
   Model z as causal and environmental factors, extract causal factors z c .
11:
   Minimize variance across environments with IRM regularization.
12:
Step 4: Parameter Optimization
13:
   Optimize model parameters using gradient-based optimization (e.g., Adam optimizer) to minimize the total loss:
14:
       L = L ERM + β Var ( { L ERM k } )
15:
Step 5: Output
16:
   Output fault detection results for five tasks.

4. Experimental Results

4.1. Dataset Preprocessing

In order to evaluate the performance of the proposed InvMOE framework for fault detection in converter stations, we utilized a custom dataset consisting of high-resolution RGB images captured from various converter stations under real-world conditions. The dataset includes images of different fault types, such as metal corrosion, respirator silicone discoloration, insulator fracture, suspended objects, and valve cooling water leakage. Given the challenging nature of these images, the dataset processing steps were carefully designed to ensure high-quality and diverse inputs for training the model.

4.1.1. Data Collection

The dataset was collected from multiple converter stations across varying environmental conditions, including different lighting, camera angles, and background complexities. Each image was annotated with labels corresponding to the five fault detection tasks. The dataset was split into training, validation, and test sets, with approximately 80% of the data allocated for training, 10% for validation, and 10% for testing. The detailed distribution of images in the dataset is summarized in Table 1.

4.1.2. Data Augmentation

To enhance the generalization capabilities of the model and mitigate the effects of sparse supervision, data augmentation techniques were applied to increase the diversity of the training data. These techniques included the following:
  • Random cropping: Randomly cropping regions of the image to simulate variations in object size and position.
  • Color jittering: Random adjustments to the image’s brightness, contrast, and saturation to simulate varying lighting conditions.
  • Rotation and flipping: Random rotations and horizontal flips to simulate different camera angles.
  • Noise injection: Adding random noise to images to simulate real-world disturbances and background complexities.

4.1.3. Normalization and Standardization

Before feeding the images into the model, the pixel values were normalized to the range [ 0 , 1 ] by dividing by 255. Additionally, the images were standardized by subtracting the mean and dividing by the standard deviation of the training dataset to ensure consistent feature scaling. This helped to improve the convergence of the model during training and ensured that the input data were suitable for the Swin Transformer.

4.2. Experimental Settings

4.2.1. Training Setup

To ensure stable and efficient training, the following setup was used:
  • The model was trained using mini-batch gradient descent, with a batch size of 32.
  • Early stopping was employed to prevent overfitting, with a patience of 10 epochs. Training stopped if the validation loss did not improve for 10 consecutive epochs.
  • The learning rate was initialized at 0.001 and decreased by a factor of 0.1 after every 20 epochs.
  • The Adam optimizer was used for model optimization, which adapts the learning rate based on first and second moments of the gradients.

4.2.2. Evaluation Baselines

We conducted experiments with several competing baselines, which are introduced as follows:
  • KNN: KNN [27] is a simple, yet effective, machine learning algorithm used for classification and regression tasks. It works by identifying the K nearest data points to a given input and assigning a label (in classification) or predicting a value (in regression) based on the majority vote or average of the neighbors. KNN is nonparametric, meaning it makes no assumptions about the underlying data distribution. However, it can be computationally expensive for large datasets, since it requires calculating the distance between the query point and all training samples.
  • ResNet: ResNet (Residual Network) [8] is a deep convolutional neural network architecture known for its use of residual connections, which help mitigate the vanishing gradient problem by allowing gradients to flow through the network more effectively. It is particularly effective in image classification tasks and has been widely used in various computer vision applications. ResNet is commonly employed as a baseline model for comparison in tasks requiring deep learning architectures.
  • Swin Transformer: The Swin Transformer [26] is a state-of-the-art vision Transformer architecture that uses shifted windows for efficient self-attention and hierarchical feature representation. It overcomes the limitations of traditional Vision Transformers (ViTs) by processing images in smaller, nonoverlapping patches and dynamically adjusting attention regions, making it highly effective for capturing both local and global features. It has shown superior performance in various vision tasks compared to CNN-based architectures.
  • IRM (Invariant Risk Minimization): IRM [10] focuses on learning representations that generalize across multiple environments by enforcing invariance in the learned features. In the context of fault detection, this method would remove the multi-task learning component, resulting in a model variant that learns invariant representations across different environmental conditions without task-specific adaptation. This approach helps in addressing environmental variability, but without the benefit of multi-task learning shared across tasks.
  • EIIL: The EIIL [20] paradigm is proposed to address the challenges of learning invariant representations across diverse environments. By leveraging environment inference techniques, EIIL aims to improve model robustness and generalization by identifying and utilizing the invariant factors that are critical for learning in real-world, multi-environment settings.

4.2.3. Evaluation Metrics

The performance of the model was evaluated using two key metrics: accuracy (ACC) and F1-score. These metrics are defined as follows:
  • Accuracy (ACC): The accuracy of a model is the proportion of correct predictions out of the total number of predictions. It is computed as
    ACC = TP + TN TP + TN + FP + FN ,
    where TP, TN, FP, and FN are the true positive, true negative, false positive, and false negative values, respectively. Accuracy provides a general measure of how well the model is performing across all classes.
  • F1-Score: The F1-score is the harmonic mean of precision and recall, providing a balance between the two metrics. It is defined as
    F 1 - Score = 2 × Precision × Recall Precision + Recall ,
    where
    Precision = TP TP + FP , Recall = TP TP + FN .
    The F1-score is especially useful when the dataset is imbalanced, as it accounts for both false positives and false negatives.

4.3. Performance Comparisons with Baselines

We evaluated the performance of the proposed InvMOE model on five distinct fault detection tasks, covering scenarios such as detecting metal corrosion, respirator silica gel discoloration, insulator breakage, overhead suspension detection, and valve cooling water leakage.
The performance of the proposed InvMOE model was compared to three baseline models—KNN [27], ResNet [8], Swin Transformer [26], IRM [10], and EIIL [20]—across five distinct fault detection tasks. The accuracy and F1-score results, as shown in Table 2 and Table 3, reveal several key findings:
  • Accuracy Performance: InvMOE consistently outperformed all baseline models across all tasks, achieving the highest accuracy in each task. Notably, InvMOE achieved an accuracy of 97.5% in Task 1 (metal corrosion detection), which was higher than the next best model, IRM (97.0%), and significantly higher than ResNet (94.0%) and Swin Transformer (96.5%). In Task 5 (valve cooling water leakage), InvMOE maintained an impressive accuracy of 88.0%, surpassing all baseline models, with IRM coming in second at 85.5%.
  • F1-Score Performance: The trend observed in the accuracy results is reflected in the F1-scores. InvMOE again led with the highest F1-scores across all tasks. For example, in Task 1, InvMOE achieved an F1-score of 97.3%, outperforming IRM (97.1%), Swin Transformer (96.2%), and ResNet (93.5%). In Task 5, InvMOE maintained its superior performance with an F1-score of 87.5%, which was considerably higher than IRM (82.5%) and Swin Transformer (80.1%).
  • Comparison to Baselines: As a classic machine learning method, KNN demonstrated strong classification performance on the first four tasks. However, it did not perform satisfactorily on the last task. This was primarily due to the lack of sufficient training samples, which hindered the model’s ability to generalize effectively in more complex scenarios. In addition, ResNet, while a strong baseline, generally fell behind both Swin Transformer and IRM in terms of both accuracy and F1-score. This is expected given that ResNet is a convolutional neural network, which may not capture the fine-grained relationships and long-range dependencies in the data as effectively as Transformer-based models. Swin Transformer and IRM performed similarly on most tasks, with IRM slightly outperforming Swin Transformer. This indicates that enforcing invariance across different environments (as done by IRM) offered some benefits over the self-attention mechanism used in Swin Transformer, particularly in tasks with varied environmental conditions. EIIL further inferred the environment labels considering the reality that there are potential unknown test scenarios, and it showed slight improvements over IRM in most cases. Overall, InvMOE demonstrated the most robust performance, suggesting that the integration of multi-task learning and the model’s ability to handle diverse fault detection scenarios contribute to its superior results. The accuracy and F1-score results for the different variants of the InvMOE model are summarized in Figure 3 and Figure 4.
In summary, InvMOE significantly outperformed all baseline models in both the accuracy and F1-score across the fault detection tasks, confirming its effectiveness in real-world fault detection applications.

5. Ablation Study

5.1. Description of Each Component of InvMOE

In this section, we conducted a detailed ablation study to evaluate the contributions of key components within the InvMOE model. By systematically removing or altering various components of the model, we can isolate their individual effects on performance and gain deeper insights into the mechanisms driving the model’s success. Specifically, we assessed the impact of three critical factors: multi-task learning, the Invariant Risk Minimization (IRM) mechanism, and the Swin Transformer backbone. The following variants were explored:
  • InvMOE (full model): The complete model, which integrates multi-task learning, IRM, and the Swin Transformer backbone, as originally designed.
  • No Multi-Task Learning: In this variant, the multi-task learning component was removed, and the model was instead trained in a traditional single-task learning setup. This isolated the contribution of task sharing and mutual information extraction.
  • No IRM: This variant removed the IRM mechanism, allowing the model to train without the additional regularization imposed by Invariant Risk Minimization. This tested the hypothesis that mitigating environment variability is essential for consistent performance.
  • Swin Transformer Backbone: In this variant, we maintained the multi-task learning and IRM components but replaced the Swin Transformer backbone with a simpler, more conventional architecture. This helped evaluate the importance of the Swin Transformer’s advanced self-attention and hierarchical feature extraction capabilities.

5.2. Ablation Analysis of InvMOE

  • Impact of Invariant Risk Minimization (IRM): The removal of the IRM mechanism resulted in a noticeable drop in performance—again, about 1–2% in the accuracy and F1-score. This decline indicates the significance of the IRM component, which helps to neutralize environmental variations and ensures that the model learns stable representations that generalize across different environments. Without IRM, the model became more sensitive to environmental noise, leading to fluctuations in its performance. This reinforces the notion that invariant learning is key to achieving consistency and reliability in real-world applications, where environmental conditions can vary dramatically.
  • Impact of Swin Transformer Backbone: The most striking performance drop occurred when the Swin Transformer backbone was replaced by a simpler architecture. The accuracy and F1-score both showed substantial reductions, particularly in tasks that required capturing long-range dependencies or complex contextual relationships. This highlights the power of the Swin Transformer’s self-attention mechanism, which allows the model to capture intricate dependencies between distant features. The hierarchical structure of the Swin Transformer further enhances its ability to extract meaningful features at multiple scales, which is particularly beneficial for tasks involving high-dimensional or sequential data. The performance drop in this variant reinforces the idea that state-of-the-art backbones, like the Swin Transformer, offer significant advantages over simpler models in terms of capturing fine-grained patterns and improving overall model performance.

6. Discussion

In the realm of fault detection, existing methodologies have made significant strides but often encounter challenges related to generalization across diverse environments, reliance on sparse supervision signals, and inefficiencies in multi-task learning. For instance, traditional deep learning models may struggle to generalize effectively when faced with varying environmental conditions, leading to decreased performance in real-world applications. Additionally, many models depend heavily on labeled data, which can be scarce and expensive to obtain, thereby limiting their applicability in scenarios with limited supervision. Furthermore, while multi-task learning frameworks aim to enhance efficiency by sharing knowledge across tasks, they sometimes face difficulties in balancing the learning process, potentially leading to suboptimal performance.
Our proposed InvMOE model addresses these limitations by integrating invariant representation learning with a flexible multi-task learning framework. This combination enables the model to capture task-relevant features while mitigating background noise interference, thereby enhancing generalization to out-of-distribution scenarios. The multi-task mixture of experts (MOE) framework further allows for adaptive routing of inputs to task-specific experts, effectively sharing knowledge across tasks and improving performance, especially in categories with limited training samples. Experimental results on real-world datasets demonstrate that InvMOE outperforms existing methods, achieving superior generalization and detection accuracy, particularly in tasks with sparse supervision signals. In summary, while existing approaches provide valuable contributions, they are often limited by poor generalization to diverse environments, reliance on sparse supervision signals, and inefficiencies in multi-task learning. Our proposed InvMOE addresses these shortcomings by combining invariant representation learning and a flexible multi-task learning framework, resulting in improved fault detection performance and generalization.

7. Conclusions

In this study, we proposed InvMOE, an advanced fault detection algorithm designed to tackle the challenges of generalization and sparse supervision in complex converter station environments. By integrating invariant representation learning and a multi-task mixture of experts (MOE) framework, InvMOE demonstrated significant robustness and accuracy across diverse fault detection tasks. Our approach leverages invariant representation learning to disentangle task-relevant causal features from environmental noise, improving the model’s generalization to out-of-distribution (OOD) scenarios. Additionally, the multi-task MOE framework enables adaptive routing of inputs to task-specific experts while effectively sharing knowledge across tasks. This design mitigates the impact of limited training samples and achieves improved performance for low-resource fault detection categories, such as valve cooling water leakage. Experimental results on real-world datasets confirm the efficacy of InvMOE. In future work, we aim to extend this framework by incorporating additional causal priors and exploring self-supervised pretraining strategies to further improve the robustness and scalability of fault detection systems in dynamic industrial environments.

Author Contributions

Conceptualization, H.S., S.L., H.L. and J.H.; Methodology, H.S.; Software, H.S.; Validation, H.S., S.L. and X.T.; Formal analysis, H.S.; Investigation, H.S.; Resources, H.S.; Data curation, H.S.; Writing—original draft preparation, H.S.; Writing—review and editing, S.L., H.L., J.H., Z.Q., J.W. and X.T.; Visualization, H.S.; Supervision, X.T.; Project administration, X.T.; Funding acquisition, X.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No. 52167011).

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

Authors Hao Sun, Shaosen Li, Hao Li, Jianxiang Huang, Zhuqiao Qiao and Jialei Wang were employed by the Kunming Bureau of EHV Transmission Company. The remaining author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Wang, S.; Wang, X.; Ren, X.; Wang, Y.; Xu, S.; Ge, Y.; He, J. Fault Diagnosis Method for Converter Stations Based on Fault Area Identification and Evidence Information Fusion. Sensors 2024, 24, 7321. [Google Scholar] [CrossRef] [PubMed]
  2. Jovcic, D.; Ahmed, K. High Voltage Direct Current (HVDC) Transmission Systems; Wiley: Hoboken, NJ, USA, 2015. [Google Scholar]
  3. Adamson, C.; Hingorani, N.G. High-Voltage Direct-Current Power Transmission; Garraway: London, UK, 1960. [Google Scholar]
  4. Li, R.; Xu, L.; Yao, L. DC fault detection and location in meshed multiterminal HVDC systems based on DC reactor voltage change rate. IEEE Trans. Power Deliv. 2016, 32, 1516–1526. [Google Scholar] [CrossRef]
  5. Bu, S.; Liu, Z.; Shi, Y.; Sun, Y. Feature-based fault detection in converter stations. IEEE Trans. Ind. Electron. 2017, 64, 7800–7808. [Google Scholar]
  6. Sun, C.; Zhang, X.; Li, H.; Chen, W. Hybrid methods for converter station monitoring. Energy Rep. 2018, 4, 202–209. [Google Scholar]
  7. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. In Proceedings of the NeurIPS, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1097–1105. [Google Scholar]
  8. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  9. Gulrajani, I.; Lopez-Paz, D. In search of lost domain generalization. In Proceedings of the International Conference on Learning Representations ICLR, Virtual Event, 3–7 May 2021. [Google Scholar]
  10. Arjovsky, M.; Bottou, L.; Gulrajani, I. Invariant Risk Minimization. arXiv 2019, arXiv:1907.02893. [Google Scholar]
  11. Ren, S.; He, K.; Girshick, R.B.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the NeurIPS, Montreal, QC, Canada, 7–12 December 2015; pp. 91–99. [Google Scholar]
  12. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. Image is worth 16 × 16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations ICLR, Addis Ababa, Ethiopia, 30 April 2020. [Google Scholar]
  13. Rojas-Carulla, M.; Schölkopf, B.; Turner, R.; Peters, J. Invariant models for causal transfer learning. J. Mach. Learn. Res. 2018, 19, 1–34. [Google Scholar]
  14. Li, Z.; Wang, Y.; Zhang, Q.; Chen, L. Deep learning-based fault diagnosis in HVDC systems. IEEE Trans. Power Electron. 2019, 34, 10245–10256. [Google Scholar]
  15. Mu, D.; Lin, S.; Zhang, H.; Zheng, T. A novel fault identification method for HVDC converter station Section based on energy relative entropy. IEEE Trans. Instrum. Meas. 2022, 71, 3507910. [Google Scholar] [CrossRef]
  16. Liu, W.; Chen, X.; Li, J.; Yang, H. Image-based fault detection in converter stations using deep learning. IEEE Access 2020, 8, 123456–123467. [Google Scholar]
  17. Li, Y.; Zhang, M.; Zhou, P.; Wu, F. Advanced deep learning models for converter station fault detection. IEEE Trans. Ind. Inform. 2021, 17, 4567–4578. [Google Scholar]
  18. Peng, C.; Zhang, Y.; Li, X.; Wang, J. Out-of-Distribution Generalization: A Survey. ACM Comput. Surv. (CSUR) 2021, 54, 1–35. [Google Scholar]
  19. Pearl, J.; Glymour, M.; Jewell, N.P. Causal Inference in Statistics: A Primer; Wiley: Hoboken, NJ, USA, 2016. [Google Scholar]
  20. Creager, E.; Jacobsen, J.H.; Zemel, R. Environment inference for invariant learning. In Proceedings of the 38th International Conference on Machine Learning, ICML, PMLR, Virtual, 18–24 July 2021; pp. 2189–2200. [Google Scholar]
  21. Krueger, D.; Caballero, E.; Jacobsen, J.H.; Zhang, A.; Binas, J.; Zhang, D.; Le Priol, R.; Courville, A. Out-of-distribution generalization via risk extrapolation (rex). In Proceedings of the 38th International Conference on Machine Learning, ICML, PMLR, Virtual, 18–24 July 2021; pp. 5815–5826. [Google Scholar]
  22. Li, H.; Zhang, Z.; Wang, X.; Zhu, W. Learning invariant graph representations for out-of-distribution generalization. NeurIPS 2022, 35, 11828–11841. [Google Scholar]
  23. Ajra, Y.; Hoblos, G.; Al Sheikh, H.; Moubayed, N. A Literature Review of Fault Detection and Diagnostic Methods in Three-Phase Voltage-Source Inverters. Machines 2024, 12, 631. [Google Scholar] [CrossRef]
  24. Wu, F.; Chen, K.; Qiu, G.; Zhou, W. Robust Open Circuit Fault Diagnosis Method for Converter Using Automatic Feature Extraction and Random Forests Considering Nonstationary Influence. IEEE Trans. Ind. Electron. 2024, 71, 13263–13273. [Google Scholar] [CrossRef]
  25. Zhang, Y.; Yang, Q. A survey on multi-task learning. IEEE Trans. Knowl. Data Eng. 2021, 34, 5586–5609. [Google Scholar]
  26. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Hu, H. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
  27. Cover, T.; Hart, P.E. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 1967, 13, 21–27. [Google Scholar] [CrossRef]
Figure 1. Llustration of faults in converter stations.
Figure 1. Llustration of faults in converter stations.
Energies 18 01783 g001
Figure 2. Overview of our proposed InvMOE framework.
Figure 2. Overview of our proposed InvMOE framework.
Energies 18 01783 g002
Figure 3. Accuracy comparison for different variants.
Figure 3. Accuracy comparison for different variants.
Energies 18 01783 g003
Figure 4. F1-score comparison for different variants.
Figure 4. F1-score comparison for different variants.
Energies 18 01783 g004
Table 1. Dataset statistics: Pu’er Converter Station dataset.
Table 1. Dataset statistics: Pu’er Converter Station dataset.
TaskNumber of Images
Task 1: Metal Corrosion500
Task 2: Silica Gel Discoloration500
Task 3: Insulator Breakage500
Task 4: Overhead Suspension500
Task 5: Valve Cooling Water Leak100
Table 2. Accuracy of different models for fault detection tasks (%).
Table 2. Accuracy of different models for fault detection tasks (%).
ModelTask 1Task 2Task 3Task 4Task 5
KNN92.590.988.889.660.7
ResNet94.093.291.592.080.0
Swin Transformer96.595.394.194.584.2
IRM97.096.595.094.885.5
EIIL96.897.195.294.986.7
InvMOE97.596.895.695.188.0
Table 3. F1-score of different models for fault detection tasks (%).
Table 3. F1-score of different models for fault detection tasks (%).
ModelTask 1Task 2Task 3Task 4Task 5
KNN91.789.689.587.858.3
ResNet93.592.090.390.875.2
Swin Transformer96.294.493.093.280.1
IRM97.196.494.994.782.5
EIIL97.096.995.194.786.9
InvMOE97.396.595.494.987.5
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sun, H.; Li, S.; Li, H.; Huang, J.; Qiao, Z.; Wang, J.; Tian, X. InvMOE: MOEs Based Invariant Representation Learning for Fault Detection in Converter Stations. Energies 2025, 18, 1783. https://doi.org/10.3390/en18071783

AMA Style

Sun H, Li S, Li H, Huang J, Qiao Z, Wang J, Tian X. InvMOE: MOEs Based Invariant Representation Learning for Fault Detection in Converter Stations. Energies. 2025; 18(7):1783. https://doi.org/10.3390/en18071783

Chicago/Turabian Style

Sun, Hao, Shaosen Li, Hao Li, Jianxiang Huang, Zhuqiao Qiao, Jialei Wang, and Xincui Tian. 2025. "InvMOE: MOEs Based Invariant Representation Learning for Fault Detection in Converter Stations" Energies 18, no. 7: 1783. https://doi.org/10.3390/en18071783

APA Style

Sun, H., Li, S., Li, H., Huang, J., Qiao, Z., Wang, J., & Tian, X. (2025). InvMOE: MOEs Based Invariant Representation Learning for Fault Detection in Converter Stations. Energies, 18(7), 1783. https://doi.org/10.3390/en18071783

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop