Incremental Spatio-Temporal Augmented Sampling for Power Grid Operation Behavior Recognition

Meng, Lingwen; He, Di; Ban, Guobang; Guo, Siqi

doi:10.3390/electronics14183579

Open AccessArticle

Incremental Spatio-Temporal Augmented Sampling for Power Grid Operation Behavior Recognition

¹

Electric Power Research Institute of Guizhou Power Grid Co., Ltd., Guiyang 550002, China

²

School of Electrical and Information Engineering, Tianjin University, Tianjin 300072, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(18), 3579; https://doi.org/10.3390/electronics14183579

Submission received: 2 August 2025 / Revised: 1 September 2025 / Accepted: 8 September 2025 / Published: 9 September 2025

(This article belongs to the Special Issue Applications and Challenges of Image Processing in Smart Environment)

Download

Browse Figures

Versions Notes

Abstract

Accurate recognition of power grid operation behaviors is crucial for ensuring both safety and operational efficiency in smart grid systems. However, this task presents significant challenges due to dynamic environmental variations, limited labeled training data availability, and the necessity for continuous model adaptation. To overcome these limitations, we propose an Incremental Spatio-temporal Augmented Sampling (ISAS) method for power grid operation behavior recognition. Specifically, we design a spatio-temporal Feature-Enhancement Fusion Module (FEFM) which employs multi-scale spatio-temporal augmented fusion combined with a cross-scale aggregation mechanism, enabling robust feature learning that is resilient to environmental interference. Furthermore, we introduce a Selective Replay Mechanism (SRM) that implements a dual-criteria sample selection strategy based on error variability and feature-space divergence metrics, ensuring optimal memory bank updates that simultaneously maximize information gain while minimizing feature redundancy. Experimental results on the power grid behavior dataset demonstrate significant advantages of the proposed method in recognition robustness and knowledge retention compared to other methods. For example, it achieves an accuracy of 89.80% on sunny days and maintains exceptional continual learning stability with merely 2.74% forgetting rate on three meteorological scenarios.

Keywords:

power grid operation; behavior recognition; spatio-temporal feature augmentation; feature-enhancement fusion module; selective replay mechanism; continual learning

1. Introduction

With the continuous advancement of power grid infrastructure, traditional manual supervision methods, characterized by low efficiency, high cost, and strong subjectivity, are increasingly inadequate for meeting modern operational requirements [1,2]. Consequently, intelligent safety supervision [3,4] has become a critical component in ensuring the stability and reliability of power grid systems. Recent advances in machine vision and deep learning have established power grid operation behavior recognition as a significant research focus in this field [5]. This task is distinct from grid-state recognition, which typically involves monitoring electrical parameters (e.g., voltage, current, frequency) to assess the operational status of the power network. Instead, it focuses on video-based human action recognition [6,7,8] within complex and dynamic environments characteristic of power grid operation scenarios.

In this context, power grid operation behavior recognition refers to the visual recognition of human operational activities performed by field personnel in power grid environments, such as climbing transmission towers, voltage verification, and installing grounding rods. These behaviors are critical for ensuring operational safety and compliance with standard procedures. However, power grid operation scenarios present unique complexities and challenges. First, the diversity of operational behaviors, encompassing equipment handling, safety inspections, and tool usage, presents rich and varied action patterns. Second, complex temporal dynamics mean that identical behaviors exhibit stage-dependent motion characteristics and are susceptible to environmental variations such as lighting conditions, weather, and equipment status.

The evolution of action-recognition technology began with hand-crafted feature descriptors, such as Spatio-Temporal Interest Points (STIPs) and Improved Dense Trajectories (IDT), which were commonly used in combination with bag-of-words models for action classification [9,10]. While these methods performed well in controlled environments with clear motion patterns, they exhibited limited generalization capability in dynamically complex scenarios, such as power grid operations (Figure 1), due to their reliance on manually designed features. The advent of deep learning significantly advanced the field. Three-dimensional Convolutional Neural Networks (3D-CNNs) [11,12] enabled joint spatio-temporal feature learning through volumetric convolutions. Subsequent architectures, like STCA [13], further enhanced discriminative power and robustness by incorporating spatio-temporal convolution and attention mechanisms. The Two-Stream Convolutional Network [14,15] improved motion perception by separately processing appearance and motion information via RGB and optical flow streams. This idea was extended in graph-based approaches, such as the two-stream spatio-temporal GCN-transformer [16], which integrated graph convolutional networks and transformers to effectively model skeletal dynamics.

However, current methods exhibit several limitations when applied to power grid operation behavior recognition. The typically low recognition accuracy stems primarily from the inability of conventional models to adequately capture the complexity and diversity of power grid operation behaviors. Limited generalization capability results in poor performance beyond training scenarios, with inadequate adaptation to varying operational conditions, personnel habits, and environmental changes. The absence of continual learning mechanisms leads to catastrophic forgetting (Figure 2), where learning new operational behaviors causes the model to lose previously acquired recognition capabilities.

As a key technical approach to address these challenges, continual learning aims to maintain model performance on previously learned tasks while acquiring new knowledge. Existing continual learning approaches [17] can be categorized into three main paradigms: (1) Regularization-based methods preserve prior knowledge by constraining parameter updates; (2) architecture-based methods dynamically expand model capacity to accommodate new tasks; and (3) memory-based methods leverage stored exemplars or feature representations to facilitate knowledge retention. However, these approaches encounter specific challenges when applied to power grid operation behavior recognition, particularly in selecting optimal exemplars for knowledge preservation given storage constraints, and effectively utilizing temporal dynamics to enhance model learning capabilities.

To address these challenges, i.e., environmental variability and catastrophic forgetting, we propose an Incremental Spatio-temporal Augmented Sampling (ISAS) method for power grid operation behavior recognition. First, we design the spatio-temporal Feature-Enhancement Fusion Module (FEFM) that employs multi-scale spatio-temporal augmented fusion combined with a cross-scale aggregation mechanism, enabling robust feature learning that is resilient to environmental interference. Second, we develop the Selective Replay Mechanism (SRM) to implement a dual-criteria sample selection strategy based on error variability and feature-space divergence metrics, ensuring optimal memory bank updates that simultaneously maximize information gain while minimizing feature redundancy. Experimental results on the power-grid behavior dataset demonstrate the superior performance of the proposed method, achieving 89.80% recognition accuracy under sunny conditions while maintaining remarkable continual learning stability (forgetting rate = 2.74%) across three distinct meteorological scenarios. These results outperform existing approaches in both recognition robustness and long-term knowledge retention.

The main contributions of this work are summarized as follows:

We propose a novel Incremental Spatio-temporal Augmented Sampling (ISAS) framework for robust power grid operation behavior recognition under dynamic environments.
We design a Feature-Enhancement Fusion Module (FEFM) that incorporates multi-scale spatio-temporal augmentation and cross-scale aggregation to enhance feature robustness.
We introduce a Selective Replay Mechanism (SRM) with a dual-criteria strategy to optimize memory sample selection, mitigating catastrophic forgetting.
We validate the proposed method on a real-world power grid behavior dataset, demonstrating superior performance across multiple meteorological scenarios.

2. Methods

Figure 3 illustrates the overall framework of the proposed method. The pipeline begins with a spatio-temporal feature extraction module, where raw video data is processed to obtain preliminary feature representations. These features are subsequently fed into the spatio-temporal Feature-Enhancement Fusion Module (FEFM), which significantly enhances feature discriminability through the multi-scale spatio-temporal augmented fusion and the cross-scale aggregation mechanism. Concurrently, the Selective Replay Mechanism (SRM) evaluates samples based on two critical metrics: (1) error variability and (2) feature-space divergence, to identify the most informative samples for memory retention. The SRM dynamically updates the memory bank with these selected exemplars, ensuring optimal knowledge preservation. Finally, through the continual learning strategy, our method achieves an effective balance between new knowledge acquisition and prior knowledge retention, while consistently improving action-recognition performance across different scenarios.

2.1. Spatio-Temporal Feature Extraction Module

To address the complex dynamics of power grid operation behaviors, we employ a dual-pathway strategy for spatio-temporal feature extraction. First, a pre-trained YOLOv7 model [18] processes the video sequence

V = {I_{1}, I_{2}, \dots, I_{T}}

to detect operators and extract operator-specific local representations

{F_{1}, F_{2}, \dots, F_{T}}

, where T denotes the temporal length. We adopt YOLOv7 for its favorable trade-off between detection accuracy and computational efficiency, as well as its widely validated performance in industrial vision tasks. Although newer versions exist, YOLOv7 provides a stable and well-documented backbone suitable for real-time applications in power grid scenarios.

In the temporal pathway, we compute dense optical flow [19] between consecutive frames to capture motion patterns:

\begin{matrix} G_{temp} & = {CNN}_{temp} (OF (I_{1}, I_{2}), \dots, OF (I_{T - 1}, I_{T})), \\ P_{temp} & = {CNN}_{temp} (OF (F_{1}, F_{2}), \dots, OF (F_{T - 1}, F_{T})), \end{matrix}

(1)

where OF denotes the optical flow field,

{CNN}_{temp}

represents the 3D convolutional network, and

G_{temp}

and

P_{temp}

are the global features and local features in the temporal dimension, respectively. Concurrently, the spatial pathway processes the RGB frames using the 2D convolutional network

{CNN}_{rgb}

:

\begin{matrix} G_{rgb} & = {CNN}_{rgb} (I_{1}, I_{2}, \dots, I_{T}), \\ P_{rgb} & = {CNN}_{rgb} (F_{1}, F_{2}, \dots, F_{T}) . \end{matrix}

(2)

Finally, the original spatio-temporal features

E_{ori}

integrate both pathways:

E_{ori} = [G_{rgb}; G_{temp}; P_{rgb}; P_{temp}],

(3)

where

[\cdot; \cdot]

indicates channel-wise concatenation.

2.2. Spatio-Temporal Feature-Enhancement Fusion Module

To enhance robustness against visual disturbances, we propose a spatio-temporal Feature-Enhancement Fusion Module (FEFM) incorporating random masking and multi-scale processing, and the structure of the FEFM is shown in Figure 4. Firstly, we apply a random masking mechanism to simulate real-world occlusions, thereby forcing the model to learn more robust feature representations while preventing over-reliance on specific spatio-temporal regions. By randomly selecting spatio-temporal regions and setting them to zero values:

E_{en} = E_{ori} ⊙ U, U \sim Bernoulli (p_{mask}),

(4)

where

E_{en}

is the enhanced spatio-temporal features,

U \in {0, 1}

is a binary mask,

p_{mask}

is the masking probability, and ⊙ denotes element-wise multiplication.

Subsequently, we extract the multi-scale representations through the spatio-temporal feature encoder, and it is formulated as:

\begin{matrix} S_{ori}^{k} & = {Conv 3 D_Block}_{k} (E_{ori}), \\ S_{enc}^{k} & = {Conv 3 D_Block}_{k} (E_{enc}), \end{matrix}

(5)

where

k \in {1, 2, 3, 4}

denotes different scale levels. Each

Conv 3 D_Block

consists of cascaded temporal and spatial convolutions designed to capture local patterns in both temporal and spatial dimensions independently:

Conv 3 D_Block (X) = σ (W_{t} * (W_{s} * X)),

(6)

where

W_{t}

and

W_{s}

represent the learnable parameters for temporal and spatial convolutions, respectively, ∗ denotes the convolution operation, and

σ

is the GELU activation function.

To fully exploit the complementary information between original spatio-temporal features and enhanced spatio-temporal features, their multi-scale representations are then fused through concatenation and subjected to dual pooling operations:

Y_{k} = {TPool}_{k} [S_{ori}^{k}; S_{enc}^{k}],

(7)

where

TPool (X) = [MaxPool (X); AvgPool (X)]

, and

[\cdot; \cdot]

denotes feature concatenation.

Cross-scale Fusion: Interactive fusion of features across different scales is achieved through adaptive attention weights:

{\hat{Y}}_{1} = \sum_{k = 1}^{4} α_{k} \cdot {UpSample}_{k} (Y_{k}),

(8)

α_{k} = \frac{\exp (W_{a}^{T} Y_{k})}{\sum_{j = 1}^{4} \exp (W_{a}^{T} Y_{j})},

(9)

where

W_{a}

is the learnable weight and UpSample denotes bilinear interpolation upsampling to ensure consistent spatial resolution across all scales.

Granularity-aware Aggregation: Differentiated aggregation strategies are employed based on the granularity characteristics of features at different scales:

{\hat{Y}}_{2} = [ReLU (FC (AvgPool (Y_{1}))) \otimes FC (Y_{2}); Conv (Y_{3}) \otimes Conv (Y_{4})],

(10)

where ⊗ and

[\cdot; \cdot]

denote the matrix multiplication and feature concatenation, fully connected (FC) layers are utilized for global modeling of fine-grained features (

Y_{1}, Y_{2}

), while convolutional operations preserve spatial structural information in coarse-grained features (

Y_{3}, Y_{4}

).

Finally, two convolutional layers are used to adjust the dimension and enhance the representation of the fused multi-scale spatio-temporal features:

{\hat{Y}}_{final} = Conv (Conv (\hat{Y})),

(11)

where

\hat{Y} = [{\hat{Y}}_{1}; {\hat{Y}}_{2}]

. Hence, action category prediction is accomplished through global pooling followed by fully connected classification layers:

\hat{y} = Softmax (W_{c}^{T} \cdot GlobalPool ({\hat{Y}}_{final})),

(12)

where GlobalPool represents global average pooling,

W_{c}

is the classification weight matrix, and

\hat{y}

represents the predicted action category probability distribution. Moreover, temporal localization for action segments is performed through one-dimensional convolution:

T_{pred} = {Conv}_{1 D} ({\hat{Y}}_{final}),

(13)

where

T_{pred} = ({\hat{t}}_{s}, {\hat{t}}_{e}, \hat{p})

represents the predicted action clips, including start time

{\hat{t}}_{s}

, end time

{\hat{t}}_{e}

, and confidence score

\hat{p}

.

2.3. Selective Replay Mechanism

This mechanism aims to mitigate catastrophic forgetting and enhance model robustness by strategically selecting high-value samples from current task data. Selected samples are stored in a capacity-constrained dynamic memory bank

M

to facilitate continual learning. The memory bank is initialized with randomly selected samples, formally defined as:

M = {\{(V_{j}, y_{j}, {\tilde{Y}}_{j})\}}_{j = 1}^{n},

(14)

where

y_{j}

denotes the ground-truth label of sample

V_{j}

,

{\tilde{Y}}_{j} = GlobalPool ({\hat{Y}}_{final}^{j})

, and n indicates the memory capacity (

n = 500

).

Sample recognition difficulty dynamically evolves during training. Samples exhibiting persistently high errors or substantial error fluctuations typically contain critical discriminative information or boundary patterns not yet mastered by the model. To identify such challenging samples, we propose an error variability (EV)-based evaluation method. The error change rate (ECR) for sample

V_{i}

quantifies prediction error variation across consecutive training epochs:

\begin{matrix} {ECR}_{i} & = \frac{1}{M - 1} \sum_{m = 2}^{M} Δ e_{i}^{(m)}, \\ Δ e_{i}^{(m)} & = \frac{∥e_{i}^{(m)} - e_{i}^{(m - 1)}∥}{q_{i}}, m = 2, 3, \dots, M, \\ e_{i}^{(m)} & = - \sum_{c = 1}^{C} y_{i, c} \log ({\hat{y}}_{i, c}^{(m)}), \end{matrix}

(15)

where

{ECR}_{i}

is the error change rate,

e_{i}^{(m)}

denotes the prediction error at epoch m,

{\hat{y}}_{i, c}^{(m)}

is the predicted probability for class c, and

q_{i}

represents the duration of key operational behaviors. A high

{ECR}_{i}

value indicates unstable model performance on

V_{i}

, suggesting proximity to decision boundaries or the presence of complex discriminative patterns. Such samples are crucial for maintaining boundary clarity and enhancing discriminative capability.

To complement error variability, sample diversity prevents feature-space over-concentration in the memory bank. We introduce a feature-embedding distance metric:

{Div}_{i} = \min_{{\tilde{Y}}_{j} \in M} {∥{\tilde{Y}}_{i} - {\tilde{Y}}_{j}∥}_{2},

(16)

where

{\tilde{Y}}_{i} = GlobalPool ({\hat{Y}}_{final}^{j})

and

{∥ \cdot ∥}_{2}

is the Euclidean distance. High

{Div}_{i}

values indicate significant feature-space separation from existing memory bank samples. Selecting high-divergence samples improves global representativeness and mitigates forgetting in corner regions.

Ultimately, we balance learning difficulty (ECR) and feature coverage (Div) through a composite score:

{Score}_{i} = \frac{Norm ({ECR}_{i}) + Norm ({Div}_{i})}{2},

(17)

where

Norm (\cdot)

denotes min–max normalization. When adding new samples to a full memory bank, replacement follows an importance-driven strategy:

B = \{V_{i} ∣ {Score}_{i} > τ\}, i = 1, 2, \dots, N,

(18)

where

τ

is a threshold, and

B = {V_{i}}_{i = 1}^{K}

is the selected sample set. Updates involve two phases: (1) sample addition for B elements, and (2) replacement of lowest-scoring samples when capacity is exceeded. This maintains a high-value sample set optimizing difficulty-diversity trade-offs.

2.4. Model Optimization

We integrate knowledge distillation and sample replay to alleviate catastrophic forgetting. The total loss combines task loss, distillation loss, and replay loss:

L_{total} = L_{task} + L_{distill} + δ L_{replay},

(19)

where

δ

is a weighting factor, and

L_{task} = L_{action} + L_{temp}

. The action recognition and temporal localization losses are:

\begin{matrix} L_{action} & = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{c = 1}^{C} I (y_{i} = c) \log {\hat{y}}_{i, c}, \\ L_{temp} & = \frac{1}{N} \sum_{i = 1}^{N} [f (t_{s}^{i} - {\hat{t}}_{s}^{i}) + f (t_{e}^{i} - {\hat{t}}_{e}^{i})], \end{matrix}

(20)

where N is the batch size, C is the number of categories, and f denotes the Smooth L1 function [20]. Knowledge distillation loss preserves historical decision boundaries:

L_{distill} = KL ({\hat{y}}_{i} (V_{i}, θ_{old}) ‖ {\hat{y}}_{i} (V_{i}, θ_{new})),

(21)

where KL denotes the KL divergence,

θ_{old}

and

θ_{new}

represent previous and current model parameters. Sample replay loss consolidates recognition capabilities:

L_{replay} = \sum_{(V_{j}, y_{j}) \in M} L_{CE} (\hat{y} (V_{j}), y_{j}),

(22)

where

L_{CE}

is the cross-entropy loss.

3. Experimental Results

3.1. Dataset and Evaluation Metrics

The dataset comprises 4716 high-definition (1080P) videos capturing power grid operation under diverse conditions, including varying lighting, time periods, locations, and weather. Professional annotators labeled three key operational behaviors: climbing up, voltage verification, and placing the grounding rod. Table 1 details the dataset distribution.

To evaluate the adaptability of the proposed method for power grid operation behavior recognition under dynamic meteorological conditions, we designed continual learning task units based on different meteorological scenarios (sunny—Task 1, cloudy—Task 2, rainy—Task 3, i.e.,

r = 3

). Within each task unit, training, validation, and test sets contain three key operational behaviors, covering feature drift caused by operational environment changes (such as visual blurriness in grounding wire operations during foggy and rainy weather, and representational variations in climbing postures under low-light conditions). Model performance is evaluated using mean Average Precision (mAP) as the core evaluation metric, which comprehensively considers the average precision across all test samples for each category.

3.2. Implementation Details

Experimental environment: Python 3.10, PyTorch 2.1.0, CUDA 12.1, with NVIDIA RTX 3090 GPU hardware. Network architecture: Pre-trained I3D network [21] for initializing the spatio-temporal feature-extraction module, and pre-trained ActionFormer network [22] for initializing the spatio-temporal feature encoder. Training phase: AdamW optimizer with learning rate 0.0001, momentum 0.9, weight decay 0.05; learning rate schedule with linear warm-up for 10 epochs followed by cosine annealing for 30 epochs; batch size

N = 64

, total training epochs 40.

3.3. Comparative Results and Analysis

To evaluate the advantages of our method in addressing environmental changes, we select four representative benchmark models for comparison: (1) Two-Stream CNN (TS-CNN) [23] that fuses spatial (RGB stream) and temporal (optical flow) information; (2) Elastic Weight Consolidation (EWC) [24], a method for addressing catastrophic forgetting; (3) TriDet [25] that uses a feature pyramid with layer-by-layer down-sampling to expand the high-level feature receptive field; and (4) SVFormer [6], a Transformer-based behavior recognition method. Table 2 shows the behavior recognition performance (mAP) of our method and comparative methods across three meteorological task units. The inter-scenario mAP difference (i.e., forgetting rate) is defined as the difference between the maximum and minimum mAP of a single model across the three tasks.

From the results, it can be observed that under sunny conditions (Task 1), all methods achieve relatively high performance (mAP > 82%), with our method achieving the highest value of 89.80%. Our method demonstrates highly competitive performance in sunny scenarios, matching SVFormer within one percentage point even in categories where SVFormer leads slightly (e.g., “put grounding rod (grounding)” at 85.26%). When meteorological conditions deteriorated, all methods experienced performance degradation, but the extent of degradation varied significantly. Our method achieves a mAP of 89.80%, 88.44%, and 87.06% in sunny (Task 1), cloudy (Task 2), and rainy (Task 3) conditions, respectively, with a fluctuation range (forgetting rate) of only 2.74%, significantly outperforming comparative methods (TS-CNN: 8.52%, EWC: 7.06%, TriDet: 5.28%, SVFormer: 4.92%). This validates the advantages of our method in cross-scenario continual learning, particularly in addressing feature drift caused by complex weather conditions. Although the EWC method alleviates forgetting through parameter regularization, it struggles to adequately address deep feature changes caused by non-semantic factors (such as meteorology), with overall performance still significantly lower than our method, indicating that the proposed method possesses stronger adaptability and generalization capability in cross-environmental scenario recognition.

Further analysis reveals that the performance drop in rainy and cloudy scenarios stems not merely from static image-quality reduction (e.g., lower contrast), but mainly from dynamic visual disturbances that disrupt motion consistency. For example, rain induces high-frequency spatio-temporal artifacts due to transient raindrop occlusions, resulting in inconsistent optical flow estimates. Similarly, under cloudy or low-light conditions, rapid illumination changes cause significant inter-frame intensity variations, impairing motion cue integration essential for recognizing continuous actions such as “voltage verification” or “put grounding rod”. The proposed FEFM module tackles these issues via a random spatio-temporal masking strategy that emulates real-world occlusions and variations during training, thereby enhancing temporal representation robustness. Moreover, the cross-scale aggregation mechanism helps reconstruct corrupted motion details from stable context, leading to improved consistency and explaining the notably lower forgetting rate observed with our method.

To further verify the role of each module in the proposed method, we conduct the ablation experiments, with results shown in Table 3. The introduction of the FEFM significantly improves the basic adaptability of the model under adverse environments, increasing the rainy task mAP from 74.25% to 78.62% (+4.37%), indicating that fusing the enhanced spatio-temporal features helps the model capture key behavioral cues under low-visibility conditions. However, this module alone still struggles to effectively address cross-scenario knowledge forgetting. The introduction of the SRM, which focuses on difficult samples and enhancing the retention of key features across scenarios, results in substantial performance improvements across all three tasks (rainy mAP reaching 83.79%), effectively mitigating feature degradation. The final complete model achieved an average mAP of 88.43% across the three task scenarios, representing a 9.41 percentage point improvement over the baseline model. Furthermore, we ablate the contribution of the granularity-aware aggregation mechanism within FEFM. Without this component, the average mAP drops by 1.57%, indicating its importance in integrating multi-scale features effectively. The results validate the critical value of the synergistic effects of each module in improving model recognition performance under extreme environments, demonstrating the wide adaptability of the proposed method under complex meteorological conditions.

As illustrated in Figure 5, we include training curves depicting the progression of mAP across epochs for both the full model and ablated variants under various meteorological conditions. These curves provide intuitive insights into the training dynamics, highlighting the stability and convergence behavior of our approach, as well as the incremental performance gains contributed by each module. From the figure, the results show that our full model exhibits robust convergence and higher performance under different weather scenarios, emphasizing the complementary role between the modules.

Furthermore, we analyze the sensitivity of key hyperparameters, i.e., the threshold

τ

and the weighting factor

δ

, with results shown in Figure 6. Experiments demonstrated that when

τ

deviates from 0.55 or

δ

deviates from 0.6, model performance significantly decreases. Optimal performance occurs at

τ = 0.55

and

δ = 0.6

. This indicates that reasonable hyperparameter settings are crucial for ensuring model performance, and fine-tuning not only improves recognition accuracy but also enhances the robustness and generalization capability of the model under variable meteorological conditions.

To more intuitively demonstrate the practical effectiveness of the proposed method in power grid operation behavior recognition, Figure 7 shows frame-level recognition visualization results of a random sample video from the test set. As illustrated in Figure 7, the first row depicts a “climbing up” scenario in which the model accurately localizes the climber and identifies the behavior with high confidence (0.81). Practically, this capability enables the supervision system to reliably trigger real-time alerts when personnel are climbing an electric pole, ensuring that safety protocols (such as the use of dual safety harnesses) are activated and monitored from the outset of an operation. In the second row, which presents a “put grounding rod” scenario, the model successfully identifies the precise moment of installing the grounding wire, focusing on the relative positional relationship between the clamp and the operating handle. This fine-grained discriminative capability is crucial for verifying procedural compliance. For instance, the system could automatically confirm whether the grounding wire is correctly attached to the designated point before the operator proceeds, a key step in preventing electrocution accidents.

To enhance the generalizability of our approach, we further conduct the experiments on a public dataset, (i.e., THUMOS-14 [26]) and the results are listed in Table 4. This dataset comprises 200 untrimmed training videos and 213 untrimmed test videos, encompassing 20 motion categories. Background clips occupy an average of 71% of the duration of each video. In alignment with established evaluation protocols for most baseline models, we assessed our method’s performance using mean Average Precision (mAP) at temporal Intersection over Union (tIoU) thresholds of 0.3, 0.5, and 0.7. The experimental results demonstrate that the proposed method significantly outperforms other approaches. Specifically, it achieves mAP of 83.9%, 73.1%, and 47.0% at tIoU thresholds of 0.3, 0.5, and 0.7, respectively. These results represent an improvement of 0.8%, 1.4%, and 1.2% over the ASL [27].

Table 4. Results (%) of on publicly available THUMOS-14 dataset [26].

Method	0.3	0.5	0.7	Avg. mAP
TALLFormer [28]	68.4	57.6	30.8	52.3
TadTR [29]	74.8	60.1	32.8	55.9
ASL [27]	83.1	71.7	45.8	66.9
Ours	83.9	73.1	47.0	68.0

4. Discussion

Experimental results demonstrate that the proposed method effectively tackles two major challenges in power grid operation behavior recognition: environmental variability and catastrophic forgetting. The FEFM module exhibits strong capability in handling meteorological variations through multi-scale spatio-temporal enhancement and cross-scale aggregation, achieving 89.80% mAP under sunny conditions while maintaining robust performance across diverse weather scenarios. This advantage stems from its adaptive attention mechanism and spatio-temporal feature augmentation, which dynamically emphasize informative regions while suppressing weather-induced noise, a critical capability for real-world deployment in unpredictable environments. Meanwhile, the SRM module proves highly effective in continual learning, as reflected by a notably low forgetting rate of 2.74% across three meteorological tasks. Its dual-criteria selection strategy balances error variability for boundary sample prioritization and feature divergence for representation diversity, outperforming conventional methods in mitigating forgetting while reducing feature redundancy.

From a practical standpoint, this integrated approach offers substantial value for power grid safety monitoring by ensuring consistent performance amid seasonal variations, enabling continuous adaptation to edge cases, and supporting scalable deployment across regions without full model retraining. However, certain limitations remain. The study currently excludes extreme conditions such as snow and fog, and the multi-scale processing introduces additional computational overhead. Moreover, while the FEFM’s augmentation strategy effectively counteracts short-term and intermittent disturbances, its performance remains theoretically bounded under persistently adverse conditions (e.g., heavy rain or dense fog) where visual inputs are severely corrupted. This inherent constraint of vision-based methods explains the slight performance decline observed even with our method in rainy settings. Future work will focus on integrating multi-modal sensors (e.g., thermal or LiDAR) that are less susceptible to such visual degradation, thereby providing more stable and complementary signals for reliable behavior recognition.

5. Conclusions

We have presented the ISAS that could advance continual learning for power grid operation behavior recognition by simultaneously addressing environmental variability and catastrophic forgetting. The core contributions include the spatio-temporal Feature-Enhancement Fusion Module (FEFM), which achieves weather-invariant feature extraction through multi-scale spatio-temporal enhancement and adaptive cross-scale aggregation, yielding 89.80% mAP under optimal conditions. Complementing this, the Selective Replay Mechanism (SRM) implements an innovative dual-criteria sample selection strategy based on error variability and feature divergence metrics, resulting in exceptional continual learning performance with a forgetting rate of merely 2.74% across diverse meteorological scenarios. Experimental validation confirms the superiority of the proposed ISAS in both environmental robustness and knowledge retention, outperforming established benchmarks while providing practical solutions for real-world monitoring challenges.

Author Contributions

Conceptualization, L.M. and D.H.; methodology, L.M. and D.H.; validation, G.B.; formal analysis, L.M.; investigation, D.H.; data curation, S.G.; writing—original draft preparation, D.H.; writing—review and editing, L.M.; visualization, D.H. and S.G.; supervision, G.B. and S.G.; project administration, G.B.; funding acquisition, L.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Guizhou Power Grid Co., Ltd., grant number GZKJXM20222320.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data is unavailable due to privacy or ethical restrictions.

Conflicts of Interest

Authors Lingwen Meng, Guobang Ban, and Siqi Guo were employed by the company Electric Power Research Institute Guizhou Power Grid Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The authors declare that this study received funding from Guizhou Power Grid Co., Ltd., grant number GZKJXM20222320. The funder had the following involvement with the study: Conceptualization, Lingwen Meng; methodology, Lingwen Meng; validation, Guobang Ban; formal analysis, Lingwen Meng; data curation, Siqi Guo; writing-review and editing, Lingwen Meng; visualization, Siqi Guo; supervision, Guobang Ban and Siqi Guo; project administration, Guobang Ban; funding acquisition, Lingwen Meng.

References

Meng, L.; Ban, G.; Liu, F.; Qiu, W.; He, D.; Zhang, L.; Wang, S. Classification of Violations in Power Grid Operations Based on Cross-domain Few-Shot Learning. Power Big Data 2024, 27, 69–76. [Google Scholar] [CrossRef]
Wang, J.; Sun, L.; Du, N.; Hua, C. The Functions and Applications of Live-line Operation Robots in Distribution Networks. Electr. Power Energy 2024, 45, 518–520. [Google Scholar]
Sun, Y.; Liu, Y.; Han, Y. Research on Safety Supervision Technology for Power Grid Operators Based on Wearable Sensors and Video Surveillance. Electr. Age 2018, 45–46. Available online: https://kns.cnki.net/kcms2/article/abstract?v=IMWkopLkOPXW_5EjYSgpPWEUdRHZwPWVzhcuJ7FYvdI8SR78Ll29rMUC1ZHuUAPyyqc79swY2xZuvZpF2naVE8tXTab6WnTpx9MGrZefCCe5Vx3J4M4Q6z_DwjWa1V7ma2nBkq3qgm6f1D8ze77fKwz6qUWZsslANIJAoVLLWfI=&uniplatform=NZKPT&language=CHS (accessed on 1 August 2025).
Wu, T. Research on On-Site Safety Management of State Grid Xiaogan Power Supply Company. Master’s Thesis, Huazhong University of Science and Technology, Wuhan, China, 2022. [Google Scholar] [CrossRef]
Cen, J.; Weng, Z.; Lin, T.; Li, G.; Yang, L. Detection Method of Violations in Power Grid Operation Sites Based on Machine Vision. Electr. Technol. Econ. 2025, 307–308+315. Available online: https://kns.cnki.net/kcms2/article/abstract?v=IMWkopLkOPUkJXTJIt04G0SE_o9sx1w_bHTk0Q2z8R4E4HP90PMFTFjWpJJfgH0Ij7X01EkIQcfNTT8E8nGwsg4MnJkKjeaB39OHpi-5TTJEMEQtgooYaMni3ecwZB5QiDdGlH4RNj4DOIrekrXzcCuPu0WPvMueen7_ES0L2Yooytgzee_wEPzTFI4Oiwiz&uniplatform=NZKPT&language=CHS (accessed on 1 August 2025).
Xing, Z.; Dai, Q.; Hu, H.; Chen, J.; Wu, Z.; Jiang, Y.G. Svformer: Semi-supervised video transformer for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 18816–18826. [Google Scholar]
Xiao, C.; Lei, Y.; Liu, C.; Wu, J. Mean teacher-based cross-domain activity recognition using WiFi signals. IEEE Internet Things J. 2023, 10, 12787–12797. [Google Scholar] [CrossRef]
Ban, G.; Fu, L.; Jiang, L.; Du, H.; Li, A.; He, Y.; Zhou, J. Dynamic Risk Identification of Personnel Behavior in Two-stage Complex Operations Based on Image Screening. Power Big Data 2024, 27, 58–69. [Google Scholar] [CrossRef]
Song, X.; Yao, X. Human Behavior Recognition Based on Multi-Descriptor Feature Coding. Comput. Technol. Dev. 2018, 28, 17–21. [Google Scholar]
Shi, A.; Cheng, Y.; Cao, X. Human Behavior Recognition Method Combining Codebook Optimization and Feature Fusion. Comput. Technol. Dev. 2018, 28, 107–111. [Google Scholar]
Ji, S.; Xu, W.; Yang, M.; Yu, K. 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 221–231. [Google Scholar] [CrossRef] [PubMed]
Feichtenhofer, C. X3d: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 203–213. [Google Scholar]
Tian, Q.; Miao, W.; Zhang, L.; Yang, Z.; Yu, Y.; Zhao, Y.; Yao, L. STCA: An action recognition network with spatio-temporal convolution and attention. Int. J. Multimed. Inf. Retr. 2025, 14, 1. [Google Scholar] [CrossRef]
Lin, J.; Gan, C.; Han, S. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 7083–7093. [Google Scholar]
Zhao, C.; Feng, X.; Cao, R. Video Behavior Recognition Based on Spatio-Temporal Dual-Stream Feature Enhancement Network. Comput. Eng. Des. 2025, 46, 871–878. [Google Scholar] [CrossRef]
Chen, D.; Chen, M.; Wu, P.; Wu, M.; Zhang, T.; Li, C. Two-stream spatio-temporal GCN-transformer networks for skeleton-based action recognition. Sci. Rep. 2025, 15, 4982. [Google Scholar] [CrossRef] [PubMed]
Lu, Y.; Zhou, X.; Zhang, S.; Liang, G.; Xing, Y.; Cheng, D.; Zhang, Y. Review of Continuous Learning Methods Based on Pre-Training. Comput. Eng. 2025, 1–17. [Google Scholar] [CrossRef]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Dosovitskiy, A.; Fischer, P.; Ilg, E.; Hausser, P.; Hazirbas, C.; Golkov, V.; Van Der Smagt, P.; Cremers, D.; Brox, T. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2758–2766. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Carreira, J.; Zisserman, A. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6299–6308. [Google Scholar]
Zhang, C.L.; Wu, J.; Li, Y. Actionformer: Localizing moments of actions with transformers. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 492–510. [Google Scholar]
Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. In Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 1, Montreal, QC, Canada, 8–13 December 2014; pp. 568–576. [Google Scholar]
Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A.A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al. Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. USA 2017, 114, 3521–3526. [Google Scholar] [CrossRef] [PubMed]
Shi, D.; Zhong, Y.; Cao, Q.; Ma, L.; Li, J.; Tao, D. Tridet: Temporal action detection with relative boundary modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 18857–18866. [Google Scholar]
Jiang, Y.; Liu, J.; Roshan Zamir, A.; Toderici, G.; Laptev, I.; Shah, M.; Sukthankar, R. THUMOS’14: ECCV Workshop on Action Recognition with a Large Number of Classes. 2014. Available online: http://crcv.ucf.edu/THUMOS14/ (accessed on 1 August 2025).
Shao, J.; Wang, X.; Quan, R.; Zheng, J.; Yang, J.; Yang, Y. Action sensitivity learning for temporal action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 4–6 October 2023; pp. 13457–13469. [Google Scholar]
Cheng, F.; Bertasius, G. Tallformer: Temporal action localization with a long-memory transformer. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 503–521. [Google Scholar]
Liu, X.; Wang, Q.; Hu, Y.; Tang, X.; Zhang, S.; Bai, S.; Bai, X. End-to-end temporal action detection with transformer. IEEE Trans. Image Process. 2022, 31, 5427–5441. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Sample examples of power grid operation in complex environments. The red box represents the target to be recognized.

Figure 2. Feature-distribution shift of old samples in continual learning.

Figure 3. Overall framework of the proposed method.

Figure 4. The structure of the spatio-temporal Feature-Enhancement Fusion Module.

Figure 5. Training curves of mAP over epochs for ablation study and full model.

Figure 6. Impact of the threshold

τ

and the weighting factor

δ

on model performance.

Figure 6. Impact of the threshold

τ

and the weighting factor

δ

on model performance.

Figure 7. Visualization results of power grid operation behavior recognition, where the behavior and confidence of the recognition are placed in the top right corner. The person is marked with a red box, the electric pole is marked with a blue box, and the voltage detector is marked with a green box.

Table 1. Dataset distribution.

Data	Number of Videos	Avg. Video Duration	Avg. Action Duration
Training Set	3319	805.9 s	57.4 s
Validation Set	700	876.2 s	56.3 s
Test Set	697	822.5 s	50.5 s

Table 2. Comparison of recognition performance of different models on different tasks.

Method	Scenario	Climbing (%)	Verification (%)	Grounding (%)	mAP (%)	Inter-Scenario mAP Diff. (%)
TS-CNN [23]	Sunny	85.66	81.63	80.06	82.45	8.52
	Cloudy	80.75	77.79	76.68	78.41
	Rainy	76.84	73.86	71.08	73.93
EWC [24]	Sunny	88.24	83.82	84.78	85.61	7.06
	Cloudy	84.87	79.68	81.25	81.93
	Rainy	81.58	76.37	77.69	78.55
TriDet [25]	Sunny	89.74	85.63	84.81	86.73	5.28
	Cloudy	87.15	83.98	81.63	84.25
	Rainy	84.03	81.76	78.57	81.45
SVFormer [6]	Sunny	90.12	86.75	85.26	87.38	4.92
	Cloudy	87.43	85.01	82.58	85.01
	Rainy	84.65	83.26	79.47	82.46
Ours	Sunny	93.67	89.59	86.13	89.80	2.74
	Cloudy	91.89	88.32	85.11	88.44
	Rainy	90.68	86.87	83.64	87.06

Table 3. Results of ablation study (mAP, %). The symbol * indicates ablating the contribution of the granularity-aware aggregation mechanism within FEFM.

Method	Sunny mAP	Cloudy mAP	Rainy mAP	Avg. mAP
Baseline	83.67	79.13	74.25	79.02
+FEFM	85.19	83.35	78.62	82.39
+SRM	86.76	85.27	83.79	85.27
Ours	89.80	88.44	87.06	88.43
Ours *	88.15	86.78	85.64	86.86

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Meng, L.; He, D.; Ban, G.; Guo, S. Incremental Spatio-Temporal Augmented Sampling for Power Grid Operation Behavior Recognition. Electronics 2025, 14, 3579. https://doi.org/10.3390/electronics14183579

AMA Style

Meng L, He D, Ban G, Guo S. Incremental Spatio-Temporal Augmented Sampling for Power Grid Operation Behavior Recognition. Electronics. 2025; 14(18):3579. https://doi.org/10.3390/electronics14183579

Chicago/Turabian Style

Meng, Lingwen, Di He, Guobang Ban, and Siqi Guo. 2025. "Incremental Spatio-Temporal Augmented Sampling for Power Grid Operation Behavior Recognition" Electronics 14, no. 18: 3579. https://doi.org/10.3390/electronics14183579

APA Style

Meng, L., He, D., Ban, G., & Guo, S. (2025). Incremental Spatio-Temporal Augmented Sampling for Power Grid Operation Behavior Recognition. Electronics, 14(18), 3579. https://doi.org/10.3390/electronics14183579

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Incremental Spatio-Temporal Augmented Sampling for Power Grid Operation Behavior Recognition

Abstract

1. Introduction

2. Methods

2.1. Spatio-Temporal Feature Extraction Module

2.2. Spatio-Temporal Feature-Enhancement Fusion Module

2.3. Selective Replay Mechanism

2.4. Model Optimization

3. Experimental Results

3.1. Dataset and Evaluation Metrics

3.2. Implementation Details

3.3. Comparative Results and Analysis

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI