1. Introduction
Electricity, as a pillar industry of the modern economy, plays a crucial role in enhancing the service level of the power sector and supporting high-quality socio-economic development through its secure operation [
1]. With the continuous growth of national electricity demand, the security and reliability of power systems have become a focal point of societal concern. In this context, electricity load forecasting, as a core technical component in power system planning and operation, serves irreplaceable functions. For short-term dispatch, accurate load forecasting optimizes power generation scheduling and transmission resource allocation, preventing power outages or resource wastage caused by supply–demand imbalance [
2]. For medium-to-long-term planning, forecasting results provide scientific foundations for power source investments and grid expansion, ensuring the economic efficiency and stability of power systems [
3,
4]. Furthermore, from an environmental protection perspective, precise load forecasting reduces energy use intensity and emissions, mitigates environmental impacts, and promotes the sustainable development of energy systems [
5].
However, electricity load data exhibit significant nonlinearity, periodicity, and long-term dependencies, compounded by dynamic interference from external factors such as meteorological conditions and holidays [
6], rendering load forecasting inherently challenging. As global energy demand escalates and power systems grow increasingly complex, the importance of load forecasting becomes more pronounced. By accurately predicting load demand, power suppliers can optimize generation and transmission plans, reduce operational costs, minimize energy waste, and advance sustainable energy system development [
7,
8].
In the face of the above challenges, this paper is committed to building a high-precision power load forecasting framework. The structure of this article is outlined as follows:
Section 2 systematically sorts out the time-series prediction methods and research on their applications in power load forecasting.
Section 3 proposes a hybrid model, LMD-MSCA-SCINet, based on signal decomposition and a multi-scale attention mechanism. In
Section 4, we conduct comparative experiments and analysis with other models using actual datasets to verify the advantages of the proposed model. Finally,
Section 5 summarizes our conclusions and briefly looks forward to the future.
2. Related Works
In recent years, many researchers have conducted extensive explorations in the field of time-series forecasting, with significant improvements in the diversity and innovation of research methods. In power load forecasting, existing time-series forecasting methods can be broadly categorized into classical forecasting methods, machine learning methods, deep learning methods, and hybrid forecasting methods [
9,
10].
In the realm of classical forecasting methods and machine learning, Yunsun Kim utilized the ARIMA model for short-term power load forecasting [
11]. Waqas Ahmad employed a genetic algorithm based on extreme learning machines for short-term power load forecasting [
12]. Additionally, many researchers have focused on integrating different machine learning algorithms to construct new forecasting models. Jinliang Zhang proposed a hybrid model based on improved empirical mode decomposition, autoregressive integrated moving average, and fruit fly optimization algorithm (FOA)-optimized wavelet neural networks (WNNs) for short-term power load forecasting [
13] and further developed an adaptive hybrid model for short-term electricity price forecasting [
14]. Guo-Feng Fan et al. combined empirical mode decomposition methods, support vector regression models, and particle swarm optimization algorithms for electricity consumption forecasting [
15] and developed a new short-term load forecasting model [
16]. Salahuddin Khan developed a novel integrated short-term load forecasting model that combines wavelet transform decomposition models, radial basis function networks, and heat exchange optimization algorithms [
4]. Other researchers used ELM and quantile regression to predict electricity prices and data from various electricity markets to prove the effectiveness of the proposed probabilistic forecasting method [
17].
In the field of deep learning-based time-series forecasting, many innovative models and methods have been proposed, primarily focusing on improving the accuracy and adaptability of predictions. Anastasia Borovykh introduced the Temporal Convolutional Network (TCN), which uses stacked dilated convolutions to enable the network to access extensive historical data, enhancing the model’s ability to handle long-range dependencies in time series [
18]. Yitian Chen developed a probabilistic forecasting framework based on convolutional neural networks suitable for handling multiple related time series, highlighting the advantages of deep learning in capturing complex relationships between time series [
19]. Bo-Sung Kwon utilized a deep neural network based on LSTM layers for short-term load forecasting [
20]. Many researchers have explored combining multiple methods to enhance the predictive accuracy of models, particularly in the application of LSTM. As a common time-series forecasting model, LSTM has been used in predicting electricity consumption in hospital facilities with some success, but it faces challenges such as high computational costs during training and susceptibility to the vanishing gradient problem [
21]. To address these issues, researchers have investigated hybrid architecture optimization by integrating different techniques and algorithms to improve model performance. Slawek Smyl proposed a hybrid method combining exponential smoothing for time-series forecasting and recurrent neural networks, merging the strengths of traditional statistical techniques and modern deep learning approaches [
22]. Xinlei Zhou combined LSTM with reinforcement learning agents to predict the electricity consumption of buildings [
23]. Zulfiqar Ahmad Khan used a CNN to extract spatial features and integrated them with LSTM for resource prediction [
24,
25]. Oreshkin et al. proposed the N-BEATS neural network, which has effectively addresses mid-term power load forecasting problems to some extent [
26]. Xinjian Xiang et al. proposed a combined method of CEEMDAN-TCN-SMA to predict short-term power load [
27].
Improved models based on the Transformer architecture have also been widely applied in time-series forecasting [
28], with the Informer model being a typical representative [
29]. The Transformer model has been introduced into time-series forecasting due to its efficient handling of long-sequence data [
30]. Some researchers used the combined model of CEA-TCN-Transformer to predict power load [
31]. The Informer model has achieved promising results in power load forecasting [
32,
33], but its global dependency modeling may lead to the loss of temporal information, limiting its performance [
34]. Some researchers have attempted to combine the fast Fourier transform with the Transformer architecture to address long-sequence time-series forecasting [
35]. In contrast, SCINet has demonstrated excellent performance in multi-scale feature extraction and time-series decomposition, showing strong adaptability in complex forecasting tasks [
36,
37]. Subsequently, researchers have further enhanced the accuracy and reliability of forecasting models by integrating SCINet with techniques such as Bayesian probabilistic recursion [
38]. There have also been attempts to add LSTM to SCINet for power load forecasting [
39]. Additionally, the introduction of local mode decomposition has allowed for the effective extraction of trend and fluctuation characteristics in time series, improving the robustness and accuracy of forecasting models [
40,
41].
From the above literature, it can be seen that signal decomposition technology can effectively enhance the representation ability of deep learning models by decoupling the multi-scale components of time-series data, and the application of attention mechanism in architectures such as the Transformer architecture verifies the universality of a dynamic feature enhancement strategy. Based on this, this paper constructs a “decomposition–feature enhancement–time-series modeling” collaborative framework and proposes a SCINet hybrid prediction model that integrates local mean decomposition and multi-scale channel attention. The model separates the high-frequency noise and low-frequency trend of the power load sequence through the multi-scale feature decoupling mechanism driven by signal decomposition and uses the dynamic attention weighting strategy to adaptively focus on the key time granularity, thereby improving the joint modeling ability of multi-scale dynamic features. However, despite the outstanding performance of the model in terms of prediction accuracy, it still has certain limitations. Due to the use of a deep convolutional structure, its parameter scale is large. In the case of limited training data, there may be a risk of overfitting, especially in scenarios with fewer abnormal samples, such as in cases of extreme weather or emergencies, and the generalization ability of the model may be affected. In terms of interpretability, although LMD decomposition enhances the interpretability of input features through physically driven modal separation, the nested structure of the deep convolution and attention mechanism still causes the feature interaction process to present a black-box characteristic. Its decision-making process is still relatively complex and lacks intuitive interpretability compared to traditional statistical models.
3. Method
3.1. Local Mean Decomposition
Local mean decomposition is an adaptive time-frequency analysis method widely used to decompose complex, multi-component amplitude-modulated and frequency-modulated (AM-FM) signals. The process of LMD is implemented through iterative cycles, with the specific steps described as follows:
Suppose the original signal is and the extreme value detection algorithm is used to traverse all time points to identify local maximum points (higher than adjacent points) and local minimum points (lower than adjacent points). These extreme points are interpolated with cubic splines to generate upper envelopes () and lower envelopes (). The construction of these two envelopes essentially describes the instantaneous amplitude range boundaries of the signal.
Based on the upper and lower envelopes, calculate the local mean function (
) and the instantaneous amplitude function (
):
where
characterizes the time-varying baseline of the signal (low-frequency trend after removing high-frequency oscillations), reflecting the local trend component of the signal, while
quantifies the local fluctuation intensity of the signal and reflects the instantaneous amplitude fluctuation range of the signal.
By subtracting the local mean from the original signal, a high-frequency oscillation signal (
) is obtained:
Then,
is normalized to obtain
:
After normalization, the amplitude of is constrained to the [−1, 1] interval and converted into a pure frequency modulation signal (containing only phase information), becoming a standardized oscillation component after the amplitude modulation is eliminated.
The first PF component is reconstructed by multiplying the normalized signal (
) by the instantaneous amplitude (
):
The physical meaning of the PF component is a single-mode AM-FM component in the signal, which contains both AM (
) and FM
) information. Then, calculate the residual signal (
):
The residual signal () represents the remaining part of the original signal after subtracting the current PF component, including the low-frequency trend and remaining high-frequency components that are not explained by the current PF component, and serves as the input signal for the next round of decomposition.
Take as the new input and repeat the above steps to iteratively extract the PF component and gradually remove the multi-scale components in the signal until the residual signal is a monotonic trend or noise, that is, the residual energy ratio is lower than the threshold or the number of extreme points of the residual signal is less than or equal to 1.
3.2. Multi-Scale Channel Attention
The multi-scale convolutional attention module effectively combines the advantages of the local receptive field of convolutional neural networks with the multi-scale characteristics of Transformers, achieving efficient modeling of contextual information in a lightweight manner. Under a certain parameter scale, its performance surpasses that of traditional Transformer models [
42].
As shown in
Figure 1, the overall architecture of MSCA consists of three key components: depth-wise convolution, multi-branch depth-wise strip convolution, and 1 × 1 convolution. Specifically, depth-wise convolution is responsible for aggregating local information and extracting fine-grained features, laying the foundation for subsequent multi-scale modeling. Multi-branch depth-wise strip convolution captures multi-scale contextual information through multiple branches, where each branch employs convolution kernels of different sizes to simulate varying receptive field ranges, approximating the standard large-kernel depth-wise convolution. This significantly reduces computational complexity while preserving multi-scale features. The 1 × 1 convolution models the relationships between different channels, generating attention weights used to reweight the input features, thereby introducing a dynamic attention mechanism across channels.
The computational process of MSCA can be expressed as follows:
F represents the input feature. ⊗ denotes element-wise multiplication.
indicates depth-wise convolution.
corresponds to the
i-th branch in
Figure 1, with
implementing residual connection. Att contains the generated attention weights, and Out is the final output feature map.
The formula specifically shows that the input features are first lightly spatially encoded using deep separable convolution (DW-Conv). Then, load features of different time granularities are extracted in parallel through multi-scale convolution branches (Scalei). Scale0 focuses on short-term information and captures sudden spikes, Scale1–2 focuses on medium-term information and models weekly periodicity, and Scale3 focuses on long-term information and fits quarterly trends. Then, the features of each scale are summed and fused channel by channel. Finally, the channel attention weight is generated through 1 × 1 convolution to quantify the importance of features of each scale, thereby realizing the attention mechanism.
3.3. Sample Convolution and Interactive Learning
SCINet (Sample Convolution and Interaction Network) is a hierarchical downsampling–convolution–interaction architecture specifically designed for complex time-series modeling and prediction tasks. It can extract time-series features at multiple temporal resolutions and achieve deep modeling and prediction of dynamic characteristics [
43]. It adopts an encoder–decoder structure, where the encoder captures dynamic feature dependencies at multiple temporal resolutions through a hierarchical convolutional network. By combining a rich set of convolutional filters, it effectively extracts both local and global features of the time series while mitigating information loss during downsampling through an interaction learning mechanism. The decoder then reorganizes and aligns the multi-level extracted features to generate the final prediction results.
The core module of SCINet is the SCI-Block. Each SCI-Block first performs downsampling on input data or features, decomposing them into two subsequences ( and ). Each subsequence is subsequently processed through independent convolutional filters (, , , and ) for feature extraction, each capturing different temporal dependency characteristics. To further enhance feature representation capability and mitigate potential information loss during downsampling, each SCI-Block incorporates an interactive learning mechanism after subsequence feature extraction. This mechanism is designed to share and integrate information between the two subsequences, capturing their latent correlation properties.
The specific formula of the SCI-block in
Figure 2 is defined as follows:
SCINet adopts a hierarchical design similar to a binary tree, stacking multiple SCI-Blocks layer by layer to form a progressive process of decomposition and reconstruction. In each layer, the input sequence is decomposed into smaller subsequences, and the temporal resolution of the feature representation is progressively enhanced. The low-resolution components are realigned and concatenated into new sequence representations. These newly generated feature representations are added to the original time series, and the residual connection mechanism is utilized to further improve prediction performance. Finally, the final output of SCINet is generated through a fully connected layer.
Figure 3 shows the structure diagram of SCINet.
To further enhance the model’s ability to model complex time series, SCINet also supports the stacking of multiple SCINet components, forming a stacked SCINet with intermediate supervision. Through this hierarchical architecture, the model can capture multi-scale dynamic characteristics and complex dependencies in time series within deeper networks. However, the corresponding model structure becomes more complex.
The architecture of SCINet shows that in time-series tasks such as power load forecasting, which have strong non-stationarity and multi-scale feature interweaving, its hierarchical convolution decomposition architecture can effectively solve the problems of multi-scale dependency, noise interference, and burst-mode capture in comparison with traditional recurrent networks or standard Transformer variants and provides expansion space for technical iteration of complex time-series tasks through modular design (such as LMD and MSCA as integrated in this article). SCINet adopts a layer-by-layer downsampling mechanism of a binary tree structure to model the multimodal coupling of high-frequency and low-frequency fluctuations in power load data, while LSTM and RNNs rely on the gating mechanism to implicitly model time-series dependency, making it difficult to explicitly separate multi-scale components that are susceptible to high-frequency noise (such as load mutation events). Although the Informer model reduces the computational complexity to O(NlogN) through ProbSparse self-attention, the global attention mechanism makes it difficult to focus on local key features (such as peak time periods) and lacks structured decomposition capabilities. In contrast, SCINet builds a pyramid structure through hierarchical downsampling convolution and skip connections and models features of different time granularities at different levels. Its computational complexity is O(NlogN), which is significantly lower than Transformer’s O(N2). Compared with the recursive sequence structure of LSTM and RNNs, its convolution operation supports parallel computing, showing obvious advantages in terms of the efficiency and accuracy of time-series modeling.
3.4. The Prediction Model Based on LMD-MSCA-SCINet
To improve the accuracy and efficiency of power load forecasting, this paper proposes a prediction model based on SCINet optimized with LMD and MSCA. This model integrates the advantages of local feature decomposition, adaptive attention mechanisms, and multi-resolution time-series modeling, demonstrating outstanding performance in complex time-series forecasting tasks.
3.4.1. Optimization via Local Mean Decomposition
Electricity data usually have significant nonlinear and non-stationary characteristics and are affected by various factors, such as seasonal fluctuations, trend changes, and random disturbances. Directly modeling the original electricity series may make it difficult to accurately capture these characteristics, affecting the prediction accuracy. LMD gradually extracts the local characteristics of the signal by constructing the local mean and envelope function so that each decomposed component has a clear meaning in both the time domain and the frequency domain. The decomposed PF component can describe the instantaneous amplitude and instantaneous frequency of the signal, forming a comprehensive expression of the global and local characteristics of the signal. Based on the characteristics of the signal itself, LMD can adaptively decompose, effectively avoiding the modal aliasing phenomenon that may occur in traditional methods, and is good at capturing local characteristics such as mutation points and short-term frequency changes in the signal.
Through LMD, the original signal
is decomposed into the sum of k PF components and one residual component:
where
represents the i-th-order PF component and
denotes the final residual component. Each PF component can be further expressed as the product of an envelope signal (
) and a pure frequency-modulated signal (
):
In the whole decomposition process, the envelope signal () represents the instantaneous amplitude of the PF component, while the frequency information of the pure frequency modulation signal () directly reflects the instantaneous frequency of the component. The residual component () represents the low-frequency trend term after the decomposition is completed.
The LMD method can fully extract the local features of the signal, and through the layer-by-layer peeling method, the original signal can be completely restored to several components with clear physical meanings. It has a good effect on the local feature extraction of nonlinear and non-stationary signals.
As can be seen from
Figure 4, LMD successfully decomposes the different scale features of the signal: high-frequency components PF1 to PF3 mainly capture high-frequency noise, mutations, and short-period oscillations, among which the overall trend of the PF3 component is relatively close to the original signal. Although the PF1 and PF2 components can capture strong fluctuations, their accuracy is still insufficient. The waveforms of the intermediate-frequency components (PF4 to PF5) gradually become smoother; the rhythm of change slows down, showing periodic characteristics; and the PF5 component gradually approaches the global change trend of the original signal. The low-frequency components (PF6 to PF7) mainly reflect the overall evolution law of the signal, but the high-frequency details are lost and the detailed fluctuations cannot be retained. All components contribute to trend modeling. Starting from the PF3 component, the trend capture effect is significantly enhanced, and the accuracy is gradually improved. However, due to the influence of the violent fluctuations of the original signal, the reconstruction of the high-frequency transient characteristics by the PF1 and PF2 components is distorted to a certain extent, resulting in the loss of local detail information of some sudden fluctuations in the original signal during the decomposition process.
3.4.2. Introduction of Multi-Scale Channel Attention for Optimization
In view of the fact that the components of the LMD output may contain redundant information, the MSCA module extracts the features of these components through a multi-scale convolution mechanism and dynamically allocates feature weights in combination with an adaptive attention mechanism, thereby screening out key features and effectively reducing the interference of noise and redundant information. MSCA enhances the model’s ability to capture contextual information of different resolutions by introducing a multi-scale convolution mechanism, optimizes the design of traditional convolution blocks, and uses the convolution attention mechanism to achieve efficient encoding of spatial and channel information, avoiding the computational overhead of traditional large convolution kernels. Compared with the self-attention modeling method in the Transformer architecture, MSCA provides a more lightweight and efficient solution.
In addition, MSCA captures local and global features in time series at different resolutions, forming a good complementary relationship with the hierarchical downsampling strategy of SCINet, alleviating the problem of information loss that may be introduced in the downsampling process of SCINet, thereby further enhancing the modeling ability of different time scales and channel information. Especially when dealing with non-stationary and high-noise time series, the multi-scale feature extraction advantage of SCINet itself, combined with the feature screening ability of the MSCA module, further enhances the prediction performance of the model. The combination of the two not only deepens the modeling of multi-scale dynamic characteristics of time series but also enhances the ability to focus on key features and robustness to noise, significantly improving the prediction model in terms of stability, accuracy, and efficiency.
3.4.3. Ensemble Framework
First, the input historical power consumption sequence is preprocessed, then enters the LMD module. The LMD module decomposes the complex original signal into several physically meaningful product functions (PF components) and residual components. This decomposition process effectively extracts the local dynamic characteristics in the time series. However, the components output by LMD may contain redundant information. In order to further optimize the features, the decomposed PF components and residual components are sent to the MSCA module.
The MSCA module extracts the features of different components through a multi-scale convolution mechanism, captures contextual information at different scales, and enhances the modeling ability of the model for complex dependencies in the time series.
Subsequently, the optimized features are input into the SCINet module for multi-level processing. The SCINet module extracts features from the time series from multiple time scales, decomposes the time series into two subsequences through a hierarchical downsampling strategy, and extracts their respective local features through independent convolution operations. In order to avoid information loss caused by isolated processing of subsequence features, SCINet introduces an inter-sequence interactive convolution mechanism to effectively integrate the features of the two subsequences.
Finally, after completing the encoding operations at all levels, the decoder restores and aligns the features layer by layer, integrates the information of all levels in the reorganization layer, and generates the final prediction result with both global dynamic patterns and local feature details.
A diagram of the LMD-MSCA-SCINet prediction model is shown in
Figure 5.
4. Experimental Results and Discussions
4.1. Dataset and Evaluation Indicators
The dataset used in this paper comes from a company in Zhejiang Province, covering electricity usage from 2 January 2021 to 5 November 2023. The original data contain electricity consumption indicators for three periods of daily peak, peak, and trough, as well as non-time-series fields such as user name, address information, and electricity type. Since static attributes have a low correlation with load forecasting targets, Python’s Pandas library was used to perform structured cleaning of the data, remove redundant fields such as “user name” and “address”, and construct a time-series data matrix with date as the index and the three periods of electricity consumption as the core. In order to enhance the feature characterization capability, based on Python directional crawler technology, the refined meteorological observation data of Zhejiang Province in the same period, including key indicators such as daily maximum temperature and minimum temperature, were obtained from the National Meteorological Data Center, and the exact date matching algorithm was used to achieve dynamic alignment of cross-source data. In addition, the date sequence was automatically labeled through the built-in statutory holiday rule library in Python, and two types of binary time attribute features were generated: “whether it is a weekend” and “whether it is a statutory holiday”. Finally, the dataset was divided into a training set and test set in an 8:2 ratio and stored in the MySQL database to provide support for subsequent analysis and modeling in a standardized format.
In order to comprehensively evaluate the performance of the prediction model, this paper selects mean absolute error (MAE), mean square error (MSE), root mean square error (RMSE), and the coefficient of determination (R
2) as evaluation indicators to measure the prediction error from different dimensions. These indicators provide a comprehensive evaluation of the absolute value of the error, square error, standardized error, and the model’s ability to explain data variation. The specific calculation formula is expressed as follows:
where
denotes the true value and
represents the predicted value.
In addition, in order to intuitively represent the data, this paper normalizes the data and uses non-normalization operations in the final prediction results to ensure that the final results have good interpretability and intuitiveness. The final prediction results are presented in degrees.
The server system used in this experiment is an Ubuntu 18.04.5 LTS, the GPU is an NVIDIA GeForce RTX 2080 Ti (8 in total), and the CPU is an Intel(R) Xeon(R) Gold 6226R CPU @ 2.90 GHz. The Python version is 3.7.16, the Pytorch version is 1.13.1, and the CUDA version is 11.0.207.
4.2. Parameter Analysis and Settings
In time-series forecasting, the choice of input step size has an important impact on the model’s prediction performance. In order to further study the role of input step size on the prediction results, combined with the scale of the dataset and the characteristics of the model architecture, the experiment included a variety of input step sizes (such as 1, 2, 4, 8, 12, 16, 24, 32, 40, and their subsets) and used MAE, MSE, and other indicators as performance evaluation criteria. The experiment was conducted on four models—LSTM, RNN, Informer, and SCINet—to explore the impact of different input step sizes on the prediction effect of each model. In the experiment, common hyperparameters are set as follows: a batch size of 32, 50 training epochs, and an initial learning rate of 0.005. Other specific hyperparameters are shown in
Table 1.
According to the results presented in
Figure 6, it can be found that the impact of the change in input step size on model performance shows significant model dependence, and the response rules of different architectures are significantly different. Specifically, the overall fluctuation amplitude of LSTM is small, and its minimum MAE occurs when the input step size is 1. The fluctuation is relatively gentle in the step-size range of 1 to 8; then, the fluctuation amplitude gradually increases, showing an overall trend of first rising, then falling. The performance of RNN is similar to that of LSTM. The minimum MAE also occurs at step size of 1, and it maintains a small fluctuation before a step size of 12; then, the MAE shows an overall upward trend. SCINet is constrained by its hierarchical decomposition structure, and the input step size must be an integer multiple of 8. Its error change presents a U-shaped curve that first decreases, then increases, and the optimal performance occurs when the step size is 32. In contrast, due to the computational characteristics of the self-attention mechanism of the Informer model, the input step size needs to be expanded from 4. Its MAE fluctuates most violently and has no obvious regular changes. It reaches the lowest value at a step size of 40, but there are multiple ups and downs in the middle interval.
In all models, the change trends of MAE and MSE are basically the same, while the R2 indicator shows higher uncertainty and sometimes disagrees with other error indicators. Based on the above results, different models have significantly different requirements for input step size. In order to unify the evaluation criteria, in subsequent experiments, the optimal input step size is selected for each model for parameter adjustment based on the MAE indicator.
4.3. Analysis of Experimental Results
In this experiment, we used MAE, MSE, RMSE, and R2 as evaluation metrics to analyze the prediction performance of each model based on the optimal input step size and pre-set parameters of each model.
4.3.1. Comparison Experiments
Table 2 shows the performance of LSTM, RNN, Informer, and LMD-MSCA-SCINet in prediction tasks. The results show that the traditional recurrent neural network models (LSTM and RNN) performed the worst, with an MAE close to 0.9, MSE close to 1.0, and all negative R
2 values, indicating that their prediction effect is even weaker than the baseline method that directly uses the data mean, which highlights the limitations of traditional recurrent structures in dealing with complex time-series features. In contrast, Informer improved performance through the self-attention mechanism, with much lower MAE and MSE values and a positive trend for R
2. It is better than traditional models in error control and fitting ability, but there is still a significant gap relative to LMD-MSCA-SCINet. LMD-MSCA-SCINet performs best, with significantly lower MAE and MSE values than other models and a significantly higher R
2 than other models. Its MAE and MSE are further reduced by 63.4% and 72.1%, respectively, compared with Informer, and its R
2 is increased by 47.2%. Compared with the negative values of LSTM and RNN, its advantages are more significant. It is also superior to VMD-CNN-LSTM in all aspects. Overall, LMD-MSCA-SCINet significantly improves the prediction accuracy of complex time-series tasks with its multi-scale modeling advantages and is better than other models in error control and fitting capabilities.
4.3.2. Ablation Experiment
Further analyzing the impact of each module on SCINet, according to
Table 3, the LMD and MSCA modules play an important role in improving the performance of SCINet. However, there are still certain limitations when introducing one of them alone. The model that introduces the LMD module alone (LMD+SCINet) improves R
2 by 6.6% compared with the baseline model (SCINet), but MAE and MSE deteriorate by 55.4% and 1.8%, respectively. This contradiction shows that although LMD can enhance the global trend modeling ability through multi-scale signal separation, its strong decomposition operation on the original sequence may destroy the continuity of local temporal features, resulting in a decrease in single-step prediction accuracy.
In contrast, the performance of the model that introduces the MSCA module alone (MSCA+SCINet) changes little: its MAE and MSE are basically the same as the baseline model, but R2 decreases by 4.3%. This shows that in the absence of explicit decoupling of the original time scale of the signal, the dynamic allocation mechanism of the attention weight of MSCA is prone to fall into the scale confusion trap. Since it is not guided by the hierarchical structure provided by the decomposition algorithm, it is difficult for the attention mechanism to autonomously construct the optimal correlation path of cross-scale features without physical guidance. Its weight update process gives priority to strengthening the pseudo-correlation between local extreme points rather than capturing the intrinsic cross-level dependencies of the signal. This scale entanglement effect is essentially due to the interweaving interference of different frequency-band features in the time-frequency domain, which causes the attention matrix to converge to a suboptimal equilibrium state in the back propagation. The weight distribution of high-frequency noise and effective transient features is highly overlapped, while the low-frequency trend component is phase-shifted due to the interference of residual periodic fluctuations. Ultimately, the semantic consistency of multi-scale interaction is broken, the short-term mutation pattern is submerged by high-frequency noise, and the long-term trend modeling is disturbed by non-correlated periods. The two form competitive inhibition in the undecomposed feature space, weakening the overall explanatory ability of the model for the dynamic evolution of power load.
However, the combination of the two, LMD-MSCA-SCINet, achieved the best results for all indicators. Its MAE and MSE are 37.3% and 63.4% lower than those the baseline model, respectively, and its R2 is increased by 30.9%, which is 18.5% better than the R2 of LMD-SCINet. This shows that when LMD and MSCA are jointly introduced, they can complement each other’s advantages, extract the refined features of time-series signals through local modal decomposition, and dynamically filter key information in combination with the multi-scale attention mechanism, thereby achieving optimization in error control and model interpretability.
4.3.3. Analysis of Prediction Results
The prediction results presented in
Figure 7 show that there are significant differences in the adaptability of different models to complex power load fluctuations. The real power consumption curve shows violent fluctuations in each interval and frequent instantaneous extreme values, which poses a dual challenge to time-series modeling, which requires both capturing long-term periodic laws and quickly responding to sudden load fluctuations. The prediction trajectories of traditional recurrent neural networks, i.e., LSTM and RNN, show obvious lag and over-smoothing tendencies. Despite multiple parameter tuning and experiments, it is still difficult to effectively adapt to the dynamic changes in data. Its prediction results still show a trend of low fluctuation but overall deviation from the true value and basically cannot capture the changes in the true value, indicating that its modeling ability for the complex time-series patterns and sudden extreme values of this dataset is insufficient. In contrast, the Informer model based on the self-attention mechanism focuses on key time steps through a sparse attention mechanism and achieves trend tracking in most areas but still produces instantaneous deviations at many load peaks, exposing the lack of sensitivity of global attention to local emergencies.
SCINet achieves smoother tracking through a layered convolutional network, but in the part where the load fluctuates violently, its model architecture encounters difficulty taking into account both long-term and short-term characteristics, resulting in a gradual expansion of cumulative deviations. LMD-SCINet, which further introduces local modal decomposition, adaptively strips the instantaneous amplitude-frequency components of the signal. Its prediction trajectory is closer to the fluctuation trend of the true value than the benchmark SCINet in many intervals, but its global stability is still limited by the feature extraction of a single path. The trend of MSCA-SCINet, which integrates multi-scale channel attention, is generally very close to that of SCINet. Its dynamic weight allocation mechanism can adaptively enhance the feature interaction of key scales, thereby enhancing the response to sudden peaks (such as samples 120–140), but it is generally inferior to LMD-SCINet. LMD-MSCA-SCINet, which integrates LMD and MSCA, significantly improves the refinement of feature extraction and dynamic weight allocation capabilities through a collaborative optimization mechanism. It can capture the trend and fluctuation of the true value and always keeps the prediction trajectory close to the true curve throughout the entire sequence. It performs best overall, and its error range is basically stable in a lower range, with only a few sample points showing controllable deviations.
5. Conclusions
Aiming at the problem of complex power load forecasting, this paper proposes a SCINet forecasting model based on local modal decomposition and multi-scale channel attention optimization. The non-stationarity, multi-scale characteristics and complex time-series dependencies of power load sequences pose severe challenges for forecasting accuracy. Traditional methods often find it difficult to coordinate the modeling of long-term trends and short-term fluctuations. To this end, this study constructed a “decomposition–feature enhancement–time-series modeling” collaborative framework. First, the LMD module decouples the original load sequence into a trend term that characterizes the long-term evolution law and a multi-scale component that reflects random fluctuations through adaptive signal decomposition, effectively reducing the modeling complexity of non-stationary sequences. Secondly, the MSCA module uses multi-scale dilated convolution to extract local features of different time granularities and dynamically allocates weights through the channel-space attention mechanism to achieve focus on key features and the suppression of redundant information, solving the bottleneck of traditional methods’ insufficient ability to fuse multi-scale features. Finally, the hierarchical interactive structure of SCINet explicitly models the coupling relationship between long-term and short-term dependencies in the load sequence through downsampling and cross-scale feature interaction, breaking through the limitations of traditional recurrent neural networks in expressing complex time-series patterns.
Experimental results show that compared with mainstream models such as LSTM, RNN, and Informer, the proposed model shows higher prediction accuracy in both short-term power load forecasting tasks. In addition, the effectiveness of the LMD module and the MSCA module were verified through ablation experiments, indicating the important role of both in feature extraction and model optimization. Our empirical research results not only verify the effectiveness of the model but also provide valuable practical insights for the field of power load forecasting and provide a reliable technical reference for the safe operation of power systems and energy optimization scheduling.
Future research will further verify the generalization ability of the model in multiple industrial types and power systems in different regions and explore its application potential in real-time data stream scenarios to achieve a low-latency dynamic prediction system. As the scope of application expands, it is necessary to establish an ethical assessment system covering data privacy protection and decision bias monitoring and adopt a microservice architecture deployment to achieve data isolation and hierarchical authority management to ensure data security and system stability.