A Multi-Channel Δ-BiLSTM Framework for Short-Term Bus Load Forecasting Based on VMD and LOWESS

Guo, Yeran; Wang, Li; Zhao, Jie

doi:10.3390/electronics14234772

Open AccessArticle

A Multi-Channel Δ-BiLSTM Framework for Short-Term Bus Load Forecasting Based on VMD and LOWESS

by

Yeran Guo

¹,

Li Wang

^1,2 and

Jie Zhao

^3,*

¹

School of Electrical and Information Engineering, Changsha University of Science & Technology, Changsha 410114, China

²

State Key Laboratory of Disaster Prevention & Reduction for Power Grid, Changsha University of Science & Technology, Changsha 410114, China

³

Hubei Engineering and Technology Research Center for AC/DC Intelligent Distribution Network, School of Electrical Engineering and Automation, Wuhan University, Wuhan 430072, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(23), 4772; https://doi.org/10.3390/electronics14234772

Submission received: 14 November 2025 / Revised: 1 December 2025 / Accepted: 2 December 2025 / Published: 4 December 2025

(This article belongs to the Special Issue Advancing Power System Intelligence: AI-Based Forecasting, Operation, and Control)

Download

Browse Figures

Versions Notes

Abstract

Short-term bus load forecasting in distribution networks faces severe challenges of non-stationarity, high-frequency disturbances, and multi-scale coupling arising from renewable integration and emerging loads such as centralized EV charging. Conventional statistical and deep learning approaches often exhibit instability under abrupt fluctuations, whereas decomposition-based frameworks risk redundancy and information leakage. This study develops a hybrid forecasting framework that integrates variational mode decomposition (VMD), locally weighted scatterplot smoothing (LOWESS), and a multi-channel differential bidirectional long short-term memory network (Δ-BiLSTM). VMD decomposes the bus load sequence into intrinsic mode functions (IMFs), residuals are adaptively smoothed using LOWESS, and effective channels are selected through correlation-based redundancy control. The Δ-target learning strategy enhances the modeling of ramping dynamics and abrupt transitions, while Bayesian optimization and time-sequenced validation ensure reproducibility and stable training. Case studies on coastal-grid bus load data demonstrate substantial improvements in accuracy. In single-step forecasting, RMSE is reduced by 65.5% relative to ARIMA, and R² remains above 0.98 for horizons h = 1–3, with slower error growth than LSTM, RNN, and SVM. Segment-wise analysis further shows that, for

h = 1

, the RMSE on the fluctuation, stable, and peak segments is reduced by 69.4%, 62.5%, and 62.4%, respectively, compared with ARIMA. The proposed Δ-BiLSTM exhibits compact error distributions and narrow interquartile ranges, confirming its robustness under peak-load and highly volatile conditions. Furthermore, the framework’s design ensures both transparency and reliable training, contributing to its robustness and practical applicability. Overall, the VMD–LOWESS–Δ-BiLSTM framework achieves superior accuracy, calibration, and robustness in complex, noisy, and non-stationary environments. Its interpretable structure and reproducible training protocol make it a reliable and practical solution for short-term bus load forecasting in modern distribution networks.

Keywords:

bus load forecasting; variational mode decomposition; locally weighted scatterplot smoothing; differential bidirectional long short-term memory network; Bayesian optimization

1. Introduction

In modern power systems, the increasing penetration of distributed energy sources and the ongoing electrification of end-use sectors, coupled with the rapid growth of new loads such as centralized charging and demand response, have led to the emergence of non-stationary, highly disturbed, and multi-scale coupled characteristics in bus loads within distribution networks. The frequency of peak mutations and “spike” noise has risen significantly, thereby increasing uncertainty and engineering constraints in short-term forecasting [1,2,3]. Short-term bus load forecasting not only provides boundary conditions for power flow and reactive voltage but also serves as a critical input for reserve scheduling, peak shaving, and electricity price incentives, directly impacting operational safety and economic efficiency [4,5].

To improve stability in noisy and non-stationary scenarios, existing studies generally follow three approaches. The first is statistical or traditional machine learning, which relies on structural priors and handcrafted features. These methods are simple and interpretable, but they struggle to adapt to holidays, weather variations, and ramp-up segments, especially under nonlinear dynamics or when features are incomplete [6,7]. Deep learning, with end-to-end representation learning, captures temporal dependencies with high accuracy. Several studies have explored various deep learning models for load forecasting, such as Recurrent Neural Networks (RNN), Long Short-Term Memory (LSTM) networks, and their bidirectional variants (BiLSTM), which have demonstrated strong performance in handling temporal dependencies and predicting future loads. Additionally, methods like Adaptive Fusion Domain-Cycling Variational Generative Adversarial Networks (VGAN) and Adaptive Variational Autoencoding Generative Adversarial Networks (VAE-GAN) have been investigated, showcasing their potential for improving model robustness and generalization in forecasting tasks. Furthermore, multi-modal techniques, such as those applied in arc detection for railway systems, have shown promise in enhancing forecasting accuracy by incorporating multiple data sources. These approaches highlight the growing interest and success of deep learning models in forecasting applications, particularly in scenarios with complex and dynamic load behaviors [8,9,10,11]. The “decomposition–learning” paradigm, such as empirical mode decomposition (EMD), complete ensemble empirical mode decomposition with adaptive noise (CEEMDAN), variational mode decomposition (VMD), wavelet transform (WT), and seasonal-trend decomposition based on loess (STL), first separates multi-scale components or performs denoising, and then processes each component individually before combining the results. This approach helps reduce component interference and balances high and low frequencies [12,13,14]. However, wavelets suffer from subjectivity in selecting frequency bands and layers, while EMD-based methods are prone to mode mixing and endpoint effects. If all components are modeled and combined without discrimination, noise and redundancy may be introduced, reducing robustness [15,16]. Engineering practice also shows that errors often accumulate in areas with compounded seasonal patterns and short-term pulses, as well as in ramp-up segments during electricity usage peaks, extreme weather, and concentrated charging access [3]. Furthermore, improper prior settings in over-parameterized models can lead to sensitivity to initialization and search range, affecting reproducibility and generalization [17].

The short-term bus load forecasting primarily faces three challenges:

(1): Multi-scale/Multi-phase coupling. Single-channel end-to-end models face challenges in simultaneously capturing trends, seasonality, and pulses, often resulting in a trade-off between these factors. Full-modeling after decomposition tends to dilute effective signals and increases the risk of overfitting [18,19].
(2): A discrepancy exists between forecasted targets and actual scheduling requirements. Solely minimizing point value prediction errors fails to adequately constrain change rate or ramp behavior, resulting in error concentration during peak and abrupt-change periods [20].
(3): Lack of robustness and reproducibility in the process. Decomposing followed by splitting or cross-temporal normalization often leads to information leakage [21]. Mixed frameworks have a high degree of freedom, and results are sensitive to random seeds and training budgets [22,23].

To address these issues, this paper proposes a multi-channel Δ-BiLSTM (Differential Bidirectional Long Short-term Memory Network) framework, beginning with mode extraction using VMD. Residuals are then smoothed and denoised using LOWESS (Locally Weighted Scatterplot Smoothing). Channel selection is performed based on correlations beyond the starting point. Δ-targets are employed to model change rates and phase mismatches via a BiLSTM. Robust training and reproducible evaluation are ensured through lightweight Bayesian optimization and time-sequenced validation. Compared with existing VMD/EMD-based hybrid models and single-channel deep networks, VMD–LOWESS–Δ-BiLSTM uniquely applies VMD–LOWESS only in a leakage-free training pipeline and couples correlation-screened multi-channel inputs with Δ-target BiLSTM under a unified training budget, jointly tackling noise, redundancy, and ramp dynamics.

An overview of the VMD-LOWESS-Δ-BiLSTM forecasting workflow is shown in Figure 1. The innovations of the proposed framework are as follows:

Multi-scale robust input. The method applies “VMD + LOWESS” processing to the training segment to suppress noise and anomalous fluctuations, thereby mitigating interference and preventing data leakage. This approach enables multi-scale input that captures both trends and transient disturbances.

Scheduling-consistent Δ-target learning. By integrating Δ-targets with a BiLSTM network, this approach enhances the model’s ability to capture rate-of-change patterns and ramp transitions, leading to significantly improved prediction accuracy during peak and abrupt-change periods.

Robust training and reproducible evaluation. The framework enhances training stability and ensures result reproducibility through foresight-based channel selection, lightweight Bayesian optimization, and time-sequenced validation.

The remaining structure of this paper is organized as follows: Section 2 outlines the methodological foundations, including VMD, LOWESS residual smoothing, Δ-target construction, LSTM/BiLSTM networks, multi-channel construction, correlation-based channel selection, and Bayesian optimization integrated with robust training strategies. Section 3 presents the integrated framework, which combines the aforementioned modules into a multi-channel Δ-BiLSTM architecture. It further elaborates on the end-to-end process and describes the leakage-prevention training/validation protocols. Section 4 covers experiments, including data preprocessing, parameter configurations, and evaluation metrics. It compares the proposed approach against baseline models, assessing performance during peak and abrupt-change periods as well as computational complexity. Section 5 gives the conclusions.

2. Methodological Foundations

2.1. Variational Mode Decomposition

Variational mode decomposition (VMD) is a widely used signal processing method for noise reduction in power systems. It works by iteratively solving for mode functions at different time scales and their corresponding central frequencies, resulting in several intrinsic mode functions (IMFs). Through selective reconstruction of these IMFs, the noise reduction effect can be achieved. The denoising process using VMD for bus load can be divided into two parts: construction and decomposition [24].

For the noise and spikes present in the original bus load sequence, VMD is used to stabilize the noisy load sequence, decomposing it into multiple stationary subsequences and an unstable residual. To prevent optimistic evaluation results caused by information leakage, a leakage-prevention process is applied [25]. Finally, reconstruction with the IMFs results in a multi-channel input that preserves both scale information and robustness.

All numerical experiments and model training in this study were implemented using MATLAB R2023b and Python 3.11.8.

2.1.1. Sub-Sequence Analytic Signal and Frequency Mixing

The rules for VMD are as follows:

(1) Let the noisy bus load sequence be

X (t)

, where

t \in [1, T]

, and T is the length of the noisy bus load sequence.

(2) The noisy bus load sequence

X (t)

is decomposed by VMD into subsequences

u_{k} (t)

, where

k \in [1, K]

, and K represents the number of subsequences after decomposition.

(3) The sub-sequence

u_{k} (t)

is converted into an analytic signal via the Hilbert transform:

[δ (t) + \frac{j}{π t}] * u_{k} (t)

(1)

where

δ (t)

is the Dirac delta function, * denotes the convolution operator, and

u_{k} (t)

is the sub-sequence of the bus load sequence after applying VMD;

(4) The analytic signal is demodulated to shift the frequency component of each mode to the baseband:

[(δ (t) + \frac{j}{π t}) * u_{k} (t)] e^{- j ω_{k} t}

(2)

where

ω_{k}

is the central frequency.

2.1.2. Sub-Sequence Baseband Minimum Bandwidth Model

After converting the analytic signals of the bus load subsequences to the baseband, the objective is to obtain the minimum bandwidth frequency. A decomposition constraint condition is established, and the Lagrangian operator is introduced for iterative calculation:

(\{u_{k}\}, \{ω_{k}\}, λ) = \sum_{k = 1}^{K} {‖\partial_{t} [(δ (t) + \frac{j}{π t}) * u_{k} (t)] e^{- j ω_{k} t}‖}_{2}^{2} + {‖X (t) - \sum_{k = 1}^{K} u_{k} (t)‖}_{2}^{2} + 〈λ (t), X (t) - \sum_{k = 1}^{K} u_{k} (t)〉

(3)

where

λ (t)

is the Lagrangian multiplier.

Initialization is performed for

{\hat{u}}_{k}^{1}

,

ω_{k}^{1}

, and

{\hat{λ}}^{1}

. When

ω \geq 0

,

{\hat{u}}_{k}

,

ω_{k}

, and

{\hat{λ}}^{n}

are subsequently updated:

u_{k}^{n + 1} (ω) = \frac{\hat{f} (ω) - \sum_{i \neq k} {\hat{u}}_{i}^{n + 1} (ω) + \frac{{\hat{λ}}^{n} (ω)}{2}}{1 + 2 {(ω - ω_{k}^{n})}^{2}}

(4)

ω_{k} = \frac{\int_{0}^{\infty} ω {|{\hat{u}}_{k}^{n + 1} (ω)|}^{2} d ω}{\int_{0}^{\infty} {|{\hat{u}}_{k}^{n + 1} (ω)|}^{2} d ω}

(5)

{\hat{λ}}^{n + 1} (ω) = {\hat{λ}}^{n} (ω) + (\hat{f} (ω) - \sum_{k} {\hat{u}}_{k}^{n + 1} (ω))

(6)

Until the condition

\sum_{k} {‖{\hat{u}}_{k}^{n + 1} - {\hat{u}}_{k}^{n}‖}_{2}^{2} / {‖{\hat{u}}_{k}^{n}‖}_{2}^{2} < ε

is satisfied, where

ε

is the preset error tolerance and

n

is the number of iterations.

2.1.3. Sub-Sequence Smoothing Optimization

The bus load sequence X(t) after VMD is expressed as follows:

X (t) = \sum_{i = 1}^{K - 1} I M F_{i} (t) + r (t)

(7)

where

I M F_{i} (t)

represents the decomposed intrinsic mode functions, and

r (t)

is the residual.

The residual of the bus load sequence is smoothed using the LOWESS method, which operates by assigning weights to proximal data points within a local section to ensure a smooth fit. The local polynomial regression equation for the smoothed residual sequence

r^{'} (t)

of the bus load sequence after VMD is as follows:

r^{'} (t) = a x_{i} + b, i \in [1, T]

(8)

where x_t is the variable of the sequence, a and b are the regression weights of the sequence.

The sequence weights of this smoothed sequence regression are solved by defining the cost function as follows:

J (a, b) = \frac{1}{N} \sum_{i = 1}^{N} ω_{i} {(y_{i} - a x_{i} - b)}^{2}

(9)

where N is the number of neighboring points set in the smoothing operation, and

ω_{i}

is the cubic weight of the point

x_{i}

relative to its neighboring points within the operation range.

The cubic weighting function is expressed as follows:

ω_{i} (x) = {(1 - {(\frac{|x - x_{i}|}{d_{i}})}^{3})}^{3}

(10)

where

x

is the current center point,

x_{i}

is a neighboring point within the bandwidth, and

d_{i}

is the distance from

x_{i}

to the current point along the horizontal axis.

The cubic weights obtained above are used for locally weighted linear regression. The parameters a and b are obtained by setting the partial derivatives of the cost function

J (a, b)

with respect to a and b to zero:

\{\begin{matrix} \frac{\partial}{\partial a} J (a, b) = \frac{1}{N} [2 \sum_{i = 1}^{N} ω_{i} x_{i} (y_{i} - a x_{i} - b)] = 0 \\ \frac{\partial}{\partial b} J (a, b) = \frac{1}{N} [2 \sum_{i = 1}^{N} ω_{i} (y_{i} - a x_{i} - b)] = 0 \end{matrix}

(11)

Through Equation (11), the regression equation gives the predicted value

{\hat{y}}_{i}

corresponding to

x_{i}

. After moving to the next point

x_{i + 1}

, the iterative step is repeated to obtain the prediction value

{\hat{y}}_{i + 1}

.

The smoothed residual is then reconstructed together with the decomposed subsequences.

r^{'} (t)

and

I M F_{i} (t)

are combined to reconstruct the denoised bus load sequence

L o a d^{'} (t)

:

L o a d^{'} (t) = \sum_{i = 1}^{n} I M F_{i} (t) + r^{'} (t)

(12)

2.1.4. Multi-Channel Input and Leakage Prevention

Let the original sequence

{X (t)}_{t = 1}^{T}

be divided into training, validation, and testing segments in chronological order:

D_{tr} = {1, \dots, T_{tr}}, D_{va} = {T_{tr} + 1, \dots, T_{va}}, D_{te} = {T_{va} + 1, \dots, T}

(13)

where T_tr, T_va, and T denote the endpoints of the training, validation, and total sequence, respectively. D_tr, D_va, and D_te are the index sets of the three time segments.

At each stage

l \in {tr, va, te}

, only the visible prefix

{X (t)}_{t \leq T_{l}}

is used to independently perform VMD and residual LOWESS, yielding

{X (t)}_{k = 1}^{K}

and

{r^{'}}^{(l)} (t)

, respectively. This then yields the following:

X_{V + L}^{(l)} (t) = \sum_{k = 1}^{K} I M F_{k}^{(l)} (t) + {\tilde{r}}^{(l)} (t), t \leq T_{l}, l \in {tr, va, te}

(14)

where

I M F_{k}^{(t)} (t)

denotes the k-th mode obtained by applying VMD on the historical prefix

[1, T_{l}]

of the sequence

x (t)

. K is the number of modes.

r^{(l)} (t)

is the residual from the same decomposition.

{\tilde{r}}^{(l)} (t) = LOWESS (r^{(l)} (t))

is the smoothed residual by LOWESS.

X_{V + L}^{(l)} (t)

represents the reconstructed denoised sequence by “VMD + LOWESS”.

The K modes and one smoothed residual are combined into

C = K + 1

channels. The channel vector is defined as follows:

z^{(l)} (t) = {[I M F_{1}^{(l)} (t), \dots, I M F_{K}^{(l)} (t), {r ’}^{(l)} (t)]}^{T} \in R^{C}, C = K + 1

(15)

With window length

L

, the input tensor is constructed as follows:

X_{t}^{(l)} = [z^{(l)} (t - L + 1), \dots, z^{(l)} (t)] \in R^{C \times L}, t = L, \dots, T_{l} - H

(16)

The corresponding point-wise prediction target is defined as follows:

y_{t}^{(l)} = [X (t + 1), \dots, X (t + H)]^{T}

(17)

where

z^{(l)} (t)

is the channel vector,

X_{t}^{(l)}

is the input tensor of length

L

ending at window

t

, and

y_{t}^{(l)}

is the point-wise observation target of length

H

.

The mean and standard deviation for each channel are calculated only from the training set samples

{X_{t}^{(tr)}}

, and are then applied consistently across all stages for normalization:

{\tilde{X}}_{t}^{(l)} (c, \cdot) = \frac{X_{t}^{(l)} (c, \cdot) - μ_{c}}{σ_{c} + ε_{n u m}}, c = 1, \dots, C, l \in {tr, va, te} .

(18)

where μ_c and σ_c are the mean and standard deviation of channel

c

, respectively. These values are calculated exclusively from the training data

{X_{t}^{(tr)}}

, and then fixed for validation and testing. The term

ε_{n u m} > 0

is a numerical stabilizer, and

{\tilde{X}}_{t}^{(l)}

denotes the normalized input.

2.2. Recurrent Neural Networks

A bidirectional long short-term memory network (BiLSTM) is a variant of LSTM that can learn contextual information from both forward and backward directions. It trains sequential data more effectively through full feature extraction compared with traditional recurrent neural networks. Under the integration of multi-channel inputs and in combination with Δ-target learning, the BiLSTM model is well suited to handle the highly non-stationary and noisy characteristics of bus load data, achieving improved forecasting accuracy and stability.

2.2.1. LSTM Unit

Suppose there are k synchronous input sequences. The LSTM unit employs three gates: a forget gate, an input gate, and an output gate. The input vector x_t carries the network’s input values; the hidden state h_t delivers the output to the next layer; the cell state c_t stores the internal state of the LSTM unit for long-term memory.

The LSTM steps are as follows:

(1) Forget gate update. It determines how much of the past memory should be retained.

f_{t} = σ (W_{f} x_{t} + U_{f} h_{t - 1} + b_{f})

(19)

(2) Input gate update. It regulates how much new information should be written into the cell state.

i_{t} = σ (W_{i} h_{t - 1} + U_{i} x_{t} + b_{i})

(20)

{\tilde{c}}_{t} = \tanh (W_{c} x_{t} + U_{c} h_{t - 1} + b_{c})

(21)

(3) Cell state update. It combines the retained memory and the newly added candidate memory.

c_{t} = f_{t} \circ c_{t - 1} + i_{t} \circ {\tilde{c}}_{t}

(22)

(4) Output gate update. It determines how much of the memory will be exposed at the current time step.

o_{t} = σ (W_{o} x_{t} + U_{o} h_{t - 1} + b_{o})

(23)

(5) Hidden state update. It calculates the proportion of memory exposed to the output at the current time step.

h_{t} = o_{t} \circ \tanh (c_{t})

(24)

(6) Final prediction update. It generates the current predicted output.

{\hat{y}}_{t} = σ (V h_{t} + c)

(25)

where

c_{t}

is the cell state,

h_{t}

is the hidden state,

x_{t}

is the input and W, U, and b are the recurrent weight, input weight, and bias, respectively.

σ (x)

is the activation function of each gate, which maps the input into a probability value.

By dynamically balancing the operations of “retain-write-read” through the gating mechanism, the LSTM model can stably propagate information across relatively long time scales.

2.2.2. BiLSTM Architecture

Based on the above, within a fixed-length window, “backward” contextual information is introduced to enhance discrimination of local patterns without crossing the window boundary.

{\vec{h}}_{t} = {LSTM}_{\to} (x_{1 : t}), {\overset{\leftarrow}{h}}_{t} = {LSTM}_{\leftarrow} (x_{L : t})

(26)

where

{LSTM}_{\to}

and

{LSTM}_{\leftarrow}

represent the forward and backward LSTM units applied within the window

[t - L + 1, t]

, ensuring that no information outside the window is accessed.

The forward and backward hidden representations are then concatenated to form a more complete temporal representation of the window:

h_{t}^{(bi)} = [{\vec{h}}_{t}; {\overset{\leftarrow}{h}}_{t}] \in R^{2 H}

(27)

2.2.3. Δ-Target and Multi-Channel Sample Organization

To use the bidirectional representations produced by BiLSTM within a single window for regression prediction, the hidden states over the window are first aggregated into a fixed-length representation, yielding the vector at the input of the regression head:

g_{t} = Pool ({h_{τ}^{(bi)}}_{τ = t - L + 1}^{t}) \in R^{2 H}

(28)

where

{h_{τ}^{(bi)}}

are the bidirectional hidden states and

Pool (\cdot)

by default assigned the last hidden state

h_{τ}^{(bi)}

. Alternatively, average pooling or other strategies can be applied within the window, provided that the boundary is not exceeded.

The regression head then maps

g_{t}

linearly to predict the “future h-step increment”:

{\hat{Δ y}}_{t + h} = w^{⊤} g_{t} + b .

(29)

where w

\in R^{2 H}

, and b

\in R

are trainable parameters.

To enhance the aggregation representation, an attention mechanism can be introduced within the window to weight different time steps adaptively:

e_{τ} = v^{⊤} \tanh (W_{a} h_{τ}^{(bi)}), α_{τ} = \frac{\exp (e_{τ})}{\sum_{κ} \exp (e_{κ})}; g_{t} = \sum_{τ} α_{τ} h_{τ}^{(bi)}

(30)

where

W_{a}

,

v

are the attention parameters, and

α_{τ}

is normalized only within the window to ensure leakage-free evaluation.

To strengthen the modeling of change rates and ramp points, this paper reformulates the point-wise prediction into a relative difference target:

Δ y_{t + h} = y_{t + h} - y_{t + h - 1}

(31)

The mean squared error (MSE) for optimization.

L_{MSE} = \frac{1}{N_{0}} \sum_{n = 1}^{N} ({\hat{Δ y}}_{t_{n} + h} - Δ y_{t_{n} + h})^{2}

(32)

where N₀ is the number of samples. During the inference stage, the ground truth value at the end of the window is used as the baseline, and the accumulated increments are added back to restore the point prediction:

{\hat{y}}_{t + h} = y_{t} + \sum_{j = 1}^{h} {\hat{Δ y}}_{t + j}

(33)

where y_t is the true load at time step t. Multi-step prediction can be performed either recursively or in parallel, depending on requirements.

Finally, to ensure that the input sufficiently incorporates multiple scales as well as exogenous factors, we construct multi-channel window tensors and apply leakage-free normalization. The concatenated form of the window channels is as follows:

X_{t} = [x_{t - L + 1 : t} | \tilde{r} (t - L + 1 : t) | u_{k_{1}} (t - L + 1 : t), \dots, u_{k_{M}} (t - L + 1 : t) | z_{t - L + 1 : t}]

(34)

where

\tilde{r}

denotes the residual smoothed by LOWESS within the window, x is the original sequence,

u_{k_{m}}

are the M IMF channels selected from VMD,

z

denotes exogenous variables (e.g., calendar factors), and

[\cdot ∣ \cdot]

indicates concatenation across channels. To prevent statistical leakage, the mean and standard deviation of each channel are estimated only on the training block, then fixed and applied to validation/testing. This yields the following:

{X^{'}}_{t}^{(c)} = \frac{X_{t}^{(c)} - μ_{c}^{train}}{σ_{c}^{train}}, c = 1, \dots, C_{sel}

(35)

where

X_{t}^{(c)}

denotes the sub-sequence segment of length L for channel c.

μ_{c}^{train}

and

σ_{c}^{train}

are the mean and standard deviation of channel c, respectively, computed only from the training block.

The model finally takes

{X^{'}}_{t}^{(c)}

as input, outputs

{\hat{Δ y}}_{t + h}

and restores the point prediction

{\hat{y}}_{t + h}

through Equation (33).

2.3. Channel Value Metrics and Selection

2.3.1. Forward Correlation

For the c-th candidate channel sequence in the training stage, denoted as

x_{c}^{(tr)} (t)

, the target is the original point value sequence

X (t)

. To evaluate its contribution after the “forecast origin,” forward correlations are defined for each horizon:

ρ_{c} (h) = corr (x_{c}^{(tr)} (t), X (t + h)), t = L, \dots, T_{tr} - h

(36)

where H is the maximum forecasting horizon.

Here, the correlation coefficients are computed on the training samples. The inputs and targets are standardized using the means and standard deviations obtained from Equation (18), i.e., with z-score normalization. To account for multiple-step forecasting, the absolute correlations at different horizons are aggregated with weights

{w_{h}}_{h = 1}^{H}

, yielding a single score for channel

c

.

s_{c} = \sum_{h = 1}^{H} w_{h} |ρ_{c} (h)|, \sum_{h = 1}^{H} w_{h} = 1

(37)

To capture potential nonlinear dependencies, mutual information and gray relational degree can also be computed and fused with weights:

\begin{array}{l} I_{c} = \sum_{h = 1}^{H} w_{h} MI (x_{c}^{(tr)} (t), X (t + h)), γ_{c} = \sum_{h = 1}^{H} w_{h} γ_{c} (h) \\ γ_{c} (h) = \frac{1}{N_{h}} \sum_{t} \frac{Δ_{\min} + ζ Δ_{\max}}{Δ_{c} (t, h) + ζ Δ_{\max}}, Δ_{c} (t, h) = |X (t + h) - x_{c}^{(tr)} (t)|, ζ \in (0, 1] \end{array}

(38)

where

γ_{c} (h)

is computed based on the difference sequence

Δ_{c} (t, h)

using a three-parameter formulation. N_h is the number of comparable samples.

All three metrics are normalized to the range [0, 1], with the maximum value across channels set to 1. The comprehensive score is then given by the following:

{\tilde{s}}_{c} = α {\hat{s}}_{c} + β {\hat{I}}_{c} + (1 - α - β) {\hat{γ}}_{c}, α, β \in [0, 1]

(39)

2.3.2. Redundancy-Aware Channel Subset Selection

The number of candidate channels is

C = K + 1

. Given the channel budget

M_{budget}

, the upper bound is set as follows:

M_{\max} = \min (C, M_{budget})

(40)

First, channels are ranked by their scores

{\tilde{s}}_{c}

, and the top

M_{0} \leq M_{\max}

are retained for secondary selection. To reduce redundancy caused by overlapping frequency bands, a forward stepwise selection with a penalty is applied. Specifically, given the current selected set

S

, each step selects:

c^{⋆} = \arg \max_{c \notin S} [{\tilde{s}}_{c} - λ \cdot \frac{1}{| S |} \sum_{j \in S} |corr (x_{c}^{(tr)}, x_{j}^{(tr)})|], λ \in [0, 1]

(41)

Until the target number of channels is reached

| S | = M \leq M_{\max}

. In the equation

λ \in [0, 1]

controls the strength of redundancy penalization, and the correlations are calculated using the training samples. Denote the final selected set by

S = {c_{1}, \dots, c_{M}}

.

2.4. Lightweight Bayesian Optimization and Stable Training

To avoid large-scale brute-force search and to ensure fair comparison under a unified training budget, this study adopts lightweight Bayesian optimization on the validation set targets. Stable training protocols such as early stopping, gradient clipping, regularization, and fixed random seeds are used to guarantee reproducibility and comparability within the leakage-free boundary.

2.4.1. Objective Functions and Evaluation Protocol

After determining the channel set and constructing the multi-channel input windows, the performance of the Δ-BiLSTM model mainly depends on the hyperparameter set θ. The hyperparameters most relevant to training and sensitive to performance are included in the search space:

θ = {hidden, lr, L 2, dropout, clip, batch} \in Ω

(42)

For a given hyperparameter set θ, after training the model on the training set, evaluation on the validation set is performed under the “forward” protocol. For each horizon

h

, the index set of comparable validation samples at the horizon

h

in the validation stage is as follows:

I_{va} (h) = {t ∣ L \leq t \leq T_{vae} - h}

(43)

where T_vae denotes the endpoint of the validation set, and

L

is the input window length.

The corresponding root mean squared error (RMSE) at horizon

h

during the validation stage is calculated as follows:

{RMSE}_{h}^{(va)} (θ) = {(\frac{1}{| I_{va} (h) |} \sum_{t \in I_{va} (h)} {({\hat{y}}_{t + h} (θ) - y_{t + h})}^{2})}^{1 / 2}

(44)

The point-wise predictions are restored by accumulating increments:

{\hat{y}}_{t + h} (θ) = y_{t} + \sum_{j = 1}^{h} {\hat{Δ y}}_{t + j} (θ)

(45)

where

{\hat{y}}_{t + h}

denotes the point forecast of

y_{t + h}

, and the baseline value y_t is taken from the end of the sliding window to ensure causality.

The step-weighted validation objective

J (θ)

is aggregated:

J (θ) = \sum_{h = 1}^{H} w_{h} {RMSE}_{h}^{(va)} (θ)

(46)

where the step weights

w_{h}

are non-negative and normalized to unity.

When the difference in

J

between two hyperparameter sets

θ_{1}

and

θ_{2}

is smaller than a tolerance ε, symmetric mean absolute percentage error (sMAPE), and mean absolute error (MAE) are further used as tie-breakers:

{sMAPE}_{h}^{(va)} (θ) = \frac{100 %}{| I_{va} (h) |} \sum_{t \in I_{va} (h)} \frac{| {\hat{y}}_{t + h} (θ) - y_{t + h} |}{\frac{1}{2} (| {\hat{y}}_{t + h} (θ) | + | y_{t + h} |)}

(47)

{MAE}_{h}^{(va)} (θ) = \frac{1}{| I_{va} (h) |} \sum_{t \in I_{va} (h)} | {\hat{y}}_{t + h} (θ) - y_{t + h} |

(48)

To further improve robustness, the validation set can be partitioned into Q consecutive sub-windows, where metrics are computed for each sub-window and then aggregated. The stability-penalized evaluation across sub-windows

\bar{J} (θ)

is given by the following:

\bar{J} (θ) = \frac{1}{Q} \sum_{q = 1}^{Q} J_{q} (θ) + η \sqrt{\frac{1}{Q} \sum_{q = 1}^{Q} {(J_{q} (θ) - \bar{J} (θ))}^{2}}, η \geq 0

(49)

where

J_{q} (θ)

is the objective value on the q-th sub-window, and

η \geq 0

is the penalty weight for fluctuation.

2.4.2. Search Space and Lightweight Search Strategy

Bayesian optimization models

J (θ)

with surrogates, recording the mean and standard deviation of the objective function after the t-th round as

μ_{t} (θ)

and

σ_{t} (θ)

, respectively.

The acquisition function selects the next evaluation point, typically using expected improvement (EI) or lower confidence bound (LCB). For the minimization problem, EI is defined as follows:

{EI}_{t} (θ) = (J_{\min, t} - μ_{t} (θ)) Φ (\frac{J_{\min, t} - μ_{t} (θ)}{σ_{t} (θ)}) + σ_{t} (θ) ϕ (\frac{J_{\min, t} - μ_{t} (θ)}{σ_{t} (θ)})

(50)

where

J_{\min, t} = m i n_{i \leq t} J (θ_{i})

,

Φ (\cdot)

and

ϕ (\cdot)

denote the cumulative distribution function and probability density function of the standard normal distribution, respectively.

LCB is expressed as follows:

{LCB}_{t} (θ) = μ_{t} (θ) - κ_{t} σ_{t} (θ), κ_{t} > 0

(51)

To embody the lightweight search strategy, each training run adopts the same budget set:

B = {E_{\max}, P, N_{iter}}

(52)

where

E_{\max}

is the maximum number of epochs,

P

is the early stopping patience, and

N_{iter}

is the number of iterations per epoch.

This ensures comparability across different

θ

. The final optimal solution is recorded as follows:

θ^{⋆} = \arg \min_{θ \in Ω} \bar{J} (θ) (o r J (θ))

(53)

2.4.3. Stable Training and Reproducibility Protocols

All evaluations strictly follow the leakage-free boundary defined in Section 2.1.4. Channel normalization statistics are computed only during the training stage as defined in Equation (18), and remain fixed for the validation and test stages. The training loss adopts the weighted mean squared error on the Δ-target:

L_{train} (θ) = \sum_{h = 1}^{H} λ_{h} \frac{1}{|I_{tr} (h)|} \sum_{t \in I_{tr} (h)} {({\hat{Δ y}}_{t + h} (θ) - Δ y_{t + h})}^{2}, \sum_{h = 1}^{H} λ_{h} = 1

(54)

When combined with L2 regularization and dropout:

L (θ) = L_{train} (θ) + L 2 \cdot ‖ W ‖_{2}^{2}

(55)

During backpropagation, L2-norm gradient clipping is applied to prevent exploding gradients:

g \leftarrow g \cdot \min (1, \frac{clip}{‖ g ‖_{2}})

(56)

Early stopping is also applied. If the validation objective shows no improvement within P consecutive rounds (with tolerance ε), training is terminated, and the model with the best validation weight is retained. All evaluations are conducted with fixed random seeds and device settings to ensure reproducibility. Testing is performed only once, strictly using the

θ^{⋆}

obtained from the training–validation procedure, along with the corresponding normalization parameters and the selected channel set

S

.

3. Combined Forecasting Framework

3.1. Overall Procedure and End-to-End Mapping

At each time step t, within the causal sliding window

{t - L + 1, \dots, t}

, the “VMD + residual LOWESS” procedure is executed to obtain K band-limited components and one smoothed residual [26]. Together they form a candidate feature vector of dimension C = K + 1:

[IMF 1 (t), \dots, IMF K (t), \tilde{r} (t)] = D_{K, α, τ, tol, Niter, span} (X_{t - L + 1 : t}) \in R^{C}

(57)

where

D

denotes the operator for “VMD + residual LOWESS”, and

\tilde{r} (t)

is the residual smoothed by LOWESS within the window. The parameters

K, α, τ, tol, Niter

and the LOWESS span are predetermined or chosen from small candidate sets, focusing the search on core hyperparameters of the learner.

Candidate channels are standardized independently during training using z-score statistics, and then selected according to the forward correlation measures and redundancy control in Section 2.3, yielding M effective channels. The c-th channel after normalization

{\tilde{x}}_{c} (t)

is given by the following:

{\tilde{x}}_{c} (t) = \frac{x_{c} (t) - μ_{c}^{(tr)}}{σ_{c}^{(tr)}}, S = Select ({{\tilde{s}}_{c}}_{c = 1}^{C}, M, λ)

(58)

where

μ_{c}^{(tr)}

and

σ_{c}^{(tr)}

are the mean and standard deviation estimated and fixed during the training stage, respectively.

Select (\cdot)

enotes the “TOP-M + redundancy penalty” channel selection.

According to

S

, the multi-channel input tensor is constructed and trained with the Δ-target using a BiLSTM architecture:

X_{t} = {[{\tilde{x}}_{t - L + 1 : t}^{(c)}]}_{c \in S} \in R^{L \times M}, {\hat{Δ y}}_{t + 1 : t + H} = f_{θ^{⋆}} (X_{t})

(59)

where

X_{t}

is the input tensor of length

L

and dimension M, and

f_{θ^{⋆}} (\cdot)

denotes the Δ-BiLSTM model trained under configuration

θ^{⋆}

, outputting the

H

-step incremental prediction

{\hat{Δ y}}_{t + 1 : t + H}

.

By compressing the above steps into a single operator, the end-to-end mapping is expressed as follows:

{\hat{y}}_{t + 1 : t + H} = R ° f_{θ^{⋆}} ° W_{S} ° N_{μ^{(tr)}, σ^{(tr)}} ° D_{K, α, τ, tol, Niter, span} (X_{t - L + 1 : t})

(60)

where

N_{μ^{(tr)}, σ^{(tr)}}

denotes per-channel normalization driven by training stage statistics,

W_{S}

represents channel selection with window concatenation,

R

indicates the “Δ-to-point” restoration, and

D_{K, α, τ, tol, Niter, span}

is the VMD + residual LOWESS decomposition operator. The optimal hyperparameters are determined by the unified objective function in Equation (53).

3.2. Training and Validation Protocol

Building on the mapping in Section 3.1, this section specifies the protocol for non-leakage training–validation testing, and the model selection criteria. The RMSE at horizon h as defined in Equation (44) is computed, and the aggregated validation objective is given by Equation (46). When robustness is required, Equation (49) is used. Based on these computations,

θ^{⋆}

is obtained, and the validation-optimal weight is fixed as the sole checkpoint for testing [27].

The test procedure is executed only once and is excluded from any selection or early stopping. The data generation and normalization procedures of the validation stage are replicated.

{\hat{Δ y}}_{t + 1 : t + H}

is subsequently obtained using

f_{θ^{⋆}} (X_{t})

, and a “Δ-to-point” restoration is performed according to Equation (45). This approach ensures no information leakage while guaranteeing fairness, comparability, and reproducibility.

3.3. Complexity Analysis

The end-to-end framework mainly consists of “causal decomposition + channel scoring + Δ-BiLSTM forward/backward training”. At any reporting time t, within the causal window

{t - L + 1, \dots, t}

, the computational complexity of VMD and residual LOWESS can be approximated as follows:

{cost}_{VMD + LOWESS} \approx O (K L \log L \cdot Niter) + O (L)

(61)

where

Niter

is the number of iterations in the VMD process.

Channel scoring and selection are only performed during training. Its overall scale is determined by the candidate channel number C = K + 1, sequence length H, and training length. The complexity can be approximated as follows:

{cost}_{score} \approx O (C H T_{tr}) + O (M (C - M) T_{tr})

(62)

The former corresponds to forward-looking correlation or mutual information, whereas the latter refers to forward-looking selection with redundancy penalties. In comparison, the computational cost of the network training can usually be ignored.

For the bidirectional LSTM, the single-layer forward and backward unfolding with sliding window length L, channel number M, and hidden units

hidden

, has a complexity approximately given by the following:

{cost}_{BiLSTM per batch} \approx O (2 L ({hidden}^{2} + hidden \cdot M))

(63)

where the constant 2 arises from the bidirectional structure.

Under the unified training budget of maximum epochs and batch size, the training time scale satisfies the following:

T_{train} \propto E N_{iter} L ({hidden}^{2} + hidden \cdot M)

(64)

The main computational cost is determined by the gate states and parameter matrices. It scales approximately quadratically with the hidden dimension, and linearly with both

L

and M. The inference phase only involves forward propagation, with the same order of complexity as above but with a smaller constant factor.

4. Case Study Analysis

The load power measurement data of the main transformer bus in Area #1 was selected as the dataset, containing a total of 864 time steps, with one time step equal to 5 min. This dataset represents a coastal region along China’s southeast coast with abundant wind and photovoltaic resources, characterized by high renewable energy penetration. Compared to inland regions with lower renewable integration, it captures the complex load fluctuations and system challenges typical of high renewable penetration areas, providing a more accurate reflection of their impact on power system behavior [28].

The dataset includes three typical load fluctuation days. However, due to the similar fluctuation patterns observed across the three days, and to enhance computational efficiency, 400 time steps, totaling 34 h of data, were selected, which include a typical day. This selection was made to ensure that the data remained representative of typical operating conditions while reducing the computational burden.

4.1. Time Series Decomposition and Residual Smoothing

4.1.1. VMD Results and Multi-Scale Features

The case study data was chronologically divided into training, validation, and test sets, comprising 70%, 15%, and 15% of the total data, respectively. Only the training set statistics were used for differencing and normalization, so as to avoid information leakage.

During the training stage, VMD was applied, yielding six intrinsic mode functions (IMFs) and one residual component. Figure 2 illustrates the VMD of the training series into six IMFs and a residual. The higher-order IMFs capture short-term fluctuations and noise, whereas the residual and lower-frequency components preserve the long-term trend and smooth variations in the load.

As shown in Figure 2:

(1)

{IMF}_{1}

–

{IMF}_{3}

capture intraday low-frequency fluctuations, while

{IMF}_{4}

–

{IMF}_{6}

gradually shift to higher central frequencies, identifying peaks and rapid oscillations.

(2) The residual mainly reflects slow-varying trends and contains ultra-low-frequency or slowly changing information not captured by the

IMFs

.

A comparison is conducted between the decomposed and the original sequences to verify correct decomposition and reconstruction, numerical stability, and consistency.

Figure 3 presents the reconstruction performance of VMD across the train, validation, and test segments. The reconstructed series closely overlaps with the original load curves in all phases, indicating that the selected IMF components and residual effectively preserve both short-term variations and long-term trends without introducing noticeable distortion.

As shown in Figure 3, the VMD reconstruction signal almost entirely overlaps with the original load profile across the training, validation, and test sets, preserving trends and turning points. In the training segment, there is a dip between samples 50–70 and peaks around 120–140 and 190–220. The validation segment declines smoothly, while the test segment rises. This confirms the fidelity of the decomposition and the numerical stability of the reconstruction.

4.1.2. Channel Value Evaluation and Selection

As confirmed in Figure 3, the full VMD reconstruction (∑IMFs + Residual) shows nearly complete overlap with the original series; the focus thus shifts from decomposition validation to channel utilization. On the training segment, each decomposed component (i.e., IMF1-IMF6 and the residual) is evaluated using two complementary criteria, and a compact subset is retained. The results are summarized in Table 1.

(i): Foresight correlation

With a window length of L = 24, multi-step incremental targets are constructed as follows:

Δ y_{h} (t) = y (t + h) - y (t), h \in {1, 2, 3}

(65)

For each channel

F_{i} (t)

, after alignment (discarding the first L samples), the absolute Pearson correlations

| r_{i} (h) | = |corr (F_{i}, Δ y_{h})|

are computed and then averaged across all horizons. This yields the following:

{Score}_{i}^{corr} = \frac{1}{3} \sum_{h = 1}^{3} | r_{i} (h) |

(66)

The absolute value measures the strength of the linear relationship regardless of sign, and the use of future increments prevents information leakage.

(ii): Energy ratio

For each channel, energy is defined as

E_{i} = \sum_{t} F_{i} {(t)}^{2}

, and the energy ratio is given by

E_{i} / E_{y}

, where

E_{y} = \sum_{t} y {(t)}^{2}

denotes the energy of the raw series. Since VMD modes are not orthogonal, this ratio serves as an indicator of prominence relative to the raw energy rather than a normalized share. Thus, the sum of ratios across modes need not equal one and may exceed unity, as observed for IMF5 and IMF6 in Table 1. This metric is used jointly with correlation to avoid retaining components that are highly correlated but energetically negligible.

(iii): Selection rule (Top-M)

Channels are ranked by

{Score}_{i}^{corr}

and ties are broken by the energy ratio. The Top-M channels, where M = 6, are retained (the residual is treated as a candidate channel but is specifically smoothed in Section 4.1.3). The “Selected” column in Table 1 marks the retained channels with a 1.

According to Table 1, IMF2 attains the highest foresight correlation on the training segment (≈0.5236), followed by IMF1 (≈0.3427) and IMF3 (≈0.3203). The Residual shows moderate correlation (≈0.1982) and is retained both by the selection rule and for subsequent smoothing. In contrast, IMF4 exhibits very weak correlation (≈0.0185) and a negligible energy ratio (≈1.29 × 10⁻⁴), making it the only channel not selected. Although IMF5 and IMF6 display relatively lower correlations (≈0.2063 and ≈0.0216, respectively), their energy ratios are large (≈3.015 and ≈1.506), indicating significant oscillatory content on the raw energy scale. Retaining them helps preserve multi-scale structure for ablation and reconstruction. Thus, the retained set is {IMF1, IMF2, IMF3, IMF5, IMF6, Residual}.

4.1.3. Residual LOWESS and Spike Suppression

Residuals are smoothed only by applying LOWESS to suppress slowly varying noise, while preserving the multi-scale structural features already captured by the

IMFs

. Using the training segment as the benchmark, the reconstruction error RMSE and roughness are jointly considered across spans to evaluate the reconstruction. The optimal span is selected according to the following objective function

J = λ \cdot {RMSE}_{norm} + (1 - λ) \cdot {Roughness}_{norm} (λ = 0.4)

.

Figure 4 shows the span selection process for LOWESS, where RMSE to the raw training sequence (blue curve) and roughness measured by second-difference energy (orange curve) are jointly evaluated. As the span increases, RMSE grows monotonically while roughness decreases sharply and then stabilizes, indicating a trade-off between fidelity to the raw signal and smoothness of the residual.

From Figure 4, it can be observed that the comprehensive index reaches its minimum when span ≈ 9. This span is then fixed and subsequently applied consistently during re-decomposition and reconstruction at the validation and testing stages. Consequently, the residual component and its corresponding LOWESS-smoothed counterpart can be obtained for comparison.

The residual and LOWESS comparison is shown in Figure 5. As shown in Figure 5, across the training, validation, and test panels, the residual is overlaid with its residual–LOWESS counterpart. Peaks, troughs, and turning points remain aligned. The residual and the LOWESS curves almost overlap, indicating that no significant structural bias has been introduced. Meanwhile, compared with the residual, the curve after LOWESS processing is smoother, showing better stability and less fluctuation.

To intuitively illustrate the advantages of the proposed processing method, four curves are compared during the training stage: the original series, VMD complete reconstruction, VMD-LOWESS, and the weighted moving average (WMA, as a baseline).

In Figure 6, the original raw signal on the training segment is contrasted with three denoising methods: plain VMD (without LOWESS), a weighted moving average (WMA) baseline, and the proposed VMD-LOWESS. The upper-right inset replots the curves within the magnified region, while the lower-right inset reports the MAE computed over the same region.

From Figure 6, it can be observed that after the “VMD + LOWESS” processing, spikes and local fluctuations in the input sequence are effectively suppressed, while the integrity of the overall trend and extreme values is maintained. This aligns with the proposed multi-scale denoising input method, which provides cleaner, more stable, and leakage-preventive input features for subsequent model construction.

During the training stage, a fine-grained grid search over

K \times span

was further conducted. Figure 7 illustrates the composite objective

J

evaluated on the training segment across a grid of LOWESS span and VMD modes. The color and height encode the normalized magnitude of J, where lower values indicate better performance.

As shown in Figure 7, although roughness decreases with the increase of

span

, the RMSE rises, leading to a deterioration of the composite metric

J

. This indicates that excessive smoothing will cause the loss of useful information. When K increases from 3 to 6, the objective

J

is significantly improved, but further increases lead to diminishing returns and higher decomposition costs.

In summary, setting

K = 6

and

span \approx 9

provides a favorable trade-off between accuracy and computational complexity.

4.2. Experimental Setup and Evaluation

Building on the decomposition and channel selection in Section 4.1, this section further conducts forecasting experiments and model selection. The entire process follows a chronological partition. All normalization parameters and channel statistics are fixed within the training phase and then consistently applied to the validation/testing phase, avoiding any form of information leakage.

4.2.1. Δ Target Definition and Experimental Protocol

The parameter ranges are as follows:

hidden \in {32, 64, \dots, 256}

,

lr \in [10^{- 4}, 5 \times 10^{- 3}]

(on a logarithmic scale),

L 2 \in [10^{- 6}, 10^{- 2}]

,

dropout \in [0, 0.5]

,

clip \in [0.5, 5]

,

batch \in {32, 64, 128}

. The forecasting target is expressed in Δ form, with inputs retaining the multi-channel structure of “

IMFs

+ LOWESS residuals”. The network adopts a single-layer BiLSTM, with training computations combined with early stopping. Key training items and computational records are listed in Table 2.

The test set evaluation includes the following metrics: RMSE, MAE, sMAPE, and R². Among these, both RMSE and MAE are expressed in kilowatts (kW), consistent with the unit of the load. sMAPE is expressed as a percentage. R² is used to assess the goodness of fit and facilitates comparisons across different forecast horizons.

4.2.2. Single-Step Prediction Result Analysis

Under this protocol, the single-step prediction performance is first examined. Figure 8 shows the comparison between predicted and actual values at the test stage with step length h = 1.

From Figure 8, the following can be seen:

(1) The overall trend and turning points are accurately captured, with the predicted curve almost overlapping the actual curve.

(2) After a slight lag at certain positions before the peaks, the model aligns well with the peaks and subsequent decline phases.

(3) The enlarged window indicates that the model maintains smoothness during rapid climbing stages without obvious overshooting, which meets the expected design goals.

4.2.3. Multi-Step Forecasting Error Propagation

Subsequently, the forecast horizon is extended to h = 2 and h = 3, as shown in Figure 9. It presents the model’s multi-step performance across horizons, displaying bar plots of RMSE, sMAPE, and

R^{2}

computed on the test set.

The bar chart in Figure 9 illustrates that the error accumulates and increases step by step. RMSE and sMAPE grow progressively from h = 1 to h = 3, while R² remains at a relatively high level. This indicates that the use of the Δ-target training strategy enables the model to capture the dynamic variations in rapid climbing and sudden changes more sharply. Even during the peak surge and high-load transition stages, Δ-BiLSTM still maintains smooth and stable predictions, validating the effectiveness of the consistency adjustment in the proposed Δ-target learning innovation.

The corresponding detailed numerical values are listed in Table 3.

4.2.4. Time Window and Step Length Sensitivity Analysis

To ensure a more stable time correlation setting, under the same leakage-free pipeline, for window lengths

L \in {28, 56, 112}

and maximum step length

h_{\max} \in \{1, 2, 3\}

, a grid search is performed. Each

L

is first segmented and processed, followed by decomposition, LOWESS, and Top-M channel selection in the training phase. Then, for each

h \leq h_{\max}

, a Δ-model is trained separately, and the validation RMSE weighted by step length is used as the selection metric.

Figure 10 visualizes the validation weighted RMSE on the

(L, h_{\max})

grid. For each cell in the grid, the leakage-free pipeline is executed, Δ-models are trained for all

h \leq h_{\max}

, and their RMSE values are aggregated with step length weights. As shown in Figure 10, a distinct performance valley appears around L ≈ 56 with

h_{\max} \approx 2

. Extending L beyond 56 or increasing h_max to three raises the weighted RMSE. Therefore,

L = 56

and

h_{\max} = 2

are adopted as a balanced choice that maintains accuracy while limiting model complexity.

Figure 11 summarizes the validation weighted RMSE for three input window lengths

L \in {28, 56, 112}

. For each

L

, the bars represent the maximum forecast horizon

h_{\max} \in {1, 2, 3}

, corresponding to the step-weighted aggregation of RMSE for horizons up to

h_{\max}

. As shown in Figure 11, when

L = 56

, the model consistently maintains consistently low error for all three window lengths. In contrast, for the same

L

, the weighted RMSE rises significantly as h_max increases, reflecting the cumulative difficulty of multi-step joint modeling. The complete grid results are presented in Table 4. It should be noted that if a certain combination is only trained up to h_max = 1 or 2, the uncovered horizons are marked as “-”, which does not affect the conclusions. Based on the consistency results of Table 4, subsequent comparisons adopt the configuration

L = 56

, with h_max also determined by the same weighted criterion.

In summary, under the unified framework of “decomposition-selection-Δ representation” and leakage-free evaluation, the model demonstrates strong trend depiction and peak-tracking ability in both single-step and multi-step scenarios, with

L = 56

serving as the more robust time window.

4.3. Results and Discussion

4.3.1. Model Comparison and Advantages

Before comparing the baseline models (such as SVM, RawLSTM, RNN, ARIMA, etc.) with the proposed method (Δ-BiLSTM), it is essential to clarify that all models were trained using the same configuration to ensure fairness and comparability of the results. Specifically, all models employed the same early stopping strategy, training epochs (60 epochs), learning rate (0.001), dropout (0.2), L2 regularization (1.00 × 10⁻⁵), gradient clipping (1), batch size (64), and random seed (123), ensuring identical convergence behavior and training stability. However, the proposed method (Δ-BiLSTM) applied VMD and LOWESS during the data preprocessing phase to remove noise and capture multi-scale features, whereas the baseline models were trained directly on the raw data without these additional processing steps. This allowed the proposed method to better handle noise and abrupt fluctuations, resulting in improved prediction accuracy and robustness.

With this unified configuration and training budget, Δ-BiLSTM is compared with baseline models such as RawLSTM and RNN (Recurrent Neural Network) [29]. The results are presented in Table 5.

As shown in Table 5, the proposed Δ-BiLSTM model achieves superior performance on the three core metrics of R², RMSE, and sMAPE in most cases, with particularly significant advantages at h = 1 and h = 2. Specifically, when h = 1 and h = 2, RMSE decreases by 65.5% and 38.26%, respectively, compared to the best baseline model, ARIMA (AutoRegressive Integrated Moving Average). When the step length is extended to h = 3, although the overall error increases, the rate of error growth is lower than that of the baseline methods, while high predictive stability is maintained. This indicates that the proposed method not only demonstrates outstanding accuracy in single-step forecasting but also maintains robust advantages in multi-step forecasting.

4.3.2. Robustness Analysis

From the perspective of robustness analysis, Figure 12 presents the boxplot distribution of absolute errors across the horizon. The three boxplots show the test set absolute error (in kW) distributions of five models at forecast horizons h = 1, 2, and 3. The boxes denote the interquartile range (IQR), with medians shown in red and means in black; the whiskers extend to 1.5 × IQR, and circles mark outliers.

As shown in Figure 12, under the three scenarios with h = 1 to 3, Dispersion increases with horizon for all methods, while Δ-BiLSTM achieves the lowest median and tightest IQR at h = 1 and remains competitive as h grows (margins narrow at h = 3). Δ-BiLSTM maintains the lowest median and the narrowest interquartile range (IQR) in the distribution of absolute errors, while the 95% quantile is significantly lower than that of the benchmark models, indicating the strongest robustness.

By contrast, the RNN exhibits higher tail risks in long-horizon forecasts, while the SVM and RawLSTM show slightly better medians than ARIMA but still suffer from wider distribution spreads. Overall, Δ-BiLSTM demonstrates superior accuracy and robustness compared to traditional machine learning models and less sophisticated deep learning baseline models.

4.3.3. Error Distribution Characteristics and Prediction Comparison of Multiple Algorithms

Figure 13 and Figure 14 further reveal the spatial distribution of forecasting errors across time and prediction horizons. Here, Algo1 denotes Δ-BiLSTM, Algo2 denotes SVM, Algo3 denotes RawLSTM, Algo4 denotes RNN, and Algo5 denotes ARIMA. As shown in Figure 13, the paired heatmaps display test errors along the sequence. Rows correspond to Algo1 to Algo5, which map to Δ-BiLSTM, SVM, RawLSTM, RNN, and ARIMA, respectively. The left panel shows absolute error (kW), and the right panel shows relative error (%). Warmer colors denote larger error magnitudes. The Δ-BiLSTM row is predominantly composed of cooler shades, whereas RNN and RawLSTM show recurring warm patterns and localized hotspots. Overall, Δ-BiLSTM exhibits the lowest spatial error profile on both absolute and relative scales.

As shown in Figure 14, the left shows the residual series at h = 1 for five algorithms. The right shows the corresponding predictions at the same horizon. It is observed that Algorithm 1 (Δ-BiLSTM) produces errors that are most tightly concentrated around zero, with the narrowest amplitude and no visible drift. Its prediction trace closely matches the measured values in terms of peak height and ramp timing. In contrast, competing methods (e.g., Algorithm 4) show larger excursions or offsets. These patterns substantiate the advantage of the proposed VMD-LOWESS + Δ-BiLSTM pipeline in delivering unbiased and stable single-step forecasts.

In summary, under the same experimental protocol and unified evaluation criteria, the proposed Δ-BiLSTM model outperforms baseline models across four aspects: overall accuracy, calibration, robustness, and spatial consistency. These results align with the experimental findings presented in Section 4.1 and Section 4.2, further demonstrating that Δ-BiLSTM, when trained under unified training protocols and evaluated without leakage, exhibits stronger consistency and stability. The results not only highlight its superiority over baseline models but also show that in repeated experiments and under varying prediction horizons, it maintains robust stability.

4.3.4. Segment-Wise Improvement Under Different Operating Conditions

To further quantify the improvement amplitude of the proposed Δ-BiLSTM under different operating conditions, the 5 min load dataset is divided into three segments along the sample-index axis, as shown in Figure 15.

The segmentation is based on both the load level and the local variability. Specifically, the active power series

P (k)

is first processed with a sliding window of 12 samples (1 h) to compute the local standard deviation

σ (k)

. Continuous intervals with

σ (k) \leq 1.0

kW and a duration longer than 20 samples are regarded as stable segments, which correspond to three relatively flat periods in the dataset (sample indices 20–64, 138–192, and 313–348). Among the remaining samples, intervals with a load level

P (k) \geq 67

kW are identified as peak segments, forming two sustained peak-load periods (indices 107–137 and 193–245). The rest of the samples, which mainly cover the ramp-up and ramp-down transitions between valley and peak, are classified as the fluctuation segment.

As shown in Figure 15, the RMSE of the

h = 1

forecasts are evaluated separately on the stable, fluctuation, and peak segments. Since ARIMA still achieves the best overall accuracy among the conventional benchmark models in this study, the comparison in this subsection focuses on Δ-BiLSTM versus ARIMA. For

h = 1

, Δ-BiLSTM reduces the RMSE on the fluctuation segment by 69.4%, and achieves 62.5% and 62.4% RMSE reduction on the stable and peak segments, respectively, compared with ARIMA.

These segment-wise results indicate that Δ-BiLSTM not only improves the prediction accuracy in normal stable periods, but also significantly suppresses error amplification during rapid load changes and peak-load conditions. In particular, the model shows a clear advantage in tracking high-frequency variations and sustained peaks while maintaining low error levels, which is crucial for short-term dispatch, reserve allocation, and secure operation of distribution networks.

4.3.5. Corresponding Effects of Contribution Points

In this section, the performance of Δ-BiLSTM is compared with four variant models: Abl w/o Delta, Abl w/o ChSel, Abl w/o LOWESS, and Abl w/o BiLSTM. Specifically, we quantify the contribution of each component to the overall performance, focusing on the percentage increase in RMSE. The analysis is based on h = 1 for consistency across the different models.

The results are presented in Table 6.

As shown in Table 6.

(i) Abl w/o Delta removes the Δ-target component, which is designed to model the rate of change and abrupt transitions in bus load. After removing this component, the RMSE increases by 8.00%, highlighting the importance of capturing dynamic transitions and fluctuations in the load sequence.

(ii) Abl w/o ChSel removes the Channel Selection component, which helps filter out redundant input features based on their correlation. The removal of ChSel leads to an RMSE increase of 4.43%, emphasizing the value of selecting effective channels for improving model performance.

(iii) Abl w/o LOWESS removes the LOWESS component, which is used to smooth residuals and reduce noise in the data. Removing LOWESS results in a significant RMSE increase of 17.91%, demonstrating the importance of noise reduction in ensuring stability and reliability in the forecasts.

(iv) Abl w/o BiLSTM replaces BiLSTM with LSTM, a unidirectional model that only processes past data, whereas BiLSTM is bidirectional and can take both past and future information into account. This change leads to the largest RMSE increase of 29.47%, emphasizing the critical role BiLSTM plays in enhancing the model’s ability to capture long-term and short-term dependencies in the data.

Based on the above calculation results, it shows that each part of Δ-BiLSTM contributes significantly to its performance. In particular, the Δ-target and BiLSTM components are crucial for capturing the dynamic nature of the load and improving prediction accuracy. Removing these components results in significant performance degradation, especially in scenarios involving abrupt changes and complex time series data. The Channel Selection and LOWESS steps also play important roles in improving model robustness and accuracy by reducing redundancy and noise.

5. Conclusions

In this paper, we address the challenges of non-stationarity, irregular fluctuations, and multi-scale coupling in short-term bus load forecasting for distribution networks, and propose a multi-channel Δ-BiLSTM forecasting framework based on VMD and residual LOWESS. By performing VMD and residual smoothing only during the training phase, combined with forward-looking correlation screening and redundancy suppression, we construct a multi-channel input embedding that captures both deterministic and stochastic features. A Δ-BiLSTM network is then employed to realize multi-step forecasting and abrupt-change tracking. With Bayesian optimization and early stopping, together with robustness-oriented training protocols, the model ensures reproducibility and fairness across different horizons.

Quantitative results on the coastal-grid bus load dataset show that the prediction accuracy is substantially improved under a unified training budget. In single-step forecasting, Δ-BiLSTM reduces the test set RMSE by 65.5% and 38.3% at horizons

h = 1

and

h = 2

, respectively, compared with the best conventional benchmark ARIMA, while maintaining higher

R^{2}

and lower sMAPE across all horizons. The boxplot analysis further shows that Δ-BiLSTM produces the tightest absolute error distributions with the lowest medians and 95% quantiles, confirming that the error band is more concentrated, smoother, and lower-variance than those of SVM, RawLSTM, RNN, and ARIMA.

To refine the analysis of improvement amplitude and link it to operating conditions, the 5 min load dataset is divided into stable, fluctuation, and peak segments according to the load level and local variability. For

h = 1

, Δ-BiLSTM reduces the RMSE by 69.4% on the fluctuation segment and by 62.5% and 62.4% on the stable and peak segments, respectively, relative to ARIMA. In addition, ablation experiments show that removing the Δ-target, channel selection, residual LOWESS, and BiLSTM components leads to RMSE increases of 8.00%, 4.43%, 17.91% and 29.47%, respectively, on the test set. These results clarify the corresponding effects of the contribution points: Δ-targets and the bidirectional architecture are crucial for tracking ramping dynamics and abrupt transitions, while channel selection and LOWESS significantly enhance robustness by suppressing redundant information and high-frequency noise.

From an application perspective, the proposed VMD–LOWESS–Δ-BiLSTM framework is especially suitable for short-term bus load forecasting in distribution networks with strong high-frequency disturbances, significant peak–valley transitions, and limited historical data. In such scenarios, conventional statistical models or single-channel deep networks tend to underestimate peak levels and amplify errors during rapid ramps, whereas the proposed method maintains low error levels and stable calibration by combining multi-scale inputs, Δ-target learning, and leakage-free training protocols. This makes the framework a practical tool for short-term dispatch, reserve allocation, and security assessment in feeders with high renewable penetration and emerging loads such as centralized EV charging.

Nevertheless, this study still has several limitations. The case study is conducted on a single coastal-grid feeder with a moderate data length and a limited set of exogenous variables, and the Δ-BiLSTM learner is restricted to a relatively compact architecture for the sake of reproducibility. Future work will extend the framework to larger-scale datasets, multi-node shared forecasting models, and explore alternative sequence learners such as Transformer-based architectures and hybrid expert (mixture-of-experts) models. In addition, integrating richer exogenous information (e.g., high-resolution meteorological data and market signals) and conducting formal forecast-comparison tests under different operating regimes will further improve the generality and interpretability of the proposed approach.

Author Contributions

Conceptualization, methodology, software, writing—original draft, validation, formal analysis, investigation, Y.G.; resources, J.Z.; data curation, L.W. and J.Z.; writing—review and editing, L.W. and J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This paper is supported by the National College Students Innovation and Entrepreneurship Training Program (Grant No. S202410536024) and the National Key Research and Development Program of China (Grant No. 2024YFE0115600).

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to research security.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Eren, Y.; Küçükdemiral, I. A comprehensive review on deep learning approaches for short-term load forecasting. Renew. Sustain. Energy Rev. 2024, 189, 114031. [Google Scholar] [CrossRef]
Morais, L.B.S.; Aquila, G.; de Faria, V.A.D.; Lima, L.M.M.; Lima, J.W.M.; de Queiroz, A.R. Short-term load forecasting using neural networks and global climate models. Appl. Energy 2023, 348, 121439. [Google Scholar] [CrossRef]
Wang, C.; Zhao, H.; Liu, Y.; Fan, G. Minute-level ultra-short-term power load forecasting based on time-series data features. Appl. Energy 2024, 372, 123801. [Google Scholar] [CrossRef]
Chan, J.W.; Yeo, C.K. A Transformer-based approach to electricity load forecasting. Electr. J. 2024, 37, 107370. [Google Scholar] [CrossRef]
Hasan, M.; Mifta, Z.; Papiya, S.J.; Roy, P.; Dey, P.; Salsabil, N.A.; Chowdhury, N.-U.; Farrok, O. A State-of-the-Art Comparative Review of Load Forecasting Methods: Characteristics, Perspectives, and Applications. Energy Convers. Manag. X 2025, 26, 100922. [Google Scholar] [CrossRef]
Iqbal, M.S.; Adnan, M.; Mohamed, S.E.G.; Tariq, M. A hybrid deep learning framework for short-term load forecasting. Results Eng. 2024, 22, 103560. [Google Scholar] [CrossRef]
Fan, G.-F.; Han, Y.-Y.; Li, J.-W.; Peng, L.-L.; Yeh, Y.-H.; Hong, W.-C. A hybrid model for deep learning short-term power load forecasting. Expert Syst. Appl. 2024, 234, 122012. [Google Scholar] [CrossRef]
Zhong, B.; Yang, L.; Li, B.; Ji, M. Short-term power grid load forecasting based on VMD-SE-BiLSTM-Attention hybrid model. Int. J. Low-Carbon Technol. 2024, 19, 1951–1958. [Google Scholar] [CrossRef]
Wang, X.; Jiang, H.; Wu, Z.; Yang, Q. Adaptive variational autoencoding generative adversarial networks for rolling bearing fault diagnosis. Adv. Eng. Inform. 2023, 56, 102027. [Google Scholar] [CrossRef]
Wang, X.; Jiang, H.; Zeng, T.; Dong, Y. An adaptive fused domain cycling variational generative adversarial network for machine fault diagnosis under data scarcity. Inf. Fusion 2025, 56, 103616. [Google Scholar] [CrossRef]
Yan, J.; Cheng, Y.; Zhang, F.; Zhou, N.; Wang, H.; Jin, B.; Wang, M.; Zhang, W. Multimodal imitation learning for arc detection in complex railway environments. IEEE Trans. Instrum. Meas. 2025, 74, 1–13. [Google Scholar] [CrossRef]
Dong, Z.; Tian, Z.; Lv, S. Decomposition and weighted correction for short-term load forecasting. Appl. Soft Comput. 2025, 162, 111863. [Google Scholar] [CrossRef]
Han, M.; Lee, J.-S. IVMD-BiLSTM with differential mechanism for load forecasting. Appl. Soft Comput. 2024, 152, 110657. [Google Scholar] [CrossRef]
Zare, M.; Nouri, N.M. End-effects mitigation in empirical mode decomposition. Mech. Syst. Signal Process. 2023, 200, 110205. [Google Scholar] [CrossRef]
Yu, M.; Yuan, H.; Li, K.; Deng, L. Noise cancellation via TVF-EMD with adaptive thresholding. Algorithms 2023, 16, 296. [Google Scholar] [CrossRef]
Shalby, E.M.; Abdelaziz, A.Y.; Ahmed, E.S.; Rashad, B.A.-E. A guide to selecting wavelet decomposition level and functions. Sci. Rep. 2025, 15, 82025. [Google Scholar] [CrossRef]
Coroneo, L.; Iacone, F. Testing for equal predictive accuracy with strong dependence. Int. J. Forecast. 2025, 41, 1073–1092. [Google Scholar] [CrossRef]
Guenoukpati, A.; Agbessi, A.P.; Salami, A.A.; Bakpo, Y.A. Hybrid Long Short-Term Memory Wavelet Transform Models for Short-Term Electricity Load Forecasting. Energies 2024, 17, 4914. [Google Scholar] [CrossRef]
Wang, Y.; Li, H.; Jahanger, A.; Li, Q.; Wang, B.; Balsalobre-Lorente, D. A novel ensemble electricity load forecasting system based on a decomposition-selection-optimization strategy. Energy 2024, 312, 133524. [Google Scholar] [CrossRef]
Li, H.; Li, S.; Wu, Y.; Xiao, Y.; Pan, Z.; Liu, M. Short-term power load forecasting for integrated energy system based on a residual and attentive LSTM-TCN hybrid network. Front. Energy Res. 2024, 12, 1384142. [Google Scholar] [CrossRef]
Harvey, D.I.; Leybourne, S.J.; Zu, Y. Tests for equal forecast accuracy under heteroskedasticity. J. Appl. Econ. 2024, 39, 850–869. [Google Scholar] [CrossRef]
Harvey, D.I.; Leybourne, S.J.; Zu, Y. Testing for equal average forecast accuracy in possibly unstable environments. J. Bus. Econ. Stat. 2024, 43, 643–656. [Google Scholar] [CrossRef]
Liu, F.; Chen, L.; Zheng, Y.; Feng, Y. A Prediction Method with Data Leakage Suppression for Time Series. Electronics 2022, 11, 3701. [Google Scholar] [CrossRef]
Wu, Y.; Meng, X.; Zhang, J.; He, Y.; Romo, J.A.; Dong, Y.; Lu, D. Effective LSTMs with seasonal-trend decomposition (STL, LOESS) and adaptive learning. Expert Syst. Appl. 2024, 236, 121202. [Google Scholar] [CrossRef]
Zhuang, Q.; Gao, L.; Zhang, F.; Ren, X.; Qin, L.; Wang, Y. MIVNDN: Ultra-Short-Term Wind Power Prediction Method with MSDBO-ICEEMDAN-VMD-Nons-DCTransformer Net. Electronics 2024, 13, 4829. [Google Scholar] [CrossRef]
Li, K.; Huang, W.; Hu, G.; Li, J. Ultra-short-term power load forecasting based on CEEMDAN-SE + LSTM. Energy Build. 2023, 279, 112666. [Google Scholar] [CrossRef]
Sun, H.; Yu, Z.; Zhang, B. Research on short-term power load forecasting based on VMD + GRU/CNN-Attention. PLoS ONE 2024, 19, e0306566. [Google Scholar] [CrossRef]
Ouyang, T.; He, Y.; Li, H.; Sun, Z.; Baek, S. Modeling and forecasting short-term power load with copula model and deep belief network. IEEE Trans. Emerg. Top. Comput. Intell. 2019, 3, 127–136. [Google Scholar] [CrossRef]
Zelios, V.; Mastorocostas, P.; Kandilogiannakis, G.; Kesidis, A.; Tselenti, P.; Voulodimos, A. Short-Term Electric Load Forecasting Using Deep Learning: A Case Study in Greece with RNN, LSTM, and GRU Networks. Electronics 2025, 14, 2820. [Google Scholar] [CrossRef]

Figure 1. An overview of the VMD-LOWESS-Δ-BiLSTM forecasting workflow.

Figure 2. VMD on of the training segment (K = 6)—Top: raw series; Middle: IMF1-IMF6 with central frequency

ω_{k}

; Bottom: Residual.

Figure 2. VMD on of the training segment (K = 6)—Top: raw series; Middle: IMF1-IMF6 with central frequency

ω_{k}

; Bottom: Residual.

Figure 3. VMD reconstruction accuracy for the bus load series: (a) training, (b) validation, and (c) test segments.

Figure 4. Trade-off between reconstruction RMSE and roughness for different LOWESS spans on the training set. The blue circles denote RMSE to the raw training series (left axis), and the orange squares denote roughness (right axis).

Figure 5. Comparison between the VMD residual and its LOWESS-smoothed counterpart for (a) training, (b) validation, and (c) test segments.

Figure 6. Denoising comparison on the training segment (Raw/VMD-plain/WMA/VMD–LOWESS) with zoom-in MAE.

Figure 7. Grid search of the composite objective

J

over the number of VMD modes

K

and the LOWESS span on the training set.

Figure 7. Grid search of the composite objective

J

over the number of VMD modes

K

and the LOWESS span on the training set.

Figure 8. Comparison between measured and predicted bus load on the test set for horizon

h = 1

, with a zoomed-in view of a peak-load period.

Figure 8. Comparison between measured and predicted bus load on the test set for horizon

h = 1

, with a zoomed-in view of a peak-load period.

Figure 9. Horizon-wise test performance of Δ-BiLSTM on the bus load dataset (RMSE, sMAPE, and R²).

Figure 10. Validation weighted RMSE on the

(L, h_{m a x})

grid under the leakage-free pipeline.

Figure 10. Validation weighted RMSE on the

(L, h_{m a x})

grid under the leakage-free pipeline.

Figure 11. Weighted RMSE distribution on (L, h_max) plane.

Figure 12. Boxplots of test set absolute errors

| e |

(in kW) for five models (Δ-BiLSTM, SVM, RawLSTM, RNN, and ARIMA) at horizons h = 1, 2 and 3. The red horizontal lines denote the medians, the black dots denote the sample means, and the circles mark outliers.

Figure 12. Boxplots of test set absolute errors

| e |

(in kW) for five models (Δ-BiLSTM, SVM, RawLSTM, RNN, and ARIMA) at horizons h = 1, 2 and 3. The red horizontal lines denote the medians, the black dots denote the sample means, and the circles mark outliers.

Figure 13. Heatmaps of absolute error (left) and percentage relative error (right) for five algorithms across sample points.

Figure 14. Comparison of residuals (left) and predicted versus measured bus load (right) at

h = 1

for five algorithms (Δ-BiLSTM, SVM, RawLSTM, RNN, and ARIMA).

Figure 14. Comparison of residuals (left) and predicted versus measured bus load (right) at

h = 1

for five algorithms (Δ-BiLSTM, SVM, RawLSTM, RNN, and ARIMA).

Figure 15. Segmented load profile with stable, fluctuation, and peak segments.

Table 1. Correlation, energy ratio, and selection results.

Component	Correlation	Energy Ratio	Selected
IMF1	0.3426509	0.78173379	1
IMF2	0.5236128	0.00835674	1
IMF3	0.3203291	0.04366188	1
IMF4	0.0184756	0.00012883	0
IMF5	0.2062651	3.0151668	1
IMF6	0.0216455	1.5061263	1
Residual	0.198236	0.166074	1

Table 2. Training configuration for Δ-BiLSTM by horizon (h = 1 to 3).

h	Hidden	lr	Dropout	L2	Clip	Epochs	Patience	Batch	Seed
1	64	0.001	0.2	1.00 × 10⁻⁵	1	60	8	64	123
2	64	0.001	0.2	1.00 × 10⁻⁵	1	60	8	64	123
3	64	0.001	0.2	1.00 × 10⁻⁵	1	60	8	64	123

Table 3. Test set performance across the horizon (h = 1, 2, and 3).

h	RMSE	MAE	R²	sMAPE
1	0.509282	0.408892	0.998107	0.922703
2	1.661738	1.431842	0.979848	2.940043
3	1.525787	1.295397	0.98301	2.896958

Table 4. Weighted RMSE across

(L, h_{\max})

combinations.

Table 4. Weighted RMSE across

(L, h_{\max})

combinations.

L	h_max	RMSE_weight	RMSE_h1	RMSE_h2	RMSE_h3	RMSE_std
28	1	0.853788	0.853788	-	-	0.0
28	2	1.537469	0.841834	1.885286	-	0.737832
28	3	2.085647	0.825713	1.705126	2.759306	0.968112
56	1	0.373426	0.373426	-	-	0.0
56	2	0.576676	0.396564	0.666732	-	0.191037
56	3	0.8369	0.408113	0.643933	1.108469	0.35634
112	1	0.394342	0.394342	-	-	0.0
112	2	0.812356	0.458781	0.909494	-	0.375977
112	3	1.354393	0.564388	0.997245	1.855827	0.657311

Table 5. Model performance comparison with baseline models.

h	Model	RMSE	MAE	sMAPE	R²
1	Δ-BiLSTM	0.7389	0.6055	1.1648	0.9972
1	SVM	2.3686	2.1735	4.5011	0.9716
1	RawLSTM	2.3847	2.0314	4.0968	0.9712
1	RNN	2.9761	2.1992	4.2940	0.9552
1	ARIMA	2.1529	1.9474	4.1697	0.9766
2	Δ-BiLSTM	1.3095	1.0822	2.0845	0.9913
2	SVM	2.3284	2.1413	4.3742	0.9725
2	RawLSTM	2.1514	2.0558	4.2051	0.9765
2	RNN	3.0320	2.2967	4.5144	0.9534
2	ARIMA	2.1211	1.9634	4.0636	0.9772
3	Δ-BiLSTM	1.8737	1.5495	2.9898	0.9822
3	SVM	1.8227	1.6084	3.3730	0.9831
3	RawLSTM	2.0252	1.7485	3.3213	0.9792
3	RNN	3.2521	2.4447	4.8504	0.9463
3	ARIMA	2.1830	1.8968	4.2016	0.9758

Table 6. Model performance comparison with four variant models.

h	Model	RMSE	MAE	sMAPE	R²
1	Δ-BiLSTM	0.7389	0.6055	1.1648	0.9972
1	Abl_w_o_Delta	0.7977	0.6981	1.3982	0.9816
1	Abl_w_o_ChSel	0.7716	0.6717	1.3717	0.9828
1	Abl_w_o_LOWESS	0.8714	0.7716	1.4716	0.9713
1	Abl_w_o_BiLSTM	0.9554	0.7554	1.5554	0.9668
2	Δ-BiLSTM	1.3095	1.0822	2.0845	0.9913
2	Abl_w_o_Delta	1.4915	1.0917	2.1919	0.9727
2	Abl_w_o_ChSel	1.4728	1.0872	2.1629	0.9866
2	Abl_w_o_LOWESS	1.5766	1.1267	2.2767	0.9734
2	Abl_w_o_BiLSTM	1.6534	1.1536	2.3537	0.9675
3	Δ-BiLSTM	1.8737	1.5495	2.9898	0.9822
3	Abl_w_o_Delta	1.9427	1.6101	3.2102	0.9733
3	Abl_w_o_ChSel	1.9335	1.6047	3.2075	0.9793
3	Abl_w_o_LOWESS	1.9796	1.8236	3.6727	0.9705
3	Abl_w_o_BiLSTM	1.9866	1.8467	3.6947	0.9659

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guo, Y.; Wang, L.; Zhao, J. A Multi-Channel Δ-BiLSTM Framework for Short-Term Bus Load Forecasting Based on VMD and LOWESS. Electronics 2025, 14, 4772. https://doi.org/10.3390/electronics14234772

AMA Style

Guo Y, Wang L, Zhao J. A Multi-Channel Δ-BiLSTM Framework for Short-Term Bus Load Forecasting Based on VMD and LOWESS. Electronics. 2025; 14(23):4772. https://doi.org/10.3390/electronics14234772

Chicago/Turabian Style

Guo, Yeran, Li Wang, and Jie Zhao. 2025. "A Multi-Channel Δ-BiLSTM Framework for Short-Term Bus Load Forecasting Based on VMD and LOWESS" Electronics 14, no. 23: 4772. https://doi.org/10.3390/electronics14234772

APA Style

Guo, Y., Wang, L., & Zhao, J. (2025). A Multi-Channel Δ-BiLSTM Framework for Short-Term Bus Load Forecasting Based on VMD and LOWESS. Electronics, 14(23), 4772. https://doi.org/10.3390/electronics14234772

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Multi-Channel Δ-BiLSTM Framework for Short-Term Bus Load Forecasting Based on VMD and LOWESS

Abstract

1. Introduction

2. Methodological Foundations

2.1. Variational Mode Decomposition

2.1.1. Sub-Sequence Analytic Signal and Frequency Mixing

2.1.2. Sub-Sequence Baseband Minimum Bandwidth Model

2.1.3. Sub-Sequence Smoothing Optimization

2.1.4. Multi-Channel Input and Leakage Prevention

2.2. Recurrent Neural Networks

2.2.1. LSTM Unit

2.2.2. BiLSTM Architecture

2.2.3. Δ-Target and Multi-Channel Sample Organization

2.3. Channel Value Metrics and Selection

2.3.1. Forward Correlation

2.3.2. Redundancy-Aware Channel Subset Selection

2.4. Lightweight Bayesian Optimization and Stable Training

2.4.1. Objective Functions and Evaluation Protocol

2.4.2. Search Space and Lightweight Search Strategy

2.4.3. Stable Training and Reproducibility Protocols

3. Combined Forecasting Framework

3.1. Overall Procedure and End-to-End Mapping

3.2. Training and Validation Protocol

3.3. Complexity Analysis

4. Case Study Analysis

4.1. Time Series Decomposition and Residual Smoothing

4.1.1. VMD Results and Multi-Scale Features

4.1.2. Channel Value Evaluation and Selection

4.1.3. Residual LOWESS and Spike Suppression

4.2. Experimental Setup and Evaluation

4.2.1. Δ Target Definition and Experimental Protocol

4.2.2. Single-Step Prediction Result Analysis

4.2.3. Multi-Step Forecasting Error Propagation

4.2.4. Time Window and Step Length Sensitivity Analysis

4.3. Results and Discussion

4.3.1. Model Comparison and Advantages

4.3.2. Robustness Analysis

4.3.3. Error Distribution Characteristics and Prediction Comparison of Multiple Algorithms

4.3.4. Segment-Wise Improvement Under Different Operating Conditions

4.3.5. Corresponding Effects of Contribution Points

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI