1. Introduction
Wind energy is an important alternative to fossil fuels and contributes significantly to the global renewable energy capacity. The Global Wind Report 2022 [
1] noted a 94 GW increase in wind energy in 2021, although growth slowed in major markets like China and the United States. Offshore wind saw significant expansion, particularly in China. However, one of the significant challenges in wind energy management is intermittency, which refers to the unpredictable nature of wind. Wind speeds, influenced by weather patterns, fluctuate throughout the day, resulting in inconsistent power output. However, this problem is usually managed by predicting wind speed in advance and then managing the integration of other available resources with the wind energy system for sustainable power generation. Various prediction techniques have been developed, which can be categorized into physical models, statistical models, and machine learning (ML) models.
Numerical Weather Prediction (NWP) models use physical principles for wind speed prediction (WSP), eliminating the need for historical data and model training. These models simulate atmospheric conditions to provide location-specific forecasts, as demonstrated by Brabec et al. [
2]. However, their performance is sensitive to initial physical conditions, where minor errors can affect results. Another study by Moreno et al. [
3] pointed out the complexity and high computational costs associated with NWP models. Additionally, accurate input data and substantial computational resources are required, since inaccuracies in initial atmospheric data can lead to forecast errors. Nevertheless, when integrated into hybrid models, NWP models become valuable tools for wind speed prediction, particularly in regions lacking historical wind data [
2,
4].
On the other hand, statistical models treat WSP as a stochastic process, using historical data to identify time-variable relationships. Common statistical models include autoregressive (AR), moving average (MA), autoregressive integrated moving average (ARIMA), and the Kalman filter [
5]. These models have been applied extensively due to their simplicity and relatively low computational cost. However, they assume linear relationships, which may not accurately capture the nonlinear characteristics of wind speed (WS) time series. Torres et al. [
6] applied an ARMA model to predict the hourly mean WS in Spain, demonstrating that although such models provide reasonable short-term forecasts, they struggle with the nonlinear and random nature of wind speeds. Li et al. [
7] utilized ARMA-based approaches for WSP, showing that these models can be effective under certain conditions but have limitations in capturing complex wind patterns. Collectively, these statistical models contribute to understanding WS patterns by providing a foundational approach to time-series analysis. However, their limitations in capturing nonlinearity and randomness highlight the need for more advanced methods that can model the complex behavior of wind speeds.
Machine learning (ML) models, including artificial neural networks (ANNs), support vector machines (SVMs), and deep learning methods, have been applied in wind speed prediction (WSP) for their ability to model nonlinear relationships [
8]. Unlike statistical models, ML models do not assume the normality of the residuals or the stationarity of the time series. ANNs model wind speed as a nonlinear system, as demonstrated by Bechrakis [
9] and Mohandes et al. [
10], who showed the potential of SVMs for WSP. These models capture patterns in wind speed data but require large datasets and significant computational power for training. Deep learning models, such as LSTM networks, capture long-term dependencies in wind speed data. Combining these approaches can improve prediction accuracy. Hybrid models (HM) integrate various techniques to maximize their benefits. Geng et al. [
11] used LSTM networks for short-term WSP, achieving higher accuracy than traditional methods. Cai et al. [
12] applied Extreme Gradient Boosting (XGBoost) for WSP, effectively handling large datasets and using regularization to prevent overfitting. Aslam et al. [
13] proposed a physics-informed machine learning approach with a novel cost function to enhance WSP by collecting features from surrounding locations to predict future wind speeds. ANNs and SVMs model nonlinear relationships, while LSTM and XGBoost handle sequential data and large datasets, respectively. Together, these techniques improve prediction accuracy and model complex wind speed behaviors.
Hybrid models combine multiple approaches to leverage their strengths and address individual limitations. HMs often combine statistical models with ML techniques, such as ARIMA with ANNs or SVMs, to improve prediction quality [
14]. Recent studies have enhanced time-series data pre-processing using Singular Spectrum Analysis (SSA) and wavelet transforms to increase model accuracy [
15]. Advances in HMs include transformer models, initially developed for natural language processing, applied to WSP. Wan et al. [
16] and Pan et al. [
17] integrated convolutional neural networks (CNNs) with LSTM networks to capture spatial and temporal dependencies in wind speed data. An attention mechanism was added to these models to focus on relevant parts of the input sequence, improving forecasting accuracy [
18]. By combining different methodologies, HMs provide a balanced approach that enhances the accuracy and reliability of WSP, effectively managing the complexities and nonlinearity in wind speed data.
Since intermittency is just one challenge related to wind energy utilization, some of the other fundamental challenges in wind power production include visual impact [
19,
20,
21], noise pollution [
22,
23], bird and bat collisions [
24,
25,
26], upfront costs [
27,
28,
29], and grid integration [
30,
31]. These issues primarily arise in commercial wind energy projects and raise questions about the sustainability of wind energy as a solution. Although these challenges are significant for commercial wind energy projects, increasing domestic small-scale windmill installations can reduce the impact of these problems.
In existing studies such as [
32,
33,
34], the challenge of intermittency and prediction is mostly studied in the context of commercial windmills, which can afford large cloud servers to run sophisticated ML models for accurate predictions. In these projects, the focus is on improving prediction accuracy, and the size of the models is not a significant concern. For small windmills, however, accessing a server to run such models presents issues like latency, data privacy, security, bandwidth usage, cloud server costs, and energy consumption. Therefore, designing smaller models that predict WS with minimal error could avoid these challenges and efficiently manage energy locally. Deploying ML models on edge devices is a potential solution, but this area has not been well studied in the context of domestic wind energy production. Addressing these challenges for small windmills, this research investigates a discrete design space (DDS) comprising the hyperparameters of a transformer–LSTM hybrid model in such a way that not only is the prediction accuracy of such models improved but also the size of the models is significantly reduced.
The primary contributions of this research are summarized as follows:
The proposal of a hybrid baseline model (HBM) that combines a transformer model with LSTM, designed for optimal tuning with variable hyperparameters.
The introduction of a novel cost function and use of genetic algorithms for optimizing discrete design spaces (DDS) influenced by model hyperparameters.
The incorporation of hardware-specific performance evaluations at each optimization step, ensuring effective deployment on targeted hardware.
This research enhances the understanding and application of transformer–LSTM models for wind speed prediction on low-power devices. It investigates the trade-offs between model size (MS) and prediction accuracy, providing solutions to challenges associated with memory constraints and performance optimization.
2. Proposed Methodology
This study utilizes a baseline model with two main components: a transformer encoder and LSTM layers. The transformer encoder extracts features by handling complex data patterns, while the LSTM layers, as the prediction head, capture temporal dependencies. This section first details the transformer–LSTM baseline architecture and then discusses model performance metrics, which are integral to the cost and fitness functions. Finally, a hardware-centric DDS optimization scheme is presented.
2.1. Baseline Architecture
The baseline model’s architecture is shown in
Figure 1. First, the input features pass through a dense embedding layer. This layer changes the dimensions of the input data using Equation (
1):
Here, is the number of heads and is the head size.
Next, positional embeddings are added using sinusoidal functions:
where
is the position in the sequence,
i is the dimension index, and
represents the dimensions. Then, the features pass through the multi-head attention mechanism:
where
Q,
K, and
V are the query, key, and value matrices, and
is the dimension of the keys. This computes attention scores, focusing on key parts of the input data. Next, the features pass through a 1D convolutional neural network (CNN) layer that applies a ReLU activation function, adding nonlinearity to help the model learn complex patterns. Following the CNN layer, an add-and-normalize layer is used:
This layer normalizes the data, stabilizing the learning process. The model then concatenates
transformer blocks, each with a multi-head attention mechanism, a CNN layer, a feed-forward network, and an add-and-normalize layer. Next, a layer of MLP dense units is added. These dense layers enhance the representation. The features then pass through more dense layers, adjusting dimensions to match the number of time stamps. This is followed by a reshape layer, which organizes features by time stamps. The final dense layer allocates features to specific time stamps for better prediction accuracy. Finally, a layer of LSTM units acts as a regression head to predict future events, which is suitable for tasks like forecasting wind speeds. The LSTM uses several gates to manage information flow:
Equations (
6)–(
11) show how the LSTM manages information. The forget gate (
) discards irrelevant data, the input gate (
) updates the cell state with new information, and the output gate (
) outputs part of the cell state. These mechanisms help the LSTM make accurate predictions based on historical data.
This baseline model is better than a purely LSTM or transformer model because it leverages the strengths of both architectures. The transformer’s attention mechanism excels at capturing contextual relationships in the data, while the LSTM is adept at handling long-term dependencies in sequential data. This combination allows the model to effectively address both short-term and long-term patterns, leading to more accurate predictions.
2.2. Model Performance Metrics
The performance of the model is quantitatively assessed using various metrics, such as MSE, MAE, score, the model’s storage size on disk and its latency. The formulas for these metrics are as follows:
Coefficient of Determination (
R2 Score):
Mean Squared Error (
MSE):
Mean Absolute Error (
MAE):
In the above equations, denotes the actual wind speed, denotes the predicted WS, and denotes the average of the actual wind speed. MSE is sensitive to larger errors due to its squaring of error terms. In contrast, MAE is sensitive to the median of errors as it considers the absolute value of errors.
2.3. Cost and Fitness Functions
The proposed baseline model has the ability to capture time-series data patterns. However, such models have a high MS if not properly optimized. Hence, there is a trade-off between prediction accuracy, quantified by the MSE and MS. Navigating this balance can be seen as the optimization of a DDS. Unlike in a continuous space, where we have infinite options for selecting parameter values, choices in a DDS are quantized, offering only a finite set of possibilities. The pivotal components of our model, all within this DDS, comprise the following:
Head Size (HS): Influences the granularity of attention and overall model capacity.
Number of Heads (NH): Enables concurrent attention mechanisms, facilitating the model’s understanding of diverse data facets.
Feed-Forward Dimension (FF): Signifies the inner processing capability of the feed-forward network within the transformer.
Number of Transformer Blocks (TB): Multiple blocks augment the model’s depth, enhancing its ability to discern complex patterns.
MLP Units (MLPU): Governs the capacity of the dense layers within the model.
Dropout (DO): Acts as a regularization knob in the transformer encoder section.
MLP Dropout (MLPD): Acts as a regularization knob in the dense layer section, mitigating the risk of overfitting.
LSTM Units (LSTMU): Determines the temporal processing power of the LSTM layers.
Considering these parameters, the optimization problem is given by Equation (
15).
where
represents the known configurations of the model, consolidated into a column vector. Our objective is to identify the optimal coefficient matrix
A to minimize the cost
C. The cost metric
C symbolizes the equilibrium between accuracy and MS, as defined in Equation (
17):
where
varies between 0 and 1, serving as a balancing agent between the model’s predictive prowess (
MSE scaled by
in m/s) and its computational footprint (size of model scaled by
in bytes).
Each column of A is subject to specific upper threshold constraints and has different data type requirements: . Here, the upper limits (mhs, mnh, mffd, etc.) represent the maximum permissible value of each parameter (Max_Head_Size, Max_Number_of_Heads, etc.), ensuring the model’s feasibility and efficiency within the specified constraints.
In this work, a genetic algorithm (GA)-inspired technique is employed to identify models that minimize prediction error while also maintaining a minimal size. Consequently, we have devised a fitness function that incorporates an additional constraint to ensure model effectiveness. This constraint ensures that the model achieves a positive
R2 score, which is a standard metric for assessing the goodness of fit in regression models. The fitness function, therefore, integrates this additional criterion and is formulated as shown in Equation (
18):
This method ensures that the chosen model balances accuracy and size while meeting a basic standard of predictive performance, as measured by the R2 score.
2.4. Hardware-Centric DDS Optimization Scheme
In this section, the proposed hardware-centric DDS optimization algorithm is discussed. The proposed methodology is shown graphically in
Figure 2. This scheme is based on the optimization principles of the genetic algorithm. The details of the proposed scheme are as follows:
- 1.
A genetic algorithm initiates a random generation of models, where each gene represents a specific aspect of the model and one chromosome is represented as
X in Equation (
16). Once these models are generated, they are passed through a constraint check.
- 2.
These models are then built and trained for a limited duration of 25 epochs. After training, they are deployed on a memory-constrained device to evaluate their performance along with the test data.
- 3.
The fitness of these models is calculated based on the test results, using Equation (
18), where
is a Lagrange multiplier that penalizes models with a negative
R2 score after 25 epochs.
- 4.
The GA process for generating new models involves selection, mutation, and crossover by considering the following criteria:
If there is an improvement in performance over the last three generations, the two best models (minimum fitness values) are selected for mutation and crossover.
If performance does not improve, the best parent and a second randomly selected gene are used for mutation and crossover.
During mutation or crossover, if any gene exceeds its predefined limits, a random value within the permissible range is reassigned to maintain the MS within the required memory space.
- 5.
The process returns to step 2 unless a stopping criterion is met.
- 6.
Finally, the top five models from the experiment are selected for further training over an extended number of epochs. They are then re-evaluated on the memory-constrained device to ascertain their final performance metrics.
In this work, models are trained on a server and tested on MCDs like the Jetson. This dual-environment and hardware-centric approach is more practical compared to the standard procedure of a neural architecture search, where models are trained and tested simultaneously on the same machine, typically a server. In our research, models are trained on servers to accelerate the training process and then tested on the actual hardware. Since results may vary when models are deployed on such devices, testing them before selecting the best parent ensures the selection of solutions that are optimal for the specific hardware. To transfer the trained models from the server to the Jetson device, Google’s gRPC protocol is employed, ensuring efficient and reliable model transfer.
Training lasts for 25 epochs to quickly assess model behavior and adaptability. A model’s performance during this period, as indicated by the MSE and R2 scores, determines its potential for further development or the need for reevaluation. Predefined limits on hyperparameters prevent training failures in memory-constrained environments, ensuring that models remain within available memory space and are practical for real-world deployment.
Furthermore, an adaptive genetic algorithm strategy introduces new random genes when performance stagnates over three generations. This keeps the search space diverse and dynamic, promoting exploration and increasing the chance of finding effective models. Hence, the top five models are selected based on initial performance metrics for extended training beyond 25 epochs. This approach ensures that promising candidates are refined further. The computational complexity of the proposed algorithm is analyzed in
Appendix A. To facilitate a better understanding of the methodology, the pseudocode in Algorithm 1 succinctly outlines the scheme.
Algorithm 1 Hardware-centric DDS optimization algorithm |
- Require:
N (population size), E (maximum epochs), S (stopping criteria) - Ensure:
Optimized models - 1:
▹ Generate initial population - 2:
▹ Generation counter - 3:
whiledo - 4:
for all do - 5:
- 6:
- 7:
- 8:
end for - 9:
▹ New population - 10:
if then - 11:
- 12:
else - 13:
- 14:
end if - 15:
for to do - 16:
- 17:
- 18:
- 19:
- 20:
- 21:
- 22:
end for - 23:
- 24:
- 25:
end while - 26:
- 27:
- 28:
- 29:
return
|
3. Results and Comparative Analysis
This section discusses the results and conducts a comparative analysis of our proposed methodology, starting with a detailed overview of the datasets and parameter configurations.
3.1. Datasets and Parameter Configuration
The datasets used in this research were compiled by the National Renewable Energy Laboratory (NREL) and focus on three significant locations in Pakistan: Gwadar, Pasni, and Jhimpir. Despite being the fifth-most populous country, Pakistan has faced an energy crisis for decades, and the resources available to the government are insufficient to install commercial windmills. Instead, domestic small-scale windmills are installed by the local government with a lower budget, supporting the United Nations Sustainable Development Goals (SDGs), particularly SDG 7 (Affordable and Clean Energy).
The datasets contain meteorological variables at 60-min intervals, including temperature, wind speed (WS), wind direction, atmospheric pressure, and cloud type. This study focuses on DDS for MCD, targeting one-step-ahead forecasts. The dataset was pre-processed to include measurements from the past 24 h and then normalized. The model aimed to estimate the WS (in m/s) for the upcoming 25th hour, denoted as . After prediction, denormalization transformed back to the actual scale to determine the real-world wind speed, .
The maximum values for the model parameters are represented as a row vector : , , , , , , , and . Thus, . The models were trained on an Nvidia GeForce RTX 3080 Ti and, to ensure practical deployment on MCDs, the models were optimized for Nvidia Jetson Nano devices. The Nvidia Jetson Nano device is designed for edge computing and features a quad-core ARM Cortex-A57 CPU and an integrated GPU, making it energy-efficient and suitable for real-time applications in renewable energy management.
3.2. Results
In this study, the parameter
in Equation (
18) was set to 0.001. Additionally,
was set to
, and
was set to 1,048,576 Bytes. These parameter values were intuitively determined by evaluating the performance of the baseline model, which has the maximum possible size for hyperparameters. Additionally, five distinct experiments were conducted, each varying the value of
in Equation (
18). The chosen values for
in these experiments were 0.01, 0.25, 0.5, 0.75, and 0.99. In the experiment where the value of
was 0.01, the primary focus was on minimizing the MS, with negligible concern for the
MSE. As the value of
increased to 0.25, there was a noticeable shift in priority, with greater emphasis on reducing
MSE while lessening the importance of MS. This trend continued with higher values of
; as
increased, the significance given to minimizing
MSE grew, consequently reducing the emphasis on the size of the model.
Initial training over 25 epochs yielded five distinct models for each case. Upon fine-tuning these models using the dataset from Gwadar city, results were obtained for varying values of epsilon. These results are depicted in
Figure 3,
Figure 4,
Figure 5,
Figure 6 and
Figure 7. Each figure comprises six subplots: the first subplot summarizes the results of the five models, while the remaining five subplots detail the DDS each model occupies. For instance,
Figure 3 features a model in the first row and second column, highlighted in purple, with the optimal parameters
. The notation ‘NH 5/16’ indicates that the optimal number of heads is 5, within the maximum allowed limit of 16.
3.2.1. Experiment with
Figure 3 showcases the top five models for
, focusing on the trade-off between MS and performance. The most efficient model achieved an impressive balance with a total cost of 0.28, an
MSE of 0.11 m/s, and an MS of only 0.28 MB. The second model, with a total cost of 0.33, an
MSE of 0.22 m/s, and an MS of 0.33 MB, showed an increase in both size and
MSE. This makes it a less favorable choice, as it does not efficiently balance the trade-off between size and performance.
Figure 3.
Top five models for .
Figure 3.
Top five models for .
In contrast, the third model demonstrated a significant improvement in performance at the expense of increased size. It registered a total cost of 0.45, an MSE of 0.08 m/s, and an MS of 0.45 MB, indicating a substantial enhancement in accuracy for a reasonable increment in MS. The fourth and fifth models, with slightly different sizes of 0.46 MB and 0.47 MB but identical total costs of 0.46 and MSEs of 0.06 m/s, showcased a consistent level of high performance for these larger models.
3.2.2. Experiment with
Figure 4 presents the top five models for
, for which the balance between MS and performance was considered under different criteria compared to the earlier experiment. The best-performing model in this setup achieved a total cost of 0.29, an
MSE of 0.06 m/s, and an MS of 0.36 MB, demonstrating an efficient balance between accuracy and compactness. The second model exhibited a total cost of 0.31, an
MSE of 0.09 m/s, and an MS of 0.38 MB. This model indicates a preference for a slightly larger size while still maintaining high accuracy.
Figure 4.
Top five models for .
Figure 4.
Top five models for .
Moving to the third model, it presented a total cost of 0.39, an increased
MSE of 0.26 m/s, and an MS of 0.44 MB. This suggests a shift toward accommodating a larger MS in exchange for a moderate increase in error rates. The fourth model showed an increase in both total cost (0.44) and MS (0.48 MB) but with a slightly higher
MSE of 0.3 m/s, indicating a trade-off for a larger MS against a marginal increase in error performance. Interestingly, the final model, despite having the largest size at 0.6 MB, achieved a low
MSE of 0.05 m/s and a total cost of 0.46. This outcome suggests that a larger MS can significantly improve accuracy, albeit at the cost of increased resource consumption. The detailed optimal parameters for these models are depicted in
Figure 4.
3.2.3. Experiment with
In the experiment characterized by
, as illustrated in
Figure 5, a distinct set of outcomes was observed, indicating a more balanced trade-off between MS and performance. The first model achieved a total cost of 0.25, an
MSE of 0.05 m/s, and an MS of 0.44 MB, demonstrating a favorable balance between accuracy and size.
Figure 5.
Top five models for .
Figure 5.
Top five models for .
The second model, with a slightly larger size of 0.46 MB, also recorded a total cost of 0.25 but achieved a slightly lower MSE of 0.04 m/s, indicating an incremental improvement in accuracy. Moving on to the third model, the total cost rose to 0.3, the MSE to 0.04 m/s, and the MS to 0.56 MB. This model indicates a preference for a larger size while maintaining a low error rate, balancing cost and performance.
The fourth model, significantly larger at 0.8 MB, demonstrated a total cost of 0.42 and an
MSE of 0.04 m/s. This indicates a considerable enhancement in accuracy, which comes with a substantial increase in size. Finally, the fifth model, with a size of 0.85 MB, showed a total cost of 0.44 and an
MSE of 0.04 m/s. Although similar in size to the fourth model, it presented a slightly higher total cost, making it slightly less efficient in comparison. The detailed parameters for these models, including their specific configurations, are further elucidated in
Figure 5.
3.2.4. Experiment with
The experiment with
, as shown in
Figure 6, revealed a compelling set of results, emphasizing a greater focus on model accuracy while also considering MS. The first model in this series achieved a total cost of 0.218, an
MSE of 0.05 m/s, and an MS of 0.721 MB, indicating a shift toward prioritizing accuracy while maintaining a moderately large size.
The second model, with an MS of 0.927 MB, had a total cost of 0.267 and an MSE of 0.047 m/s, indicating a further enhancement in accuracy but a noticeable increase in size. Advancing to the third model, the total cost rose to 0.324, the MSE to 0.076 m/s, and the MS to 1.065 MB, reflecting a continued emphasis on reducing the error rate, albeit at the expense of larger model dimensions.
The fourth model, at 1.191 MB, showed a cost of 0.343 and an
MSE of 0.06 m/s. Despite its size, it balanced size and accuracy well. The fifth model, the largest at 1.311 MB, recorded a cost of 0.368 and an
MSE of 0.054 m/s. Although it was the largest, it maintained good accuracy, illustrating a balance between size and performance. The details of these models are shown in
Figure 6.
3.2.5. Experiment with
The final experiment, with
, as shown in
Figure 7, represents an approach with an emphasis on reducing
MSE while placing minimal constraints on MS. Hence, the first model achieved a total cost of 0.059, an
MSE of 0.047 m/s, and an MS of 1.217 MB, showing a significant emphasis on accuracy. The second model, with a total cost of 0.062, an
MSE of 0.05 m/s, and a slightly smaller size of 1.163 MB, maintained a similar level of accuracy with a marginal reduction in size.
Moreover, the third model, with a total cost of 0.063, an
MSE of 0.057 m/s, and a smaller MS of 0.655 MB, presented a better balance between size and accuracy compared to its predecessors. The fourth model, at 0.781 MB, showed a higher cost of 0.177 and an
MSE of 0.171 m/s, indicating improvements in both size and accuracy. Finally, the fifth model, despite a slightly smaller size of 0.719 MB, recorded the highest cost of 0.207 and an
MSE of 0.202 m/s. This highlights the difficulty of optimizing for high accuracy while maintaining a reasonable MS. Detailed insights into the model configurations and trade-offs are provided in
Figure 7.
Figure 6.
Top five models for .
Figure 6.
Top five models for .
This experiment revealed the top five models for each epsilon value, highlighting the trade-offs. For example, in Experiment 1 with , an MSE of 0.08 m/s or less is acceptable. We also limited MS to under 500 KB. The third model is suitable with a size of 0.455 MB and an MSE of 0.078 m/s. This approach emphasizes that model selection depends on specific application requirements and constraints.
Figure 7.
Top five models for .
Figure 7.
Top five models for .
3.3. Performance of Optimal Models Across Datasets
The models and results presented above were derived using the Gwadar dataset. However, when these models, configured with their optimal hyperparameters, were trained and tested on data from the other two datasets, Jhimpir and Pasni, they also exhibited acceptable performance, as shown in
Table 1.
3.4. Analyzing the Influence of on Model Metrics
This section explores the relationship between the parameter
and various model performance indicators, specifically
MSE,
MAE,
R2 score, and MS.
Figure 8 graphically presents these relationships, allowing for a comprehensive comparison of how changes in
affect each metric. As can be seen, a trend emerged showing that an increase in
led to changes in both
MSE and
MAE, demonstrating the direct impact of
on minimizing these error metrics. Furthermore, a positive correlation between
and the
R2 score was also observed, suggesting improved predictive performance with larger
values, and the MS increased with the value of
.
3.5. Comparison
This section compares the proposed optimal model with three existing schemes. First, this work compares DeepAR [
35], an autoregressive recurrent neural network model for probabilistic forecasting, with our proposed scheme. The second model for comparison is a CNN-based probabilistic forecasting framework [
36]. Lastly, this work investigates the CNN-LSTM model [
37], which integrates a CNN and an LSTM network for wind power prediction.
As shown in the results of the comprehensive comparative analysis presented in
Table 2, the proposed scheme consistently outperformed existing models like DeepAR, DeepTCN, and CNN-LSTM across all datasets. For example, on the Jhimpir dataset, our proposed scheme achieved up to a 98.78% reduction in size, a 51.89% improvement in
MSE, a 31.80% improvement in
MAE, and a 9.80% increase in the
score compared to DeepAR. Similar trends were observed on the Gwadar and Pasni datasets, with substantial improvements across all metrics. Specifically, compared to DeepTCN on the Gwadar dataset, the proposed scheme showed a 92.24% reduction in size, a 32.66% better
MSE, an 18.15% better
MAE, and a 3.52% higher
score. Similarly, on the Pasni dataset, compared to CNN-LSTM, the proposed scheme demonstrated a 90.18% reduction in size, a 43.07% improvement in
MSE, a 23.60% improvement in
MAE, and an 18.29% increase in the
score. The predicted versus actual data for each dataset are shown in
Figure 9,
Figure 10 and
Figure 11.