Benchmarking Transformer Variants for Hour-Ahead PV Forecasting: PatchTST with Adaptive Conformal Inference

Suresh, Vishnu

doi:10.3390/en18185000

Open AccessFeature PaperArticle

Benchmarking Transformer Variants for Hour-Ahead PV Forecasting: PatchTST with Adaptive Conformal Inference

by

Vishnu Suresh

Faculty of Electrical Engineering, Wroclaw University of Science and Technology, 50-370 Wroclaw, Poland

Energies 2025, 18(18), 5000; https://doi.org/10.3390/en18185000

Submission received: 26 August 2025 / Revised: 12 September 2025 / Accepted: 16 September 2025 / Published: 19 September 2025

(This article belongs to the Special Issue Renewable Energy System Technologies: 3rd Edition)

Download

Browse Figures

Versions Notes

Abstract

Accurate hour-ahead photovoltaic (PV) forecasts are essential for grid balancing, intraday trading, and renewable integration. While Transformer architectures have recently reshaped time series forecasting, their application to short-term PV prediction with calibrated uncertainty remains largely unexplored. This study provides a systematic benchmark of five Transformer variants (Autoformer, Informer, FEDformer, DLinear, and PatchTST) evaluated on a five-year, rooftop PV dataset (5 kW peak) against an unseen 12-month test set. All models are trained within a pipeline using a 48-h rolling input window with cyclical temporal encodings to ensure comparability. Beyond point forecasts, we introduce Adaptive Conformal Inference (ACI), a distribution-free and adaptive framework, to quantify uncertainty in real time. The results demonstrate that PatchTST, through its patch-based temporal tokenization, delivers superior accuracy (MAE = 0.194 kW, RMSE = 0.381 kW), outperforming both classical persistence and other Transformer baselines. When coupled with ACI, PatchTST achieves 86.2% empirical coverage with narrow intervals (0.62 kW mean width) and probabilistic scores (CRPS = 0.54; Winkler = 1.86) that strike a balance between sharpness and reliability. The findings establish that combining patch-based Transformers with adaptive conformal calibration provides a novel and viable route to risk-aware PV forecasting.

Keywords:

photovoltaic forecasting; hour-ahead forecasting; transformer models; PatchTST; adaptive conformal inference (ACI); probabilistic forecasting; uncertainty quantification

1. Introduction

Accurate hour-ahead solar photovoltaic (PV) power forecasting is required for ensuring the stability, efficiency, and profitability of modern power systems. The intermittent and weather-dependent nature of PV generation causes rapid fluctuations in output, which, if not anticipated, can lead to costly imbalances between supply and demand [1]. Hour-ahead forecasts enable grid operators to fine-tune dispatch schedules, minimize reliance on expensive reserve generation, and reduce the need for sudden ramping of conventional plants. In electricity markets, precise short-term predictions help PV producers avoid imbalance penalties, which can be several times higher than day-ahead tariffs, thereby protecting revenue streams. Moreover, with the growing penetration of PV systems exceeding 900 GW of installed capacity worldwide, localized, high-resolution forecasts are essential for mitigating congestion, optimizing energy storage dispatch, and enhancing participation in intraday trading [1]. Unlike day-ahead predictions, hour-ahead forecasting captures the most recent meteorological dynamics, making it highly responsive to transient phenomena such as fast-moving cloud cover or dust events. This capability not only supports operational resilience in large-scale grids but also underpins the reliable integration of distributed PV systems in microgrids and islanded networks.

To tackle the challenges of hour-ahead PV power forecasting, recent research has adopted increasingly data-driven and hybrid modeling strategies, leveraging both physical and machine learning architectures. Traditional statistical approaches, such as regression and support vector methods, have been largely supplanted by artificial neural networks (ANNs) and their derivatives, such as Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Long Short-Term Memory (LSTM) models, which model nonlinear temporal dependencies in PV generation data [2]. More recent advancements have focused on attention-based architectures, particularly Transformers, which have reshaped the forecasting landscape by enabling models to capture both fine-grained and long-range temporal correlations, while dynamically reweighting features under varying weather conditions. Hybrid approaches now combine signal decomposition, such as variational mode decomposition optimized by advanced metaheuristics with causal convolutional Transformers to isolate multi-scale components before prediction, thereby improving robustness across diverse seasonal and meteorological regimes [3]. Other domain-specific Transformer adaptations employ patch partitioning to better capture rapid irradiance fluctuations, integrate two-stage attention stacks for sequential temporal and cross-variable dependencies, and couple these with BiLSTM decoders to strengthen bidirectional temporal learning [4]. At the system scale, benchmarking studies that fuse modeled irradiance, numerical weather predictions, and plant metadata consistently report lower MAE and Mean Absolute Percentage Error (MAPE) values for Transformer architectures compared with AutoRegressive Integrated Moving Average (ARIMA), Support Vector Regression (SVR), random forest, and deep recurrent baselines [5]. Furthermore, interpretability-enhanced frameworks, such as the Temporal Importance Model Explanation (TIME) combined with Temporal Fusion Transformers (TFTs), achieve high predictive fidelity while exposing variable contributions for operational decision-making. Compared with iterative hour-by-hour prediction schemes, sequence-to-sequence Transformer models and their hybrid variants deliver multi-horizon forecasts in a single pass, offering a balance between accuracy, adaptability, and computational efficiency for grid operations and market participation.

In the literature, Transformer-based architectures for time series forecasting have evolved into several distinct categories, each addressing different challenges. Vanilla Transformers form the baseline, using self-attention to model long-range dependencies but often needing optimization for sequence length and efficiency [6]. Efficient Transformers such as Informer and FedFormer address scalability by introducing sparse attention, distillation, or frequency-enhanced modules to reduce computational cost while retaining predictive accuracy in long sequences [7]. Decomposition-based Transformers like Autoformer embed time series decomposition directly into the architecture, separating trend and seasonal components to enhance stability and interpretability [7]. Spatial–temporal Transformers expand the attention mechanism to model dependencies across both time and geographic space, enabling multi-site meteorological correlation modeling. This categorization, from general-purpose to task-optimized designs, forms a clear hierarchical flow that can be visualized as a taxonomy of Transformer variants for renewable energy time series forecasting. This structure is visualized by means of a chart, as shown in Figure 1.

In terms of Transformer-based models used in the broad area of energy research, the following works have been described in the literature. The study in [8] proposes a Transformer-based deep neural network augmented with wavelet decomposition features to forecast wind speed and derived wind power up to 6 h ahead, benchmarked against a carefully hyper-parameter-tuned LSTM baseline (plus a persistence model). Inputs include hourly meteorological variables (wind speed/direction, air temperature, humidity, and pressure) and cyclic time encodings; data come from anemometric towers at 100, 120, and 150 m in three sites in Bahia, Brazil (Mucuge, Esplanada, and Mucuri), yielding nine time series of 744 h each, with 550 h for training/validation (70/30 split) and 194 h for testing. The method learns on original signals plus optimal wavelet reconstructions (mother wavelet chosen by lowest reconstruction RMSE), and model selection uses Normalized Mean Squared Error (NMSE) and Pearson’s r, with final reporting on MAE/MSE/RMSE/Fraction within a factor of two (Fac2) as well. Across sites/heights, the Transformer generally outperforms the fine-tuned LSTM, especially for longer horizons (3rd–6th hour) and trains faster while having similar inference time. Wavelet feature augmentation improves accuracy in most configurations. Exceptions include the LSTM at Mucugê-120 and Mucuri-100/120 m, in which it performed better. A Wilcoxon signed-rank test confirms that the LSTM–Transformer performance differences are statistically significant across towers.

A Powerformer model was proposed in [9], which is a temporal-based Transformer architecture for short-term wind power forecasting, integrating LSTM-based embeddings, sparse self-attention, temporal pooling, and gated residual connections to enhance temporal feature extraction and reduce complexity. The model was evaluated against Back-Propagation (BP), GRU, LSTM, and vanilla Transformer on two real-world datasets: WindPower-GP (Guangxi, April–October 2021, 15 min data, 14 features) and WindPower-XP (Xinjiang, April–September 2019, 15 min data, 10 features). Using standard metrics (MSE, MAE, RMSE, and MAPE), Powerformer consistently achieved the best performance, e.g., on WindPower-GP, reducing MAPE to 1.08% compared with 3.83% for Transformer, while similar gains were observed on WindPower-XP. These results demonstrate the superiority of tailored Transformer designs over both recurrent and vanilla Transformer baselines for wind power prediction.

An SL-Transformer is introduced in [10], which is a hybrid deep learning framework designed for time series forecasting of both wind and solar power. The model combines a Transformer encoder with an LSTM decoder and integrates an attention mechanism, while also incorporating Savitzky–Golay (SG) filtering for noise reduction in wind speed data and a Local Outlier Factor (LOF) filter for anomaly removal in solar data, ensuring higher quality inputs. The datasets comprised one year of wind farm data (5 min interval, including wind speed and power generation) and four months of photovoltaic plant data (hourly solar generation records). The proposed SL-Transformer was benchmarked against ARIMA, Support Vector Machine (SVM), LSTM, Neural Basis Expansion Analysis for Time Series (NBEATS), and vanilla Transformer, with evaluation based on MSE, MAE, RMSE, R², and Symmetric Mean Absolute Percentage Error (SMAPE). The results show that the SL-Transformer significantly outperformed all baselines: for wind power, it achieved R² = 0.9989 and SMAPE = 5.85%, compared to ~0.97 R² and >16% SMAPE for the LSTM–Transformer; for solar power, it obtained R² = 0.9674 and SMAPE = 4.22%, representing ~15% accuracy improvement over competing models. These outcomes highlight the strong capability of the SL-Transformer to capture complex temporal dependencies in renewable energy forecasting, delivering high accuracy at the cost of slightly higher computation time.

The study in [11] proposed a CNN–Transformer hybrid model for short-term photovoltaic (PV) power forecasting, aiming to leverage CNN’s strength in local feature extraction and the Transformer’s capability in long-range temporal dependency modeling. The approach integrates 1D-CNN layers for spatial–temporal feature encoding, followed by a multi-head self-attention mechanism to enhance temporal correlations in PV output data. The study used a real-world PV dataset from a power station in China, consisting of meteorological parameters (solar irradiance, temperature, humidity, wind speed, etc.) and historical PV generation records, sampled at 15 min intervals. The proposed CNN–Transformer was compared against traditional methods (ARIMA, SVR) and deep learning models (CNN, LSTM, GRU, and vanilla Transformer). Performance was evaluated using RMSE, MAE, and R². The results show that the CNN–Transformer consistently outperformed all baselines, e.g., reducing RMSE by more than 20% compared to LSTM/GRU and achieving an R² above 0.98, indicating high accuracy and robustness under varying weather conditions. The findings demonstrate that coupling CNN feature extraction with Transformer temporal modeling is highly effective for PV forecasting.

An hour-ahead PV power forecasting framework using Recurrent Neural Networks with Long Short-Term Memory (RNN-LSTM) was applied to three different PV plants installed at the University of Malaya, consisting of polycrystalline, monocrystalline, and thin-film modules as described in [12]. The dataset spans four years (2016–2019), with five-minute resolution measurements of solar irradiance, wind speed, ambient temperature, and module temperature. After preprocessing and standardization, 70% of each year’s data was used for training and 30% for testing. The proposed RNN-LSTM method was benchmarked against regression approaches (Gaussian Process Regression (GPR) and Support Vector Regression (SVR), both with and without PCA), artificial neural networks (ANNs), and hybrid adaptive neuro-fuzzy inference systems (ANFISs) using grid partitioning, subtractive clustering, and fuzzy C-means. Multiple LSTM structures (single, double, and bidirectional) were also tested. The results showed that the single-layered RNN-LSTM consistently delivered the lowest RMSE and MAE across all three PV technologies, outperforming conventional machine learning and hybrid baselines, and demonstrating robustness to different PV system types.

The work described in [13] investigates the application of GRU for one-hour-ahead solar irradiance forecasting, with a focus on multivariate inputs. The dataset consists of eleven years (2004–2014) of hourly Global Horizontal Irradiance (GHI) and weather data from Phoenix International Airport, sourced from the U.S. National Solar Radiation Database (NREL NSRDB), along with cloud cover information from NOAA’s ISCCP satellite products. The authors designed and compared univariate and multivariate GRU and LSTM models, with exogenous features including solar zenith angle, relative humidity, air temperature, and cloud cover. The models were trained using 48 h rolling windows, with univariate cases relying solely on past irradiance, while multivariate configurations incorporated the additional variables. The results demonstrated that multivariate models significantly outperformed univariate ones, with the inclusion of cloud cover yielding the greatest improvements in forecasting accuracy. A comparative table that situates this work within the broader context of state-of-the-art approaches is presented in Table 1. It highlights the forecasting task, input features, and methods used.

Building on the literature of Transformer-based architectures for renewable energy forecasting, the work aims to address two gaps in the literature: (i) the absence of benchmarking of the Transformer variants for hour-ahead solar PV forecasting and (ii) the lack of integration of adaptive uncertainty quantification techniques with these models to enhance their practical reliability. While prior studies have introduced specialized Transformer designs for wind or hybrid renewable prediction tasks, the potential of architectures such as Autoformer, Informer, Fedformer, PatchTST, and DLinear for solar forecasting remains underexplored from both deterministic and probabilistic perspectives. Motivated by this, the study makes the following contributions:

Benchmarking of five advanced Transformer-based architectures (Autoformer, Informer, Fedformer, PatchTST, and DLinear) for hour-ahead solar PV forecasting, providing a systematic performance comparison under real-world conditions.
First application of PatchTST to solar PV forecasting, demonstrating its capability in capturing temporal patterns and outperforming competing Transformer-based models.
Novel integration of Adaptive Conformal Inference (ACI) with PatchTST, enabling non-parametric probabilistic forecasting; to the best of the author’s knowledge, this represents the first attempt in the literature to combine PatchTST with ACI for uncertainty quantification in solar forecasting.

2. Forecasting Model Architectures

To investigate the suitability of Transformer variants for hour-ahead solar PV forecasting, we consider a set of recent architectures that have demonstrated strong performance in time series prediction tasks. Specifically, we evaluate Autoformer, Informer, Fedformer, PatchTST, and DLinear, each of which introduces distinct mechanisms to enhance sequence modelling efficiency, capture long-range dependencies, or improve forecasting accuracy. The following subsections provide a brief overview of these models, highlighting their design principles.

2.1. Autoformer Model

The Autoformer, introduced in [14], is a Transformer-based architecture tailored to the requirements of long-term time series forecasting. Traditional Transformers rely heavily on dot-product self-attention to capture dependencies among time steps. While effective for natural language processing, this mechanism often proves inefficient for time series data, where long-range periodic patterns and trend–seasonal interactions dominate the signal. Autoformer addresses these limitations by introducing two key innovations: (i) a series Decomposition Block, which explicitly separates trend and seasonal components to improve representation learning and reduce redundancy; and (ii) an auto-correlation mechanism, which replaces conventional attention with a correlation-based operator that identifies periodic temporal dependencies directly. Together, these innovations allow Autoformer to model both multi-scale seasonality and long-term dependencies, making it effective for applications such as renewable energy forecasting, where diurnal and weather-driven cycles play a central role [15].

Series Decomposition

Autoformer integrates decomposition within each encoder and decoder block to disentangle slowly varying and cyclic components of the signal. For an input time series, the decomposition is defined as follows:

x_{t} = T_{t} + S_{t}

(1)

where

T_{t}

is the trend component and

S_{t}

is the seasonal component described by (2) and (3). The trend is estimated using a moving average operator

M

.

T_{t} = M (x_{t})

(2)

S_{t} = x_{t} - T_{t}

(3)

By iteratively applying this operator within stacked layers, Autoformer ensures progressive refinement of temporal features, enhancing robustness against noise and short-term irregularities.

Auto-Correlation Mechanism

Instead of using standard dot-product attention, Autoformer introduces an auto-correlation mechanism that focuses on identifying time-delay dependencies. Given an input sequence

X \in R^{L \times d}

with embedding dimension

d

, queries, keys, and values are constructed as

Q, K, V \in R^{L \times d}

. The auto-correlation is then computed as follows:

AC (X) = Softmax (\frac{Q K^{⊤}}{\sqrt{d}}) V

(4)

Unlike traditional attention, Autoformer enhances this operator by applying Fourier transforms to efficiently locate dominant periodic lags, allowing the model to emphasize recurring patterns while reducing computational overhead [16]. This is crucial for solar PV generation, where irradiance and output frequently exhibit repeated cycles (e.g., diurnal periodicity).

Encoder–Decoder Framework

The encoder and decoder of Autoformer are composed of stacked decomposition and auto-correlation blocks. In the encoder, trend and seasonal components are separated at each layer and then passed to subsequent blocks, ensuring multi-scale temporal representation. In the decoder, trend forecasting is performed recursively. Given the decomposed seasonal sequence

S_{t}

, the predicted trend at horizon

h

is updated as follows:

\hat{T_{t + h}} = \hat{T_{t + h - 1}} + f_{θ} (S_{t})

(5)

where

f_{θ}

denotes the learned seasonal extrapolation function. This recursive structure allows Autoformer to progressively extend forecasts while maintaining temporal consistency. A flow chart describing a simplified Autoformer architecture for time series forecasting is presented in Figure 2.

2.2. Informer Model

Informer [17] is a Transformer-based model designed to overcome the computational and scalability bottlenecks of long-sequence time series forecasting. Standard self-attention in the original Transformer requires computing similarity scores across all pairs of input tokens, leading to both time and memory complexity of

O (L^{2})

, where

L

is the input length. For large-scale forecasting tasks with long historical windows, such as those encountered in renewable energy forecasting, this quadratic cost becomes prohibitive. Informer addresses this limitation through three principal innovations: ProbSparse self-attention, distilling operation, and a generative-style decoder, each of which contributes to improved efficiency while preserving forecasting accuracy.

ProbSparse Self-Attention

In traditional multi-head self-attention, the output is computed as follows:

A t t e n t i o n (Q, K, V) = S o f t m a x (\frac{Q K^{⊤}}{\sqrt{d}}) V

(6)

where

Q, K, V \in R^{L \times d}

denote the query, key, and value matrices, and d is the feature dimension. This formulation requires evaluating attention scores for every query–key pair. Informer introduces the ProbSparse self-attention mechanism, which is motivated by the observation that only a small number of queries dominate the attention distribution. By selecting the top-u queries with the highest sparsity measurement (derived from Kullback–Leibler divergence of the attention distribution), Informer reduces the number of queries that require full attention computation [18]. This results in the following computational complexity:

O (L l o g L)

(7)

which is substantially more efficient than the quadratic cost of conventional attention, enabling the model to handle much longer input sequences.

Distilling Operation

A second contribution is the distilling operation, which reduces redundancy and enforces hierarchical feature extraction across layers. After each encoder layer, the input sequence is compressed through a convolutional filter followed by pooling, effectively halving the sequence length. Formally, given the hidden state

X \in R^{L \times d}

at layer

l

, the distilled representation is computed as follows:

X^{(l + 1)} = MaxPool (σ (Conv 1 D (X^{(l)})))

(8)

where

σ

is a nonlinear activation function. This operation forces the encoder to focus on salient temporal dependencies while progressively reducing the sequence length across layers, thereby achieving efficient representation learning with reduced memory consumption.

Generative Decoder

Finally, Informer departs from the conventional autoregressive decoding paradigm and employs a generative-style decoder that produces the entire forecast horizon in parallel. Instead of predicting one time step at a time, the decoder takes as input the compressed latent representation

Z

from the encoder, together with any known covariates

C

, and directly generates the

H

-step forecast:

\hat{Y_{t + 1 : t + H}} = f_{θ} (Z, C)

(9)

where

f_{θ}

represents the learned decoder mapping. By avoiding recursive step-by-step prediction, this design eliminates error accumulation that typically degrades the performance of autoregressive decoders in long-horizon forecasting. Moreover, the parallel generation strategy significantly accelerates inference, making Informer attractive for real-time applications.

The architecture of the Informer model is shown in Figure 3, showcasing its encoder–decoder structure. The encoder processes the input time series through embedding, ProbSparse self-attention, convolutional distillation, and feed-forward layers to produce compressed representations. The decoder utilizes masked self-attention and cross-attention mechanisms with the encoder output to generate future sequence predictions [19]. A final projection layer maps the decoder output to the forecasted values. This architecture enables efficient long-sequence time series forecasting with reduced computational complexity.

2.3. FEDformer Model

The Frequency Enhanced Decomposed Transformer (FEDformer) [20] represents an important advancement in long sequence timeseries forecasting by extending the Transformer architecture with both seasonal trend decomposition and frequency-domain modeling. Unlike earlier models such as Informer, which focused on sparse attention, or Autoformer, which emphasized auto-correlation in the time domain, FEDformer leverages the insight that periodicities in time series are more naturally expressed in the frequency domain. To this end, the architecture introduces two essential innovations: a Frequency Enhanced Block, which operates directly on spectral representations, and a Decomposition Block, which splits the input signal into trend and seasonal components for specialized processing. This design allows the model to capture global temporal dependencies with reduced complexity, while ensuring interpretability through decomposition.

Seasonal–Trend Decomposition

Similar to Autoformer, FEDformer applies decomposition at both the encoder and decoder stages to disentangle long-term and short-term variations. Formally, the input time series is expressed as follows:

x_{t} = T_{t} + S_{t}

(10)

where

T_{t}

is the trend component and

S_{t}

is the seasonal component described by (2) and (3). The trend is estimated using a moving average operator

M

as described in the Autoformer section. This decomposition enables FEDformer to process seasonal fluctuations in the spectral domain while forecasting trends with simpler linear projections.

Frequency Enhanced Block

The central innovation of FEDformer lies in its Frequency Enhanced Block, which applies self-attention not in the time domain but in the frequency domain [21]. Given an embedded input sequence

X \in R^{L \times d}

, its frequency representation is computed according to (11):

F (X) = FFT (X)

(11)

where

F

denotes the Fast Fourier Transform (FFT). In practice, FEDformer does not retain all spectral components but selects only a subset of dominant frequencies Ω that contribute most to the variance of the signal according to (12).

F_{Ω} (X) = {F_{k} (X) | k \in Ω}

(12)

The resulting compressed representation captures the essential periodicities while discarding redundant or noisy frequencies. After processing in the frequency domain, the signal is reconstructed back into the time domain through the inverse Fourier transform:

X^{’} = IFFT (F_{Ω} (X))

(13)

This operation allows attention layers to focus on meaningful periodic components rather than the entire time sequence, thereby improving efficiency and generalization.

Segment-Wise Attention

FEDformer further reduces computational cost through segment-wise attention, where the input sequence is divided into non-overlapping patches. Instead of computing attention across the full sequence of length

L

, attention is applied over compressed segments of length

P ≪ L

. For a query

Q

, key

K

, and value

V

belonging to a patch, the attention output is calculated according to (6). By restricting this computation to patches, the complexity of self-attention is reduced from

O (L^{2})

to approximately

O (L l o g L)

, while still retaining global receptive fields via the frequency-based projection.

Decoder and Forecast Reconstruction

In the decoder, the trend component is modeled through linear extrapolation, while the seasonal component is reconstructed through frequency-enhanced attention. Let

\hat{T_{t + h}}

denote the predicted trend and

\hat{S_{t + h}}

the predicted seasonal component at horizon

h

. The final forecast is obtained by recombining both parts according to (14).

\hat{x_{t + h}} = \hat{T_{t + h}} + \hat{S_{t + h}}

(14)

This dual-pathway design ensures that long-term smooth variations and high-frequency periodic patterns are modeled separately but integrated at the output stage. The hybrid design of FEDformer leads to several important benefits. By leveraging the frequency domain, the model captures periodicities such as daily and weekly cycles with high fidelity. The decomposition into trend and seasonal components provides interpretability and reduces the burden on attention layers. Finally, the use of segment-wise sparse attention significantly improves computational efficiency, enabling FEDformer to scale to very long historical sequences without prohibitive memory or runtime costs. These features make FEDformer one of the most advanced architectures for long-sequence forecasting, offering a robust balance between accuracy, interpretability, and efficiency [22].

Figure 4 illustrates the architecture of the FEDformer model. The input time series is first decomposed into trend and seasonal components. The trend path is modeled using a linear projection to capture long-term variations, while the seasonal path is processed through a Frequency Enhanced Block, which applies FFT, selects dominant frequencies, and reconstructs the signal via Inverse Fast Fourier Transform (IFFT). The enhanced seasonal representation is further refined using segment-wise attention to capture localized dependencies. Finally, the outputs of the trend and seasonal forecasts are recombined in the reconstruction layer to produce the final forecast output.

2.4. PatchTST Model

The Patch Time Series Transformer (PatchTST) [23] is a recent Transformer-based architecture that adapts the principles of vision transformers (ViTs) to time series forecasting by operating on subsequence patches rather than individual time points. Unlike conventional Transformer models that process every timestamp as a token, PatchTST divides the input series into overlapping or non-overlapping patches, which serve as tokens for the encoder. This patching strategy reduces sequence length, enhances local contextual representation, and allows the model to scale more efficiently to long time horizons.

Patch Tokenization

Given an input sequence

X \in R^{L \times d}

of length

L

and dimension

d

variables, PatchTST partitions the series into patches of length

P

. Each patch is flattened and projected into an embedding space through a linear mapping:

z_{i} = W_{p} \cdot vec (X_{i : i + P - 1}), i = 1, \dots, N

(15)

where

W_{p} \in R^{d P x d_{m o d e l}}

is the learnable projection matrix and

N = [L / P]

is the number of patches. The resulting patch embeddings

{z_{i}} {i = 1, . . . ., N}

form the token sequence that enters the Transformer encoder.

Channel-Independent Modeling

A distinctive feature of PatchTST is its channel-independent (CI) design. Instead of concatenating multivariate time series across variables, PatchTST learns separate patch embeddings for each channel. This approach prevents inter-channel interference and allows the model to scale efficiently with the number of input variables. Formally, for channel

c \in {1, . . . ., N}

, the equation is as follows:

z_{i}^{(c)} = W_{p}^{(c)} \cdot vec (X_{i : i + P - 1}^{(c)}), c = 1, \dots, d

(16)

where each channel has its own projection

W_{p}^{(c)}

.

Transformer Encoder

The tokenized patches are passed through a standard Transformer encoder composed of multi-head self-attention and feed-forward layers. The self-attention mechanism operates on patch-level embeddings according to (6). Since patches encapsulate local temporal information, the attention mechanism captures long-range dependencies across subsequences rather than individual time points [24].

Forecast Head

After the encoder, the contextualized patch embeddings are projected back to the original resolution. A regression head maps the representation to the forecast horizon

H

:

\hat{Y_{t + 1 : t + H}} = W_{o} \cdot Z_{e n c}

(17)

where

W_{o}

is the output projection and

Z_{e n c}

is the encoded patch representation. This design ensures that local temporal details captured within patches are aligned with global dependencies captured by self-attention.

PatchTST offers several advantages over conventional Transformer architectures. Reducing sequence length through patching improves computational efficiency while maintaining temporal resolution. Its channel-independent design ensures scalability in multivariate settings by avoiding interference across variables, enabling robust modeling of high-dimensional inputs. Moreover, the use of patches enhances representation learning by capturing local temporal patterns that complement the long-range dependencies extracted by self-attention [25]. Its architecture is shown in Figure 5.

The DLinear model is a widely used linear decomposition-based forecasting approach that has gained popularity in the time series community for its simplicity and competitive performance. The key idea is to decompose the input sequence into trend and seasonal components, which are then modelled through separate linear transformations before being recombined to form the final forecast. Due to its lightweight design and strong performance on various benchmarks, DLinear has become a standard baseline in the literature. As the model has already been extensively described and evaluated in prior works and related applications in solar forecasting, a detailed formulation is not repeated here; instead, we refer the reader to these sources for an in-depth presentation [26,27].

3. Adaptive Conformal Predictions

To extend the point forecasts from the above-mentioned models into reliable probabilistic intervals, we employ Adaptive Conformal Inference (ACI), a non-parametric framework that dynamically adjusts prediction intervals according to observed calibration errors [28]. Unlike fixed conformal methods, which assume stationarity in the data distribution, ACI adapts to changes in uncertainty over time, making it particularly well-suited for renewable energy forecasting where stochastic variability is high.

Basic conformal prediction

In conformal prediction, given a sequence of predictions

\hat{y_{t}}

and observations

y_{t}

, one defines a non-conformity score for each observation based on (18).

e_{t} = |y_{t} - \hat{y_{t}}|

(18)

Then from a history of calibration scores

\{e_{1}, e_{2}, \dots, e_{T}\}

, the

(1 - α)

—quantile is determined using (19).

q_{\{1 - α\}} = {Q u a n t i l e}_{\{1 - α\}} (e_{1}, e_{2}, \dots, e_{T})

(19)

Then the corresponding

(1 - α)

prediction interval for horizon

t + h

is defined according to (20).

I_{t + h} = [{\hat{y}}_{\{t + h\}} - q_{1 - α}, {\hat{y}}_{\{t + h\}} + q_{1 - α}]

(20)

This guarantees marginal coverage of approximately

1 - α

, provided the data are exchangeable. However, in many real-world settings, such as renewable energy forecasting, the error distribution is non-stationary, which limits the reliability of fixed conformal methods.

Adaptive Update Mechanism

To address this, ACI introduces an online adaptation rule for the miscoverage rate

α_{t}

. After each forecast, a coverage indicator is computed according to (21):

c_{t} = 1 {y_{t} \in I_{t}}

(21)

where

c_{t} = 1

if the observed value lies within the interval and

c_{t} = 0

otherwise. The coverage error at time

t

is then as follows:

ε_{t} = 1 - c_{t}

(22)

The adaptive rule for updating α is calculated by (23):

α_{t + 1} = α_{t} + γ (α_{t} - ε_{t}), γ > 0

(23)

where γ > 0 is a learning rate. If recent intervals fail to cover the observed values frequently (

ε_{t} = 1

), the update reduces

α_{t}

, widening the prediction intervals to regain nominal coverage. Conversely, if intervals are consistently too wide (

ε_{t} = 0

), the update increases

α_{t}

, narrowing the intervals to improve efficiency [29].

To ensure stability,

α_{t}

is typically constrained within a feasible range (e.g., 0 <

α_{t}

< 0.30). The adaptive mechanism ensures that the method reacts quickly to shifts in error distributions caused by weather-driven fluctuations or seasonal changes. The advantage of this method is that it offers distribution-free guarantees without relying on parametric assumptions, while adaptively recalibrating to reflect non-stationary error distributions in real time. Its recursive update rule is computationally lightweight, requiring only quantile calculations and a simple adjustment of

α_{t}

. These properties make ACI highly effective for renewable energy forecasting, where uncertainty evolves rapidly due to changing weather and environmental conditions. The entire process is visually shown in Figure 6.

4. Data Description and Preparation

4.1. Data Visualization

The dataset used in this study includes a continuous five-year period from January 2014 to January 2019 with hourly resolution, resulting in 44,568 observations. It was collected from rooftop solar PV panels installed at the Department of Electrical Engineering and Electrotechnology Fundamentals, Wroclaw University of Science and Technology. The system has a total installed capacity of 15.21 kWp and began operation in November 2011. It consists of three independent single-phase installations with different PV technologies: (i) a monocrystalline array comprising 27 SUNTECH STP190S-24/Ad+ modules (5.13 kWp total, 190 W per module, and 14.9% efficiency), (ii) a polycrystalline array of 21 SOLAR FUTURE ENERGY PF-6:240 modules (5.04 kWp total, 240 W per module, and 15.5% efficiency), and (iii) a thin-film CIGS array with 56 Q.CELLS Q.SMART 90 modules (5.04 kWp total, 90 W per module, and 11.8% efficiency). All modules are mounted with a 40° tilt, with the mono- and thin-film arrays oriented 135° southeast and the polycrystalline array oriented 225° southwest. For consistency, this work primarily focuses on the polycrystalline module (5 kWp), though the dataset structure allows extension to other modules. The measured variables include solar irradiation (W/m²), ambient temperature (°C), wind speed (m/s), PV module temperature (°C), and electrical output power (W), recorded at 10 min intervals; it was resampled to an hourly resolution according to the needs of this study.

This dataset has been validated and benchmarked in multiple earlier studies. For instance, [30] applied (CNN), multi-headed CNN, CNN–LSTM, autoregressive moving average (ARMA), and multiple linear regression (MLR) models for short- and medium-term PV forecasting, demonstrating the trustworthiness of the measurements and the suitability of the dataset for deep learning approaches. A subsequent study [31] employed an LSTM–autoencoder forecasting model embedded within a microgrid energy management system, further establishing the dataset’s relevance.

The PV power is expressed in kilowatts, with an average generation of 0.62 kW across the full sample, and nearly half of the hours showing zero output, reflecting night-time conditions. When splitting the data chronologically, the training set (January 2014–January 2018, 35,785 rows) has a mean power of 0.61 kW, while the test set (February 2018–January 2019, 8784 rows) records a slightly higher mean of 0.68 kW, indicating an approximate 12% increase. Irradiation exhibits a similar upward shift of about 4.5% in the test period. Strong correlations are observed between irradiation and power (r = 0.97 overall, r = 0.99 in training, and r = 0.92 in testing), underscoring the dominant role of solar input in driving PV generation. Figure 7 presents the hourly PV power output over the study horizon, where the training period (January 2014–January 2018) is shown in blue and the unseen test period (February 2018–January 2019) in orange. The visualization highlights both the strong seasonal oscillations and the day–night cycle inherent to solar generation. The clear distinction between training and test intervals ensures that model development is based solely on historical observations, while the final twelve months provide an independent benchmark for evaluating forecasting performance.

4.2. Normalization of Features

The input variables used in this study differ in terms of units, scale, and statistical distribution. Without adjustment, such discrepancies could cause the LSTM model to be overly sensitive to certain inputs and lead to unstable weight updates during training [30]. To address this, all features were rescaled to a uniform range using normalization. This step ensures numerical consistency across inputs, improving both model stability and predictive performance. In particular, min–max normalization was applied, as shown in (24), where

x_{i}

denotes an individual data point,

m i n (x)

is the minimum value observed of the concerned feature, and

m a x (x)

is the maximum value observed.

{x^{'} = x}_{i} - m i n (x) / \max (x) - \min (x)

(24)

4.3. Cyclical Encoding of Time Indices

In addition to historical power output, time indices were incorporated as input features to account for the natural periodicity of solar generation. These indices include the hour of the day, day of the week, and month of the year, all of which strongly influence photovoltaic output due to daily solar cycles and seasonal variation [32]. However, representing such variables as simple integers (e.g., 0–23 for hours, 1–12 for months) introduces artificial discontinuities; for example, hour 23 and hour 0 would appear maximally distant, despite being consecutive in time. To overcome this limitation, the indices were transformed into a cyclical representation using sine and cosine functions. For each time index

Y

with a maximum value

M

, the encoding is defined as follows:

Y_{\sin} = \sin (\frac{2 π \cdot Y}{\max_value}), Y_{\cos} = \cos (\frac{2 π \cdot Y}{\max_value})

(25)

This mapping projects the values onto a unit circle, ensuring that the beginning and end of the cycle are smoothly connected. In practice, this encoding allows the learning algorithm to recognize recurring temporal patterns such as the midday peak and nighttime minimum in daily solar production, or the higher summer yields relative to winter, without being misled by discontinuities in the raw index values. By embedding the cyclical structure directly into the feature space, the models are better equipped to capture the intrinsic periodic behavior of solar power generation

4.4. Rolling Window for Transformers

Time series forecasting with deep learning models requires a systematic way of constructing training samples from a continuous sequence. A widely adopted approach is the rolling window method, which divides the historical series into overlapping input–output pairs. This ensures that each model observes past segments of fixed length while learning to predict future values within a specified horizon [32]. The step-by-step implementation of this window is described in Algorithm 1.

Algorithm 1 Rolling Window

1:

Let Y denote the complete time series of length T. Choose a window size C, representing the number of past observations used as input.

2:

Initialize counters j = 0 for the current position in the series and m = 0 for the number of constructed windows. Create an empty set W to store all generated windows.

3:

While

j + C \leq T

:

○: Extract the subsequence $Y [j : j + C - 1]$ of length C and append it to W[m].
○: Shift the starting index j forward by one step (i.e., j = j + 1) to allow overlapping windows.
○: Increment the window counter $(m = m + 1)$ .

4:

Continue until the entire time series has been traversed.

In this work, all models employ the rolling window framework in a consistent manner, with each input segment mapped to a single-step (hour-ahead) forecast. Autoformer, Informer, and FedFormer apply the method directly by treating each window of past observations as the basis for predicting the next step. DLinear likewise follows the same principle but processes the extracted sequence through a linear projection to obtain the forecast. PatchTST also relies on the identical rolling window setup; however, it introduces an additional internal step by dividing the input window into smaller patches before embedding them in the Transformer encoder. This modification alters the internal representation of the window but does not change the construction of the rolling samples themselves.

4.5. Hyper-Parameter Selection

The cyclical encoding of temporal features, as shown in the previous section, allows the models to represent recurring daily and yearly patterns, thereby linking the feature space to model parameters such as sequence length and embedding dimension.

The hyper-parameters for each of the five Transformer variants are summarized in Table 2. All models were retrained with a common look-back window of 48 h, ensuring comparability while also capturing at least two full diurnal cycles. This is an important consideration given the seasonal dependence of solar generation. Batch sizes and dropout ratios were selected to balance computational efficiency with regularization, limiting overfitting under variable weather conditions. Model dimension and number of heads were chosen based on the prior literature and preliminary tuning, providing sufficient model capacity to learn interactions among meteorological and seasonal features without too much complexity.

5. Evaluation Metrics

5.1. Point Forecast Metrics

The relative performance of the models investigated in this study is assessed using standard error-based metrics widely employed in the solar forecasting literature. Following recommendations from highly cited studies and the International Energy Agency (IEA) guidelines on forecast evaluation, three complementary measures are considered: Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and Mean Bias Error (MBE) [33].

MAE is the simplest of the three, providing the average magnitude of forecast errors irrespective of direction. It is computed as the mean of the absolute deviations between the predicted and observed values, thereby offering an intuitive measure of overall accuracy. RMSE, while similar in form, squares the deviations before averaging and then applies a square root. This formulation penalizes large deviations more strongly, making RMSE particularly sensitive to infrequent but significant forecast errors. MBE, in contrast, is designed to evaluate systematic tendencies in the predictions. Unlike MAE and RMSE, it retains the sign of the error, allowing detection of persistent overestimation or underestimation in the forecasts. Formally, for a time series of

N

observations, the error at time step

i

is defined as follows:

e_{i} = y_{i (f o r e c a s t)} - y_{i (a c t u a l)}

(26)

from which MAE, RMSE, and MBE are derived according to Equations (27)–(29).

M A E = 1 / N * \sum_{i = 1}^{N} | e_{i} |

(27)

R M S E = \sqrt{M S E} = \sqrt{1 / N * \sum_{i = 1}^{N} e_{i}^{2}}

(28)

M B E = 1 / N * \sum_{i = 1}^{N} e_{i}

(29)

5.2. Interval Forecast Metrics

The quality of interval forecasts is evaluated using complementary criteria that capture both reliability and sharpness. The first measure is validity, which refers to the proportion of observed values that fall within the forecasted intervals [29]. Validity ensures that the constructed intervals achieve the intended coverage probability in finite samples, without assuming any particular distribution of the underlying data. A well-calibrated model should achieve empirical coverage close to the nominal level, indicating that the intervals are neither systematically too narrow nor too wide.

The second measure is the mean interval width (MIW), which quantifies the average width of the prediction intervals [29]. The MIW provides insight into the informativeness of the forecasts: narrower intervals imply greater precision and more confident predictions, whereas wider intervals reflect increased uncertainty. For a sample of N forecasts, MIW is computed as follows:

M I W = 1 / N * \sum_{i = 1}^{N} w_{i}

(30)

where the interval width for each forecast is as follows:

w_{i} = y_{i}^{u p p e r} - y_{i}^{l o w e r}

(31)

While coverage and interval width provide valuable information, they do not jointly assess the overall quality of the predictive distribution. The Continuous Ranked Probability Score (CRPS) offers a more comprehensive evaluation by simultaneously accounting for both calibration and sharpness. The CRPS measures the discrepancy between the predictive cumulative distribution and the observed outcome, effectively generalizing the mean absolute error to the probabilistic setting [34].

Based on the quantile–interval approximation in this study, the CRPS for an observation

y_{i}

with forecast interval [

y_{i}^{l o w e r}, y_{i}^{u p p e r}

] is defined as follows:

\begin{matrix} C R P S (y_{i}) = & (y_{i}^{u p p e r} - y_{i}^{l o w e r}) + \frac{{(y_{i}^{l o w e r} - y_{i})}^{2}}{y_{i}^{u p p e r} - y_{i}^{l o w e r}}, if y_{i} < y_{i}^{l o w e r} \\ (y_{i}^{u p p e r} - y_{i}^{l o w e r}) + \frac{{(y_{i} - y_{i}^{u p p e r})}^{2}}{y_{i}^{u p p e r} - y_{i}^{l o w e r}}, if y_{i} > y_{i}^{u p p e r} \\ {(y}_{i}^{u p p e r} - y_{i}^{l o w e r}) / 2, if y_{i}^{l o w e r} \leq y_{i} \leq y_{i}^{u p p e r} \end{matrix}

(32)

Lower CRPS values indicate better forecast quality, as they correspond to intervals that are both sharp and well-aligned with the observed outcomes

Another widely used measure for evaluating prediction intervals is the Winkler score, which incorporates both interval width and the degree of miscoverage. For a central prediction interval at confidence level

1 - α

, the Winkler score penalizes not only wide intervals but also observations falling outside the predicted bounds [35]. This makes it a more discriminative measure than validity or MIW alone. It is defined as follows:

\begin{matrix} W i n k l e r (y_{i}) = & (y_{i}^{u p p e r} - y_{i}^{l o w e r}) + \frac{2}{α} (y_{i}^{l o w e r} - y_{i}), if y_{i} < y_{i}^{l o w e r} \\ (y_{i}^{u p p e r} - y_{i}^{l o w e r}) + \frac{2}{α} (y_{i} - y_{i}^{u p p e r}), if y_{i} > y_{i}^{u p p e r} \\ (y_{i}^{u p p e r} - y_{i}^{l o w e r}), if y_{i}^{l o w e r} \leq y_{i} \leq y_{i}^{u p p e r} \end{matrix}

(33)

Here,

α

denotes the significance level (e.g.,

α = 0.1

for a 90% interval). A smaller Winkler score indicates narrower, well-calibrated intervals, while large penalties are imposed when the observed value lies outside the predicted bounds.

To ensure reproducibility, the full dataset, preprocessing pipeline, and model configurations were uploaded to a GitHub repository available at the following link: https://github.com/vsidaarth/Transformer-Variants-for-HourvAhead-PV-Forecasting.git (accessed on 15 September 2025).

6. Results

6.1. Point Forecast Performance

The comparative evaluation of the five forecasting models, along with a persistence baseline model, was carried out on an unseen 12-month test set consisting of hourly solar power values, ensuring that the results reflect true generalization performance rather than training fit. All models were trained with a rolling input window of 48 h (two days) and a one-hour forecast horizon, using 10 training epochs under identical conditions.

As shown in Table 3, PatchTST delivered the most accurate forecasts, achieving MAE = 0.194 kW (≈3.9% of peak) and RMSE = 0.381 kW (≈7.6%), with a small underestimation bias (MBE = −0.089 kW, −1.8%). Informer and DLinear followed closely, with MAE ≈0.22 kW (≈4.4–4.5%) and RMSE in the range of 0.45–0.46 kW (≈9.0%), each showing minimal bias (MBE = +0.055 kW, +1.1% for Informer; MBE = −0.028 kW, −0.6% for DLinear). Autoformer produced slightly higher errors (MAE = 0.237 kW, 4.7%; RMSE = 0.456 kW, 9.1%) and a mild overestimation bias (MBE = +0.051 kW, +1.0%). FEDformer was the weakest among the tested models, yielding MAE = 0.248 kW (≈5.0%) and RMSE = 0.481 kW (≈9.6%), with a stronger overprediction tendency (MBE = +0.108 kW, +2.2%). The persistence baseline performed worse than PatchTST, Informer, and DLinear, with MAE = 0.244 kW (≈4.9%) and RMSE = 0.474 kW (≈9.5%), underscoring the added value of Transformer-based architectures. Overall, these results confirm that PatchTST’s patch-based temporal encoding yields the most accurate short-term PV forecasts, while Informer, DLinear, and Autoformer deliver competitive results, and FEDformer lags behind under the same test conditions.

Figure 8 presents a qualitative comparison of one-hour-ahead point forecasts from all evaluated models (Autoformer, Informer, FEDformer, DLinear, and PatchTST), benchmarked against the persistence baseline and the measured PV power output on four representative test days randomly selected across the 2018 seasons. These examples complement the statistical results in Table 2, providing further insight into how the models behave under different seasonal conditions.

On February 5 (winter), when solar output was relatively low with a peak around 3 kW, most models captured the general shape of the curve. PatchTST followed the measured trajectory closely during the ramp-up, while FEDformer and Autoformer tended to overshoot the midday peak. During the ramp-down, PatchTST underpredicted, while the persistence model overpredicted, and the other models captured it well. Persistence underpredicted during the rising edges, highlighting its lagging behavior.

On June 15 (summer), the actual peak exceeded 3.5 kW under high irradiance. Due to the variable nature of this day, all models had difficulties predicting the two midday peaks and the one midday dip that were formed. FEDformer and Informer notably overpredicted the midday peaks and the dip. While Autoformer, PatchTST, and FEDformer predicted the first peak fairly accurately, they could not capture the midday dip and subsequently underpredicted the second midday peak. The persistence model, due to its lagged behavior, underpredicted the ramp-up and overpredicted the ramp-down periods.

On September 17 (autumn), the measured curve exhibited a smooth bell-shaped peak above 3.0 kW. PatchTST aligned well with the ramp-up and decay. Informer, Autoformer, and FEDformer again showed an overestimation tendency, while DLinear provided forecasts closer to the measured values. The persistence model, due to its nature, underpredicted the ramp-up and overpredicted the ramp-down periods.

Finally, on December 20 (winter, low irradiance), solar generation peaked below 0.8 kW. This regime proved most challenging, with FEDformer, Autoformer, and Informer considerably overpredicting the peak, while DLinear and persistence provided lower estimates. PatchTST remained the closest to the actual curve, capturing both timing and order of magnitude despite underestimation of the peaks.

Figure 9 illustrates the point forecast performance of all five Transformer-based models and the persistence baseline on two highly variable PV generation days (8 April 2018 and 23 June 2018). Unlike the mostly clear-sky examples shown previously, these days are characterized by strong fluctuations and irregular peaks due to intermittent cloud cover. Such conditions are particularly challenging for forecasting models, as they require capturing rapid power ramps and avoiding systematic bias.

On 8 April 2018, all models were able to follow the overall diurnal trend but tended to overpredict the midday peak, with FEDformer and Autoformer showing the largest deviations. PatchTST and Informer maintained closer alignment with the actual trajectory, especially during the ramp-up and ramp-down periods. The persistence model lagged in capturing the sudden fluctuations, leading to misalignment at peak hours and during the ramp-up and ramp-down periods.

On 23 June 2018, the variability was even stronger, with sharp power ramps in the late morning and early afternoon. PatchTST significantly overestimated the ramp-up, while the steep decline after noon was better captured; the DLinear and persistence models performed competitively during this day. The Autoformer and FEDformer tended to overestimate the peak magnitude.

Given the above-mentioned point forecast performance, it is clear that both visually and statistically, the PatchTST model is the best performing one on most but not all days. Since the ACI is a post-processing uncertainty quantification model, its performance is based on the point forecast performances and is subsequently applied to the PatchTST model.

In order to have an insight regarding the performance of the proposed models against those popular in the literature, the following Table 4 is presented. It summarizes key datasets, forecasting horizons, methods, and error metrics reported in previous studies alongside the results obtained in this work, thereby situating contributions within the broader context of PV forecasting research. This comparison highlights both the competitiveness of the investigated Transformer architectures and the added value of the proposed PatchTST+ACI approach for delivering accurate and reliable short-term forecasts.

6.2. Interval Forecast Performance (PatchTST ACI)

For the PatchTST model, the Adaptive Conformal Inference (ACI) evaluation yielded strong probabilistic forecasting performance. The prediction intervals achieved a coverage of 86.2%, which is close to the nominal target, indicating that the constructed intervals successfully capture most of the observed values. The mean interval width (0.62 kW) reflects reasonably tight bounds relative to the 5 kW system peak, suggesting a good balance between reliability and precision. In addition, the mean CRPS (0.54) and mean Winkler score (1.86) further confirm the model’s ability to generate well-calibrated and sharp interval forecasts. These results demonstrate that PatchTST’s patch-based temporal representation, combined with ACI, provides robust uncertainty quantification for short-term PV power forecasting.

Figure 10 illustrates the interval forecast performance of the PatchTST model with ACI correction across four representative test days, each selected from different seasons. The blue dashed lines show the observed PV power generation, while the red solid lines indicate the ACI-based predictive intervals.

5 February 2018 (top-left, winter): The actual output is very low and irregular, with a sharp midday peak. The ACI intervals overestimate generation during most of the day and fail to fully capture the spike at 11:00.
28 May 2018 (top-right, late spring): The forecast follows the diurnal cycle well, with intervals closely surrounding the actual trajectory. Both the morning ramp-up and the afternoon decline are well-aligned, and the midday peak is well-captured except at 11:00 and 14:00, demonstrating good calibration under clear irradiance conditions.
2 August 2018 (bottom-left, summer): A clear, high-generation day where the model performs in a satisfactory manner. The predictive intervals encompass most of the observed curve, with a slight tendency to overpredict around midday but good coverage during morning and evening transitions. Slight deviations are noticed at 8:00, 13:00, and 14:00.
21 December 2018 (bottom-right, winter solstice): The shortest day in the dataset, with low irradiance and truncated production. The model systematically overestimates generation, producing intervals that are shifted upwards relative to the actual output, which is consistent with the challenges of capturing winter conditions and reduced daylight hours.

All experiments were conducted in Google Colab Pro using an NVIDIA A100 GPU under PyTorch 2.1. The total training and evaluation runtimes for 10 epochs were as follows: DLinear ≈ 1 min 29 s, PatchTST ≈ 2 min 19 s, Informer ≈ 3 min 23 s, FEDformer ≈ 5 min 19 s, and Autoformer ≈ 5 min 50 s. The differences reflect the varying architectural complexity of the models, with Autoformer and FEDformer requiring longer convergence times due to their decomposition and frequency-domain operations. Importantly, once trained, all models generate one-hour-ahead forecasts in the order of milliseconds per step, confirming their feasibility for real-time solar PV forecasting applications. It can be seen that, despite having the highest forecast accuracy, the PatchTST is also the second fastest model, making it the model with the best balance between performance and computational cost.

7. Discussion

The results of this study demonstrate the effectiveness of Transformer-based architectures for short-term photovoltaic (PV) power forecasting, with PatchTST emerging as the most accurate and robust among the five benchmarked models. PatchTST achieved the lowest deterministic errors (MAE = 0.194 kW, RMSE = 0.381 kW) and, when coupled with Adaptive Conformal Inference (ACI), provided well-calibrated probabilistic forecasts with 86% empirical coverage and narrow mean interval widths of 0.62 kW. These results confirm that patch-level temporal tokenization and channel-independent modeling are particularly well-suited for capturing sharp ramps and intermittent fluctuations in PV output that other Transformer variants tend to smooth.

From the perspective of existing studies, the results align with the broader evidence that Transformer variants such as Informer, Autoformer, and FEDformer provide meaningful improvements in renewable energy forecasting. However, while prior works have shown the benefits of decomposition-based or frequency-enhanced architectures, the experiments reveal that PatchTST achieves better performance by leveraging patch tokenization and channel-independent modeling. The findings also reinforce the hypothesis that probabilistic calibration is strongly dependent on the accuracy of the underlying point forecasts, as seen in the consistent relationship between PatchTST’s superior deterministic metrics and its robust interval coverage under ACI.

From a practical perspective, the implications of these findings are significant. More accurate and reliable hour-ahead forecasts directly support grid operators in reducing reserve requirements, minimizing imbalance penalties, and improving energy storage scheduling. For PV producers, the ability to quantify uncertainty with adaptive, non-parametric intervals enhances participation in intraday markets by providing actionable risk metrics. The demonstrated performance of PatchTST+ACI suggests that the framework is readily applicable for operational deployment in both standalone PV plants and aggregated portfolios, helping to stabilize grids under increasing renewable penetration.

Looking ahead, future research should explore several directions. First, the integration of exogenous variables such as satellite imagery, sky cameras, or high-resolution weather forecasts could further enhance PatchTST’s predictive accuracy. Second, while this work focuses on a single-site case study, extending the approach to multi-site or regional PV forecasting would allow evaluation of its scalability and effectiveness in capturing spatial correlations. Finally, combining ACI with other ensemble or hybrid architectures could provide complementary strengths, offering an avenue for developing robust forecasting pipelines that adapt to diverse meteorological regimes. Collectively, these directions can help unlock the full potential of Transformer-based probabilistic forecasting for enabling resilient and economically efficient renewable energy integration.

Funding

This research received no external funding.

Data Availability Statement

Data will be made available upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

PV	Photovoltaic
MAE	Mean Absolute Error
RMSE	Root Mean Squared Error
MBE	Mean Bias Error
MSE	Mean Squared Error
MAPE	Mean Absolute Percentage Error
SMAPE	Symmetric Mean Absolute Percentage Error
NMSE	Normalized Mean Squared Error
CRPS	Continuous Ranked Probability Score
MIW	Mean Interval Width
MLR	Multiple Linear Regression
ARMA	Autoregressive Moving Average
ACI	Adaptive Conformal Inference
Fac2	Fraction Within a Factor of Two
FFT	Fast Fourier Transform
IFFT	Inverse Fast Fourier Transform
CNN	Convolutional Neural Network
RNN	Recurrent Neural Network
LSTM	Long Short-Term Memory
GRU	Gated Recurrent Unit
ViT	Vision Transformer
ARIMA	AutoRegressive Integrated Moving Average
SVM	Support Vector Machine
SVR	Support Vector Regression
BP	Backpropagation
PatchTST	Patch Time Series Transformer
N-BEATS	Neural Basis Expansion Analysis for Time Series
SG	Savitzky–Golay
CI	Channel Independent

References

Jiang, X.; Gou, Y.; Jiang, M.; Luo, L.; Zhou, Q. Photovoltaic Power Forecasting with Weather Conditioned Attention Mechanism. Big Data Min. Anal. 2025, 8, 326–345. [Google Scholar] [CrossRef]
Yang, J.; He, H.; Zhao, X.; Wang, J.; Yao, T.; Cao, H.; Wan, M. Day-Ahead PV Power Forecasting Model Based on Fine-Grained Temporal Attention and Cloud-Coverage Spatial Attention. IEEE Trans. Sustain. Energy 2024, 15, 1062–1073. [Google Scholar] [CrossRef]
Wang, X.; Ma, W. A Hybrid Deep Learning Model with an Optimal Strategy Based on Improved VMD and Transformer for Short-Term Photovoltaic Power Forecasting. Energy 2024, 295, 131071. [Google Scholar] [CrossRef]
Liu, M.; Rao, S.; Huang, M.; Deng, S. Short-Term Photovoltaic Power Forecasting Based on Improved Transformer with Feature Enhancement. Sustain. Energy Grids Netw. 2025, 43, 101759. [Google Scholar] [CrossRef]
Piantadosi, G.; Dutto, S.; Galli, A.; De Vito, S.; Sansone, C.; Di Francia, G. Photovoltaic Power Forecasting: A Transformer Based Framework. Energy AI 2024, 18, 100444. [Google Scholar] [CrossRef]
Su, L.; Zuo, X.; Li, R.; Wang, X.; Zhao, H.; Huang, B. A Systematic Review for Transformer-Based Long-Term Series Forecasting. Artif. Intell. Rev. 2025, 58, 80. [Google Scholar] [CrossRef]
Ahmed, S.; Nielsen, I.E.; Tripathi, A.; Siddiqui, S.; Ramachandran, R.P.; Rasool, G. Transformers in Time-Series Analysis: A Tutorial. Circuits Syst. Signal Process. 2023, 42, 7433–7466. [Google Scholar] [CrossRef]
Nascimento, E.G.S.; de Melo, T.A.C.; Moreira, D.M. A Transformer-Based Deep Neural Network with Wavelet Transform for Forecasting Wind Speed and Wind Energy. Energy 2023, 278, 127678. [Google Scholar] [CrossRef]
Mo, S.; Wang, H.; Li, B.; Xue, Z.; Fan, S.; Liu, X. Powerformer: A Temporal-Based Transformer Model for Wind Power Forecasting. Energy Rep. 2024, 11, 736–744. [Google Scholar] [CrossRef]
Zhu, J.; Zhao, Z.; Zheng, X.; An, Z.; Guo, Q.; Li, Z.; Sun, J.; Guo, Y. Time-Series Power Forecasting for Wind and Solar Energy Based on the SL-Transformer. Energies 2023, 16, 7610. [Google Scholar] [CrossRef]
Al-Ali, E.M.; Hajji, Y.; Said, Y.; Hleili, M.; Alanzi, A.M.; Laatar, A.H.; Atri, M. Solar Energy Production Forecasting Based on a Hybrid CNN-LSTM-Transformer Model. Mathematics 2023, 11, 676. [Google Scholar] [CrossRef]
Akhter, M.N.; Mekhilef, S.; Mokhlis, H.; Almohaimeed, Z.M.; Muhammad, M.A.; Khairuddin, A.S.M.; Akram, R.; Hussain, M.M. An Hour-Ahead PV Power Forecasting Method Based on an RNN-LSTM Model for Three Different PV Plants. Energies 2022, 15, 2243. [Google Scholar] [CrossRef]
Wojtkiewicz, J.; Hosseini, M.; Gottumukkala, R.; Chambers, T.L. Hour-Ahead Solar Irradiance Forecasting Using Multivariate Gated Recurrent Units. Energies 2019, 12, 4055. [Google Scholar] [CrossRef]
Wu, H.; Xu, J.; Wang, J.; Long, M. Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting. arXiv 2022, arXiv:2106.13008. [Google Scholar]
Oliveira, J.M.; Ramos, P. Evaluating the Effectiveness of Time Series Transformers for Demand Forecasting in Retail. Mathematics 2024, 12, 2728. [Google Scholar] [CrossRef]
Jiang, Y.; Gao, T.; Dai, Y.; Si, R.; Hao, J.; Zhang, J.; Gao, D.W. Very Short-Term Residential Load Forecasting Based on Deep-Autoformer. Appl. Energy 2022, 328, 120120. [Google Scholar] [CrossRef]
Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting. Proc. AAAI Conf. Artif. Intell. 2020, 35, 11106–11115. [Google Scholar] [CrossRef]
Zhao, M.; Peng, H.; Li, L.; Ren, Y. Graph Attention Network and Informer for Multivariate Time Series Anomaly Detection. Sensors 2024, 24, 1522. [Google Scholar] [CrossRef]
Xu, H.; Peng, Q.; Wang, Y.; Zhan, Z. Power-Load Forecasting Model Based on Informer and Its Application. Energies 2023, 16, 3086. [Google Scholar] [CrossRef]
Zhou, T.; Wen, Q.; Wang, X.; Sun, L.; Jin, R. FEDformer: Frequency Enhanced Decomposed Transformer for Long-Term Series Forecasting. arXiv 2022, arXiv:2201.12740. [Google Scholar] [CrossRef]
Li, D.; Liu, Q.; Feng, D.; Chen, Z. A Medium- and Long-Term Residential Load Forecasting Method Based on Discrete Cosine Transform-FEDformer. Energies 2024, 17, 3676. [Google Scholar] [CrossRef]
Jin, X.; Pan, T.; Yu, H.; Wang, Z.; Cao, W. Electricity Load Forecasting Method Based on the GRA-FEDformer Algorithm. Energies 2025, 18, 4057. [Google Scholar] [CrossRef]
Nie, Y.; Nguyen, N.H.; Sinthong, P.; Kalagnanam, J. A Time Series Is Worth 64 Words. arXiv 2023, arXiv:2211.14730. [Google Scholar] [CrossRef]
Anh, L.H.; Vu, D.T.; Oh, S.; Yu, G.-H.; Han, N.B.N.; Kim, H.-G.; Kim, J.-S.; Kim, J.-Y. Partial Transfer Learning from Patch Transformer to Variate-Based Linear Forecasting Model. Energies 2024, 17, 6452. [Google Scholar] [CrossRef]
Zhang, K.; Zheng, S. An Interpretable Deep Learning Approach Integrating PatchTST, Quantile Regression, and SHAP for Dam Displacement Interval Prediction. Water 2025, 17, 1661. [Google Scholar] [CrossRef]
Yu, Y.; Loskot, P.; Zhang, W.; Zhang, Q.; Gao, Y. A Spatial–Temporal Time Series Decomposition for Improving Independent Channel Forecasting †. Mathematics 2025, 13, 2221. [Google Scholar] [CrossRef]
Wang, G.; Liao, Y.; Guo, L.; Geng, J.; Ma, X. DLinear Photovoltaic Power Generation Forecasting Based on Reversible Instance Normalization. In Proceedings of the 2023 IEEE 12th Data Driven Control and Learning Systems Conference (DDCLS), Xiangtan, China, 12–14 May 2023; pp. 990–995. [Google Scholar]
Gibbs, I.; Candès, E. Adaptive Conformal Inference Under Distribution Shift. arXiv 2021, arXiv:2106.00170. [Google Scholar] [CrossRef]
Zaffran, M.; Dieuleveut, A.; Féron, O.; Goude, Y.; Josse, J. Adaptive Conformal Predictions for Time Series. arXiv 2022, arXiv:2202.07282. [Google Scholar] [CrossRef]
Suresh, V.; Janik, P.; Rezmer, J.; Leonowicz, Z. Forecasting Solar PV Output Using Convolutional Neural Networks with a Sliding Window Algorithm. Energies 2020, 13, 723. [Google Scholar] [CrossRef]
Suresh, V.; Janik, P.; Guerrero, J.M.; Leonowicz, Z.; Sikorski, T. Microgrid Energy Management System With Embedded Deep Learning Forecaster and Combined Optimizer. IEEE Access 2020, 8, 202225–202239. [Google Scholar] [CrossRef]
Suresh, V.; Swain, A.; Revathi, B.S.; Guerrero, J.M. Mamba Based Adaptive Conformal Inference for Probabilistic Short-Term Load Forecasting. Knowl.-Based Syst. 2025, 328, 114222. [Google Scholar] [CrossRef]
Pelland, S.; Remund, J.; Kleissl, J.; Oozeki, T.; De Brabandere, K. Photovoltaic and Solar Forecasting: State of the Art; International Energy Agency: Paris, France, 2013. [Google Scholar]
Xu, C.; Zhong, P.; Zhu, F.; Xu, B.; Wang, Y.; Yang, L.; Wang, S.; Xu, S. A Hybrid Model Coupling Process-Driven and Data-Driven Models for Improved Real-Time Flood Forecasting. J. Hydrol. 2024, 638, 131494. [Google Scholar] [CrossRef]
Li, G.; Zhang, J.; Shen, X.; Kong, C.; Zhang, Y.; Li, G. A New Wind Speed Evaluation Method Based on Pinball Loss and Winkler Score. Adv. Electr. Comput. Eng. 2022, 22, 11–18. [Google Scholar] [CrossRef]

Figure 1. Transformer-based architectures for time series forecasting.

Figure 2. Autoformer architecture for time series forecasting.

Figure 3. Informer architecture for time series forecasting.

Figure 4. FEDformer architecture for time series forecasting.

Figure 5. PatchTST architecture for time series forecasting.

Figure 6. ACI procedure for building forecast intervals.

Figure 7. Dataset visualization.

Figure 8. Point forecast performance of the investigated models across random days.

Figure 9. Point forecast performance of the investigated models across days with high variability.

Figure 10. Patch TST interval forecast performance across random days.

Table 1. Literature summary of Transformer based models in time series energy forecasting.

Study/Reference	Forecasting Task	Input Features	Methods Applied	Key Findings
[8] Wavelet-Transformer (Wind, Brazil)	Wind speed and power, up to 6 h ahead	Meteorological vars (wind speed/direction, temperature, humidity, and pressure), cyclic time, wavelet features	Transformer + wavelet decomposition vs. tuned LSTM, persistence	Transformer outperforms LSTM at 3–6 h horizons, statistically significant (Wilcoxon), faster training
[9] Powerformer (Wind)	Short-term wind power (15 min ahead)	Wind features (10–14 vars)	Powerformer (LSTM embeddings + sparse attention + pooling) vs. GRU, LSTM, vanilla Transformer	Achieves best MAE/RMSE/MAPE; reduces MAPE to 1.08% vs. 3.83% (Transformer)
[10] SL-Transformer (Hybrid Wind and Solar)	Wind (5 min) and Solar PV (hourly) generation	Wind speed/power, solar generation + preprocessing (SG filter, LOF)	Transformer encoder + LSTM decoder + attention, vs. ARIMA, SVM, LSTM, NBEATS, vanilla Transformer	SL-Transformer significantly outperforms all baselines; solar SMAPE = 4.22% (~15% better than baselines)
[11] CNN–Transformer (Solar PV, China)	PV power (15 min ahead)	Meteorological (irradiance, temp, humidity, and wind speed) + historical PV	CNN + Transformer vs. ARIMA, SVR, CNN, LSTM, GRU, vanilla Transformer	CNN–Transformer reduces RMSE by >20% vs. LSTM/GRU; achieves R² > 0.98
[12] RNN-LSTM (Solar PV, Malaysia)	PV power, hour-ahead forecasting	Irradiance, wind speed, and ambient and module temperatures	RNN-LSTM (single, double, and bidirectional) vs. ANN, SVR, GPR, and ANFIS	Single-layer LSTM delivers the lowest RMSE/MAE across modules, robust to PV tech type
[13] GRU (Solar Irradiance, USA)	GHI, hour-ahead forecasting	Irradiance + exogenous vars (zenith angle, humidity, temperature, and cloud cover)	GRU and LSTM (uni- vs. multivariate)	Multivariate > univariate; cloud cover adds the largest gains; LSTM has slightly better accuracy; GRU is more efficient
Present Study	PV power, hour-ahead forecasting	Irradiance, module temperature, past output, and cyclical time encodings	Transformers: Autoformer, Informer, FEDformer, PatchTST, and DLinear; ACI for uncertainty	PatchTST achieves best accuracy; ACI yields calibrated 86% coverage with sharp intervals

Table 2. Investigated models’ hyperparameter values.

Model	Sequence Length	Batch size	Epochs	Dropout	Model Dimension	Number of Heads	Label Length	Other Key Settings
PatchTST	48	48	10	0.1	128	4	–	Patch length = 16; Stride = 8; Individual = True
Informer	48	32	10	0.05	64	2	24	Factor = 3; Distil = True; Embed = fixed
FEDformer	48	32	10	0.05	64	8	24	Modes = 32; Version = Fourier; Moving average = 25
Autoformer	48	32	10	0.05	64	2	24	Factor = 3; Moving average = 25; Distil = True
DLinear	48	32	10	–	–	–	24	Individual = True; Moving average = 50

Table 3. Point forecast performance evaluation metrics.

Model	MAE (kW)	RMSE (kW)	MBE (kW)	MAE (% of 5 kW Peak)	RMSE (% of 5 kW Peak)	MBE (% of 5 kW Peak)
Autoformer	0.2373	0.4559	0.0511	4.7460	9.1178	1.0229
Informer	0.2202	0.4486	0.0548	4.4043	8.9726	1.0965
FEDformer	0.2480	0.4808	0.1075	4.9600	9.6164	2.1510
PatchTST	0.1937	0.3805	−0.0886	3.8743	7.6102	−1.7723
DLinear	0.2183	0.4590	−0.0283	4.3662	9.0174	−0.5650
Persistence	0.2443	0.4737	0.0000	4.8867	9.4744	−0.0000

Table 4. Comparison of the point forecast performance of the proposed methods with those from the literature.

Study/Reference	Forecasting Task	Dataset/Resolution	Methods Applied	Main Results
[11] CNN–Transformer (Solar PV, China)	15 min PV power forecasting	PV plant in China; meteorological + historical PV	1D-CNN + Transformer vs. ARIMA, SVR, CNN, LSTM, GRU, and vanilla Transformer	CNN–Transformer reduced RMSE by >20% vs. LSTM/GRU; R² > 0.98
[12] RNN-LSTM (Solar PV, Malaysia)	Hour-ahead PV power	University of Malaya; poly/mono/thin-film PV; 2016–2019; 5 min resolution	RNN-LSTM (single, double, and bidirectional) vs. ANN, SVR, GPR, and ANFIS	Single-layer LSTM gave the lowest RMSE/MAE across all PV modules; robust to PV type
[13] GRU (Solar Irradiance, USA)	Hour-ahead GHI	Phoenix Airport; 2004–2014; hourly; NREL NSRDB + NOAA GRU and LSTM (uni- vs. multivariate)	Multivariate > univariate; cloud cover gave the largest improvement	GRU is more efficient than the compared models
Present Study (Poland)	Hour-ahead PV power	Rooftop PV (WUST, Poland); 5 kW; 5 years hourly; 12-month unseen test	Autoformer, Informer, FEDformer, PatchTST, DLinear; persistence baseline; and PatchTST+ACI	PatchTST best: MAE = 0.194 kW (3.9%), RMSE = 0.381 kW (7.6%); ACI coverage = 86% with sharp intervals

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Suresh, V. Benchmarking Transformer Variants for Hour-Ahead PV Forecasting: PatchTST with Adaptive Conformal Inference. Energies 2025, 18, 5000. https://doi.org/10.3390/en18185000

AMA Style

Suresh V. Benchmarking Transformer Variants for Hour-Ahead PV Forecasting: PatchTST with Adaptive Conformal Inference. Energies. 2025; 18(18):5000. https://doi.org/10.3390/en18185000

Chicago/Turabian Style

Suresh, Vishnu. 2025. "Benchmarking Transformer Variants for Hour-Ahead PV Forecasting: PatchTST with Adaptive Conformal Inference" Energies 18, no. 18: 5000. https://doi.org/10.3390/en18185000

APA Style

Suresh, V. (2025). Benchmarking Transformer Variants for Hour-Ahead PV Forecasting: PatchTST with Adaptive Conformal Inference. Energies, 18(18), 5000. https://doi.org/10.3390/en18185000

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Benchmarking Transformer Variants for Hour-Ahead PV Forecasting: PatchTST with Adaptive Conformal Inference

Abstract

1. Introduction

2. Forecasting Model Architectures

2.1. Autoformer Model

2.2. Informer Model

2.3. FEDformer Model

2.4. PatchTST Model

3. Adaptive Conformal Predictions

4. Data Description and Preparation

4.1. Data Visualization

4.2. Normalization of Features

4.3. Cyclical Encoding of Time Indices

4.4. Rolling Window for Transformers

4.5. Hyper-Parameter Selection

5. Evaluation Metrics

5.1. Point Forecast Metrics

5.2. Interval Forecast Metrics

6. Results

6.1. Point Forecast Performance

6.2. Interval Forecast Performance (PatchTST ACI)

7. Discussion

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI